User:Yiming/LineArt Further Improvements

= LineArt Further Improvements =

0314
Just started another grant working on line art! Proposal here


 * working on `temp-lineart-embree` to get occlusion query using embree (for now just perspective camera).
 * Updated `temp-lineart-contained` branch to latest.

Regarding GPencil


 * Polishing fading for build modifier.
 * Some cleaning up work for curvature weight modifier.

0321
Embree line art: Progress as of 0319:


 * 1) The branch `temp-lineart-embree` is now runnable.
 * 2) Performance mostly on par with `temp-lineart-contained` branch considering optimized not enabled. The calculation works mostly correctly, and necessary data/flags are all registered as what it would in legacy line art.
 * 3) No need to load an additional mesh structure, all embree related callbacks now uses geometry from loaded lineart data (also: remove that additional mesh).

Problems so far / stuff to be done:


 * 1) SOLVED The triangle index in `Mesh` is [supposedly] different than the index in `BMesh`, so those triangles in in collide func (where I need the triangle data structure) and in bounds func (where it's just that plain mesh for embree) don't match. But even that,
 * 2) SOLVED it seems to add way fewer "potential virtual pairs" than it needs to, or may not, depending on the mesh layout, I just saw it using default cube and it only added 5 pairs. So I'll look into it later.
 * 3) SOLVED Hang on larger files, and seem to still take a few seconds to build BVH and everything before the actual occlusion query. So don't know what's going on there.
 * 4) Precision issue, regarding internal triangle `isect` function, prominent in default cube (algorithmically it's due to lack of special treatment of triangles who share one vertex).
 * 5) Still copies `double` to `float` for internal triangle `isect`, need to get rid of that, and use line art own function (needs some modification because we don't want to add geometry in the call back)
 * 6) Need to take care of discarded triangles and lines.
 * 7) Try out 3d bounds call back for geometry used for intersection, but use 2d for occlusion only. Need to see if building two different BVH trees would have taken away the benefits of faster intersection stage using 3d bounds.

0328
Basically this week have been trying out different ways of optimizing line art embree core.


 * 1) Changed tri-tri intersection call for virtual triangle (for both occlusion stage and intersection stage) into my own one instead of using blender's internal math function, a tiny little speed up, performance bottleneck mostly on the locks.
 * 2) Jacques suggested using `EnumerableThreadSpecific` (TLS) to so storage per thread, so we don't need to lock the result array when worker threads add into it, it indeed improved performance, then the bottleneck mostly become the "occlusion cutting" stage where multiple threads trying to cut edges where they share memory, thus a lot of locking going on.
 * 3) TODO: See if there's a thread local allocator instead of `MEM_malloc` so it's gonna be faster in the threads. See below.
 * 4) Tried spread out locks by assigning 100 locks incrementally to all edges in hope that the cutting function doesn't collide that often, but turns out the memory allocator is shared so it's not improving much.
 * 5) Tried two ways for pre-check potential triangle intersections (in occlusion stage), first way is to check if triangle intersects with internal tri-tri function, it does filter out a lot of non-intersecting ones, but that stage cost a lot of time. The second way is to disregard that part at all and directly feed potential intersection pairs into line-triangle occlusion call, the performance mostly stayed the same for these two. (Which is generally slightly faster than current `master` but still slower than `temp-lineart-contained`.)
 * 6) Technically I could do a pre-check using "if line crosses the triangle in 2d", but that's essentially the first step inside the actual occlusion call, so it's not gonna be very useful.
 * 7) A theory for the performance being this way is that embree only do bound box checks, while line art grid acceleration method put triangles in a denser & adaptive grid, so embree is giving more potentially intersecting triangle pairs than line art would have done, because if two triangles are slanted in such a way where they occupy overlapping bound boxes, they could also very likely be in separate grid tiles. I'm not sure which way is better now.
 * 8) Another reason for it being a bit slower than expected is line art legacy algorithm actually records intersection verts that's already been found onto the triangle, but in embree method we actually need to calculate that again for the same edge but for two sides of that, so nearly doubled the work there?
 * 9) Basically removed "intersection record" and did intersection calculation directly in embree `IntersectionCollide` callback. Reduced memory usage (supposedly, because I left those variables in place for convenience of testing...), and also increased a little bit performance (because the result points are recorded directly rather than copied again). So there's some improvements. However the generation of points still suffer from the memory allocating lock issue as mentioned above, need to find a solution for that. See below.

Solved stuff:


 * 1) Memory leaks fixed. (Just to be careful with `new`ed objects from C++ and use a wrapper to properly take care of them.)
 * 2) Fixed Sebastian's mesh loading code with `totedge==0` handling, further sped up the whole loading rendering. (Still some minor crashes, due to reduced `edge_hash==NULL` and I'm not sure what caused it because `totedge!=NULL`.)
 * 3) This code works on both embree branch and legacy branch.

On the topic of thread-friendly memory management [IMPORTANT]:


 * 1) Turns out `MEM_mallocN` stuff internally uses `tbbmalloc` and `jemalloc` which is optimized for multithread already. So now need to take advantage of this by giving each thread a local mem pool (Now understandably, using thread local storage) so we don't lock anything for allocation, which would greatly increase the performance of line art. (Thanks Hans for clearing that confusion for me)

Currently `temp-lineart-embree` branch has this code path which I found to be the fastest up till now:


 * Directly record intersection result in `IntersectionCollide`. Use thread local storage and combine result afterwards to avoid locks.
 * Directly calculate occlusion cutting and only record `l`, `r` cutting positions in `OcclusionCollide` and later in `occlusion_worker` apply all cuts in parallel. Still using locks, need `TLS` or something like that.
 * Do not use any pre-checks for potential "virtual triangle" intersections.
 * Only set up basic 10x10 acceleration grid for the chaining code (which depends on that). (Note/TODO 0329: Well I checked afterwards, at some point the code in master becomes 4x4 again, I'm not sure if there's a merging issue or I never updated master for that 10x10 change, so that's slowing stuff a bit)

Also some other progress on GPencil:


 * 1) Made fading support for build modifier, some back and forth for UI and some hidden algorithm issue.
 * 2) Cyclic option for dot dash modifier to satisfy a weirder look.
 * 3) A little bit fix for curvature weight modifier.

0404

 * 1) Tried Möller algorithm for tri-tri intersection speed up but turns out it doesn't give correct result. Not sure about the reason, need to maybe try copy original data into `float[3]` and try again. But from the look of it I suspect it's the nature of this algorithm that it doesn't have good stability when triangles become quite small.
 * 2) Use `Vector::reserve` for getting combined occlusion result.
 * 3) Corrected crease loading, now faster object loading code is basically finished, need to test a bit more to see if there are hidden issues.
 * 4) Tested the `edge_hash` bug but can't reproduce.

Fixings:


 * 1) `lineart-shadow` branch for correct intersection filtering logic (for whatever reason the logic was not merged from master changes).
 * 2) Fixed https://developer.blender.org/T94888
 * 3) Closed https://developer.blender.org/T96846

0411

 * 1) Changed line art final edge list into an array and further sped up `temp-lineart-contained` branch.
 * 2) Finished up edge/face mark filtering logic under new object loading code and tested to work correctly.
 * 3) Feature line filtering by shadow region now working correctly.
 * 4) Shadow region enclosed shape support now working, but light contour didn't went into re-projection, needs further fixes to make the result look great.

Generic GPencil stuff:


 * 1) Global scale compensation for sample modifier. https://developer.blender.org/D14544

0418

 * 1) Shadow contour re-projection logic is fixed, now the generated light/shadow shapes are guaranteed to be fully enclosed.
 * 2) Object loading code patch: https://developer.blender.org/D14627 Pending review.
 * 3) Sebastian also suggested a new way of building adjacent edges without using `EdgeHash` (https://youtu.be/z5oWopN39OU?t=191), will look into it, and if that turns out to be faster then the object loading patch should be updated to include that.
 * 4) Tries embree build quality to `HIGH` but still didn't speed up that much.
 * 5) We kinda decide that if everything fails, we still go with the tile solution but leave embree for intersection because it's faster than line art tile method in that stage (and then we are not gonna need to do intersection in 2d tiles, which would save a lot of time for locks).
 * 6) Implemented an experimental `CAS tree` for line art legacy tile algorithm, not completely working, yet doesn't feel like "very fast" either. It's in the `temp-lineart-contained` branch if anyone interested to test. Could be me still including the intersection stuff inside the tile adding process... Need further testing.

0425

 * 1) `CAS` tree is producing correct result except it doesn't free any memory.
 * 2) Index-sorting based edge adjacent lookup is working correctly atm for old object loading code, needs to be migrated to new object loading code.
 * 3) It's also working in new object loading code right now, but about 25% slower than that, probably due to qsort performance.
 * 4) Trying to keep threads working by slicing `add_triangles` into smaller chunks instead of using each object as a chunk, so any single "huge" object would be split into different worker threads instead of being worked on by one thread.
 * 5) Well it did keep all threads working but also introducing a lot of conflicts in tile operations so ended up much slower.

0502

 * 1) The object loading code is done and awaits review :D . Currently using index ordering to find out adjacent triangles and only adding loose edge with `MEdge`s.
 * 2) Silhouette group feature implemented and running correctly. (The algorithm is based on top of shadow cast calculation) Which means the goal for shadow support is basically finished.
 * 3) Silhouette works out of box but it introduces ambiguity with lit/shade regions. Currently I break the silhouette up to match this setting, and most of the time it's good enough. In the future this needs to be improved (Probably with node or some more logic stuff, or with more intuitive presets).
 * 4) Intersection lit/shade info is not registered, need to take care of that.
 * 5) Fixed edge cutting function for erroneous cuts in the last segment (not registering correct silhouette group).

Others:


 * 1) Trying `CAS` tree without reallocating storage arrays, not succeeded yet.

0509

 * 1) Fixed lit/shade cutting for intersection lines (But expectedly slow)
 * 2) Object loading code committed into master :D
 * 3) Progress about `CAS` tree acceleration experiment:
 * 4) Without reallocating is now a success. A little bit faster than traditional algorithm when no intersection line is involved.
 * 5) With embree intersection the whole performance just about to catch up with traditional algorithm but still not quite.

0516

 * 1) Fixed Object loading iterator so it won't crash on stuff like particles.
 * 2) Committed Better smooth tolerance handling, now a greater value of smooth tolerance won't reduce the entire contour loop into a single line.
 * 3) Made `cas` method work correctly with the use of `atomic_load` and `atomic_store`.
 * 4) Updated 7 more patches on the Lineart task, pending review.

0517-0530
Not particularly productive.


 * 1) Fixed two bugs related to line art crashes.
 * 2) https://developer.blender.org/T98355
 * 3) https://developer.blender.org/T98359
 * 4) `CAS` patch committed (But got reverted for some atomic-related issues, new patch is being reviewed)
 * 5) The way line art iterates objects when loading is unsafe in depsgraph, New method is being researched:
 * 6) https://developer.blender.org/D15022
 * 7) Some minor fixes in shadow branch for getting the reference assigning correct under new object loading code.

Grease Pencil:


 * 1) Fixed sample modifier behavior of the last vert: https://developer.blender.org/D15005

0606

 * 1) "Speed up quad tree building" patch is finally fully polished and accepted into `master` (Yay!). Eventually we did not go with `cas` algorithm as it involves busy waiting, and it's not preferred in the sense of OS thread scheduling.
 * 2) Committed some minor fixes for line art that has not made into `master` yet.
 * 3) Polishing `lineart-shadow` patch, writing documentations and preparing for code review.
 * 4) Review task is here: https://developer.blender.org/D15109
 * 5) Design and implementation notes: https://developer.blender.org/T98498

0613

 * 1) Polished shadow patch more for consistency and removing irrelevant changes.
 * 2) Refactored `LineartRenderBuffer` to `LineartData` and reorganized variables for better clarity. (https://developer.blender.org/D15172)

Otherwise nothing substantial is happening :thinking:

0620

 * 1) Made a new model specifically for testing line art shadow functionality in one go, which is available here: https://developer.blender.org/D15109, it demonstrates:
 * 2) Cast shadow and light contour.
 * 3) Cast shadow over transparent materials.
 * 4) Silhouette (wires)
 * 5) Selection of lit/shaded regions.
 * 6) Intersection priority grouping.
 * 7) during making of that model I found a few more bugs and fixed in that patch:
 * 8) Threading issue regarding `cast` function, which I reverted back to single thread, I'll design a better threading model for it in the future. (But since the entire shadow stage is pretty fast, it won't have much impact)
 * 9) Added another 4 bytes in `LineartEdge` to store light contour `target_reference` for both adjacent triangles because `t1`/`t2` is not applicable for them, now light contour adjacency don't have any ambiguity which is much better.
 * 10) Various stability improvements
 * 11) Also did some more variable name clean ups in master line art.

0627

 * 1) Polishing shadow patch.
 * 2) Cleaned up a bunch of the UI logic, as well as removing a few bugs introduced by some typo.
 * 3) Updated the patch to include "Object Silhouette Group" functionality, when selected that, every object would have their own silhouette, but object and other objects in the same silhouette group isn't combined (e.g. two monkeys are overlapping each other, their shapes are separated, but their inner features are removed).
 * 4) Writing and making demonstration illustrations for manual updates.