Dump of ideas for optimizing Cycles.
There's a few different ways to approach optimizations:
- Low level optimizations: SSE/AVX, memory prefetching, ...
- Reduce memory usage: for bigger scenes and avoid cache misses
- Algorithmic optimizations: better BVH, handling many lights, shader optimization
- Sampling: avoid unneeded light paths, skip empty background areas, branched path tracer tweaks, ...
- Tricks: skip some light paths, blur background for diffuse light, alpha threshold for hair, ...
- Shaders: make a good set of shader groups for various common materials with production tricks.
Low Level Optimization
- Find places where we can use SSE/AVX instructions
- See if we can benefit from memory prefetching anywhere
- See if we can use fast inverse sqrt instructions (watch out for precision issues)
- Look at latest embree code
- Add second level BVH traversal function so hair and motion blur don't slow down traversal everywhere.
- Add multithreaded spatial splits builder so it becomes more usable, might be good as default then
- Use SIMD to intersect multiple triangles at once
- Store smooth normals only when used
- Store quads instead of triangles
- Hair segments storage needs to be separated from triangles to reduce memory usage
- Hair needs to be more tightly bounded so we can avoid intersections (spatial splits?)
- Use SIMD to intersect multiple hairs at once
- Try doing minimum width from camera POV before rendering instead of in the BVH traversal
- Faster traversal/intersection, and other renderers do it this way as well apparently
- For (shadow) rays from fur, it may be faster to intersect against triangles first, and then hairs.
- About half of the shadow rays will likely hit the base surface anyway.
- Transparency cutoff so we can stop shading after most of the light is blocked (for all types of rays)
- The hair BSDF gets replaced by a transparent BSDF for backfacing curve points. Shouldn't we skip the intersection altogether, but where exactly? Further it uses a fixed 1,1,1 weight for each BSDF which will blow up to fireflies if multiple such BSDFS are mixed.
- Minimum hair width: currently it uses stochastic termination based on the thickness of the hair rather than transparency when the hair is enlarged. This may be quite good for path tracing, if we used transparency we would stochastically continue or scatter anyway, and this avoids having to do multiple scene intersection calls or shader setups, instead doing it right in the intersection function. A problem also is that the stochastic termination does not use a QMC sequence currently, it's not entirely clear if that will work well or how to get that working. Some noise from this stochastic termination is hard to get rid of.
- For branched path tracing with fewer AA samples this may not be ideal, though to beat it we may need to record all intersections, and perhaps smarter behavior for hairs that are nearly fully covered by others. Each AA sample would get less variance but also more costly so it's tricky to find the right tradeoff.
- Add object attribute system so we can reduce object memory usage (for very high number of instances)
- Don't store vectors/colors in float4
- Convert CORNER to VERTEX attributes by splitting.
- A workaround for the terminator problem would reduce noise due to fewer rays going below the surface, some options:
- Flip ray direction above surface when it is below.
- Ignore backfaces of the same object (correct for closed meshes)
- Somehow keeps rays above the surface by remapping them and smooth blend when near the surface.
- Slightly increase the glossiness for camera rays based on ray differentials to avoid noisy sharp highlights
- Or increase it a lot for depth of field and motion blur?
- Tweak the formulas and magic values used for Filter Glossy to see if we can get it to behave better
- Option to disable glossy for indirect light on given BSDFs, or some factor to control the amount for direct and indirect.
- Add Constant folding for nodes where it is commonly useful
- Mipmapping and OIIO texture cache support
- More compact storage of image textures with fewer than 4 channels.
- Add frequency clamping or other ways to use ray differentials to filter perlin noise.
- Design a set of shader groups with production tricks, like
- Simpler texture or fixed color for indirect light or shadows for faster shader executions
- Shader that replaces glossy by diffuse for indirect light
- Fake shadows for glass to avoid caustics
- Hairs without transparency for indirect light
- Smoother light falloff to avoid fireflies for geometry near light sources
- For MIS of background, shader that makes area below horizon black to avoid unnecessarily sampling there.
Random Number Sequences
- Test correlated multi-jittered sampling on more complex scenes to see if it helps
- Can we get more memory coherence and fewer cache misses by evaluating pixels and their samples in some other order?
- Test if using the power heuristic instead of balance heuristic helps for combining BSDFs
- Test using the power heuristic for combining SSS falloffs
- If a MIS weight is near 0 or 1, can we round it to avoid the sample?
- Make light tree for quick lookup of lights that have some influence when there are many lights
- SSS rays could be optimized by ensuring the object is instanced and tracing only the sub-BVH for the object.
- Hair: look into other BSDF importance sampling papers (e.g. paper)
- Hair: dual scattering could be used as an approximation for faster indirect light
- Adaptive number of (AA) samples: unsure if this can work reliable, needs tests on real world scenes.
- Hair shading could cache and interpolate shading results at curve vertices
- Probably use a per-thread, least recently used cache with fixed number of entries
- For indirect glossy shaders this does not work however, only camera rays and diffuse, but optimized shaders could skip the glossy here or turn them into diffuse
- Option for lower resolution viewport render, or fewer bounces, etc.
- On retina or high DPI display, it may be good to use lower resolution by default
- Rather than progressively rendering samples by sample, have a mode where it renders progressively higher resolution
- Use one of the recent filtering techniques to denoise the image. These gives artifacts in final renders but can give a better preview with few samples.
- Diffuse x point light
- Glossy x point light
- Sharp x point light
- Sharp x area light (if MIS is enabled)
- Ambient occlusion
- Diffuse x area light
- Sharp x anything (if not too many bounces)
- Glossy x area light (if MIS is enabled)
- Diffuse x diffuse x ...
- Glossy x diffuse x …
- DoF or motion x diffuse
- High number of diffuse or glossy bounces
- Diffuse x sharp glossy x … (caustic)
- DoF or motion x sharp glossy
So it can help to exclude more expensive light paths or replace them by something else. When you exclude them there will of course be missing light for which the artists will have to compensate. Replacing can mean:
- Use a local trick like AO to replace many light bounces
- Blur a sharp glossy surface to a softer glossy after a diffuse bounce (that’s what Filter Glossy does)
- Replace a glossy surface by a diffuse surface
- For glass you can replace many bounces by a constant exit color after N bounces
One thing we are missing is a way to say for a BSDF is the ability to disable it for indirect light. For example (glossy x point light) is fast and useful to give a specular highlight, but (glossy x diffuse) may be too noisy to be worth it. Being able to disable the latter would be useful.
Another is support for light groups or layers, so specific interactions between objects and lights can be disabled if they are too noisy.
What Cycles Does
- Classic path tracing will only follow a single light path, much like a photon would. At each vertex you pick one BSDF/BSSRDF and one direction to continue the path in.
- Path tracing with next event estimation (as Cycles uses) connects every vertex on the path to one randomly sampled position on a light as well. So at that point the path is temporarily branched.
- Branched path tracing will sample all BSDFs and BSSRDFs at the first hit, and sample all lights for camera rays. The rest of the path is like regular path tracing, except that there is also an option to sample all lights each time instead of one.
- Branched path tracing also handles transparency different, in that it it also fully samples all BSDFs, BSSRDFs and lights at each transparent surface hit from the camera ray. For regular path tracing a transparent BSDFs would be randomly picked among other BSDFs, here it is always picked.
Picking one BSDF, one light, etc. can introduce significant noise but is also clearly faster per sample. For complex scenes or lighting setups that require many bounces picking just one can be helpful, because you might need to try many different variations of the start of the path to find a light at the end.
With fewer bounces that’s less helpful and branching more helps. This does get expensive if you have many lights, looping over all lights each sample is slow, but it’s a tradeoff.
Probably for production renders where you don’t have a ton of render time, it’s probably best to use branched path tracing and avoid putting too many lights in the scene to keep render times reasonable.
- Area lights are more noisy than point lights for direct light
- However they can be less noisy for indirect light due to inverse squared falloff giving extreme high values for point lights
- For production probably best to always use smooth light falloff, at least for indirect light
- Blender Internal always uses a similar smoothing, not possible to get extreme values when shading point is near light
- Blender Internal does a trick where it only evaluates shadows for area lights but still treats shading as coming from a point light
- This leads to less noise but also some strange results
- I don’t think this is a trick that should be added to cycles
- Mesh emitters should probably get multiple importance sampling disabled by default
- Meshes that emit light weakly can take away too many samples from meshes that do contribute a lot
- Not clear for users that this happens
- Multiple importance sampling is not enabled by default for lamps
- Maybe it should be, but there is a performance impact
- Generally lights need to be mindful of where to enabled MIS
- For scenes with many lights, looping over all lights and checking if they influence the current shading point may be slow
- Solution could be some sort of light tree (similar to a BVH for triangles?) to quickly cull lights
- Lights also have an inverse square falloff which means they have a very far influence
- Some sort of max distance or intensity cutoff would help skipping lights when there are many
Depth of Field and Motion Blur
- Doing it in compositing is much faster and perhaps the most practical
- Deep compositing will give better quality result for transparency and antialiasing.
- Perhaps preview renders could still use it for tweaking, and then have a better way for compositing to use the same settings as rendering
- Still need to split in render layers to avoid issues with missing pixels behind blurred objects
- REYES style motion blur
- Is faster, though need some sort of REYES dicing or shader caching to fit in a path tracer
- Does not give you motion blurred shadows or reflections
- Probably too difficult to fit in
- 3D sample sequences for pixel filter + time or 4D for pixel filter + lens may reduce noise
- For hair shading, it may be good to cache shader evaluations at curve key points and interpolate
- Lots of overlapping transparent hairs
- Transparency also caused by minimum width feature even if not used in shader
- Camera rays might benefit from the same optimization recently added for shadow rays
- Recording all possible transparency surface intersection in one go
- With branched path tracing fewer pixel samples are possible which can help performance
- Improved filtering in the shader may be needed to make the most of that
- Procedural noise could use tricks like frequency clamping to remove high frequency components
- Better quality image texture filtering (as implemented in OIIO) can help as well
- OSL backend supports good ray differentials in shading, SVM does not
- Blender internal tries to shade only once per pixel, samples in a single pixel are merged
- Helps in some cases but also gives artifacts that you can’t get rid of (subtle flickering)
- With global illumination, shading for camera rays is not the main cost anymore though
- So practical benefit might not be so big anymore
- Don’t think this is a good way to go
- Cycles currently uses “filtered importance sampling” to implement pixel filters
- It may be faster to do as Blender Internal does and let pixel samples contribute to multiple pixels
- This requires padding pixels around tiles which can make things slower again with small tiles
- If we’re clever those padding pixels can be shared however between tiles if they are cached somewhere
- With path tracing you get many shader evaluations, so important to make them as fast as possible
- Only camera visible or shaders visible through sharp reflections or refractions need to be accurate
- So production shaders should have two an accurate and fast evaluation, where fast can mean:
- No detailed procedural textures
- No glossy, or replaced by diffuse
- Constant color
- Remembering to use all production tricks is not convenient, best to have a number of presets with tricks that can be the default
- Ideally a production should mainly use 20 or so preset shader groups for different kinds of materials, worlds and lights
- For predictability these shaders should be tested in some standard light setup (maybe a HDRI world)
- That way artists don’t have to tweak materials too much for specific scenes, but rather can tweak the lights knowing that the material react as they should
Difficult Light Paths
- For production rendering without a huge render farm, it’s probably best to avoid difficult light paths like caustics or high number of bounces entirely
- Caustics can be blurred out or omitted
- If lower number of bounces are not enough, more lights can be placed manually or ambient occlusion added (perhaps only when shading for indirect light to make it less obvious)
- Keeping all geometry in memory is pretty much a must
- Rays fly all over the place, difficult to find enough coherence to make caches work fast
- Better focus on compression and lowering memory usage than caching
- Mesh storage in cycles can be reduce
- Data structures are sometimes duplicated for blender, intermediate cycles data and cycles kernel data
- For image textures and volume data this is possible
- OpenImageIO gives this mostly for free
- But good ray differentials are needed to make it work, being very careful to only access high res when needed
- Need a good system for user to autogenerate tiled, mipmapped .tx and .exr files
- Big gain would be already in only loading lower resolution tiles
- Embree kernels from Intel are really good
- They added support for instancing, motion blur and are also working on hair on github
- BVH for hair can probably be significantly improved
- Lots of overlapping curves means you get a lot of needless intersections
- BVH for triangles improvements are probably most in
- Better quality BVH building for complex scenes (analyzing BVH and trying to find places where it performs poorly)
- Multithreaded spatial splits builder so it can be enabled more often or by default
- SIMD for triangle intersections
- Making BVH traversal faster makes the entire renderer faster