Source/Render/Cycles/KernelScheduling

= Cycles Kernel Scheduling =

Wavefronts
Cycles uses wavefront path tracing on the GPU. For a good introduction on the how and why, these two papers are a good reference:


 * Megakernels Considered Harmful: Wavefront Path Tracing on GPUs
 * The Iray Light Transport Simulation and Rendering System

Megakernels compute a light path from start to end. However paths on different threads may terminate at different points, may hit different types of objects, or hit objects with different shaders. This causes GPU threads to execute different parts of the kernel code and different instructions, and such divergent execution is bad for performance.

Wavefront path tracing instead splits computation into multiple smaller kernels. Each kernel advances the path to the next kernel or terminates it. By tracing millions of paths at once, we can queue or sort paths so that GPU threads are coherently executing the same kernel or shader.

Graph
The following graphs are used for path tracing with next event estimation. Each kernel terminates the path or transitions to another kernel, like a state machine. For next event estimation, shadow paths branch off from the main path.

Baking using the same kernel graph, with one difference: a kernel is used for initializing the path from a shading point instead of the camera.





Integrator State
The state of each path is stored in an `IntegratorState`. This state contains all information for the following kernels to compute the rest of the path. Memory is reserved for millions such integrator states, the exact amount depending on the GPU capabilities.

On the GPU, structure-of-arrays (SoA) storage is used for more efficient memory access patterns. The state must be as compact as possible, since many paths together take up significant GPU memory, and the more paths we can handle in parallel the more coherence we can extract.

Each integrator state can be active or inactive. If it is active, it stores the next kernel to be executed.

Scheduling
The basic scheduling algorithm is as follows:

Further details:
 * The image is split into smaller tiles, and multiple can be scheduled at once to fill the available paths. This improves coherence and allows more flexible scheduling to keep more paths active.
 * Before scheduling additional tiles, paths are compacted so that all active paths are together at the start of the array, and inactive paths are at the end. This reduces fragmentation and helps threads access memory more coherently.
 * Shadow catchers split the main path into two. For this reason, only half of all inactive paths can be initialized at once when there are such objects in the scene, as the other half of the paths may be needed for the split path.
 * Main and shadow path states are stored in separate arrays. We ensure that enough shadow path array space is available before executing shaders that may create shadow paths.
 * The `shade_surface` kernel sorts paths by shader, for more coherent execution.

CPU
CPU rendering currently traces a light path from start to end in each thread. A single megakernel calls the individual microkernels as needed, sharing code with the GPU implementation but not using wavefront path tracing.

Multi-threading uses a simple parallel for loop over all pixels and samples to be rendered.

However CPUs may also benefit from wavefront path tracing, this is an optimization to investigate in the future. In particular, a kernel could be executed for N paths at the same time using N-wide SIMD instructions. Different solutions would be needed for different kernels:


 * OSL has support for SIMD shader execution, so shader evaluation could take advantage of this.
 * Embree supports pack-tracing, to trace multiple rays at once using SIMD.
 * For the remainder, a solution may be ISPC (custom compiler) or Enoki (C++ template library).