User:Jbakker/reports/Cycles OpenCL status report 1

= Cycles OpenCL Status Report =

Last updated: 2019-03-07

Summary
The goal of the Cycles OpenCL Optimization project is to reduce compilation time that is needed to compile Cycles. This project has two phases. The first phase will look at the technical optimizations. The seconds phase will look at the user experience.

Activities in the first phase are centered around Cycles and its OpenCL Programs, Kernels. It the goal of this phase is to optimize the actual time that is needed to compile cycles for an OpenCL platform. The activities in the second phase are more oriented around the expectation of the user.

Phase 1: Optimizing Compilation
This section will give an overview of the results so far and the steps we still want to do. Phase 1 is all about the actual compilation times. When started the compilation times took between 60 and 220 seconds for scenes, depending on the features that are needed.

After some research we did implement some optimizations and based on the measurements we did, we expect that we can bring the compilation time back to between 15 to 40 seconds, where the average would be around 20 seconds.

Research LuxRender and AMD ProRender
The biggest difference between Cycles and LuxRender or AMD ProRender is that Cycles tries to compile once to support most cases. Only adding features like hair and/or volumetric needs recompilations. LuxRender and AMD ProRender compile with all features enabled, but will need to recompile the materials every time the node-setup of a material changes. AMD ProRender even has a material library that pushes the artist in a direction so the node-setup does not change that often.

We were not able to compare production level scenes as we are not able to convert production-level Cycles materials to LuxRender or AMD ProRender. AMD ProRender has a conversion tool, but that breaks with complex materials, making it hard to compare compile times.

Parallel Compilation Of OpenCL Programs
Compiling OpenCL programs uses a single core of your CPU. When activating more cores for compilation we will reduce compilation times. In order to do parallel compilations more effective we need to restructure the OpenCL programs. This restructuring entails:


 * Compile larger kernels in their own OpenCL program. Compile tiny kernels in a single OpenCL program. Compiling in multiple programs has some compilation overhead.
 * Order the OpenCL programs so we start compiling with the OpenCL Programs that take the longest time to compile.

Volumetric Rendering
During our research we found out that when volumetric rendering is enabled the compilation times increases in every kernel that uses shader evaluation. We focused our research on what material nodes impacted the compilation times the most and found some areas we could improve


 * Split the generic parameter retrieval in a volumetric parameter retrieval and surface parameter retrieval. This way we the unrolling of the OpenCL Program can be controlled better.
 * Reduce the complexity of the Cubical sampling of a Volume texture.

Bump mapping
During Bump mapping there is a node operation that handles attributes in the X-direction and another one the handles attributes in the Y-direction. These operations can be merged. This will have a minor impact on render times, but saves time during compilation times. We should also investigate if we can merge these 2 operations in the standard attribute operation.

Optimizing Nodes
During the research we found several nodes have significant impact on the compilation time due to unrolling. These are:


 * Point Density Texture Node
 * Voronoi Texture Node

We will look into optimizing these nodes to compile faster.

Feature sets
All kernels are recompiled when a render-feature is changed. We currently have 8 feature sets


 * Normal
 * Hair
 * Volumetric
 * Hair + Volumetric
 * Sub Surface Scattering
 * Sub Surface Scattering + Hair
 * Sub Surface Scattering + Volumetric
 * Sub Surface Scattering + Hair + Volumetric

When we switch between one of those feature set we should only recompile the kernels that are effected by this switch. Some kernels are compiled without an implementation (empty function). We don't need to compile these kernels and don't trigger them.

Tasks

 * 1) [Done] Research how LuxRender and AMD Prorender is organized.
 * 2) [Done] Research what the impact of every material node is on the compile time [T61461].
 * 3) [Done] Implement parallel compiling for OpenCL kernels [D2264].
 * 4) [Done] Restructure and re-order OpenCL Programs so they are optimized for parallel compiling [T61463 T61514 T61463].
 * 5) [Done] Reduce compilation of volumetric feature [T61513 T61533].
 * 6) [In Progress] Reduce the number of events when recompilation happens [T61501 T62252  T62266].
 * 7) [Done] Do not compile kernels when they are not needed [T61576].
 * 8) [Done] Merge small kernels that are executed in serial [T61466]
 * 9) [Not Started] Merge shadow blocked kernels [T61464].
 * 10) [Invalid] Put Point Density Texture Node inside a compile directive [T61479].
 * 11) [Invalid] Optimize Texture Voronoi Node [T61465].
 * 12) [Not Started] Reduction of compilation times when using subsurface and volumetric rendering together. Both features influence each other compilation times, but this influence can be minimized. [T62304]
 * 13) [Not Started] Clean up OpenCL code base [T62267]

We did an in dept research on the `Point Density Texture` and the `Texture Voronoi` and came to the conclusion that there was no room for improvement in terms of compilation times.

Phase 2: User Experience
This phase of the project is about the user experience and what we can do about to increase this. We will be optimizing the process when the user is using OpenCL within Blenders' viewport.


 * 1) [In Progress] Background compilation during scene preparation. [T61752]
 * 2) [In Progress] Introduction of a preview kernel. A kernel that is compiled really fast and provides the user with an AO render during the compilation of the real kernel. This way the user can continue working. [T61752]
 * 3) [Not Started] Minimize blocking when calculating Multiple Importance Sampling [T62300]
 * 4) [Not Started] Minimize blocking when calculating mesh displacements [T62301]
 * 5) [Invalid] Distributing SPIR kernels. Research done, drivers do not seem to support SPIR for OpenCL Kernels due to HSAIL .  [T61663]

Compile times
The next table shows the compile times we measured at the beginning of this project and what we currently measure.

Render times
As a side effect when optimizing the compilation times, we also changed the render times for heavy scenes. The next table shows the render times at the start of this project and the current render times.

Advice/Next Steps

 * Add support of an Intermediate format (like SPIR/Kernel) to AMD Pro Drivers.
 * Add support of multi platform compilation for AMD/OpenCL. At this time it is not possible to crosscompile the opencl kernels and distribute them with the blender binary. When having such a capability will allow us to distribute Cycles/OpenCL for the main OpenCL platforms.
 * Support for non-function-inlining. Currently when compiling OpenCL all functions are inlined. By supporting non-function-inlining we would be able to point out some functions to not inline.
 * Restructure Subsurface Scattering and Volumetrics. These functionalities are the most time consuming to compile.