Note: This is an archived version of the Blender Developer Wiki (archived 2024). The current developer documentation is available on developer.blender.org/docs.

User:Jbakker/reports/Cycles OpenCL status report 1

Cycles OpenCL Status Report

Last updated: 2019-03-07

Summary

The goal of the Cycles OpenCL Optimization project is to reduce compilation time that is needed to compile Cycles. This project has two phases. The first phase will look at the technical optimizations. The seconds phase will look at the user experience.

Activities in the first phase are centered around Cycles and its OpenCL Programs, Kernels. It the goal of this phase is to optimize the actual time that is needed to compile cycles for an OpenCL platform. The activities in the second phase are more oriented around the expectation of the user.

Phase 1: Optimizing Compilation

This section will give an overview of the results so far and the steps we still want to do. Phase 1 is all about the actual compilation times. When started the compilation times took between 60 and 220 seconds for scenes, depending on the features that are needed.

After some research we did implement some optimizations and based on the measurements we did, we expect that we can bring the compilation time back to between 15 to 40 seconds, where the average would be around 20 seconds.

Research LuxRender and AMD ProRender

The biggest difference between Cycles and LuxRender or AMD ProRender is that Cycles tries to compile once to support most cases. Only adding features like hair and/or volumetric needs recompilations. LuxRender and AMD ProRender compile with all features enabled, but will need to recompile the materials every time the node-setup of a material changes. AMD ProRender even has a material library that pushes the artist in a direction so the node-setup does not change that often.

We were not able to compare production level scenes as we are not able to convert production-level Cycles materials to LuxRender or AMD ProRender. AMD ProRender has a conversion tool, but that breaks with complex materials, making it hard to compare compile times.

Parallel Compilation Of OpenCL Programs

Compiling OpenCL programs uses a single core of your CPU. When activating more cores for compilation we will reduce compilation times. In order to do parallel compilations more effective we need to restructure the OpenCL programs. This restructuring entails:

  • Compile larger kernels in their own OpenCL program. Compile tiny kernels in a single OpenCL program. Compiling in multiple programs has some compilation overhead.
  • Order the OpenCL programs so we start compiling with the OpenCL Programs that take the longest time to compile.

Volumetric Rendering

During our research we found out that when volumetric rendering is enabled the compilation times increases in every kernel that uses shader evaluation. We focused our research on what material nodes impacted the compilation times the most and found some areas we could improve

  • Split the generic parameter retrieval in a volumetric parameter retrieval and surface parameter retrieval. This way we the unrolling of the OpenCL Program can be controlled better.
  • Reduce the complexity of the Cubical sampling of a Volume texture.

Bump mapping

During Bump mapping there is a node operation that handles attributes in the X-direction and another one the handles attributes in the Y-direction. These operations can be merged. This will have a minor impact on render times, but saves time during compilation times. We should also investigate if we can merge these 2 operations in the standard attribute operation.

Optimizing Nodes

During the research we found several nodes have significant impact on the compilation time due to unrolling. These are:

  • Point Density Texture Node
  • Voronoi Texture Node

We will look into optimizing these nodes to compile faster.

Feature sets

All kernels are recompiled when a render-feature is changed. We currently have 8 feature sets

  • Normal
  • Hair
  • Volumetric
  • Hair + Volumetric
  • Sub Surface Scattering
  • Sub Surface Scattering + Hair
  • Sub Surface Scattering + Volumetric
  • Sub Surface Scattering + Hair + Volumetric

When we switch between one of those feature set we should only recompile the kernels that are effected by this switch. Some kernels are compiled without an implementation (empty function). We don't need to compile these kernels and don't trigger them.

Tasks

  1. [Done] Research how LuxRender and AMD Prorender is organized.
  2. [Done] Research what the impact of every material node is on the compile time [T61461].
  3. [Done] Implement parallel compiling for OpenCL kernels [D2264].
  4. [Done] Restructure and re-order OpenCL Programs so they are optimized for parallel compiling [T61463 T61514 T61463].
  5. [Done] Reduce compilation of volumetric feature [T61513 T61533].
  6. [In Progress] Reduce the number of events when recompilation happens [T61501 T62252 T62266].
  7. [Done] Do not compile kernels when they are not needed [T61576].
  8. [Done] Merge small kernels that are executed in serial [T61466]
  9. [Not Started] Merge shadow blocked kernels [T61464].
  10. [Invalid] Put Point Density Texture Node inside a compile directive [T61479].
  11. [Invalid] Optimize Texture Voronoi Node [T61465].
  12. [Not Started] Reduction of compilation times when using subsurface and volumetric rendering together. Both features influence each other compilation times, but this influence can be minimized. [T62304]
  13. [Not Started] Clean up OpenCL code base [T62267]

We did an in dept research on the Point Density Texture and the Texture Voronoi and came to the conclusion that there was no room for improvement in terms of compilation times.

Phase 2: User Experience

This phase of the project is about the user experience and what we can do about to increase this. We will be optimizing the process when the user is using OpenCL within Blenders' viewport.

  1. [In Progress] Background compilation during scene preparation. [T61752]
  2. [In Progress] Introduction of a preview kernel. A kernel that is compiled really fast and provides the user with an AO render during the compilation of the real kernel. This way the user can continue working. [T61752]
  3. [Not Started] Minimize blocking when calculating Multiple Importance Sampling [T62300]
  4. [Not Started] Minimize blocking when calculating mesh displacements [T62301]
  5. [Invalid] Distributing SPIR kernels. Research done, drivers do not seem to support SPIR for OpenCL Kernels due to HSAIL [1]. [T61663]

Results so far

Compile times

The next table shows the compile times we measured at the beginning of this project and what we currently measure.

Scene Start (s) Current (s) Improvement
empty 22.73 7.37 68%
bmw 56.44 13.57 76%
fishycat 59.50 14.55 76%
barbershop 212.28 28.84 86%
classroom 51.46 13.70 73%
koro 62.48 16.35 74%
pavillion 54.37 13.71 75%
splash279 55.76 14.97 73%
volume_emission 145.22 25.04 83%

Render times

As a side effect when optimizing the compilation times, we also changed the render times for heavy scenes. The next table shows the render times at the start of this project and the current render times.

Scene Start (s) Current (s) Improvement
empty 20.63 20.61 0%
bmw 191.00 192.59 -1%
fishycat 393.48 388.57 1%
barbershop 1623.53 931.73 43%
classroom 341.23 339.37 1%
koro 475.96 354.81 25%
pavillion 903.48 918.77 -2%
splash279 53.68 54.42 -1%
volume_emission 62.38 38.52 38%

Advice/Next Steps

  • Add support of an Intermediate format (like SPIR/Kernel) to AMD Pro Drivers.
  • Add support of multi platform compilation for AMD/OpenCL. At this time it is not possible to crosscompile the opencl kernels and distribute them with the blender binary. When having such a capability will allow us to distribute Cycles/OpenCL for the main OpenCL platforms.
  • Support for non-function-inlining. Currently when compiling OpenCL all functions are inlined. By supporting non-function-inlining we would be able to point out some functions to not inline.
  • Restructure Subsurface Scattering and Volumetrics. These functionalities are the most time consuming to compile.