CUDA Status


  • SoA: this is important to get good performance from split kernel. For local memory in megakernel it's automatic, but for split kernel global memory we need to do it manually. This will require using macros or preprocessing the source code to keep the code from becoming too complicated, no better solution to this is known.
  • Spilling: we are probably bottlenecked by this a lot. There's no easy solutions. We can restructure code and change algorithms, use as much inlining as we can, and manually decide to put data in shared or global memory where it helps.
  • Block size: we currently use the maximum possible, but this may not be optimal. Suggested was to try smaller values like 128 instead.
  • Max registers: optimal value requires tweaking and benchmarking. Recently was increased from 48 to 64 in master for Pascal.
  • Memory latency and throughput depends on type of memory used (HBM, GDDR, ..). Something to keep in mind when benchmarking and tuning parameters for performance.
  • Using shared memory instead of local memory could help in some cases (BVH traversal, SVM stack, ..), however so far we haven't found improvements from these. Reducing block size would make more shared memory available per thread, which could help.
  • Memory usage: we should try to reduce kernel state size as much as possible, both to have more split kernel rays active and avoid the memory bottleneck. Compression can help (quantize values between 0..1, half floats, ..). Also helpful could be to do direct lighting before shader evaluation, so closures don't need to be kept in memory outside after shader evaluation kernel.
  • Inlining: almost all functions can be force inlined, though it affects compile time and instruction cache. It may help to revisit this again, force inline everything and then incrementally tweak from there. It also helps to reorganize code to avoid functions being called from multiple places.
  • Driver timeouts and keeping the UI responsiveness is not great. On Linux + Pascal we have compute preemption, but on other platforms we suffer suboptimal performance since we can't schedule enough work without risking timeouts. The timed sync code from D2862 would work better with the split kernel, where long paths don't hold up others.
  • Hair: the Koro scene is slower on CUDA relative to OpenCL. Ideas to optimize it:
    • Increasing max registers helped performance in the Koro and Fishy cat scene, which indicates that hair intersection likely suffers from spilling. It may be possible to restructure the code to reduce this.
    • Removing hair minimum width may help, it's unclear if this provides any benefit when rendering with many AA samples, would need to be tested.
    • Bezier control points in local frame could be stored in shared memory?
    • Better hair intersection algorithms could help too.
    • Koro also has lots of transparent shadows, some optimization may be possible there (avoid shader eval with fixed transparency, optimize shaders specifically for transparent shadow rays?).
    • Getting the CUDA split kernel to work would likely help in this and other scenes.
  • CUDA split kernel optimal performance probably requires:
    • SoA
    • Reduced state memory usage.
    • Better scheduling, the current work queues and atomics are slow.


Build and Code


  • CUDA devices can be enabled under File > User Preferences > System, then Save User Settings.
  • Hidden debugging settings can be found in the Debug panel in the Render properties, when running with: ./blender --debug-value 256
    • For CUDA this is mainly split kernel (wavefront) and adaptive compilation (don't compile features that are not used).
    • These currently are not working well on Windows, we're adding nvrtc support which should help (D2913)

Profiling and Benchmarking