Cycles does have split OpenCL kernel since Blender release 2.75. It's an alternative approach to what is used on CPU (so called megakernel). The idea behind splitting the kernel is to have multiple smaller kernels which are not only simpler and faster to compile, but also have better performance. Initial split kernel patch was done by AMD. With following patches and with current drivers, all production files (including the huge one from Gooseberry for cards with 8+ GB of memory) from the official cycles benchmark pack render now pretty fast.
Activate OpenCL rendering
- For AMD cards on Windows and Linux you just need to select your GPU under file -> user preferences -> system. The split kernel will be used by default to give the best performance.
- To use OpenCL on other platforms, launch Blender with --debug-value 256 (either on the command line or by adding it to your shortcut). It will add a section "debug" in the render options panel. There you can choose the kernel (split or mega) and the platform (set to "all" to enable OpenCL for Intel and NVidia).
Then choose GPU as the device to render in the render option panel.
To test the performance of your computer, you can download the official Cycles Benchmark files from here https://download.blender.org/demo/test/cycles_benchmark_20160228.zip. Those include production files for films, archviz (exterior and interior), comics, etc.
How to get the best performance
- Big tile size give the best performance. It also needs more memory. If your scene doesn't fit in memory (you either get an error or your render is transparent), try to lower the tile size to something like 384*384.
- For single GPU configuration it is best to set the tile size to the image size or more.
- If you have many devices, you can either:
- cut one of the image dimension by the number of devices you render with (for example a 1920 x 1080 image rendered on 4 GPU would have tiles of 480x1080). It will work well if the different tiles are of about the same complexity. If some tiles get only background to render and another one only hairs, the rendering will wait for the tiles with hairs to finish. In this case, use the second solution
- use progressive rendering (check box just under the tile size in the performance section).
Compare your results
- You can compare your configuration performance to some tested platforms at the Blender institute and Lionrender here: https://lionrender.com/2016/06/16/top-gpu-cards-performance-comparison-in-blender-gtx1080-titan-x-and-gtx980ti/
- There are also different thread where rendering times are discussed on http://www.blenderartists.org.
Current issues and limitations
There are some known issues which are common to all kernels and platforms:
- Baking does not deliver correct results.
- Transparent shadows work in 2.76 and up but are relatively slow. Depending on the depth set in the "light path" panel, rendering time can increase by a factor of 2 to 5 (according to user reports made here: https://blenderartists.org/forum/showthread.php?400121-AMD-RX-480-with-8-Gigs-of-Vram-at-229&p=3091536&viewfull=1#post3091536). Slowdowns are also seen on CPU and CUDA but between 1,01 and 2x for CPU and 5x for CUDA. Optimizing transparent Shadows on GPU would bring consequent speedup in many scenes.
- Correlated multi jitter pattern, SSS and Volumes are planned to be included in upcoming releases.
- Branched path tracing which is quite tricky to integrate into current implementation of split kernel and supporting it will mean we'll do split kernel once again.
- Compiling kernels take up to 30 sec on high end CPU. It leads devs to reduce the selective node compilation possibilities to reduce recompile time. At the moment this process is single thread. Making it multi-threaded would greatly reduce times. Luxrender for example can recompile kernel in Blender viewport, while adding new nodes in materials, in less than 2 seconds.
For now they are considered a TODO rather than a bug.
- Only GCN cards works. AMD cards that are more than 5 years old may not work.
- Rendering animations with drivers up to 16.8.2 will crash after some hundreds of frames, depending on available system memory (at the time of writing, no tested drivers has solved the problem). Reason for this is a memory leak, see T48410.
Some current performance issues are:
- Slowdown in complex scenes because of high registers pressure due to all functions being inlined.
- The ordering of nodes seem to have an impact on performance to some extent. See T44943 Speed regression after proper integration of selective node compilation. It seems to be caused by different order of nodes in the switch() statement now. While we can do selective nodes support to certain extent, we can't do selective ordering of nodes. And it's not an issue on other platforms, so it's something to be improved in the driver.
- Current implementation of the selective node compilation has lower performance as the original implementation from AMD. Some performance loss seem to be due to the above mentioned ordering of nodes, a big part is still due to the current implementation not being as fined grained as the original one. The original implementation would only compile nodes that where used in the rendered scene.
Hardware Issues (very old reports, to be updated)
While official statement from AMD about split kernel was that all GCN devices are to be supported, we've got several bug reports in the tracker which claims some of the GCN cards with latest drivers does not work:
- T45336, T45892 AMD FirePro W5000
- T45417 AMD HD 8650G
- [Unreported] AMD FirePro W8000: Driver fails to compile Baking kernel (using 15.7 driver)
NOTE: We've got some workarounds in Cycles side to minimize number of features being compiled-in to absolute minimum required to render specific scene.NOTE on note: This is not true anymore, but a good goal to keep for future release. At the moment, only groups of nodes (0 to 3) are selectively compiled as well as some special features like hairs.
It's also unclear whether APU are officially advertised as supported by AMD compiler, here's a list of reports about APU support:
- Here in the studio kernel compilation for A10 works fine, but rendering gets stuck at some point. Similar issue to Split OpenCL on OSX.
AMD on OSX
AMD team who's working on OSX drivers for El Capitan (OS X 10.11) did really nice work on improving the driver which is now capable of compiling and running OpenCL megakernel. The following features are supported:
- Hard and rough surface BSDF
- Transparent shadows
- Motion blur (camera, object, deformation)
Nothing special is needed for using OpenCL on OSX now, just go to the user preferences and enable OpenCL compute device.
The following features are to be investigated for inclusion into next Blender release:
- Correlated multi jitter noise pattern
- Volume scatter/absorption
Other features requires a bit bigger changes and will happen in one of the later releases.
Split kernel status
Split kernel on OSX has some issues with reporting back ray status to the CPU (and maybe some other ones) which makes it unusable at this platform.
This is to be investigated still, more details later.
OpenCL on other platforms
OpenCL works fine on NVIDIA cards, but performance is reasonably slower (up to 2x slowdown) compared to CUDA, so it doesn't really worth using OpenCL on NVIDIA cards at this moment.
Intel OpenCL works reasonably well. It's even possible to use OpenCL to combine GPU and CPU to render at the same time, for until some more proper solution is implemented.