Current state of OpenCL in blender (may 2015) (WIP)
Cycles was included into blender with the release of 2.61 in december 2011. The release notes mention: “Cycles has two GPU rendering modes, through CUDA, which is the preferred method for NVidia graphics cards, and OpenCL, which is intended to support rendering on AMD/ATI graphics cards”. Ever since the support or lack thereof in cycles has been a topic of debate. In April of 2015 AMD contributed a set of patches to improve the situation.
Supported AMD devices
|Name||Code Name||Core's||GCN CU count|
|Radeon HD 7730||Cape Verde LE||384||6|
|Radeon HD 7750||Cape Verde PRO||512||8|
|Radeon HD 7770 Ghz||Cape Verde XT||640||10|
|Radeon HD 7790||Bonaire XT||896||14|
|Radeon HD 7850||Pitcairn PRO||1024||16|
|Radeon HD 7870 GHz||Pitcairn XT||1280||20|
|Radeon HD 7870 XT||Tahiti LE||1536||24|
|Radeon HD 7950||Tahiti PRO||1792||28|
|Radeon HD 7950 Boost||Tahiti PRO2||1792||28|
|Radeon HD 7970||Tahiti XT||2048||32|
|Radeon HD 7970 GHz||Tahiti XT2||2048||32|
|Radeon HD 7990||New Zealand||2048 x 2||32 x 2|
|Radeon HD 8570||Oland||384||6|
|Radeon HD 8670||Oland||384||6|
|Radeon HD 8760||Cape Verde XT||640||10|
|Radeon HD 8770||Bonaire XT||896||14|
|Radeon HD 8860||Pitcairn XT||1280||20|
|Radeon HD 8950||Tahiti Pro||1792||28|
|Radeon HD 8970||Tahiti XT2||2048||32|
|Radeon HD 8990||Malta||2048 x2||32 x 2|
|Radeon R5 240||Oland||320||5|
|Radeon R7 240||Oland PRO||320||5|
|Radeon R7 250||Oland XT||384||6|
|Radeon R7 250X||Cape Verde XT||640||10|
|Radeon R7 260||Bonaire||768||12|
|Radeon R7 260X||Bonaire XTX||896||14|
|Radeon R7 265||Curaçao PRO||1024||16|
|Radeon R9 270||Curaçao PRO||1280||20|
|Radeon R9 270X||Curaçao XT||1280||20|
|Radeon R9 280||Tahiti PRO3||1792||28|
|Radeon R9 280X||Tahiti XT2||2048||32|
|Radeon R9 285||Tonga PRO||1792||28|
|Radeon R9 290||Hawaii PRO||2560||40|
|Radeon R9 290X||Hawaii XT||2816||44|
|Radeon R9 295X2||Vesuvius||2816 x 2||44 x 2|
|Device||Operating system||Driver / Toolkit||Version||Status|
|Radeon HD 7000 series (southern islands)||Windows 7 x64||Catalyst Beta||15.04||Works (Not all features)|
|Radeon HD 7000 series (southern islands)||Linux x64||Catalyst Beta||15.04||Works (Not all features)|
|Radeon HD 7000 series (southern islands)||Mac OS X 10.9.2||Apple||10.10.1||Not usable|
|Intel Iris Pro||Mac OS X 10.9.2||Apple||10.10.1||Crash|
|Nvidia GTX 400/ 500 (Fermi)||All||Nvidia||all||Works (CUDA is faster)|
|Nvidia GTX 600/700 (Kepler)||All||Nvidia||all||Works (CUDA is faster)|
|Nvidia GTX 750 (Maxwell)||All||Nvidia||all||Works (CUDA is faster)|
|Intel Core / Xeon||Windows x64||Intel||SDK 2013||Works (C++ is faster)|
|Intel Core / Xeon||Linux x64||Intel||SDK 2013||Works (C++ is faster)|
|Intel or AMD CPU||Mac OS X 10.9.2||Apple||10.9.2||Works (C++ is faster)|
Current status of Windows / Linux AMD OpenCL drivers
In the past year AMD’s implementation has improved a lot. The cycles kernel ( the bit that gets loaded onto the GPU) is compared to the typical workload really large. A year ago one would have to disable most of what makes cycles good to be even compile it at all. As of a few months ago if you have a recent AMD card and the latest driver you can get away with only disabling some parts. While this in itself is good news and shows progress it does not carry any indication of how long the remaining things will take.
As it stands now we are missing 2 crucial things:
- The OpenCL compiler tool-chain to just be able to compile the whole of cycles without any having to disable parts. This most may or may not include: register spill improvements. Cycles will grow and while spilling is very undesirable for the typical workload you want to do on a GPU it is unavoidable for us.
- User control over when to spill. By default a good OpenCL compiler will do everything it can to not have to spill and it will use all resources before doing so. We need it to spill registers into the slow main memory before it gets to the resource limit. This is needed to be able to run enough parallel threads to get the kind of performance our users want from cycles and expect when comparing their AMD product to a comparable Nvidia project. (We rely on this to get enough parallel threads on Nvidia based hardware)
These 2 do not include the expected workarounds and hacks that are part of any language with multiple implementations. Those are expected and will not give serious problems. When we have both we could get to a situation where it would make economic sense to buy AMD products when doing cycles GPU rendering.
Current state of OS X AMD OpenCL drivers
For Apple’s OS X the situation is a bit different but generally comparable.
OpenCL Blender / Cycles roadmap:
- Improve texture lookup / interpolation, we are currently doing our own interpolation and lookup on platforms that provide enough textures we could switch to using that instead.
- Implement device fission. This would be good especially when the user does not have a dedicated GPU/accelerator.
- Use more OpenCL 1.2 features. Nvidia’s opencl is still at 1.1 but it is not a serious target due to CUDA being better there anyway.
OpenCL in blender outside of cycles
Cycles is not the only area in blender that benefits or can benefit from OpenCL acceleration. The blender compositor also optionally uses OpenCL on some operations and there are plans to increase its usage. Physics simulation and especially Bullet also offer chances to use more OpenCL.
As a technology OpenCL aligns very well with graphics and as a result with blender. Our mission reads “We want to build a free and open source complete 3D creation pipeline for artists and small teams.”. Affordable high performance GPGPU products like AMD discrete graphics cards and APU’s fit really well with that. Now lets make that happen.
Q & A:
Q: Why only talk about AMD’s OpenCL? There are a lot of other implementations out there.
A: While this is true there is no real user visible benefit to using these. On Nvidia hardware cuda outperforms OpenCL and they are stuck on OpenCL 1.1. Intel GPU’s are getting more powerful but are not a good target yet and CPU based opencl provides little or no benefit over CPU based cycles.
Q: Why don’t you just split up cycles so it can run better on AMD hardware?
A: While this would likely help it is not a trivial matter to split up cycles in this way. Also it is not clear that it is going to help and how much. As a resource constrained open-source project this will most likely not be a top priority
Q: There seem to be things that blender can do to make things better why point to AMD for this ?
A: While it is absolutely true that cycles could be made to run better on any and all other OpenCL implementations but there is no compelling user visible reason to do so. Making it run better on Nvidia’s OpenCL is of no use as we have CUDA there. Making it better on AMD’s or Intel’s CPU based OpenCL implementations is no real use as we have a compiled C++ kernel on those platforms.
Q: What about Luxrender, Why does that not have a problem running on AMD's GPU's ?
A: While it is true that luxrender does run ( and it has excellent performance on AMD's hardware) it is not by accident. It is my understanding that luxrender is developed targeting the AMD OpenCL runtime and as a result keeps within the limits it imposes. The cycles kernel is a valid OpenCL program but not all OpenCL implementations can compile and run valid programs. One could attribute this to hardware insufficiencies and while this might be true for AMD's VLIW4 architecture and maybe for Intel's Iris and AMD's GCN architecture should be able to support the kernels the size of cycles. According the the AMD GCN ISA docs it should be possible ""The SALU also can perform operations directly on the Program Counter, allowing the program to create a call stack in SGPRs"" to have proper function calls and thus run (maybe slowly) large programs.