Source/Render/Cycles/Devices

= Cycles Devices =

Overview
We assume rendering happens on a device where we can't directly manipulate the memory or call functions, so all communication needs to go through the Device interface.

We've got a few device backends:


 * CPU: this device will render on the same CPU, with multithreading.
 * CUDA: render on an NVIDIA GPU
 * OptiX: render on an NVIDIA GPU, using hardware ray-tracing
 * HIP: render on an AMD GPU
 * Multi: balance rendering on multiple devices (GPU+GPU or CPU+GPU)

These devices have methods to:
 * Query device information
 * Allocate, copy and free memory
 * Build BVHs
 * Execute kernels (in a queue)
 * OpenGL interop for fast display of renders
 * Denoising with native APIs

There are a few differences between CPU and GPU devices:
 * CPU devices do not have a kernel execution queue
 * OpenShadingLanguage is only supported on CPUs currently

Device Memory
Different types of memory is allocated on devices using a few utility classes:
 * `device_only_memory`: memory that resides only on the device and is never read by the CPU host, typically working memory for kernels.
 * `device_vector`: equivalent of `std::vector`, for memory that is shared between CPU and GPU.
 * `device_texture`: 2D or 3D image texture, using native GPU texture handles.

Memory must be explicitly copied to and from devices, unified memory is not used currently.

By default, memory operations are performed synchronously on the default GPU queue (or stream). This is used for allocating scene memory and render buffers.

For kernel scheduling, memory allocation and copying should be performed on the GPU queue used for kernel execution. This ensure operations are properly synchronized, and can be performed asynchronously for better performance.

For historical reasons, some memory is encoded in vectors with types like `uint4` or `float4` even though a structure would be more clear. When refactoring an area this can be changed to structs without performance loss.

Host Memory Fallback
GPU devices typically have less memory than a CPU. Scene memory can be automatically moved to host memory for this reason, which allows rendering bigger scenes with slower memory access.

Only `device_vector` and `device_texture` memory can be moved to the host. Other working memory is assumed to require fast access and must be in GPU memory.

Textures
GPUs have dedicated hardware for interpolate texture lookups. For this reason each device implements its own texture sampling to take advantage of this. This mechanism is used for both 2D and dense 3D textures.

Sparse 3D textures are be stored and sampled with NanoVDB, also with a device specific implementation.

Multi Device
Multi device abstracts memory allocation and BVH building over multiple devices, multiplexing calls to all devices. Multiple GPUs of the same type can share memory, either with peer-to-peer access or using a host memory fallback.

Kernel execution on the other hand is not abstracted, and each device must be handled individually. The `integrator/` module handles scheduling kernel execution, associated memory allocation and denoising over multiple devices.