Note: This is an archived version of the Blender Developer Wiki (archived 2024). The current developer documentation is available on developer.blender.org/docs.

User:OmarSquircleArt/GSoC2019/Documentation/Code Repetition: Performance Readability And Portability

While working on improving and extending procedural textures, I found it challenging to write performant, readable, and portable code while also reducing code repetition. This document outlines and discusses those challenges.

Portability in this document denote the facility of porting an implementation from one backend to another. In particular, the facility of writing code for all three backends GLSL, SVM, and OSL.

Generic Vector Types

Problem

Most of the procedural textures we write can operate in 1D, 2D, 3D, or 4D dimensions. In most cases, the implementation is very similar and can be implemented using a Generic Programming approach. Unfortunately, non of the backends support generic programming. So we end up duplicating the same code four times, one time for each dimension. For instance, in GLSL and OSL, since the noise function is overloaded, the noise_tubulance function is exactly the same for all dimensions, the only difference is that the input vector p have a different type:

float noise_turbulence([float|vec2|vec3|vec4] p, float octaves)
{
  float fscale = 1.0;
  float amp = 1.0;
  float sum = 0.0;
  octaves = clamp(octaves, 0.0, 16.0);
  int n = int(octaves);
  for (int i = 0; i <= n; i++) {
    float t = noise(fscale * p);
    sum += t * amp;
    amp *= 0.5;
    fscale *= 2.0;
  }
  float rmd = octaves - floor(octaves);
  if (rmd != 0.0) {
    float t = noise(fscale * p);
    float sum2 = sum + t * amp;
    sum *= (float(1 << n) / float((1 << (n + 1)) - 1));
    sum2 *= (float(1 << (n + 1)) / float((1 << (n + 2)) - 1));
    return (1.0 - rmd) * sum + rmd * sum2;
  } else {
    sum *= (float(1 << n) / float((1 << (n + 1)) - 1));
    return sum;
  }
}

Moreover, SVM does not support function overloading, so the noise function will also have to be changed to noise_[1|2|3|4]d. This code duplication can be seen in node_texture.h. Similarly, the situation is the same for the Musgrave texture code, as can be seen in node_musgrave_texture.osl.

Solution

One solution to this problem would be to utilize the preprocessor to simulate generic programming. For instance, writing the noise_turbulence function as a macro as follows:

#define TURBULANCE(type) \
  float noise_turbulence(type p, float details) \
  { \
    float fscale = 1.0; \
    float amp = 1.0; \
    float sum = 0.0; \
    float octaves = clamp(details, 0.0, 16.0); \
    int n = (int)octaves; \
    for (int i = 0; i <= n; i++) { \
      float t = safe_noise(fscale * p); \
      sum += t * amp; \
      amp *= 0.5; \
      fscale *= 2.0; \
    } \
    float rmd = octaves - floor(octaves); \
    if (rmd != 0.0) { \
      float t = safe_noise(fscale * p); \
      float sum2 = sum + t * amp; \
      sum *= ((float)(1 << n) / (float)((1 << (n + 1)) - 1)); \
      sum2 *= ((float)(1 << (n + 1)) / (float)((1 << (n + 2)) - 1)); \
      return (1.0 - rmd) * sum + rmd * sum2; \
    } \
    else { \
      sum *= ((float)(1 << n) / (float)((1 << (n + 1)) - 1)); \
      return sum; \
    } \
  }

TURBULANCE(float)
TURBULANCE(vec2)
TURBULANCE(vec3)
TURBULANCE(vec4)

Alternatively, the implementation can be done in a separate file that uses preprocessor constants instead of types and function names. Then the implementation file can be included multiple times with different constant definitions. For instance, for SVM, the implementation file can be as follows:

float noise_turbulence(TYPE p, float details)
{
  float fscale = 1.0;
  float amp = 1.0;
  float sum = 0.0;
  float octaves = clamp(details, 0.0, 16.0);
  int n = (int)octaves;
  for (int i = 0; i <= n; i++) {
    float t = NOISE_FUNCTION(fscale * p);
    sum += t * amp;
    amp *= 0.5;
    fscale *= 2.0;
  }
  float rmd = octaves - floor(octaves);
  if (rmd != 0.0) {
    float t = NOISE_FUNCTION(fscale * p);
    float sum2 = sum + t * amp;
    sum *= ((float)(1 << n) / (float)((1 << (n + 1)) - 1));
    sum2 *= ((float)(1 << (n + 1)) / (float)((1 << (n + 2)) - 1));
    return (1.0 - rmd) * sum + rmd * sum2;
  }
  else {
    sum *= ((float)(1 << n) / (float)((1 << (n + 1)) - 1));
    return sum;
  }
}

Which can be included as follows:

#define TYPE float
#define NOISE_FUNCTION noise_1d
#include "noise_turbulence_impl.h"
#undef TYPE
#undef NOISE_FUNCTION

#define TYPE float2
#define NOISE_FUNCTION noise_2d
#include "noise_turbulence_impl.h"
#undef TYPE
#undef NOISE_FUNCTION

#define TYPE float3
#define NOISE_FUNCTION noise_3d
#include "noise_turbulence_impl.h"
#undef TYPE
#undef NOISE_FUNCTION

#define TYPE float4
#define NOISE_FUNCTION noise_4d
#include "noise_turbulence_impl.h"
#undef TYPE
#undef NOISE_FUNCTION

The second approach is easier, cleaner, and more flexible, but it is harder to setup.

Eventually, I decided to duplicate the code for simplicity and readability. I also prepended the code with the following comment to make refactoring by other developers more optimal:

/* The following 4 functions are exactly the same but with different input type.
 * When refactoring, simply copy the function body to the rest of the functions.
 */

Macro Redefinitions

The SSE version of the Jenkins hash have a slightly different bit-rotate macro, while the mix and final macros are exactly the same. This forces us to redefine mix and final after redefining rot. This can be seen in util_hash.h.

Different Voronoi Dimensions

The Voronoi texture also have similar code for different dimensions, however, unlike the Musgrave and Noise functions, the Voronoi functions have different control structures for different dimensions. In particular, the 1D case have a single loop, the 2D case have 2 nested loops, the 3D case have 3 nested loops, and the 4D case have 4 nested loops.

Personally, I don't view this as code repetition as it seems unavoidable.

Different Voronoi Features

Currently, we support five Voronoi features. Initially, I tried to reduce code repetition by implementing all features in a single function. This didn't work out for the following reasons:

  • Search Kernel: The search kernel at the heart of the Voronoi algorithm was not constant. Smooth Voronoi required larger kernels, Distance To Voronoi Edges required two passes of big kernels, and N-Sphere Radius required two passes of small kernels. Moreover, GLSL compilers prefer constant expressions in loops for the purpose of compile-time optimizations and validation, in fact, the OpenGL ES specification doesn't even allow dynamic expressions in loops. So trying to dynamically control the loops based on required feature may not be the best thing to do.
  • Branching: The code branched a lot, which is bad for GPUs. Some branching is turned into conditional assignments like in the case of F1 voronoi, but more complicated branching can impact performance. (I am unsure if this issue make sense, since all wavefronts will choose the same branch, and thus there shouldn't be any stalling ...)
  • Compile Time: It appears shaders are recompiled every time a GPU_constant is changed (Which shouldn't happen?), having a big function would mean slower compile time.
  • Readability: The code became somewhat unreadable and not easy to refactor.

So, I eventually chose the simple approach of implementing everything independently.

Notes

  • While C++ and CUDA support generic programming in the form of templates, OpenCl does not. So SVM, in general, doesn't support generic programming except in special cases where the code is only transpiled to C++ (For instance, SSE kernels).