Source/Nodes/FunctionSystem

= Function System =

The term "function system" describes a library that solves the problem of evaluating user-defined functions at run-time. Having a good solution to this problem is a key ingredient for new node systems in Blender.

Motivation
I've talked about different use-cases for this system in this document before. Many more use-cases can be discovered in the future. The function system also has a lot of immediate value to me. The new particle system heavily depends on user-defined functions to provide more flexibility. The design of the particle system would be very different if we would not able to evaluate functions efficiently.

The image below shows a simple user-defined force that has to be evaluated for every particle in every time step. In fact, forces can only be implemented using nodes currently.



Goals
One of the major goals for this system is performance. However, there are two ways to measure performance:


 * Latency: This measures how much time it takes to evaluate a function only once. For example, the time it takes to compute the force on a single particle takes 5ns.
 * Throughput: This measures how many times the function can be evaluated in a certain amount of time. For example, in 1ms I can compute the force on 1 million particles.

It turns out that you get very different solutions depending on what you optimize for. My initial implementation of the function system, as described in this document, was optimizing for latency. However, it turned out that optimizing for throughput makes a lot more sense when it comes to particle systems and many other use-cases.

Even more important than the performance right now, is the optimizability of the system. Whenever there is a bottleneck in the function evaluation, there should be an obvious way to optimize it away. This is important, because micro-optimizing everything won't do us any good, but selectively optimizing small code segments is fine. Optimizing optimizability has two important implications. Firstly, bottlenecks have to be easy to find by looking at a profile. Secondly, bottlenecks must only appear in small loops. I'd say that both things can only be achieved when optimizing for throughput.

The system needs a well defined and extensible type system. Only supporting primitive types like `int` and `float` is not an option. It should be possible to use the system with most C++ types (they must be copyable, destructible, ...). Furthermore, passing lists of elements between nodes has to be supported efficiently, i.e. without potentially doing a separate memory allocation for every particle.

Functions have to be composable. That means, I have to be able to take a couple of functions, connect them somehow, and that results in a new function with the same interface as the original functions.

The setup cost should be low. So, transforming a node system or an expression into an executable function should be fast. This is important, because if it is slow, the loading time of .blend files can increase a lot and artists have to wait longer when they change a function. I expect files to have many small functions in the future.

Simple Benchmark
Here I'll just provide some perspective for the current performance of the system by comparing it to a precompiled C++ function. Both, the C++ code and the function evaluation can be optimized more of course, but to understand where we stand, it is good enough.

I use Heron's formula to compute the area of a triangle as an example. However, the input will be three vertex coordinates per triangle instead of three side lengths. Below is the C++ code and a screenshot of the node tree (implemented using node groups) that I will compare. I'm running the benchmark on a single core, but both functions can easily be extended to use multiple cores.



I evaluate the functions on 10.000.000 elements. The C++ code takes approximatly 60ms and the user-defined function 160ms to execute. That are 6ns and 16ns per element on average. There is no compilation happening at run-time. Personally, I think this result is quite good already.

Below is a flame graph that shows roughly the time spent in each function. We can see that the C++ code has been inlined entirely. Furthermore, when evaluating the node tree, almost all of the time is spent in small loops. Those can still be optimized individually or combined if necessary. Most time is spent in the three nodes that compute the vector distances. The other four flat segments of the profile show the time of the sqrt, subtract, add and multiply nodes respectively. The three large peaks indicate page faults when a memory buffer has been written to the first time.



LLVM is not the solution, but
LLVM is not the solution, but can be part of it. There is no doubt, that for many small functions (like the one in the benchmark above), LLVM will provide the best possible performance, probably even faster than the precompiled C++ code on many CPUs. However, LLVM does not solve all problems and comes with it's own set of new problems.

The main downside is compilation time. Both, optimized and unoptimized compilation can take a significant amount of time that quickly adds up when many small functions are used. Furthermore, debugging and profiling functions compiled at run-time is much harder. Integrating non trivial C++ types with LLVM IR can be complex. Also, the performance of working with lists of elements is not magically better by using LLVM. For many nodes, that do more than simple math operations (e.g. compute perlin noise), LLVM does not provide any performance benefit and just makes everything more complex.

That is all to say that, while LLVM is great, it cannot be seen as primary solution to the function evaluation problem (I developed a node tree to LLVM IR compiler last year, so I actually tried this).

Nevertheless, LLVM can be used in a node network optimization step. For example, this optimization could find groups of nodes that can be replaced by a single new node, that has been compiled at run-time.

Interface of a Function
This section shows how a run-time generated function is used. It explains the different data structures used in the interface. A later section will show a function from the inside.

The most important class is `FN::MultiFunction`. It encapsulates a function, that can be computed on many elements at the same time, hence the name `MultiFunction`.

Usually, nodes have input and output sockets. However, that concept turned out to be bad for a `MultiFunction`. Instead, a `MultiFunction` instance has a list of parameters. Every parameter has an interface category, a data type category and a base type. There are three distinct interface categories:


 * Input: An input parameter has to be initialized by the caller of the function. It is readonly inside the function.
 * Output: An output parameter is initialized by the callee. The caller only provides the memory buffer.
 * Mutable: A mutable parameter is initialized by the caller. The function is allowed to modify it.

There are two data type categories:


 * Single: A single parameter is one that gets one value per element to be computed.
 * Vector: A vector parameter can receive zero or more values per element to be computed.

Lastly, the base data type of a parameter is represented by an `FN::CPPType` instance. Such instances are available for many types such as `float`, `int` and `std::string`.

In order to call a function, the caller has to provide all parameters. A special type `FN::MFParamsBuilder` helps with that (MF = MultiFunction). Remember that a `MultiFunction` always computes many elements at once.

The `Array.as_ref` and `as_mutable_ref` functions return a `BLI::ArrayRef` and `BLI::MutableArrayRef` respectively. The `context_builder` can be used to pass additional information to the function, but we don't need that here.

Note, the first argument of `fn.call` is an `BLI::IndexRange`. In this case it is `{0, 1, 2, 3, 4}`. The parameter is used to tell the callee which elements/indices should be computed. For example, if I only wanted to compute sum at indices 1 and 3, I could pass `{1, 3}` into the function. Internally, this is converted into an `BLI::IndexMask` structure, that just references an array of unsigned integers. The indices have to be ordered and duplicates are not allowed.

A great part of the design is that readonly inputs to a function can be "virtual arrays". So it does not have to be an actual array, but only has to look like an array to the callee. This becomes very handy when e.g. the second input in the example above is constant. The `MFParamsBuilder` takes care of the necessary conversions.

Runtime Type System
A couple of classes are necessary to work with types generically in a good and safe way. This section will explain those classes independently. Afterwards, we'll see how they are used to evaluate functions.

CPPType
The `FN::CPPType` class is the core of the run-time type system. Every type has a size and alignment. Furthermore, a type has to implement the operations `construct_default`, `destruct`, `copy_to_initialized`, `relocate_to_uninitialized` and more.

Types are identified by the pointer of their `CPPType` object. So, there has to be no deep comparison between two types.

Types that correspond to compile time types (all currently) can be accessed using a special template function: `template const CPPType &CPP_TYPE`. For example, to get the `CPPType` for `float`, one can just use `CPP_TYPE `. A new type can be defined in a single line with the help of a macro.

Whenever a method on a type object is called, the alignment of pointers is checked. So it is important to be aware of alignment when working with generic types.

GenericArrayRef and GenericMutableArrayRef
Those are mostly equivalent to `BLI::ArrayRef` and `BLI::MutableArrayRef`. However, instead of having their type defined at compile time, they have an `const CPPType *` member.

GenericVectorArray
This is a more complex data structure (and also the mostly likely to change in the future). Its purpose is to store a constant number of lists of varying length; it is an array of vectors. It also references a `CPPType`.

GenericVirtualListRef
This is a generic version of `BLI::VirtualListRef`. It represents something that "looks like" an array. Internally, it can be an actual array, a single value, a smaller repeated array or an array of pointers. The data in the virtual list is readonly. Where performance matters, code can figure out the internal structure of the virtual list and optimize for different cases.

GenericVirtualListListRef
This is a generic version of `BLI::VirtualListListRef`. It represents somethings that "looks like" an array of arrays. Internally, it can either be a single array are multiple arrays. External code can optimize for the different cases as well if necessary.

Implementing a Function
As an example, I'll implement a function that adds two integers. This function could be used in the example above. Per convention I call it `MF_AddInts`. Every function has to be a subclass of `MultiFunction` and has to implement a constructor and `call` method.

The constructor is used to define the signature of the function as below. The function has three parameters. Two input and one output parameter. I'll give two functionally equivalent definitions of each function.

The actual work happens in the `call` function. It gets the set of indices to be computed, the parameters and the context as input. You can see that the parameters are accessed in an redundant way on purpose. The first parameter (here 0, 1 and 2) is the parameter index that should be accessed. In debug builds, there will be additional type and name checks to avoid some kinds of errors.

Multi Function Network
It is possible to hardcode e.g. the concatenation of multiple functions by calling them one after the other. However, the prefered way of combining multiple functions is to use the multi function network.

Essentially, it is an internal node graph (separate from the one in Blender's node editor) in which most nodes are functions and every socket corresponds to a parameter of a function. Input parameters correspond to an input socket, output socket parameters correspond to an output socket and mutable parameters correspond to an input and an output socket.

There is a second category of nodes: dummy nodes. Those do not correspond to a function, but still have input and output sockets. They are used to represent inputs and outputs of the multi function network.

Every socket has a data type. Links can only be made between sockets that have the exact same data type. Every input socket has to be linked to exactly one output socket. An output socket can be linked to an arbitrary amount of inputs.

There are two classes that can represent these networks: `FN::MFNetworkBuilder` and `FN:MFNetwork`. The first one is mutable and allows some invalid states such as an input that is not linked. Nodes and links can be added and removed. The second one represents a finalized network. It cannot change anymore. This is the network that is evaluated in the end.

These networks are usually generated from node trees or potentially expressions in the future. I won't explain the entire API here, but below is a small snippet using it.

The `MFNetworkBuilder` has a dot exporter to visualize the graph.



Network Optimization
A `MFNetworkBuilder` generated from user input can often still be optimized. This section presents three optimization passes that I've implemented already. The goal of all these passes is to reduce the number of nodes that need to be evaluated in the end.

The first optimization is constant folding. All function nodes that neither depend in the context nor on dummy nodes, can be collapsed.

Before constant folding: After constant folding:

As you can see, a set of nodes has been replaced with a constant. However, the nodes have not been removed yet. This is the job of the next optimization called dead node elimination. It removes all nodes that aren't dependencies of dummy nodes. This does not impact evaluation performance as much, but less nodes are always better. Also removing unused nodes helps when visualizing the network.



Lastly, there is duplicate removal. It can find and remove duplicate computations in the node tree. Duplicates are often created when a node group is inlined multiple times. This can also detect common subexpression when we are able to generate networks from expressions.

Before duplicate removal: After duplicate removal:

Network Evaluation
An `MFNetwork` itself is not a `MultiFunction` and therefore cannot be called. Currently, the `MF_EvaluateNetwork` function provides the functionality to execute a network. It takes two sets of sockets as input that represent the inputs and outputs in the network. There has to be at least one output. The `MF_EvaluateNetwork` is then called like any other `MultiFunction`.

There are many ways to implement the evaluation of a network. Since I started working on Animation Nodes, I've probably implemented at least a dozen of those. Here are a couple of possible approaches:


 * Recursive Interpreter: This starts at the output and uses recursion to compute the output of every node. This is the simplest variant when you allow that some nodes will be computed multiple times. With caching it becomes harder, but still fairly doable. The issue with such a recursive approach is that you might run out of stack memory when there are long node chains (which can happen when the network is generated programatically).
 * Bytecode Interpreter: Convert the network with its inputs and outputs into some bytecode at construction time and interpret it later on. This can become quite annoying to debug and does not really provide any benefits here.
 * Compile: The order in which functions have to be evaluated can be converted into e.g. LLVM IR and then compiled. This is costly, because the compilation can take quite some time. Debugging becomes much harder with this approach. The performance benefit is negilible when many elements are evaluated per node. Also implementing control flow decisions is quite hard.
 * Stack-based Interpreter: This is similar to the recursive interpreter with caching. However, instead of using recursion, a stack and a while loop is used. This worked the best in my experiments.

The current network evaluator has the following features:


 * It does not use recursion. So there is no problem with long node chains.
 * It can handle all existing parameter types (including mutable parameters).
 * Memory buffers for intermediate results are cached and reused.
 * Uses a "deepest depth first" heuristic to decide in which order the inputs of a node should be computed in order to minimize the number of temporary buffers.
 * No data is copied when there is a chain of mutable parameters. So, the same buffer is just passed to every function in the right order.
 * Every node is executed at most once.
 * If the output of a node is determined to be the same for every element, the node will only be evaluated on a single element (instead of e.g. for every particle).
 * The last node before the output node can write directly into the buffer provided by the caller, eliminating a copy.
 * The more elements are processed at the same time, the more negilible the overhead of the evaluator itself. That works because the system is throughput optimized and the evaluator does not add overhead for every element.

The evaluator does not do any multithreading on its own, currently. It does not feel very necessary so far. Simply splitting up the data into smaller chunks of elements is much better, because it scales to an arbitrary number of threads independent of the node network.

Summary
The function system contributes the following components to Blender:
 * A run-time type system and data structures to work with data of unknown type efficiently and safely.
 * A flexible and efficient interface for throughput optimized functions for the CPU.
 * Efficient data structures and algorithms to combine and evaluate multiple such functions.

All these components are internal and not visible to the user. However, they are an integral part of the other systems I'm working on:
 * A node interface framework to allow for more flexible node trees with type inferencing and other features (see Node Interface Framework).
 * A particle nodes implementation using the function system to give the user a lot of flexibility by still achieving high performance (see Particle Nodes Core Concepts).
 * A unified simulation system to integrate different kinds of simulations in a single node system (see Unified Simulation System Proposal).