## Thursday, October 29, 2015

### Real-time Raytracing part 2.1

In my last post I gave a summary on all available methods for creating bounding volume hierarchies. In this post I'd like to share my implementation of the Fast and Simple Agglomerative LBVH Construction (download link). I'm not using any special heuristic, it's going to be the same simple radix tree as mentioned in the previous post.

This is what we want to end up with. The green nodes in the bottom are the leaf nodes (containing triangles), and in blue are the internal nodes. The numbers below the nodes are their left and right childrens indices, more on this later. The numbers in the bottom green rectangles are the leaf nodes' respective morton codes, they represent an ordering in space. In our case this space is going to be 3D. There are plenty of methods on creating morton codes, so I'm going to skip discussing these methods here.

In Karras' paper on fast LBVH building, they use 2 compute passes to find all relations between the nodes, and build a radix tree. Apetrei found a similar method which only uses one pass to do both. Similar to Karras' method, we split the array of nodes into leaf nodes and internal nodes. You can still store them in the same array if you use an offset though.

First of all, the structures. We're going to used Axis Aligned Bounding Boxes (AABB), for which the structure looks as follows:
struct AABB
{
float3 min, max;
};

The BVHNode contains a little bit more:
struct BVHNode
{
AABB bounds;
BVHNode *children[2];
int triangleID;
int atomic;
int rangeLeft, rangeRight;
};

The basic things are there: AABB and relations to successors and parent. If this would be a leaf node, the triangles can be stored in the elements array. The rangeLeft and rangeRight are the, soon to be, childrens indices. The atomic counter will play an important role in this algorithm.

Important to note: If you're going to optimize this, think about memory alignment, this structure currently has the size of 6 floats and 6 ints. Storing in multiples of 128 bit (4 ints/floats) and reading/writing in those sizes on the GPU is a good memory optimization.

I started by simply allocating two arrays for the nodes on the GPU. Other requirements you need for this algorithm are the AABB's of every triangle, and their respective morton codes. Obtaining these is simple, but a little obfuscated on the GPU. We're going to use thrust (a package included in the cuda SDK) to sort the created morton keys:
void Sort(unsigned long long* morton, int *keys, const int n)
{
thrust::device_ptr<int> d_keys = thrust::device_pointer_cast(keys);
thrust::device_ptr<unsigned long long> d_morton = thrust::device_pointer_cast(morton);
thrust::stable_sort_by_key(d_morton, d_morton + n, d_keys);
}

Note that both morton and keys are both already existing arrays on the GPU. The keys array is a simple int array containing [1,2..n] storing the triangleIDs. Morton are the morton codes of the respective triangles. In this piece of code I simply convert them to something 'thrustworthy' and sort the morton codes together with the keys using stable sort by key (so you end up with both arrays sorted the same way).

Since allocated data on the GPU could contain anything, we're going to reset some parameters in the nodes and start by defining bounding boxes in the leaf nodes:
__global__ void Reset(BVHNode *leafNodes, BVHNode *internalNodes, AABB *data, int *sortedObjectIDs, unsigned long long *morton, int nNodes)
{
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= nNodes)
return;

// Reset parameters for internal and leaf nodes here

// Set ranges
leafNodes[idx].rangeLeft = idx;
leafNodes[idx].rangeRight = idx;
leafNodes[idx].atomic = 1; // To allow the next thread to process
internalNodes[idx].atomic = 0; // Second thread to process

// Set triangles in leaf
// Save triangle ID
// Expand bounds using min/max functions

// Special case
if (nNodes == 1)
{
internalNodes[0].bounds = leafNodes[0].bounds;
internalNodes[0].children[0] = &leafNodes[0];
}
}

For those who have never seen CUDA code before, the __host__ and __device__ simply indicate on which device this code can be executed, __global__ indicates the host calls the function, but the device executes the function.

The input for the function contains the same morton codes and keys from the sorting, the nodes themselves, and nNodes, which is simply nLeafNodes in this case. This is always true in a binary radix tree.

Compute shader logic: To find the index in the array, we have to look up the current threadID. I called this function using only one dimensional blocks and grids, so this is easy to calculate by adding threadId to the block index and size. If we surpass the maximum amount of threads (nNodes) we simply tell this thread to stop executing.

Walkthrough: First, we simply reset the leaf and internal nodes parameters to NULL and reset the AABB. We're going to start the algorithm in the leaf nodes, so the range counter will only include the node itself for every leaf node. The atomic counter is set to 1 for leaf nodes, and 0 in internal nodes. This is explained further in the algorithm below.

We want to end up with every node having a bound in terms of an AABB. This can be done by setting them directly. I prefer to use the min/max functions, since you don't have to rethink the process if you try to store multiple triangles later on.

The final part of the code simply handles the case when there is only one node. If this is the case, we don't even have to run the actual algorithm!

The actual algorithm

I made a function which calls the actual algorithm for every leaf node, fetching the threadID in the same manner as above. After that I start the recursion:
// Sets the bounding box and traverses to root
__device__ void ProcessParent(BVHNode *node, int nData, BVHNode* internalNodes, BVHNode *leafNodes, unsigned long long *morton)
{
// Allow only one thread to process a node
return;

// Set bounding box if the node is no leaf
if (!isLeaf(node))
{
// Expand bounding box with min/max functions from children AABB's
}

int left = node->rangeLeft;
int right = node->rangeRight;
if (left == 0 || (right != nData - 1 && HighestBit(right, morton) < HighestBit(left - 1, morton)))
{
// parent = right, set parent left child and range to node
}
else
{
// parent = left -1, set parent right child and range to node
}

if (left == 0 && right == nData)
return;
ProcessParent(parent, nData, internalNodes, leafNodes, morton);
}

Walkthrough: The first part is an atomicAdd, the function adds a value to a reference and returns the previous value. We need this to guarantee parallel reduction. You can read more on what atomics are and how they work here. In our context, we're saying if the old value stored in the atomic counter is equal to one, the thread can continue. This way we stop the first thread that encounters a new node and continue traversing when the second thread arrives. This is the reason why we set the atomic counter on 1 for every leaf node, since the first thread arriving will read this value and continue.

The next part simply states that if the thread encounters an internal node, we set the bounding box by combining the AABB's from the children. We know that both children are set from the logic with the atomic counter above.

Now we can calculate the parent according to the logic from the paper. With the left and right child index and the following function, we can calculate which internal node is the parent:
// Returns the highest differing bit of i and i+1
__device__ int HighestBit(int i, unsigned long long *morton)
{
return morton[i] ^ morton[i + 1];
}

The paper has a clear explanation as to why this algorithm works, so if you want to know why, I suggest reading the paper.

However, I will give a working example to show this. Let's trace the algorithm from the thread that handles internal node 5 from the image. The child indices are 5 and 6. The algorithm states:
if (left == 0 ||
(right != nData - 1 &&
HighestBit(right, morton) < HighestBit(left - 1, morton)))

The first two statements opt out the 0 and n-1 case (left and right part of the tree). This is not the case, so we have to compare some values. The HighestBit of right returns $1101 \wedge 1111 = 0010 = 2$ and the HighestBit of left-1 returns $1000 \wedge 1100 = 0100 = 4$ 2 < 4, so the statement is true. This means the parent has the same index of the right child, which is 6.

I hope you gained some insight in this algorithm from this code sample. Make sure to note that while this algorithm gives you fast building speed it does not give the best tracing speed.

Warning: LBVHs as shown in the graph above represent an optimal solution. Most of the LBVHs you will make using this algorithm will (most likely) contain tons of layers containing one leaf- and one internal node, thus the depth constraint of log2(n) in balanced binary trees does not apply here!