Saturday, October 31, 2015
Flux - Realtime path tracer
After all these posts on techniques, here is a video showing the current state of my path tracer:
Thursday, October 29, 2015
Real-time Raytracing part 2.1
In my last post I gave a summary on all available methods for creating bounding volume hierarchies. In this post I'd like to share my implementation of the Fast and Simple Agglomerative LBVH Construction (download link). I'm not using any special heuristic, it's going to be the same simple radix tree as mentioned in the previous post.
This is what we want to end up with. The green nodes in the bottom are the leaf nodes (containing triangles), and in blue are the internal nodes. The numbers below the nodes are their left and right childrens indices, more on this later. The numbers in the bottom green rectangles are the leaf nodes' respective morton codes, they represent an ordering in space. In our case this space is going to be 3D. There are plenty of methods on creating morton codes, so I'm going to skip discussing these methods here.
In Karras' paper on fast LBVH building, they use 2 compute passes to find all relations between the nodes, and build a radix tree. Apetrei found a similar method which only uses one pass to do both. Similar to Karras' method, we split the array of nodes into leaf nodes and internal nodes. You can still store them in the same array if you use an offset though.
First of all, the structures. We're going to used Axis Aligned Bounding Boxes (AABB), for which the structure looks as follows:
The BVHNode contains a little bit more:
The basic things are there: AABB and relations to successors and parent. If this would be a leaf node, the triangles can be stored in the elements array. The rangeLeft and rangeRight are the, soon to be, childrens indices. The atomic counter will play an important role in this algorithm.
Important to note: If you're going to optimize this, think about memory alignment, this structure currently has the size of 6 floats and 6 ints. Storing in multiples of 128 bit (4 ints/floats) and reading/writing in those sizes on the GPU is a good memory optimization.
I started by simply allocating two arrays for the nodes on the GPU. Other requirements you need for this algorithm are the AABB's of every triangle, and their respective morton codes. Obtaining these is simple, but a little obfuscated on the GPU. We're going to use thrust (a package included in the cuda SDK) to sort the created morton keys:
Note that both morton and keys are both already existing arrays on the GPU. The keys array is a simple int array containing [1,2..n] storing the triangleIDs. Morton are the morton codes of the respective triangles. In this piece of code I simply convert them to something 'thrustworthy' and sort the morton codes together with the keys using stable sort by key (so you end up with both arrays sorted the same way).
Since allocated data on the GPU could contain anything, we're going to reset some parameters in the nodes and start by defining bounding boxes in the leaf nodes:
For those who have never seen CUDA code before, the __host__ and __device__ simply indicate on which device this code can be executed, __global__ indicates the host calls the function, but the device executes the function.
The input for the function contains the same morton codes and keys from the sorting, the nodes themselves, and nNodes, which is simply nLeafNodes in this case. This is always true in a binary radix tree.
Compute shader logic: To find the index in the array, we have to look up the current threadID. I called this function using only one dimensional blocks and grids, so this is easy to calculate by adding threadId to the block index and size. If we surpass the maximum amount of threads (nNodes) we simply tell this thread to stop executing.
Walkthrough: First, we simply reset the leaf and internal nodes parameters to NULL and reset the AABB. We're going to start the algorithm in the leaf nodes, so the range counter will only include the node itself for every leaf node. The atomic counter is set to 1 for leaf nodes, and 0 in internal nodes. This is explained further in the algorithm below.
We want to end up with every node having a bound in terms of an AABB. This can be done by setting them directly. I prefer to use the min/max functions, since you don't have to rethink the process if you try to store multiple triangles later on.
The final part of the code simply handles the case when there is only one node. If this is the case, we don't even have to run the actual algorithm!
The actual algorithm
I made a function which calls the actual algorithm for every leaf node, fetching the threadID in the same manner as above. After that I start the recursion:
Walkthrough: The first part is an atomicAdd, the function adds a value to a reference and returns the previous value. We need this to guarantee parallel reduction. You can read more on what atomics are and how they work here. In our context, we're saying if the old value stored in the atomic counter is equal to one, the thread can continue. This way we stop the first thread that encounters a new node and continue traversing when the second thread arrives. This is the reason why we set the atomic counter on 1 for every leaf node, since the first thread arriving will read this value and continue.
The next part simply states that if the thread encounters an internal node, we set the bounding box by combining the AABB's from the children. We know that both children are set from the logic with the atomic counter above.
Now we can calculate the parent according to the logic from the paper. With the left and right child index and the following function, we can calculate which internal node is the parent:
The paper has a clear explanation as to why this algorithm works, so if you want to know why, I suggest reading the paper.
However, I will give a working example to show this. Let's trace the algorithm from the thread that handles internal node 5 from the image. The child indices are 5 and 6. The algorithm states:
The first two statements opt out the 0 and n-1 case (left and right part of the tree). This is not the case, so we have to compare some values. The HighestBit of right returns $1101 \wedge 1111 = 0010 = 2$ and the HighestBit of left-1 returns $1000 \wedge 1100 = 0100 = 4$ 2 < 4, so the statement is true. This means the parent has the same index of the right child, which is 6.
I hope you gained some insight in this algorithm from this code sample. Make sure to note that while this algorithm gives you fast building speed it does not give the best tracing speed.
Warning: LBVHs as shown in the graph above represent an optimal solution. Most of the LBVHs you will make using this algorithm will (most likely) contain tons of layers containing one leaf- and one internal node, thus the depth constraint of log2(n) in balanced binary trees does not apply here!
This is what we want to end up with. The green nodes in the bottom are the leaf nodes (containing triangles), and in blue are the internal nodes. The numbers below the nodes are their left and right childrens indices, more on this later. The numbers in the bottom green rectangles are the leaf nodes' respective morton codes, they represent an ordering in space. In our case this space is going to be 3D. There are plenty of methods on creating morton codes, so I'm going to skip discussing these methods here.
In Karras' paper on fast LBVH building, they use 2 compute passes to find all relations between the nodes, and build a radix tree. Apetrei found a similar method which only uses one pass to do both. Similar to Karras' method, we split the array of nodes into leaf nodes and internal nodes. You can still store them in the same array if you use an offset though.
First of all, the structures. We're going to used Axis Aligned Bounding Boxes (AABB), for which the structure looks as follows:
struct AABB { float3 min, max; };
The BVHNode contains a little bit more:
struct BVHNode { AABB bounds; BVHNode *children[2]; int triangleID; int atomic; int rangeLeft, rangeRight; };
The basic things are there: AABB and relations to successors and parent. If this would be a leaf node, the triangles can be stored in the elements array. The rangeLeft and rangeRight are the, soon to be, childrens indices. The atomic counter will play an important role in this algorithm.
Important to note: If you're going to optimize this, think about memory alignment, this structure currently has the size of 6 floats and 6 ints. Storing in multiples of 128 bit (4 ints/floats) and reading/writing in those sizes on the GPU is a good memory optimization.
I started by simply allocating two arrays for the nodes on the GPU. Other requirements you need for this algorithm are the AABB's of every triangle, and their respective morton codes. Obtaining these is simple, but a little obfuscated on the GPU. We're going to use thrust (a package included in the cuda SDK) to sort the created morton keys:
void Sort(unsigned long long* morton, int *keys, const int n) { thrust::device_ptr<int> d_keys = thrust::device_pointer_cast(keys); thrust::device_ptr<unsigned long long> d_morton = thrust::device_pointer_cast(morton); thrust::stable_sort_by_key(d_morton, d_morton + n, d_keys); }
Note that both morton and keys are both already existing arrays on the GPU. The keys array is a simple int array containing [1,2..n] storing the triangleIDs. Morton are the morton codes of the respective triangles. In this piece of code I simply convert them to something 'thrustworthy' and sort the morton codes together with the keys using stable sort by key (so you end up with both arrays sorted the same way).
Since allocated data on the GPU could contain anything, we're going to reset some parameters in the nodes and start by defining bounding boxes in the leaf nodes:
__global__ void Reset(BVHNode *leafNodes, BVHNode *internalNodes, AABB *data, int *sortedObjectIDs, unsigned long long *morton, int nNodes) { unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx >= nNodes) return; // Reset parameters for internal and leaf nodes here // Set ranges leafNodes[idx].rangeLeft = idx; leafNodes[idx].rangeRight = idx; leafNodes[idx].atomic = 1; // To allow the next thread to process internalNodes[idx].atomic = 0; // Second thread to process // Set triangles in leaf // Save triangle ID // Expand bounds using min/max functions // Special case if (nNodes == 1) { internalNodes[0].bounds = leafNodes[0].bounds; internalNodes[0].children[0] = &leafNodes[0]; } }
For those who have never seen CUDA code before, the __host__ and __device__ simply indicate on which device this code can be executed, __global__ indicates the host calls the function, but the device executes the function.
The input for the function contains the same morton codes and keys from the sorting, the nodes themselves, and nNodes, which is simply nLeafNodes in this case. This is always true in a binary radix tree.
Compute shader logic: To find the index in the array, we have to look up the current threadID. I called this function using only one dimensional blocks and grids, so this is easy to calculate by adding threadId to the block index and size. If we surpass the maximum amount of threads (nNodes) we simply tell this thread to stop executing.
Walkthrough: First, we simply reset the leaf and internal nodes parameters to NULL and reset the AABB. We're going to start the algorithm in the leaf nodes, so the range counter will only include the node itself for every leaf node. The atomic counter is set to 1 for leaf nodes, and 0 in internal nodes. This is explained further in the algorithm below.
We want to end up with every node having a bound in terms of an AABB. This can be done by setting them directly. I prefer to use the min/max functions, since you don't have to rethink the process if you try to store multiple triangles later on.
The final part of the code simply handles the case when there is only one node. If this is the case, we don't even have to run the actual algorithm!
The actual algorithm
I made a function which calls the actual algorithm for every leaf node, fetching the threadID in the same manner as above. After that I start the recursion:
// Sets the bounding box and traverses to root __device__ void ProcessParent(BVHNode *node, int nData, BVHNode* internalNodes, BVHNode *leafNodes, unsigned long long *morton) { // Allow only one thread to process a node if (atomicAdd(&node->atomic, 1) != 1) return; // Set bounding box if the node is no leaf if (!isLeaf(node)) { // Expand bounding box with min/max functions from children AABB's } int left = node->rangeLeft; int right = node->rangeRight; if (left == 0 || (right != nData - 1 && HighestBit(right, morton) < HighestBit(left - 1, morton))) { // parent = right, set parent left child and range to node } else { // parent = left -1, set parent right child and range to node } if (left == 0 && right == nData) return; ProcessParent(parent, nData, internalNodes, leafNodes, morton); }
Walkthrough: The first part is an atomicAdd, the function adds a value to a reference and returns the previous value. We need this to guarantee parallel reduction. You can read more on what atomics are and how they work here. In our context, we're saying if the old value stored in the atomic counter is equal to one, the thread can continue. This way we stop the first thread that encounters a new node and continue traversing when the second thread arrives. This is the reason why we set the atomic counter on 1 for every leaf node, since the first thread arriving will read this value and continue.
The next part simply states that if the thread encounters an internal node, we set the bounding box by combining the AABB's from the children. We know that both children are set from the logic with the atomic counter above.
Now we can calculate the parent according to the logic from the paper. With the left and right child index and the following function, we can calculate which internal node is the parent:
// Returns the highest differing bit of i and i+1 __device__ int HighestBit(int i, unsigned long long *morton) { return morton[i] ^ morton[i + 1]; }
The paper has a clear explanation as to why this algorithm works, so if you want to know why, I suggest reading the paper.
However, I will give a working example to show this. Let's trace the algorithm from the thread that handles internal node 5 from the image. The child indices are 5 and 6. The algorithm states:
if (left == 0 || (right != nData - 1 && HighestBit(right, morton) < HighestBit(left - 1, morton)))
The first two statements opt out the 0 and n-1 case (left and right part of the tree). This is not the case, so we have to compare some values. The HighestBit of right returns $1101 \wedge 1111 = 0010 = 2$ and the HighestBit of left-1 returns $1000 \wedge 1100 = 0100 = 4$ 2 < 4, so the statement is true. This means the parent has the same index of the right child, which is 6.
I hope you gained some insight in this algorithm from this code sample. Make sure to note that while this algorithm gives you fast building speed it does not give the best tracing speed.
Warning: LBVHs as shown in the graph above represent an optimal solution. Most of the LBVHs you will make using this algorithm will (most likely) contain tons of layers containing one leaf- and one internal node, thus the depth constraint of log2(n) in balanced binary trees does not apply here!
Thursday, October 22, 2015
Real-time Raytracing part 2
Back to part 1.
In this post we will look into a particular optimization for the raytracing algorithm, namely data structures. Once you got your raytracer up and running on the GPU, you will check for every ray, which triangles it intersects. In the last post we saw that checking all triangles for every ray is simply impossible. In order to save some time, we store the triangles in a spatial data structure, which is easier to query for a ray.
There are many possible data structures to store a set of triangles. The most frequently used ones for raytracing are: k-d trees, octrees and bounding volume hierarchies (BVH). Any one of them is going to speed up the process a LOT. If your ray would only hit the background, in both the octree and the BVH there would only be a single ray-box intersection and then we're done. That's already saving n triangle checks!
But this post wouldn't be interesting to just point to some data structures. In this post I'm going more in depth on the BVH, specifically using bounding boxes. Why bounding boxes? Because there is a very simple, computationally efficient test for ray-box intersections. No if-statements, so it's a very solid algorithm for massive parallel execution as well. Fits right in our GPU ray tracer! Now there are basically two types of BVHs. The LBVH (linear BVH) can be built really fast, but has less tracing performance than SAH (Surface Area Heuristic) BVHs. The SAH splits the set of triangles so that the surface areas of the two child spaces, weighted by the number of objects in each child, are equal.
Tero Karras from Nvidia research came up with a very cool algorithm to build LBVHs in parallel on the GPU. On the Nvidia blog you can find this and this post. It explains how you can traverse a BVH on the GPU, and some great performance tips to speed it up. Part three of the tutorial shows how you can build the LBVH on the GPU making use of the massive parallelism available. It's worth checking out the research paper on the subject, but below is an easier and even faster version. The basic principle of the paper is building a radix tree of a set of objects which are sorted using morton codes. This creates a spatial code by interleaving positional vectors in one integer or long. From this radix tree, you can precalculate where the tree would split, and thus process the whole tree in parallel.
This LBVH creation method is really fast: about 16ms for a million triangles is not uncommon. Even though, the algorithm could be way easier, as described in Fast and Simple Agglomerative LBVH Construction (download link). This paper shows that the same restrictions on the radix tree can be used to build the tree bottom-up instead of using an unnecessary top-down iteration. It shows that by comparing values with the neighbor, you can decide the parent node. It's even a bit faster than the method above, and a lot simpler to implement.
This is as fast as it's going to get for building times for BVHs. But how about tracing times? As mentioned above, the SAH is a good heuristic for ray tracing performance. This graph is an awesome comparison:
This graph is from another Nvidia paper: Fast Parallel Construction of High-Quality Bounding Volume Hierarchies. This is another paper on building LBVHs really really fast. They also mention an optimization in tracing speed by pre-splitting some large triangles. There is a lot more in depth information about building times and tracing times as well. They mention a lot of methods, but the main point from this image is to show that (H)LBVH has about 60~70% of the tracing speed in comparison with SAH BVHs. The building time is about 10~20 times slower than that of LBVHs though.
So how do we build these awesome BVHs using the SAH? Ingo Wald wrote a paper on building these kind of BVHs very fast. This algorithm is performed on the CPU in a top-down fashion. There are a lot of possibilities for performance boost which are also explained in the paper. The building times could be about 3 times as slow as building a LBVH by trading in some tracing performance for build optimizations. But, as shown in the graph above, this method still gives the best raytracing performance.
This image below shows a visualization of rays intersecting bounding boxes with 8 stanford dragon models. You can find the source code for this BVH in part 2.1.
I hope to have informed you enough about the possibilities of optimizing raytracing by using spatial data structures. There is some speculation about the creation of BVHs on the GPU, since the GPU will be working on tracing rays all the time anyway. More on this discussion can be read on ompf2.com (where you can also find an implementation of most of these methods).
Part 2.1: code sample of Fast and Simple Agglomerative LBVH Construction
Part 3
In this post we will look into a particular optimization for the raytracing algorithm, namely data structures. Once you got your raytracer up and running on the GPU, you will check for every ray, which triangles it intersects. In the last post we saw that checking all triangles for every ray is simply impossible. In order to save some time, we store the triangles in a spatial data structure, which is easier to query for a ray.
There are many possible data structures to store a set of triangles. The most frequently used ones for raytracing are: k-d trees, octrees and bounding volume hierarchies (BVH). Any one of them is going to speed up the process a LOT. If your ray would only hit the background, in both the octree and the BVH there would only be a single ray-box intersection and then we're done. That's already saving n triangle checks!
But this post wouldn't be interesting to just point to some data structures. In this post I'm going more in depth on the BVH, specifically using bounding boxes. Why bounding boxes? Because there is a very simple, computationally efficient test for ray-box intersections. No if-statements, so it's a very solid algorithm for massive parallel execution as well. Fits right in our GPU ray tracer! Now there are basically two types of BVHs. The LBVH (linear BVH) can be built really fast, but has less tracing performance than SAH (Surface Area Heuristic) BVHs. The SAH splits the set of triangles so that the surface areas of the two child spaces, weighted by the number of objects in each child, are equal.
Tero Karras from Nvidia research came up with a very cool algorithm to build LBVHs in parallel on the GPU. On the Nvidia blog you can find this and this post. It explains how you can traverse a BVH on the GPU, and some great performance tips to speed it up. Part three of the tutorial shows how you can build the LBVH on the GPU making use of the massive parallelism available. It's worth checking out the research paper on the subject, but below is an easier and even faster version. The basic principle of the paper is building a radix tree of a set of objects which are sorted using morton codes. This creates a spatial code by interleaving positional vectors in one integer or long. From this radix tree, you can precalculate where the tree would split, and thus process the whole tree in parallel.
This LBVH creation method is really fast: about 16ms for a million triangles is not uncommon. Even though, the algorithm could be way easier, as described in Fast and Simple Agglomerative LBVH Construction (download link). This paper shows that the same restrictions on the radix tree can be used to build the tree bottom-up instead of using an unnecessary top-down iteration. It shows that by comparing values with the neighbor, you can decide the parent node. It's even a bit faster than the method above, and a lot simpler to implement.
This is as fast as it's going to get for building times for BVHs. But how about tracing times? As mentioned above, the SAH is a good heuristic for ray tracing performance. This graph is an awesome comparison:
This graph is from another Nvidia paper: Fast Parallel Construction of High-Quality Bounding Volume Hierarchies. This is another paper on building LBVHs really really fast. They also mention an optimization in tracing speed by pre-splitting some large triangles. There is a lot more in depth information about building times and tracing times as well. They mention a lot of methods, but the main point from this image is to show that (H)LBVH has about 60~70% of the tracing speed in comparison with SAH BVHs. The building time is about 10~20 times slower than that of LBVHs though.
So how do we build these awesome BVHs using the SAH? Ingo Wald wrote a paper on building these kind of BVHs very fast. This algorithm is performed on the CPU in a top-down fashion. There are a lot of possibilities for performance boost which are also explained in the paper. The building times could be about 3 times as slow as building a LBVH by trading in some tracing performance for build optimizations. But, as shown in the graph above, this method still gives the best raytracing performance.
This image below shows a visualization of rays intersecting bounding boxes with 8 stanford dragon models. You can find the source code for this BVH in part 2.1.
I hope to have informed you enough about the possibilities of optimizing raytracing by using spatial data structures. There is some speculation about the creation of BVHs on the GPU, since the GPU will be working on tracing rays all the time anyway. More on this discussion can be read on ompf2.com (where you can also find an implementation of most of these methods).
Part 2.1: code sample of Fast and Simple Agglomerative LBVH Construction
Part 3
Wednesday, October 14, 2015
Real-time Raytracing part 1
In the next few posts I will talk about Real-time Raytracing. The real-time part has different interpretations. Some define it as being 'interactive', which is another vague term as any application running at 1 FPS can still be interactive. In this case we will define a "real-time constraint" as a time constraint per frame of about 16ms (which leads to 60 frames per second). This is a common time constraint used in games which have to be responsive.
Now for the raytracing part. The basic concept of raytracing is tracing the paths of light for every pixel on an image plane. This way we can simulate visual realism by by mimicking the real process. You can create pretty images without much code:
Unfortunately raytracing also has downsides. The process to trace the light paths requires global scene access, and is thus unable to work similar to a rasterizer (which draws objects one by one). This makes the whole rendering process more difficult and the standard route of rendering objects is averted. Raytracing has a very high computational cost. A basic algorithm for checking which triangle was hit already shows the problem:
The whole algorithm scales with screen size and the amount of triangles. Running this algorithm with the dragon model shown above is basically asking for your computer to explode. The model has 100.000 triangles exactly, and checking that amount of triangles for every pixel in a low resolution (1024x1024) is about 104 billion ray triangle checks. A simple line-triangle intersection already has a lot of instructions, needless to say, this algorithm isn't going to run real-time...
So, how do we make it real-time? Several decades of research show us that there are plenty of possibilities to speed up the process of raytracing. In the upcoming posts I will talk about some of these processes to speed up my path tracer. Most of the subjects will be on parallel programming on the GPU using CUDA. Even though I will try my best to keep it as readable as possible, unexplained terminology can always be found in the excellent ray tracing tutorial of scratchapixel.
If you're really interested in ray tracing, and haven't read it already: Ray tracey's blog already has a huge collection of ray tracing posts.
Part 2.
Now for the raytracing part. The basic concept of raytracing is tracing the paths of light for every pixel on an image plane. This way we can simulate visual realism by by mimicking the real process. You can create pretty images without much code:
Unfortunately raytracing also has downsides. The process to trace the light paths requires global scene access, and is thus unable to work similar to a rasterizer (which draws objects one by one). This makes the whole rendering process more difficult and the standard route of rendering objects is averted. Raytracing has a very high computational cost. A basic algorithm for checking which triangle was hit already shows the problem:
for (int x = 0; x < screenwidth; x++) for (int y = 0; y < screenheight; y++) for (int t = 0; t < nTriangles; t++) RayTriangle(x, y, t)
The whole algorithm scales with screen size and the amount of triangles. Running this algorithm with the dragon model shown above is basically asking for your computer to explode. The model has 100.000 triangles exactly, and checking that amount of triangles for every pixel in a low resolution (1024x1024) is about 104 billion ray triangle checks. A simple line-triangle intersection already has a lot of instructions, needless to say, this algorithm isn't going to run real-time...
So, how do we make it real-time? Several decades of research show us that there are plenty of possibilities to speed up the process of raytracing. In the upcoming posts I will talk about some of these processes to speed up my path tracer. Most of the subjects will be on parallel programming on the GPU using CUDA. Even though I will try my best to keep it as readable as possible, unexplained terminology can always be found in the excellent ray tracing tutorial of scratchapixel.
If you're really interested in ray tracing, and haven't read it already: Ray tracey's blog already has a huge collection of ray tracing posts.
Part 2.
Subscribe to:
Posts (Atom)