Monday, December 21, 2015

Real-time Raytracing part 4

In all of the previous posts, I've talked about optimizing ray tracing. I've mentioned path tracing, but never spoke of path tracing specific optimizations. This post will be a small survey of implementations and research about path tracing optimizations.

Foundations

Path tracing is great for simulating global illumination by tracing rays, and of course, these rays have to follow "realistic paths". As observed a long time ago (1760) by Johann Lambert, intensity of light from an ideal diffuse surface is directly proportional to the cosine of the angle $\theta$ between the direction of the incident light and the surface normal. This is called Lambert's cosine law in optics.

In path tracing, we mainly use the cosine law to calculate the contribution of light from the next ray. When tracing from a diffuse surface, the direction of the next ray is usually calculated by generating a random direction in the hemisphere of the surface. When tracing from a mirror, you can imagine there is only one direction on the hemisphere that is correct: the perfect reflection. Metal-like surfaces contain both elements, called glossy reflections. This image shows what I'm talking about in 2D:



If you would order them by calculation cost for the path tracing algorithm, the reflection would be the easiest, there is only one solution, one ray to calculate. The glossy one is harder to calculate, since you have to accumulate rays with different weights, but still simpler to fully calculate than the diffuse option. Besides these simple examples, materials can contain many functions to bend incoming rays. The collection of these functions describe the Bidirectional Reflectance Distribution Function (BRDF). Which in turn, you can even extend to allow for rays extending through the surface with a Bidirectional Scattering Distribution Function (BSDF).

Cosine weighted sampling
The intensity of light from an ideal diffuse surface is directly proportional to the cosine of the angle $\theta$ between the direction of the incident light and the surface normal.
This leads to an interesting optimization: cosine weighted sampling. Since the incoming and outgoing light is measured by a cosine, the outgoing rays nearly parallel to the surface don't add any (useful) information to the final solution. I could write this whole post about this optimization, but that would be a waste of time, since there are already many good examples out there. This post from Rory Driscoll shows results as well as the maths behind it.

This image shows the cosine weighted sampling on the top and the uniform sampling on the bottom. This was taken from the pathtracing blog
Sampling optimization

As mentioned above, when you've found a ray-triangle collision, the direction of the new ray is usually 'bruteforce' calculated by taking a random direction on the hemisphere. For diffuse surfaces, this is no problem, since every outgoing direction weights the same. If you would apply the same to perfectly reflecting surfaces, chances are you will most likely never find the correct outgoing ray.

By knowing the material properties you can determine how to sample the outgoing ray. This sampling strategy can be a very easy optimization if you're willing to end up with an approximation. If you look at the 3 examples from the image above: one variable for the degree of reflectiveness can produce these examples.

The problems with this reflectiveness variable for path tracing are the weights shown in the glossy example. Solving the Monte Carlo problem of path tracing requires a probability distribution function (PDF) over these samples. The outer rays in the glossy examples are shown smaller, because they affect the outcome less than the rays from the perfect reflection.

This optimization is as simple or as hard as you want to make it. The simpler you keep the sampling, the more approximate your solution is going to be. The best solution would be to sample your chosen BRDF. The cosine weighted sampling is shown to be perfect for diffuse only materials (diffuse materials have perfect lambertian reflectance). I found a sampling method for the Cook-Torrance BRDF in this paper: Microfacet Models for Refraction through Rough Surfaces.

Direct and indirect lighting

Direct lighting is easy to calculate. This is shown in almost all released 3D games out there. Simply calculating a dot product of the surface normal and light direction would result in perfect lambertian shading.

In path tracing we can apply an optimization to allow for direct light sampling. It's called next event estimation. This is where the math and statistics get nasty. To allow for direct light sampling, we have to apply a different probability distribution function. We have to sample the light and divide the contribution of direct light by the area of the light:
Area sampling. Image taken from [2]

Whereas the indirect illumination should be sampled by the original method using hemispheres. With one exception, all of the rays hitting the light should be discarded (since the light gets sampled by the other PDF from above).
Hemisphere sampling without direct light sampling (dashed lines). Image taken from [2]
These examples show only sampling of a single light, which is easy to maintain. Imagine a scene with thousands of lights and one sun. If you apply the same theory of the two PDFs from above, you would end up sampling more inefficient than the brute force sampling. The solution is rather simple, we add weights to each light. But these have to be based on different criteria: only adding weights based on distance could influence the effects of the sun (large distance, huge brightness) versus normal lights (small distance, small brightness). Other small optimizations could be added based on needs: you don't have to continually sample the sun during night time and so on. If you like to know the mathematics and details behind these two PDFs, I suggest you read the presentations from the references below.

Multiple importance sampling

In larger applications, thousands of material types and BRDFs are going to be used. Having a different sampling strategy for each of them is going to be time consuming and frankly undo-able. We already have two different PDFs to sample all of these materials, the direct and indirect sampling. The problem that arises from all of this shown in images:

These are the two sampling methods shown from above, the BSDF sampling is the hemisphere indirect sampling and the light source sampling is the direct sampling using the area of the light. The problem is the glossy reflection on the four plates. You can see that indirect sampling works if the light is large enough to have some collisions with rays, and the direct light source sampling works fine for small area lights.

Multiple Importance Sampling is a technique that can combine different sampling techniques to find a low variance estimate of the integral. By combining both sampling methods we can get the following result:
Images taken from [4]

Bidirectional path tracing

This leads us to more recent techniques in path tracing, namely bidirectional path tracing. This technique is based on the inverse of path tracing, instead of finding the light by tracing rays from the camera, find the camera by tracing from the light. These are the shooting rays, and the original camera rays the gathering rays. The bidirectional part indicates that these two types of rays can be combined.

This technique wasn't meant for real time path tracing applications which require responsiveness. Since for every gathering ray you shoot, the amount of calculation is doubled by also tracing shooting rays. This nearly doubles the time per frame, and thus decreases responsiveness. While you lose responsiveness, this algorithm increases the convergence rate for path tracing a lot, so in the end it's worth it.

As extension to bidirectional path tracing, there is an algorithm called Metropolis light transport. This algorithm allows finding more difficult light paths by extending existing paths. An example: A shooting and gathering ray have been found by the bidirectional raytracing method. Every bounce in these rays is recorded as a node. The metropolis light transport can add extra nodes on these rays to modify the existing ray to a new one. These nodes are placed in areas with high gradients. You can find more on this technique in this paper: Metropolis Light Transport.

Another extension I found, but have not fully explored is: Light Transport Simulation with Vertex Connection and Merging. The basic idea is to reuse existing paths by connecting vertices. This can increase the amount of generated paths per pixel:

Vertex merging: reusing paths to increase paths per pixel. Image taken from [6]
There is an opensource implementation of VCM called smallVCM. This implementation also contains multiple importance sampling and normal bidirectional path tracing.

Further reading

Most recent techniques include gradient domain tracing. It first originated from metropolis light transport in: Gradient-domain Metropolis Light Transport. The paper on Gradient Domain Bidirectional Path Tracing applies this to a bidirectional path tracer, and Gradient-Domain Path Tracing applies the same principle for normal path tracers. The path tracer also contains source code. You can see the path tracing algorithm is converging faster already with only one sample per pixel:

Showing the difference between Gradient domain path tracer (G-PT) and normal path tracer (PT). Image taken from [9].
Other blog posts:

http://blog.hvidtfeldts.net/index.php/2015/01/path-tracing-3d-fractals/
Path tracing 3D fractals. Also explains cosine weighted sampling, importance sampling, next event estimation, and image based lighting. Implementation in GLSL.

http://raytracey.blogspot.com
Everything about ray tracing, and is currently working on a GPU path tracer.

https://www.reddit.com/r/pathtracing/
https://www.reddit.com/r/raytracing/
Community for path- and ray tracing.

References

Presentations

[1] http://cg.informatik.uni-freiburg.de/course_notes/graphics2_09_pathTracing.pdf
[2] http://www.cs.dartmouth.edu/~cs77/slides/18_PathTracing.pdf

Papers

[3] Walter, Bruce, et al. "Microfacet models for refraction through rough surfaces."Proceedings of the 18th Eurographics conference on Rendering Techniques. Eurographics Association, 2007.
[4] Veach, Eric. Robust monte carlo methods for light transport simulation. Diss. Stanford University, 1997.
[5] Veach, Eric, and Leonidas J. Guibas. "Metropolis light transport." Proceedings of the 24th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1997.
[6] Georgiev, Iliyan, et al. "Light transport simulation with vertex connection and merging." ACM Trans. Graph. 31.6 (2012): 192.
[7] Lehtinen, Jaakko, et al. "Gradient-domain metropolis light transport." ACM Transactions on Graphics (TOG) 32.4 (2013): 95.
[8] Manzi, Marco, et al. "Gradient-Domain Bidirectional Path Tracing." (2015).
[9] Kettunen, Markus, et al. "Gradient-domain path tracing." ACM Transactions on Graphics (TOG) 34.4 (2015): 123.


Saturday, December 5, 2015

Real-time Raytracing part 3.1

In part 3, I've shown some examples on how to tune algorithms on the GPU. Here I would like to address how we can apply those rules to optimize path tracing. As usual, I will show code samples for CUDA, but this doesn't mean it can't be applied on any other graphics programming language.

Fortunately for us, there is an opensource framework for optimized GPU based ray traversal. The framework is from the research paper Understanding the Efficiency of Ray Traversal on GPUs. In 2012 they added Kepler and Fermi Addendum, which added some changes to the framework to optimize traversal for those architectures. In this post I will be looking at the version for the kepler architecture. The framework is quite heavy on magic numbers and hard to read code. In this post I'm going to dissect some of the more important parts and how they improve performance.

Data structures

The first thing to notice is what kind of spatial data structure they are using. From the name of the class: "SplitBVHBuilder" you can already see that the builder is based on this paper. It comes down to a simple binned BVH builder, including triangle splits to optimize traversal. For more information on BVH techniques, you can read part 2 of this series.

More important for this post is how you store this BVH on the GPU and access it without creating any memory latency. If we look at the ray traversal, the following code shows how the nodes are read from memory:

    const float4 n0xy = tex1Dfetch(t_nodesA, nodeAddr + 0); // (c0.lo.x, c0.hi.x, c0.lo.y, c0.hi.y)
    const float4 n1xy = tex1Dfetch(t_nodesA, nodeAddr + 1); // (c1.lo.x, c1.hi.x, c1.lo.y, c1.hi.y)
    const float4 nz   = tex1Dfetch(t_nodesA, nodeAddr + 2); // (c0.lo.z, c0.hi.z, c1.lo.z, c1.hi.z)
            float4 tmp  = tex1Dfetch(t_nodesA, nodeAddr + 3); // child_index0, child_index1
            int2  cnodes= *(int2*)&tmp;

Recalling from part 3, I showed you a similar statement for memory alignment using the fact that arrays in global memory didn't have data fragmentation. Reading memory from a texture using this principle is more natural, because textures ensure no data fragmentation. As seen in the comments above, the information gathered contains two BVH bounding boxes and two child indices. Storing the data this way ensures you have four 128 bit reads and one 64 bit read, since the last two values are unused.

The other data structure I would like to point out is the way they store triangles. The nodeaddr variable can store either positive or negative values. The positive indicate an index in the BVH and negative values store indices to the triangle array. The actual index is found by using the bit negate function ($\sim$) on the index. From this address they fetch the triangle using similar code as above:

    // Tris in TEX (good to fetch as a single batch)
    const float4 v00 = tex1Dfetch(t_trisA, triAddr + 0);
    const float4 v11 = tex1Dfetch(t_trisA, triAddr + 1);
    const float4 v22 = tex1Dfetch(t_trisA, triAddr + 2);

In which you can store the data required for a ray-triangle test and execute it. The thing to note is: BVHs can store more than one triangle per node, especially SBVHs. In order to store this efficiently in one array, they inserted breaks in the array. The code for checking all triangles:

for (int triAddr = ~leafAddr;; triAddr += 3)
{
 // Read triangle data (see above)

    // End marker (negative zero) => all triangles processed.
    if (__float_as_int(v00.x) == 0x80000000)
        break;

 // Ray-triangle intersection (see part 1)
}

The for loop determines the triangle address and loops endlessly increasing the triangle address. If the first address contains a stop value (they chose for negative zero, but any distinguished value will do), stop the for loop. Simple as that, the array can now contain more than one triangle per node.

Traversal algorithm

The traversal algorithm is as follows: beginning this algorithm assumes the ray hit the BVH containing the model we wish to test for. If it didn't, it's not that bad, after two ray-box intersections the algorithm stops.

  • Starting from the first node address, we read the first two child bounding boxes using the code above
  • Using the optimized intersection test for the GPU from the paper, we test ray-box intersections.
  • If the ray intersects one of them, we change the node address and continue the algorithm. 
  • If we intersect both children, store one of the children on the stack and continue traversing the other.
  • If we intersect none, pull a child index from the stack. If the stack is empty, stop the algorithm.
While this algorithm is pretty easy to implement, optimizing it for the GPU is quite a hassle. In order to minimize divergence, they thought of two neat tricks to alter this algorithm. The first trick is persistent threads. This concept originated from the following thought: A warp will execute until all threads inside it are done processing. In a path tracer like ours this can lead to horrific situations: 31 threads waiting for just the one to be done tracing.

The idea to combat this is quite simple: if there are less than X threads active, fetch new work for the idle threads. Using older graphics hardware, the implementation would take some time for all threads to communicate their status using shuffle functions. Since kepler architecture there are some warp voting functions introduced. From the CUDA toolkit:
__all(predicate):
Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them.
__any(predicate):
Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them.
__ballot(predicate):
Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

Which comes in very useful for this task. The code that remains to execute this test:
if( __popc(__ballot(true)) < DYNAMIC_FETCH_THRESHOLD )
                break;
Where __popc is a population count, counting the number of bits equal to 1 from the provided value.

The other trick in the algorithm to minimize divergence is part of triangle checking. It comes down to timing: when to check for triangles. In theory, the threads in the warp could be divergent: one thread checking ray-triangle intersections and the others traversing the BVH. While this is not completely preventable, we could help the algorithm postpone checking triangles until all threads have at least one triangle to check.

By creating an array (or in this case a variable) to store a triangle address, we can continue traversing the BVH until other threads have also found a triangle. The code in the source uses a direct compiled statement to check this, since it wouldn't be compiled in an optimal way. The code that you would need to check if all threads have a triangle is simple:
    if(!__any(leafAddr >= 0))
        break;
Where leafAddr indicates a triangle address. The statement simply checks if there is a thread for which the address is larger than or equal to 0. If there are none, it means all threads have found a triangle to check and we can stop traversing the BVH.


I hope to have cleared up some of the magic parts of the source code. Of course, if you have any more questions about it, feel free to post them below.