Robbin Marcus: Shader

Showing posts with label Shader. Show all posts

Thursday, October 29, 2015

Real-time Raytracing part 2.1

In my last post I gave a summary on all available methods for creating bounding volume hierarchies. In this post I'd like to share my implementation of the Fast and Simple Agglomerative LBVH Construction (download link). I'm not using any special heuristic, it's going to be the same simple radix tree as mentioned in the previous post.

This is what we want to end up with. The green nodes in the bottom are the leaf nodes (containing triangles), and in blue are the internal nodes. The numbers below the nodes are their left and right childrens indices, more on this later. The numbers in the bottom green rectangles are the leaf nodes' respective morton codes, they represent an ordering in space. In our case this space is going to be 3D. There are plenty of methods on creating morton codes, so I'm going to skip discussing these methods here.

In Karras' paper on fast LBVH building, they use 2 compute passes to find all relations between the nodes, and build a radix tree. Apetrei found a similar method which only uses one pass to do both. Similar to Karras' method, we split the array of nodes into leaf nodes and internal nodes. You can still store them in the same array if you use an offset though.

First of all, the structures. We're going to used Axis Aligned Bounding Boxes (AABB), for which the structure looks as follows:

struct AABB
{
 float3 min, max;
};

The BVHNode contains a little bit more:

struct BVHNode
{
 AABB bounds;
 BVHNode *children[2];
 int triangleID;
 int atomic;
 int rangeLeft, rangeRight;
};

The basic things are there: AABB and relations to successors and parent. If this would be a leaf node, the triangles can be stored in the elements array. The rangeLeft and rangeRight are the, soon to be, childrens indices. The atomic counter will play an important role in this algorithm.

Important to note: If you're going to optimize this, think about memory alignment, this structure currently has the size of 6 floats and 6 ints. Storing in multiples of 128 bit (4 ints/floats) and reading/writing in those sizes on the GPU is a good memory optimization.

I started by simply allocating two arrays for the nodes on the GPU. Other requirements you need for this algorithm are the AABB's of every triangle, and their respective morton codes. Obtaining these is simple, but a little obfuscated on the GPU. We're going to use thrust (a package included in the cuda SDK) to sort the created morton keys:

void Sort(unsigned long long* morton, int *keys, const int n)
{
 thrust::device_ptr<int> d_keys = thrust::device_pointer_cast(keys);
 thrust::device_ptr<unsigned long long> d_morton = thrust::device_pointer_cast(morton);
 thrust::stable_sort_by_key(d_morton, d_morton + n, d_keys);
}

Note that both morton and keys are both already existing arrays on the GPU. The keys array is a simple int array containing [1,2..n] storing the triangleIDs. Morton are the morton codes of the respective triangles. In this piece of code I simply convert them to something 'thrustworthy' and sort the morton codes together with the keys using stable sort by key (so you end up with both arrays sorted the same way).

Since allocated data on the GPU could contain anything, we're going to reset some parameters in the nodes and start by defining bounding boxes in the leaf nodes:

__global__ void Reset(BVHNode *leafNodes, BVHNode *internalNodes, AABB *data, int *sortedObjectIDs, unsigned long long *morton, int nNodes)
{
 unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
 if (idx >= nNodes)
 return;

 // Reset parameters for internal and leaf nodes here
 
 // Set ranges
 leafNodes[idx].rangeLeft = idx;
 leafNodes[idx].rangeRight = idx;
 leafNodes[idx].atomic = 1; // To allow the next thread to process
 internalNodes[idx].atomic = 0; // Second thread to process

 // Set triangles in leaf
 // Save triangle ID
 // Expand bounds using min/max functions

 // Special case
 if (nNodes == 1)
 {
 internalNodes[0].bounds = leafNodes[0].bounds;
 internalNodes[0].children[0] = &leafNodes[0];
 }
}

For those who have never seen CUDA code before, the __host__ and __device__ simply indicate on which device this code can be executed, __global__ indicates the host calls the function, but the device executes the function.

The input for the function contains the same morton codes and keys from the sorting, the nodes themselves, and nNodes, which is simply nLeafNodes in this case. This is always true in a binary radix tree.

Compute shader logic: To find the index in the array, we have to look up the current threadID. I called this function using only one dimensional blocks and grids, so this is easy to calculate by adding threadId to the block index and size. If we surpass the maximum amount of threads (nNodes) we simply tell this thread to stop executing.

Walkthrough: First, we simply reset the leaf and internal nodes parameters to NULL and reset the AABB. We're going to start the algorithm in the leaf nodes, so the range counter will only include the node itself for every leaf node. The atomic counter is set to 1 for leaf nodes, and 0 in internal nodes. This is explained further in the algorithm below.

We want to end up with every node having a bound in terms of an AABB. This can be done by setting them directly. I prefer to use the min/max functions, since you don't have to rethink the process if you try to store multiple triangles later on.

The final part of the code simply handles the case when there is only one node. If this is the case, we don't even have to run the actual algorithm!

The actual algorithm

I made a function which calls the actual algorithm for every leaf node, fetching the threadID in the same manner as above. After that I start the recursion:

// Sets the bounding box and traverses to root
__device__ void ProcessParent(BVHNode *node, int nData, BVHNode* internalNodes, BVHNode *leafNodes, unsigned long long *morton)
{
 // Allow only one thread to process a node
 if (atomicAdd(&node->atomic, 1) != 1)
 return;

 // Set bounding box if the node is no leaf
 if (!isLeaf(node))
 {
 // Expand bounding box with min/max functions from children AABB's
 }

 int left = node->rangeLeft;
 int right = node->rangeRight;
 if (left == 0 || (right != nData - 1 && HighestBit(right, morton) < HighestBit(left - 1, morton)))
 {
 // parent = right, set parent left child and range to node
 }
 else
 {
 // parent = left -1, set parent right child and range to node
 }

 if (left == 0 && right == nData)
 return;
 ProcessParent(parent, nData, internalNodes, leafNodes, morton);
}

Walkthrough: The first part is an atomicAdd, the function adds a value to a reference and returns the previous value. We need this to guarantee parallel reduction. You can read more on what atomics are and how they work here. In our context, we're saying if the old value stored in the atomic counter is equal to one, the thread can continue. This way we stop the first thread that encounters a new node and continue traversing when the second thread arrives. This is the reason why we set the atomic counter on 1 for every leaf node, since the first thread arriving will read this value and continue.

The next part simply states that if the thread encounters an internal node, we set the bounding box by combining the AABB's from the children. We know that both children are set from the logic with the atomic counter above.

Now we can calculate the parent according to the logic from the paper. With the left and right child index and the following function, we can calculate which internal node is the parent:

// Returns the highest differing bit of i and i+1
__device__ int HighestBit(int i, unsigned long long *morton)
{
 return morton[i] ^ morton[i + 1];
}

The paper has a clear explanation as to why this algorithm works, so if you want to know why, I suggest reading the paper.

However, I will give a working example to show this. Let's trace the algorithm from the thread that handles internal node 5 from the image. The child indices are 5 and 6. The algorithm states:

if (left == 0 || 
(right != nData - 1 && 
HighestBit(right, morton) < HighestBit(left - 1, morton)))

The first two statements opt out the 0 and n-1 case (left and right part of the tree). This is not the case, so we have to compare some values. The HighestBit of right returns $1101 \wedge 1111 = 0010 = 2$ and the HighestBit of left-1 returns $1000 \wedge 1100 = 0100 = 4$ 2 < 4, so the statement is true. This means the parent has the same index of the right child, which is 6.

I hope you gained some insight in this algorithm from this code sample. Make sure to note that while this algorithm gives you fast building speed it does not give the best tracing speed.

Warning: LBVHs as shown in the graph above represent an optimal solution. Most of the LBVHs you will make using this algorithm will (most likely) contain tons of layers containing one leaf- and one internal node, thus the depth constraint of log2(n) in balanced binary trees does not apply here!

Thursday, October 22, 2015

Real-time Raytracing part 2

Back to part 1.

In this post we will look into a particular optimization for the raytracing algorithm, namely data structures. Once you got your raytracer up and running on the GPU, you will check for every ray, which triangles it intersects. In the last post we saw that checking all triangles for every ray is simply impossible. In order to save some time, we store the triangles in a spatial data structure, which is easier to query for a ray.

There are many possible data structures to store a set of triangles. The most frequently used ones for raytracing are: k-d trees, octrees and bounding volume hierarchies (BVH). Any one of them is going to speed up the process a LOT. If your ray would only hit the background, in both the octree and the BVH there would only be a single ray-box intersection and then we're done. That's already saving n triangle checks!

But this post wouldn't be interesting to just point to some data structures. In this post I'm going more in depth on the BVH, specifically using bounding boxes. Why bounding boxes? Because there is a very simple, computationally efficient test for ray-box intersections. No if-statements, so it's a very solid algorithm for massive parallel execution as well. Fits right in our GPU ray tracer! Now there are basically two types of BVHs. The LBVH (linear BVH) can be built really fast, but has less tracing performance than SAH (Surface Area Heuristic) BVHs. The SAH splits the set of triangles so that the surface areas of the two child spaces, weighted by the number of objects in each child, are equal.

Tero Karras from Nvidia research came up with a very cool algorithm to build LBVHs in parallel on the GPU. On the Nvidia blog you can find this and this post. It explains how you can traverse a BVH on the GPU, and some great performance tips to speed it up. Part three of the tutorial shows how you can build the LBVH on the GPU making use of the massive parallelism available. It's worth checking out the research paper on the subject, but below is an easier and even faster version. The basic principle of the paper is building a radix tree of a set of objects which are sorted using morton codes. This creates a spatial code by interleaving positional vectors in one integer or long. From this radix tree, you can precalculate where the tree would split, and thus process the whole tree in parallel.

This LBVH creation method is really fast: about 16ms for a million triangles is not uncommon. Even though, the algorithm could be way easier, as described in Fast and Simple Agglomerative LBVH Construction (download link). This paper shows that the same restrictions on the radix tree can be used to build the tree bottom-up instead of using an unnecessary top-down iteration. It shows that by comparing values with the neighbor, you can decide the parent node. It's even a bit faster than the method above, and a lot simpler to implement.

This is as fast as it's going to get for building times for BVHs. But how about tracing times? As mentioned above, the SAH is a good heuristic for ray tracing performance. This graph is an awesome comparison:

This graph is from another Nvidia paper: Fast Parallel Construction of High-Quality Bounding Volume Hierarchies. This is another paper on building LBVHs really really fast. They also mention an optimization in tracing speed by pre-splitting some large triangles. There is a lot more in depth information about building times and tracing times as well. They mention a lot of methods, but the main point from this image is to show that (H)LBVH has about 60~70% of the tracing speed in comparison with SAH BVHs. The building time is about 10~20 times slower than that of LBVHs though.

So how do we build these awesome BVHs using the SAH? Ingo Wald wrote a paper on building these kind of BVHs very fast. This algorithm is performed on the CPU in a top-down fashion. There are a lot of possibilities for performance boost which are also explained in the paper. The building times could be about 3 times as slow as building a LBVH by trading in some tracing performance for build optimizations. But, as shown in the graph above, this method still gives the best raytracing performance.

This image below shows a visualization of rays intersecting bounding boxes with 8 stanford dragon models. You can find the source code for this BVH in part 2.1.

I hope to have informed you enough about the possibilities of optimizing raytracing by using spatial data structures. There is some speculation about the creation of BVHs on the GPU, since the GPU will be working on tracing rays all the time anyway. More on this discussion can be read on ompf2.com (where you can also find an implementation of most of these methods).

Part 2.1: code sample of Fast and Simple Agglomerative LBVH Construction
Part 3

Wednesday, October 14, 2015

Real-time Raytracing part 1

In the next few posts I will talk about Real-time Raytracing. The real-time part has different interpretations. Some define it as being 'interactive', which is another vague term as any application running at 1 FPS can still be interactive. In this case we will define a "real-time constraint" as a time constraint per frame of about 16ms (which leads to 60 frames per second). This is a common time constraint used in games which have to be responsive.

Now for the raytracing part. The basic concept of raytracing is tracing the paths of light for every pixel on an image plane. This way we can simulate visual realism by by mimicking the real process. You can create pretty images without much code:

Unfortunately raytracing also has downsides. The process to trace the light paths requires global scene access, and is thus unable to work similar to a rasterizer (which draws objects one by one). This makes the whole rendering process more difficult and the standard route of rendering objects is averted. Raytracing has a very high computational cost. A basic algorithm for checking which triangle was hit already shows the problem:

for (int x = 0; x < screenwidth; x++)
 for (int y = 0; y < screenheight; y++)
 for (int t = 0; t < nTriangles; t++)
 RayTriangle(x, y, t)

The whole algorithm scales with screen size and the amount of triangles. Running this algorithm with the dragon model shown above is basically asking for your computer to explode. The model has 100.000 triangles exactly, and checking that amount of triangles for every pixel in a low resolution (1024x1024) is about 104 billion ray triangle checks. A simple line-triangle intersection already has a lot of instructions, needless to say, this algorithm isn't going to run real-time...

So, how do we make it real-time? Several decades of research show us that there are plenty of possibilities to speed up the process of raytracing. In the upcoming posts I will talk about some of these processes to speed up my path tracer. Most of the subjects will be on parallel programming on the GPU using CUDA. Even though I will try my best to keep it as readable as possible, unexplained terminology can always be found in the excellent ray tracing tutorial of scratchapixel.

If you're really interested in ray tracing, and haven't read it already: Ray tracey's blog already has a huge collection of ray tracing posts.

Part 2.

Friday, April 3, 2015

Compute Shader Framework

In the last few posts about ray tracing I briefly mentioned compute shaders. If you don't know what they are, here is a short summary:

Introduction

Compute shaders are not part of the ordinary graphics pipeline, they can be used separately of any other stage. They are particularly meant for computation on the GPU. The compute shaders are in the same language as the other pipeline stages, like the pixel shader. In this case, HLSL. The compute shader takes advantage of the huge speedup the GPU has to offer over the CPU. This is done by taking into account the parallel computation power of the GPU.

When I first started out with compute shaders, I saw it as a black box and didn't really understand how to get started using one. After I found out that it can be really useful for large computations, I decided to implement one for my ray tracing project (with success). After this, I decided to make a simple framework allowing everyone access to compute shaders in a more user friendly way. This is only intented for DirectX compute shaders in C#

Framework

So without further ado, here is the framework: ComputeShader. You can also view it on GitHub.
On the first run, the framework will download some NuGet packages from SharpDX. If you have already have SharpDX installed, you can simply reference them to skip this part.

With this framework it's possible to bind any structure to the GPU. You can do numerous things with these structures in your shader, and then output some data that you want to know. You can read this data back in your code and use it later on! A simple example would be updating a particle system: you dump all positions and velocities to the GPU, and then calculate the next positions in your shader.

Usage:
In your project, either reference the ComputeShaderAddon.dll or add the project to your solution and reference the project.
You can now calculate anything on the GPU by using the next 4 lines of code (don't forget to include ComputeShaderAddon):

ComputeShaderHelper CSHelper = new ComputeShaderHelper(Device, "effect.fx");
int index = CSHelper.SetData<ExampleStruct>(data);
CSHelper.Execute(50);
CSHelper.GetData<ExampleStruct>(index);

This is the code from the example in the framework. What it does per line:
- Initialize the helper, this compiles the shader (if necessary) and sets it up.
- Set your data from any possible struct to the GPU buffers. The index is stored to retrieve the data later on.
- Executes the compute shader, the number is the amount of cores used on the GPU. The maximum number of cores is 1024, however this will use all calculation power of the GPU at once!
- Retrieve the data from the GPU, using the index from above.
Create your compute shader. Set the amount of cores you want to use in the brackets above the main function like this:[numthreads(cores, 1, 1)]

Likewise, save the length of the array of structs somewhere in the compute shader, if you want to use this like I did in the framework.
Done! Run your project!

Results
The example is a small program I wrote to test the computation power of the GPU in comparison with the CPU. The operation to perform is simple: for every struct you get, count numbers from zero to the length of the array and store them in the struct. Below you'll find a CPU and GPU version of this in code:

// CPU
for (int i = 0; i < amount; i++)
{
 int result = 0;
 for (int j = 0; j < amount; j++)
 result += j;
 data[i].Data = new Vector3(result, result, result);
}

// GPU -- ComputeShaderExample.fx in framework
[numthreads(nThreads, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
 int range = nStructs / nThreads;
 for (uint i = id.x * range; i < id.x * range + range; i++)
 {
 int result = 0;
 for (uint j = 0; j < nStructs; j++)
 {
 result += j;
 }
 data[i].Data = float3(result, result, result);
 }
}

The framework ran with 50 cores on the GPU, and the results are as follows:

Last data: X:4,9995E+07 Y:4,9995E+07 Z:4,9995E+07
It took the GPU: 102 milliseconds
Last data: X:4,9995E+07 Y:4,9995E+07 Z:4,9995E+07
It took the CPU: 283 milliseconds

With this small calculation the GPU, using 50 cores, is about 3 times faster than the CPU, using only one core.

Some scalings:

Structs	Cores	GPU time (ms)	CPU time (ms)
10k	50	102	283
20k	50	380	1131
30k	50	751	2463
10k	10	452	283
10k	100	119	283
10k	1000	480	283

You can see that the amount of cores is something to fiddle with, since the resulting time differs greatly. This is because the overhead of running all the threads costs more than the actual computation itself, so be careful with this!

Fun stuff: the outcome is easily calculated by: n(n + 1) / 2. This is an easy way to calculate a numerical sequence like this. In this case, n = 9999 (because we start at 0).

Future work

This framework currently only supports Unordered View bindings, so if you would use DirectX 10 you can only bind one array of structures to the compute shader and that's it. In DirectX 11 this is increased to 8, which is supported in this example project.

Currently you still have to set the length of the array and the amount of threads manually in the compute shader. I don't know if it's possible to change this dynamically from code, but if I ever find a way, I will update the framework for sure.

Wednesday, January 21, 2015

Raytracing Part 3

The final post in this series about my ray tracer. I spent last weekend creating some black magic called a compute shader. It allows you to use your GPU to aid in calculations. In my case this came in quite handy with all these ray calculations. The GPU is optimal for calculations that can be executed completely in parallel. Fortunately, raytracing falls in this category.

This is the first result worth looking at:

First thing to notice is that it's quite grainy. This is because there are not that many samples per pixel, only about 5. However, the (still local) Cook Torrance shading is already working as you can see on the metal-ish sphere in the back.

The main thing to notice here are the soft shadows. A small change, but 'realism' awaits. There are some incorrect results in this image, because you can see the diffuse reflection of the green box is still biased. The local shading model from two lights draws the reflecting rays in their direction, meaning this is not a totally correct result.

The box in the middle suddenly became radioactive and now emits light. The image is still biased, even the emitting box only emits towards the light sources.

Experimenting with rougher surfaces and more glossy surfaces.

The first results of refracting rays.

Perfectly refracting sphere showing the bias is indeed gone here. The grain is also gone since I actually took the time to sit back and wait for the image to converge (this image took about two minutes to converge to this state).

The same scene with different materials and a more emitting 'roof'.

Monday, September 8, 2014

From zero to lighting in 2D

This tutorial is about spicing up your 2D game with some awesome lighting. I made this tutorial because there aren't any out there yet, and it's a very cool effect to add to your 2D game.

What you need to know

First off, I assume you have some familiarity with C#, or coding in general. Secondly it is advised to at least have programmed in a shader language before. Used in this tutorial is HLSL. The rest is what this tutorial is for!

The final result

You can download the code example here (or you can scroll down to find some explanations). The final result will look something like this:

Normal maps

Normal maps are textures to containing a color indicating the normal of a certain pixel. How does this work? Take the normal map used in the code example alongside the original texture:

The left texture contains the normal colors, and on the right is the generated color defining the normal. To calculate the actual normal vector, we have to apply a little transformation:
$$normal = 2.0 * pixelcolor - 1.0$$

To explain this better, we take the most common color in the picture, a light blue-ish color. In RGB values it is about $((128, 128, 255)$), which is reduced in the range of [0,1]as $((0.5, 0.5, 1.0)$). After applying our transformation the value becomes $((0.0, 0.0, 1.0)$) which is a normal pointing in the Z direction.

In our 2D game this value will point from screen towards the viewer. As you can see in the normal image, there are several red and green colored portions, which will affect the direction of the normal. With this information we can add fake depth to a plain 2D texture!

The drawing setup

To pass this information to our lighting effect we need to have this normal map ready, which means we can't draw everything on the screen immediately. The setup used for this comes close to Deferred Shading. We only need the color and normal buffer, since the depth buffer is worthless here.

All the normal sprites are drawn first, as you're used to, except, this time we save them to a rendertarget. After which I draw all of the normals to another rendertarget. Unlike standard deferred shading, I chose to render the lights to a seperate rendertarget here. If you wish, you can combine drawing the textures and drawing the lights to a single pass. This was merely done to show the different rendering steps here.

The actual lighting magic
Because it's a 2D game, you expect to need to draw a "lighting texture", something like this:

But we don't need to! Since we have a normal, we can simply apply the technique to draw a light in 3D, which is really fancy and easy to create. The effect I used to create this tutorial with is called Diffuse Reflection, or Lambertian Reflectance. We set up a point light (which is, a point from which light emanates) and calculate the pixel color on the GPU.

Diffuse reflection requires three things: the position of the light, the position of the current pixel being shaded, and the normal at that position. From the first two you can calculate the light direction, and by looking at the value of the dot product from the light direction and the normal you can determine the lighting coefficient.

Sometimes you will want to rotate the normal retrieved from the normal map. This is done by creating a separate rotation matrix and adding it to the shader. More information about creating such a matrix can be found in my other tutorial series on rotations.

Finding the correct normal on the pixel position is rather easy: we have a full screen buffer of normals, and a position given by the draw call. Dividing this position by the screen size, we have the texture coordinates of the normal pixel ranging in [0,1]. Exactly what we need!

Code example

All of this code can be found in the source code, I'd like to point out a few things in this article though, here's the code used for lighting in HLSL:

// Basic XNA Vertex shader
float4x4 MatrixTransform;
void SpriteVertexShader(inout float4 color    : COLOR0,
     inout float2 texCoord : TEXCOORD0,
     inout float4 position : SV_Position)
{
     position = mul(position, MatrixTransform);
}

float4 PixelShaderFunction(float2 position : SV_POSITION, 
         float4 color : COLOR0,
         float2 TexCoordsUV : TEXCOORD0) : COLOR0
{
     // Obtain texture coordinates corresponding to the current pixel on screen
     float2 TexCoords = position.xy / screenSize;
     TexCoords += 0.5f / screenSize;

     // Sample the input texture
     float4 normal = 2.0f * tex2D(NormalSampler, TexCoords) - 1.0f;

     // Transform input position to view space
     float3 newPos = float3(position.xy, 0.0f);
     float4 pos = mul(newPos, InverseVP);

     // Calculate the lighting with given normal and position
     float4 lighting = CalculateLight(pos.xyz, normal.xyz);
     return lighting;
}

// Calculates diffuse light with attenuation and normal dot light
float4 CalculateLight(float3 pos, float3 normal)
{
   float3 lightDir = LightPosition - pos;

   float attenuation = saturate(1.0f - length(lightDir) / LightRadius);
   lightDir = normalize(lightDir); 
 
   float NdL = max(0, dot(normal, lightDir));
   float4 diffuseLight = NdL * LightColor * LightIntensity * attenuation;
 
   return float4(diffuseLight.rgb, 1.0f);
}

As you can see, the shader consists of a vertex and pixel shader. The vertex shader simply passes on the color, texture coordinates and position. It only transforms the position with the given matrix. After the vertex shader we know the rectangle on the screen and the pixel shader will analyze all the pixel inside of it.

What you first see in the pixel shader is getting the texture coordinates from the position. This is done by dividing it through the screen size and adding half a pixel width (so we're in the center of the pixel). With this coordinate we can sample the normal map to get the color of the normal, and as shown in this article, calculate the actual normal from it. What happens next is retrieving the original position, by multiplying it with the inverse view-projection matrix. We can now calculate the lighting with given parameters.

Where the light calculation method is nothing more than a normal times lightdirection to see if the surface should get lit. Of course, not to forget, the attenuation which looks at the range of the light and caps it (smoothly) by multiplying it with this value.

Optimization

You want to draw a lot of lights, right? Normal deferred shading can't handle a lot of point lights, since you have to redraw the whole screen for every pointlight. Thus follows the first optimization: if we only draw a small square on the screen were we expect the light to shine, we don't draw the rest of the screen. This is done quite simple by adding a light radius, from which we can create a rectangle to draw in spritebatch.

Since we don't need to draw any textures, and we still have to make a draw call, I found the following optimization: in spritebatch you have to supply a texture for a draw call, the best way to use our previous optimization is drawing a pixel and upsize it to the square. This way, the pixel shader can sample the normal map and output the lighting on the positions given by the draw call. In the code I just pass the normal map as texture for simplicity.

I hope you learned something from this tutorial, and sure hope to see some awesome games created with this effect!