First published: 2024-02-21
Last updated: 2024-02-22
About the Mesh Shading Series
This post is part 5 of a series about mesh shading. My intent in this series is to introduce the various parts of mesh shading in an easy to understand fashion. Well, as easy as I can make it. My objective isn’t to convince you to use mesh shading. I assume you’re reading this post because you’re already interested in mesh shading. Instead, my objective is to explain the mechanics of how to do mesh shading in Direct3D 12, Metal, and Vulkan as best I can. My hope is that you’re able to use this information in your own graphics projects and experiments.
- Mesh Shading Part 1: Rendering Meshlets
- Mesh Shading Part 2: Amplification
- Mesh Shading Part 3: Instancing
- Mesh Shading Part 4: Culling
- Mesh Shading Part 5: LOD Selection
- Mesh Shading Part 6: LOD Calculation
- Mesh Shading Part 7: Culling + LOD
- Mesh Shading Part 8: Vertex Attributes (TBD)
- Mesh Shading Part 9: Barycentric Interpolation (TBD)
Sample Projects for This Post
115_mesh_shader_lod - Demonstrates the most absolute basic functionality of LOD using instance index to select LOD.
The D3D12 version of the above samples displays pipeline statistics. The Metal and Vulkan versions do not display pipeline statistics for different reasons. Metal doesn’t have pipeline statistics. Turning on pipeline statistics on the Vulkan version tanks the performance. I haven’t had a chance to investigate why this is and how it affects the various GPUs.
Introduction
Alongside culling, a much touted usage of mesh shading is LOD selection of meshlets. Before we jump in, a brief word about how the LOD discussion is covered.
The LOD discussion spans two posts to keep things simple. The first post, which is this post, will cover loading the LOD meshes, creating the LOD meshlets, and then drawing each LOD with a hard coded index. The next post will build upon this post and show how to do automatic LOD selection using distance to camera. Hopefully this keeps the posts shorter and easier to consume!
On to LOD meshes!
LOD Meshes
In order to get LODs of meshlets we need LOD meshes. To keep things simple, we’ll continue using the horse statue. We’ll say that horse_statue_01_1k.obj
, the model we’ve been using, is LOD 0 - the level with the most detail. This means our convention will be 0..n from most detailed to least detailed. If you look in the GREX project’s asset/models directory, you’ll see 4 other files:
- horse_statue_01_1k_LOD_1.obj
- horse_statue_01_1k_LOD_2.obj
- horse_statue_01_1k_LOD_3.obj
- horse_statue_01_1k_LOD_4.obj
So all together, we’ll have 5 LODs.
How Were the LOD Meshes Created?
The LODs were created in blender using the Decimate tool. I originally wanted to use meshopt’s simplification but I couldn’t get it to do what I wanted. So I opted just to create the LODs by hand. This means that we won’t be able to reuse LOD 0’s vertex data for the subsequent LODs. But it’s all good, this is sample code after all.
Loading the LOD Meshes
Loading the LOD meshes is straightforward, we just load 5 meshes instead of one.
std::vector<TriMesh> meshLODs;
{
// LOD 0
{
TriMesh mesh = {};
bool res = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k.obj").string(), &mesh);
if (!res) {
assert(false && "failed to load model LOD 0");
return EXIT_FAILURE;
}
meshLODs.push_back(mesh);
}
// LOD 1
{
TriMesh mesh = {};
bool res = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k_LOD_1.obj").string(), &mesh);
if (!res) {
assert(false && "failed to load model LOD 1");
return EXIT_FAILURE;
}
meshLODs.push_back(mesh);
}
// LOD 2
{
TriMesh mesh = {};
bool res = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k_LOD_2.obj").string(), &mesh);
if (!res) {
assert(false && "failed to load model LOD 2");
return EXIT_FAILURE;
}
meshLODs.push_back(mesh);
}
// LOD 3
{
TriMesh mesh = {};
bool res = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k_LOD_3.obj").string(), &mesh);
if (!res) {
assert(false && "failed to load model LOD 3");
return EXIT_FAILURE;
}
meshLODs.push_back(mesh);
}
// LOD 4
{
TriMesh mesh = {};
bool res = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k_LOD_4.obj").string(), &mesh);
if (!res) {
assert(false && "failed to load model LOD 4");
return EXIT_FAILURE;
}
meshLODs.push_back(mesh);
}
}
Building LOD Meshlets
The gist here is that we iterate over the mesh LODs and build meshlets for each LOD. We store the meshlets data in a combined arrays. For each LOD we also store the offset of the first meshlet and the meshlet count. For each LOD we’ll also need to adjust the meshlet data offsets so they correspond to the correct LOD.
We keep the recommended values the same.
TriMesh::Aabb meshBounds = meshLODs[0].GetBounds();
std::vector<float3> combinedMeshPositions;
std::vector<meshopt_Meshlet> combinedMeshlets;
std::vector<uint32_t> combinedMeshletVertices;
std::vector<uint8_t> combinedMeshletTriangles;
std::vector<uint32_t> meshlet_LOD_Offsets; // Offset of first meshlet of each LOD
std::vector<uint32_t> meshlet_LOD_Counts; // Count of meshlets of each LOD
for (size_t lodIdx = 0; lodIdx < meshLODs.size(); ++lodIdx) {
const auto& mesh = meshLODs[lodIdx];
const size_t kMaxVertices = 64;
const size_t kMaxTriangles = 124;
const float kConeWeight = 0.0f;
std::vector<meshopt_Meshlet> meshlets;
std::vector<uint32_t> meshletVertices;
std::vector<uint8_t> meshletTriangles;
const size_t maxMeshlets = meshopt_buildMeshletsBound(mesh.GetNumIndices(), kMaxVertices, kMaxTriangles);
meshlets.resize(maxMeshlets);
meshletVertices.resize(maxMeshlets * kMaxVertices);
meshletTriangles.resize(maxMeshlets * kMaxTriangles * 3);
size_t meshletCount = meshopt_buildMeshlets(
meshlets.data(),
meshletVertices.data(),
meshletTriangles.data(),
reinterpret_cast<const uint32_t*>(mesh.GetTriangles().data()),
mesh.GetNumIndices(),
reinterpret_cast<const float*>(mesh.GetPositions().data()),
mesh.GetNumVertices(),
sizeof(float3),
kMaxVertices,
kMaxTriangles,
kConeWeight);
auto& last = meshlets[meshletCount - 1];
meshletVertices.resize(last.vertex_offset + last.vertex_count);
meshletTriangles.resize(last.triangle_offset + ((last.triangle_count * 3 + 3) & ~3));
meshlets.resize(meshletCount);
// Store offset of first meshlet and meshlet count for current LOD
meshlet_LOD_Offsets.push_back(static_cast<uint32_t>(combinedMeshlets.size()));
meshlet_LOD_Counts.push_back(static_cast<uint32_t>(meshlets.size()));
// Adjustment offsets for current LOD
const uint32_t vertexOffset = static_cast<uint32_t>(combinedMeshPositions.size());
const uint32_t meshletVertexOffset = static_cast<uint32_t>(combinedMeshletVertices.size());
const uint32_t meshletTriangleOffset = static_cast<uint32_t>(combinedMeshletTriangles.size());
// Copy current LOD's vertex data to the combined positions array
std::copy(mesh.GetPositions().begin(), mesh.GetPositions().end(), std::back_inserter(combinedMeshPositions));
// Adjusts the vertex offset and triangle offset for current LOD
for (auto meshlet : meshlets) {
meshlet.vertex_offset += meshletVertexOffset;
meshlet.triangle_offset += meshletTriangleOffset;
combinedMeshlets.push_back(meshlet);
}
// Adjust the vertex indices for current LOD
for (auto vertex : meshletVertices) {
vertex += vertexOffset;
combinedMeshletVertices.push_back(vertex);
}
std::copy(meshletTriangles.begin(), meshletTriangles.end(), std::back_inserter(combinedMeshletTriangles));
}
Repacking
The only change to the repacking code is that we iterate over combinedMeshlets
instead of meshlets
. Everything else remains the same.
// Repack triangles from 3 consecutive bytes to 4-byte uint32_t to
// make it easier to unpack on the GPU.
//
std::vector<uint32_t> meshletTrianglesU32;
for (auto& m : combinedMeshlets)
{
// Save triangle offset for current meshlet
uint32_t triangleOffset = static_cast<uint32_t>(meshletTrianglesU32.size());
// Repack to uint32_t
for (uint32_t i = 0; i < m.triangle_count; ++i)
{
uint32_t i0 = 3 * i + 0 + m.triangle_offset;
uint32_t i1 = 3 * i + 1 + m.triangle_offset;
uint32_t i2 = 3 * i + 2 + m.triangle_offset;
uint8_t vIdx0 = combinedMeshletTriangles[i0];
uint8_t vIdx1 = combinedMeshletTriangles[i1];
uint8_t vIdx2 = combinedMeshletTriangles[i2];
uint32_t packed = ((static_cast<uint32_t>(vIdx0) & 0xFF) << 0) |
((static_cast<uint32_t>(vIdx1) & 0xFF) << 8) |
((static_cast<uint32_t>(vIdx2) & 0xFF) << 16);
meshletTrianglesU32.push_back(packed);
}
// Update triangle offset for current meshlet
m.triangle_offset = triangleOffset;
}
Creating Buffers From meshopt Output
The positionBuffer
, meshletBuffer
, and meshletVerticesBuffer
use the combined arrays combinedMeshPositions
, combinedMeshlets
, and combinedMeshletVertices
, respectively.
MetalBuffer positionBuffer;
MetalBuffer meshletBuffer;
MetalBuffer meshletVerticesBuffer;
MetalBuffer meshletTrianglesBuffer;
MetalBuffer meshletBoundsBuffer;
{
CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(combinedMeshPositions), DataPtr(combinedMeshPositions), &positionBuffer));
CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(combinedMeshlets), DataPtr(combinedMeshlets), &meshletBuffer));
CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(combinedMeshletVertices), DataPtr(combinedMeshletVertices), &meshletVerticesBuffer));
CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(meshletTrianglesU32), DataPtr(meshletTrianglesU32), &meshletTrianglesBuffer));
CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(meshletBounds), DataPtr(meshletBounds), &meshletBoundsBuffer));
}
LOD Constant Data
We’ll store the LOD offsets and counts in the SceneProperties
struct since InstanceCount
and MeshletCount
are already there.
Note that the D3D12 and Vulkan versions use uvec4
to store the offsets since array elements for in constant data structs are always aligned to 16 bytes. Metal, on the other hand, tightly packs arrays. Fun graphics API nuances.
// -----------------------------------------------------------------------------
// D3D12 and Vulkan
// -----------------------------------------------------------------------------
struct SceneProperties
{
mat4 CameraVP;
uint InstanceCount;
uint MeshletCount;
uint __pad0[2];
uvec4 Meshlet_LOD_Offsets[5]; // ** NEW **
uvec4 Meshlet_LOD_Counts[5]; // ** NEW **
};
// -----------------------------------------------------------------------------
// Metal
// -----------------------------------------------------------------------------
//
// NOTE: Unlike D3D12 and Vulkan, it looks like Metal arrays are tightly
// packed for 32-bit scalar types. This means that Meshlet_LOD_Offsets
// and Meshlet_LOD_Counts are uint here instead of uint4/uvec4.
//
struct SceneProperties
{
float4x4 CameraVP;
uint InstanceCount;
uint MeshletCount;
uint Meshlet_LOD_Offsets[5]; // ** NEW **
uint Meshlet_LOD_Counts[5]; // ** NEW **
uint __pad1[2]; // Make struct size aligned to 16
};
Updating Scene Constant Data
We add the necessary code to update the constant data using the offset and counts we stored earlier.
Note the minor differences between the D3D12/Vulkan updates and the Metal updates.
// -----------------------------------------------------------------------------
// D3D12 and Vulkan
// -----------------------------------------------------------------------------
scene.MeshletCount = meshlet_LOD_Counts[0];
scene.Meshlet_LOD_Offsets[0].x = meshlet_LOD_Offsets[0];
scene.Meshlet_LOD_Offsets[1].x = meshlet_LOD_Offsets[1];
scene.Meshlet_LOD_Offsets[2].x = meshlet_LOD_Offsets[2];
scene.Meshlet_LOD_Offsets[3].x = meshlet_LOD_Offsets[3];
scene.Meshlet_LOD_Offsets[4].x = meshlet_LOD_Offsets[4];
scene.Meshlet_LOD_Counts[0].x = meshlet_LOD_Counts[0];
scene.Meshlet_LOD_Counts[1].x = meshlet_LOD_Counts[1];
scene.Meshlet_LOD_Counts[2].x = meshlet_LOD_Counts[2];
scene.Meshlet_LOD_Counts[3].x = meshlet_LOD_Counts[3];
scene.Meshlet_LOD_Counts[4].x = meshlet_LOD_Counts[4];
// -----------------------------------------------------------------------------
// Metal
// -----------------------------------------------------------------------------
scene.MeshletCount = meshlet_LOD_Counts[0];
scene.Meshlet_LOD_Offsets[0] = meshlet_LOD_Offsets[0];
scene.Meshlet_LOD_Offsets[1] = meshlet_LOD_Offsets[1];
scene.Meshlet_LOD_Offsets[2] = meshlet_LOD_Offsets[2];
scene.Meshlet_LOD_Offsets[3] = meshlet_LOD_Offsets[3];
scene.Meshlet_LOD_Offsets[4] = meshlet_LOD_Offsets[4];
scene.Meshlet_LOD_Counts[0] = meshlet_LOD_Counts[0];
scene.Meshlet_LOD_Counts[1] = meshlet_LOD_Counts[1];
scene.Meshlet_LOD_Counts[2] = meshlet_LOD_Counts[2];
scene.Meshlet_LOD_Counts[3] = meshlet_LOD_Counts[3];
scene.Meshlet_LOD_Counts[4] = meshlet_LOD_Counts[4];
LOD Instances
We’ll use the same instancing code from the earlier samples to store the model transform matrix. For 115_mesh_shader_lod, we’ll have 5 instances - one for each LOD.
const uint32_t kNumInstanceCols = 1;
const uint32_t kNumInstanceRows = 5;
std::vector<float4x4> instances(kNumInstanceCols * kNumInstanceRows);
Instance Positions
We hard code some positions for each LOD instance. LOD 0 is closest to the camera and LOD 4 is furthest away.
// Update instance transforms
{
float maxSpan = std::max<float>(meshBounds.Width(), meshBounds.Depth());
float instanceSpanX = 4.0f * maxSpan;
float instanceSpanZ = 4.5f * maxSpan;
float totalSpanX = kNumInstanceCols * instanceSpanX;
float totalSpanZ = kNumInstanceRows * instanceSpanZ;
float t = static_cast<float>(glfwGetTime());
// 0
{
float3 P = float3(0, 0, -static_cast<float>(0 * instanceSpanZ));
instances[0] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
}
// 1
{
float3 P = float3(0, 0, -static_cast<float>(0.75f * instanceSpanZ));
instances[1] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
}
// 2
{
float3 P = float3(0, 0, -static_cast<float>(2.5 * instanceSpanZ));
instances[2] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
}
// 3
{
float3 P = float3(0, 0, -static_cast<float>(8 * instanceSpanZ));
instances[3] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
}
// 4
{
float3 P = float3(0, 0, -static_cast<float>(40 * instanceSpanZ));
instances[4] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
}
}
That should do it for the C++ code. Let’s move onto the amplification shader.
Amplification Shader
We only need to make 2 small changes to the amplification shader to support LODs:
- Add the LOD information to the SceneProperties struct.
- Update amplification shader body to make use of the LOD information.
Add LOD Info To SceneProperties
Add Meshlet_LOD_Offsets
and Meshlet_LOD_Counts
arrays to the SceneProperties
struct.
The HLSL and MSL code are identical in this case.
struct SceneProperties {
float4x4 CameraVP;
uint InstanceCount;
uint MeshletCount;
uint Meshlet_LOD_Offsets[5]; // ** NEW **
uint Meshlet_LOD_Counts[5]; // ** NEW **
};
Select LOD
For 115_mesh_shader_lod
, we’re going to use the instance index to select the LOD. Instance 0 will use LOD 0, instance 1 will use LOD 1, and so on. You get the idea.
Once selected the LOD, we check to make sure that meshletIndex
is within bounds of the meshlet count for the current LOD.
If meshletIndex
is within the current LOD’s meshlet count, we then adjust it by the offset to the first meshlet for the current LOD. This puts meshletIndex
at the correct place for the meshlet we want to draw.
Everything else remains the same. Pretty easy, huh :)
HLSL for D3D12 and Vulkan
[numthreads(AS_GROUP_SIZE, 1, 1)]
void asmain(
uint gtid : SV_GroupThreadID,
uint dtid : SV_DispatchThreadID,
uint gid : SV_GroupID
)
{
bool visible = false;
uint instanceIndex = dtid / Scene.MeshletCount;
uint meshletIndex = dtid % Scene.MeshletCount;
if (instanceIndex < Scene.InstanceCount){
uint lod = instanceIndex; // Use instance index for LOD
uint lodMeshletCount = Scene.Meshlet_LOD_Counts[lod]; // Get LOD's meshlet count
if (meshletIndex < lodMeshletCount) {
// Adjust meshletIndex it's referring a meshlet in current LOD
meshletIndex += Scene.Meshlet_LOD_Offsets[lod];
// Assuming visible, no culling here
visible = 1;
}
}
if (visible) {
uint index = WavePrefixCountBits(visible);
sPayload.InstanceIndices[index] = instanceIndex;
sPayload.MeshletIndices[index] = meshletIndex;
}
uint visibleCount = WaveActiveCountBits(visible);
DispatchMesh(visibleCount, 1, 1, sPayload);
}
MSL for Metal
[[object]]
void objectMain(
constant SceneProperties& Scene [[buffer(0)]],
device const float4* MeshletBounds [[buffer(1)]],
device const Instance* Instances [[buffer(2)]],
uint gtid [[thread_position_in_threadgroup]],
uint dtid [[thread_position_in_grid]],
object_data Payload& outPayload [[payload]],
mesh_grid_properties outGrid)
{
uint visible = 0;
uint instanceIndex = dtid / Scene.MeshletCount;
uint meshletIndex = dtid % Scene.MeshletCount;
if (instanceIndex < Scene.InstanceCount) {
uint lod = instanceIndex;
uint lodMeshletCount = Scene.Meshlet_LOD_Counts[lod];
if (meshletIndex < lodMeshletCount) {
meshletIndex += Scene.Meshlet_LOD_Offsets[lod];
// Assuming visibile, no culling here
visible = 1;
}
}
if (visible) {
uint index = simd_prefix_exclusive_sum(visible);
outPayload.InstanceIndices[index] = instanceIndex;
outPayload.MeshletIndices[index] = meshletIndex;
}
// Assumes all meshlets are visible
uint visibleCount = simd_sum(visible);
outGrid.set_threadgroups_per_grid(uint3(visibleCount, 1, 1));
}
Mesh Shader Changes
There aren’t any mesh shader changes for this post. Hope it’s not too disappointing. We’ll make up for it soon.
Rendered Image
The 115_mesh_shader_lod sample renders 5 instances of the horse statue at 5 different LODs.