Skip to the content.
Posts

First published: 2024-02-21
Last updated: 2024-02-22

About the Mesh Shading Series

This post is part 5 of a series about mesh shading. My intent in this series is to introduce the various parts of mesh shading in an easy to understand fashion. Well, as easy as I can make it. My objective isn’t to convince you to use mesh shading. I assume you’re reading this post because you’re already interested in mesh shading. Instead, my objective is to explain the mechanics of how to do mesh shading in Direct3D 12, Metal, and Vulkan as best I can. My hope is that you’re able to use this information in your own graphics projects and experiments.

Sample Projects for This Post

115_mesh_shader_lod - Demonstrates the most absolute basic functionality of LOD using instance index to select LOD.

The D3D12 version of the above samples displays pipeline statistics. The Metal and Vulkan versions do not display pipeline statistics for different reasons. Metal doesn’t have pipeline statistics. Turning on pipeline statistics on the Vulkan version tanks the performance. I haven’t had a chance to investigate why this is and how it affects the various GPUs.

Introduction

Alongside culling, a much touted usage of mesh shading is LOD selection of meshlets. Before we jump in, a brief word about how the LOD discussion is covered.

The LOD discussion spans two posts to keep things simple. The first post, which is this post, will cover loading the LOD meshes, creating the LOD meshlets, and then drawing each LOD with a hard coded index. The next post will build upon this post and show how to do automatic LOD selection using distance to camera. Hopefully this keeps the posts shorter and easier to consume!

On to LOD meshes!

LOD Meshes

In order to get LODs of meshlets we need LOD meshes. To keep things simple, we’ll continue using the horse statue. We’ll say that horse_statue_01_1k.obj, the model we’ve been using, is LOD 0 - the level with the most detail. This means our convention will be 0..n from most detailed to least detailed. If you look in the GREX project’s asset/models directory, you’ll see 4 other files:

So all together, we’ll have 5 LODs.

How Were the LOD Meshes Created?

The LODs were created in blender using the Decimate tool. I originally wanted to use meshopt’s simplification but I couldn’t get it to do what I wanted. So I opted just to create the LODs by hand. This means that we won’t be able to reuse LOD 0’s vertex data for the subsequent LODs. But it’s all good, this is sample code after all.

Loading the LOD Meshes

Loading the LOD meshes is straightforward, we just load 5 meshes instead of one.

std::vector<TriMesh> meshLODs;
{
    // LOD 0
    {
        TriMesh mesh = {};
        bool    res  = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k.obj").string(), &mesh);
        if (!res) {
            assert(false && "failed to load model LOD 0");
            return EXIT_FAILURE;
        }
        meshLODs.push_back(mesh);
    }

    // LOD 1
    {
        TriMesh mesh = {};
        bool    res  = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k_LOD_1.obj").string(), &mesh);
        if (!res) {
            assert(false && "failed to load model LOD 1");
            return EXIT_FAILURE;
        }
        meshLODs.push_back(mesh);
    }

    // LOD 2
    {
        TriMesh mesh = {};
        bool    res  = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k_LOD_2.obj").string(), &mesh);
        if (!res) {
            assert(false && "failed to load model LOD 2");
            return EXIT_FAILURE;
        }
        meshLODs.push_back(mesh);
    }

    // LOD 3
    {
        TriMesh mesh = {};
        bool    res  = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k_LOD_3.obj").string(), &mesh);
        if (!res) {
            assert(false && "failed to load model LOD 3");
            return EXIT_FAILURE;
        }
        meshLODs.push_back(mesh);
    }

    // LOD 4
    {
        TriMesh mesh = {};
        bool    res  = TriMesh::LoadOBJ2(GetAssetPath("models/horse_statue_01_1k_LOD_4.obj").string(), &mesh);
        if (!res) {
            assert(false && "failed to load model LOD 4");
            return EXIT_FAILURE;
        }
        meshLODs.push_back(mesh);
    }
}

Building LOD Meshlets

The gist here is that we iterate over the mesh LODs and build meshlets for each LOD. We store the meshlets data in a combined arrays. For each LOD we also store the offset of the first meshlet and the meshlet count. For each LOD we’ll also need to adjust the meshlet data offsets so they correspond to the correct LOD.

We keep the recommended values the same.

TriMesh::Aabb                meshBounds = meshLODs[0].GetBounds();
std::vector<float3>          combinedMeshPositions;
std::vector<meshopt_Meshlet> combinedMeshlets;
std::vector<uint32_t>        combinedMeshletVertices;
std::vector<uint8_t>         combinedMeshletTriangles;
std::vector<uint32_t>        meshlet_LOD_Offsets; // Offset of first meshlet of each LOD
std::vector<uint32_t>        meshlet_LOD_Counts;  // Count of meshlets of each LOD

for (size_t lodIdx = 0; lodIdx < meshLODs.size(); ++lodIdx) {
    const auto& mesh = meshLODs[lodIdx];

    const size_t kMaxVertices  = 64;
    const size_t kMaxTriangles = 124;
    const float  kConeWeight   = 0.0f;

    std::vector<meshopt_Meshlet> meshlets;
    std::vector<uint32_t>        meshletVertices;
    std::vector<uint8_t>         meshletTriangles;

    const size_t maxMeshlets = meshopt_buildMeshletsBound(mesh.GetNumIndices(), kMaxVertices, kMaxTriangles);

    meshlets.resize(maxMeshlets);
    meshletVertices.resize(maxMeshlets * kMaxVertices);
    meshletTriangles.resize(maxMeshlets * kMaxTriangles * 3);

    size_t meshletCount = meshopt_buildMeshlets(
        meshlets.data(),
        meshletVertices.data(),
        meshletTriangles.data(),
        reinterpret_cast<const uint32_t*>(mesh.GetTriangles().data()),
        mesh.GetNumIndices(),
        reinterpret_cast<const float*>(mesh.GetPositions().data()),
        mesh.GetNumVertices(),
        sizeof(float3),
        kMaxVertices,
        kMaxTriangles,
        kConeWeight);

    auto& last = meshlets[meshletCount - 1];
    meshletVertices.resize(last.vertex_offset + last.vertex_count);
    meshletTriangles.resize(last.triangle_offset + ((last.triangle_count * 3 + 3) & ~3));
    meshlets.resize(meshletCount);

    // Store offset of first meshlet and meshlet count for current LOD
    meshlet_LOD_Offsets.push_back(static_cast<uint32_t>(combinedMeshlets.size()));
    meshlet_LOD_Counts.push_back(static_cast<uint32_t>(meshlets.size()));

    // Adjustment offsets for current LOD
    const uint32_t vertexOffset          = static_cast<uint32_t>(combinedMeshPositions.size());
    const uint32_t meshletVertexOffset   = static_cast<uint32_t>(combinedMeshletVertices.size());
    const uint32_t meshletTriangleOffset = static_cast<uint32_t>(combinedMeshletTriangles.size());

    // Copy current LOD's vertex data to the combined positions array
    std::copy(mesh.GetPositions().begin(), mesh.GetPositions().end(), std::back_inserter(combinedMeshPositions));

    // Adjusts the vertex offset and triangle offset for current LOD
    for (auto meshlet : meshlets) {
        meshlet.vertex_offset += meshletVertexOffset;
        meshlet.triangle_offset += meshletTriangleOffset;
        combinedMeshlets.push_back(meshlet);
    }

    // Adjust the vertex indices for current LOD
    for (auto vertex : meshletVertices) {
        vertex += vertexOffset;
        combinedMeshletVertices.push_back(vertex);
    }

    std::copy(meshletTriangles.begin(), meshletTriangles.end(), std::back_inserter(combinedMeshletTriangles));
}

Repacking

The only change to the repacking code is that we iterate over combinedMeshlets instead of meshlets. Everything else remains the same.

// Repack triangles from 3 consecutive bytes to 4-byte uint32_t to 
// make it easier to unpack on the GPU.
//
std::vector<uint32_t> meshletTrianglesU32;
for (auto& m : combinedMeshlets)
{
    // Save triangle offset for current meshlet
    uint32_t triangleOffset = static_cast<uint32_t>(meshletTrianglesU32.size());

    // Repack to uint32_t
    for (uint32_t i = 0; i < m.triangle_count; ++i)
    {
        uint32_t i0 = 3 * i + 0 + m.triangle_offset;
        uint32_t i1 = 3 * i + 1 + m.triangle_offset;
        uint32_t i2 = 3 * i + 2 + m.triangle_offset;

        uint8_t  vIdx0  = combinedMeshletTriangles[i0];
        uint8_t  vIdx1  = combinedMeshletTriangles[i1];
        uint8_t  vIdx2  = combinedMeshletTriangles[i2];
        uint32_t packed = ((static_cast<uint32_t>(vIdx0) & 0xFF) << 0) |
                            ((static_cast<uint32_t>(vIdx1) & 0xFF) << 8) |
                            ((static_cast<uint32_t>(vIdx2) & 0xFF) << 16);
        meshletTrianglesU32.push_back(packed);
    }

    // Update triangle offset for current meshlet
    m.triangle_offset = triangleOffset;
}

Creating Buffers From meshopt Output

The positionBuffer, meshletBuffer, and meshletVerticesBuffer use the combined arrays combinedMeshPositions, combinedMeshlets, and combinedMeshletVertices, respectively.

MetalBuffer positionBuffer;
MetalBuffer meshletBuffer;
MetalBuffer meshletVerticesBuffer;
MetalBuffer meshletTrianglesBuffer;
MetalBuffer meshletBoundsBuffer;
{
    CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(combinedMeshPositions), DataPtr(combinedMeshPositions), &positionBuffer));
    CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(combinedMeshlets), DataPtr(combinedMeshlets), &meshletBuffer));
    CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(combinedMeshletVertices), DataPtr(combinedMeshletVertices), &meshletVerticesBuffer));
    CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(meshletTrianglesU32), DataPtr(meshletTrianglesU32), &meshletTrianglesBuffer));
    CHECK_CALL(CreateBuffer(renderer.get(), SizeInBytes(meshletBounds), DataPtr(meshletBounds), &meshletBoundsBuffer));
}

LOD Constant Data

We’ll store the LOD offsets and counts in the SceneProperties struct since InstanceCount and MeshletCount are already there.

Note that the D3D12 and Vulkan versions use uvec4 to store the offsets since array elements for in constant data structs are always aligned to 16 bytes. Metal, on the other hand, tightly packs arrays. Fun graphics API nuances.

// -----------------------------------------------------------------------------
// D3D12 and Vulkan
// -----------------------------------------------------------------------------
struct SceneProperties
{
    mat4        CameraVP;
    uint        InstanceCount;
    uint        MeshletCount;
    uint        __pad0[2];
    uvec4       Meshlet_LOD_Offsets[5]; // ** NEW **
    uvec4       Meshlet_LOD_Counts[5];  // ** NEW **
};

// -----------------------------------------------------------------------------
// Metal
// -----------------------------------------------------------------------------
//
// NOTE: Unlike D3D12 and Vulkan, it looks like Metal arrays are tightly
//       packed for 32-bit scalar types. This means that Meshlet_LOD_Offsets
//       and Meshlet_LOD_Counts are uint here instead of uint4/uvec4.
//
struct SceneProperties
{
    float4x4    CameraVP;
    uint        InstanceCount;
    uint        MeshletCount;
    uint        Meshlet_LOD_Offsets[5];  // ** NEW **
    uint        Meshlet_LOD_Counts[5];   // ** NEW **
    uint        __pad1[2];               // Make struct size aligned to 16
};

Updating Scene Constant Data

We add the necessary code to update the constant data using the offset and counts we stored earlier.

Note the minor differences between the D3D12/Vulkan updates and the Metal updates.

// -----------------------------------------------------------------------------
// D3D12 and Vulkan
// -----------------------------------------------------------------------------
scene.MeshletCount             = meshlet_LOD_Counts[0];
scene.Meshlet_LOD_Offsets[0].x = meshlet_LOD_Offsets[0];
scene.Meshlet_LOD_Offsets[1].x = meshlet_LOD_Offsets[1];
scene.Meshlet_LOD_Offsets[2].x = meshlet_LOD_Offsets[2];
scene.Meshlet_LOD_Offsets[3].x = meshlet_LOD_Offsets[3];
scene.Meshlet_LOD_Offsets[4].x = meshlet_LOD_Offsets[4];
scene.Meshlet_LOD_Counts[0].x  = meshlet_LOD_Counts[0];
scene.Meshlet_LOD_Counts[1].x  = meshlet_LOD_Counts[1];
scene.Meshlet_LOD_Counts[2].x  = meshlet_LOD_Counts[2];
scene.Meshlet_LOD_Counts[3].x  = meshlet_LOD_Counts[3];
scene.Meshlet_LOD_Counts[4].x  = meshlet_LOD_Counts[4];

// -----------------------------------------------------------------------------
// Metal
// -----------------------------------------------------------------------------
scene.MeshletCount           = meshlet_LOD_Counts[0];
scene.Meshlet_LOD_Offsets[0] = meshlet_LOD_Offsets[0];
scene.Meshlet_LOD_Offsets[1] = meshlet_LOD_Offsets[1];
scene.Meshlet_LOD_Offsets[2] = meshlet_LOD_Offsets[2];
scene.Meshlet_LOD_Offsets[3] = meshlet_LOD_Offsets[3];
scene.Meshlet_LOD_Offsets[4] = meshlet_LOD_Offsets[4];
scene.Meshlet_LOD_Counts[0]  = meshlet_LOD_Counts[0];
scene.Meshlet_LOD_Counts[1]  = meshlet_LOD_Counts[1];
scene.Meshlet_LOD_Counts[2]  = meshlet_LOD_Counts[2];
scene.Meshlet_LOD_Counts[3]  = meshlet_LOD_Counts[3];
scene.Meshlet_LOD_Counts[4]  = meshlet_LOD_Counts[4];

LOD Instances

We’ll use the same instancing code from the earlier samples to store the model transform matrix. For 115_mesh_shader_lod, we’ll have 5 instances - one for each LOD.

const uint32_t        kNumInstanceCols = 1;
const uint32_t        kNumInstanceRows = 5;
std::vector<float4x4> instances(kNumInstanceCols * kNumInstanceRows);

Instance Positions

We hard code some positions for each LOD instance. LOD 0 is closest to the camera and LOD 4 is furthest away.

// Update instance transforms
{
    float maxSpan       = std::max<float>(meshBounds.Width(), meshBounds.Depth());
    float instanceSpanX = 4.0f * maxSpan;
    float instanceSpanZ = 4.5f * maxSpan;
    float totalSpanX    = kNumInstanceCols * instanceSpanX;
    float totalSpanZ    = kNumInstanceRows * instanceSpanZ;

    float t = static_cast<float>(glfwGetTime());

    // 0
    {
        float3 P     = float3(0, 0, -static_cast<float>(0 * instanceSpanZ));
        instances[0] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
    }

    // 1
    {
        float3 P     = float3(0, 0, -static_cast<float>(0.75f * instanceSpanZ));
        instances[1] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
    }

    // 2
    {
        float3 P     = float3(0, 0, -static_cast<float>(2.5 * instanceSpanZ));
        instances[2] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
    }

    // 3
    {
        float3 P     = float3(0, 0, -static_cast<float>(8 * instanceSpanZ));
        instances[3] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
    }

    // 4
    {
        float3 P     = float3(0, 0, -static_cast<float>(40 * instanceSpanZ));
        instances[4] = glm::translate(P) * glm::rotate(t, float3(0, 1, 0));
    }
}

That should do it for the C++ code. Let’s move onto the amplification shader.

Amplification Shader

We only need to make 2 small changes to the amplification shader to support LODs:

  1. Add the LOD information to the SceneProperties struct.
  2. Update amplification shader body to make use of the LOD information.

Add LOD Info To SceneProperties

Add Meshlet_LOD_Offsets and Meshlet_LOD_Counts arrays to the SceneProperties struct.

The HLSL and MSL code are identical in this case.

struct SceneProperties {
    float4x4    CameraVP;
    uint        InstanceCount;
    uint        MeshletCount;
    uint        Meshlet_LOD_Offsets[5]; // ** NEW **
    uint        Meshlet_LOD_Counts[5];  // ** NEW **
};

Select LOD

For 115_mesh_shader_lod, we’re going to use the instance index to select the LOD. Instance 0 will use LOD 0, instance 1 will use LOD 1, and so on. You get the idea.

Once selected the LOD, we check to make sure that meshletIndex is within bounds of the meshlet count for the current LOD.

If meshletIndex is within the current LOD’s meshlet count, we then adjust it by the offset to the first meshlet for the current LOD. This puts meshletIndex at the correct place for the meshlet we want to draw.

Everything else remains the same. Pretty easy, huh :)

HLSL for D3D12 and Vulkan

[numthreads(AS_GROUP_SIZE, 1, 1)]
void asmain(
    uint gtid : SV_GroupThreadID,
    uint dtid : SV_DispatchThreadID,
    uint gid  : SV_GroupID
)
{
    bool visible = false;

    uint instanceIndex = dtid / Scene.MeshletCount;
    uint meshletIndex  = dtid % Scene.MeshletCount;

    if (instanceIndex < Scene.InstanceCount){
        uint lod             = instanceIndex;                 // Use instance index for LOD
        uint lodMeshletCount = Scene.Meshlet_LOD_Counts[lod]; // Get LOD's meshlet count

        if (meshletIndex < lodMeshletCount) {
            // Adjust meshletIndex it's referring a meshlet in current LOD
            meshletIndex += Scene.Meshlet_LOD_Offsets[lod];
           
            // Assuming visible, no culling here
            visible = 1;
        }
    }

    if (visible) {
        uint index = WavePrefixCountBits(visible);
        sPayload.InstanceIndices[index] = instanceIndex;
        sPayload.MeshletIndices[index]  = meshletIndex;
    }
    
    uint visibleCount = WaveActiveCountBits(visible);    
    DispatchMesh(visibleCount, 1, 1, sPayload); 
}

MSL for Metal

[[object]]
void objectMain(
    constant SceneProperties&  Scene         [[buffer(0)]],
    device const float4*       MeshletBounds [[buffer(1)]],    
    device const Instance*     Instances     [[buffer(2)]],
    uint                       gtid          [[thread_position_in_threadgroup]],
    uint                       dtid          [[thread_position_in_grid]],
    object_data Payload&       outPayload    [[payload]],
    mesh_grid_properties       outGrid)
{
    uint visible = 0;

    uint instanceIndex = dtid / Scene.MeshletCount;
    uint meshletIndex  = dtid % Scene.MeshletCount;
   
    if (instanceIndex < Scene.InstanceCount) {
        uint lod             = instanceIndex;
        uint lodMeshletCount = Scene.Meshlet_LOD_Counts[lod];

        if (meshletIndex < lodMeshletCount) {
            meshletIndex += Scene.Meshlet_LOD_Offsets[lod];

            // Assuming visibile, no culling here
            visible = 1;
        }
    }

    if (visible) {
        uint index = simd_prefix_exclusive_sum(visible);
        outPayload.InstanceIndices[index] = instanceIndex;
        outPayload.MeshletIndices[index]  = meshletIndex;
    }

    // Assumes all meshlets are visible
    uint visibleCount = simd_sum(visible);
    outGrid.set_threadgroups_per_grid(uint3(visibleCount, 1, 1));
}

Mesh Shader Changes

There aren’t any mesh shader changes for this post. Hope it’s not too disappointing. We’ll make up for it soon.

Rendered Image

The 115_mesh_shader_lod sample renders 5 instances of the horse statue at 5 different LODs.