Andrzej Więckowski, Ph.D. — personal webpage

Exploring Burst 1: `imin`

Burst is a great and powerful tool that enables low-level optimizations while keeping your code readable and maintainable. I hope this marks the beginning of a series of blog posts in which I’ll share insights into how compiler can enhance performance through vectorization of mathematical utilities. I’m not aiming to measure performance explicitly, but rather to examine the generated assembly code.

`imin(float4)`

Unity.Mathematics offers a wide range of helper functions for mathematical operations. The API includes overloads for many vector types such as float2/3/4, float2x2/3x3/4x4, quaternions, and more. It provides min(float4, float4) for computing the component-wise minimum of two float4 vectors, and cmin(float4) to obtain the smallest component of a vector. However, it lacks an imin(float4) function to return the index of the smallest component—a utility that can be quite useful.

In this post, we’ll explore how the Burst compiler handles various implementations of the imin(float4) function. The following assembly code samples were generated using Unity.Burst@1.8.21. Note that results may vary with different package versions.

Let’s begin with the simplest, most straightforward implementation: iterating through all components and selecting the index of the minimum value.

[BurstCompile]
public static int imin0(ref float4 v)
{
    var (id, min) = (0, v[0]);
    for (int i = 1; i < 4; i++)
    {
        (id, min) = v[i] < min ? (i, v[i]) : (id, min);
    }
    return id;
}

Here is the compiled code:

        # imin0 imin1
        vmovss            xmm0, dword ptr [rcx]
        vmovss            xmm1, dword ptr [rcx + 4]
        xor               eax, eax
        vucomiss          xmm0, xmm1
        seta              al
        vminss            xmm0, xmm1, xmm0
        vmovss            xmm1, dword ptr [rcx + 8]
        vucomiss          xmm0, xmm1
        vminss            xmm0, xmm1, xmm0
        mov               edx, 2
        cmovbe            edx, eax
        vucomiss          xmm0, dword ptr [rcx + 12]
        mov               eax, 3
        cmovbe            eax, edx

Surprisingly, explicitly checking each component yields the same assembly code:

[BurstCompile]
public static int imin1(ref float4 v)
{
    var (id, min) = (0, v.x);
    (id, min) = v.y < min ? (1, v.y) : (id, min);
    (id, min) = v.z < min ? (2, v.z) : (id, min);
    (id, min) = v.w < min ? (3, v.w) : (id, min);
    return id;
}

Another approach is to first calculate the minimum value, then compare components to determine the matching index:

[BurstCompile]
public static int imin2(ref float4 v)
{
    var min = math.cmin(v);
    if (v.x == min) return 0;
    if (v.y == min) return 1;
    if (v.z == min) return 2;
    return 3;
}

As expected, this generates more complex assembly, including several jumps:

        # imin2
        vmovss            xmm1, dword ptr [rcx + 12]
        vmovss            xmm2, dword ptr [rcx + 8]
        vmovss            xmm3, dword ptr [rcx + 4]
        vmovups           xmm0, xmmword ptr [rcx]
        vminss            xmm4, xmm3, xmm0
        vcmpunordss        xmm5, xmm0, xmm0
        vblendvps         xmm3, xmm4, xmm3, xmm5
        vcmpunordss        xmm4, xmm3, xmm3
        vminss            xmm3, xmm2, xmm3
        vblendvps         xmm2, xmm3, xmm2, xmm4
        vcmpunordss        xmm3, xmm2, xmm2
        vminss            xmm2, xmm1, xmm2
        vblendvps         xmm1, xmm2, xmm1, xmm3
        xor               eax, eax
        vucomiss          xmm0, xmm1
        jne               .LBB4_1
        jp                .LBB4_1
.Ltmp1:
.LBB4_3:
        pop               rbp
        ret
.LBB4_1:
.Ltmp2:
        vmovshdup         xmm2, xmm0
        mov               eax, 1
        vucomiss          xmm2, xmm1
        jne               .LBB4_2
        jnp               .LBB4_3
.LBB4_2:
        vshufpd           xmm0, xmm0, xmm0, 1
        vcmpeqss          xmm0, xmm0, xmm1
        vmovd             eax, xmm0
        and               eax, 1
        xor               eax, 3

By using a component-wise equality check ==, we can reduce the number of jumps:

[BurstCompile]
public static int imin3(ref float4 v)
{
    var r = math.cmin(v) == v;
    if (r.x) return 0;
    if (r.y) return 1;
    if (r.z) return 2;
    return 3;
}

The generated assembly is still large, but it’s more branchless compared to imin2:

        # imin3 imin4
        vmovss            xmm0, dword ptr [rcx + 12]
        vmovss            xmm1, dword ptr [rcx + 8]
        vmovss            xmm2, dword ptr [rcx + 4]
        vmovups           xmm3, xmmword ptr [rcx]
        vminss            xmm4, xmm2, xmm3
        vcmpunordss        xmm5, xmm3, xmm3
        vblendvps         xmm2, xmm4, xmm2, xmm5
        vcmpunordss        xmm4, xmm2, xmm2
        vminss            xmm2, xmm1, xmm2
        vblendvps         xmm1, xmm2, xmm1, xmm4
        vcmpunordss        xmm2, xmm1, xmm1
        vminss            xmm1, xmm0, xmm1
        vblendvps         xmm0, xmm1, xmm0, xmm2
        vbroadcastss        xmm0, xmm0
        vcmpeqps          xmm0, xmm0, xmm3
        vpackssdw         xmm0, xmm0, xmm0
        vpacksswb         xmm0, xmm0, xmm0
        vpand             xmm0, xmm0, xmmword ptr [rip + __xmm@00000000000000000000000001010101]
        vmovd             ecx, xmm0
        xor               eax, eax
        test              cl, cl
        je                .LBB4_1
.Ltmp1:
.LBB4_3:
        pop               rbp
        ret
.LBB4_1:
.Ltmp2:
        vpextrb           ecx, xmm0, 1
        mov               eax, 1
        test              ecx, ecx
        jne               .LBB4_3
        vpextrb           ecx, xmm0, 2
        xor               eax, eax
        cmp               cl, 1
        adc               eax, 2

The same behavior can be achieved using C# pattern matching:

[BurstCompile]
public static int imin4(ref float4 v)
{
    var r = math.cmin(v) == v;
    return (r.x, r.y, r.z, r.w) switch
    {
        (true, _, _, _) => 0,
        (_, true, _, _) => 1,
        (_, _, true, _) => 2,
        _ => 3,
    };
}

To better leverage modern CPU instruction sets (SSE/AVX), we can break the problem into smaller parts using a divide-and-conquer approach:

[BurstCompile]
public static int imin5(ref float4 v)
{
    var id0 = v.x <= v.y ? 0 : 1;
    var id1 = v.z <= v.w ? 2 : 3;
    return v[id0] <= v[id1] ? id0 : id1;
}

Note: If you don’t care about the exact index in case of duplicate minimums, you can use < instead of <=. In practice, exact duplicates in a float vector are rare, and returning any correct index is typically acceptable.

This approach already results in fewer instructions than imin0 or imin1.

        # imin5
        vmovups           xmm0, xmmword ptr [rcx]
        vmovss            xmm1, dword ptr [rcx + 4]
        xor               edx, edx
        vucomiss          xmm1, xmm0
        setb              dl
        vbroadcastss        xmm1, dword ptr [rcx + 12]
        vcmpnleps         xmm0, xmm0, xmm1
        vextractps        r8d, xmm0, 2
        mov               eax, 2
        sub               eax, r8d
        vmovss            xmm0, dword ptr [rcx + 4*rax]
        vucomiss          xmm0, dword ptr [rcx + 4*rdx]
        cmovae            eax, edx

Finally, we can fully vectorize imin5 using math.select:

[BurstCompile]
public static int imin6(ref float4 v)
{
    var id = math.select(math.int2(1, 3), math.int2(0, 2), v.xz <= v.yw);
    return v[id[0]] <= v[id[1]] ? id[0] : id[1];
}

This results in the shortest and most optimized assembly, utilizing several SIMD instructions.

        # imin6
        vmovups           xmm0, xmmword ptr [rcx]
        vshufps           xmm1, xmm0, xmm0, 232
        vshufps           xmm0, xmm0, xmm0, 237
        vcmpleps          xmm0, xmm1, xmm0
        vpaddd            xmm0, xmm0, xmmword ptr [rip + __xmm@00000000000000000000000300000001]
        vmovd             eax, xmm0
        vpextrd           edx, xmm0, 1
        vmovss            xmm0, dword ptr [rcx + 4*rdx]
        vucomiss          xmm0, dword ptr [rcx + 4*rax]
        cmovb             eax, edx

Summary

To summarize, imin6 leverages SIMD for vectorized comparisons, which can significantly reduce latency by processing multiple elements in parallel. This method is especially efficient on modern x86-64 CPUs with SSE or AVX support. Keep in mind that performance gains depend on CPU architecture and may vary across different platforms. When optimizing critical code paths in your game or simulation, it’s a good idea to take advantage of utilities from the math library.

function	#instructions	#jumps
`imin0`, `imin1`	14	0
`imin2`	29	4
`imin3`, `imin4`	32	2
`imin5`	13	0
`imin6`	10	0

[Top]

Exploring Burst 1: imin

imin(float4)

Summary

Exploring Burst 1: `imin`

`imin(float4)`