Exploring Burst 1: imin
Burst is a great and powerful tool that enables low-level optimizations while keeping your code readable and maintainable. I hope this marks the beginning of a series of blog posts in which I’ll share insights into how compiler can enhance performance through vectorization of mathematical utilities. I’m not aiming to measure performance explicitly, but rather to examine the generated assembly code.
imin(float4)
Unity.Mathematics offers a wide range of helper functions for mathematical operations. The API includes overloads for many vector types such as float2/3/4, float2x2/3x3/4x4, quaternions, and more. It provides min(float4, float4) for computing the component-wise minimum of two float4 vectors, and cmin(float4) to obtain the smallest component of a vector. However, it lacks an imin(float4) function to return the index of the smallest component—a utility that can be quite useful.
In this post, we’ll explore how the Burst compiler handles various implementations of the imin(float4) function. The following assembly code samples were generated using Unity.Burst@1.8.21. Note that results may vary with different package versions.
Let’s begin with the simplest, most straightforward implementation: iterating through all components and selecting the index of the minimum value.
[BurstCompile]
public static int imin0(ref float4 v)
{
var (id, min) = (0, v[0]);
for (int i = 1; i < 4; i++)
{
(id, min) = v[i] < min ? (i, v[i]) : (id, min);
}
return id;
}
Here is the compiled code:
# imin0 imin1
vmovss xmm0, dword ptr [rcx]
vmovss xmm1, dword ptr [rcx + 4]
xor eax, eax
vucomiss xmm0, xmm1
seta al
vminss xmm0, xmm1, xmm0
vmovss xmm1, dword ptr [rcx + 8]
vucomiss xmm0, xmm1
vminss xmm0, xmm1, xmm0
mov edx, 2
cmovbe edx, eax
vucomiss xmm0, dword ptr [rcx + 12]
mov eax, 3
cmovbe eax, edx
Surprisingly, explicitly checking each component yields the same assembly code:
[BurstCompile]
public static int imin1(ref float4 v)
{
var (id, min) = (0, v.x);
(id, min) = v.y < min ? (1, v.y) : (id, min);
(id, min) = v.z < min ? (2, v.z) : (id, min);
(id, min) = v.w < min ? (3, v.w) : (id, min);
return id;
}
Another approach is to first calculate the minimum value, then compare components to determine the matching index:
[BurstCompile]
public static int imin2(ref float4 v)
{
var min = math.cmin(v);
if (v.x == min) return 0;
if (v.y == min) return 1;
if (v.z == min) return 2;
return 3;
}
As expected, this generates more complex assembly, including several jumps:
# imin2
vmovss xmm1, dword ptr [rcx + 12]
vmovss xmm2, dword ptr [rcx + 8]
vmovss xmm3, dword ptr [rcx + 4]
vmovups xmm0, xmmword ptr [rcx]
vminss xmm4, xmm3, xmm0
vcmpunordss xmm5, xmm0, xmm0
vblendvps xmm3, xmm4, xmm3, xmm5
vcmpunordss xmm4, xmm3, xmm3
vminss xmm3, xmm2, xmm3
vblendvps xmm2, xmm3, xmm2, xmm4
vcmpunordss xmm3, xmm2, xmm2
vminss xmm2, xmm1, xmm2
vblendvps xmm1, xmm2, xmm1, xmm3
xor eax, eax
vucomiss xmm0, xmm1
jne .LBB4_1
jp .LBB4_1
.Ltmp1:
.LBB4_3:
pop rbp
ret
.LBB4_1:
.Ltmp2:
vmovshdup xmm2, xmm0
mov eax, 1
vucomiss xmm2, xmm1
jne .LBB4_2
jnp .LBB4_3
.LBB4_2:
vshufpd xmm0, xmm0, xmm0, 1
vcmpeqss xmm0, xmm0, xmm1
vmovd eax, xmm0
and eax, 1
xor eax, 3
By using a component-wise equality check ==, we can reduce the number of jumps:
[BurstCompile]
public static int imin3(ref float4 v)
{
var r = math.cmin(v) == v;
if (r.x) return 0;
if (r.y) return 1;
if (r.z) return 2;
return 3;
}
The generated assembly is still large, but it’s more branchless compared to imin2:
# imin3 imin4
vmovss xmm0, dword ptr [rcx + 12]
vmovss xmm1, dword ptr [rcx + 8]
vmovss xmm2, dword ptr [rcx + 4]
vmovups xmm3, xmmword ptr [rcx]
vminss xmm4, xmm2, xmm3
vcmpunordss xmm5, xmm3, xmm3
vblendvps xmm2, xmm4, xmm2, xmm5
vcmpunordss xmm4, xmm2, xmm2
vminss xmm2, xmm1, xmm2
vblendvps xmm1, xmm2, xmm1, xmm4
vcmpunordss xmm2, xmm1, xmm1
vminss xmm1, xmm0, xmm1
vblendvps xmm0, xmm1, xmm0, xmm2
vbroadcastss xmm0, xmm0
vcmpeqps xmm0, xmm0, xmm3
vpackssdw xmm0, xmm0, xmm0
vpacksswb xmm0, xmm0, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + __xmm@00000000000000000000000001010101]
vmovd ecx, xmm0
xor eax, eax
test cl, cl
je .LBB4_1
.Ltmp1:
.LBB4_3:
pop rbp
ret
.LBB4_1:
.Ltmp2:
vpextrb ecx, xmm0, 1
mov eax, 1
test ecx, ecx
jne .LBB4_3
vpextrb ecx, xmm0, 2
xor eax, eax
cmp cl, 1
adc eax, 2
The same behavior can be achieved using C# pattern matching:
[BurstCompile]
public static int imin4(ref float4 v)
{
var r = math.cmin(v) == v;
return (r.x, r.y, r.z, r.w) switch
{
(true, _, _, _) => 0,
(_, true, _, _) => 1,
(_, _, true, _) => 2,
_ => 3,
};
}
To better leverage modern CPU instruction sets (SSE/AVX), we can break the problem into smaller parts using a divide-and-conquer approach:
[BurstCompile]
public static int imin5(ref float4 v)
{
var id0 = v.x <= v.y ? 0 : 1;
var id1 = v.z <= v.w ? 2 : 3;
return v[id0] <= v[id1] ? id0 : id1;
}
Note: If you don’t care about the exact index in case of duplicate minimums, you can use < instead of <=. In practice, exact duplicates in a float vector are rare, and returning any correct index is typically acceptable.
This approach already results in fewer instructions than imin0 or imin1.
# imin5
vmovups xmm0, xmmword ptr [rcx]
vmovss xmm1, dword ptr [rcx + 4]
xor edx, edx
vucomiss xmm1, xmm0
setb dl
vbroadcastss xmm1, dword ptr [rcx + 12]
vcmpnleps xmm0, xmm0, xmm1
vextractps r8d, xmm0, 2
mov eax, 2
sub eax, r8d
vmovss xmm0, dword ptr [rcx + 4*rax]
vucomiss xmm0, dword ptr [rcx + 4*rdx]
cmovae eax, edx
Finally, we can fully vectorize imin5 using math.select:
[BurstCompile]
public static int imin6(ref float4 v)
{
var id = math.select(math.int2(1, 3), math.int2(0, 2), v.xz <= v.yw);
return v[id[0]] <= v[id[1]] ? id[0] : id[1];
}
This results in the shortest and most optimized assembly, utilizing several SIMD instructions.
# imin6
vmovups xmm0, xmmword ptr [rcx]
vshufps xmm1, xmm0, xmm0, 232
vshufps xmm0, xmm0, xmm0, 237
vcmpleps xmm0, xmm1, xmm0
vpaddd xmm0, xmm0, xmmword ptr [rip + __xmm@00000000000000000000000300000001]
vmovd eax, xmm0
vpextrd edx, xmm0, 1
vmovss xmm0, dword ptr [rcx + 4*rdx]
vucomiss xmm0, dword ptr [rcx + 4*rax]
cmovb eax, edx
Summary
To summarize, imin6 leverages SIMD for vectorized comparisons, which can significantly reduce latency by processing multiple elements in parallel. This method is especially efficient on modern x86-64 CPUs with SSE or AVX support. Keep in mind that performance gains depend on CPU architecture and may vary across different platforms. When optimizing critical code paths in your game or simulation, it’s a good idea to take advantage of utilities from the math library.
function | #instructions | #jumps |
---|---|---|
imin0, imin1 | 14 | 0 |
imin2 | 29 | 4 |
imin3, imin4 | 32 | 2 |
imin5 | 13 | 0 |
imin6 | 10 | 0 |