DirectXMath AVX and AVX2 - A Coda
directxmath, xbox-
Over the years, I’ve done a number of optimizations for DirectXMath using advanced instruction sets available on x86/x64 CPUs. For Xbox developers, making the choice to use these is very easy since you can count on them along with AVX. For PC developers, modern x64 development means you can rely on SSE, SSE2–and at this point, SSE3–without sacrificing any target market. I’ve recently done some work for another project unrelated to DirectMath per se, but I wanted to add some notes about using other advanced instruction sets.
The original blog series that summed up the advanced instructions applicable for DirectXMath are:
ABM
ABM (Advanced Bit Manipulation) was an instruction set originally introduced by AMD. It includes LZCNT
(leading-zero count) and POPCNT
(population count). After some back and forth with Intel over this and other instruction set extensions at the time, these are both supported by AMD and Intel, but you need to check more than one bit in CPUID: ABM indicates LZCNT
and POPCNT indicates the population count instruction is supported.
For more on the convoluted history here, see Wikipedia.
Generally if the PC you are using supports AVX2, it will support both of these instructions. The Visual C++20 Standard Library header <bit>
will therefore use LZCNT
and POPCNT
when building with /arch:AVX2
if you use std::popcnt
and/or std::countl_zero
.
BMI1
BMI (Bit Manipulation Instruction) adds some interesting new instructions like ANDN
(Logical and not) and BEXTR
(Bit field extract) that can be useful for compiler code-generation when using /arch:AVX2
. The TZCNT
instruction is also used to implement C++20 std::countr_zero
as well.
BMI2
The BMI 2 instruction set adds a few more instructions, like variants of basic Intel ISA MUL
, ROR
, SAR
, SHR
, and SHL
that don’t affect eflags. Again, mostly useful for compilers building with /arch:AVX2
.
It’s generally advised to avoid using
PEXT
anPDEP
on AMD prior to Zen 3.
AES
The AES (Advanced Encryption Standard) instructions provide hardware acceleration support for the AES cipher. Any PC that supports AVX or AVX2 is likely to support AES.
For more details, see Wikipedia.
MOVBE
The MOVBE
instruction (officially called “Move Data After Swapping Bytes” but the mnemonic means “Move Big-Endian”) is an instruction for swapping Big-Endian/Little-Endian 16-bit, 32-bit, and 64-bit data. Much like SSSE3’s PSHUFB
which can be used to implement BE swapping for SIMD data vectors, it’s pretty specialized, but useful when you need it.
CPUID Example
This code example show checking each of the CPUID bits mentioned in this blog post.
#if defined(__clang__) || defined(__GNUC__)
#include <cpuid.h>
#else
#include <intrin.h>
#endif
int CPUInfo[4] = { -1 };
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0);
#endif
bool bABM = false;
bool bAES = false;
bool bBMI1 = false;
bool bBMI2 = false;
bool bMOVBE = false;
bool bPOPCNT = false;
const bool checkextfeature = (CPUInfo[0] >= 7);
if (CPUInfo[0] > 0)
{
#if defined(__clang__) || defined(__GNUC__)
__cpuid(1, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 1);
#endif
bAES = (CPUInfo[2] & 0x2000000) != 0;
bPOPCNT = (CPUInfo[2] & 0x800000) != 0;
bMOVBE = (CPUInfo[2] & 0x400000) != 0;
}
if ( checkextfeature )
{
#if defined(__clang__) || defined(__GNUC__)
__cpuid_count(7, 0, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuidex(CPUInfo, 7, 0);
#endif
bBMI2 = (CPUInfo[1] & 0x100) != 0);
bBMI1 = (CPUInfo[1] & 0x8) != 0;
}
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0x80000000, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0x80000000);
#endif
if (uint32_t(CPUInfo[0]) > 0x80000000)
{
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0x80000001, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0x80000001);
#endif
bABM = (CPUInfo[2] & 0x20) != 0;
}
Xbox: Xbox One supports ABM, AES, BMI1, and MOVBE. Xbox Series X|S supports those as well plus BMI2.
Related: See Visual C++ Team Blog