DirectXMath - F16C and FMA
xbox, directxmathOriginally posted to Chuck Walbourn's Blog on MSDN,
In this installment in our series, we cover a few additional instructions that extend the AVX instruction set. These instructions make use of the VEX prefix and require the OS implement “OXSAVE”. Without this support, these instructions are all invalid and will generate an invalid instruction hardware exception.
Half-precision Floating-point Conversion
The F16C instruction set (also called CVT16 by AMD) provides support for doing half-precision <-> single-precision floating-point conversions. These intrinsics are in the immintrin.h
header.
inline float XMConvertHalfToFloat(HALF Value)
{
__m128i V1 = _mm_cvtsi32_si128(static_cast<uint32_t>(Value));
__m128 V2 = _mm_cvtph_ps(V1);
return _mm_cvtss_f32(V2);
}
inline HALF XMConvertFloatToHalf(float Value)
{
__m128 V1 = _mm_set_ss(Value);
__m128i V2 = _mm_cvtps_ph(V1, 0);
return static_cast<HALF>(_mm_cvtsi128_si32(V2));
}
This instruction actually converts 4 HALF <-> float values at a time, so this can be used to improve the performance of both XMConvertHalfToFloatStream
and XMConvertFloatToHalfStream
.
Fused Multiply-Add
Computations often contain steps where two values are multiplied and then the result is accumulated with previous results. This can be done in single instruction using a ‘fused’ multiply-add operation:
V = V1 * V2 + V3
DirectXMath provides this functionality with the XMVectorMultiplyAdd
function. The challenge in making use of FMA is that Intel and AMD took a while to agree on the exact details—thankfully ARM-NEON has a fused multiply-add instruction.
AMD Bulldozer implements FMA4. which uses a non-destructive destination form using 4 registers. These intrinsics are located in the ammintrin.h
header.
inline XMVECTOR XMVectorMultiplyAdd(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
return _mm_macc_ps( V1, V2, V3 );
}
inline XMVECTOR XM_CALLCONV XMVectorNegativeMultiplySubtract(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
return _mm_nmacc_ps( V1, V2, V3 );
}
Intel “Haswell” is expected to implement FMA3, which uses a destructive destination form using only 3 registers. The intrinsics are located in the immintrin.h
header.
inline XMVECTOR XMVectorMultiplyAdd(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
return _mm_fmadd_ps( V1, V2, V3 );
}
inline XMVECTOR XM_CALLCONV XMVectorNegativeMultiplySubtract(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
return _mm_fnmadd_ps( V1, V2, V3 );
}
AMD has announced it is planning to implement FMA3 with “Piledriver”. It is also fairly easy to use the same source code to generate both versions by just substituting one intrinsic for the other.
Processor Support
F16C/CVT16 is supported by AMD “Piledriver”, Intel “Ivy Bridge”, and later processors.
FMA4 is supported by AMD Bulldozer. AMD Zen1 or later removed FMA4.
FMA3 is supported by Intel “Haswell” and AMD “Piledriver” processors or later.
All CPUs with AVX2 support include AVX, FMA3, and F16C.
As extensions of the AVX instruction set, these instructions all require OSXSAVE support. This support is included in Windows 7 Service Pack 1, Windows Server 2008 R2 Service Pack 1, Windows 8, and Windows Server 2012.
#if defined(__clang__) || defined(__GNUC__)
#include <cpuid.h>
#else
#include <intrin.h>
#endif
int CPUInfo[4] = { -1 };
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0);
#endif
bool bOSXSAVE = false;
bool bAVX = false;
bool bF16C = false;
bool bFMA3 = false;
bool bFMA4 = false;
if (CPUInfo[0] > 0)
{
#if defined(__clang__) || defined(__GNUC__)
__cpuid(1, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 1);
#endif
bOSXSAVE = (CPUInfo[2] & 0x8000000) != 0;
bF16C = bOSXSAVE && (CPUInfo[2] & 0x20000000) != 0;
bAVX = bOSXSAVE && (CPUInfo[2] & 0x10000000) != 0;
bFMA3 = bOSXSAVE && (CPUInfo[2] & 0x1000) != 0;
}
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0x80000000, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0x80000000);
#endif
if (CPUInfo[0] > 0x80000000)
{
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0x80000001, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0x80000001);
#endif
bFMA4 = bOSXSAVE && (CPUInfo[2] & 0x10000) != 0;
}
Compiler Support
FMA4 intrinsics were added to Visual Studio 2010 via Service Pack 1.
FMA3 and F16C/CVT16 intrinsic support requires Visual Studio 2012 or later.
For Visual C++, the compiler will emit these instructions when you use their intrinsic without any specific
/arch
setting, but clang/LLVM will error out if you use_mm_cvtph_ps
or_mm_cvtps_ph
unless you use the-mavx2
or-mf16c
compiler switches.
Utility Code
Update: The source for this project is now available on GitHub under the MIT license.
Xbox: Xbox One supports F16C, but does not support FMA3 or FMA4. Xbox Series X|S supports both F16C and FMA3.
See also: SSE. SSE2. and ARM-NEON; SSE3 and SSSE3; SSE4.1 and SSE4.2; AVX; AVX2; ARM64