DirectXMath - F16C and FMA

xbox, directxmath

Originally posted to Chuck Walbourn's Blog on MSDN, Sep 11, 2012

In this installment in our series, we cover a few additional instructions that extend the AVX instruction set. These instructions make use of the VEX prefix and require the OS implement “OXSAVE”. Without this support, these instructions are all invalid and will generate an invalid instruction hardware exception.

Half-precision Floating-point Conversion

The F16C instruction set (also called CVT16 by AMD) provides support for doing half-precision <-> single-precision floating-point conversions. These intrinsics are in the immintrin.h header.

inline float XMConvertHalfToFloat(HALF Value)
{
    __m128i V1 = _mm_cvtsi32_si128(static_cast<uint32_t>(Value));
    __m128 V2 = _mm_cvtph_ps(V1);
    return _mm_cvtss_f32(V2);
}

inline HALF XMConvertFloatToHalf(float Value)
{
    __m128 V1 = _mm_set_ss(Value);
    __m128i V2 = _mm_cvtps_ph(V1, 0);
    return static_cast<HALF>(_mm_cvtsi128_si32(V2));
}

This instruction actually converts 4 HALF <-> float values at a time, so this can be used to improve the performance of both XMConvertHalfToFloatStream and XMConvertFloatToHalfStream.

Fused Multiply-Add

Computations often contain steps where two values are multiplied and then the result is accumulated with previous results. This can be done in single instruction using a ‘fused’ multiply-add operation:

V = V1 * V2 + V3

DirectXMath provides this functionality with the XMVectorMultiplyAdd function. The challenge in making use of FMA is that Intel and AMD took a while to agree on the exact details—thankfully ARM-NEON has a fused multiply-add instruction.

AMD Bulldozer implements FMA4. which uses a non-destructive destination form using 4 registers. These intrinsics are located in the ammintrin.h header.

inline XMVECTOR XMVectorMultiplyAdd(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
    return _mm_macc_ps( V1, V2, V3 );
}

inline XMVECTOR XM_CALLCONV XMVectorNegativeMultiplySubtract(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
    return _mm_nmacc_ps( V1, V2, V3 );
}

Intel “Haswell” is expected to implement FMA3, which uses a destructive destination form using only 3 registers. The intrinsics are located in the immintrin.h header.

inline XMVECTOR XMVectorMultiplyAdd(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
    return _mm_fmadd_ps( V1, V2, V3 );
}

inline XMVECTOR XM_CALLCONV XMVectorNegativeMultiplySubtract(FXMVECTOR V1, FXMVECTOR V2, FXMVECTOR V3)
{
    return _mm_fnmadd_ps( V1, V2, V3 );
}

AMD has announced it is planning to implement FMA3 with “Piledriver”. It is also fairly easy to use the same source code to generate both versions by just substituting one intrinsic for the other.

Processor Support

F16C/CVT16 is supported by AMD “Piledriver”, Intel “Ivy Bridge”, and later processors.

FMA4 is supported by AMD Bulldozer. AMD Zen1 or later removed FMA4.

FMA3 is supported by Intel “Haswell” and AMD “Piledriver” processors or later.

All CPUs with AVX2 support include AVX, FMA3, and F16C.

As extensions of the AVX instruction set, these instructions all require OSXSAVE support. This support is included in Windows 7 Service Pack 1, Windows Server 2008 R2 Service Pack 1, Windows 8, and Windows Server 2012.

#if defined(__clang__) || defined(__GNUC__)
#include <cpuid.h>
#else
#include <intrin.h>
#endif

int CPUInfo[4] = { -1 };
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0);
#endif
bool bOSXSAVE = false;
bool bAVX = false;
bool bF16C = false;
bool bFMA3 = false;
bool bFMA4 = false;
if (CPUInfo[0] > 0)
{
#if defined(__clang__) || defined(__GNUC__)
    __cpuid(1, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
    __cpuid(CPUInfo, 1);
#endif
    bOSXSAVE = (CPUInfo[2] & 0x8000000) != 0;
    bF16C = bOSXSAVE && (CPUInfo[2] & 0x20000000) != 0;
    bAVX = bOSXSAVE && (CPUInfo[2] & 0x10000000) != 0;
    bFMA3 = bOSXSAVE && (CPUInfo[2] & 0x1000) != 0;
}
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0x80000000, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0x80000000);
#endif
if (CPUInfo[0] > 0x80000000)
{
#if defined(__clang__) || defined(__GNUC__)
    __cpuid(0x80000001, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
    __cpuid(CPUInfo, 0x80000001);
#endif
    bFMA4 = bOSXSAVE && (CPUInfo[2] & 0x10000) != 0;
}

Compiler Support

FMA4 intrinsics were added to Visual Studio 2010 via Service Pack 1.

FMA3 and F16C/CVT16 intrinsic support requires Visual Studio 2012 or later.

For Visual C++, the compiler will emit these instructions when you use their intrinsic without any specific /arch setting, but clang/LLVM will error out if you use _mm_cvtph_ps or _mm_cvtps_ph unless you use the -mavx2 or -mf16c compiler switches.

Utility Code

Update: The source for this project is now available on GitHub under the MIT license.

Xbox: Xbox One supports F16C, but does not support FMA3 or FMA4. Xbox Series X|S supports both F16C and FMA3.

Games for Windows and the DirectX SDK blog

Half-precision Floating-point Conversion

Fused Multiply-Add

Processor Support

Compiler Support

Utility Code