DirectXMath - SSE3 and SSSE3

xbox, directxmath

Originally posted to Chuck Walbourn's Blog on MSDN, Sep 11, 2012

The SSE3 instruction set adds about a dozen instructions (intrinsics are in the pimmintrin.h header). The main operation these instructions provide is the ability to do “horizontal” adds and subtracts (ARM-NEON refers to these as ‘pairwise’ operations) for float4 and double2 data.

Result = _mm_hadd_ps(V1,V2);
->
Result[0] = V1[0] + V1[1];
Result[1] = V1[2] + V1[3];
Result[2] = V2[0] + V2[1];
Result[3] = V2[2] + V2[3];

There are variants that use different signs for the two values, but otherwise they are basically the same.

The majority of the DirectXMath library is designed around avoiding the needing for these operations, but they are useful for dot-product operations (VMX128 on the Xbox 360 had a specific instruction for doing dot-products across a vector, but not a general pairwise add).

The existing SSE/SSE2 dot-product for float4:

inline XMVECTOR XMVector4Dot(FXMVECTOR V1, FXMVECTOR V2)
{
    XMVECTOR vTemp2 = V2;
    XMVECTOR vTemp = _mm_mul_ps(V1, vTemp2);
    vTemp2 = _mm_shuffle_ps(vTemp2, vTemp, _MM_SHUFFLE(1, 0, 0, 0));
    vTemp2 = _mm_add_ps(vTemp2, vTemp);
    vTemp = _mm_shuffle_ps(vTemp, vTemp2, _MM_SHUFFLE(0, 3, 0, 0));
    vTemp = _mm_add_ps(vTemp, vTemp2);
    return XM_PERMUTE_PS(vTemp, _MM_SHUFFLE(2, 2, 2, 2));
}

can be rewritten using SSE3 as:

inline XMVECTOR XMVector4Dot(FXMVECTOR V1, FXMVECTOR V2)
{
    XMVECTOR vTemp = _mm_mul_ps(V1, V2);
    vTemp = _mm_hadd_ps(vTemp, vTemp);
    return _mm_hadd_ps(vTemp, vTemp);
}

This version has the same number of multiply/add operations, but there are three fewer shuffles required. As we’ll see in a future installment, there are actually some better options than this in SSE 4.1.

There are also two new instructions which can be used as a special-case substitute for the XMVectorSwizzle<> template. We’ll make use of these in a future installment.

`XMVectorSwizzle<0,0,2,2>(V)`	`_mm_moveldup_ps(V)`
`XMVectorSwizzle<1,1,3,3>(V)`	`_mm_movehdup_ps(V)`

The Supplemental SSE3 (SSSE3) instruction set adds the equivalent “horizontal” adds and subtracts for various integer vectors, so they are not particularly useful for DirectXMath. These intrinsics are located in the tmmintrin.h header. There are also some other useful integer operations that make life simpler for implementing algorithms like Fast Block Compress, codecs, or other image processing on integer data which are a bit out of scope for DirectXMath.

There is one SSSE3 intrinsic of interest for DirectXMath: mm_shuffle_epi8. The purpose of this instruction is to be able to rearrange the bytes in a vector, which makes it an excellent function for doing vector-based Big-Endian/Little-Endian swaps without having to ‘spill’ the vector to memory and reload it.

inline XMVECTOR XMVectorEndian(FXMVECTOR V)
{
    static const XMVECTORU32 idx = { 0x00010203, 0x04050607, 0x08090A0B, 0x0C0D0E0F };
    __m128i Result = _mm_shuffle_epi8(_mm_castps_si128(V), idx);
    return _mm_castsi128_ps(Result);
}

There’s not enough use for this kind of operation to make this function part of the library (Windows x86, Windows x64, and Windows RT are all Little-Endian platforms), but it can be useful for some cross-platform tools processing (Xbox 360 is Big-Endian).

Processor Support

SSE3 is supported by Intel Pentium 4 processors (“Prescott”), Intel Atom, AMD Athlon 64 (“revision E”), AMD Phenom, and later processors. This means most, but not quite all, x64 capable CPUs should support SSE3.

Windows 10 tightened the requirements for x64 support so those few early first-generation x64 AMD and Intel CPUs aren’t supported these days in any case.

Supplemental SSE3 (SSSE3) is supported by Intel Core 2 Duo, Intel Core i7/i5/i3, Intel Atom, AMD Bulldozer, and later processors.

#if defined(__clang__) || defined(__GNUC__)
#include <cpuid.h>
#else
#include <intrin.h>
#endif

int CPUInfo[4] = { -1 };
#if defined(__clang__) || defined(__GNUC__)
__cpuid(0, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
__cpuid(CPUInfo, 0);
#endif
bool bSSE3 = false;
bool bSSSE3 = false;
if (CPUInfo[0] > 0)
{
#if defined(__clang__) || defined(__GNUC__)
    __cpuid(1, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
    __cpuid(CPUInfo, 1);
#endif
    bSSE3 = (CPUInfo[2] & 0x1) != 0;
    bSSSE3 = (CPUInfo[2] & 0x200) != 0;
}

You can also use the IsProcessorFeaturePresent Win32 API with PF_SSE3_INSTRUCTIONS_AVAILABLE on Windows Vista or later to detect SSE3 support. This API does not report support for SSSE3.

The Surface X Pro ARM64 device supports x86 with SSE, SSE2, SSE3, SSSE3, and SSE4.1 support.

Utility Code

Update: The source for this project is now available on GitHub under the MIT license. A XMVectorSum function was added to recent versions of DirectXMath which makes use of horizontal-add.

Xbox: Xbox One and Xbox Series X|S support SSE3 and SSSE3.

DirectXMath 3.10: Added the XMVectorSum method for horizontal adds, as well as the ability to specifically opt-in to just the SSE3 optimizations via -D_XM_SSE3_INTRINSICS_

Windows 10: As of Windows 10, x64 editions of the OS require support for a few specific instructions (CMPXCHG16b, PrefetchW, and LAHF/SAHF) which excludes a number of first-generation x64 CPUs from both Intel and AMD. The net result is that it’s even more likely that an x64 native application can assume SSE3 is supported even though only SSE/SSE2 is actually in the system requirements.

Update: Per the latest numbers from the Value Hardware Survey, for PC games you can require SSE3 and SSSE3 support without excluding significant numbers of gamers. You should check for the CPU support at startup to avoid unexplained crashes due to invalid instructions if a customer tries to run it on an ancient PC.

Games for Windows and the DirectX SDK blog

Processor Support

Utility Code