Games for Windows and the DirectX SDK blog

Technical tips, tricks, and news about game development for Microsoft platforms including desktop, Xbox, and UWP

Project maintained by walbourn Hosted on GitHub Pages — Theme by mattgraham
Home | Posts by Tag | Posts by Month

DirectXMath - ARM64


Originally posted to Chuck Walbourn's Blog on MSDN,

The Visual Studio 2017 (15.9 update) now supports the ARM64 architecture for the Universal Windows Platform (UWP) apps.

The ARM64 platform supports ARM-NEON using the same intrinsics as the ARM (32-bit) platform. The Windows on ARM (32-bit) platform assumes support for ARMv7, ARM-NEON, and VFPv3. The Windows on ARM (64-bit) platform assumes support for ARMv8, ARM-NEON, and VFPv4.


The ARMv8 instruction set implies support for several useful intrinsics for DirectXMath data types:

  • vector divide: vdivq_f32
  • vector rounding: vrndq_f32, vrndnq_f32, vrndmq_f32, vrndpq_f32
  • half-precision conversion: vcvt_f32_f16, vcvt_f16_f32
  • fused-multiply and accumulate: vfmaq_f32, vfmsq_f32

In ARM (32-bit), vector division had to be implemented using multiply-by-reciprocal with 2 or 3 iterations of Newton-Raphson refinement which is less precise. For the ARM64 platform, I was able to replace all uses of divide in non-Est functions with a ‘true divide’ in XMVectorDivide, XMVectorReciprocal, and in the implementation for a number of other functions.

For ARM (32-bit) I used a number of tricks to perform the rounding operations. With the ARM64 platform, I can use the new intrinsics to implement XMVectorRound, XMVectorTruncate, XMVectorFloor, and XMVectorCeiling.

The half-precision conversion intrinsics are used when building for the ARM64 platform to implement XMConvertHalfToFloat, XMConvertFloatToHalf, XMConvertHalfToFloatStream, and XMConvertFloatToHalfStream.


  • DirectXMath 3.07 was the first version to include basic ARM64 support using the same ARM-NEON implementation as used for ARM (32-bit).
  • DirectXMath 3.10 uses ARMv8 intrinsics when building the ``_M_ARM64`` architecture for optimizations of specific functions including the new XMVectorSum horizontal add function.
  • DirectXMath 3.12 uses ARM64 fused-multiply and accumulate to implement XMVectorMultiplyAdd and XMVectorNegativeMultiplySubtract on the ARM64 platform.
  • DirectXMath 3.13 - 3.16 cleaned up the ARM-NEON implementation for better compiler portability for clang/LLVM and GNUC.

See also: SSE. SSE2. and ARM-NEON; SSE3 and SSSE3; SSE4.1 and SSE4.2; AVX; F16C and FMA; AVX2