Games for Windows and the DirectX SDK blog

Technical tips, tricks, and news about game development for Microsoft platforms including desktop, Xbox, and UWP


Project maintained by walbourn Hosted on GitHub Pages — Theme by mattgraham
Home | Posts by Tag | Posts by Month

DirectXMath 3.06

windowssdk, directxmath

Originally posted to Chuck Walbourn's Blog on MSDN,

DirectXMath version 3.06 is included in the release of VS 2013. You can use this with VS 2012 or VS 2010 as well via the standalone Windows 8.1 SDK.

The high-level What’s New is covered on Microsoft Docs, but here is a more technical summary of the changes between DirectXMath 3.03 in the VS 2012 / Windows 8.0 SDK and 3.06.

This release includes the fixes mentioned in a previous post:

  • Corrected bug with ARM-NEON version of ``XMVector3Cross``
  • Fixed behavior problem on odd whole numbers with ``XMVectorFloor`` and ``XMVectorCeiling``
  • Range issue with ``XMConvertHalfToFloat`` and ``XMConvertFloatToHalf`` which incorrectly still used the extended non-standard Xbox 360 range instead of the IEEE 754 standard variant
  • Scaling transformation with ``BoundingOrientedBox::Transform`` and ``BoundFrustum::Transform``
  • Fixed ``XMLoadFloat3SE`` and ``XMStoreFloatSE`` to actually match ``DXGI_FORMAT_R9G9B9E5_SHAREDEXP``

DirectXMath 3.06 also includes the following changes

  • Take advantage of the vectorcall calling convention when available (see Microsoft Docs). This introduced the ``XM_CALLCONV`` macro and some new typedefs: ``HXMVECTOR`` and ``FXMMATRIX``. Details of using this convention in your own code are covered on Microsoft Docs.
  • New functions ``XMColorRGBToSRGB`` and ``XMColorSRGBToRGB``
  • New functions ``XMVectorExp2``, ``XMVectorExpE``, ``XMVectorLog2``, ``XMVectorLogE``.
    • The older ``XMVectorExp`` is now an alias for XMVectorExp2, XMVectorLog is an alias for XMVectorLog2
  • New functions ``XMLoadUDecN4_XR`` and ``XMStoreUDecN4_XR`` to match ``DXGI_FORMAT_R10G10B10_XR_BIAS_A2_UNORM``
  • Makes use of the multiply-by-scalar ARM-NEON instructions to avoid some extra vdups, working around some bugs in the older VS 2012 compiler releases if needed
  • Some cleanup of the ARM-NEON implementations to make use of the 'typed' register types rather than the generic ``__n64/__n128``
  • Some cleanup of the SSE implementations to use ``_mm_cast*`` instead of C++ style casts
  • Improved ``XMVectorRound``
  • ARM-NEON and SSE optimized versions of ``XMVectorExp*`` and ``XMVectorLog*``
  • Optimized ``XMMATRIX::operator/=`` and ``operator/``
  • Optimized all the ``XMVector*Stream`` functions for ARM-NEON and SSE, including making use of ``_mm_stream_ps`` when possible.

Rounding

The original XMVectorRound implementation was not truly a round-to-nearest and did not match the behavior of the SSE4.1 instruction _mm_round_ps, so I revisited it for this release.

Here is the new version:

namespace Internal
{
    inline float round_to_nearest(float x)
    {
        float i = floorf(x);
        x -= i;
        if (x < 0.5f)
            return i;
        if (x > 0.5f)
            return i + 1.f;

        float int_part;
        modff(i / 2.f, &int_part);
        if ((2.f*int_part) == i)
        {
            return i;
        }

        return i + 1.f;
    }
};

#if !defined(_XM_NO_INTRINSICS_)
#pragma float_control(push)
#pragma float_control(precise, on)
#endif

inline XMVECTOR XMVectorRound(FXMVECTOR V)
{
#if defined(_XM_NO_INTRINSICS_)

    XMVECTOR vResult = {
    Internal::round_to_nearest(V.vector4_f32[0]),
    Internal::round_to_nearest(V.vector4_f32[1]),
    Internal::round_to_nearest(V.vector4_f32[2]),
    Internal::round_to_nearest(V.vector4_f32[3])
    };
    return vResult;

#elif defined(_XM_ARM_NEON_INTRINSICS_)
    static const XMVECTORI32 magic = { 0x4B000000, 0x4B000000, 0x4B000000, 0x4B000000 };
    uint32x4_t sign = vandq_u32(V, g_XMNegativeZero);
    uint32x4_t sMagic = vorrq_u32(magic, sign);
    XMVECTOR vResult = vaddq_f32(V, sMagic);
    vResult = vsubq_f32(vResult, sMagic);
    return vResult;
#elif defined(_XM_SSE_INTRINSICS_)
    static const XMVECTORI32 magic = { 0x4B000000, 0x4B000000, 0x4B000000, 0x4B000000 };
    __m128 sign = _mm_and_ps(V, g_XMNegativeZero);
    __m128 sMagic = _mm_or_ps(magic, sign);
    XMVECTOR vResult = _mm_add_ps(V, sMagic);
    vResult = _mm_sub_ps(vResult, sMagic);
    return vResult;
#else // _XM_VMX128_INTRINSICS_
#endif // _XM_VMX128_INTRINSICS_
}

#if !defined(_XM_NO_INTRINSICS_)
#pragma float_control(pop)
#endif

The revisited XMVectorFloor and XMVectorCeiling were already covered by the previous blog post.

Exponent and Logarithm

XMVectorExp was renamed XMVectorExp2 and optimized for ARM-NEON and SSE.

inline XMVECTOR XMVectorExp2
(
    FXMVECTOR V
)
{
#if defined(_XM_NO_INTRINSICS_)

    XMVECTOR Result;
    Result.vector4_f32[0] = powf(2.0f, V.vector4_f32[0]);
    Result.vector4_f32[1] = powf(2.0f, V.vector4_f32[1]);
    Result.vector4_f32[2] = powf(2.0f, V.vector4_f32[2]);
    Result.vector4_f32[3] = powf(2.0f, V.vector4_f32[3]);
    return Result;

#elif defined(_XM_ARM_NEON_INTRINSICS_)
    int32x4_t itrunc = vcvtq_s32_f32(V);
    float32x4_t ftrunc = vcvtq_f32_s32(itrunc);
    float32x4_t y = vsubq_f32(V, ftrunc);

    float32x4_t poly = vmlaq_f32(g_XMExpEst6, g_XMExpEst7, y);
    poly = vmlaq_f32(g_XMExpEst5, poly, y);
    poly = vmlaq_f32(g_XMExpEst4, poly, y);
    poly = vmlaq_f32(g_XMExpEst3, poly, y);
    poly = vmlaq_f32(g_XMExpEst2, poly, y);
    poly = vmlaq_f32(g_XMExpEst1, poly, y);
    poly = vmlaq_f32(g_XMOne, poly, y);

    int32x4_t biased = vaddq_s32(itrunc, g_XMExponentBias);
    biased = vshlq_n_s32(biased, 23);
    float32x4_t result0 = XMVectorDivide(biased, poly);

    biased = vaddq_s32(itrunc, g_XM253);
    biased = vshlq_n_s32(biased, 23);
    float32x4_t result1 = XMVectorDivide(biased, poly);
    result1 = vmulq_f32(g_XMMinNormal.v, result1);

    // Use selection to handle the cases
    // if (V is NaN) -> QNaN;
    // else if (V sign bit set)
    // if (V > -150)
    // if (V.exponent < -126) -> result1
    // else -> result0
    // else -> +0
    // else
    // if (V < 128) -> result0
    // else -> +inf

    int32x4_t comp = vcltq_s32(V, g_XMBin128);
    float32x4_t result2 = vbslq_f32(comp, result0, g_XMInfinity);

    comp = vcltq_s32(itrunc, g_XMSubnormalExponent);
    float32x4_t result3 = vbslq_f32(comp, result1, result0);

    comp = vcltq_s32(V, g_XMBinNeg150);
    float32x4_t result4 = vbslq_f32(comp, result3, g_XMZero);

    int32x4_t sign = vandq_s32(V, g_XMNegativeZero);
    comp = vceqq_s32(sign, g_XMNegativeZero);
    float32x4_t result5 = vbslq_f32(comp, result4, result2);

    int32x4_t t0 = vandq_s32(V, g_XMQNaNTest);
    int32x4_t t1 = vandq_s32(V, g_XMInfinity);
    t0 = vceqq_s32(t0, g_XMZero);
    t1 = vceqq_s32(t1, g_XMInfinity);
    int32x4_t isNaN = vbicq_s32(t1, t0);

    float32x4_t vResult = vbslq_f32(isNaN, g_XMQNaN, result5);
    return vResult;
#elif defined(_XM_SSE_INTRINSICS_)
    __m128i itrunc = _mm_cvttps_epi32(V);
    __m128 ftrunc = _mm_cvtepi32_ps(itrunc);
    __m128 y = _mm_sub_ps(V, ftrunc);
    __m128 poly = _mm_mul_ps(g_XMExpEst7, y);
    poly = _mm_add_ps(g_XMExpEst6, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst5, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst4, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst3, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst2, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst1, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMOne, poly);

    __m128i biased = _mm_add_epi32(itrunc, g_XMExponentBias);
    biased = _mm_slli_epi32(biased, 23);
    __m128 result0 = _mm_div_ps(_mm_castsi128_ps(biased), poly);

    biased = _mm_add_epi32(itrunc, g_XM253);
    biased = _mm_slli_epi32(biased, 23);
    __m128 result1 = _mm_div_ps(_mm_castsi128_ps(biased), poly);
    result1 = _mm_mul_ps(g_XMMinNormal.v, result1);

    // Use selection to handle the cases
    // if (V is NaN) -> QNaN;
    // else if (V sign bit set)
    // if (V > -150)
    // if (V.exponent < -126) -> result1
    // else -> result0
    // else -> +0
    // else
    // if (V < 128) -> result0
    // else -> +inf

    __m128i comp = _mm_cmplt_epi32(_mm_castps_si128(V), g_XMBin128);
    __m128i select0 = _mm_and_si128(comp, _mm_castps_si128(result0));
    __m128i select1 = _mm_andnot_si128(comp, g_XMInfinity);
    __m128i result2 = _mm_or_si128(select0, select1);

    comp = _mm_cmplt_epi32(itrunc, g_XMSubnormalExponent);
    select1 = _mm_and_si128(comp, _mm_castps_si128(result1));
    select0 = _mm_andnot_si128(comp, _mm_castps_si128(result0));
    __m128i result3 = _mm_or_si128(select0, select1);

    comp = _mm_cmplt_epi32(_mm_castps_si128(V), g_XMBinNeg150);
    select0 = _mm_and_si128(comp, result3);
    select1 = _mm_andnot_si128(comp, g_XMZero);
    __m128i result4 = _mm_or_si128(select0, select1);

    __m128i sign = _mm_and_si128(_mm_castps_si128(V), g_XMNegativeZero);
    comp = _mm_cmpeq_epi32(sign, g_XMNegativeZero);
    select0 = _mm_and_si128(comp, result4);
    select1 = _mm_andnot_si128(comp, result2);
    __m128i result5 = _mm_or_si128(select0, select1);

    __m128i t0 = _mm_and_si128(_mm_castps_si128(V), g_XMQNaNTest);
    __m128i t1 = _mm_and_si128(_mm_castps_si128(V), g_XMInfinity);
    t0 = _mm_cmpeq_epi32(t0, g_XMZero);
    t1 = _mm_cmpeq_epi32(t1, g_XMInfinity);
    __m128i isNaN = _mm_andnot_si128(t0, t1);

    select0 = _mm_and_si128(isNaN, g_XMQNaN);
    select1 = _mm_andnot_si128(isNaN, result5);
    __m128i vResult = _mm_or_si128(select0, select1);

    return _mm_castsi128_ps(vResult);
#else // _XM_VMX128_INTRINSICS_
#endif // _XM_VMX128_INTRINSICS_
}

Here is the base-E version which was added to the library, XMVectorExpE

inline XMVECTOR XMVectorExpE
(
    FXMVECTOR V
)
{
#if defined(_XM_NO_INTRINSICS_)

    XMVECTOR Result;
    Result.vector4_f32[0] = expf(V.vector4_f32[0]);
    Result.vector4_f32[1] = expf(V.vector4_f32[1]);
    Result.vector4_f32[2] = expf(V.vector4_f32[2]);
    Result.vector4_f32[3] = expf(V.vector4_f32[3]);
    return Result;

#elif defined(_XM_ARM_NEON_INTRINSICS_)
    // expE(V) = exp2(vin*log2(e))
    float32x4_t Ve = vmulq_f32(g_XMLgE, V);

    int32x4_t itrunc = vcvtq_s32_f32(Ve);
    float32x4_t ftrunc = vcvtq_f32_s32(itrunc);
    float32x4_t y = vsubq_f32(Ve, ftrunc);


    float32x4_t poly = vmlaq_f32(g_XMExpEst6, g_XMExpEst7, y);
    poly = vmlaq_f32(g_XMExpEst5, poly, y);
    poly = vmlaq_f32(g_XMExpEst4, poly, y);
    poly = vmlaq_f32(g_XMExpEst3, poly, y);
    poly = vmlaq_f32(g_XMExpEst2, poly, y);
    poly = vmlaq_f32(g_XMExpEst1, poly, y);
    poly = vmlaq_f32(g_XMOne, poly, y);

    int32x4_t biased = vaddq_s32(itrunc, g_XMExponentBias);
    biased = vshlq_n_s32(biased, 23);
    float32x4_t result0 = XMVectorDivide(biased, poly);

    biased = vaddq_s32(itrunc, g_XM253);
    biased = vshlq_n_s32(biased, 23);
    float32x4_t result1 = XMVectorDivide(biased, poly);
    result1 = vmulq_f32(g_XMMinNormal.v, result1);

    // Use selection to handle the cases
    // if (V is NaN) -> QNaN;
    // else if (V sign bit set)
    // if (V > -150)
    // if (V.exponent < -126) -> result1
    // else -> result0
    // else -> +0
    // else
    // if (V < 128) -> result0
    // else -> +inf

    int32x4_t comp = vcltq_s32(Ve, g_XMBin128);
    float32x4_t result2 = vbslq_f32(comp, result0, g_XMInfinity);

    comp = vcltq_s32(itrunc, g_XMSubnormalExponent);
    float32x4_t result3 = vbslq_f32(comp, result1, result0);

    comp = vcltq_s32(Ve, g_XMBinNeg150);
    float32x4_t result4 = vbslq_f32(comp, result3, g_XMZero);

    int32x4_t sign = vandq_s32(Ve, g_XMNegativeZero);
    comp = vceqq_s32(sign, g_XMNegativeZero);
    float32x4_t result5 = vbslq_f32(comp, result4, result2);

    int32x4_t t0 = vandq_s32(Ve, g_XMQNaNTest);
    int32x4_t t1 = vandq_s32(Ve, g_XMInfinity);
    t0 = vceqq_s32(t0, g_XMZero);
    t1 = vceqq_s32(t1, g_XMInfinity);
    int32x4_t isNaN = vbicq_s32(t1, t0);

    float32x4_t vResult = vbslq_f32(isNaN, g_XMQNaN, result5);
    return vResult;
#elif defined(_XM_SSE_INTRINSICS_)
    // expE(V) = exp2(vin*log2(e))
    __m128 Ve = _mm_mul_ps(g_XMLgE, V);

    __m128i itrunc = _mm_cvttps_epi32(Ve);
    __m128 ftrunc = _mm_cvtepi32_ps(itrunc);
    __m128 y = _mm_sub_ps(Ve, ftrunc);
    __m128 poly = _mm_mul_ps(g_XMExpEst7, y);
    poly = _mm_add_ps(g_XMExpEst6, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst5, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst4, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst3, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst2, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMExpEst1, poly);
    poly = _mm_mul_ps(poly, y);
    poly = _mm_add_ps(g_XMOne, poly);

    __m128i biased = _mm_add_epi32(itrunc, g_XMExponentBias);
    biased = _mm_slli_epi32(biased, 23);
    __m128 result0 = _mm_div_ps(_mm_castsi128_ps(biased), poly);

    biased = _mm_add_epi32(itrunc, g_XM253);
    biased = _mm_slli_epi32(biased, 23);
    __m128 result1 = _mm_div_ps(_mm_castsi128_ps(biased), poly);
    result1 = _mm_mul_ps(g_XMMinNormal.v, result1);

    // Use selection to handle the cases
    // if (V is NaN) -> QNaN;
    // else if (V sign bit set)
    // if (V > -150)
    // if (V.exponent < -126) -> result1
    // else -> result0
    // else -> +0
    // else
    // if (V < 128) -> result0
    // else -> +inf

    __m128i comp = _mm_cmplt_epi32(_mm_castps_si128(Ve), g_XMBin128);
    __m128i select0 = _mm_and_si128(comp, _mm_castps_si128(result0));
    __m128i select1 = _mm_andnot_si128(comp, g_XMInfinity);
    __m128i result2 = _mm_or_si128(select0, select1);

    comp = _mm_cmplt_epi32(itrunc, g_XMSubnormalExponent);
    select1 = _mm_and_si128(comp, _mm_castps_si128(result1));
    select0 = _mm_andnot_si128(comp, _mm_castps_si128(result0));
    __m128i result3 = _mm_or_si128(select0, select1);

    comp = _mm_cmplt_epi32(_mm_castps_si128(Ve), g_XMBinNeg150);
    select0 = _mm_and_si128(comp, result3);
    select1 = _mm_andnot_si128(comp, g_XMZero);
    __m128i result4 = _mm_or_si128(select0, select1);

    __m128i sign = _mm_and_si128(_mm_castps_si128(Ve), g_XMNegativeZero);
    comp = _mm_cmpeq_epi32(sign, g_XMNegativeZero);
    select0 = _mm_and_si128(comp, result4);
    select1 = _mm_andnot_si128(comp, result2);
    __m128i result5 = _mm_or_si128(select0, select1);

    __m128i t0 = _mm_and_si128(_mm_castps_si128(Ve), g_XMQNaNTest);
    __m128i t1 = _mm_and_si128(_mm_castps_si128(Ve), g_XMInfinity);
    t0 = _mm_cmpeq_epi32(t0, g_XMZero);
    t1 = _mm_cmpeq_epi32(t1, g_XMInfinity);
    __m128i isNaN = _mm_andnot_si128(t0, t1);

    select0 = _mm_and_si128(isNaN, g_XMQNaN);
    select1 = _mm_andnot_si128(isNaN, result5);
    __m128i vResult = _mm_or_si128(select0, select1);

    return _mm_castsi128_ps(vResult);
#else // _XM_VMX128_INTRINSICS_
#endif // _XM_VMX128_INTRINSICS_
}

XMVectorLog was renamed XMVectorLog2 and optimized for SSE and ARM-NEON.

inline XMVECTOR XMVectorLog2
(
    FXMVECTOR V
)
{
#if defined(_XM_NO_INTRINSICS_)

    const float fScale = 1.4426950f; // (1.0f / logf(2.0f));

    XMVECTOR Result;
    Result.vector4_f32[0] = logf(V.vector4_f32[0])*fScale;
    Result.vector4_f32[1] = logf(V.vector4_f32[1])*fScale;
    Result.vector4_f32[2] = logf(V.vector4_f32[2])*fScale;
    Result.vector4_f32[3] = logf(V.vector4_f32[3])*fScale;
    return Result;

#elif defined(_XM_ARM_NEON_INTRINSICS_)
    int32x4_t rawBiased = vandq_s32(V, g_XMInfinity);
    int32x4_t trailing = vandq_s32(V, g_XMQNaNTest);
    int32x4_t isExponentZero = vceqq_s32(g_XMZero, rawBiased);

    // Compute exponent and significand for normals.
    int32x4_t biased = vshrq_n_u32(rawBiased, 23);
    int32x4_t exponentNor = vsubq_s32(biased, g_XMExponentBias);
    int32x4_t trailingNor = trailing;

    // Compute exponent and significand for subnormals.
    int32x4_t leading = Internal::GetLeadingBit(trailing);
    int32x4_t shift = vsubq_s32(g_XMNumTrailing, leading);
    int32x4_t exponentSub = vsubq_s32(g_XMSubnormalExponent, shift);
    int32x4_t trailingSub = vshlq_u32(trailing, shift);
    trailingSub = vandq_s32(trailingSub, g_XMQNaNTest);
    int32x4_t e = vbslq_f32(isExponentZero, exponentSub, exponentNor);
    int32x4_t t = vbslq_f32(isExponentZero, trailingSub, trailingNor);

    // Compute the approximation.
    int32x4_t tmp = vorrq_s32(g_XMOne, t);
    float32x4_t y = vsubq_f32(tmp, g_XMOne);

    float32x4_t log2 = vmlaq_f32(g_XMLogEst6, g_XMLogEst7, y);
    log2 = vmlaq_f32(g_XMLogEst5, log2, y);
    log2 = vmlaq_f32(g_XMLogEst4, log2, y);
    log2 = vmlaq_f32(g_XMLogEst3, log2, y);
    log2 = vmlaq_f32(g_XMLogEst2, log2, y);
    log2 = vmlaq_f32(g_XMLogEst1, log2, y);
    log2 = vmlaq_f32(g_XMLogEst0, log2, y);
    log2 = vmlaq_f32(vcvtq_f32_s32(e), log2, y);

    // if (x is NaN) -> QNaN
    // else if (V is positive)
    // if (V is infinite) -> +inf
    // else -> log2(V)
    // else
    // if (V is zero) -> -inf
    // else -> -QNaN

    int32x4_t isInfinite = vandq_s32((V), g_XMAbsMask);
    isInfinite = vceqq_s32(isInfinite, g_XMInfinity);

    int32x4_t isGreaterZero = vcgtq_s32((V), g_XMZero);
    int32x4_t isNotFinite = vcgtq_s32((V), g_XMInfinity);
    int32x4_t isPositive = vbicq_s32(isGreaterZero, isNotFinite);

    int32x4_t isZero = vandq_s32((V), g_XMAbsMask);
    isZero = vceqq_s32(isZero, g_XMZero);

    int32x4_t t0 = vandq_s32((V), g_XMQNaNTest);
    int32x4_t t1 = vandq_s32((V), g_XMInfinity);
    t0 = vceqq_s32(t0, g_XMZero);
    t1 = vceqq_s32(t1, g_XMInfinity);
    int32x4_t isNaN = vbicq_s32(t1, t0);

    float32x4_t result = vbslq_f32(isInfinite, g_XMInfinity, log2);
    tmp = vbslq_f32(isZero, g_XMNegInfinity, g_XMNegQNaN);
    result = vbslq_f32(isPositive, result, tmp);
    result = vbslq_f32(isNaN, g_XMQNaN, result);
    return result;
#elif defined(_XM_SSE_INTRINSICS_)
    __m128i rawBiased = _mm_and_si128(_mm_castps_si128(V), g_XMInfinity);
    __m128i trailing = _mm_and_si128(_mm_castps_si128(V), g_XMQNaNTest);
    __m128i isExponentZero = _mm_cmpeq_epi32(g_XMZero, rawBiased);

    // Compute exponent and significand for normals.
    __m128i biased = _mm_srli_epi32(rawBiased, 23);
    __m128i exponentNor = _mm_sub_epi32(biased, g_XMExponentBias);
    __m128i trailingNor = trailing;

    // Compute exponent and significand for subnormals.
    __m128i leading = Internal::GetLeadingBit(trailing);
    __m128i shift = _mm_sub_epi32(g_XMNumTrailing, leading);
    __m128i exponentSub = _mm_sub_epi32(g_XMSubnormalExponent, shift);
    __m128i trailingSub = Internal::multi_sll_epi32(trailing, shift);
    trailingSub = _mm_and_si128(trailingSub, g_XMQNaNTest);

    __m128i select0 = _mm_and_si128(isExponentZero, exponentSub);
    __m128i select1 = _mm_andnot_si128(isExponentZero, exponentNor);
    __m128i e = _mm_or_si128(select0, select1);

    select0 = _mm_and_si128(isExponentZero, trailingSub);
    select1 = _mm_andnot_si128(isExponentZero, trailingNor);
    __m128i t = _mm_or_si128(select0, select1);

    // Compute the approximation.
    __m128i tmp = _mm_or_si128(g_XMOne, t);
    __m128 y = _mm_sub_ps(_mm_castsi128_ps(tmp), g_XMOne);

    __m128 log2 = _mm_mul_ps(g_XMLogEst7, y);
    log2 = _mm_add_ps(g_XMLogEst6, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst5, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst4, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst3, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst2, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst1, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst0, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(log2, _mm_cvtepi32_ps(e));

    // if (x is NaN) -> QNaN
    // else if (V is positive)
    // if (V is infinite) -> +inf
    // else -> log2(V)
    // else
    // if (V is zero) -> -inf
    // else -> -QNaN

    __m128i isInfinite = _mm_and_si128(_mm_castps_si128(V), g_XMAbsMask);
    isInfinite = _mm_cmpeq_epi32(isInfinite, g_XMInfinity);

    __m128i isGreaterZero = _mm_cmpgt_epi32(_mm_castps_si128(V), g_XMZero);
    __m128i isNotFinite = _mm_cmpgt_epi32(_mm_castps_si128(V), g_XMInfinity);
    __m128i isPositive = _mm_andnot_si128(isNotFinite, isGreaterZero);

    __m128i isZero = _mm_and_si128(_mm_castps_si128(V), g_XMAbsMask);
    isZero = _mm_cmpeq_epi32(isZero, g_XMZero);

    __m128i t0 = _mm_and_si128(_mm_castps_si128(V), g_XMQNaNTest);
    __m128i t1 = _mm_and_si128(_mm_castps_si128(V), g_XMInfinity);
    t0 = _mm_cmpeq_epi32(t0, g_XMZero);
    t1 = _mm_cmpeq_epi32(t1, g_XMInfinity);
    __m128i isNaN = _mm_andnot_si128(t0, t1);

    select0 = _mm_and_si128(isInfinite, g_XMInfinity);
    select1 = _mm_andnot_si128(isInfinite, _mm_castps_si128(log2));
    __m128i result = _mm_or_si128(select0, select1);

    select0 = _mm_and_si128(isZero, g_XMNegInfinity);
    select1 = _mm_andnot_si128(isZero, g_XMNegQNaN);
    tmp = _mm_or_si128(select0, select1);

    select0 = _mm_and_si128(isPositive, result);
    select1 = _mm_andnot_si128(isPositive, tmp);
    result = _mm_or_si128(select0, select1);

    select0 = _mm_and_si128(isNaN, g_XMQNaN);
    select1 = _mm_andnot_si128(isNaN, result);
    result = _mm_or_si128(select0, select1);

    return _mm_castsi128_ps(result);
#else // _XM_VMX128_INTRINSICS_
#endif // _XM_VMX128_INTRINSICS_
}

And the base-E version was added, XMVectorLogE

inline XMVECTOR XMVectorLogE
(
    FXMVECTOR V
)
{
#if defined(_XM_NO_INTRINSICS_)

    XMVECTOR Result;
    Result.vector4_f32[0] = logf(V.vector4_f32[0]);
    Result.vector4_f32[1] = logf(V.vector4_f32[1]);
    Result.vector4_f32[2] = logf(V.vector4_f32[2]);
    Result.vector4_f32[3] = logf(V.vector4_f32[3]);
    return Result;

#elif defined(_XM_ARM_NEON_INTRINSICS_)
    int32x4_t rawBiased = vandq_s32(V, g_XMInfinity);
    int32x4_t trailing = vandq_s32(V, g_XMQNaNTest);
    int32x4_t isExponentZero = vceqq_s32(g_XMZero, rawBiased);

    // Compute exponent and significand for normals.
    int32x4_t biased = vshrq_n_u32(rawBiased, 23);
    int32x4_t exponentNor = vsubq_s32(biased, g_XMExponentBias);
    int32x4_t trailingNor = trailing;

    // Compute exponent and significand for subnormals.
    int32x4_t leading = Internal::GetLeadingBit(trailing);
    int32x4_t shift = vsubq_s32(g_XMNumTrailing, leading);
    int32x4_t exponentSub = vsubq_s32(g_XMSubnormalExponent, shift);
    int32x4_t trailingSub = vshlq_u32(trailing, shift);
    trailingSub = vandq_s32(trailingSub, g_XMQNaNTest);
    int32x4_t e = vbslq_f32(isExponentZero, exponentSub, exponentNor);
    int32x4_t t = vbslq_f32(isExponentZero, trailingSub, trailingNor);

    // Compute the approximation.
    int32x4_t tmp = vorrq_s32(g_XMOne, t);
    float32x4_t y = vsubq_f32(tmp, g_XMOne);

    float32x4_t log2 = vmlaq_f32(g_XMLogEst6, g_XMLogEst7, y);
    log2 = vmlaq_f32(g_XMLogEst5, log2, y);
    log2 = vmlaq_f32(g_XMLogEst4, log2, y);
    log2 = vmlaq_f32(g_XMLogEst3, log2, y);
    log2 = vmlaq_f32(g_XMLogEst2, log2, y);
    log2 = vmlaq_f32(g_XMLogEst1, log2, y);
    log2 = vmlaq_f32(g_XMLogEst0, log2, y);
    log2 = vmlaq_f32(vcvtq_f32_s32(e), log2, y);

    log2 = vmulq_f32(g_XMInvLgE, log2);

    // if (x is NaN) -> QNaN
    // else if (V is positive)
    // if (V is infinite) -> +inf
    // else -> log2(V)
    // else
    // if (V is zero) -> -inf
    // else -> -QNaN

    int32x4_t isInfinite = vandq_s32((V), g_XMAbsMask);
    isInfinite = vceqq_s32(isInfinite, g_XMInfinity);

    int32x4_t isGreaterZero = vcgtq_s32((V), g_XMZero);
    int32x4_t isNotFinite = vcgtq_s32((V), g_XMInfinity);
    int32x4_t isPositive = vbicq_s32(isGreaterZero, isNotFinite);

    int32x4_t isZero = vandq_s32((V), g_XMAbsMask);
    isZero = vceqq_s32(isZero, g_XMZero);

    int32x4_t t0 = vandq_s32((V), g_XMQNaNTest);
    int32x4_t t1 = vandq_s32((V), g_XMInfinity);
    t0 = vceqq_s32(t0, g_XMZero);
    t1 = vceqq_s32(t1, g_XMInfinity);
    int32x4_t isNaN = vbicq_s32(t1, t0);

    float32x4_t result = vbslq_f32(isInfinite, g_XMInfinity, log2);
    tmp = vbslq_f32(isZero, g_XMNegInfinity, g_XMNegQNaN);
    result = vbslq_f32(isPositive, result, tmp);
    result = vbslq_f32(isNaN, g_XMQNaN, result);
    return result;
#elif defined(_XM_SSE_INTRINSICS_)
    __m128i rawBiased = _mm_and_si128(_mm_castps_si128(V), g_XMInfinity);
    __m128i trailing = _mm_and_si128(_mm_castps_si128(V), g_XMQNaNTest);
    __m128i isExponentZero = _mm_cmpeq_epi32(g_XMZero, rawBiased);

    // Compute exponent and significand for normals.
    __m128i biased = _mm_srli_epi32(rawBiased, 23);
    __m128i exponentNor = _mm_sub_epi32(biased, g_XMExponentBias);
    __m128i trailingNor = trailing;

    // Compute exponent and significand for subnormals.
    __m128i leading = Internal::GetLeadingBit(trailing);
    __m128i shift = _mm_sub_epi32(g_XMNumTrailing, leading);
    __m128i exponentSub = _mm_sub_epi32(g_XMSubnormalExponent, shift);
    __m128i trailingSub = Internal::multi_sll_epi32(trailing, shift);
    trailingSub = _mm_and_si128(trailingSub, g_XMQNaNTest);

    __m128i select0 = _mm_and_si128(isExponentZero, exponentSub);
    __m128i select1 = _mm_andnot_si128(isExponentZero, exponentNor);
    __m128i e = _mm_or_si128(select0, select1);

    select0 = _mm_and_si128(isExponentZero, trailingSub);
    select1 = _mm_andnot_si128(isExponentZero, trailingNor);
    __m128i t = _mm_or_si128(select0, select1);

    // Compute the approximation.
    __m128i tmp = _mm_or_si128(g_XMOne, t);
    __m128 y = _mm_sub_ps(_mm_castsi128_ps(tmp), g_XMOne);

    __m128 log2 = _mm_mul_ps(g_XMLogEst7, y);
    log2 = _mm_add_ps(g_XMLogEst6, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst5, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst4, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst3, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst2, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst1, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(g_XMLogEst0, log2);
    log2 = _mm_mul_ps(log2, y);
    log2 = _mm_add_ps(log2, _mm_cvtepi32_ps(e));

    log2 = _mm_mul_ps(g_XMInvLgE, log2);

    // if (x is NaN) -> QNaN
    // else if (V is positive)
    // if (V is infinite) -> +inf
    // else -> log2(V)
    // else
    // if (V is zero) -> -inf
    // else -> -QNaN

    __m128i isInfinite = _mm_and_si128(_mm_castps_si128(V), g_XMAbsMask);
    isInfinite = _mm_cmpeq_epi32(isInfinite, g_XMInfinity);

    __m128i isGreaterZero = _mm_cmpgt_epi32(_mm_castps_si128(V), g_XMZero);
    __m128i isNotFinite = _mm_cmpgt_epi32(_mm_castps_si128(V), g_XMInfinity);
    __m128i isPositive = _mm_andnot_si128(isNotFinite, isGreaterZero);

    __m128i isZero = _mm_and_si128(_mm_castps_si128(V), g_XMAbsMask);
    isZero = _mm_cmpeq_epi32(isZero, g_XMZero);

    __m128i t0 = _mm_and_si128(_mm_castps_si128(V), g_XMQNaNTest);
    __m128i t1 = _mm_and_si128(_mm_castps_si128(V), g_XMInfinity);
    t0 = _mm_cmpeq_epi32(t0, g_XMZero);
    t1 = _mm_cmpeq_epi32(t1, g_XMInfinity);
    __m128i isNaN = _mm_andnot_si128(t0, t1);

    select0 = _mm_and_si128(isInfinite, g_XMInfinity);
    select1 = _mm_andnot_si128(isInfinite, _mm_castps_si128(log2));
    __m128i result = _mm_or_si128(select0, select1);

    select0 = _mm_and_si128(isZero, g_XMNegInfinity);
    select1 = _mm_andnot_si128(isZero, g_XMNegQNaN);
    tmp = _mm_or_si128(select0, select1);

    select0 = _mm_and_si128(isPositive, result);
    select1 = _mm_andnot_si128(isPositive, tmp);
    result = _mm_or_si128(select0, select1);

    select0 = _mm_and_si128(isNaN, g_XMQNaN);
    select1 = _mm_andnot_si128(isNaN, result);
    result = _mm_or_si128(select0, select1);

    return _mm_castsi128_ps(result);
#else // _XM_VMX128_INTRINSICS_
#endif // _XM_VMX128_INTRINSICS_
}

These all make use of new constants:

XMGLOBALCONST XMVECTORI32 g_XMExponentBias = {127, 127, 127, 127};
XMGLOBALCONST XMVECTORI32 g_XMSubnormalExponent = {-126, -126, -126, -126};
XMGLOBALCONST XMVECTORI32 g_XMNumTrailing = {23, 23, 23, 23};
XMGLOBALCONST XMVECTORI32 g_XMMinNormal = {0x00800000, 0x00800000, 0x00800000, 0x00800000};
XMGLOBALCONST XMVECTORI32 g_XMNegInfinity = {0xFF800000, 0xFF800000, 0xFF800000, 0xFF800000};
XMGLOBALCONST XMVECTORI32 g_XMNegQNaN = {0xFFC00000, 0xFFC00000, 0xFFC00000, 0xFFC00000};
XMGLOBALCONST XMVECTORI32 g_XMBin128 = {0x43000000, 0x43000000, 0x43000000, 0x43000000};
XMGLOBALCONST XMVECTORI32 g_XMBinNeg150 = {0xC3160000, 0xC3160000, 0xC3160000, 0xC3160000};
XMGLOBALCONST XMVECTORI32 g_XM253 = {253, 253, 253, 253};
XMGLOBALCONST XMVECTORF32 g_XMExpEst1 = {-6.93147182e-1f, -6.93147182e-1f, -6.93147182e-1f, -6.93147182e-1f};
XMGLOBALCONST XMVECTORF32 g_XMExpEst2 = {+2.40226462e-1f, +2.40226462e-1f, +2.40226462e-1f, +2.40226462e-1f};
XMGLOBALCONST XMVECTORF32 g_XMExpEst3 = {-5.55036440e-2f, -5.55036440e-2f, -5.55036440e-2f, -5.55036440e-2f};
XMGLOBALCONST XMVECTORF32 g_XMExpEst4 = {+9.61597636e-3f, +9.61597636e-3f, +9.61597636e-3f, +9.61597636e-3f};
XMGLOBALCONST XMVECTORF32 g_XMExpEst5 = {-1.32823968e-3f, -1.32823968e-3f, -1.32823968e-3f, -1.32823968e-3f};
XMGLOBALCONST XMVECTORF32 g_XMExpEst6 = {+1.47491097e-4f, +1.47491097e-4f, +1.47491097e-4f, +1.47491097e-4f};
XMGLOBALCONST XMVECTORF32 g_XMExpEst7 = {-.08635004e-5f, -1.08635004e-5f, -1.08635004e-5f, -1.08635004e-5f};
XMGLOBALCONST XMVECTORF32 g_XMLogEst0 = {+1.442693f, +1.442693f, +1.442693f, +1.442693f};
XMGLOBALCONST XMVECTORF32 g_XMLogEst1 = {-0.721242f, -0.721242f, -0.721242f, -0.721242f};
XMGLOBALCONST XMVECTORF32 g_XMLogEst2 = {+0.479384f, +0.479384f, +0.479384f, +0.479384f};
XMGLOBALCONST XMVECTORF32 g_XMLogEst3 = {-0.350295f, -0.350295f, -0.350295f, -0.350295f};
XMGLOBALCONST XMVECTORF32 g_XMLogEst4 = {+0.248590f, +0.248590f, +0.248590f, +0.248590f};
XMGLOBALCONST XMVECTORF32 g_XMLogEst5 = {-0.145700f, -0.145700f, -0.145700f, -0.145700f};
XMGLOBALCONST XMVECTORF32 g_XMLogEst6 = {+0.057148f, +0.057148f, +0.057148f, +0.057148f};
XMGLOBALCONST XMVECTORF32 g_XMLogEst7 = {-0.010578f, -0.010578f, -0.010578f, -0.010578f};
XMGLOBALCONST XMVECTORF32 g_XMLgE = {+1.442695f, +1.442695f, +1.442695f, +1.442695f};
XMGLOBALCONST XMVECTORF32 g_XMInvLgE = {+6.93147182e-1f, +6.93147182e-1f, +6.93147182e-1f, +6.93147182e-1f};

GitHub: Note that DirectXMath is now hosted on GitHub.

Related: Visual Studio 2013 and Windows 8.1 SDK RTM, DirectXMath: SSE, SSE2, and ARM-NEON, Introducing DirectXMath, DirectXMath 3.07, DirectXMath 3.08, DirectXMath 3.09, DirectXMath 3.10, DirectXMath 3.11

Note that both DirectXTex and DirectXTK have been updated to work with the new version of DirectXMath 3.06 when available including the __vectorcall convention.