Branchless Matrix to Quaternion Conversion

(EDIT: This arti­cle is a more in-depth write­up of an algo­rithm that I devel­oped around 2005, and first post­ed to Mar­tin Baker’s Euclid­ean Space web­site. That time was the height of the Intel Net­Burst archi­tec­ture, which was noto­ri­ous for its deep pipeline and high branch mis­pre­dic­tion penal­ty. Hence the moti­va­tion to devel­op a branch-free matrix to quater­nion con­ver­sion rou­tine. What fol­lows is the com­plete deriva­tion and analy­sis of this idea.)

The orig­i­nal rou­tine to con­vert a matrix to a quater­nion was giv­en by Ken Shoe­make [1] and is very branchy. There is a way to elim­i­nate these branch­es and arrive at a com­plete­ly branch-free and high­ly par­al­leliz­able code. The trade off is the intro­duc­tion of 3 addi­tion­al square roots. Jump to the analy­sis sec­tion and the end of this arti­cle, or con­tin­ue fist with the math bits.

Trade 3 roots for a branch

When a quater­nion is con­vert­ed to a matrix, the terms that final­ly end up in the matrix can be added and sub­tract­ed to arrive at a wealth of iden­ti­ties. The ele­ments in the matrix diag­o­nal are relat­ed to the quater­nion ele­ments by

    \begin{align*} r &= \frac{1}{2}\sqrt{1+m_{00}+m_{11}+m_{22}}, \\ i &= \frac{1}{2}\sqrt{1+m_{00}-m_{11}-m_{22}}, \\ j &= \frac{1}{2}\sqrt{1-m_{00}+m_{11}-m_{22}}, \\ k &= \frac{1}{2}\sqrt{1-m_{00}-m_{11}+m_{22}}. \end{align*}

The off-diag­o­nal matrix ele­ments are relat­ed to the quater­nion ele­ments by

    \begin{align*} i &= \frac{1}{4r}(m_{21}-m_{12}) , & i &= \frac{1}{4j}(m_{10}+m_{01}) , \\ j &= \frac{1}{4r}(m_{02}-m_{20}) , & j &= \frac{1}{4k}(m_{21}+m_{12}) , \\ k &= \frac{1}{4r}(m_{10}-m_{01}) , & k &= \frac{1}{4i}(m_{20}+m_{02}) . \end{align*}

It is obvi­ous that there are mul­ti­ple paths to restore the quater­nion from a matrix. One could first restore r from the diag­o­nal ele­ments, and then use the off-diag­o­nal ele­ments to restore i, j and k. Or first restore i from the diag­o­nal ele­ments, and then use the off-diag­o­nal for the rest, etc. The stan­dard, branchy con­ver­sion code just does that. Can we elim­i­nate the branch­es and just use the diag­o­nal ele­ments for every­thing? In the­o­ry, yes, the only thing miss­ing would be the signs.

void BranchlessMatrixToQuaternion( float out[4], float in[3][3] )
    out[0] = .5 * sqrt( max( 0, 1 + in[0][0] + in[1][1] + in[2][2] ) ); // r
    out[1] = .5 * sqrt( max( 0, 1 + in[0][0] - in[1][1] - in[2][2] ) ); // i
    out[2] = .5 * sqrt( max( 0, 1 - in[0][0] + in[1][1] - in[2][2] ) ); // j
    out[3] = .5 * sqrt( max( 0, 1 - in[0][0] - in[1][1] + in[2][2] ) ); // k
    out[1] = copysign( out[1], in[2][1] - in[1][2] ); // i
    out[2] = copysign( out[2], in[0][2] - in[2][0] ); // j
    out[3] = copysign( out[3], in[1][0] - in[0][1] ); // k

Math­e­mat­i­cal­ly, if the matrix \mathbf{M} has a deter­mi­nant of one, then 1 + \mathrm{tr}(\mathbf{M}) can­not be neg­a­tive, and all four square roots are defined. In prac­tice there can be round­ing errors, so to be safe the terms are clamped with max( 0, ... ) before the square roots are tak­en. The off diag­o­nal ele­ments are only used to restore the miss­ing signs. Since only the signs mat­ter, the divi­sion by r can be lift­ed! (The func­tion copysign is sup­posed to trans­fer the sign bit from one float­ing point num­ber to anoth­er, and your com­pil­er should do a decent job and reduce it to some sequence of bit­wise and/or instruc­tions. Clang and GCC do this on x86 with SSE float­ing point. See the com­ment sec­tion for a dis­cus­sion about this.)

Com­pared with the orig­i­nal con­ver­sion code, there are four square roots instead of one square root and one divi­sion, but there aren’t any branch­es. The code pipelines well (all the square roots can run in par­al­lel, fol­lowed by all the copy­signs).

Operation Counts

In the fol­low­ing table of oper­a­tion counts, I have includ­ed a max( 0, ... ) safe­guard also in the Shoe­make algo­rithm (the orig­i­nal doesn’t have this). The copysign is regard­ed as a sin­gle oper­a­tion.

Shoe­make Branch­less
add/sub/mul/max 11..13 23
div/sqrt 2 4
float­ing point
1..3 0
branch 1..3 0
copy­sign 0 3
steps in longest
depen­den­cy chain
8..15 6

Let’s assume the laten­cy of sqrt and div as 15 cycles (sin­gle pre­ci­sion), the laten­cy of a branch as 2 cycles with a mis­pre­dic­tion penal­ty of 15 cycles, and the aver­age laten­cy of every­thing else as 3 cycles. These num­bers are in the ball­park of Intel Wolf­dale resp. AMD Jaguar proces­sors [2]. Depend­ing on whether the exe­cu­tion is assumed to be either per­fect­ly ser­i­al and in-order (a sim­ple addi­tion of all the laten­cies), or assumed to be per­fect­ly par­al­lel and out-of-order (the sum of the laten­cies in the longest depen­den­cy chain), we get the cycle counts shown in the table below. The real­i­ty should be any­where between these extremes.

Shoe­make Branch­less
per­fect­ly ser­i­al (in-order) 68..129 138
per­fect­ly par­al­lel (out of order) 49..88 30

Numerical Stability

The catch of the branch­less algo­rithm is that the sin­gu­lar­i­ty at the rota­tion angle of 180° is still there. Although the divi­sion by r was lift­ed and so a divi­sion by zero can­not occur, it now appears as ran­dom signs from cat­a­stroph­ic can­cel­la­tion. Take for exam­ple, a 180° rota­tion around the x-y diag­o­nal,

    \[\begin{bmatrix} 0 & 1 & \epsilon \\ 1 & 0 & -\epsilon \\ -\epsilon & \epsilon & -1 \end{bmatrix}.\]

In an ide­al world, when the rota­tion angle trav­els through a small neigh­bor­hood around the point of 180°, the val­ue of \epsilon van­ish­es and changes sign. The i and j com­po­nents of the result­ing quater­nion should then flip signs simul­ta­ne­ous­ly too. In real­i­ty how­ev­er, with just the tini­est amount of round­ing error, these signs are qua­si ran­dom. The behav­ior can be described like this:

just before 180°
(con­sis­tent signs)
at 180°
(ran­dom signs)
just after 180°
(con­sis­tent signs)
[ 0 \;\; +i \;\; +j \;\; 0 ] [ 0 \;\; \pm i \;\; \pm j \;\; 0 ] [ 0 \;\; -i \;\; -j \;\; 0 ]


A con­fig­u­ra­tion which pro­vokes the sin­gu­lar­i­ty described above is sur­pris­ing­ly com­mon as an exchange of coor­di­nate axes (con­ver­sion from Z-up to Y-up, for exam­ple). Where the branch­less algo­rithm shines how­ev­er is the con­ver­sion of local ani­ma­tion keys. These rarely cross a 180° rota­tion angle. In one use case for me, the skele­ton is the skele­ton of a plant and the ani­ma­tion is dri­ven by physics, and I know or can make sure that a rota­tion of 180° sim­ply doesn’t occur.

Giv­en this knowl­edge, the branch­less con­ver­sion code is rec­om­mend­ed over the stan­dard one, under the con­di­tion that the rota­tion angle does not cross 180°.

[1] Ken Shoe­make, “Quater­nions”,

[2] Instruc­tion Tables at Agn­er Opti­miza­tion

2 thoughts on “Branchless Matrix to Quaternion Conversion

  1. On PPC archi­tec­tures, I would rec­om­mend *not* doing copy­sign() with bit oper­a­tions, because that incurs a LHS penal­ty. On those archi­tec­tures, you are bet­ter of imple­ment­ing copy­sign() using fabs() and a float­ing-point select.

    Like the branch­less ver­sion, will come in handy in the future!

Leave a Reply

Your email address will not be published.