Branchless Matrix to Quaternion Conversion

(Edit: This arti­cle is a more in-depth write­up of an algo­rithm that I devel­oped around 2005, and first post­ed to Mar­tin Baker’s Euclid­ean Space web­site. That time was the height of the Intel Net­Burst archi­tec­ture, which was noto­ri­ous for its deep pipeline and high branch mis­pre­dic­tion penal­ty. Hence the moti­va­tion to devel­op a branch-free matrix to quater­nion con­ver­sion rou­tine. What fol­lows is the com­plete deriva­tion and analy­sis of this idea.)

The orig­i­nal rou­tine to con­vert a matrix to a quater­nion was giv­en by Ken Shoe­make [1] and is very branchy. There is a way to elim­i­nate these branch­es and arrive at a com­plete­ly branch-free and high­ly par­al­leliz­able code. The trade off is the intro­duc­tion of 3 addi­tion­al square roots. Jump to the analy­sis sec­tion and the end of this arti­cle, or con­tin­ue fist with the math bits.

Trade 3 roots for a branch

When a quater­nion with ele­ments r,i,j,k is con­vert­ed to a matrix M, the terms that final­ly end up in the matrix can be added and sub­tract­ed to arrive at a wealth of iden­ti­ties. The ele­ments in the matrix diag­o­nal are relat­ed to the quater­nion ele­ments by

    \begin{align*} r &= \frac{1}{2}\sqrt{1+M_{00}+M_{11}+M_{22}}, \\ i &= \frac{1}{2}\sqrt{1+M_{00}-M_{11}-M_{22}}, \\ j &= \frac{1}{2}\sqrt{1-M_{00}+M_{11}-M_{22}}, \\ k &= \frac{1}{2}\sqrt{1-M_{00}-M_{11}+M_{22}}. \end{align*}

The off-diag­o­nal matrix ele­ments are relat­ed to the quater­nion ele­ments by

    \begin{align*} i &= \frac{1}{4r}(M_{21}-M_{12}) , & i &= \frac{1}{4j}(M_{10}+M_{01}) , \\ j &= \frac{1}{4r}(M_{02}-M_{20}) , & j &= \frac{1}{4k}(M_{21}+M_{12}) , \\ k &= \frac{1}{4r}(M_{10}-M_{01}) , & k &= \frac{1}{4i}(M_{20}+M_{02}) . \end{align*}

It is obvi­ous that there are mul­ti­ple paths to restore the quater­nion from a matrix. One could first restore r from the diag­o­nal ele­ments, and then use the off-diag­o­nal ele­ments to restore i, j and k. Or first restore i from the diag­o­nal ele­ments, and then use the off-diag­o­nal for the rest, etc. The stan­dard, branchy con­ver­sion code just does that. Can we elim­i­nate the branch­es and just use the diag­o­nal ele­ments for every­thing? In the­o­ry, yes, the only thing miss­ing would be the signs.

void BranchlessMatrixToQuaternion( float out[4], float in[3][3] )
{	
    out[0] = .5 * sqrt( max( 0, 1 + in[0][0] + in[1][1] + in[2][2] ) ); // r
    out[1] = .5 * sqrt( max( 0, 1 + in[0][0] - in[1][1] - in[2][2] ) ); // i
    out[2] = .5 * sqrt( max( 0, 1 - in[0][0] + in[1][1] - in[2][2] ) ); // j
    out[3] = .5 * sqrt( max( 0, 1 - in[0][0] - in[1][1] + in[2][2] ) ); // k
    out[1] = copysign( out[1], in[2][1] - in[1][2] ); // i
    out[2] = copysign( out[2], in[0][2] - in[2][0] ); // j
    out[3] = copysign( out[3], in[1][0] - in[0][1] ); // k
}

Math­e­mat­i­cal­ly, if the matrix M has a deter­mi­nant of one, then 1 + \operatorname{tr}M can­not be neg­a­tive, and all four square roots are defined. In prac­tice there can be round­ing errors, so to be safe the terms are clamped with max( 0, ... ) before the square roots are tak­en. The off diag­o­nal ele­ments are only used to restore the miss­ing signs. Since only the signs mat­ter, the divi­sion by r can be lift­ed! (The func­tion copysign is sup­posed to trans­fer the sign bit from one float­ing point num­ber to anoth­er, and your com­pil­er should do a decent job and reduce it to some sequence of bit­wise and/or instruc­tions. Clang and GCC do this on x86 with SSE float­ing point. See the com­ment sec­tion for a dis­cus­sion about this.)

Com­pared with the orig­i­nal con­ver­sion code, there are four square roots instead of one square root and one divi­sion, but there aren’t any branch­es. The code pipelines well (all the square roots can run in par­al­lel, fol­lowed by all the copysigns).

Operation Counts

In the fol­low­ing table of oper­a­tion counts, I have includ­ed a max( 0, ... ) safe­guard also in the Shoe­make algo­rithm (the orig­i­nal does­n’t have this). The copysign is regard­ed as a sin­gle operation.

Shoe­make Branch­less
add/sub/mul/max 11..13 23
div/sqrt 2 4
float­ing point
compare
1..3 0
branch 1..3 0
copy­sign 0 3
steps in longest
depen­den­cy chain
8..15 6

Let’s assume the laten­cy of sqrt and div as 15 cycles (sin­gle pre­ci­sion), the laten­cy of a branch as 2 cycles with a mis­pre­dic­tion penal­ty of 15 cycles, and the aver­age laten­cy of every­thing else as 3 cycles. These num­bers are in the ball­park of Intel Wolf­dale resp. AMD Jaguar proces­sors [2]. Depend­ing on whether the exe­cu­tion is assumed to be either per­fect­ly ser­i­al and in-order (a sim­ple addi­tion of all the laten­cies), or assumed to be per­fect­ly par­al­lel and out-of-order (the sum of the laten­cies in the longest depen­den­cy chain), we get the cycle counts shown in the table below. The real­i­ty should be any­where between these extremes.

Shoe­make Branch­less
per­fect­ly ser­i­al (in-order) 68..129 138
per­fect­ly par­al­lel (out of order) 49..88 30

Numerical Stability

The catch of the branch­less algo­rithm is that the sin­gu­lar­i­ty at the rota­tion angle of 180° is still there. Although the divi­sion by r was lift­ed and so a divi­sion by zero can­not occur, it now appears as ran­dom signs from cat­a­stroph­ic can­cel­la­tion. Take for exam­ple, a 180° rota­tion around the x‑y diagonal,

    \[\left( \begin{array}{d{0}d{0}d{0}} 0 & 1 & \epsilon \\ 1 & 0 & -\epsilon \\ -\epsilon & \epsilon & -1 \end{array} \right) .\]

In an ide­al world, when the rota­tion angle trav­els through a small neigh­bor­hood around the point of 180°, the val­ue of \epsilon van­ish­es and changes sign. The i and j com­po­nents of the result­ing quater­nion should then flip signs simul­ta­ne­ous­ly too. In real­i­ty how­ev­er, with just the tini­est amount of round­ing error, these signs are qua­si ran­dom. The behav­ior can be described like this:

just before 180°
(con­sis­tent signs)
at 180°
(ran­dom signs)
just after 180°
(con­sis­tent signs)
[ 0 \;\; +i \;\; +j \;\; 0 ] [ 0 \;\; \pm i \;\; \pm j \;\; 0 ] [ 0 \;\; -i \;\; -j \;\; 0 ]

Recommendation

A con­fig­u­ra­tion which pro­vokes the sin­gu­lar­i­ty described above is sur­pris­ing­ly com­mon as an exchange of coor­di­nate axes (con­ver­sion from Z‑up to Y‑up, for exam­ple). Where the branch­less algo­rithm shines how­ev­er is the con­ver­sion of local ani­ma­tion keys. These rarely cross a 180° rota­tion angle. In one use case for me, the skele­ton is the skele­ton of a plant and the ani­ma­tion is dri­ven by physics, and I know or can make sure that a rota­tion of 180° sim­ply does­n’t occur.

Giv­en this knowl­edge, the branch­less con­ver­sion code is rec­om­mend­ed over the stan­dard one, under the con­di­tion that the rota­tion angle does not cross 180°.


References

[1] Ken Shoe­make, “Quater­nions”,
http://www.cs.ucr.edu/~vbz/resources/quatut.pdf

[2] Instruc­tion Tables at Agn­er Optimization
http://www.agner.org/optimize/instruction_tables.pdf

4 Gedanken zu „Branchless Matrix to Quaternion Conversion

  1. On PPC archi­tec­tures, I would rec­om­mend *not* doing copy­sign() with bit oper­a­tions, because that incurs a LHS penal­ty. On those archi­tec­tures, you are bet­ter of imple­ment­ing copy­sign() using fabs() and a float­ing-point select.

    Like the branch­less ver­sion, will come in handy in the future!

  2. Dear Chris­t­ian,
    It seems your code may avoid sev­er­al additions/subtractions.
    See below f90 implementation.
    Alex
    ===========================================

    function mtrx2quat1(r)
    ! get rotation quaternion from matrix
    ! Method: http://www.thetenthplanet.de/archives/1994
    ! tested OK
      type(TQUAT)          :: mtrx2quat1
      real(SP),intent(IN) :: r(3,3)
    
      real(SP) :: s,t,sq2=sqrt(0.5_SP)
    
      t=(r(1,1)+r(2,2)+r(3,3))*0.5_SP
      s=0.5_SP-t
      mtrx2quat1%q(0)=sq2*sqrt(max(0._SP,0.5+t))
      mtrx2quat1%q(1)=sign(sq2*sqrt(max(0._SP,r(1,1)+s)),r(3,2)-r(2,3))
      mtrx2quat1%q(2)=sign(sq2*sqrt(max(0._SP,r(2,2)+s)),r(1,3)-r(3,1))
      mtrx2quat1%q(3)=sign(sq2*sqrt(max(0._SP,r(3,3)+s)),r(2,1)-r(1,2))
    end function mtrx2quat1
    
    • Hi Alex,
      Yes, the code that I showed in the arti­cle does not aggres­sive­ly com­bine com­mon subex­pres­sions, as your code does. I did this for clar­i­ty and under the assump­tion that the com­pil­er could do some­thing sim­i­lar, if giv­en the lib­er­ty to do so (eg. ‑ffast-math and co). In any case, thanks for your contribution!

      EDIT: On a side note, fac­tor­ing out com­mon sub expres­sions increas­es the length of the depen­den­cy chain, so it would be arguable if that is real­ly faster with out-of-order exe­cu­tion. In any case, mea­sure­ment over opinion!

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Please answer the following anti-spam test

Which thing makes "tick-tock" and if it falls down, the clock is broken?

  1.    pencil
  2.    ruler
  3.    watch
  4.    chair
  5.    table