From: chughes@maths.tcd.ie (Conrad Hughes)
Subject: Re: fast divide
Date: 31 Oct 91 13:15:24 GMT

DEFFNdiv(A,B,C,R):LOCALA%:[OPTpass:MOVS C,A,LSR#31:RSBNE A,A,#0
 :ADDS C,C,A,LSL#16:MOV A,A,LSR#16:ADDS A,B,A:SUBCC A,A,B:]
  FORA%=1TO31:[OPTpass:ADCS C,C,C:ADCS A,B,A,LSL#1:SUBCC A,A,B:]NEXT
[OPTpass:ADCS R,C,C:RSBCC R,R,#0:]=0

This is the BASIC assembler code to do the division.. Use it this way:

FNdiv(reg_top,reg_bot,reg_misc,reg_res)

while in the BASIC assembler.. It inlines the code. Register reg_res
contains 32768*(reg_top/reg_bot); reg_bot _must_ be negative (if you
don't know what sign it'll be, tack the following in at the beginning:

CMP B,#0:RSBPL B,B,#0:RSBPL A,A,#0

...) , and reg_top can be either sign.. If you need slightly higher
accuracy (I think I can increase the accuracy by half a bit) and
don't worry about registers, tell me.  reg_res can be any register,
including reg_top, reg_bot or reg_misc.

If you want to turn it into a loop, do so.. But remember that the
fundamental unit of of the routine takes 3 clock counts, a branch-loop
takes five! Also, it'll be straightforward enough to turn into a
subroutine..

If you need any clarification of it, ask me...

Someone posted another version which is faster for a logarithmic
distribution of numbers, with 0 and 1 occuring most frequently; I
haven't got it to hand, but could dig it up if nobody else posts
it.

 Conrad


From: torbenm@diku.dk (Torben AEgidius Mogensen)
Date: 31 Oct 91 15:34:21 GMT
Subject: Re: fast divide
Sender: torbenm@freke.diku.dk

Here is one I cooked up. It is from memory, so there might be slight
errors. The comparisons before the first SUB ensure that no overflow
happens on the ASL. Worst case time is 5 cycles per bit.

Arguments in p,q.
on exit, r = p div q, p = p mod q

.div
 MOV r,#0
 CMP p,q
 BLT nodiv
 CMP p,q ASL #1
 BLT div0
 CMP p,q ASL #2
 BLT div1
 ...			\ repeat for 3..30
 CMP p,q ASL #31
 SUBGE p,q ASL #31
 ADDGE r,#2^31
 CMP p,q ASL #30
.div30
 SUBGE p,q ASL #30
 ADDGE r,#2^30
 CMP p,q ASL #29
.div29
 ...			\ repeat for 29..1
.div0
 SUBGE p,q ASL #0	\ equiv. to SUBGE p,q
 ADDGE r,#2^0
.nodiv
 MOV PC,Link

 
From: gcwillia@daisy.waterloo.edu (Graeme Williams)
Date: 2 Nov 91 20:13:28 GMT
Subject: Re: fast divide
Organization: University of Waterloo

Ok, here is the fastest unsigned divide routine I know of. The 3 instruction
loop (unrolled) wasn't my idea - but the piece of code that figures out
whereabouts to jump into the unrolled loop is.

On entry d and n should be both +ve.

CMP d,#0 : BEQ end% ; Remove zero divisors
CMP d,n : MOVGT m,n : MOVGT n,#0 : BGT end% ; Exit if denominator > numerator

; This next bit of code figures out roughly how many bits the quotient is
; going to have and hence how many times you need to run through the
; divide loop proper - its only worth finding out to the nearest 4 times

.start%
MOV cnt,#28 : MOV m,n,LSR#4
CMP d,m,LSR#12 : SUBLE cnt,cnt,#16 : MOVLE m,m,LSR#16
CMP d,m,LSR#4  : SUBLE cnt,cnt,#8  : MOVLE m,m,LSR#8
CMP d,m        : SUBLE cnt,cnt,#4  : MOVLE m,m,LSR#4

MOV n,n,LSL cnt
ADDS n,n,n : RSB d,d,#0 ; Have to flip the sign of d for the divide loop
ADD cnt,cnt,cnt,LSL#1  ; Multiply cnt by 3
ADD pc,pc,cnt,LSL#2    ; Jump cnt instructions
MOV R0,R0              ; Dummy instruction to take care of pipelining

32 copies of { ADCS m,d,m,LSL#1 : SUBCC m,m,d : ADCS n,n,n }

.end%

The best times occur when the numerator and denominator of your division
are of similar size. The worst times when the numerator is very large
and the denominator small. (The case where the denominator is larger than
the numerator is killed off very early.)

Suppose one has a distribution of numbers wherein the highest bit of
the numbers is equally distributed between bit 0 and bit 31, that is,
there is the same number of numbers between 2^n and 2^(n+1) for every n.
(I think this is actually the case for most purposes)

Now if a numerator and denominator are selected at random from this
distribution subject only to the requirement that denom <= numer
then the above routine will execute on average in 58.25 cycles
(from the point start% in the code).

The worst case is 116 cycles (top bit of quotient is bit 28 or higher)
the best case is 32 cycles (top bit of quotient is bit 0 to 3)
The inbetween cases get 12 cycles worse for every 4 extra bits in the
quotient.

Note that the average case isn't the simple average of the various cases
but has to be weighted toward the better cases - this is because there
are more ways (27) for numer and denom to differ by say 5 bits than there
are (7) for them to differ by say 25 bits.

Have fun.

Graeme Williams


From: zrzm0370@rusmv1.rus.uni-stuttgart.de (Joerg Scheurich)
Subject: integer division
Date: 10 Jun 91 11:25:36 GMT

Hi 

So i know, there is no Assembler-command for integer-division in the 
ARM. Therefor a fast division-routine is a thing of common interest.

In a earlier Posting i wrote about the bad quick hacked division-routine
for the demo of my formula-tool (math. formula ==> assemblermacros).

Now i wrote a better routine, but i don't know, if there is a better 
algorithm.

The shorter and slower 32-Bit/32-Bit-division-routine requires
382     Cycles in 100 Bytes, with usage of 4 32-Bit-registers and 4 Byte memory.
With unrolling the loop,
the faster  and longer 32-Bit/32-Bit-division-routine requires 
182     Cycles in 692 Bytes, with usage of 4 32-Bit-registers.
  
This is very faster then the old routine (which requires 2**33 Cycles in 
worst case and requires all time of the universe, if there is a division
by zero ... )

Cause the INTERNAL-assembler-command of the INTEL 8086
for a                  32-Bit/16-Bit-division  requires
165-184 Cycles in 2-4 Bytes, with usage of 3  16-Bit-registers,
i think, the result is ok.   

The result is nothing against the INTERNAL-assembler-command of the ARM
for a                  32-Bit*32-Bit-multiplication which  requires
17     Cycles in    4 Bytes, with usage of 1-3 32-Bit-registers,
but my Professor in digital electronics used to say:
"If a program contain much divisions, the programmer is a idiot" ...

Know someone a better (faster or shorter) algorithm ?

so long
MUFTI

( internet:    zrzm0111@helpdesk.rus.uni-stuttgart.de
  ( from janet: zrzm0111%de.uni-stuttgart.rus.helpdesk@NFS-RELAY )
  bitnet:      ZRZM  AT DS0RUS1I )

div-fast:
---------
; division routine
; 1. calculate flag if result is negate and convert operands to positiv
; 2. second "divide" with unsigned suczessive Approximation  
; 3. fix the sign of result 

        .MACRO INTdivide
         LDR R2,%2
         LDR R3,%3 

         MOV R0,#0
         CMP R2,#0
          RSBLT R2,R2,#0
          SUBLT R0,R0,#1
         CMP R3,#0
          RSBLT R3,R3,#0
          MVNLT R0,R0

         MOV  R1,#0

         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l2
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l3
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l4
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l5
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l6
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l7
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l8
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l9
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l10
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l11
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l12
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l13
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l14
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l15
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l16
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l17
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l18
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l19
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l20
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l21
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l22
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l23
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l24
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l25
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l26
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l27
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l28
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l29
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l30
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l31
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
\l32
         ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
    
         CMP R0,#0
         RSBNE R2,R2,#0 
         STR R2,%1
        .ENDM 

div-short:
----------
; division routine
; 1. calculate flag if result is negate and convert operands to positiv
; 2. second "divide" with unsigned suczessive Approximation  
; 3. fix the sign of result 

        .MACRO INTdivide
         LDR R2,%2
         LDR R3,%3 

         MOV R0,#0
         CMP R2,#0
          RSBLT R2,R2,#0
          SUBLT R0,R0,#1
         CMP R3,#0
          RSBLT R3,R3,#0
          MVNLT R0,R0
         STR R0,minusflag

         MOV  R0,#32.
         MOV  R1,#0
\loop    ADDS R2,R2,R2
         ADCS R1,R1,R1
         CMP  R1,R3
         SUBGE  R1,R1,R3
         ADDGE  R2,R2,#1         
         SUB  R0,R0,#1
         CMP  R0,#0
         BNE  \loop        
    
         LDR R0,minusflag
         CMP R0,#0
         RSBNE R2,R2,#0 
         STR R2,%1
        .ENDM 


From: chughes@maths.tcd.ie (Conrad Hughes)
Subject: Re: integer division
Date: 11 Jun 91 14:10:31 GMT

zrzm0370@rusmv1.rus.uni-stuttgart.de (Joerg Scheurich) writes:
>382     Cycles in 100 Bytes, with usage of 4 32-Bit-registers
>and 4 Byte memory.
>With unrolling the loop,
>the faster  and longer 32-Bit/32-Bit-division-routine requires 
>182     Cycles in 692 Bytes, with usage of 4 32-Bit-registers.

This divides a 64bit number by a 32bit one, with 32bit result, in 97
instructions/388 bytes/97 clock cycles..

Entry: A=MSW of 64bit no., B=LSW of 64bit no., C=32bit no.
	C _must_ be negative (if it's positive, RSB C,C,#0),
	A _must_ be positive (if not, RSBS B,B,#0:RSC A,A,#0)
ADDS B,B,B
ADCS A,C,A,LSL#1:SUBCC A,A,C:ADCS B,B,B [repeat these three 32 times]
Exit: A=Remainder of division, B=result, C unchanged.
	..if you don't need the remainder, get rid of the last
	SUBCC A,A,C.
	The result will be unsigned, and flags will (obviously)
	reflect the last addition.
Adjust as you will to get 32/32 division (Initialise A to 0, or AB
= numerator shifted some way) or any other variety you like..
Rolling up the loop it could get slightly complicated since every
instruction requires flag information from the previous one ;-)
Something like SUB count,count,#1:TEQ count,#0:BNE loop with count
initialised to #32 might do the trick, but it'll slow it down to at
least 288 cycles (so what if it only uses 32 bytes?? ;-}), which is
pretty unacceptable unless you're on an ARM3..

This is off the top of my head and I'm nowhere near the Arc, so if
it doesn't work (it should - I've typed it in enough times now..)
flame me..

Conrad


From: gcwillia@daisy.waterloo.edu (Graeme Williams)
Subject: Re: really FAST integer division
Summary: Avg. exec. time < 60 cycles    - the fastest div in the West?
Date: 28 Jun 91 19:07:58 GMT
Organization: University of Waterloo

Following several posts regarding division routines, in particular one
with a lovely 3 instr. central loop I modified my old division routine
into the following (supersonic, lightning fast, ZR rated, :-) )
32bit routine:

;Essentially what's happening here is we are figuring out how far left
;we can shift the numerator before any actual dividing begins and thus save
;a significant number of useless trips through the (unrolled) 3 instr. loop.
; - Note we only go to the nearest 4 bits, trying to save another 2 bits
;   results in 3 extra instrs. for a saving of 6 instrs. half the time
;   and 0 instrs. the other half -  i.e. no nett benefit.

MOV cnt,#28 : MOV mod,num,LSR#4
CMP den,mod,LSR#12 : SUBLE cnt,cnt,#16 : MOVLE mod,mod,LSR#16
CMP den,mod,LSR#4 : SUBLE cnt,cnt,#8 : MOVLE mod,mod,LSR#8
CMP den,mod : SUBLE cnt,cnt,#4 : MOVLE mod,mod,LSR#4
MOV num,num,LSL cnt : RSB den,den,#0

ADDS num,num,num
;Now skip over cnt copies of the 3 instr. loop.
ADD cnt,cnt,cnt,LSL#1 : ADD pc,pc,cnt,LSL#2 : MOV R0,R0

{ADCS mod,den,mod,LSL#1 : SUBLO mod,mod,den : ADCS num,num,num} x 32 copies


The routine behaves as follows:
On Entry:  num - is a signed numerator A (but +ve)
           den - is a signed denominator B (but non-zero and +ve)
On Exit:   mod - contains A MOD B
           num - contains A DIV B

The routine requires 4 registers and 452 bytes of memory.

Execution time:
  Given numbers A and B (both +ve) chosen at random from a distribution in
  which the most significant bit of the numbers is uniformly distributed
  between bits 0 and 31 (i.e. the density of numbers goes as roughly log n)
  then the average execution time is:

       For A < B :  32 cycles      (no division actually required)
       For A >= B:  58.25 cycles

       The best and worst cases are 32 cycles and 116 cycles respectively.


Have fun. (Stuff like this ought to be worth good money!)

Graeme Williams
gcwillia@daisy.waterloo.edu

P.S. With any luck there's only a couple of typos :-).


From: dseal@armltd.uucp (David Seal)
Subject: Re: Dividing
Date: 5 Mar 91 19:47:12 GMT

In article <8828@castle.ed.ac.uk> ecwu61@castle.ed.ac.uk (R Renwick) writes:

>	Can anyone help me with this problem:
>In my super-duper, wire-frame model animating 'C' program, I represent 
>floating point numbers as 4-byte integers. This I do by shifting the number
>being represented by 15 bits (multiply by 32768) in order that I can keep a
>track of the fractional part. The reason for doing this is that I want
>to make sure my program runs as fast as possible by not using floating
>point arithmetic...
>
>	What I want to do is do a divide a number by another number
>and not lose alot of the precision.
>	The way I do this at the moment is by
>		 result=((a/b)<<15)
>But if a and b are close together then (a/b) loses alot of precison
>(since the result is an int).
>
>Does anyone know how I could do an integer division without losing
>precision. I cannot use result=(((a<<8)/b)<<7) or any other tricks
>because a may be close to 32768<<15 already...
>Perhaps if I can find the reciprocal of b in the above representation
>and multiply it by a then I'll not lose the as much precision, but I
>don't know how to do this (find the reciprocal) :-(

Two quick suggestions:

(1) The basic problem is that integer division stops when it determines the
    bit in the units place, whereas you want it to go on and calculate
    another 15 fractional binary places. So rather than using C's integer
    divide operator, write a long division routine using subtractions and
    shifts and force it to go on 15 iterations longer than normal. (This
    won't be much slower than C's integer divide, which essentially uses a
    long division routine anyway).

    If speed is very important, you'll probably want to either (a) write a
    well-optimised assembler routine; or (b) write it as a C macro, so that
    the compiler includes it inline. In the latter case, you may want to
    look carefully at the assembler produced by the compiler for an instance
    of your macro and experiment a bit - careful choice of the details of
    the long division algorithm can make a significant difference to the
    speed of the division.

    Incidentally, you may want to watch out for significant bits being lost
    during the last 15 iterations: this is perfectly possible and means that
    your result has overflowed.

(2) Use a Newton-Raphson iteration to find the reciprocal of your divisor
    and then multiply - this presupposes that you've got an efficient way to
    do multiplications, both for the Newton-Raphson iteration and for the
    final multiplication. The iteration to find the reciprocal of a number A
    is (in pseudo-code - you'll have to translate to C):

       newx = initial approximation to reciprocal;
       repeat
         x = newx;
         newx = x * (2 - A*x);
       until x and newx sufficiently close together;

    The main thing to beware of in this is that your initial approximation
    must be reasonably close to the reciprocal - it must be greater than 0
    and less than twice the reciprocal. Such an approximation can be found
    by finding the most significant 1 in A and choosing a value based on
    this - e.g. if A lies in the range 1.0 <= A < 2.0, its most significant
    1 is at bit position 16, and a suitable starting point for the iteration
    would be 0.75. This gives us:

    Most sign. bit position      Range for A        Suitable starting point
    -----------------------------------------------------------------------
              :
              :
              9              2^(-7) <= A < 2^(-6)       1.5*2^6
             10              2^(-6) <= A < 2^(-5)       1.5*2^5
             11              2^(-5) <= A < 2^(-4)       1.5*2^4
             12              2^(-4) <= A < 2^(-3)       1.5*2^3
             13              2^(-3) <= A < 2^(-2)       1.5*2^2
             14              2^(-2) <= A < 2^(-1)       1.5*2^1
             15              2^(-1) <= A < 2^0          1.5*2^0
             16                 2^0 <= A < 2^1          1.5*2^(-1)
             17                 2^1 <= A < 2^2          1.5*2^(-2)
             18                 2^2 <= A < 2^3          1.5*2^(-3)
             19                 2^3 <= A < 2^4          1.5*2^(-4)
             20                 2^4 <= A < 2^5          1.5*2^(-5)
             21                 2^5 <= A < 2^6          1.5*2^(-6)
             22                 2^6 <= A < 2^7          1.5*2^(-7)
             :
             :

    Watch out for very big values - e.g. 2^16 is representable in this
    number system, but 2^(-16) isn't!

Hope these suggestions help.

David Seal
dseal@acorn.co.uk or dseal@armltd.uucp