On most hardware, multiplication is significantly faster than division. So if you have to divide many numbers by the same divisor, it is usually faster to determine the reciprocal of the divisor once and multiply the numbers with the reciprocal. If you divide by a constant, Gforth performs this optimization automatically.

However, for cases where the divisor is not known during compilation, Gforth provides words that allow you to implement this optimization without to much fuss.

Let’s start with an example: You want to divide all elements of an array of cells by the same number n. A straightforward way to implement this is:

: array/ ( addr u n -- ) -rot cells bounds u+do i @ over / i ! 1 cells +loop drop ;

A possibly more efficient version looks like this:

: array/ ( addr u n -- ) {: | reci[ staged/-size ] :} reci[ /f-stage1m cells bounds u+do i @ reci[ /f-stage2m i ! 1 cells +loop ;

This example first creates a local buffer `reci[`

with size
`staged/-size`

for storing the reciprocal data. Then
`/f-stage1m`

computes the reciprocal of *n* and stores it in
`reci[`

. Finally, inside the loop `/f-stage2m`

uses the
data in `reci[`

to compute the quotient.

There are some limitations: Only positive divisors are supported for
`/f-stage1m`

; for `u/-stage1m`

you can use a divisor of 2 or
higher. You get an error if you try to use an unsupported divisor.
You must initalize the reciprocal buffer for the floored second-stage
words with `/f-stage1m`

and for the unsigned second-stage words
with `u/-stage1m`

. You must not modify the reciprocal buffer
between the first stage and the second stage; basically, don’t treat
it as a memory buffer, but as something that is only mutable by the
first stage; the point of this rule is that future versions of Gforth
will not consider aliasing of this buffer.

Measurements show that staged division is not always beneficial:

break 100 elem even speedup core 7 2.09 Skylake (Core i5-6600K) - 0.94 Rocket Lake (Xeon E-2388G) 40 1.09 Golden Cove (Core i3-1315U P-core) - 0.85 Gracemont (Core i3-1315U E-core) 6 1.68 Zen2 (Ryzen 9 3900X) - 0.56 Zen3 (Ryzen 7 5800X)

The words are:

`staged/-size`

(– u) gforth-1.0 “staged-slash-size”

Size of buffer for `u/-stage1m`

or `/f-stage1m`

.

`/f-stage1m`

(n a-reci –) gforth-1.0 “slash-f-stage1m”

Compute the reciprocal of *n* and store it in the buffer
*a-reci* of size `staged/-size`

. Throws an error if
*n*<1.

`/f-stage2m`

(n1 a-reci – nquotient) gforth-1.0 “slash-f-stage2m”

*Nquotient* is the result of dividing *n1* by the divisor represented
by *a-reci*, which was computed by `/f-stage1m`

.

`modf-stage2m`

(n1 a-reci – umodulus) gforth-1.0 “mod-f-stage2m”

*Umodulus* is the remainder of dividing *n1* by the divisor represented
by *a-reci*, which was computed by `/f-stage1m`

.

`/modf-stage2m`

(n1 a-reci – umodulus nquotient) gforth-1.0 “slash-mod-f-stage2m”

*Nquotient* is the quotient and *umodulus* is the remainder of
dividing *n1* by the divisor represented by *a-reci*, which was
computed by `/f-stage1m`

.

`u/-stage1m`

(u a-reci –) gforth-1.0 “u-slash-stage1m”

Compute the reciprocal of *u* and store it in the buffer
*a-reci* of size `staged/-size`

. Throws an error if
*u*<2.

`u/-stage2m`

(u1 a-reci – uquotient) gforth-1.0 “u-slash-stage2m”

*Uquotient* is the result of dividing *u1* by the divisor represented
by *a-reci*, which was computed by `u/-stage1m`

.

`umod-stage2m`

(u1 a-reci – umodulus) gforth-1.0 “u-mod-stage2m”

*Umodulus* is the remainder of dividing *u1* by the divisor represented
by *a-reci*, which was computed by `u/-stage1m`

.

`u/mod-stage2m`

(u1 a-reci – umodulus uquotient) gforth-1.0 “u-slash-mod-stage2m”

*Uquotient* is the quotient and *umodulus* is the remainder of
dividing *u1* by the divisor represented by *a-reci*, which was
computed by `u/-stage1m`

.

Gforth currently does not support staged symmetrical division.

You can recover the divisor from (the address of) a reciprocal with
`staged/-divisor @`

:

`staged/-divisor`

(addr1 – addr2) gforth-1.0 “staged-slash-divisor”

*Addr1* is the address of a reciprocal, *addr2* is the address
containing the divisor from which the reciprocal was computed.

This can be useful when looking at the decompiler output of Gforth: a division by a constant is often compiled to a literal containing the address of a reciprocal followed by a second-stage word.

The performance impact of using these words strongly depends on the
architecture (does it have hardware division?) and the specific
implementation (how fast is hardware division?), but just to give you
an idea about the relative performance of these words, here are the
cycles per iteration of a microbenchmark (which performs the mentioned
word once per iteration) on two AMD64 implementations; the *norm*
column shows the normal division word (e.g., `u/`

), while the
*stg2* column shows the corresponding stage2 word (e.g.,
`u/-stage2m`

):

Skylake Zen2 norm stg2 norm stg2 41.3 15.8 u/ 35.2 21.4 u/ 39.8 19.7 umod 36.9 25.8 umod 44.0 25.3 u/mod 43.0 33.9 u/mod 48.7 16.9 /f 36.2 22.5 /f 47.9 20.5 modf 37.9 27.1 modf 53.0 24.6 /modf 45.8 35.4 /modf 227.2 u/stage1 101.9 u/stage1 159.8 /fstage1 97.7 /fstage1