Next: , Previous: , Up: Arithmetic   [Contents][Index]

#### 5.5.5 Two-stage integer division

On most hardware, multiplication is significantly faster than division. So if you have to divide many numbers by the same divisor, it is usually faster to determine the reciprocal of the divisor once and multiply the numbers with the reciprocal. For integers, this is tricky, so Gforth packages this work into the words described in this section.

Let’s start with an example: You want to divide all elements of an array of cells by the same number n. A straightforward way to implement this is:

```: array/ ( addr u n -- )
-rot cells bounds u+do
i @ over / i !
1 cells +loop
drop ;
```

A more efficient version looks like this:

```: array/ ( addr u n -- )
{: | reci[ staged/-size ] :}
reci[ /f-stage1m
cells bounds u+do
i @ reci[ /f-stage2m i !
1 cells +loop ;
```

This example first creates a local buffer `reci[` with size `staged/-size` for storing the reciprocal data. Then `/f-stage1m` computes the reciprocal of n and stores it in `reci[`. Finally, inside the loop `/f-stage2m` uses the data in `reci[` to compute the quotient.

There are some limitations: Only positive divisors are supported for `/f-stage1m`; for `u/-stage1m` you can use a divisor of 2 or higher. You get an error if you try to use an unsupported divisor. You must initalize the reciprocal buffer for the floored second-stage words with `/f-stage1m` and for the unsigned second-stage words with `u/-stage1m`. You must not modify the reciprocal buffer between the first stage and the second stage; basically, don’t treat it as a memory buffer, but as something that is only mutable by the first stage; the point of this rule is that future versions of Gforth will not consider aliasing of this buffer.

The words are:

````staged/-size`       – u         gforth       “staged-slash-size”
```

Size of buffer for `u/-stage1m` or `/f-stage1m`.

````/f-stage1m`       n addr-reci –         gforth       “slash-f-stage1m”
```

Compute the reciprocal of n and store it in the buffer addr-reci of size `staged/-size`. Throws an error if n<1.

````/f-stage2m`       n1 a-reci – nquotient        gforth       “slash-f-stage2m”
```

Nquotient is the result of dividing n1 by the divisor represented by a-reci, which was computed by `/f-stage1m`.

````modf-stage2m`       n1 a-reci – umodulus        gforth       “mod-f-stage2m”
```

Umodulus is the remainder of dividing n1 by the divisor represented by a-reci, which was computed by `/f-stage1m`.

````/modf-stage2m`       n1 a-reci – umodulus nquotient        gforth       “slash-mod-f-stage2m”
```

Nquotient is the quotient and umodulus is the remainder of dividing n1 by the divisor represented by a-reci, which was computed by `/f-stage1m`.

````u/-stage1m`       u addr-reci –         gforth       “u-slash-stage1m”
```

Compute the reciprocal of u and store it in the buffer addr-reci of size `staged/-size`. Throws an error if u<2.

````u/-stage2m`       u1 a-reci – uquotient        gforth       “u-slash-stage2m”
```

Uquotient is the result of dividing u1 by the divisor represented by a-reci, which was computed by `u/-stage1m`.

````umod-stage2m`       u1 a-reci – umodulus        gforth       “u-mod-stage2m”
```

Umodulus is the remainder of dividing u1 by the divisor represented by a-reci, which was computed by `u/-stage1m`.

````u/mod-stage2m`       u1 a-reci – umodulus uquotient        gforth       “u-slash-mod-stage2m”
```

Uquotient is the quotient and umodulus is the remainder of dividing u1 by the divisor represented by a-reci, which was computed by `u/-stage1m`.

Gforth currently does not support staged symmetrical division.

You can recover the divisor from (the address of) a reciprocal with `staged/-divisor @`:

````staged/-divisor`       addr1 – addr2         gforth       “staged-slash-divisor”
```

This can be useful when looking at the decompiler output of Gforth: a division by a constant is often compiled to a literal containing the address of a reciprocal followed by a second-stage word.

The performance impact of using these words strongly depends on the architecture (does it have hardware division) and the specific implementation (how fast is hardware division?), but just to give you an idea about the relative performance of these words, here are the cycles per iteration of a microbenchmark (which performs the mentioned word once per iteration) on two AMD64 implementations; the norm column shows the normal division word (e.g., `u/`), while the stag column shows the corresponding stage2 word (e.g., `u/-stage2m`):

``` Skylake              Zen2
norm stag           norm stag
41.3 15.8 u/	    35.2 21.4 u/
39.8 19.7 umod	    36.9 25.8 umod
44.0 25.3 u/mod	    43.0 33.9 u/mod
48.7 16.9 /f	    36.2 22.5 /f
47.9 20.5 modf	    37.9 27.1 modf
53.0 24.6 /modf	    45.8 35.4 /modf
227.2 u/stage1      101.9 u/stage1
159.8 /fstage1       97.7 /fstage1
```

Next: , Previous: , Up: Arithmetic   [Contents][Index]