15.4 Performance

On RISCs the Gforth engine is very close to optimal; i.e., it is usually impossible to write a significantly faster threaded-code engine.

On register-starved machines like the 386 architecture processors improvements are possible, because gcc does not utilize the registers as well as a human, even with explicit register declarations; e.g., Bernd Beuster wrote a Forth system fragment in assembly language and hand-tuned it for the 486; this system is 1.19 times faster on the Sieve benchmark on a 486DX2/66 than Gforth compiled with gcc-2.6.3 with -DFORCE_REG. The situation has improved with gcc-2.95 and gforth-0.4.9; now the most important virtual machine registers fit in real registers (and we can even afford to use the TOS optimization), resulting in a speedup of 1.14 on the sieve over the earlier results. And dynamic superinstructions provide another speedup (but only around a factor 1.2 on the 486).

The potential advantage of assembly language implementations is not necessarily realized in complete Forth systems: We compared Gforth-0.5.9 (direct threaded, compiled with gcc-2.95.1 and -DFORCE_REG) with Win32Forth 1.2093 (newer versions are reportedly much faster), LMI’s NT Forth (Beta, May 1994) and Eforth (with and without peephole (aka pinhole) optimization of the threaded code); all these systems were written in assembly language. We also compared Gforth with three systems written in C: PFE-0.9.14 (compiled with gcc-2.6.3 with the default configuration for Linux: -O2 -fomit-frame-pointer -DUSE_REGS -DUNROLL_NEXT), ThisForth Beta (compiled with gcc-2.6.3 -O3 -fomit-frame-pointer; ThisForth employs peephole optimization of the threaded code) and TILE (compiled with make opt). We benchmarked Gforth, PFE, ThisForth and TILE on a 486DX2/66 under Linux. Kenneth O’Heskin kindly provided the results for Win32Forth and NT Forth on a 486DX2/66 with similar memory performance under Windows NT. Marcel Hendrix ported Eforth to Linux, then extended it to run the benchmarks, added the peephole optimizer, ran the benchmarks and reported the results.

We used four small benchmarks: the ubiquitous Sieve; bubble-sorting and matrix multiplication come from the Stanford integer benchmarks and have been translated into Forth by Martin Fraeman; we used the versions included in the TILE Forth package, but with bigger data set sizes; and a recursive Fibonacci number computation for benchmarking calling performance. The following table shows the time taken for the benchmarks scaled by the time taken by Gforth (in other words, it shows the speedup factor that Gforth achieved over the other systems).

relative       Win32-    NT       eforth       This-      
time     Gforth Forth Forth eforth  +opt   PFE Forth  TILE
sieve      1.00  2.16  1.78   2.16  1.32  2.46  4.96 13.37
bubble     1.00  1.93  2.07   2.18  1.29  2.21        5.70
matmul     1.00  1.92  1.76   1.90  0.96  2.06        5.32
fib        1.00  2.32  2.03   1.86  1.31  2.64  4.55  6.54

You may be quite surprised by the good performance of Gforth when compared with systems written in assembly language. One important reason for the disappointing performance of these other systems is probably that they are not written optimally for the 486 (e.g., they use the lods instruction). In addition, Win32Forth uses a comfortable, but costly method for relocating the Forth image: like cforth, it computes the actual addresses at run time, resulting in two address computations per NEXT (see Image File Background).

The speedup of Gforth over PFE, ThisForth and TILE can be easily explained with the self-imposed restriction of the latter systems to standard C, which makes efficient threading impossible (however, the measured implementation of PFE uses a GNU C extension: see Defining Global Register Variables in GNU C Manual). Moreover, current C compilers have a hard time optimizing other aspects of the ThisForth and the TILE source.

The performance of Gforth on 386 architecture processors varies widely with the version of gcc used. E.g., gcc-2.5.8 failed to allocate any of the virtual machine registers into real machine registers by itself and would not work correctly with explicit register declarations, giving a significantly slower engine (on a 486DX2/66 running the Sieve) than the one measured above.

Note that there have been several releases of Win32Forth since the release presented here, so the results presented above may have little predictive value for the performance of Win32Forth today (results for the current release on an i486DX2/66 are welcome).

In Translating Forth to Efficient C by M. Anton Ertl and Martin Maierhofer (presented at EuroForth ’95), an indirect threaded version of Gforth is compared with Win32Forth, NT Forth, PFE, ThisForth, and several native code systems; that version of Gforth is slower on a 486 than the version used here. You can find a newer version of these measurements at https://www.complang.tuwien.ac.at/forth/performance.html. You can find numbers for Gforth on various machines in Benchres.