gforth-fast use another
optimization: Dynamic superinstructions with replication. As an
example, consider the following colon definition:
: squared ( n1 -- n2 ) dup * ;
Gforth compiles this into the threaded code sequence
dup * ;s
In normal direct threaded code there is a code address occupying one
cell for each of these primitives. Each code address points to a
machine code routine, and the interpreter jumps to this machine code in
order to execute the primitive. The routines for these three
primitives are (in
gforth-fast on the 386):
Code dup ( $804B950 ) add esi , # -4 \ $83 $C6 $FC ( $804B953 ) add ebx , # 4 \ $83 $C3 $4 ( $804B956 ) mov dword ptr 4 [esi] , ecx \ $89 $4E $4 ( $804B959 ) jmp dword ptr FC [ebx] \ $FF $63 $FC end-code Code * ( $804ACC4 ) mov eax , dword ptr 4 [esi] \ $8B $46 $4 ( $804ACC7 ) add esi , # 4 \ $83 $C6 $4 ( $804ACCA ) add ebx , # 4 \ $83 $C3 $4 ( $804ACCD ) imul ecx , eax \ $F $AF $C8 ( $804ACD0 ) jmp dword ptr FC [ebx] \ $FF $63 $FC end-code Code ;s ( $804A693 ) mov eax , dword ptr [edi] \ $8B $7 ( $804A695 ) add edi , # 4 \ $83 $C7 $4 ( $804A698 ) lea ebx , dword ptr 4 [eax] \ $8D $58 $4 ( $804A69B ) jmp dword ptr FC [ebx] \ $FF $63 $FC end-code
With dynamic superinstructions and replication the compiler does not just lay down the threaded code, but also copies the machine code fragments, usually without the jump at the end.
( $4057D27D ) add esi , # -4 \ $83 $C6 $FC ( $4057D280 ) add ebx , # 4 \ $83 $C3 $4 ( $4057D283 ) mov dword ptr 4 [esi] , ecx \ $89 $4E $4 ( $4057D286 ) mov eax , dword ptr 4 [esi] \ $8B $46 $4 ( $4057D289 ) add esi , # 4 \ $83 $C6 $4 ( $4057D28C ) add ebx , # 4 \ $83 $C3 $4 ( $4057D28F ) imul ecx , eax \ $F $AF $C8 ( $4057D292 ) mov eax , dword ptr [edi] \ $8B $7 ( $4057D294 ) add edi , # 4 \ $83 $C7 $4 ( $4057D297 ) lea ebx , dword ptr 4 [eax] \ $8D $58 $4 ( $4057D29A ) jmp dword ptr FC [ebx] \ $FF $63 $FC
Only when a threaded-code control-flow change happens (e.g., in
;s), the jump is appended. This optimization eliminates many of
these jumps and makes the rest much more predictable. The speedup
depends on the processor and the application; on the Athlon and Pentium
III this optimization typically produces a speedup by a factor of 2.
The code addresses in the direct-threaded code are set to point to the appropriate points in the copied machine code, in this example like this:
primitive code address dup $4057D27D * $4057D286 ;s $4057D292
Thus there can be threaded-code jumps to any place in this piece of code. This also simplifies decompilation quite a bit.
You can disable this optimization with --no-dynamic. You can use the copying without eliminating the jumps (i.e., dynamic replication, but without superinstructions) with --no-super; this gives the branch prediction benefit alone; the effect on performance depends on the CPU; on the Athlon and Pentium III the speedup is a little less than for dynamic superinstructions with replication.
One use of these options is if you want to patch the threaded code. With superinstructions, many of the dispatch jumps are eliminated, so patching often has no effect. These options preserve all the dispatch jumps.
On some machines dynamic superinstructions are disabled by default, because it is unsafe on these machines. However, if you feel adventurous, you can enable it with --dynamic.