15.2.3 Dynamic Superinstructions

The engines gforth and gforth-fast use another optimization: Dynamic superinstructions with replication. As an example, consider the following colon definition:

: squared ( n1 -- n2 )
  dup * ;

Gforth compiles this into the threaded code sequence

dup
*
;s

Use simple-see (see Examining compiled code) to see the threaded code of a colon definition.

In normal direct threaded code there is a code address occupying one cell for each of these primitives. Each code address points to a machine code routine, and the interpreter jumps to this machine code in order to execute the primitive. The routines for these three primitives are (in gforth-fast on the 386):

Code dup  
( $804B950 )  add     esi , # -4  \ $83 $C6 $FC 
( $804B953 )  add     ebx , # 4  \ $83 $C3 $4 
( $804B956 )  mov     dword ptr 4 [esi] , ecx  \ $89 $4E $4 
( $804B959 )  jmp     dword ptr FC [ebx]  \ $FF $63 $FC 
end-code
Code *  
( $804ACC4 )  mov     eax , dword ptr 4 [esi]  \ $8B $46 $4 
( $804ACC7 )  add     esi , # 4  \ $83 $C6 $4 
( $804ACCA )  add     ebx , # 4  \ $83 $C3 $4 
( $804ACCD )  imul    ecx , eax  \ $F $AF $C8 
( $804ACD0 )  jmp     dword ptr FC [ebx]  \ $FF $63 $FC 
end-code
Code ;s  
( $804A693 )  mov     eax , dword ptr [edi]  \ $8B $7 
( $804A695 )  add     edi , # 4  \ $83 $C7 $4 
( $804A698 )  lea     ebx , dword ptr 4 [eax]  \ $8D $58 $4 
( $804A69B )  jmp     dword ptr FC [ebx]  \ $FF $63 $FC 
end-code

With dynamic superinstructions and replication the compiler does not just lay down the threaded code, but also copies the machine code fragments, usually without the jump at the end.

( $4057D27D )  add     esi , # -4  \ $83 $C6 $FC 
( $4057D280 )  add     ebx , # 4  \ $83 $C3 $4 
( $4057D283 )  mov     dword ptr 4 [esi] , ecx  \ $89 $4E $4 
( $4057D286 )  mov     eax , dword ptr 4 [esi]  \ $8B $46 $4 
( $4057D289 )  add     esi , # 4  \ $83 $C6 $4 
( $4057D28C )  add     ebx , # 4  \ $83 $C3 $4 
( $4057D28F )  imul    ecx , eax  \ $F $AF $C8 
( $4057D292 )  mov     eax , dword ptr [edi]  \ $8B $7 
( $4057D294 )  add     edi , # 4  \ $83 $C7 $4 
( $4057D297 )  lea     ebx , dword ptr 4 [eax]  \ $8D $58 $4 
( $4057D29A )  jmp     dword ptr FC [ebx]  \ $FF $63 $FC 

Only when a threaded-code control-flow change happens (e.g., in ;s), the jump is appended. This optimization eliminates many of these jumps and makes the rest much more predictable. The speedup depends on the processor and the application; on the Athlon and Pentium III this optimization typically produces a speedup by a factor of 2.

The code addresses in the direct-threaded code are set to point to the appropriate points in the copied machine code, in this example like this:

primitive  code address
   dup       $4057D27D
   *         $4057D286
   ;s        $4057D292

Thus there can be threaded-code jumps to any place in this piece of code. This also simplifies decompilation quite a bit.

See-code (see Examining compiled code) shows the threaded code intermingled with the native code of dynamic superinstructions. These days some additional optimizations are applied for the dynamically-generated native code, so the output of see-code squared on gforth-fast on one particular AMD64 installation looks like this:

$7FB689C678C8 dup    1->2 
7FB68990C1B2:   mov     r15,r8
$7FB689C678D0 *    2->1 
7FB68990C1B5:   imul    r8,r15
$7FB689C678D8 ;s    1->1 
7FB68990C1B9:   mov     rbx,[r14]
7FB68990C1BC:   add     r14,$08
7FB68990C1C0:   mov     rax,[rbx]
7FB68990C1C3:   jmp     eax

You can disable this optimization with --no-dynamic. You can use the copying without eliminating the jumps (i.e., dynamic replication, but without superinstructions) with --no-super; this gives the branch prediction benefit alone; the effect on performance depends on the CPU; on the Athlon and Pentium III the speedup is a little less than for dynamic superinstructions with replication.

One use of these options is if you want to patch the threaded code. With superinstructions, many of the dispatch jumps are eliminated, so patching often has no effect. These options preserve all the dispatch jumps.

On some machines dynamic superinstructions are disabled by default, because it is unsafe on these machines. However, if you feel adventurous, you can enable it with --dynamic.