OpenJDK includes two interpreters, CPP interpreter and Template interpreter. The template interpreter is the default one, and its implementation contains various optimization to make it fast, which also makes its source code harder to follow to some extent. In this post we will focus on the template interpreter, understanding how it works on a high-level, and gradually diving deeper into the source code.

§Primer

We will start with a sketch of the interpreter, covering some major components while omitting most details from the actually implementation. The key in this section is to get a rough mental picture on what the interpreter consists of, and how it’s different from traditional switch-based interpreters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// bytecodes are defined as an enum.
enum bytecodes {
bytecode0,
bytecode1,
bytecode2,
...
num_bytecodes
};

// entry to the assembly snippet of each bytecode; IOW, each slot contains
// the starting address of the assembly snippet.
uintptr_t bc_table[num_bytecodes];

// bytecode pointer, pointing to the current bytecode; similar to the program counter (pc)
void* bcp;

// populate the bc_table
// generate_asm is architecture specific, and emits assembly snippet implementing a bytecode.
void populate_bc_table() {
for (bc : bytecodes) {
bc_table[bc] = generate_asm(bc);
}
}

In addition to generating the assembly snippet for a bytecode, generate_asm also emits the logic for updating the bcp and jump to the entry of the next bytecode. In summary, after populate_bc_table, the generated assembly looks like:

1
2
3
4
5
6
7
8
9
10
bytecode0:                /<- diff dispatch point
... |
... |
bcp += size |
jmp bc_table[*bcp] <---|
bytecode1: |
... |
... |
bcp += size |
jmp bc_table[*bcp] <---|

Note that at the end of each bytecode case, bcp is incremented and we jump to the assembly block associated with the new bytecode. One advantage of this structure, instead of a plain C-switch, is that there are multiple dispatch points, making branch-prediction easier, if opcode X is often followed by opcode Y. (The X-Y association sounds very reasonable and convincing, but I am unaware of any hard evidence in the area of JVM bytecode.) This technique is called “direct threaded code” in the literature.

Here I am making an omission, and treating all all bytecodes as “branchless” (they run-till-the-end). However, some bytecode contains “jump” logic, such as goto, ireturn, etc. IOW, after executing them, the control does not move to the next one in the bytecode stream. The following pseudo-assembly shows the difference between “branchless” and “branchful” bytecode: the former has a generic dispatch logic, while the latter handles the dispatching itself.

1
2
3
4
5
6
7
8
9
10
11
bytecode0:                <--- branchless bytecode
... <-\
... | -- assembly corresponding to this bytecode
... <-/
bcp += size
jmp bc_table[*bcp]
goto: <--- branchful bytecode
branch_offset = ... <-\
bcp += branch_offset | -- assembly corresponding to this bytecode
jmp bc_table[*bcp] <-/
// unreachable

§A tale of two stacks

In the context of JVM bytecode interpreter, there are two conceptually independent “stacks”: the call stack and the JVM stack (because JVM is stack-based as opposed to register-based). The call stack grows/shrinks as we call into and return from a function in C/C++, while the JVM stack grows/shrinks as JVM bytecode is executed, e.g. iadd pops two integers from the stack (shrinking), and pushes the sum of them onto the stack (growing).

The call stack is automatically managed by the C/C++ compiler&runtime, requiring nothing from us. In contrast, the JVM stack must be managed by the interpreter (us). The most straightforward way to implement JVM stack is to preallocate some memory, and a counter tracking the top of the stack: for each bytecode, we get the inputs from the stack (pop from the stack), do the calculation, and place the result on the stack (push to the stack). The pseudo-code for iadd could be something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// JVM stack
uint64_t stack[stack_size_limit];
uint top = 0;

push(e) {
stack[top++] = e
}

pop() {
return stack[--top]
}

iadd() {
x = pop()
y = pop()
push(x + y)
}

Because the call stack is “natively” supported by the architecture: there is a dedicated CPU register (the stack register) and instructions for manipulate the stack (push and pop), it requires fewer instructions than stacks implemented on the C/C++ level. Therefore, as an optimization, the interpreter in OpenJDK uses the call stack to implement the JVM stack, which of course requires the use of assembly since the call stack is not visible on the C/C++ level. In the following, I will continue using the term JVM stack, but it’s just a conceptual entity, and the actual implementation is the call stack.

§Top-of-stack (TOS) state

Even though the call stack is fast, it’s still slower than the CPU registers, so it’s desirable to keep the data in registers (in other words, to delay pushing the data onto the call stack). For example, when a bytecode should push something onto the JVM stack, and the bytecode immediately uses it (popping it from the JVM stack), the two stack operations should cancel out ideally. Take the following Java bytecode as an example (generated from int x = 1;):

1
2
iconst_1 # places int value 1 onto the stack;
istore_0 # store int value (at top-of-stack) into variable 0;

If we implement the interpreter naively, the resulting assembly will be something like:

1
2
3
4
reg = 1
push reg # pushing to the JVM stack
... # jump to next bytecode
pop reg # poping from the JVM stack

In order to cancel out these two stack operations (push and pop), the interpreter must be stateful, because operations to the JVM stack corresponding to the current bytecode are influenced by the next bytecode.

The technique used in OpenJDK is to track the current top-of-stack (TOS) state: what is on the top of the JVM stack semantically. There are ten TOS states, and we will use only two in our examples: vtos (void) and itos (integer). Each TOS state denotes a value with the corresponding type in the TOS register (a preserved CPU register). If the next bytecode requires an input with the matching TOS state, such input can be retrieved directly from the register instead of loading from the JVM stack. Using the same int x = 1; from above, but with the TOS state transition:

1
2
iconst_1 # places int value 1 onto the stack;                 vtos -> itos
istore_0 # store int value (at top-of-stack) into variable 0; itos -> vtos

The corresponding assembly will semantically be something like:

1
2
3
reg_tos = 1        # store 1 to TOS register
... # jumping to next bytecode
var_0 = reg_tos # load from TOS register

Note that we get rid of JVM-stack operations completely here; both load and store are done to registers. This is possible only if the output TOS state matches the input TOS state of the next bytecode. What happens if there’s a mismatch? Let’s look at another example, excerpt from f(1, 2):

1
2
iconst_1 # places int value 1 onto the stack;   vtos -> itos
iconst_2 # places int value 2 onto the stack; vtos -> itos

The assembly will semantically be like:

1
2
3
4
reg_tos = 1
... # jumping to next bytecode
push(reg_tos) # TOS state: actual (itos) diffs with the expected (vtos); saving TOS register onto the JVM stack
reg_tos = 2

Basically, we save the TOS entry onto the JVM stack when mismatch occurs. In a way, this TOS register can be viewed as a single-value cache in front of the JVM stack.

§To see the world as it is

In this section we will use a trivial Java program to see the actual generated assembly (both the application and the interpreter).

1
2
3
4
5
public class test {
public static void main(String[] args) {
int x = 1;
}
}
1
2
$ javac test.java
$ javap -c test

The Java bytecode looks like:

1
2
3
0: iconst_1
1: istore_1
2: return

Next we ask JVM to print the interpreter (the assembly for each bytecode).

1
2
3
4
$ java -XX:+UnlockDiagnosticVMOptions \
-XX:+PrintInterpreter \ # print assembly for each bytecode
-XX:PrintAssemblyOptions=intel \ # use the same dst-src order as in openjdk source code
test

The part for iconst_1 looks like the following. It contains mostly three sections:

  1. entries, depending on the TOS state
  2. the main body, “essence”
  3. incrementing the bytecode pointer and jump to the next bytecode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
iconst_1  4 iconst_1  [0x00007f22097effe0, 0x00007f22097f0040]  96 bytes

--------------------------------------------------------------------------------
0x00007f22097effe0: sub rsp,0x8 <----|
0x00007f22097effe4: vmovss DWORD PTR [rsp],xmm0 |
0x00007f22097effe9: jmp 0x00007f22097f0013 | diff entries depending
0x00007f22097effee: sub rsp,0x10 <----| on the TOS state
0x00007f22097efff2: vmovsd QWORD PTR [rsp],xmm0 |
0x00007f22097efff7: jmp 0x00007f22097f0013 | all go to the essence
0x00007f22097efffc: sub rsp,0x10 <----|
0x00007f22097f0000: mov QWORD PTR [rsp],rax |
0x00007f22097f0004: mov QWORD PTR [rsp+0x8],0x0 |
0x00007f22097f000d: jmp 0x00007f22097f0013 |
0x00007f22097f0012: push rax <----|
0x00007f22097f0013: mov eax,0x1 <---------------------------------- the "essence"
0x00007f22097f0018: movzx ebx,BYTE PTR [r13+0x1]<----|
0x00007f22097f001d: inc r13 |
0x00007f22097f0020: movabs r10,0x7f22292851c0 | inc bcp and jump to next bytecode
0x00007f22097f002a: jmp QWORD PTR [r10+rbx*8] <----|
0x00007f22097f002e: xchg ax,ax
0x00007f22097f0030: add BYTE PTR [rax],al
0x00007f22097f0032: add BYTE PTR [rax],al
0x00007f22097f0034: add BYTE PTR [rax],al
0x00007f22097f0036: add BYTE PTR [rax],al
0x00007f22097f0038: add BYTE PTR [rax],al
0x00007f22097f003a: add BYTE PTR [rax],al
0x00007f22097f003c: add BYTE PTR [rax],al
0x00007f22097f003e: add BYTE PTR [rax],al
--------------------------------------------------------------------------------

Here we can appreciate why it’s called a “template” interpreter: depending on the TOS state, control enters this assembly snippet at different entries, as if there are different versions of assembly for the same bytecode.

Next we will look at some key stack frames, which might be useful when using a debugger.

JavaMain calls the main method in Java. Since JavaMain is C but main is Java, a stub is used in the middle to bridge the call: JavaCalls::call_helper identifies the being called Java method (main in this case), calls into the interpreter using StubRoutines::call_stub. The control moves to the interpreter through one of many entries, depending on the method kind, AbstractInterpreter::MethodKind. The method kind of main is zerolocals, and the entry for such method kind is generated by TemplateInterpreterGenerator::generate_normal_entry. At the end of the entry, we will dispatch to the first bytecode in the current Java method.

In summary,

class::method callsite
JavaMain (*env)->CallStaticVoidMethod(...)
JavaCalls::call_helper StubRoutines::call_stub
TemplateInterpreterGenerator::generate_normal_entry __ dispatch_next(vtos)

Here is a concrete example on how to debug a Java program using GDB:

1
2
3
4
5
6
7
gdb java -ex "set args -cp . <class>"                           \
-ex "set breakpoint pending on" \
-ex "break jni_CallStaticVoidMethod" \
-ex "run" \
-ex "break InterpreterRuntime::resolve_from_cache" \
-ex "cont" \
-ex 'call pns($rsp, $rbp, $pc)'

The first breakpoint gets us the transition from C to Java, and the second breakpoint corresponds roughly to the first thing in Java land.

§References