Brief Intro to the Template Interpreter in OpenJDK
OpenJDK includes two interpreters, CPP interpreter and Template interpreter. The template interpreter is the default one, and its implementation contains various optimization to make it fast, which also makes its source code harder to follow to some extent. In this post we will focus on the template interpreter, understanding how it works on a high-level, and gradually diving deeper into the source code.
§Primer
We will start with a sketch of the interpreter, covering some major components while omitting most details from the actually implementation. The key in this section is to get a rough mental picture on what the interpreter consists of, and how it’s different from traditional switch-based interpreters.
1 | // bytecodes are defined as an enum. |
In addition to generating the assembly snippet for a bytecode, generate_asm
also emits the logic for updating the
bcp
and jump to the entry of the next bytecode. In summary, after populate_bc_table
, the generated assembly looks
like:
1 | bytecode0: /<- diff dispatch point |
Note that at the end of each bytecode case, bcp
is incremented and we jump to the assembly block associated with the
new bytecode. One advantage of this structure, instead of a plain C-switch, is that there are multiple dispatch points,
making branch-prediction easier, if opcode X is often followed by opcode Y. (The X-Y association sounds very reasonable
and convincing, but I am unaware of any hard evidence in the area of JVM bytecode.) This technique is called “direct
threaded code” in the literature.
Here I am making an omission, and treating all all bytecodes as “branchless” (they run-till-the-end). However, some
bytecode contains “jump” logic, such as goto
, ireturn
, etc. IOW, after executing them, the control does not move
to the next one in the bytecode stream. The following pseudo-assembly shows the difference between “branchless” and
“branchful” bytecode: the former has a generic dispatch logic, while the latter handles the dispatching itself.
1 | bytecode0: <--- branchless bytecode |
§A tale of two stacks
In the context of JVM bytecode interpreter, there are two conceptually independent “stacks”: the call stack and the JVM
stack (because JVM is stack-based as opposed to register-based). The call stack grows/shrinks as we call into and
return from a function in C/C++, while the JVM stack grows/shrinks as JVM bytecode is executed, e.g. iadd
pops two
integers from the stack (shrinking), and pushes the sum of them onto the stack (growing).
The call stack is automatically managed by the C/C++ compiler&runtime, requiring nothing from us. In contrast, the JVM
stack must be managed by the interpreter (us). The most straightforward way to implement JVM stack is to preallocate
some memory, and a counter tracking the top of the stack: for each bytecode, we get the inputs from the stack (pop from
the stack), do the calculation, and place the result on the stack (push to the stack). The pseudo-code for iadd
could
be something like:
1 | // JVM stack |
Because the call stack is “natively” supported by the architecture: there is a dedicated CPU register (the stack
register) and instructions for manipulate the stack (push
and pop
), it requires fewer instructions than stacks
implemented on the C/C++ level. Therefore, as an optimization, the interpreter in OpenJDK uses the call stack to
implement the JVM stack, which of course requires the use of assembly since the call stack is not visible on the C/C++
level. In the following, I will continue using the term JVM stack, but it’s just a conceptual entity, and the actual
implementation is the call stack.
§Top-of-stack (TOS) state
Even though the call stack is fast, it’s still slower than the CPU registers, so it’s desirable to keep the data in
registers (in other words, to delay pushing the data onto the call stack). For example, when a bytecode should push
something onto the JVM stack, and the bytecode immediately uses it (popping it from the JVM stack), the two stack
operations should cancel out ideally. Take the following Java bytecode as an example (generated from int x = 1;
):
1 | iconst_1 # places int value 1 onto the stack; |
If we implement the interpreter naively, the resulting assembly will be something like:
1 | reg = 1 |
In order to cancel out these two stack operations (push and pop), the interpreter must be stateful, because operations to the JVM stack corresponding to the current bytecode are influenced by the next bytecode.
The technique used in OpenJDK is to track the current top-of-stack (TOS) state: what is on the top of the JVM stack
semantically. There are ten TOS states, and we will use only two in our examples: vtos
(void) and itos
(integer).
Each TOS state denotes a value with the corresponding type in the TOS register (a preserved CPU register). If the
next bytecode requires an input with the matching TOS state, such input can be retrieved directly from the register
instead of loading from the JVM stack. Using the same int x = 1;
from above, but with the TOS state transition:
1 | iconst_1 # places int value 1 onto the stack; vtos -> itos |
The corresponding assembly will semantically be something like:
1 | reg_tos = 1 # store 1 to TOS register |
Note that we get rid of JVM-stack operations completely here; both load and store are done to registers. This is
possible only if the output TOS state matches the input TOS state of the next bytecode. What happens if there’s a
mismatch? Let’s look at another example, excerpt from f(1, 2)
:
1 | iconst_1 # places int value 1 onto the stack; vtos -> itos |
The assembly will semantically be like:
1 | reg_tos = 1 |
Basically, we save the TOS entry onto the JVM stack when mismatch occurs. In a way, this TOS register can be viewed as a single-value cache in front of the JVM stack.
§To see the world as it is
In this section we will use a trivial Java program to see the actual generated assembly (both the application and the interpreter).
1 | public class test { |
1 | javac test.java |
The Java bytecode looks like:
1 | 0: iconst_1 |
Next we ask JVM to print the interpreter (the assembly for each bytecode).
1 | $ java -XX:+UnlockDiagnosticVMOptions \ |
The part for iconst_1
looks like the following. It contains mostly three sections:
- entries, depending on the TOS state
- the main body, “essence”
- incrementing the bytecode pointer and jump to the next bytecode
1 | iconst_1 4 iconst_1 [0x00007f22097effe0, 0x00007f22097f0040] 96 bytes |
Here we can appreciate why it’s called a “template” interpreter: depending on the TOS state, control enters this assembly snippet at different entries, as if there are different versions of assembly for the same bytecode.
Next we will look at some key stack frames, which might be useful when using a debugger.
JavaMain
calls the main
method in Java. Since JavaMain
is C but main
is Java, a stub is used in the middle to
bridge the call: JavaCalls::call_helper
identifies the being called Java method (main
in this case), calls into the
interpreter using StubRoutines::call_stub
. The control moves to the interpreter through one of many entries, depending
on the method kind, AbstractInterpreter::MethodKind
. The method kind of main
is zerolocals
, and the entry for such
method kind is generated by TemplateInterpreterGenerator::generate_normal_entry
. At the end of the entry, we will
dispatch to the first bytecode in the current Java method.
In summary,
class::method | callsite |
---|---|
JavaMain |
(*env)->CallStaticVoidMethod(...) |
JavaCalls::call_helper |
StubRoutines::call_stub |
TemplateInterpreterGenerator::generate_normal_entry |
__ dispatch_next(vtos) |
Here is a concrete example on how to debug a Java program using GDB:
1 | gdb java -ex "set args -cp . <class>" \ |
The first breakpoint gets us the transition from C to Java, and the second breakpoint corresponds roughly to the first thing in Java land.