> This is called branch prediction, it has been the source of many fun security issues...
No, that's speculative execution you just described. Branch prediction was implemented long before out-of-order CPUs were a thing, as you need branch prediction to make the most of pipelining (eg. fetching and decoding a new instruction while you're still executing the previous one--if you predict branches, you're more likely to keep the pipeline full).
Speculative execution does not require out-of-order execution. When you predict a branch, you're speculatively executing the predicted branch. Whether you're doing it in the same order as instruction order or out of order is independent of that.
Essentially all microarchtectural state is fodder for side channel exploits.
Static branch prediction like "predict taken if negative branch offset" doesn't leak anything, but just about any dynamically updated tables will (almost tautologically) contain statistical information about what was executed recently.
A shame operating systems like iOS/iPadOS do not allow JIT. iPad Pro's have such fast CPU's that you cant even use fully because of decisions like this.
Good read. But a word of caution - the "JIT vs interpreter" comparisons often favor the interpreter when the JIT is inplemented as more-or-less simple inlining of the interpreter code. (Here called "copy-and-patch" but a decades-only approach). I've had fairly senior engineers try to convince me that this is true even for Java VMs. It's not in general, at least not with the right kind of JIT compiler design.
I just recently upgraded[1] a JIT that essentially compiled each bytecode separately to one that shares registers within the same basic block. Easy 40 percent improvement to runtime, as expected.
But something I hadn't expected was it also improved compilation time by 40 percent too (fewer virtual registers made for much faster register allocation).
This is an embarrassing context to admit, but here goes.
Back when Parrot was a thing and the Perl 6 people were targeting it, I profiled the prelude of Perl 6 to optimize startup time and discovered two things:
- the first basic block of the prelude was thousands of instructions long (not surprising)
- the compiler had to allocate thousands of registers because the prelude instructions used virtual registers
The prelude emitted two instructions, one right after another: load a named symbol from a library, then make it available. I forget all of the details, but each of those instructions either one string register and one PMC register. Because register allocation used the dominance frontier method, the size of the basic block and total number of all symbolic registers dominated the algorithm.
I suggested a change to the prelude emitter to reuse actual registers and avoid virtual registers and compilation sped up quite a bit.
Yeah, I expect the real advantage of a JIT is that you can perform proper register allocation and avoid a lot of stack and/or virtual register manipulation.
I wrote a toy copy-patch JIT before and I don't remember being impressed with the performance, even compared to a naive dispatch loop, even on my ~11 year old processor.
Exactly, and it's not just register allocatio: but for many languages also addign proper typing, some basic data flow optimization, some constant folding, and a few other things that can be done fairly quickly, without the full set of trees and progressive lowering of the operators down to instruactions.
What's odd about the "JIT vs interpreter" debate is that it keeps coming up, given that it is fairly easy to see even in toy examples.
From the previous article in the series, it looks like the biggest impediment to just using full llvm to compile the query is that they didn't find a good way to cache the results across invocations.
Sql server hekaton punted this problem in a seemingly effective way by requiring the client to use stored procedures to get full native compilation. Not sure though if they recompile if the table statistics indicate a different query plan is needed.
I'm not really interested in building an interpreter, but the part about scalar out of order execution got me thinking. The opcode sequencing logic of an interpreter is inherently serial and an obvious bottleneck (step++; goto step->label; requires an add, then a fetch and then a jump, pretty ugly).
Why not do the same thing the CPU does and fetch N jump addresses at once?
Now the overhead is gone and you just need to figure out how to let the CPU fetch the chain of instructions that implement the opcodes.
You simply copy the interpreter N times, store N opcode jump addresses in N registers and each interpreter copy is hardcoded to access its own register during the computed goto.
You run into the same problem a CPU does: if you have dependencies between the instructions, you can't execute ahead of time. Your processor has a bunch of hardware to efficiently resolve conflicts but your interpreter does not.
No, that's speculative execution you just described. Branch prediction was implemented long before out-of-order CPUs were a thing, as you need branch prediction to make the most of pipelining (eg. fetching and decoding a new instruction while you're still executing the previous one--if you predict branches, you're more likely to keep the pipeline full).
Static branch prediction like "predict taken if negative branch offset" doesn't leak anything, but just about any dynamically updated tables will (almost tautologically) contain statistical information about what was executed recently.
But something I hadn't expected was it also improved compilation time by 40 percent too (fewer virtual registers made for much faster register allocation).
[1] https://github.com/ZQuestClassic/ZQuestClassic/commit/68087d...
Back when Parrot was a thing and the Perl 6 people were targeting it, I profiled the prelude of Perl 6 to optimize startup time and discovered two things:
- the first basic block of the prelude was thousands of instructions long (not surprising) - the compiler had to allocate thousands of registers because the prelude instructions used virtual registers
The prelude emitted two instructions, one right after another: load a named symbol from a library, then make it available. I forget all of the details, but each of those instructions either one string register and one PMC register. Because register allocation used the dominance frontier method, the size of the basic block and total number of all symbolic registers dominated the algorithm.
I suggested a change to the prelude emitter to reuse actual registers and avoid virtual registers and compilation sped up quite a bit.
I wrote a toy copy-patch JIT before and I don't remember being impressed with the performance, even compared to a naive dispatch loop, even on my ~11 year old processor.
What's odd about the "JIT vs interpreter" debate is that it keeps coming up, given that it is fairly easy to see even in toy examples.
Sql server hekaton punted this problem in a seemingly effective way by requiring the client to use stored procedures to get full native compilation. Not sure though if they recompile if the table statistics indicate a different query plan is needed.
My take is that you can get pretty far these days with a simple bytecode interpreter. Food for thought if your side project could benefit from a DSL!
Why not do the same thing the CPU does and fetch N jump addresses at once?
Now the overhead is gone and you just need to figure out how to let the CPU fetch the chain of instructions that implement the opcodes.
You simply copy the interpreter N times, store N opcode jump addresses in N registers and each interpreter copy is hardcoded to access its own register during the computed goto.