JIT compilers are among the most performance-critical components in modern software stacks. Java, JavaScript, Python — every language with a managed runtime depends on JIT compilation to approach native-code speed. When a JIT compiler has a bug that produces wrong output, users notice. When a JIT compiler has a bug that produces slow output, nobody investigates.
Until this paper, no systematic study of JIT compiler performance bugs existed.
The gap is remarkable. Decades of compiler testing research have focused exclusively on correctness: does the compiled code produce the right answer? Differential testing — running the same program through multiple compilers or optimization levels and comparing outputs — is a mature methodology for finding correctness bugs. But the same methodology was never applied to performance. A program that runs correctly but 10x slower than it should passes every existing test.
The empirical analysis of 191 performance bug reports across four JIT compilers reveals the landscape. These bugs cause significant runtime degradation — factors of 2-100x — and they persist in production compilers because no automated tool looks for them. The bugs are in optimization passes: a JIT compiler that fails to apply a known optimization, or applies an optimization that interacts poorly with another, produces correct but slow code.
The tool (Jittery) applies layered differential performance testing: compare the same code across JIT tiers (interpreter, baseline compiler, optimizing compiler) and flag cases where higher optimization tiers are slower. Test prioritization reduces testing time by 92% without losing detection capability. Twelve previously unknown performance bugs were found in Oracle HotSpot and Graal, six already fixed.
The bugs were always there. The testing methodology just never asked the performance question.