v0.12.0 - Tier-2 trace optimizer
Released May 8, 2026.
v0.11 gave us the adaptive specializer. v0.12 gives us what the specializer was always meant to feed: a Tier-2 trace optimizer that takes the typed opcodes the specializer produces, projects them into a linear uop stream, runs an analysis pass over the stream, and dispatches the result through a tight tier-2 eval loop. Where Tier-1 (the specializer) rewrites individual bytecodes in place, Tier-2 captures whole traces of bytecodes and treats them as a unit.
This release lays down all of the control flow. The per-install-
site executor side table, the JUMP_BACKWARD warmup callback
that drives _PyOptimizer_Optimize, projection from quickened
bytecode into a uop trace, the analysis skeleton with the cleanup
pass that prunes dead _SET_IP and _CHECK_VALIDITY rows, and
the uop dispatch loop itself are all in. Trace install,
side-table wiring, the dispatcher, and the deopt arm are all
end-to-end gated.
What is intentionally not in: the long tail of hand-ported per-uop
bodies. Of the roughly 285 Tier-2-viable uops, 14 ship with real
bodies in this release. The rest live as StatusDeopt stubs in
uops_stubs_gen.go and bail back to Tier-1 when the dispatcher
hits them. This is by design. Adding a body removes a stub on the
next regen, so the long tail is a body swap rather than a
control-flow change. We wanted the scaffolding gated and shipped
before we started porting hundreds of small uop bodies against it.
Highlights
Three pieces of work carry this release.
Trace projection and install
A "trace" in Tier-2 is a linear sequence of micro-operations
(uops) projected from a hot loop's quickened bytecode. The
projector starts at a JUMP_BACKWARD that has warmed past a
threshold, walks forward through the bytecode, expands each
opcode into its uop expansion via the metadata table, stamps
every guarded type / dict / code pointer into a dependency bloom
filter, closes the loop with _JUMP_TO_TOP when the projector
returns to the install site, and bails on a return or explicit
error before the loop edge.
def hot():
s = 0
for i in range(100_000):
s += i
return s
When hot() runs, the JUMP_BACKWARD at the bottom of the
for loop accumulates warmup. After enough hits, the warmup
callback in vm/tier2.go tryWarmupTier2 fires. It allocates an
executor slot, calls optimizer.Optimize, which walks the loop
body and produces a uop trace that looks roughly like:
_START_EXECUTOR
_MAKE_WARM
_CHECK_VALIDITY
_SET_IP (bytecode index N)
_LOAD_FAST (i)
_LOAD_FAST (s)
_GUARD_BOTH_INT
_BINARY_OP_ADD_INT
_STORE_FAST (s)
_LOAD_FAST (i)
_LOAD_CONST (1)
_GUARD_BOTH_INT
_BINARY_OP_ADD_INT
_STORE_FAST (i)
... (range iteration check)
_JUMP_TO_TOP
The install path patches the JUMP_BACKWARD site to
ENTER_EXECUTOR. The next pass through that site does not run
the Tier-1 loop body; it runs the trace. The Tier-1 op the
install path stashed in Executor.VMData.Opcode becomes the
deopt target: if anything goes wrong (a guard fails, an uop
returns StatusDeopt), the dispatcher falls back to that Tier-1
opcode and continues from there.
The uop dispatcher
The dispatcher is in optimizer/uops.go. RunExecutor builds a
Tier2State around the trace and the calling frame; Run walks
the buffer one uop at a time, calling a method on *Tier2State
per uop. Each method returns a Tier2Status:
type Tier2Status int
const (
StatusContinue Tier2Status = iota
StatusError
StatusDeopt
StatusExit
)
Continue advances to the next uop. Error propagates a Python
exception. Deopt falls back to Tier-1 at the deopt target.
Exit leaves the trace cleanly through _EXIT_TRACE.
JUMP_TO_TOP and JUMP_TO_JUMP_TARGET mutate NextUop in place
so the switch stays small.
The dispatch table itself is generated. uops_dispatch_gen.go
fans every Tier-2-viable uop ID into a method call on
*Tier2State. The generator scans optimizer/ for hand-ported
bodies via go/ast and writes only stubs for uops that do not
already have a method, so adding a new port is one file edit
followed by a regen.
The cleanup analysis pass
The analysis pipeline runs three phases unconditionally:
removeGlobals. Currently a stub returning 1. The dict and type watcher callbacks land with the watcher infrastructure in v0.13. When wired, this pass folds_LOAD_GLOBAL_MODULEand_LOAD_GLOBAL_BUILTINSrows into inline-const loads when the dict version still matches and the watcher can prove the dict has not mutated.optimizeUops. Init / fini bracket, contradiction and out-of-space short-circuits, terminator detection. The per-opcode abstract semantics land with the DSL-generated case table fromPython/optimizer_bytecodes.c. Today the orchestrator runs the unwired default arm; v0.13 adds the bodies behind it.removeUnneededUops. A full port. This is the only analysis pass that actually does work in v0.12, and it does a lot. It walks the trace once, NOPs_SET_IPand_CHECK_VALIDITYwhere escapes can be proved out of reach since_START_EXECUTOR, resurrects_SET_IPahead of the next escaping uop, and collapses the load-then-pop idiom.
The collapse pattern is worth showing. A trace projection
expands LOAD_CONST 0; POP_TOP into:
_LOAD_CONST_INLINE 0
_POP_TOP
The cleanup pass recognizes the pair and fuses them to:
_POP_TOP_LOAD_CONST_INLINE 0
The fused uop skips the push / pop round trip. The same fusion
applies to _LOAD_FAST / _LOAD_FAST_BORROW /
_LOAD_CONST_INLINE / _LOAD_CONST_INLINE_BORROW /
_LOAD_SMALL_INT / _COPY paired with _POP_TOP and its
load-then-pop variants.
import dis
import sys
def loop():
for _ in range(100_000):
pass
return None
loop()
loop()
ex = sys._getframe().f_code.co_executors[0]
for name, oparg, target in ex.get_trace():
print(f'{name:35} {oparg:5} {target}')
Above a certain warmup threshold, the trace is visible through
the Executor.get_trace method. The output shows the post-
cleanup uop stream, which is what the dispatcher actually walks.
What's new
The full breakdown, grouped by where it landed.
optimizer/ package
The Tier-2 optimizer lives in its own package because it touches
a different concern from the eval loop. The split mirrors
CPython's Python/optimizer.c / Python/optimizer_analysis.c
split.
-
optimizer/bloom.go. The 256-bit dependency filter (_Py_BLOOM_FILTER_WORDS = 8). Trace projection adds every guarded type, dict, and code pointer to the filter. The invalidation walk hashes a mutated pointer and asks every executor whether it might be in scope. False positives cost a trace invalidation; false negatives would be a correctness bug, so the hash is chosen to never miss. -
optimizer/types.go. The shared shapes:UOPInstruction. The per-uop row: opcode, oparg, target, operand pointers, two-int operand slots.Executor. The trace container: uop buffer, exit table, side pointer back intoCode.Executors, the dependency bloom filter, and the per-trace metadataVMDatablock.ExitData. Per-exit deopt metadata.BackoffCounter. The retry-on-deopt counter that prevents a cold loop from re-optimizing every time it warms a little.JitOptSymbollattice. Nine tags:Unknown,NonNull,Null,Bottom,TypeVersion,KnownClass,KnownValue,Tuple,Truthiness. The lattice is what the abstract interpreter walks to fold types and constants.JitOptContext. The per-trace arena: locals plus stack pool, contradiction flag, out-of-space tripwire.- Bounds:
UOPMaxTraceLength = 800,MaxAbstractInterpSize = 4096,TyArenaSize,MaxAbstractFrameDepth. The same constants CPython uses.
-
optimizer/uop_ids_gen.go,optimizer/uop_meta_gen.go. Generated tables. Every_PyUop_Flagsand_PyUop_Replicationrow, plus the per-uop num-popped table. Upstream uop IDs are aliased to gopy'scompile.OPCODEconstants where the Tier-1 and Tier-2 IDs match (_LOAD_CONST,_POP_TOP,_NOP,_COPY,_SWAP,_RETURN_VALUE). The generator drives offInclude/internal/pycore_uop_ids.handInclude/internal/pycore_uop_metadata.hfrom upstream. -
optimizer/executor.go. Allocate,ExecutorInit, link against the per-interpreter executor list, detach fromCode.Executorson dispose,ExecutorClear. Plus the two-phase free: pending deletions queue onInterpState.ExecutorDeletionListHeadand the sweep runs whenRemainingCapacityunderflows sotp_dealloccannot race a thread that is currently executing the trace. This matches the CPython_Py_Executors_InvalidateAlldeferred-free shape. -
optimizer/pyobject.go. The Python-visible executor surface:Executor.GetTracereturns the uop sequence as a list of(name, oparg, target)tuples fordis.disto render;is_validmirrors the upstream method. This is howTools/scripts/uop_metrics.pyand friends will plug in. -
optimizer/side_table.go.Code.Executorspacking:hasSpaceForExecutor,getIndexForExecutor,insertExecutor. The last one setsVMData.Opcode,VMData.Oparg,VMData.Code,VMData.Index, then patches the bytecode at the install site toENTER_EXECUTOR. The Tier-1 opcode that was there before the install lives inVMData.Opcodeso the deopt arm can find it. -
optimizer/trace.go.TranslateBytecodeToTrace. The projection. Walks the instruction stream from a warmup site, expands each opcode through thelookupMacroidentity fallback, stamps every guarded type and dict pointer into the dependency bloom filter, closes the loop with_JUMP_TO_TOPwhen the projector returns to the install site, and bails on a return or explicit error before the loop edge. Preludes every trace with_START_EXECUTORand_MAKE_WARM. -
optimizer/symbols.go. TheJitOptContextlattice helpers.SymNew*,SymSet*,SymGet*,SymIs*,SymTruthiness,SymTupleGetitem,AbstractContextInit/AbstractContextFini,FrameNew/FramePop. Per-trace symbol arena, locals and stack pool, contradiction and out-of-space tripwires fromPython/optimizer_symbols.c. The abstract interpreter that runs over the trace queries these helpers to fold types and constants. -
optimizer/optimize.go. The orchestration.Optimizeis the warmup-callback entry point;uopOptimizeis the projection + analysis + fix-up pipeline;prepareForExecutionis the exit / error stub split;makeExit,countExits, andmakeExecutorFromUopswire_EXIT_TRACErows to per-exitExitDataslots and stamp the executor's own pointer into_START_EXECUTOR.operand0so the running trace can find its own metadata. -
optimizer/analysis.go. The forward-pass orchestrator.AnalyzerunsremoveGlobals,optimizeUops, andremoveUnneededUopsin order. The third pass is fully ported; the first two are skeleton. -
optimizer/dis_hook.go,optimizer/uops_print.go. Format hooks.dis.disknows how to print anENTER_EXECUTORrow with the upstream operand layout; the per-uop_PyUOpPrintformatter the executor'sGetTracerides on lives here. -
optimizer/uops.go. The dispatcher itself.RunExecutorbuilds aTier2Statearound the trace,Runwalks the buffer, and per-uop methods on*Tier2StatereturnTier2Status.JUMP_TO_TOPandJUMP_TO_JUMP_TARGETare encoded as in-method mutations ofNextUopso the switch stays small. -
optimizer/uops_impl.go. The hand-ported uop bodies for this release. The trace prologue and a starter set of stack and local ops:_NOP_START_EXECUTOR_MAKE_WARM_CHECK_VALIDITY_SET_IP_EXIT_TRACE_JUMP_TO_TOP_LOAD_FAST_LOAD_FAST_BORROW_STORE_FAST_POP_TOP_PUSH_NULL_COPY_SWAP
Fourteen out of roughly 285 Tier-2-viable uops. The rest live as
unimplementedUopstubs inuops_stubs_gen.goand bail toStatusDeoptuntil their bodies land. -
optimizer/uops_dispatch_gen.go,optimizer/uops_stubs_gen.go. The dispatch switch. The generator scansoptimizer/for hand-ported bodies viago/astand writes only stubs for uops that do not already have a method, so adding a new port is one file edit followed by a regen. The split between dispatch and stubs is so the generator can rewrite the stubs file without touching the hand-edited dispatch table.
vm/ wiring
The Tier-2 optimizer plugs into the eval loop at two points: the
warmup callback on JUMP_BACKWARD and the dispatch on
ENTER_EXECUTOR.
vm/tier2.go.tryWarmupTier2is the per-JUMP_BACKWARDwarmup callback. It bails cheaply when no executor slot is available (most warm code is already optimized) and otherwise callsoptimizer.Optimize. On success, the install site flips toENTER_EXECUTORand subsequent passes land inenterExecutor. With no uop dispatcher yet (well, with a partial one), everyENTER_EXECUTORdeopts cleanly: it reads back the Tier-1 instruction the install path stashed inExecutor.VMDataand dispatches it throughtrySimple. ThecodeExecutorAthelper encapsulates theany-typedCode.Executorscast so the vm layer stays independent of the optimizer's concreteExecutorArraytype.vm/eval_simple.go. Wires the warmup site intoJUMP_BACKWARD. The computed jump target goes throughtryWarmupTier2on the way down. TheENTER_EXECUTORarm callsenterExecutor. The two-line addition is the entire user-visible coupling; the bulk of the logic lives invm/tier2.go.
objects/code.go
Code.Executors is added as a typed any field so the vm layer
does not need to import optimizer. It stores
*optimizer.ExecutorArray once the trace optimizer installs the
first executor for a code object. The lazy allocation matters
because the field stays nil for code that never warms past the
threshold (the common case for one-shot scripts), and the import
direction stays clean.
tools/uops_gen/ scaffold
The DSL generator. Parses pycore_uop_ids.h and
pycore_uop_metadata.h from a CPython checkout and emits
optimizer/uop_ids_gen.go and optimizer/uop_meta_gen.go. Drift
mode hashes the input header and fails CI when the recorded hash
falls behind upstream, so regeneration is a forced step rather
than a silent miss.
The matching pass over Python/optimizer_bytecodes.c (the DSL
that drives the abstract-interpreter semantics) is left for the
follow-up that lights up optimize_uops. Doing it twice (once
for IDs and metadata, once for semantics) would have been more
work for no real benefit in v0.12; the orchestrator runs the
default-arm semantics today, which is correct (just unoptimized).
compile/opcodes.go
COMPARE_OP's oparg now packs the comparison kind in the high
bits, matching CPython's pre-specializer layout. Unblocks the
comparison_eq family of smoke fixtures deferred from v0.11.
This is the one piece of v0.12 that is not Tier-2; it slipped
in here because the fix is small and needed downstream.
Why we built it this way
Several decisions deserve callouts.
Why "control flow first, bodies later"
The single most important decision in this release is shipping the scaffolding before the bodies. The alternative would have been to port all 285 uop bodies and ship the whole optimizer at once. We chose not to because the bodies are independent and the scaffolding is shared.
Concretely: if you ship 14 bodies plus the scaffolding, anyone can port body 15 in a single PR that touches one file. The generator regenerates the dispatch table, the new body slots in, and the test suite re-runs. If instead you tried to ship 285 bodies at once, every body change rebases against every other body, and the diff is unreviewable.
The same logic explains the body-swap shape of the spec. Adding a body removes a stub on the next regen. There is no other plumbing change. The dispatch table, the side table, the analysis pass, the install path, the deopt arm: all already in, all already gated.
Why the bloom filter is 256 bits
The bloom filter's job is to ask "could this trace possibly care about that pointer?". A false positive costs a trace invalidation (the trace is dropped and the code re-warms). A false negative would be a correctness bug (a stale trace continues to run against mutated objects).
CPython tuned the filter at 256 bits because traces typically reference a few dozen pointers and the false positive rate at 256 bits is low. We kept the same bound. Going larger would have made invalidation lookups slower without measurably lowering the false-positive rate; going smaller would have collided more often. The same shape is the right shape for our workload because our traces are CPython traces.
Why the deferred-free for executors
tp_dealloc on an executor cannot run while a thread is
currently inside that executor's trace. CPython solves this with
a deferred-free list: dead executors queue on the interpreter,
and the sweep runs when the executor pool is under pressure. We
copied the shape exactly because the alternative (refcounted
trampolines, or stop-the-world before free) was more code for the
same effect.
InterpState.ExecutorDeletionListHead holds the queue;
RemainingCapacity decrementing through zero triggers the
sweep. A racing thread that was inside the trace at the time of
disposal completes its current uop and exits cleanly through
_EXIT_TRACE before the sweep runs.
Why removeUnneededUops first
Of the three analysis passes, removeUnneededUops is the only
one that we ported in full for v0.12. The choice was deliberate.
The pass is structurally simple (a single linear walk),
self-contained (it does not need watchers or version counters),
and visibly improves the trace output. Shipping it first gives
the dispatcher a clean trace to walk, which means the gates can
exercise real cleanup output rather than hand-built fixtures.
removeGlobals and optimizeUops both rely on infrastructure
that does not exist yet (dict watchers, the DSL-generated case
table). Stubbing them out keeps the orchestrator runnable while
we land the dependencies separately. The cost is that
optimization quality is below what the final pipeline produces,
which is fine because v0.12 is about shipping the pipeline at
all.
Why a typed-method dispatcher
The dispatcher could have been a flat switch over an enum, the
way the Tier-1 eval loop is. We made the dispatcher dispatch
through methods on *Tier2State instead, for two reasons.
The first reason is locality. Each uop body sits next to its method declaration, not in a giant case statement. Code review is per-uop; the diff for a new body is one method.
The second reason is generation. The dispatch table generator
scans optimizer/ via go/ast and emits stubs for uops without
a method body, then a switch that calls the method. Adding a
body deletes a stub on regen. The generator does not need to
know about the body's content, just its name. This works because
methods are syntactically distinct from free functions; the
generator can identify them precisely.
The cost is a virtual call per uop. Benchmarks said this was negligible compared to the cost of the uop's actual work, so we kept the readability win.
Where it lives
The map of this release in the repo:
optimizer/. Everything Tier-2.bloom.go. The dependency filter.types.go. The shared shapes.uop_ids_gen.go,uop_meta_gen.go. Generated tables.executor.go. Allocation, install, two-phase free.pyobject.go. Python-visible executor surface.side_table.go.Code.Executorspacking.trace.go. Projection from quickened bytecode to uop trace.symbols.go. The lattice and arena helpers.optimize.go. Orchestration.analysis.go. The three-pass analyze pipeline.dis_hook.go,uops_print.go. Format hooks.uops.go. The dispatcher.uops_impl.go. The 14 hand-ported uop bodies.uops_dispatch_gen.go,uops_stubs_gen.go. Dispatch table and stubs.
vm/tier2.go. Warmup callback andenterExecutor.vm/eval_simple.go.JUMP_BACKWARDwarmup site andENTER_EXECUTORarm.objects/code.go.Code.Executorsfield.tools/uops_gen/. The DSL generator that produces theuop_ids_gen.goanduop_meta_gen.gotables.compile/opcodes.go. TheCOMPARE_OPoparg packing fix.
Compatibility
Two user-visible behaviors change.
- Bytecode disassembly shows
ENTER_EXECUTORrows. Hot loops that have warmed past the Tier-2 threshold showENTER_EXECUTORat the install site indis.disoutput. The source is unchanged; only the warmed bytecode shape is visible. CPython 3.13+ behaves the same way. dis.get_instructionsexposes executor traces. TheExecutor.get_tracemethod returns the uop sequence as a list of(name, oparg, target)tuples. Tools that walk disassembled code for analysis can read the Tier-2 trace directly.
Both behaviors are gated behind warmup. Cold code shows the Tier-1 bytecode it always did.
There are no behavior changes for user code: the Tier-2 trace is semantically equivalent to the Tier-1 bytecode it replaces, and the deopt arm covers the cases the optimizer cannot statically prove. If a guard fails mid-trace, the dispatcher falls back to Tier-1 at the deopt target and continues from there.
Gates
End-to-end gates cover the control flow this release lands. They
live under v012test/.
v012test/optimizer_gate_test.go. The Tier-2 install gate.TestOptimizerGateInstall. Callsoptimizer.Optimizedirectly on a tinyLOAD_CONST/POP_TOP/JUMP_BACKWARDloop, asserts the install site flips toENTER_EXECUTOR, the side table is populated at the right slot, the executor stashes the Tier-1 op the install site was patched out of, and the trace prelude is_START_EXECUTOR/_MAKE_WARMwith_JUMP_TO_TOPclosing the loop.TestOptimizerGateEnterExecutorDispatch. Manually constructs an executor whoseVMData.OpcodeisRETURN_VALUE, runsLOAD_CONST 0; ENTER_EXECUTOR 0throughEvalCode, and verifies the deopt path returnsNone. Proves the arm went throughenterExecutor, not through any other hand-rolled arm ornotImplemented.
v012test/analysis_gate_test.go. The cleanup-pass gate.TestAnalysisGateCleanupRunsInOptimize. Runs the install flow and asserts the post-finalize trace contains no_NOProws and no_SET_IProws ahead of the first escaping uop.TestAnalysisGateBenignBailFromOptimizeUops. Blows the abstract-interp stack budget through a giantStacksize, confirmsOptimizereturns a benign bail (status 0, nil executor), and the install site is left untouched. Out-of- budget bails should never leave a half-patched site behind.
v012test/uops_gate_test.go. The Tier-2 dispatch gate.TestUopsGateHandPortedHappyPath. Drives a hand-built trace through every hand-ported uop in turn and asserts the dispatch loop terminates withStatusExit, the warm bit flips, the frame instruction pointer is what_SET_IPset it to, and the local round-trips through_LOAD_FAST/_STORE_FAST.TestUopsGateStubReturnsDeopt. Hits a stubbed uop (_BINARY_OP) and asserts the dispatcher cleanly returnsStatusDeoptrather than panicking. The whole point of the stub design is to bail rather than crash; this gate locks it in.TestUopsGateProjectedTraceTerminates. Runs a real Optimize-produced trace throughRunExecutorand asserts the loop terminates on eitherStatusExitorStatusDeopt. Partial-port traces stay runtime-safe.
optimizer/uops_test.go. Direct unit coverage for the dispatcher:_NOP/_EXIT_TRACEround-trip,_LOAD_FAST/_STORE_FASTecho, stub deopt, and the_CHECK_VALIDITYjump-on-invalid arm.optimizer/analysis_test.go. Direct unit coverage forremoveUnneededUops: drops redundant_SET_IP/_CHECK_VALIDITY, keeps_SET_IPahead of an escaping op, collapses the load-then-pop idiom. Plus theAnalyzepass-through on an empty trace.
A note on Tier-1 vs Tier-2
The "tier" terminology in CPython's optimizer is borrowed from JIT design but does not refer to a JIT in the traditional sense. There is no machine code generation. Both tiers are bytecode interpreters; they differ in granularity and analysis.
Tier-1 is the specializer from v0.11. It operates on
individual opcodes. When a LOAD_ATTR site has stable receiver
type, the opcode is rewritten in place to LOAD_ATTR_INSTANCE_VALUE.
The eval loop dispatches the typed variant. The unit of
optimization is one opcode.
Tier-2 is the trace optimizer from v0.12. It operates on linear sequences of micro-operations. A hot loop accumulates warmup at its backward jump; once it crosses a threshold, the projector captures the body as a uop stream; the analyzer folds across the stream; the dispatcher walks the result. The unit of optimization is a whole trace.
The two tiers are complementary. Tier-1 makes individual opcodes faster. Tier-2 fuses sequences of typed opcodes and eliminates redundant checks across them. A trace that walks ten typed opcodes can collapse ten type-version checks into one (the guard at trace entry), provided the analyzer can prove the type cannot change between checks. That kind of elimination is exactly what Tier-2 is for.
The cumulative win is significant. CPython benchmarks show Tier-2 lifting performance another 10-20% above Tier-1 on hot loops. Our porting target is to deliver the same improvement on the same workloads.
A note on the abstract interpreter
The analysis pass that will fold types across a trace is an
abstract interpreter. It walks the trace once, propagating
lattice elements through each uop. The lattice tags are in
optimizer/types.go JitOptSymbol:
Unknown. Top. No information.NonNull. The value is known not to be null.Null. The value is known to be null.Bottom. Contradiction. Unreachable.TypeVersion. The value's type version is pinned.KnownClass. The value's class is pinned.KnownValue. The value itself is pinned.Tuple. The value is a tuple of known length with per-slot lattice info.Truthiness. The value's truthiness is known.
Each uop's abstract semantics describe how its operands and
outputs flow through the lattice. _GUARD_BOTH_INT narrows both
stack slots to KnownClass(int). _BINARY_OP_ADD_INT reads
two KnownClass(int) slots and produces a KnownClass(int).
If a later _GUARD_BOTH_INT re-checks slots that are already
KnownClass(int), the abstract interpreter can prove the guard
is redundant and the cleanup pass can drop it.
The v0.12 release ships the lattice helpers in symbols.go and
the orchestrator in analysis.go optimizeUops. The per-uop
abstract semantics live in Python/optimizer_bytecodes.c
upstream, in a DSL that generates a case table. Porting that DSL
is the v0.13 follow-up. Until then, optimizeUops runs the
default arm, which is "lattice element stays Unknown for every
output". The dispatcher still runs correctly; the optimization
quality is just below the upstream ceiling.
What's next
The v0.12 ship lands trace projection, install, the analysis skeleton, the dispatch loop, and a starter set of hand-ported uop bodies. The remaining work is body swaps against the wiring this release ships.
- The long tail of per-uop bodies. 14 of about 285
Tier-2-viable uops are hand-ported; the rest live as
StatusDeoptstubs. Adding a body removes a stub on the next regen, so this is incremental. The priority list goes by Tier-2 hit frequency in upstream benchmarks: the arithmetic family (_BINARY_OP_ADD_INTand its siblings), the attribute family (_LOAD_ATTR_INSTANCE_VALUE), the call family (_CALL_BUILTIN_O,_CALL_PY_EXACT_ARGS), the iteration family (_FOR_ITER_RANGE,_FOR_ITER_LIST). - The DSL-generated case table
(
optimizer/uops_cases_gen.gofromPython/optimizer_bytecodes.c). Per-opcode abstract semantics drive the constant-folding, type-version, and stack-effect parts ofoptimize_uops. The orchestrator runs the unwired default arm today, so this too is a body swap rather than a control-flow change. Landing this lights up real optimization quality on top of the dispatch. - The dict / type watcher infrastructure plus the
remove_globalspass. Folds_LOAD_GLOBAL_MODULE,_LOAD_GLOBAL_BUILTINS, and_LOAD_ATTR_MODULErows into inline-const loads when the dict version still matches and the watcher can prove the dict has not mutated. The invalidation walk through_Py_Executors_InvalidateDependencyrides on the watcher fan-out. Without watchers, the analysis pass cannot safely fold cross-trace state, so this is on the critical path for the second-tier wins.
Carried over for the broader v0.13 ship beyond Tier-2:
- The four CPython smoke fixtures left ahead of v0.11
(
comparison_eqand friends) are unblocked by theCOMPARE_OPoparg fix; full re-enable still waits on a v0.13 specializer fix-up sweep. - The
_io.Filelayered split (RawIOBase/BufferedReader/TextIOWrapper) was deferred from v0.10.1 and is still pending. - The fast (super-instruction) compaction passes in
compile/flowgraph(swaptimize, super-instructions, LOAD_FAST ref-stack, cold-block hoist) were deferred from v0.5 and remain open.
Acknowledgments
This release closes work tracked across the v0.12 spec series. The public-facing pointers:
- PEP 659. The design that motivates Tier-2 exists. Without the specializer producing typed opcodes, Tier-2 has nothing to fold.
- CPython source we ported against:
Python/optimizer.c,Python/optimizer_analysis.c,Python/optimizer_symbols.c,Python/optimizer_bytecodes.c(the DSL),Include/internal/pycore_optimizer.h,Include/internal/pycore_uop_ids.h,Include/internal/pycore_uop_metadata.h. - The pull request that shipped this release (#20) merged at
bc3369c.
With the control flow in, the long tail of body ports is what gates v0.13's quality on hot loops. We expect to land a meaningful chunk of those bodies per release through the v0.13 / v0.14 window.