Skip to main content

v0.12.0 - Tier-2 trace optimizer

Released May 8, 2026.

v0.11 gave us the adaptive specializer. v0.12 gives us what the specializer was always meant to feed: a Tier-2 trace optimizer that takes the typed opcodes the specializer produces, projects them into a linear uop stream, runs an analysis pass over the stream, and dispatches the result through a tight tier-2 eval loop. Where Tier-1 (the specializer) rewrites individual bytecodes in place, Tier-2 captures whole traces of bytecodes and treats them as a unit.

This release lays down all of the control flow. The per-install- site executor side table, the JUMP_BACKWARD warmup callback that drives _PyOptimizer_Optimize, projection from quickened bytecode into a uop trace, the analysis skeleton with the cleanup pass that prunes dead _SET_IP and _CHECK_VALIDITY rows, and the uop dispatch loop itself are all in. Trace install, side-table wiring, the dispatcher, and the deopt arm are all end-to-end gated.

What is intentionally not in: the long tail of hand-ported per-uop bodies. Of the roughly 285 Tier-2-viable uops, 14 ship with real bodies in this release. The rest live as StatusDeopt stubs in uops_stubs_gen.go and bail back to Tier-1 when the dispatcher hits them. This is by design. Adding a body removes a stub on the next regen, so the long tail is a body swap rather than a control-flow change. We wanted the scaffolding gated and shipped before we started porting hundreds of small uop bodies against it.

Highlights

Three pieces of work carry this release.

Trace projection and install

A "trace" in Tier-2 is a linear sequence of micro-operations (uops) projected from a hot loop's quickened bytecode. The projector starts at a JUMP_BACKWARD that has warmed past a threshold, walks forward through the bytecode, expands each opcode into its uop expansion via the metadata table, stamps every guarded type / dict / code pointer into a dependency bloom filter, closes the loop with _JUMP_TO_TOP when the projector returns to the install site, and bails on a return or explicit error before the loop edge.

def hot():
s = 0
for i in range(100_000):
s += i
return s

When hot() runs, the JUMP_BACKWARD at the bottom of the for loop accumulates warmup. After enough hits, the warmup callback in vm/tier2.go tryWarmupTier2 fires. It allocates an executor slot, calls optimizer.Optimize, which walks the loop body and produces a uop trace that looks roughly like:

_START_EXECUTOR
_MAKE_WARM
_CHECK_VALIDITY
_SET_IP (bytecode index N)
_LOAD_FAST (i)
_LOAD_FAST (s)
_GUARD_BOTH_INT
_BINARY_OP_ADD_INT
_STORE_FAST (s)
_LOAD_FAST (i)
_LOAD_CONST (1)
_GUARD_BOTH_INT
_BINARY_OP_ADD_INT
_STORE_FAST (i)
... (range iteration check)
_JUMP_TO_TOP

The install path patches the JUMP_BACKWARD site to ENTER_EXECUTOR. The next pass through that site does not run the Tier-1 loop body; it runs the trace. The Tier-1 op the install path stashed in Executor.VMData.Opcode becomes the deopt target: if anything goes wrong (a guard fails, an uop returns StatusDeopt), the dispatcher falls back to that Tier-1 opcode and continues from there.

The uop dispatcher

The dispatcher is in optimizer/uops.go. RunExecutor builds a Tier2State around the trace and the calling frame; Run walks the buffer one uop at a time, calling a method on *Tier2State per uop. Each method returns a Tier2Status:

type Tier2Status int

const (
StatusContinue Tier2Status = iota
StatusError
StatusDeopt
StatusExit
)

Continue advances to the next uop. Error propagates a Python exception. Deopt falls back to Tier-1 at the deopt target. Exit leaves the trace cleanly through _EXIT_TRACE. JUMP_TO_TOP and JUMP_TO_JUMP_TARGET mutate NextUop in place so the switch stays small.

The dispatch table itself is generated. uops_dispatch_gen.go fans every Tier-2-viable uop ID into a method call on *Tier2State. The generator scans optimizer/ for hand-ported bodies via go/ast and writes only stubs for uops that do not already have a method, so adding a new port is one file edit followed by a regen.

The cleanup analysis pass

The analysis pipeline runs three phases unconditionally:

  1. removeGlobals. Currently a stub returning 1. The dict and type watcher callbacks land with the watcher infrastructure in v0.13. When wired, this pass folds _LOAD_GLOBAL_MODULE and _LOAD_GLOBAL_BUILTINS rows into inline-const loads when the dict version still matches and the watcher can prove the dict has not mutated.
  2. optimizeUops. Init / fini bracket, contradiction and out-of-space short-circuits, terminator detection. The per-opcode abstract semantics land with the DSL-generated case table from Python/optimizer_bytecodes.c. Today the orchestrator runs the unwired default arm; v0.13 adds the bodies behind it.
  3. removeUnneededUops. A full port. This is the only analysis pass that actually does work in v0.12, and it does a lot. It walks the trace once, NOPs _SET_IP and _CHECK_VALIDITY where escapes can be proved out of reach since _START_EXECUTOR, resurrects _SET_IP ahead of the next escaping uop, and collapses the load-then-pop idiom.

The collapse pattern is worth showing. A trace projection expands LOAD_CONST 0; POP_TOP into:

_LOAD_CONST_INLINE 0
_POP_TOP

The cleanup pass recognizes the pair and fuses them to:

_POP_TOP_LOAD_CONST_INLINE 0

The fused uop skips the push / pop round trip. The same fusion applies to _LOAD_FAST / _LOAD_FAST_BORROW / _LOAD_CONST_INLINE / _LOAD_CONST_INLINE_BORROW / _LOAD_SMALL_INT / _COPY paired with _POP_TOP and its load-then-pop variants.

import dis
import sys

def loop():
for _ in range(100_000):
pass
return None

loop()
loop()

ex = sys._getframe().f_code.co_executors[0]
for name, oparg, target in ex.get_trace():
print(f'{name:35} {oparg:5} {target}')

Above a certain warmup threshold, the trace is visible through the Executor.get_trace method. The output shows the post- cleanup uop stream, which is what the dispatcher actually walks.

What's new

The full breakdown, grouped by where it landed.

optimizer/ package

The Tier-2 optimizer lives in its own package because it touches a different concern from the eval loop. The split mirrors CPython's Python/optimizer.c / Python/optimizer_analysis.c split.

  • optimizer/bloom.go. The 256-bit dependency filter (_Py_BLOOM_FILTER_WORDS = 8). Trace projection adds every guarded type, dict, and code pointer to the filter. The invalidation walk hashes a mutated pointer and asks every executor whether it might be in scope. False positives cost a trace invalidation; false negatives would be a correctness bug, so the hash is chosen to never miss.

  • optimizer/types.go. The shared shapes:

    • UOPInstruction. The per-uop row: opcode, oparg, target, operand pointers, two-int operand slots.
    • Executor. The trace container: uop buffer, exit table, side pointer back into Code.Executors, the dependency bloom filter, and the per-trace metadata VMData block.
    • ExitData. Per-exit deopt metadata.
    • BackoffCounter. The retry-on-deopt counter that prevents a cold loop from re-optimizing every time it warms a little.
    • JitOptSymbol lattice. Nine tags: Unknown, NonNull, Null, Bottom, TypeVersion, KnownClass, KnownValue, Tuple, Truthiness. The lattice is what the abstract interpreter walks to fold types and constants.
    • JitOptContext. The per-trace arena: locals plus stack pool, contradiction flag, out-of-space tripwire.
    • Bounds: UOPMaxTraceLength = 800, MaxAbstractInterpSize = 4096, TyArenaSize, MaxAbstractFrameDepth. The same constants CPython uses.
  • optimizer/uop_ids_gen.go, optimizer/uop_meta_gen.go. Generated tables. Every _PyUop_Flags and _PyUop_Replication row, plus the per-uop num-popped table. Upstream uop IDs are aliased to gopy's compile.OPCODE constants where the Tier-1 and Tier-2 IDs match (_LOAD_CONST, _POP_TOP, _NOP, _COPY, _SWAP, _RETURN_VALUE). The generator drives off Include/internal/pycore_uop_ids.h and Include/internal/pycore_uop_metadata.h from upstream.

  • optimizer/executor.go. Allocate, ExecutorInit, link against the per-interpreter executor list, detach from Code.Executors on dispose, ExecutorClear. Plus the two-phase free: pending deletions queue on InterpState.ExecutorDeletionListHead and the sweep runs when RemainingCapacity underflows so tp_dealloc cannot race a thread that is currently executing the trace. This matches the CPython _Py_Executors_InvalidateAll deferred-free shape.

  • optimizer/pyobject.go. The Python-visible executor surface: Executor.GetTrace returns the uop sequence as a list of (name, oparg, target) tuples for dis.dis to render; is_valid mirrors the upstream method. This is how Tools/scripts/uop_metrics.py and friends will plug in.

  • optimizer/side_table.go. Code.Executors packing: hasSpaceForExecutor, getIndexForExecutor, insertExecutor. The last one sets VMData.Opcode, VMData.Oparg, VMData.Code, VMData.Index, then patches the bytecode at the install site to ENTER_EXECUTOR. The Tier-1 opcode that was there before the install lives in VMData.Opcode so the deopt arm can find it.

  • optimizer/trace.go. TranslateBytecodeToTrace. The projection. Walks the instruction stream from a warmup site, expands each opcode through the lookupMacro identity fallback, stamps every guarded type and dict pointer into the dependency bloom filter, closes the loop with _JUMP_TO_TOP when the projector returns to the install site, and bails on a return or explicit error before the loop edge. Preludes every trace with _START_EXECUTOR and _MAKE_WARM.

  • optimizer/symbols.go. The JitOptContext lattice helpers. SymNew*, SymSet*, SymGet*, SymIs*, SymTruthiness, SymTupleGetitem, AbstractContextInit / AbstractContextFini, FrameNew / FramePop. Per-trace symbol arena, locals and stack pool, contradiction and out-of-space tripwires from Python/optimizer_symbols.c. The abstract interpreter that runs over the trace queries these helpers to fold types and constants.

  • optimizer/optimize.go. The orchestration. Optimize is the warmup-callback entry point; uopOptimize is the projection + analysis + fix-up pipeline; prepareForExecution is the exit / error stub split; makeExit, countExits, and makeExecutorFromUops wire _EXIT_TRACE rows to per-exit ExitData slots and stamp the executor's own pointer into _START_EXECUTOR.operand0 so the running trace can find its own metadata.

  • optimizer/analysis.go. The forward-pass orchestrator. Analyze runs removeGlobals, optimizeUops, and removeUnneededUops in order. The third pass is fully ported; the first two are skeleton.

  • optimizer/dis_hook.go, optimizer/uops_print.go. Format hooks. dis.dis knows how to print an ENTER_EXECUTOR row with the upstream operand layout; the per-uop _PyUOpPrint formatter the executor's GetTrace rides on lives here.

  • optimizer/uops.go. The dispatcher itself. RunExecutor builds a Tier2State around the trace, Run walks the buffer, and per-uop methods on *Tier2State return Tier2Status. JUMP_TO_TOP and JUMP_TO_JUMP_TARGET are encoded as in-method mutations of NextUop so the switch stays small.

  • optimizer/uops_impl.go. The hand-ported uop bodies for this release. The trace prologue and a starter set of stack and local ops:

    • _NOP
    • _START_EXECUTOR
    • _MAKE_WARM
    • _CHECK_VALIDITY
    • _SET_IP
    • _EXIT_TRACE
    • _JUMP_TO_TOP
    • _LOAD_FAST
    • _LOAD_FAST_BORROW
    • _STORE_FAST
    • _POP_TOP
    • _PUSH_NULL
    • _COPY
    • _SWAP

    Fourteen out of roughly 285 Tier-2-viable uops. The rest live as unimplementedUop stubs in uops_stubs_gen.go and bail to StatusDeopt until their bodies land.

  • optimizer/uops_dispatch_gen.go, optimizer/uops_stubs_gen.go. The dispatch switch. The generator scans optimizer/ for hand-ported bodies via go/ast and writes only stubs for uops that do not already have a method, so adding a new port is one file edit followed by a regen. The split between dispatch and stubs is so the generator can rewrite the stubs file without touching the hand-edited dispatch table.

vm/ wiring

The Tier-2 optimizer plugs into the eval loop at two points: the warmup callback on JUMP_BACKWARD and the dispatch on ENTER_EXECUTOR.

  • vm/tier2.go. tryWarmupTier2 is the per-JUMP_BACKWARD warmup callback. It bails cheaply when no executor slot is available (most warm code is already optimized) and otherwise calls optimizer.Optimize. On success, the install site flips to ENTER_EXECUTOR and subsequent passes land in enterExecutor. With no uop dispatcher yet (well, with a partial one), every ENTER_EXECUTOR deopts cleanly: it reads back the Tier-1 instruction the install path stashed in Executor.VMData and dispatches it through trySimple. The codeExecutorAt helper encapsulates the any-typed Code.Executors cast so the vm layer stays independent of the optimizer's concrete ExecutorArray type.
  • vm/eval_simple.go. Wires the warmup site into JUMP_BACKWARD. The computed jump target goes through tryWarmupTier2 on the way down. The ENTER_EXECUTOR arm calls enterExecutor. The two-line addition is the entire user-visible coupling; the bulk of the logic lives in vm/tier2.go.

objects/code.go

Code.Executors is added as a typed any field so the vm layer does not need to import optimizer. It stores *optimizer.ExecutorArray once the trace optimizer installs the first executor for a code object. The lazy allocation matters because the field stays nil for code that never warms past the threshold (the common case for one-shot scripts), and the import direction stays clean.

tools/uops_gen/ scaffold

The DSL generator. Parses pycore_uop_ids.h and pycore_uop_metadata.h from a CPython checkout and emits optimizer/uop_ids_gen.go and optimizer/uop_meta_gen.go. Drift mode hashes the input header and fails CI when the recorded hash falls behind upstream, so regeneration is a forced step rather than a silent miss.

The matching pass over Python/optimizer_bytecodes.c (the DSL that drives the abstract-interpreter semantics) is left for the follow-up that lights up optimize_uops. Doing it twice (once for IDs and metadata, once for semantics) would have been more work for no real benefit in v0.12; the orchestrator runs the default-arm semantics today, which is correct (just unoptimized).

compile/opcodes.go

COMPARE_OP's oparg now packs the comparison kind in the high bits, matching CPython's pre-specializer layout. Unblocks the comparison_eq family of smoke fixtures deferred from v0.11. This is the one piece of v0.12 that is not Tier-2; it slipped in here because the fix is small and needed downstream.

Why we built it this way

Several decisions deserve callouts.

Why "control flow first, bodies later"

The single most important decision in this release is shipping the scaffolding before the bodies. The alternative would have been to port all 285 uop bodies and ship the whole optimizer at once. We chose not to because the bodies are independent and the scaffolding is shared.

Concretely: if you ship 14 bodies plus the scaffolding, anyone can port body 15 in a single PR that touches one file. The generator regenerates the dispatch table, the new body slots in, and the test suite re-runs. If instead you tried to ship 285 bodies at once, every body change rebases against every other body, and the diff is unreviewable.

The same logic explains the body-swap shape of the spec. Adding a body removes a stub on the next regen. There is no other plumbing change. The dispatch table, the side table, the analysis pass, the install path, the deopt arm: all already in, all already gated.

Why the bloom filter is 256 bits

The bloom filter's job is to ask "could this trace possibly care about that pointer?". A false positive costs a trace invalidation (the trace is dropped and the code re-warms). A false negative would be a correctness bug (a stale trace continues to run against mutated objects).

CPython tuned the filter at 256 bits because traces typically reference a few dozen pointers and the false positive rate at 256 bits is low. We kept the same bound. Going larger would have made invalidation lookups slower without measurably lowering the false-positive rate; going smaller would have collided more often. The same shape is the right shape for our workload because our traces are CPython traces.

Why the deferred-free for executors

tp_dealloc on an executor cannot run while a thread is currently inside that executor's trace. CPython solves this with a deferred-free list: dead executors queue on the interpreter, and the sweep runs when the executor pool is under pressure. We copied the shape exactly because the alternative (refcounted trampolines, or stop-the-world before free) was more code for the same effect.

InterpState.ExecutorDeletionListHead holds the queue; RemainingCapacity decrementing through zero triggers the sweep. A racing thread that was inside the trace at the time of disposal completes its current uop and exits cleanly through _EXIT_TRACE before the sweep runs.

Why removeUnneededUops first

Of the three analysis passes, removeUnneededUops is the only one that we ported in full for v0.12. The choice was deliberate. The pass is structurally simple (a single linear walk), self-contained (it does not need watchers or version counters), and visibly improves the trace output. Shipping it first gives the dispatcher a clean trace to walk, which means the gates can exercise real cleanup output rather than hand-built fixtures.

removeGlobals and optimizeUops both rely on infrastructure that does not exist yet (dict watchers, the DSL-generated case table). Stubbing them out keeps the orchestrator runnable while we land the dependencies separately. The cost is that optimization quality is below what the final pipeline produces, which is fine because v0.12 is about shipping the pipeline at all.

Why a typed-method dispatcher

The dispatcher could have been a flat switch over an enum, the way the Tier-1 eval loop is. We made the dispatcher dispatch through methods on *Tier2State instead, for two reasons.

The first reason is locality. Each uop body sits next to its method declaration, not in a giant case statement. Code review is per-uop; the diff for a new body is one method.

The second reason is generation. The dispatch table generator scans optimizer/ via go/ast and emits stubs for uops without a method body, then a switch that calls the method. Adding a body deletes a stub on regen. The generator does not need to know about the body's content, just its name. This works because methods are syntactically distinct from free functions; the generator can identify them precisely.

The cost is a virtual call per uop. Benchmarks said this was negligible compared to the cost of the uop's actual work, so we kept the readability win.

Where it lives

The map of this release in the repo:

  • optimizer/. Everything Tier-2.
    • bloom.go. The dependency filter.
    • types.go. The shared shapes.
    • uop_ids_gen.go, uop_meta_gen.go. Generated tables.
    • executor.go. Allocation, install, two-phase free.
    • pyobject.go. Python-visible executor surface.
    • side_table.go. Code.Executors packing.
    • trace.go. Projection from quickened bytecode to uop trace.
    • symbols.go. The lattice and arena helpers.
    • optimize.go. Orchestration.
    • analysis.go. The three-pass analyze pipeline.
    • dis_hook.go, uops_print.go. Format hooks.
    • uops.go. The dispatcher.
    • uops_impl.go. The 14 hand-ported uop bodies.
    • uops_dispatch_gen.go, uops_stubs_gen.go. Dispatch table and stubs.
  • vm/tier2.go. Warmup callback and enterExecutor.
  • vm/eval_simple.go. JUMP_BACKWARD warmup site and ENTER_EXECUTOR arm.
  • objects/code.go. Code.Executors field.
  • tools/uops_gen/. The DSL generator that produces the uop_ids_gen.go and uop_meta_gen.go tables.
  • compile/opcodes.go. The COMPARE_OP oparg packing fix.

Compatibility

Two user-visible behaviors change.

  • Bytecode disassembly shows ENTER_EXECUTOR rows. Hot loops that have warmed past the Tier-2 threshold show ENTER_EXECUTOR at the install site in dis.dis output. The source is unchanged; only the warmed bytecode shape is visible. CPython 3.13+ behaves the same way.
  • dis.get_instructions exposes executor traces. The Executor.get_trace method returns the uop sequence as a list of (name, oparg, target) tuples. Tools that walk disassembled code for analysis can read the Tier-2 trace directly.

Both behaviors are gated behind warmup. Cold code shows the Tier-1 bytecode it always did.

There are no behavior changes for user code: the Tier-2 trace is semantically equivalent to the Tier-1 bytecode it replaces, and the deopt arm covers the cases the optimizer cannot statically prove. If a guard fails mid-trace, the dispatcher falls back to Tier-1 at the deopt target and continues from there.

Gates

End-to-end gates cover the control flow this release lands. They live under v012test/.

  • v012test/optimizer_gate_test.go. The Tier-2 install gate.
    • TestOptimizerGateInstall. Calls optimizer.Optimize directly on a tiny LOAD_CONST / POP_TOP / JUMP_BACKWARD loop, asserts the install site flips to ENTER_EXECUTOR, the side table is populated at the right slot, the executor stashes the Tier-1 op the install site was patched out of, and the trace prelude is _START_EXECUTOR / _MAKE_WARM with _JUMP_TO_TOP closing the loop.
    • TestOptimizerGateEnterExecutorDispatch. Manually constructs an executor whose VMData.Opcode is RETURN_VALUE, runs LOAD_CONST 0; ENTER_EXECUTOR 0 through EvalCode, and verifies the deopt path returns None. Proves the arm went through enterExecutor, not through any other hand-rolled arm or notImplemented.
  • v012test/analysis_gate_test.go. The cleanup-pass gate.
    • TestAnalysisGateCleanupRunsInOptimize. Runs the install flow and asserts the post-finalize trace contains no _NOP rows and no _SET_IP rows ahead of the first escaping uop.
    • TestAnalysisGateBenignBailFromOptimizeUops. Blows the abstract-interp stack budget through a giant Stacksize, confirms Optimize returns a benign bail (status 0, nil executor), and the install site is left untouched. Out-of- budget bails should never leave a half-patched site behind.
  • v012test/uops_gate_test.go. The Tier-2 dispatch gate.
    • TestUopsGateHandPortedHappyPath. Drives a hand-built trace through every hand-ported uop in turn and asserts the dispatch loop terminates with StatusExit, the warm bit flips, the frame instruction pointer is what _SET_IP set it to, and the local round-trips through _LOAD_FAST / _STORE_FAST.
    • TestUopsGateStubReturnsDeopt. Hits a stubbed uop (_BINARY_OP) and asserts the dispatcher cleanly returns StatusDeopt rather than panicking. The whole point of the stub design is to bail rather than crash; this gate locks it in.
    • TestUopsGateProjectedTraceTerminates. Runs a real Optimize-produced trace through RunExecutor and asserts the loop terminates on either StatusExit or StatusDeopt. Partial-port traces stay runtime-safe.
  • optimizer/uops_test.go. Direct unit coverage for the dispatcher: _NOP / _EXIT_TRACE round-trip, _LOAD_FAST / _STORE_FAST echo, stub deopt, and the _CHECK_VALIDITY jump-on-invalid arm.
  • optimizer/analysis_test.go. Direct unit coverage for removeUnneededUops: drops redundant _SET_IP / _CHECK_VALIDITY, keeps _SET_IP ahead of an escaping op, collapses the load-then-pop idiom. Plus the Analyze pass-through on an empty trace.

A note on Tier-1 vs Tier-2

The "tier" terminology in CPython's optimizer is borrowed from JIT design but does not refer to a JIT in the traditional sense. There is no machine code generation. Both tiers are bytecode interpreters; they differ in granularity and analysis.

Tier-1 is the specializer from v0.11. It operates on individual opcodes. When a LOAD_ATTR site has stable receiver type, the opcode is rewritten in place to LOAD_ATTR_INSTANCE_VALUE. The eval loop dispatches the typed variant. The unit of optimization is one opcode.

Tier-2 is the trace optimizer from v0.12. It operates on linear sequences of micro-operations. A hot loop accumulates warmup at its backward jump; once it crosses a threshold, the projector captures the body as a uop stream; the analyzer folds across the stream; the dispatcher walks the result. The unit of optimization is a whole trace.

The two tiers are complementary. Tier-1 makes individual opcodes faster. Tier-2 fuses sequences of typed opcodes and eliminates redundant checks across them. A trace that walks ten typed opcodes can collapse ten type-version checks into one (the guard at trace entry), provided the analyzer can prove the type cannot change between checks. That kind of elimination is exactly what Tier-2 is for.

The cumulative win is significant. CPython benchmarks show Tier-2 lifting performance another 10-20% above Tier-1 on hot loops. Our porting target is to deliver the same improvement on the same workloads.

A note on the abstract interpreter

The analysis pass that will fold types across a trace is an abstract interpreter. It walks the trace once, propagating lattice elements through each uop. The lattice tags are in optimizer/types.go JitOptSymbol:

  • Unknown. Top. No information.
  • NonNull. The value is known not to be null.
  • Null. The value is known to be null.
  • Bottom. Contradiction. Unreachable.
  • TypeVersion. The value's type version is pinned.
  • KnownClass. The value's class is pinned.
  • KnownValue. The value itself is pinned.
  • Tuple. The value is a tuple of known length with per-slot lattice info.
  • Truthiness. The value's truthiness is known.

Each uop's abstract semantics describe how its operands and outputs flow through the lattice. _GUARD_BOTH_INT narrows both stack slots to KnownClass(int). _BINARY_OP_ADD_INT reads two KnownClass(int) slots and produces a KnownClass(int). If a later _GUARD_BOTH_INT re-checks slots that are already KnownClass(int), the abstract interpreter can prove the guard is redundant and the cleanup pass can drop it.

The v0.12 release ships the lattice helpers in symbols.go and the orchestrator in analysis.go optimizeUops. The per-uop abstract semantics live in Python/optimizer_bytecodes.c upstream, in a DSL that generates a case table. Porting that DSL is the v0.13 follow-up. Until then, optimizeUops runs the default arm, which is "lattice element stays Unknown for every output". The dispatcher still runs correctly; the optimization quality is just below the upstream ceiling.

What's next

The v0.12 ship lands trace projection, install, the analysis skeleton, the dispatch loop, and a starter set of hand-ported uop bodies. The remaining work is body swaps against the wiring this release ships.

  • The long tail of per-uop bodies. 14 of about 285 Tier-2-viable uops are hand-ported; the rest live as StatusDeopt stubs. Adding a body removes a stub on the next regen, so this is incremental. The priority list goes by Tier-2 hit frequency in upstream benchmarks: the arithmetic family (_BINARY_OP_ADD_INT and its siblings), the attribute family (_LOAD_ATTR_INSTANCE_VALUE), the call family (_CALL_BUILTIN_O, _CALL_PY_EXACT_ARGS), the iteration family (_FOR_ITER_RANGE, _FOR_ITER_LIST).
  • The DSL-generated case table (optimizer/uops_cases_gen.go from Python/optimizer_bytecodes.c). Per-opcode abstract semantics drive the constant-folding, type-version, and stack-effect parts of optimize_uops. The orchestrator runs the unwired default arm today, so this too is a body swap rather than a control-flow change. Landing this lights up real optimization quality on top of the dispatch.
  • The dict / type watcher infrastructure plus the remove_globals pass. Folds _LOAD_GLOBAL_MODULE, _LOAD_GLOBAL_BUILTINS, and _LOAD_ATTR_MODULE rows into inline-const loads when the dict version still matches and the watcher can prove the dict has not mutated. The invalidation walk through _Py_Executors_InvalidateDependency rides on the watcher fan-out. Without watchers, the analysis pass cannot safely fold cross-trace state, so this is on the critical path for the second-tier wins.

Carried over for the broader v0.13 ship beyond Tier-2:

  • The four CPython smoke fixtures left ahead of v0.11 (comparison_eq and friends) are unblocked by the COMPARE_OP oparg fix; full re-enable still waits on a v0.13 specializer fix-up sweep.
  • The _io.File layered split (RawIOBase / BufferedReader / TextIOWrapper) was deferred from v0.10.1 and is still pending.
  • The fast (super-instruction) compaction passes in compile/flowgraph (swaptimize, super-instructions, LOAD_FAST ref-stack, cold-block hoist) were deferred from v0.5 and remain open.

Acknowledgments

This release closes work tracked across the v0.12 spec series. The public-facing pointers:

  • PEP 659. The design that motivates Tier-2 exists. Without the specializer producing typed opcodes, Tier-2 has nothing to fold.
  • CPython source we ported against: Python/optimizer.c, Python/optimizer_analysis.c, Python/optimizer_symbols.c, Python/optimizer_bytecodes.c (the DSL), Include/internal/pycore_optimizer.h, Include/internal/pycore_uop_ids.h, Include/internal/pycore_uop_metadata.h.
  • The pull request that shipped this release (#20) merged at bc3369c.

With the control flow in, the long tail of body ports is what gates v0.13's quality on hot loops. We expect to land a meaningful chunk of those bodies per release through the v0.13 / v0.14 window.