Skip to main content

v0.11.0 - Adaptive specializer and sys.monitoring

Released May 7, 2026.

If you have ever stared at the CPython source tree and wondered why Python/specialize.c is six thousand lines long, this is the release where that answer becomes ours to maintain. v0.11 brings two large 3.12 / 3.13 era runtime features over the line at once: the PEP 659 adaptive specializer that rewrites bytecode into typed variants on first hit, and the PEP 669 sys.monitoring surface that fans nineteen events out to as many as seven independent tools without paying a per-event cost when nothing is listening.

These two pieces of work are not coincidentally landing together. They share a substrate. Both rewrite quickened bytecode in place, both rely on the per-code instruction stride to find their caches, and both reach into the dispatch loop the same way. The specializer rewrites LOAD_ATTR into LOAD_ATTR_INSTANCE_VALUE when the type version stays stable; the monitor rewrites LOAD_ATTR into INSTRUMENTED_LOAD_ATTR when a tool subscribes to the right event. The arms coexist because they were designed together upstream, and we ported them together so the eval loop sees a single coherent surface.

Legacy sys.settrace and sys.setprofile ride on top of the same fan-out through a faithful port of CPython's Python/legacy_tracing.c. There is one dispatch path for tracing in v0.11. The old "if profile != nil { ... } else if trace != nil { ... }" tangle is gone.

Highlights

Three big themes carry this release.

Adaptive specialization at every hot opcode

PEP 659 is the design that makes CPython 3.12 noticeably faster than 3.11 on real Python code without bringing in a JIT. The trick is small: rewrite the bytecode in place on the first call, swapping a generic LOAD_ATTR for a typed variant like LOAD_ATTR_INSTANCE_VALUE that assumes the receiver is a plain instance with a fixed shape. Cache the type version. On the next call, if the version still matches, the typed arm bypasses the descriptor dance and reads the slot directly. On a miss, the counter ticks down; after enough misses, the opcode "deopts" back to the generic arm and starts the warmup window over.

v0.11 ports every PEP 659 family. The list is long and worth spelling out so you know what is and is not specialized:

LOAD_ATTR module / class / instance value / slot /
property / method
STORE_ATTR
LOAD_GLOBAL
LOAD_SUPER_ATTR
BINARY_OP add / subtract / multiply for int, float,
unicode, list, dict
COMPARE_OP
CONTAINS_OP
TO_BOOL
STORE_SUBSCR
UNPACK_SEQUENCE
FOR_ITER list / tuple / range / generator
SEND
CALL / CALL_KW

Each family lands with its hit path, its miss path, and the deopt path that walks the counter back. Each family carries the upstream tp_version_tag and dk_version checks the fast variants assume. The counter packing matches CPython exactly: value | (backoff << 4) | tag, the same shape Include/internal/pycore_backoff.h defines.

def hot(obj):
return obj.x + 1

class P:
__slots__ = ('x',)
def __init__(self): self.x = 41

p = P()
for _ in range(10_000):
hot(p)

On the first call, LOAD_ATTR x is the generic arm. Inside _PyCode_Quicken (our specialize/quicken.go), the opcode flips to LOAD_ATTR_SLOT because P carries __slots__ and the attribute resolves to a slot descriptor. The inline cache stashes the type version. The next ten thousand calls take the slot read path, which is roughly four instructions: check the type version, load the slot offset, read the slot. The __getattribute__ descriptor walk does not run.

If you swap p for an instance of a different class on call number 9,001, the version check fails. The opcode increments its deopt counter. When the counter saturates, the opcode falls back to generic LOAD_ATTR and starts another warmup window. None of this is visible to your Python code. It just runs faster.

sys.monitoring with seven independent tools

PEP 669 replaces sys.settrace and sys.setprofile for new code. The model is simple but careful: there are nineteen events (PY_START, PY_RESUME, PY_RETURN, PY_YIELD, CALL, C_CALL, LINE, INSTRUCTION, JUMP, BRANCH, RAISE, EXCEPTION_HANDLED, PY_UNWIND, PY_THROW, STOP_ITERATION, RERAISE, C_RETURN, C_RAISE, BRANCH_RIGHT / BRANCH_LEFT). There are seven tools, addressed by ID:

0 debugger
1 coverage
2 profiler
3 branch coverage
4 reserved
5 optimizer
6 sys.profile (legacy bridge)
7 sys.trace (legacy bridge)

Each tool registers a callback per event it cares about. The interpreter walks the active-tool bitmask on every fire site and calls each tool in turn. If a tool returns the DISABLE sentinel, the per-code instrumentation for that event clears, and that line never fires again for that tool. This is what makes coverage tools cheap: once a line has been recorded, it stops emitting.

import sys.monitoring as mon

DEBUGGER = mon.DEBUGGER_ID
mon.use_tool_id(DEBUGGER, 'my-debugger')

def on_return(code, off, retval):
print(f'{code.co_name} returned {retval!r}')

mon.register_callback(DEBUGGER, mon.events.PY_RETURN, on_return)
mon.set_events(DEBUGGER, mon.events.PY_RETURN)

def f():
return 42

f() # prints: f returned 42

The shadow walk in monitor/install.go rewrites quickened bytecode in place. Every opcode that can fire an event gets an INSTRUMENTED_* mirror (INSTRUMENTED_CALL, INSTRUMENTED_RETURN_VALUE, etc.). Deinstrument restores the original byte. The per-event mask vs the per-tool mask are recomputed lazily so re-registering a callback stays cheap.

One dispatcher for old and new tracing

sys.settrace and sys.setprofile predate PEP 669 by twenty years, and CPython keeps them working by routing the old shape through the new fan-out. The bridge registers sys.profile as tool 6 and sys.trace as tool 7, listens for the events those two APIs care about, and translates between the two callback conventions. Forward jumps return the DISABLE sentinel so the line handler takes over via INSTRUMENTED_LINE.

import sys

def tracer(frame, event, arg):
if event == 'line':
print(f' line {frame.f_lineno}')
return tracer

sys.settrace(tracer)
def go():
x = 1
y = 2
return x + y
go()

Under v0.11, sys.settrace(tracer) registers tool 7, subscribes to PY_START, LINE, PY_RETURN, and the exception family, and hands every event back to the user-supplied Python callable through a Go trampoline shaped like CPython's Py_tracefunc. Per-frame TraceLines / TraceOpcodes / Lineno slots mirror the f_trace_lines / f_trace_opcodes fields on PyFrameObject, so the bridge can drive line and opcode events without paying for them when the user has not asked for them.

What's new

The full breakdown, grouped by where it landed.

specialize/ core

The specializer lives in its own package. The shape mirrors the upstream split between counter / cache layouts, the quicken pass, and the per-family bodies.

  • specialize/backoff.go. The counter packing. Each counter is one 16-bit word with the layout value | (backoff << 4) | tag. The tag lives in the bottom four bits; the backoff exponent occupies the next four; the value is the remaining eight. The hit path decrements value; the miss path increments backoff. The shape mirrors Include/internal/pycore_backoff.h exactly because we want identical warmup behavior to CPython under the same workload.

  • specialize/cache.go. The inline-cache layout helpers. For each opcode, the cache count comes from _PyOpcode_Caches. The helpers walk past the prefix codeunit to find the per-opcode cache slots, which is where the type version, dict keys version, function pointer, or attribute index lives.

  • specialize/quicken.go. A 1:1 port of _PyCode_Quicken from Python/specialize.c. Walks the bytecode at code object install time. Every eligible opcode gets rewritten to its adaptive variant (the _ADAPTIVE suffix in CPython, dropped from the table for terseness). The counters are seeded to the upstream initial values so the first call enters the warmup window rather than firing instantly.

  • specialize/core.go and specialize/deopt.go. The _PyOpcode_Caches and _PyOpcode_Deopt tables plus the set_opcode, specialize, and unspecialize core helpers from Python/specialize.c. specialize flips the opcode to the typed variant on the hit path; unspecialize reverts to the adaptive prefix when the deopt counter saturates.

  • The per-family ports. Every PEP 659 family in Python/specialize.c has a counterpart here:

    • LOAD_ATTR arms for module attribute, class attribute, instance value, slot descriptor, property descriptor, and method (the bound-method fast path).
    • STORE_ATTR and its symmetric arms.
    • LOAD_GLOBAL for module dict vs builtins dict, with the keys-version check the fast variants assume.
    • LOAD_SUPER_ATTR for the two-argument zero-argument super cases.
    • BINARY_OP for add, subtract, multiply across int, float, unicode, list, and dict operand pairs.
    • COMPARE_OP for the three numeric kinds (int / int, float / float, str / str) plus the generic arm.
    • CONTAINS_OP for dict, set, list, tuple.
    • TO_BOOL across int, list, str, dict, set, none, bool.
    • STORE_SUBSCR for list-int and dict-str.
    • UNPACK_SEQUENCE for list and tuple of fixed length.
    • FOR_ITER arms for list, tuple, range, and generator.
    • SEND for the generator fast path.
    • CALL and CALL_KW arms for Python function, C function, method-bound function, type call, builtin call, list-append call (the hottest method call in the language).

    Each family carries its hit path, its miss path, and its deopt path. Each family stamps the right version into the inline cache so the typed arm can re-verify on the next call.

Exact-type predicates

The PEP 659 fast paths key off "is this the exact type" rather than "is this an instance of this type". int subclasses do not take the int fast path. dict subclasses do not take the dict fast path. The exact-type predicates live in objects/exact.go. Every fast path family consults them before flipping to the typed variant.

The corresponding version counters live in objects/version.go, objects/type_specialize.go, and objects/dict_specialize.go. The type version increments when a class mutates (a new method added, a base swapped). The dict keys version increments when the keys shape changes (a key added or removed). The fast-path arms read these versions out of the inline cache, compare against the live values, and deopt on mismatch. This is the mechanism that makes the warm path so cheap: a couple of integer comparisons replace the full attribute lookup.

vm/adaptive.go and vm/dispatch.go

These wire the specializer into the eval loop. Every quickened opcode peels into its typed variant on hit, increments its counter, and falls back to the generic arm on miss with the deopt counter primed. The dispatch loop sees one opcode per codeunit just like CPython; the typed variants are real opcodes with their own table rows, not a runtime branch on opcode + tag.

This decision matters for two reasons. First, the dispatcher stays a flat switch. Second, the typed variants are independently analyzable by the Tier-2 optimizer (which ships in v0.12); a trace projection can identify an LOAD_ATTR_INSTANCE_VALUE row and fold the type check, where it could not do the same if the type tag were a runtime branch.

vm/instrument_fire.go

The nineteen fire-event entry points (_Py_call_instrumentation family from Python/instrumentation.c) live here. Each entry point dispatches to the per-tool callbacks registered against monitor.InterpState. The fan-out walks the active-tool bitmask, calls each callback in tool-id order (debugger first, then coverage, then profiler, etc.), and honors the Disable sentinel by clearing the per-code instrumentation for that event.

monitor/ package

The monitor package is the gopy side of sys.monitoring.

  • monitor/events.go and monitor/tools.go. The PEP 669 event IDs and tool IDs. The constants match CPython's exact numeric values so cross-implementation tooling that hard-codes these IDs continues to work.
  • monitor/state.go. The per-code instrumentation slab. Each code object carries an optional MonitoringData block with per-event masks, per-event callback IDs, and the saved original bytecode bytes the instrument pass overwrote. The slab is lazy: uninstrumented code carries a nil pointer.
  • monitor/interp.go. InterpState carries the per-tool callback table, the per-tool event mask, the active-tool registry, the global monitoring version, and the shared sentinels (Disable, Missing). The global monitoring version bumps on every set_events / register_callback call; per-code instrumentation re-checks the version and reinstruments on mismatch.
  • monitor/tables.go and monitor/install.go. The shadow walk from _Py_Instrument. Quickened bytecode is rewritten in place to INSTRUMENTED_* variants. Deinstrument restores the original byte. The per-event mask vs the per-tool mask are recomputed lazily through setup_*_callbacks so re-registration stays cheap.
  • monitor/line.go. Line instrumentation. The PEP 626 line table encodes start-of-line positions. Line tracking walks the table to identify line transitions; INSTRUMENTED_LINE fires the LINE event for each transition through the dispatch fan-out.
  • monitor/local.go. The eleven local events that gate per-frame line and opcode tracing without touching the global mask. Local events let a debugger say "trace lines in this frame but not its callees" without rewriting the callees' bytecode.
  • monitor/fire.go. The shared callback runner. Walks the active tool bitmask, calls each callback in tool-id order, honors the Disable sentinel by clearing the per-code instrumentation, and re-raises Python exceptions back into the interpreter. A callback that raises propagates the exception; the surrounding opcode handles it the same way a manual raise would.
  • monitor/sysmonitoring.go. The Python-visible sys.monitoring module: use_tool_id, free_tool_id, get_tool, register_callback, set_events, get_events, set_local_events, get_local_events, restart_events, plus the events.* and tool ID constants and the DISABLE / MISSING singletons.
  • monitor/sentinel.go. The Disable and Missing Python objects. Disable is what a callback returns to say "do not fire this event for this code location again"; Missing is the argument the interpreter passes when an event fires without a natural value (a LINE event has no retval, for example).

Legacy tracing bridge

The bridge in vm/legacy_tracing.go is a 1:1 port of Python/legacy_tracing.c. The job is to take a Python-shaped trace function (the one you pass to sys.settrace) and present it to the monitor layer as a PEP 669 tool.

  • vm/legacy_tracing.go. The bridge registers sys.profile and sys.trace as tools 6 and 7. It fans the matching events through Go-level LegacyTraceFunc callbacks shaped like CPython's Py_tracefunc, and translates between the two callback conventions. Forward jumps return the Disable sentinel so the line handler takes over via INSTRUMENTED_LINE.
  • vm/sys_trace_builtins.go. The Python-visible sys.settrace, sys.setprofile, sys.gettrace, sys.getprofile. Threads the user-supplied Python callable through a Go trampoline shaped like LegacyTraceFunc, then defers to SetTrace / SetProfile to install the bridge. The shape mirrors Python/sysmodule.c's sys_settrace / sys_setprofile.
  • frame/frame.go. Per-frame TraceLines, TraceOpcodes, and Lineno slots so the bridge can drive line and opcode events without paying for them when the user has not asked. These mirror the f_trace_lines / f_trace_opcodes flags on PyFrameObject.
  • state/state.go and monitor/interp.go. Per-thread tracing slots on the VM-side thread state plus SysProfileOnce and SysTraceOnce flags so the one-time setup_*_callbacks install runs at most once per interpreter. Re-calling sys.settrace(f) swaps the callback without paying the install cost again.

compile/opcodes_gen.go

The opcode table regenerated. The specialized variants (LOAD_ATTR_INSTANCE_VALUE, BINARY_OP_ADD_INT, etc.) and the INSTRUMENTED_* mirror set are all there now. The cache-count column lines up with _PyOpcode_Caches so the dispatch loop can stride over the inline cache slots without per-opcode special cases.

Regenerating the table is one command. Adding a new specialized opcode means editing the upstream table, regenerating, and adding the per-opcode body. The generator does not invent variants; it mirrors what upstream declares.

Why we built it this way

A few decisions deserve calling out.

Why typed variants instead of a runtime tag

PEP 659 could in principle have been a single LOAD_ATTR opcode that branches at runtime on a tag stored in the cache. CPython chose to mint a separate opcode per typed arm (LOAD_ATTR_INSTANCE_VALUE, LOAD_ATTR_SLOT, LOAD_ATTR_MODULE, etc.) and we mirrored that choice. The benefit is that the eval loop's switch goes through one branch per opcode rather than two. The dispatch logic for the typed arm is just a different case in the same switch. There is no extra indirection.

The other benefit is downstream. A Tier-2 trace optimizer (which v0.12 ships) reads a stream of typed opcodes and folds the type checks. If the type were a runtime tag, the optimizer would have to reason about the tag's value, which is harder. The flat typed-variant model makes the trace projector almost trivial.

Why a per-code instrumentation slab

The monitor could maintain per-event masks at the interpreter level and check them on every opcode. CPython chose instead to rewrite the bytecode in place: code that is being monitored carries INSTRUMENTED_* opcodes that fire the event inline; code that is not being monitored carries the original opcode and pays nothing. We followed.

The win is that monitoring is free when no tool is listening. This matters because sys.monitoring is intended to replace sys.settrace, and sys.settrace historically carried a real runtime cost even when the trace function was None. PEP 669 fixes that. The shadow walk pays the install cost up front and then runs at native speed.

The cost is that re-instrumenting takes a walk through every relevant code object. We made the same tradeoff CPython did: re-instrumentation is rare (you call set_events once at program start), so the up-front cost is amortized.

Why two-phase tracing

The legacy sys.settrace model is one callback that receives every event. The PEP 669 model is one callback per event per tool. The bridge collapses the new model down to the old shape, but it cannot run the user callback on every opcode without paying the cost the new model was designed to avoid.

The compromise is the per-frame TraceLines / TraceOpcodes flags. The bridge subscribes to LINE and INSTRUCTION globally but only delivers the events to the user callback when the current frame's flag is set. The user callback can flip the flag on or off from within itself, mirroring CPython's f_trace_lines semantics. The result is that sys.settrace(f) works without re-instrumenting on every frame push.

Where it lives

The map of this release in the repo:

  • specialize/. The PEP 659 specializer core. backoff.go, cache.go, quicken.go, core.go, deopt.go, plus one file per family (load_attr.go, binary_op.go, etc.).
  • monitor/. The PEP 669 monitoring surface. events.go, tools.go, state.go, interp.go, tables.go, install.go, line.go, local.go, fire.go, sysmonitoring.go, sentinel.go.
  • vm/adaptive.go, vm/dispatch.go. Specializer wiring in the eval loop.
  • vm/instrument_fire.go. The fire-event entry points.
  • vm/legacy_tracing.go. The sys.settrace / sys.setprofile bridge.
  • vm/sys_trace_builtins.go. The Python-visible builtins.
  • frame/frame.go. Per-frame trace flags.
  • state/state.go. Per-thread trace state.
  • objects/exact.go, objects/version.go, objects/type_specialize.go, objects/dict_specialize.go. The exact-type predicates and version counters the families consult.
  • compile/opcodes_gen.go. The regenerated opcode table carrying the specialized and instrumented variants.

Compatibility

A handful of user-visible behaviors changed.

  • sys.settrace granularity. The trace callback now fires at PEP 669 event boundaries rather than at the legacy line / call / return / exception positions. For most code this is identical. For code that depended on the exact tick timing of the old callback (a rare case, mostly debuggers), the events fire at the slightly different positions PEP 669 specifies.
  • sys.monitoring is available. The module did not exist in earlier gopy releases. Tools that want to use PEP 669 can. Tools that import it conditionally on Python 3.12+ now resolve the import.
  • Bytecode disassembly shows specialized variants. dis.dis on a warm function shows opcodes like LOAD_ATTR_INSTANCE_VALUE rather than LOAD_ATTR. The Python source is unchanged; only the JIT-warmed bytecode shape is visible. CPython behaves the same way on 3.12+.
  • INSTRUMENTED_* opcodes show in dis.dis when a tool is registered. Code rewritten by the instrument pass surfaces the rewritten ops. Removing the tool restores the original opcodes in the listing.

Gates

End-to-end gates lock the surface in. They live in vmtest/v011_gate_test.go.

  • TestGateSpecializerRewritesToBool. Drives a TO_BOOL through the adaptive path with a forced-zero counter, lands on the specialized variant, recognizes the deopt path, and still produces True for an int operand. The gate proves the specializer rewrites, dispatches, and deopts cleanly on a representative family.
  • TestGateMonitorPyReturnFires. Registers a debugger callback against EventPyReturn, instruments the code, runs the program, and checks the callback received the per-event arg trio. Proves the fire path and the per-tool dispatch agree.
  • TestGateLegacyTraceFires. Installs a Go-level legacy tracefunc, runs a small program, and confirms the bridge fires PyTrace_RETURN with the right value. Proves the legacy bridge translates events correctly.
  • TestGateSysMonitoringRoundTrip. Claims a tool slot through use_tool_id, registers a Python callback through register_callback, sets events, and checks the internal state reflects the request. Proves the Python-visible API and the internal state stay synchronized.

A note on the PEP 659 design

PEP 659 is worth reading if you have not. The short version: the authors observed that real Python code has narrow type behavior at hot opcodes. Most LOAD_ATTR sites in a long-running program see exactly one receiver type. Most BINARY_OP sites see exactly one operand pair shape. A general-purpose interpreter that dispatches on the receiver type for every call is paying a cost the workload does not actually impose.

The PEP's answer is to specialize lazily. The first call records the shape; the second and subsequent calls take the typed fast path. A miss costs the same as the unspecialized arm plus a counter increment. After enough misses, the opcode deopts and the warmup window restarts. This is structurally similar to what a tracing JIT does, without the JIT.

The reason this matters for gopy: we want to reach CPython performance on workloads that benefit from PEP 659. Skipping the specializer would have meant accepting a 20-40% performance gap on the kind of code that benefits most (idiomatic OO with stable shapes). Porting the specializer recovers that gap.

A note on the PEP 669 design

PEP 669 is the other PEP worth reading. The motivating problem: sys.settrace is a global single-callback API that fires on every opcode boundary. A coverage tool, a profiler, and a debugger cannot coexist; they each want different events at different frequencies; and the per-event cost is paid even when no tool needs the event.

PEP 669 fixes this with three changes. First, events are typed: LINE, CALL, RETURN, etc. are separate channels. Second, tools are independent: up to seven of them can register non-conflicting callbacks. Third, the instrumentation is per-code and per-event: code that nobody is watching pays nothing, and a tool that wants LINE does not pay the CALL cost.

The shape of the implementation falls naturally out of the design. The INSTRUMENTED_* opcodes are a per-event tax that each subscribing tool pays once. The Disable sentinel lets a tool unsubscribe per-location, which is how coverage tools record a line once and then stop firing on it. The per-code instrumentation slab is the storage for the rewritten bytecode and the saved originals.

We followed the design verbatim because it solves the right problems and because tools written against it on CPython 3.12+ should run unmodified on gopy.

What's next

v0.12 lays down the Tier-2 trace optimizer on top of what v0.11 ships. The specialized variants this release introduces are the input to that optimizer: a trace projection identifies the typed opcodes, folds the type checks across the trace, and runs the resulting straight-line uop stream through a dispatch loop. The trace projector cannot do its job without the specialized variants, so v0.11 has to land first.

Items that slip from v0.11 to v0.12:

  • The four CPython smoke fixtures left ahead of v0.11 (comparison_eq and friends) surface a pre-existing comparison wiring bug. Triage moves to v0.12 along with the COMPARE_OP oparg fix.
  • The _io.File layered split (RawIOBase / BufferedReader / TextIOWrapper) was deferred from v0.10.1 and is still pending. It rides along with the v0.12 stdlib work.
  • The fast (super-instruction) compaction passes in compile/flowgraph (swaptimize, super-instructions, LOAD_FAST ref-stack, cold-block hoist) were deferred from v0.5 and remain open. These are pure compile-time optimizations and do not gate runtime correctness.

Acknowledgments

This release closes work tracked across the v0.11 spec series. The public-facing pointers:

  • PEP 659 (the specializer design). The CPython source we ported against: Python/specialize.c, Python/ceval.c, Include/internal/pycore_backoff.h.
  • PEP 669 (the monitoring design). The CPython source we ported against: Python/instrumentation.c, Python/legacy_tracing.c, Python/sysmodule.c.

The pull request that shipped this release covered every file under specialize/, monitor/, and the matching vm/ arms. With this in, the foundation for Tier-2 in v0.12 is ready.