v0.11.0 - Adaptive specializer and sys.monitoring
Released May 7, 2026.
If you have ever stared at the CPython source tree and wondered
why Python/specialize.c is six thousand lines long, this is the
release where that answer becomes ours to maintain. v0.11 brings
two large 3.12 / 3.13 era runtime features over the line at once:
the PEP 659 adaptive specializer that rewrites bytecode into
typed variants on first hit, and the PEP 669 sys.monitoring
surface that fans nineteen events out to as many as seven
independent tools without paying a per-event cost when nothing is
listening.
These two pieces of work are not coincidentally landing together.
They share a substrate. Both rewrite quickened bytecode in place,
both rely on the per-code instruction stride to find their
caches, and both reach into the dispatch loop the same way. The
specializer rewrites LOAD_ATTR into LOAD_ATTR_INSTANCE_VALUE
when the type version stays stable; the monitor rewrites
LOAD_ATTR into INSTRUMENTED_LOAD_ATTR when a tool subscribes
to the right event. The arms coexist because they were designed
together upstream, and we ported them together so the eval loop
sees a single coherent surface.
Legacy sys.settrace and sys.setprofile ride on top of the
same fan-out through a faithful port of CPython's
Python/legacy_tracing.c. There is one dispatch path for tracing
in v0.11. The old "if profile != nil { ... } else if trace !=
nil { ... }" tangle is gone.
Highlights
Three big themes carry this release.
Adaptive specialization at every hot opcode
PEP 659 is the design that makes CPython 3.12 noticeably faster
than 3.11 on real Python code without bringing in a JIT. The
trick is small: rewrite the bytecode in place on the first call,
swapping a generic LOAD_ATTR for a typed variant like
LOAD_ATTR_INSTANCE_VALUE that assumes the receiver is a plain
instance with a fixed shape. Cache the type version. On the next
call, if the version still matches, the typed arm bypasses the
descriptor dance and reads the slot directly. On a miss, the
counter ticks down; after enough misses, the opcode "deopts" back
to the generic arm and starts the warmup window over.
v0.11 ports every PEP 659 family. The list is long and worth spelling out so you know what is and is not specialized:
LOAD_ATTR module / class / instance value / slot /
property / method
STORE_ATTR
LOAD_GLOBAL
LOAD_SUPER_ATTR
BINARY_OP add / subtract / multiply for int, float,
unicode, list, dict
COMPARE_OP
CONTAINS_OP
TO_BOOL
STORE_SUBSCR
UNPACK_SEQUENCE
FOR_ITER list / tuple / range / generator
SEND
CALL / CALL_KW
Each family lands with its hit path, its miss path, and the
deopt path that walks the counter back. Each family carries the
upstream tp_version_tag and dk_version checks the fast
variants assume. The counter packing matches CPython exactly:
value | (backoff << 4) | tag, the same shape
Include/internal/pycore_backoff.h defines.
def hot(obj):
return obj.x + 1
class P:
__slots__ = ('x',)
def __init__(self): self.x = 41
p = P()
for _ in range(10_000):
hot(p)
On the first call, LOAD_ATTR x is the generic arm. Inside
_PyCode_Quicken (our specialize/quicken.go), the opcode flips
to LOAD_ATTR_SLOT because P carries __slots__ and the
attribute resolves to a slot descriptor. The inline cache stashes
the type version. The next ten thousand calls take the slot read
path, which is roughly four instructions: check the type version,
load the slot offset, read the slot. The __getattribute__
descriptor walk does not run.
If you swap p for an instance of a different class on call
number 9,001, the version check fails. The opcode increments its
deopt counter. When the counter saturates, the opcode falls back
to generic LOAD_ATTR and starts another warmup window. None of
this is visible to your Python code. It just runs faster.
sys.monitoring with seven independent tools
PEP 669 replaces sys.settrace and sys.setprofile for new code.
The model is simple but careful: there are nineteen events
(PY_START, PY_RESUME, PY_RETURN, PY_YIELD, CALL,
C_CALL, LINE, INSTRUCTION, JUMP, BRANCH, RAISE,
EXCEPTION_HANDLED, PY_UNWIND, PY_THROW, STOP_ITERATION,
RERAISE, C_RETURN, C_RAISE, BRANCH_RIGHT / BRANCH_LEFT).
There are seven tools, addressed by ID:
0 debugger
1 coverage
2 profiler
3 branch coverage
4 reserved
5 optimizer
6 sys.profile (legacy bridge)
7 sys.trace (legacy bridge)
Each tool registers a callback per event it cares about. The
interpreter walks the active-tool bitmask on every fire site and
calls each tool in turn. If a tool returns the DISABLE
sentinel, the per-code instrumentation for that event clears, and
that line never fires again for that tool. This is what makes
coverage tools cheap: once a line has been recorded, it stops
emitting.
import sys.monitoring as mon
DEBUGGER = mon.DEBUGGER_ID
mon.use_tool_id(DEBUGGER, 'my-debugger')
def on_return(code, off, retval):
print(f'{code.co_name} returned {retval!r}')
mon.register_callback(DEBUGGER, mon.events.PY_RETURN, on_return)
mon.set_events(DEBUGGER, mon.events.PY_RETURN)
def f():
return 42
f() # prints: f returned 42
The shadow walk in monitor/install.go rewrites quickened
bytecode in place. Every opcode that can fire an event gets an
INSTRUMENTED_* mirror (INSTRUMENTED_CALL,
INSTRUMENTED_RETURN_VALUE, etc.). Deinstrument restores the
original byte. The per-event mask vs the per-tool mask are
recomputed lazily so re-registering a callback stays cheap.
One dispatcher for old and new tracing
sys.settrace and sys.setprofile predate PEP 669 by twenty
years, and CPython keeps them working by routing the old shape
through the new fan-out. The bridge registers sys.profile as
tool 6 and sys.trace as tool 7, listens for the events those
two APIs care about, and translates between the two callback
conventions. Forward jumps return the DISABLE sentinel so the
line handler takes over via INSTRUMENTED_LINE.
import sys
def tracer(frame, event, arg):
if event == 'line':
print(f' line {frame.f_lineno}')
return tracer
sys.settrace(tracer)
def go():
x = 1
y = 2
return x + y
go()
Under v0.11, sys.settrace(tracer) registers tool 7, subscribes
to PY_START, LINE, PY_RETURN, and the exception family, and
hands every event back to the user-supplied Python callable
through a Go trampoline shaped like CPython's Py_tracefunc.
Per-frame TraceLines / TraceOpcodes / Lineno slots mirror
the f_trace_lines / f_trace_opcodes fields on
PyFrameObject, so the bridge can drive line and opcode events
without paying for them when the user has not asked for them.
What's new
The full breakdown, grouped by where it landed.
specialize/ core
The specializer lives in its own package. The shape mirrors the upstream split between counter / cache layouts, the quicken pass, and the per-family bodies.
-
specialize/backoff.go. The counter packing. Each counter is one 16-bit word with the layoutvalue | (backoff << 4) | tag. The tag lives in the bottom four bits; the backoff exponent occupies the next four; the value is the remaining eight. The hit path decrements value; the miss path increments backoff. The shape mirrorsInclude/internal/pycore_backoff.hexactly because we want identical warmup behavior to CPython under the same workload. -
specialize/cache.go. The inline-cache layout helpers. For each opcode, the cache count comes from_PyOpcode_Caches. The helpers walk past the prefix codeunit to find the per-opcode cache slots, which is where the type version, dict keys version, function pointer, or attribute index lives. -
specialize/quicken.go. A 1:1 port of_PyCode_QuickenfromPython/specialize.c. Walks the bytecode at code object install time. Every eligible opcode gets rewritten to its adaptive variant (the_ADAPTIVEsuffix in CPython, dropped from the table for terseness). The counters are seeded to the upstream initial values so the first call enters the warmup window rather than firing instantly. -
specialize/core.goandspecialize/deopt.go. The_PyOpcode_Cachesand_PyOpcode_Deopttables plus theset_opcode,specialize, andunspecializecore helpers fromPython/specialize.c.specializeflips the opcode to the typed variant on the hit path;unspecializereverts to the adaptive prefix when the deopt counter saturates. -
The per-family ports. Every PEP 659 family in
Python/specialize.chas a counterpart here:LOAD_ATTRarms for module attribute, class attribute, instance value, slot descriptor, property descriptor, and method (the bound-method fast path).STORE_ATTRand its symmetric arms.LOAD_GLOBALfor module dict vs builtins dict, with the keys-version check the fast variants assume.LOAD_SUPER_ATTRfor the two-argument zero-argument super cases.BINARY_OPforadd,subtract,multiplyacross int, float, unicode, list, and dict operand pairs.COMPARE_OPfor the three numeric kinds (int / int, float / float, str / str) plus the generic arm.CONTAINS_OPfor dict, set, list, tuple.TO_BOOLacross int, list, str, dict, set, none, bool.STORE_SUBSCRfor list-int and dict-str.UNPACK_SEQUENCEfor list and tuple of fixed length.FOR_ITERarms for list, tuple, range, and generator.SENDfor the generator fast path.CALLandCALL_KWarms for Python function, C function, method-bound function, type call, builtin call, list-append call (the hottest method call in the language).
Each family carries its hit path, its miss path, and its deopt path. Each family stamps the right version into the inline cache so the typed arm can re-verify on the next call.
Exact-type predicates
The PEP 659 fast paths key off "is this the exact type" rather
than "is this an instance of this type". int subclasses do not
take the int fast path. dict subclasses do not take the dict
fast path. The exact-type predicates live in
objects/exact.go. Every fast path family consults them before
flipping to the typed variant.
The corresponding version counters live in
objects/version.go, objects/type_specialize.go, and
objects/dict_specialize.go. The type version increments when a
class mutates (a new method added, a base swapped). The dict keys
version increments when the keys shape changes (a key added or
removed). The fast-path arms read these versions out of the
inline cache, compare against the live values, and deopt on
mismatch. This is the mechanism that makes the warm path so
cheap: a couple of integer comparisons replace the full attribute
lookup.
vm/adaptive.go and vm/dispatch.go
These wire the specializer into the eval loop. Every quickened opcode peels into its typed variant on hit, increments its counter, and falls back to the generic arm on miss with the deopt counter primed. The dispatch loop sees one opcode per codeunit just like CPython; the typed variants are real opcodes with their own table rows, not a runtime branch on opcode + tag.
This decision matters for two reasons. First, the dispatcher
stays a flat switch. Second, the typed variants are independently
analyzable by the Tier-2 optimizer (which ships in v0.12); a
trace projection can identify an LOAD_ATTR_INSTANCE_VALUE row
and fold the type check, where it could not do the same if the
type tag were a runtime branch.
vm/instrument_fire.go
The nineteen fire-event entry points
(_Py_call_instrumentation family from
Python/instrumentation.c) live here. Each entry point dispatches
to the per-tool callbacks registered against
monitor.InterpState. The fan-out walks the active-tool bitmask,
calls each callback in tool-id order (debugger first, then
coverage, then profiler, etc.), and honors the Disable sentinel
by clearing the per-code instrumentation for that event.
monitor/ package
The monitor package is the gopy side of sys.monitoring.
monitor/events.goandmonitor/tools.go. The PEP 669 event IDs and tool IDs. The constants match CPython's exact numeric values so cross-implementation tooling that hard-codes these IDs continues to work.monitor/state.go. The per-code instrumentation slab. Each code object carries an optionalMonitoringDatablock with per-event masks, per-event callback IDs, and the saved original bytecode bytes the instrument pass overwrote. The slab is lazy: uninstrumented code carries a nil pointer.monitor/interp.go.InterpStatecarries the per-tool callback table, the per-tool event mask, the active-tool registry, the global monitoring version, and the shared sentinels (Disable,Missing). The global monitoring version bumps on everyset_events/register_callbackcall; per-code instrumentation re-checks the version and reinstruments on mismatch.monitor/tables.goandmonitor/install.go. The shadow walk from_Py_Instrument. Quickened bytecode is rewritten in place toINSTRUMENTED_*variants. Deinstrument restores the original byte. The per-event mask vs the per-tool mask are recomputed lazily throughsetup_*_callbacksso re-registration stays cheap.monitor/line.go. Line instrumentation. The PEP 626 line table encodes start-of-line positions. Line tracking walks the table to identify line transitions;INSTRUMENTED_LINEfires theLINEevent for each transition through the dispatch fan-out.monitor/local.go. The eleven local events that gate per-frame line and opcode tracing without touching the global mask. Local events let a debugger say "trace lines in this frame but not its callees" without rewriting the callees' bytecode.monitor/fire.go. The shared callback runner. Walks the active tool bitmask, calls each callback in tool-id order, honors theDisablesentinel by clearing the per-code instrumentation, and re-raises Python exceptions back into the interpreter. A callback that raises propagates the exception; the surrounding opcode handles it the same way a manualraisewould.monitor/sysmonitoring.go. The Python-visiblesys.monitoringmodule:use_tool_id,free_tool_id,get_tool,register_callback,set_events,get_events,set_local_events,get_local_events,restart_events, plus theevents.*and tool ID constants and theDISABLE/MISSINGsingletons.monitor/sentinel.go. TheDisableandMissingPython objects.Disableis what a callback returns to say "do not fire this event for this code location again";Missingis the argument the interpreter passes when an event fires without a natural value (aLINEevent has noretval, for example).
Legacy tracing bridge
The bridge in vm/legacy_tracing.go is a 1:1 port of
Python/legacy_tracing.c. The job is to take a Python-shaped
trace function (the one you pass to sys.settrace) and present
it to the monitor layer as a PEP 669 tool.
vm/legacy_tracing.go. The bridge registerssys.profileandsys.traceas tools 6 and 7. It fans the matching events through Go-levelLegacyTraceFunccallbacks shaped like CPython'sPy_tracefunc, and translates between the two callback conventions. Forward jumps return theDisablesentinel so the line handler takes over viaINSTRUMENTED_LINE.vm/sys_trace_builtins.go. The Python-visiblesys.settrace,sys.setprofile,sys.gettrace,sys.getprofile. Threads the user-supplied Python callable through a Go trampoline shaped likeLegacyTraceFunc, then defers toSetTrace/SetProfileto install the bridge. The shape mirrorsPython/sysmodule.c'ssys_settrace/sys_setprofile.frame/frame.go. Per-frameTraceLines,TraceOpcodes, andLinenoslots so the bridge can drive line and opcode events without paying for them when the user has not asked. These mirror thef_trace_lines/f_trace_opcodesflags onPyFrameObject.state/state.goandmonitor/interp.go. Per-thread tracing slots on the VM-side thread state plusSysProfileOnceandSysTraceOnceflags so the one-timesetup_*_callbacksinstall runs at most once per interpreter. Re-callingsys.settrace(f)swaps the callback without paying the install cost again.
compile/opcodes_gen.go
The opcode table regenerated. The specialized variants
(LOAD_ATTR_INSTANCE_VALUE, BINARY_OP_ADD_INT, etc.) and the
INSTRUMENTED_* mirror set are all there now. The cache-count
column lines up with _PyOpcode_Caches so the dispatch loop can
stride over the inline cache slots without per-opcode special
cases.
Regenerating the table is one command. Adding a new specialized opcode means editing the upstream table, regenerating, and adding the per-opcode body. The generator does not invent variants; it mirrors what upstream declares.
Why we built it this way
A few decisions deserve calling out.
Why typed variants instead of a runtime tag
PEP 659 could in principle have been a single LOAD_ATTR opcode
that branches at runtime on a tag stored in the cache. CPython
chose to mint a separate opcode per typed arm
(LOAD_ATTR_INSTANCE_VALUE, LOAD_ATTR_SLOT,
LOAD_ATTR_MODULE, etc.) and we mirrored that choice. The
benefit is that the eval loop's switch goes through one branch
per opcode rather than two. The dispatch logic for the typed arm
is just a different case in the same switch. There is no extra
indirection.
The other benefit is downstream. A Tier-2 trace optimizer (which v0.12 ships) reads a stream of typed opcodes and folds the type checks. If the type were a runtime tag, the optimizer would have to reason about the tag's value, which is harder. The flat typed-variant model makes the trace projector almost trivial.
Why a per-code instrumentation slab
The monitor could maintain per-event masks at the interpreter
level and check them on every opcode. CPython chose instead to
rewrite the bytecode in place: code that is being monitored
carries INSTRUMENTED_* opcodes that fire the event inline;
code that is not being monitored carries the original opcode and
pays nothing. We followed.
The win is that monitoring is free when no tool is listening.
This matters because sys.monitoring is intended to replace
sys.settrace, and sys.settrace historically carried a real
runtime cost even when the trace function was None. PEP 669
fixes that. The shadow walk pays the install cost up front and
then runs at native speed.
The cost is that re-instrumenting takes a walk through every
relevant code object. We made the same tradeoff CPython did:
re-instrumentation is rare (you call set_events once at
program start), so the up-front cost is amortized.
Why two-phase tracing
The legacy sys.settrace model is one callback that receives
every event. The PEP 669 model is one callback per event per
tool. The bridge collapses the new model down to the old shape,
but it cannot run the user callback on every opcode without
paying the cost the new model was designed to avoid.
The compromise is the per-frame TraceLines / TraceOpcodes
flags. The bridge subscribes to LINE and INSTRUCTION globally
but only delivers the events to the user callback when the
current frame's flag is set. The user callback can flip the flag
on or off from within itself, mirroring CPython's
f_trace_lines semantics. The result is that
sys.settrace(f) works without re-instrumenting on every frame
push.
Where it lives
The map of this release in the repo:
specialize/. The PEP 659 specializer core.backoff.go,cache.go,quicken.go,core.go,deopt.go, plus one file per family (load_attr.go,binary_op.go, etc.).monitor/. The PEP 669 monitoring surface.events.go,tools.go,state.go,interp.go,tables.go,install.go,line.go,local.go,fire.go,sysmonitoring.go,sentinel.go.vm/adaptive.go,vm/dispatch.go. Specializer wiring in the eval loop.vm/instrument_fire.go. The fire-event entry points.vm/legacy_tracing.go. Thesys.settrace/sys.setprofilebridge.vm/sys_trace_builtins.go. The Python-visible builtins.frame/frame.go. Per-frame trace flags.state/state.go. Per-thread trace state.objects/exact.go,objects/version.go,objects/type_specialize.go,objects/dict_specialize.go. The exact-type predicates and version counters the families consult.compile/opcodes_gen.go. The regenerated opcode table carrying the specialized and instrumented variants.
Compatibility
A handful of user-visible behaviors changed.
sys.settracegranularity. The trace callback now fires at PEP 669 event boundaries rather than at the legacy line / call / return / exception positions. For most code this is identical. For code that depended on the exact tick timing of the old callback (a rare case, mostly debuggers), the events fire at the slightly different positions PEP 669 specifies.sys.monitoringis available. The module did not exist in earlier gopy releases. Tools that want to use PEP 669 can. Tools that import it conditionally on Python 3.12+ now resolve the import.- Bytecode disassembly shows specialized variants.
dis.dison a warm function shows opcodes likeLOAD_ATTR_INSTANCE_VALUErather thanLOAD_ATTR. The Python source is unchanged; only the JIT-warmed bytecode shape is visible. CPython behaves the same way on 3.12+. INSTRUMENTED_*opcodes show indis.diswhen a tool is registered. Code rewritten by the instrument pass surfaces the rewritten ops. Removing the tool restores the original opcodes in the listing.
Gates
End-to-end gates lock the surface in. They live in
vmtest/v011_gate_test.go.
TestGateSpecializerRewritesToBool. Drives aTO_BOOLthrough the adaptive path with a forced-zero counter, lands on the specialized variant, recognizes the deopt path, and still producesTruefor an int operand. The gate proves the specializer rewrites, dispatches, and deopts cleanly on a representative family.TestGateMonitorPyReturnFires. Registers a debugger callback againstEventPyReturn, instruments the code, runs the program, and checks the callback received the per-event arg trio. Proves the fire path and the per-tool dispatch agree.TestGateLegacyTraceFires. Installs a Go-level legacy tracefunc, runs a small program, and confirms the bridge firesPyTrace_RETURNwith the right value. Proves the legacy bridge translates events correctly.TestGateSysMonitoringRoundTrip. Claims a tool slot throughuse_tool_id, registers a Python callback throughregister_callback, sets events, and checks the internal state reflects the request. Proves the Python-visible API and the internal state stay synchronized.
A note on the PEP 659 design
PEP 659 is worth reading if you have not. The short version: the
authors observed that real Python code has narrow type behavior
at hot opcodes. Most LOAD_ATTR sites in a long-running program
see exactly one receiver type. Most BINARY_OP sites see exactly
one operand pair shape. A general-purpose interpreter that
dispatches on the receiver type for every call is paying a cost
the workload does not actually impose.
The PEP's answer is to specialize lazily. The first call records the shape; the second and subsequent calls take the typed fast path. A miss costs the same as the unspecialized arm plus a counter increment. After enough misses, the opcode deopts and the warmup window restarts. This is structurally similar to what a tracing JIT does, without the JIT.
The reason this matters for gopy: we want to reach CPython performance on workloads that benefit from PEP 659. Skipping the specializer would have meant accepting a 20-40% performance gap on the kind of code that benefits most (idiomatic OO with stable shapes). Porting the specializer recovers that gap.
A note on the PEP 669 design
PEP 669 is the other PEP worth reading. The motivating problem:
sys.settrace is a global single-callback API that fires on
every opcode boundary. A coverage tool, a profiler, and a
debugger cannot coexist; they each want different events at
different frequencies; and the per-event cost is paid even when
no tool needs the event.
PEP 669 fixes this with three changes. First, events are typed:
LINE, CALL, RETURN, etc. are separate channels. Second,
tools are independent: up to seven of them can register
non-conflicting callbacks. Third, the instrumentation is per-code
and per-event: code that nobody is watching pays nothing, and a
tool that wants LINE does not pay the CALL cost.
The shape of the implementation falls naturally out of the
design. The INSTRUMENTED_* opcodes are a per-event tax that
each subscribing tool pays once. The Disable sentinel lets a
tool unsubscribe per-location, which is how coverage tools record
a line once and then stop firing on it. The per-code
instrumentation slab is the storage for the rewritten bytecode
and the saved originals.
We followed the design verbatim because it solves the right problems and because tools written against it on CPython 3.12+ should run unmodified on gopy.
What's next
v0.12 lays down the Tier-2 trace optimizer on top of what v0.11 ships. The specialized variants this release introduces are the input to that optimizer: a trace projection identifies the typed opcodes, folds the type checks across the trace, and runs the resulting straight-line uop stream through a dispatch loop. The trace projector cannot do its job without the specialized variants, so v0.11 has to land first.
Items that slip from v0.11 to v0.12:
- The four CPython smoke fixtures left ahead of v0.11
(
comparison_eqand friends) surface a pre-existing comparison wiring bug. Triage moves to v0.12 along with theCOMPARE_OPoparg fix. - The
_io.Filelayered split (RawIOBase/BufferedReader/TextIOWrapper) was deferred from v0.10.1 and is still pending. It rides along with the v0.12 stdlib work. - The fast (super-instruction) compaction passes in
compile/flowgraph(swaptimize, super-instructions, LOAD_FAST ref-stack, cold-block hoist) were deferred from v0.5 and remain open. These are pure compile-time optimizations and do not gate runtime correctness.
Acknowledgments
This release closes work tracked across the v0.11 spec series. The public-facing pointers:
- PEP 659 (the specializer design). The CPython source we ported
against:
Python/specialize.c,Python/ceval.c,Include/internal/pycore_backoff.h. - PEP 669 (the monitoring design). The CPython source we ported
against:
Python/instrumentation.c,Python/legacy_tracing.c,Python/sysmodule.c.
The pull request that shipped this release covered every file
under specialize/, monitor/, and the matching vm/ arms.
With this in, the foundation for Tier-2 in v0.12 is ready.