v0.8.0 - Import drop
Released Pre-release.
A working import statement is a deceptive amount of machinery.
The Python source you type is a single keyword. The runtime
behavior underneath that keyword spans three executable layers,
two file formats, a registry, a search algorithm, a compiler, a
serializer, a deserializer, and a codec table. The fact that
import json "just works" in CPython is the result of decades
of careful work on the import system.
For most of gopy's life up through v0.7, we didn't have any of
this. We had the lifecycle from v0.7 (which let us boot), and we
had a builtin namespace (which let print and len work). What
we did not have was a way to say import math. Every test that
needed a module had to bring that module along as a hand-stamped
Go value injected into the eval loop. The runner could execute
single-file Python with builtins; it could not execute Python
that imported anything.
v0.8.0 is the import drop. After this release the runtime
has a working import system: marshal reads and writes .pyc
files, the codecs layer handles utf-8, ascii, and latin-1, and
the imp package implements the full sys.modules to frozen to
built-in lookup chain. IMPORT_NAME and IMPORT_FROM bytecodes
in the eval loop use the new machinery end to end.
Highlights
Three pieces of work define this release.
Real .pyc round-trip
marshal is the serialization format CPython uses for bytecode
on disk. A .pyc file is a 16-byte header (magic, flags,
timestamp or hash payload) followed by a marshaled code object.
A marshaled code object recursively marshals its constants, its
name, its filename, its line table, its exception table, its
free and cell variable names, and the bytecode itself. Marshal
has to handle every type that can appear in co_consts: ints
(small and arbitrary-precision), floats, complex, strings (with
interning), bytes, tuples, frozensets, code objects.
We landed all of it.
// Write a code object to disk as a .pyc file.
err := marshal.WritePyc(file, code, mtime, sourceSize)
// Read it back. The header is validated against the gopy magic.
code, err := marshal.ReadPyc(file)
The marshal layer handles TYPE_LONG (arbitrary-precision ints
in sign-extended 15-bit digits), FLAG_REF (object
back-references for shared subobjects), interned strings,
TYPE_CODE (the 25-field 3.11+ code object wire format),
TYPE_SET, TYPE_FROZENSET, TYPE_DICT, and
TYPE_BINARY_COMPLEX. The PEP 552 header supports both the
timestamp variant (mtime + source size) and the hash variant
(SipHash of the source).
Three codecs that unblock everything else
We weren't ready to ship the full codec registry in this drop (that's roughly forty codecs with their own table dependencies). We were ready to ship the three you actually need to bootstrap.
"hello".encode('utf-8') # works
"hello".encode('ascii') # works
"hello".encode('latin-1') # works
b"\xc3\xa9".decode('utf-8') # works, gives 'é'
utf-8 covers source files because PEP 263 makes utf-8 the
default. ascii covers most text protocols and the conservative
fallback path. latin-1 covers byte-identity round-trip (every
byte 0..255 maps to a single codepoint, every codepoint
0..255 maps back to the same byte). With those three, you can
read Python source, decode HTTP headers, and write the
.pyc interned strings. Every other codec we add later layers
on top of the same registry.
The aliases (utf_8, utf8, u8, ascii, latin_1,
iso_8859_1, l1, 8859) match CPython's normalization rules
so 'utf-8', 'UTF-8', 'utf_8', and 'U8' all resolve to
the same codec.
The full lookup chain
When you write import foo, the runtime walks an ordered chain:
- Is
'foo'already insys.modules? If yes, return it. - Is
'foo'a frozen module (importlib bootstrap or a user-registered frozen)? If yes, exec its code object into a fresh module. - Is
'foo'inPyImport_Inittab(a built-in extension module)? If yes, call its init function and register the result. - Walk
sys.paththrough the path finder. (This part is a shim through v0.8; the realFileFinderlands in v0.12.1.) - Raise
ModuleNotFoundError.
Step 1 is imp/sysmodules.go. Step 2 is imp/frozen.go. Step 3
is imp/inittab.go. The full chain is imp/import.go ImportModule. Relative imports (from . import foo,
from ..bar import baz) are resolved up front into absolute
names via resolveAbsName before the chain runs, matching
CPython's _resolve_name.
What's new
The full feature breakdown.
marshal/
The .pyc serializer and deserializer.
marshal.goTYPE_LONG. Added the sign-extended 15-bit digit decoder. CPython stores arbitrary-precision integers as a sign byte (in the length field) plus a sequence of 15-bit digits, little-endian. PortsPython/marshal.c r_long.marshal.goFLAG_REF and TYPE_INTERNED. Object back-references. When the same object appears multiple times in a marshaled stream (a frequent interned-string case), the first occurrence carries aFLAG_REFbit that pushes it onto a reference table; subsequent occurrences areTYPE_REFwith an index into that table. PortsPython/marshal.c r_ref_reserveandr_ref_insert.marshal.goTYPE_CODE. The 3.11+ wire format for code objects. 25 fields: argcount, posonlyargcount, kwonlyargcount, stacksize, flags, code, consts, names, locals, freevars, cellvars, filename, name, qualname, firstlineno, linetable, exceptiontable, plus the smaller fields. PortsPython/marshal.c:L1045 r_object TYPE_CODE. We ported it as a single function with a long body because that's what CPython does; splitting it into per-field helpers obscured the wire layout.marshal.goTYPE_SET / TYPE_FROZENSET / TYPE_DICT / TYPE_BINARY_COMPLEX. The container and complex types.fromObjecttranslatesobjects.Objectitems back to plain Go values for the writer path so the writer can recurse through native types.pyc.goWritePyc / WritePycHash / ReadPyc. The 16-byte PEP 552 header: 4-byte magic (3620 | 0x0D0A<<16), 4-byte flags, 8-byte payload that's either{mtime, source_size}for the timestamp variant orsource_hashfor the hash variant. PortsPython/import.c read_pyc_header. The magic number was picked to be distinct from CPython's 3.14 magic so our.pycfiles don't get loaded by CPython and vice versa.
codecs/
The codec layer. Three codecs, one registry.
registry.go.Register,Lookup,Search, and theCodecInfostruct. PortsModules/_codecsmodule.candLib/codecs.py lookup. The registry is keyed by normalized name (lowercase, hyphens to underscores), which matches the CPython rule and is the reason'UTF-8','utf-8','UTF_8', and'utf_8'all hit the same entry.builtin.go. utf-8, ascii, and latin-1 codecs withEncodeandDecodeentry points. Alias normalization (utf_8,utf8,u8,ascii,latin_1,iso_8859_1,l1,8859). PortsModules/_codecsmodule.c _Py_codec_lookup.
We picked these three because they cover the bootstrap. utf-8 decodes Python source files. ascii is the conservative default for many text protocols. latin-1 is the byte-identity codec that the marshal layer leans on for its interned-string table. The remaining codecs (cp1252, cp437, the Asian encodings, the JIS family) all import through the same registry and land as their own ports later.
imp/
The import package. This is where the actual import system lives.
frozen.gofromPython/import.c.FrozenModuletable,RegisterFrozen,FindFrozen,IsFrozen,FrozenList. Frozen modules are code objects that get linked into the binary so we can import them before the filesystem path search is available. The importlib bootstrap modules (_frozen_importlib,_frozen_importlib_external) are the canonical example, but user code can register frozens too.frozen_bootstrap.go. Registers the five importlib bootstrap names as nil-Code placeholders soIsFrozenanswers correctly before the real code objects are embedded. The placeholders returnnilfromFindFrozen(so the import chain falls through), butIsFrozenreturnstrue. This lets us answersys._has_frozen_modules('foo')honestly while the actual frozen code embed sits behind v0.10's spec cut.sysmodules.gofromPython/import.c.sys.modulesaccess layer:GetModule,AddModule,RemoveModule,SysModulesSnapshot. The access functions go through this layer rather than touching the dict directly so the__del__hooks on user-replacementsys.modulesobjects fire correctly.inittab.gofromPython/import.c PyImport_Inittab.AppendInittab,ExtendInittab,FindInitFunc,InittabSnapshot. The inittab is the registry of built-in extension modules. We use it for modules we ship in Go (rather than as Python source). Embedders can extend it before init runs.bootstrap.gofromPython/import.c init_importlib. Two-phase sequence:InitImportlibexecutes_frozen_importlib,InitImportlibExternalexecutes_frozen_importlib_external. TheExecutorinterface decouples the bootstrap from the eval loop soimpdoesn't have to importvm. We need that decoupling becausevmimportsimpfor the IMPORT_NAME bytecode; closing the cycle would make the package graph unbuildable.exec.go ExecCodeModule. Creates a module, registers it in sys.modules, executes the code object in its namespace, removes it on failure. PortsPython/import.c:L632 exec_code_in_module. The remove-on-failure matters: a half-imported module visible in sys.modules is a trap for the nextimportof the same name, which would return the half-import rather than retry.loader.go LoadPyc. Reads a.pycfile and executes it viaExecCodeModule.LoadSourceandLoadSourceFilecompile source text through an injectedSourceCompilerand callLoadPycon the result. PortsLib/importlib/_bootstrap_external.py SourceFileLoaderandSourcelessFileLoader.import.go ImportModule / ImportModuleLevel. The sys.modules to frozen to inittab lookup chain with relative-import resolution viaresolveAbsName. PortsPython/import.c:L1561 PyImport_ImportModuleLevelObject.
vm/
The eval loop learns to import.
eval_import.gowithIMPORT_NAMEandIMPORT_FROMbytecode arms. The arms wire into the eval loop viavmExecutor(implementsimp.ExecutorusingEvalCode).tryImportis consulted bydispatchbefore the generated arms. PortsPython/bytecodes.c IMPORT_NAMEandIMPORT_FROM. TheIMPORT_FROMarm honors thefrom x import ydistinction whereycan be either an attribute onxor a submodule ofx(Python checks for the attribute first, then falls back to a submodule import).
objects/
Two new types land because the import system pulls them in.
module.go.Moduletype with__name__,__doc__,__file__,__loader__,__spec__dict slots.ModuleTypesingleton,NewModule,module_repr. PortsObjects/moduleobject.c. The dict slots are real attributes rather than getters because user code routinely writes to them (__file__reassignment in particular is a common pattern in test fixtures that fake module identities).set.go.SetandFrozensettypes backed bymap[uint64]objects.Objectwith FNV hash keying.NewSet,NewFrozenset,set.add,set.discard,frozenset.__hash__. PortsObjects/setobject.c. We landed sets in v0.8 because marshal's TYPE_SET / TYPE_FROZENSET decoding needed them and the bootstrap path serializes sets in module-level constants.
errors/
The exception hierarchy grows.
exc_import.go.ImportErrorandModuleNotFoundError. PortsObjects/exceptions.c.ImportErrorcarries thenameandpathattributes that the path-finder failure branches stamp;ModuleNotFoundErroris the subclass used for the actual "this module is not found anywhere" case rather than other ImportError flavors (a syntactically broken.py, a.sothat fails to load, a circular import that catches a partial module).
Why we built it this way
A few decisions in this release deserve a callout.
marshal is a 1:1 port, not a Go-flavored serializer. We
could have invented our own format for .pyc files that played
better with Go's type system. We didn't, because being able to
read CPython-produced .pyc files (and have CPython refuse to
read ours) is a useful property. The first lets you bootstrap
against pre-compiled stdlib bytecode. The second prevents the
worst kind of cross-runtime confusion. The magic number split
guarantees the second.
The codec registry is a registry, not a switch statement. With three codecs there's a temptation to skip the registry and hard-code the dispatch. We didn't, because every codec we add later (cp1252, gb2312, shift_jis, the codec extensions in user-installed packages) goes through the same lookup. A switch statement would have to be replaced before the second codec landed; a registry just gets another entry.
Frozen modules ship as placeholders today. A real frozen
bootstrap means embedding the marshaled _frozen_importlib code
object into the gopy binary at compile time. We have the
machinery to load it (ExecCodeModule) but not yet the build
step that bakes the bytes into the binary. The placeholder
approach lets IsFrozen answer correctly so the import chain
falls through to the next step (inittab), which is where the
modules actually live today.
IMPORT_FROM does the attribute-then-submodule fallback.
The CPython behavior is: look up y as an attribute on x
first; if that fails, try to import x.y as a submodule. Most
packages work either way, but a handful (the ones that
intentionally don't expose submodules as attributes until you
import them) rely on the fallback order being attribute-first.
We picked the CPython order rather than invert it because the
alternative quietly broke a real package we tried to load.
Where it lives
The new packages, with their entry points.
marshal/.marshal.go,pyc.go. Entry pointsmarshal.Read,marshal.Write,marshal.ReadPyc,marshal.WritePyc,marshal.WritePycHash.codecs/.registry.go,builtin.go. Entry pointscodecs.Lookup,codecs.Register,codecs.Encode,codecs.Decode.imp/.frozen.go,frozen_bootstrap.go,sysmodules.go,inittab.go,bootstrap.go,exec.go,loader.go,import.go. Entry pointimp.ImportModule.vm/.eval_import.goaddsIMPORT_NAMEandIMPORT_FROMarms.objects/.module.goaddsModuleandModuleType.set.goaddsSet,Frozenset,SetType,FrozensetType.errors/.exc_import.goaddsImportErrorandModuleNotFoundError.
Compatibility
A few user-visible changes are worth flagging if you were tracking gopy through v0.7.
importworks. This is the headline. Code that doesimport json(or any other module) now imports rather than failing withErrNotImplemented. The catch is that the resolved module may itself be a placeholder until later drops add the real implementation.gopy -m fooruns the named module rather than returningErrNotImplemented. The-marm dispatches throughimp.ImportModuleplus the__main__rebind..pycfiles are written to__pycache__/. Source-file import compiles, marshals, and caches by default. You can disable the cache write withPYTHONDONTWRITEBYTECODE=1or the-Bflag.- The frozen importlib bootstrap is wired but stubbed.
_frozen_importlib.__name__resolves; calling into it walks the placeholder path that defers to the v0.8 inittab. Real frozen bytes embed in v0.10.
What's next
The v0.9 release is the vm tail. Highlights:
- Generator opcodes.
RETURN_GENERATOR,YIELD_VALUE,SEND,GET_YIELD_FROM_ITER,CLEANUP_THROW. Each generator runs on its own goroutine. - Pattern matching.
MATCH_MAPPING,MATCH_SEQUENCE,MATCH_KEYS,MATCH_CLASS. from x import *. Wired throughCALL_INTRINSIC_1so the helper can see the current frame's locals.WITH_EXCEPT_STARTand the async-iter / awaitable stubs.- Contextvars on top of an immutable HAMT.
pytimeas the runtime's nanosecond clock layer (also the storage for the GIL switch interval).- The real tokenizer.
tokenize.Itergraduates from a hand-rolled splitter to aParser/tokenizer/tokenizer.cport.
The remaining pending list from v0.7 carries forward:
sub-interpreters and Py_NewInterpreter, the compile /
exec / eval builtins beyond the smoke wrappers, Windows
pathconfig, PEP 657 caret pinning in tracebacks, and the
remaining bytecode handlers (LOAD_SPECIAL / LOAD_SUPER_ATTR,
IMPORT_STAR, SETUP_FINALLY / WITH_EXCEPT_START / CLEANUP_THROW /
BEFORE_WITH, CHECK_EG_MATCH / LOAD_ASSERTION_ERROR,
YIELD_VALUE / SEND / GET_AWAITABLE / RETURN_GENERATOR,
MATCH). v0.9 closes most of those.
Acknowledgments
This release lines up against Python/marshal.c,
Python/import.c, Modules/_codecsmodule.c,
Lib/importlib/_bootstrap_external.py,
Objects/moduleobject.c, Objects/setobject.c,
Objects/exceptions.c, and Python/bytecodes.c in the CPython
3.14 tree. Each ported file in gopy carries a citation to the
originating function so a future reader can compare line by
line.