Skip to main content

v0.8.0 - Import drop

Released Pre-release.

A working import statement is a deceptive amount of machinery. The Python source you type is a single keyword. The runtime behavior underneath that keyword spans three executable layers, two file formats, a registry, a search algorithm, a compiler, a serializer, a deserializer, and a codec table. The fact that import json "just works" in CPython is the result of decades of careful work on the import system.

For most of gopy's life up through v0.7, we didn't have any of this. We had the lifecycle from v0.7 (which let us boot), and we had a builtin namespace (which let print and len work). What we did not have was a way to say import math. Every test that needed a module had to bring that module along as a hand-stamped Go value injected into the eval loop. The runner could execute single-file Python with builtins; it could not execute Python that imported anything.

v0.8.0 is the import drop. After this release the runtime has a working import system: marshal reads and writes .pyc files, the codecs layer handles utf-8, ascii, and latin-1, and the imp package implements the full sys.modules to frozen to built-in lookup chain. IMPORT_NAME and IMPORT_FROM bytecodes in the eval loop use the new machinery end to end.

Highlights

Three pieces of work define this release.

Real .pyc round-trip

marshal is the serialization format CPython uses for bytecode on disk. A .pyc file is a 16-byte header (magic, flags, timestamp or hash payload) followed by a marshaled code object. A marshaled code object recursively marshals its constants, its name, its filename, its line table, its exception table, its free and cell variable names, and the bytecode itself. Marshal has to handle every type that can appear in co_consts: ints (small and arbitrary-precision), floats, complex, strings (with interning), bytes, tuples, frozensets, code objects.

We landed all of it.

// Write a code object to disk as a .pyc file.
err := marshal.WritePyc(file, code, mtime, sourceSize)

// Read it back. The header is validated against the gopy magic.
code, err := marshal.ReadPyc(file)

The marshal layer handles TYPE_LONG (arbitrary-precision ints in sign-extended 15-bit digits), FLAG_REF (object back-references for shared subobjects), interned strings, TYPE_CODE (the 25-field 3.11+ code object wire format), TYPE_SET, TYPE_FROZENSET, TYPE_DICT, and TYPE_BINARY_COMPLEX. The PEP 552 header supports both the timestamp variant (mtime + source size) and the hash variant (SipHash of the source).

Three codecs that unblock everything else

We weren't ready to ship the full codec registry in this drop (that's roughly forty codecs with their own table dependencies). We were ready to ship the three you actually need to bootstrap.

"hello".encode('utf-8') # works
"hello".encode('ascii') # works
"hello".encode('latin-1') # works
b"\xc3\xa9".decode('utf-8') # works, gives 'é'

utf-8 covers source files because PEP 263 makes utf-8 the default. ascii covers most text protocols and the conservative fallback path. latin-1 covers byte-identity round-trip (every byte 0..255 maps to a single codepoint, every codepoint 0..255 maps back to the same byte). With those three, you can read Python source, decode HTTP headers, and write the .pyc interned strings. Every other codec we add later layers on top of the same registry.

The aliases (utf_8, utf8, u8, ascii, latin_1, iso_8859_1, l1, 8859) match CPython's normalization rules so 'utf-8', 'UTF-8', 'utf_8', and 'U8' all resolve to the same codec.

The full lookup chain

When you write import foo, the runtime walks an ordered chain:

  1. Is 'foo' already in sys.modules? If yes, return it.
  2. Is 'foo' a frozen module (importlib bootstrap or a user-registered frozen)? If yes, exec its code object into a fresh module.
  3. Is 'foo' in PyImport_Inittab (a built-in extension module)? If yes, call its init function and register the result.
  4. Walk sys.path through the path finder. (This part is a shim through v0.8; the real FileFinder lands in v0.12.1.)
  5. Raise ModuleNotFoundError.

Step 1 is imp/sysmodules.go. Step 2 is imp/frozen.go. Step 3 is imp/inittab.go. The full chain is imp/import.go ImportModule. Relative imports (from . import foo, from ..bar import baz) are resolved up front into absolute names via resolveAbsName before the chain runs, matching CPython's _resolve_name.

What's new

The full feature breakdown.

marshal/

The .pyc serializer and deserializer.

  • marshal.go TYPE_LONG. Added the sign-extended 15-bit digit decoder. CPython stores arbitrary-precision integers as a sign byte (in the length field) plus a sequence of 15-bit digits, little-endian. Ports Python/marshal.c r_long.
  • marshal.go FLAG_REF and TYPE_INTERNED. Object back-references. When the same object appears multiple times in a marshaled stream (a frequent interned-string case), the first occurrence carries a FLAG_REF bit that pushes it onto a reference table; subsequent occurrences are TYPE_REF with an index into that table. Ports Python/marshal.c r_ref_reserve and r_ref_insert.
  • marshal.go TYPE_CODE. The 3.11+ wire format for code objects. 25 fields: argcount, posonlyargcount, kwonlyargcount, stacksize, flags, code, consts, names, locals, freevars, cellvars, filename, name, qualname, firstlineno, linetable, exceptiontable, plus the smaller fields. Ports Python/marshal.c:L1045 r_object TYPE_CODE. We ported it as a single function with a long body because that's what CPython does; splitting it into per-field helpers obscured the wire layout.
  • marshal.go TYPE_SET / TYPE_FROZENSET / TYPE_DICT / TYPE_BINARY_COMPLEX. The container and complex types. fromObject translates objects.Object items back to plain Go values for the writer path so the writer can recurse through native types.
  • pyc.go WritePyc / WritePycHash / ReadPyc. The 16-byte PEP 552 header: 4-byte magic (3620 | 0x0D0A<<16), 4-byte flags, 8-byte payload that's either {mtime, source_size} for the timestamp variant or source_hash for the hash variant. Ports Python/import.c read_pyc_header. The magic number was picked to be distinct from CPython's 3.14 magic so our .pyc files don't get loaded by CPython and vice versa.

codecs/

The codec layer. Three codecs, one registry.

  • registry.go. Register, Lookup, Search, and the CodecInfo struct. Ports Modules/_codecsmodule.c and Lib/codecs.py lookup. The registry is keyed by normalized name (lowercase, hyphens to underscores), which matches the CPython rule and is the reason 'UTF-8', 'utf-8', 'UTF_8', and 'utf_8' all hit the same entry.
  • builtin.go. utf-8, ascii, and latin-1 codecs with Encode and Decode entry points. Alias normalization (utf_8, utf8, u8, ascii, latin_1, iso_8859_1, l1, 8859). Ports Modules/_codecsmodule.c _Py_codec_lookup.

We picked these three because they cover the bootstrap. utf-8 decodes Python source files. ascii is the conservative default for many text protocols. latin-1 is the byte-identity codec that the marshal layer leans on for its interned-string table. The remaining codecs (cp1252, cp437, the Asian encodings, the JIS family) all import through the same registry and land as their own ports later.

imp/

The import package. This is where the actual import system lives.

  • frozen.go from Python/import.c. FrozenModule table, RegisterFrozen, FindFrozen, IsFrozen, FrozenList. Frozen modules are code objects that get linked into the binary so we can import them before the filesystem path search is available. The importlib bootstrap modules (_frozen_importlib, _frozen_importlib_external) are the canonical example, but user code can register frozens too.
  • frozen_bootstrap.go. Registers the five importlib bootstrap names as nil-Code placeholders so IsFrozen answers correctly before the real code objects are embedded. The placeholders return nil from FindFrozen (so the import chain falls through), but IsFrozen returns true. This lets us answer sys._has_frozen_modules('foo') honestly while the actual frozen code embed sits behind v0.10's spec cut.
  • sysmodules.go from Python/import.c. sys.modules access layer: GetModule, AddModule, RemoveModule, SysModulesSnapshot. The access functions go through this layer rather than touching the dict directly so the __del__ hooks on user-replacement sys.modules objects fire correctly.
  • inittab.go from Python/import.c PyImport_Inittab. AppendInittab, ExtendInittab, FindInitFunc, InittabSnapshot. The inittab is the registry of built-in extension modules. We use it for modules we ship in Go (rather than as Python source). Embedders can extend it before init runs.
  • bootstrap.go from Python/import.c init_importlib. Two-phase sequence: InitImportlib executes _frozen_importlib, InitImportlibExternal executes _frozen_importlib_external. The Executor interface decouples the bootstrap from the eval loop so imp doesn't have to import vm. We need that decoupling because vm imports imp for the IMPORT_NAME bytecode; closing the cycle would make the package graph unbuildable.
  • exec.go ExecCodeModule. Creates a module, registers it in sys.modules, executes the code object in its namespace, removes it on failure. Ports Python/import.c:L632 exec_code_in_module. The remove-on-failure matters: a half-imported module visible in sys.modules is a trap for the next import of the same name, which would return the half-import rather than retry.
  • loader.go LoadPyc. Reads a .pyc file and executes it via ExecCodeModule. LoadSource and LoadSourceFile compile source text through an injected SourceCompiler and call LoadPyc on the result. Ports Lib/importlib/_bootstrap_external.py SourceFileLoader and SourcelessFileLoader.
  • import.go ImportModule / ImportModuleLevel. The sys.modules to frozen to inittab lookup chain with relative-import resolution via resolveAbsName. Ports Python/import.c:L1561 PyImport_ImportModuleLevelObject.

vm/

The eval loop learns to import.

  • eval_import.go with IMPORT_NAME and IMPORT_FROM bytecode arms. The arms wire into the eval loop via vmExecutor (implements imp.Executor using EvalCode). tryImport is consulted by dispatch before the generated arms. Ports Python/bytecodes.c IMPORT_NAME and IMPORT_FROM. The IMPORT_FROM arm honors the from x import y distinction where y can be either an attribute on x or a submodule of x (Python checks for the attribute first, then falls back to a submodule import).

objects/

Two new types land because the import system pulls them in.

  • module.go. Module type with __name__, __doc__, __file__, __loader__, __spec__ dict slots. ModuleType singleton, NewModule, module_repr. Ports Objects/moduleobject.c. The dict slots are real attributes rather than getters because user code routinely writes to them (__file__ reassignment in particular is a common pattern in test fixtures that fake module identities).
  • set.go. Set and Frozenset types backed by map[uint64]objects.Object with FNV hash keying. NewSet, NewFrozenset, set.add, set.discard, frozenset.__hash__. Ports Objects/setobject.c. We landed sets in v0.8 because marshal's TYPE_SET / TYPE_FROZENSET decoding needed them and the bootstrap path serializes sets in module-level constants.

errors/

The exception hierarchy grows.

  • exc_import.go. ImportError and ModuleNotFoundError. Ports Objects/exceptions.c. ImportError carries the name and path attributes that the path-finder failure branches stamp; ModuleNotFoundError is the subclass used for the actual "this module is not found anywhere" case rather than other ImportError flavors (a syntactically broken .py, a .so that fails to load, a circular import that catches a partial module).

Why we built it this way

A few decisions in this release deserve a callout.

marshal is a 1:1 port, not a Go-flavored serializer. We could have invented our own format for .pyc files that played better with Go's type system. We didn't, because being able to read CPython-produced .pyc files (and have CPython refuse to read ours) is a useful property. The first lets you bootstrap against pre-compiled stdlib bytecode. The second prevents the worst kind of cross-runtime confusion. The magic number split guarantees the second.

The codec registry is a registry, not a switch statement. With three codecs there's a temptation to skip the registry and hard-code the dispatch. We didn't, because every codec we add later (cp1252, gb2312, shift_jis, the codec extensions in user-installed packages) goes through the same lookup. A switch statement would have to be replaced before the second codec landed; a registry just gets another entry.

Frozen modules ship as placeholders today. A real frozen bootstrap means embedding the marshaled _frozen_importlib code object into the gopy binary at compile time. We have the machinery to load it (ExecCodeModule) but not yet the build step that bakes the bytes into the binary. The placeholder approach lets IsFrozen answer correctly so the import chain falls through to the next step (inittab), which is where the modules actually live today.

IMPORT_FROM does the attribute-then-submodule fallback. The CPython behavior is: look up y as an attribute on x first; if that fails, try to import x.y as a submodule. Most packages work either way, but a handful (the ones that intentionally don't expose submodules as attributes until you import them) rely on the fallback order being attribute-first. We picked the CPython order rather than invert it because the alternative quietly broke a real package we tried to load.

Where it lives

The new packages, with their entry points.

  • marshal/. marshal.go, pyc.go. Entry points marshal.Read, marshal.Write, marshal.ReadPyc, marshal.WritePyc, marshal.WritePycHash.
  • codecs/. registry.go, builtin.go. Entry points codecs.Lookup, codecs.Register, codecs.Encode, codecs.Decode.
  • imp/. frozen.go, frozen_bootstrap.go, sysmodules.go, inittab.go, bootstrap.go, exec.go, loader.go, import.go. Entry point imp.ImportModule.
  • vm/. eval_import.go adds IMPORT_NAME and IMPORT_FROM arms.
  • objects/. module.go adds Module and ModuleType. set.go adds Set, Frozenset, SetType, FrozensetType.
  • errors/. exc_import.go adds ImportError and ModuleNotFoundError.

Compatibility

A few user-visible changes are worth flagging if you were tracking gopy through v0.7.

  • import works. This is the headline. Code that does import json (or any other module) now imports rather than failing with ErrNotImplemented. The catch is that the resolved module may itself be a placeholder until later drops add the real implementation.
  • gopy -m foo runs the named module rather than returning ErrNotImplemented. The -m arm dispatches through imp.ImportModule plus the __main__ rebind.
  • .pyc files are written to __pycache__/. Source-file import compiles, marshals, and caches by default. You can disable the cache write with PYTHONDONTWRITEBYTECODE=1 or the -B flag.
  • The frozen importlib bootstrap is wired but stubbed. _frozen_importlib.__name__ resolves; calling into it walks the placeholder path that defers to the v0.8 inittab. Real frozen bytes embed in v0.10.

What's next

The v0.9 release is the vm tail. Highlights:

  • Generator opcodes. RETURN_GENERATOR, YIELD_VALUE, SEND, GET_YIELD_FROM_ITER, CLEANUP_THROW. Each generator runs on its own goroutine.
  • Pattern matching. MATCH_MAPPING, MATCH_SEQUENCE, MATCH_KEYS, MATCH_CLASS.
  • from x import *. Wired through CALL_INTRINSIC_1 so the helper can see the current frame's locals.
  • WITH_EXCEPT_START and the async-iter / awaitable stubs.
  • Contextvars on top of an immutable HAMT.
  • pytime as the runtime's nanosecond clock layer (also the storage for the GIL switch interval).
  • The real tokenizer. tokenize.Iter graduates from a hand-rolled splitter to a Parser/tokenizer/tokenizer.c port.

The remaining pending list from v0.7 carries forward: sub-interpreters and Py_NewInterpreter, the compile / exec / eval builtins beyond the smoke wrappers, Windows pathconfig, PEP 657 caret pinning in tracebacks, and the remaining bytecode handlers (LOAD_SPECIAL / LOAD_SUPER_ATTR, IMPORT_STAR, SETUP_FINALLY / WITH_EXCEPT_START / CLEANUP_THROW / BEFORE_WITH, CHECK_EG_MATCH / LOAD_ASSERTION_ERROR, YIELD_VALUE / SEND / GET_AWAITABLE / RETURN_GENERATOR, MATCH). v0.9 closes most of those.

Acknowledgments

This release lines up against Python/marshal.c, Python/import.c, Modules/_codecsmodule.c, Lib/importlib/_bootstrap_external.py, Objects/moduleobject.c, Objects/setobject.c, Objects/exceptions.c, and Python/bytecodes.c in the CPython 3.14 tree. Each ported file in gopy carries a citation to the originating function so a future reader can compare line by line.