1690. gopy marshal
What we are porting
Python/marshal.c (2163 lines): the binary serialisation format that
Python uses for .pyc bytecode files, marshal.dumps / marshal.loads,
and the compile-time constant pool. Every .pyc file is a 16-byte header
followed by a marshalled code object.
The existing marshal/marshal.go skeleton (v0.5) handles None, bool,
int (TYPE_INT), float (binary), str, bytes, tuple, and list. The
following are missing and must land in v0.8:
- TYPE_LONG: arbitrary-precision integers encoded as base-2^15 limbs.
- FLAG_REF (0x80 on any tag): back-reference dedup that lets cyclic or repeated objects be stored once and referred to by index.
- TYPE_INTERNED / TYPE_ASCII_INTERNED / TYPE_SHORT_ASCII_INTERNED: interned string variants used heavily in code objects.
- TYPE_CODE: the full code-object round-trip needed to run .pyc files.
- TYPE_COMPLEX / TYPE_BINARY_COMPLEX: complex number encoding.
- TYPE_SET / TYPE_FROZENSET: set types.
- TYPE_DICT: dictionary.
- .pyc file header: 16 bytes of magic + flags + source stamp + marshal blob, validated on read.
The v0.8 gate is import json; json.dumps({"a": 1}), which requires
loading a .pyc file containing a code object for json/__init__.py.
Key CPython functions
| C item | Location |
|---|---|
w_object main write dispatch | Python/marshal.c:459 |
w_complex_object FLAG_REF wrap | Python/marshal.c:356 |
w_long TYPE_LONG encode | Python/marshal.c:201 |
w_short_pstring / w_pstring | Python/marshal.c:168 |
r_object main read dispatch | Python/marshal.c:1159 |
r_object FLAG_REF read path | Python/marshal.c:1022 |
r_long TYPE_LONG decode | Python/marshal.c:937 |
PyMarshal_ReadLastObjectFromFile | Python/marshal.c:1807 |
PyMarshal_WriteObjectToString | Python/marshal.c:1928 |
PyMarshal_ReadObjectFromString | Python/marshal.c:1876 |
Type tags (full set, matching Python/marshal.c:58-110):
| Tag | Char | Notes |
|---|---|---|
| TYPE_NULL | '0' | |
| TYPE_NONE | 'N' | |
| TYPE_FALSE | 'F' | |
| TYPE_TRUE | 'T' | |
| TYPE_STOPITER | 'S' | |
| TYPE_ELLIPSIS | '.' | |
| TYPE_INT | 'i' | 32-bit signed |
| TYPE_INT64 | 'I' | 64-bit signed (Python 2 compat) |
| TYPE_FLOAT | 'f' | ASCII float |
| TYPE_BINARY_FLOAT | 'g' | IEEE 754 little-endian double |
| TYPE_LONG | 'l' | base-2^15 limbs |
| TYPE_STRING | 's' | bytes |
| TYPE_TUPLE | '(' | |
| TYPE_SMALL_TUPLE | ')' | 1-byte length prefix |
| TYPE_LIST | '[' | |
| TYPE_DICT | '{' | |
| TYPE_CODE | 'c' | code object |
| TYPE_UNICODE | 'u' | |
| TYPE_SHORT_ASCII | 'z' | 1-byte length prefix |
| TYPE_SHORT_ASCII_INTERNED | 'Z' | interned |
| TYPE_ASCII | 'a' | 4-byte length prefix |
| TYPE_ASCII_INTERNED | 'A' | interned |
| TYPE_REF | 'r' | back-reference |
| TYPE_COMPLEX | 'y' | ASCII re+im |
| TYPE_BINARY_COMPLEX | 'Y' | two IEEE 754 doubles |
| TYPE_SET | '<' | |
| TYPE_FROZENSET | '>' |
FLAG_REF = 0x80 ORed onto any tag byte. When writing: if the object
has been registered for reference tracking, OR its reference count is
greater than 1, set FLAG_REF and record an index in the ref table
before recursing. On reading: if FLAG_REF bit is set, pre-allocate a
slot in the decoder ref list, fill in the decoded object after
construction. TYPE_REF 'r' reads a 4-byte index and returns the
stored object.
TYPE_LONG encoding (Python/marshal.c:201): the absolute value is
split into base-2^15 digits in little-endian order. The digit count is
stored as a signed 32-bit integer; negative count signals a negative
number. Each digit is a uint16 in little-endian.
.pyc header (16 bytes, Python/importlib/_bootstrap_external.py):
uint32 magic_number // PYC_MAGIC_NUMBER_TOKEN for 3.14, pinned as const
uint32 bit_flags // bit 0: checked; bit 1: source-hash mode
uint32 field2 // source mtime (timestamp mode) or source_hash[0:4] (hash mode)
uint32 field3 // source size (timestamp mode) or source_hash[4:8] (hash mode)
// followed immediately by the marshalled code object
Go shape
// WriteCodeToString marshals a code object to bytes at the given version.
// CPython: Python/marshal.c:1928 PyMarshal_WriteObjectToString
func WriteCodeToString(code *objects.Code, version int) ([]byte, error)
// ReadCodeFromBytes unmarshals a code object from raw marshal bytes (no pyc header).
// CPython: Python/marshal.c:1876 PyMarshal_ReadObjectFromString
func ReadCodeFromBytes(data []byte) (*objects.Code, error)
// WritePyc writes a complete .pyc file: 16-byte header then marshalled code.
// flags: bit 0 = checked, bit 1 = source-hash mode.
// sourceHash is used when bit 1 is set; mtime+size when bit 1 is clear.
func WritePyc(w io.Writer, code *objects.Code, flags uint32, mtime uint32, size uint32, sourceHash uint64) error
// ReadPyc reads and validates a .pyc file header then returns the code object.
// Returns an error if the magic number does not match or the source stamp is stale.
func ReadPyc(r io.Reader, expectedMtime uint32, expectedSize uint32, expectedSourceHash uint64) (*objects.Code, error)
// Dumps serialises obj to bytes (marshal.dumps equivalent).
// CPython: Python/marshal.c:1928 PyMarshal_WriteObjectToString
func Dumps(obj objects.Object, version int) ([]byte, error)
// Loads deserialises bytes to an object (marshal.loads equivalent).
// CPython: Python/marshal.c:1876 PyMarshal_ReadObjectFromString
func Loads(data []byte) (objects.Object, error)
File mapping
| C source | Go target |
|---|---|
Python/marshal.c w_object full dispatch | marshal/write.go (replaces encoder in marshal.go) |
Python/marshal.c r_object full dispatch | marshal/read.go (replaces decoder in marshal.go) |
Python/marshal.c:201 TYPE_LONG encode/decode | marshal/long.go |
Python/marshal.c:356 FLAG_REF writer, ref table | marshal/refs.go |
Python/marshal.c:1022 FLAG_REF reader | marshal/refs.go |
Python/marshal.c:1807 pyc header read | marshal/pyc.go |
Python/marshal.c:1928 pyc header write | marshal/pyc.go |
| Existing skeleton | marshal/marshal.go (shrinks to package doc + Dumps/Loads wrappers) |
Checklist
Status legend: [x] shipped, [ ] pending, [~] partial / scaffold,
[n] deferred / not in scope this phase.
TYPE_LONG (marshal/long.go)
-
writeLong: split*big.Intabsolute value into base-2^15 uint16 limbs, little-endian. Emit signed 32-bit count (negative = negative number). CPython:Python/marshal.c:201 w_long. -
readLong: read signed 32-bit count, read|count|uint16 limbs, reconstruct*big.Int, apply sign from count sign. CPython:Python/marshal.c:937 r_long. - Zero
big.Intencodes as count=0 with no limbs. - Negative numbers encode as negative count (not two's complement).
- Round-trip property:
readLong(writeLong(n)) == nfor arbitrary*big.Int.
FLAG_REF (marshal/refs.go)
-
writerRefs: map fromobjects.Objectidentity to pre-assigned 32-bit index. Populated duringw_complex_objectwhen an object is seen more than once or is flagged for interning. CPython:Python/marshal.c:356 w_complex_object. - Write path: before writing a compound object, check if it needs a
ref slot; if so, OR FLAG_REF (0x80) onto the tag byte and record
len(refs)as the slot index. -
readerRefs: growable slice ofobjects.Object, indexed by order of first appearance. - Read path: if FLAG_REF bit is set on the tag, pre-allocate slot
(
append(refs, nil)) before decoding; after decoding, store the result at that slot. CPython:Python/marshal.c:1022. - TYPE_REF (
'r'): read 4-byte little-endian index, returnrefs[index]. Error if index out of range. CPython:Python/marshal.c:1159REF arm.
TYPE_INTERNED string variants (marshal/write.go, marshal/read.go)
- Write TYPE_SHORT_ASCII_INTERNED (
'Z') for interned strings shorter than 256 bytes. CPython:Python/marshal.c:459 w_objectunicode arm. - Write TYPE_ASCII_INTERNED (
'A') for interned strings >= 256 bytes. CPython:Python/marshal.c:459 w_object. - Write TYPE_SHORT_ASCII (
'z') for non-interned ASCII strings shorter than 256 bytes. - Write TYPE_ASCII (
'a') for non-interned ASCII strings >= 256 bytes. - Write TYPE_UNICODE (
'u') for non-ASCII strings (UTF-8 encoded, 4-byte length). - Read path: TYPE_ASCII_INTERNED and TYPE_SHORT_ASCII_INTERNED add
the decoded string to the interned-string table (the decoder's
stringsslice). CPython:Python/marshal.c:1159string arms.
TYPE_CODE (marshal/write.go, marshal/read.go)
- Write TYPE_CODE: emit tag, then all
objects.Codefields in the exact CPython order (Python/marshal.c:459 w_objectcode arm): argcount, posonlyargcount, kwonlyargcount, stacksize, flags, code (bytes), consts (tuple), names (tuple), localsplusnames (tuple), localspluskinds (bytes), filename (str), name (str), qualname (str), firstlineno (int), linetable (bytes), exceptiontable (bytes). - Read TYPE_CODE: reconstruct
objects.Codefrom the same field sequence. CPython:Python/marshal.c:1159code arm. - Field order must match CPython 3.14 exactly; any mismatch produces an unreadable .pyc file.
- CODE_FIELD_COUNT constant (16 for 3.14) guards against accidental field reordering in the writer.
TYPE_COMPLEX (marshal/write.go, marshal/read.go)
- Write TYPE_BINARY_COMPLEX (
'Y'): real then imag, each as IEEE 754 little-endian double (8 bytes each). CPython:Python/marshal.c:459 w_objectcomplex arm. - Read TYPE_BINARY_COMPLEX: two 8-byte IEEE 754 doubles.
CPython:
Python/marshal.c:1159binary-complex arm. - [n] TYPE_COMPLEX ASCII form (
'y'): deferred to v0.9 with full complex number support.
TYPE_SET / TYPE_FROZENSET (marshal/write.go, marshal/read.go)
- Write TYPE_SET (
'<'): 4-byte count then each element. CPython:Python/marshal.c:459 w_objectset arm. - Write TYPE_FROZENSET (
'>'): same layout. - Read TYPE_SET / TYPE_FROZENSET: count then elements, reconstruct
*objects.Setor*objects.FrozenSet. CPython:Python/marshal.c:1159set/frozenset arm. - Empty frozenset uses FLAG_REF dedup (CPython interns it).
TYPE_DICT (marshal/write.go, marshal/read.go)
- Write TYPE_DICT (
'{'): interleaved key, value pairs terminated by a NULL byte (thew_object(NULL)sentinel). CPython:Python/marshal.c:459 w_objectdict arm. - Read TYPE_DICT: read key/value pairs until NULL sentinel is read.
CPython:
Python/marshal.c:1159dict arm.
.pyc header (marshal/pyc.go)
-
MagicNumberconstant pinned to CPython 3.14'sPYC_MAGIC_NUMBER_TOKENvalue (3495 * 10 + 13 = 34963, stored asmagic | (b'\r\n' << 16)). CPython:Python/import.cmagic generation andLib/importlib/_bootstrap_external.py. -
WritePyc: write 4 uint32s (magic, flags, field2, field3) then callWriteCodeToStringand write the blob. -
ReadPyc: read 4 uint32s; reject on magic mismatch (raiseImportError); validate timestamp or source-hash depending on bit 1 of flags; then callReadCodeFromBytes. - Timestamp mode (flags bit 1 clear): field2 = source mtime, field3 = source size. Stale if either mismatches.
- Source-hash mode (flags bit 1 set): field2/field3 = 64-bit source hash split little-endian. Stale if hash mismatches.
- Unchecked mode (flags bit 0 clear): skip validation entirely.
CPython:
Lib/importlib/_bootstrap_external.py _validate_bytecode_header.
Write dispatch completeness (marshal/write.go)
-
w_objecthandles all 27 type tags listed in the type-tag table. - TYPE_NONE, TYPE_TRUE, TYPE_FALSE: already in skeleton.
- TYPE_INT (32-bit): already in skeleton.
- TYPE_INT64: emit when int fits in int64 but not int32
(Python 2 compat path). CPython:
Python/marshal.c:459. - TYPE_BINARY_FLOAT: already in skeleton.
- TYPE_STRING (bytes): already in skeleton.
- TYPE_SMALL_TUPLE / TYPE_TUPLE: already in skeleton.
- TYPE_LIST: already in skeleton.
- TYPE_LONG: delegate to
marshal/long.go writeLong. - TYPE_DICT: as above.
- TYPE_CODE: as above.
- TYPE_BINARY_COMPLEX: as above.
- TYPE_SET / TYPE_FROZENSET: as above.
- TYPE_ELLIPSIS (
'.'): singleton write. - TYPE_STOPITER (
'S'): singleton write. - TYPE_NULL (
'0'): internal sentinel; never called from user code.
Read dispatch completeness (marshal/read.go)
-
r_objecthandles all 27 type tags. - TYPE_NONE, TYPE_TRUE, TYPE_FALSE: already in skeleton.
- TYPE_INT: already in skeleton.
- TYPE_INT64: read 8 bytes, return
*objects.Long. - TYPE_BINARY_FLOAT: already in skeleton.
- TYPE_FLOAT ASCII form: parse ASCII float string.
CPython:
Python/marshal.c:1159float arm. - TYPE_STRING (bytes): already in skeleton.
- TYPE_SMALL_TUPLE, TYPE_TUPLE: already in skeleton.
- TYPE_LIST: already in skeleton.
- TYPE_UNICODE (
'u'): 4-byte length, UTF-8 bytes. - TYPE_SHORT_ASCII, TYPE_ASCII: as above (non-interned).
- TYPE_SHORT_ASCII_INTERNED, TYPE_ASCII_INTERNED: intern on read.
- TYPE_LONG: delegate to
marshal/long.go readLong. - TYPE_DICT: as above.
- TYPE_CODE: as above.
- TYPE_BINARY_COMPLEX: as above.
- TYPE_SET / TYPE_FROZENSET: as above.
- TYPE_REF: index into ref slice.
- TYPE_ELLIPSIS: return
objects.Ellipsis. - TYPE_STOPITER: return
objects.StopIterationsingleton. - TYPE_NULL: return nil (internal sentinel).
Tests
-
marshal/marshal_test.go:DumpsthenLoadsround-trip for every supported type tag (None, bool, int, big.Int, float, bytes, str, tuple, list, dict, set, frozenset, code object). -
marshal/marshal_test.go: byte-for-byte match againstpython3 -c "import marshal, sys; sys.stdout.buffer.write(marshal.dumps(42))"for int, float, str, bytes, and a simple tuple. -
marshal/marshal_test.go: code-object round-trip: compile a trivial Python source, marshal the code object, unmarshal it, verifyco_consts,co_names,co_filename,co_firstlineno. -
marshal/long_test.go: round-trip forbig.Intvalues 0, 1, -1, 2^15, 2^31, 2^63, 2^128, -(2^128). -
marshal/refs_test.go: object shared between two tuple slots is written once with FLAG_REF and read back as the same pointer. -
marshal/pyc_test.go:WritePycthenReadPycround-trip in timestamp mode; source-hash mode; stale-magic rejection; stale-mtime rejection.
Cross-references
- Code object definition: 1687 (
objects/code.go). The marshal writer reads every field ofobjects.Code; the reader constructs one. - Import system that reads .pyc files: 1691 (
imp/loader.gocallsmarshal.ReadPyc). - Compile pipeline that produces Code objects: 1620 series.
- objects/set.go, objects/dict.go: 1680, 1681 (needed by TYPE_SET, TYPE_FROZENSET, TYPE_DICT read paths).
Out of scope
- TYPE_COMPLEX ASCII form (
'y'). Lands in v0.9 with full complex numbers. - TYPE_FRAME, TYPE_ELLIPSIS, TYPE_STOPITER object construction. The tags are written; the resulting objects land with their respective types in later phases.
- Pickle compatibility differences for marshal version < 4 interning behaviour.
- The
marshalPython module surface (marshal.version,marshal.MAGIC_NUMBER). Lands with the stdlib bridge. - Free-threaded write-barrier integration. gopy is single-threaded through v0.10.