Skip to main content

1690. gopy marshal

What we are porting

Python/marshal.c (2163 lines): the binary serialisation format that Python uses for .pyc bytecode files, marshal.dumps / marshal.loads, and the compile-time constant pool. Every .pyc file is a 16-byte header followed by a marshalled code object.

The existing marshal/marshal.go skeleton (v0.5) handles None, bool, int (TYPE_INT), float (binary), str, bytes, tuple, and list. The following are missing and must land in v0.8:

  • TYPE_LONG: arbitrary-precision integers encoded as base-2^15 limbs.
  • FLAG_REF (0x80 on any tag): back-reference dedup that lets cyclic or repeated objects be stored once and referred to by index.
  • TYPE_INTERNED / TYPE_ASCII_INTERNED / TYPE_SHORT_ASCII_INTERNED: interned string variants used heavily in code objects.
  • TYPE_CODE: the full code-object round-trip needed to run .pyc files.
  • TYPE_COMPLEX / TYPE_BINARY_COMPLEX: complex number encoding.
  • TYPE_SET / TYPE_FROZENSET: set types.
  • TYPE_DICT: dictionary.
  • .pyc file header: 16 bytes of magic + flags + source stamp + marshal blob, validated on read.

The v0.8 gate is import json; json.dumps({"a": 1}), which requires loading a .pyc file containing a code object for json/__init__.py.

Key CPython functions

C itemLocation
w_object main write dispatchPython/marshal.c:459
w_complex_object FLAG_REF wrapPython/marshal.c:356
w_long TYPE_LONG encodePython/marshal.c:201
w_short_pstring / w_pstringPython/marshal.c:168
r_object main read dispatchPython/marshal.c:1159
r_object FLAG_REF read pathPython/marshal.c:1022
r_long TYPE_LONG decodePython/marshal.c:937
PyMarshal_ReadLastObjectFromFilePython/marshal.c:1807
PyMarshal_WriteObjectToStringPython/marshal.c:1928
PyMarshal_ReadObjectFromStringPython/marshal.c:1876

Type tags (full set, matching Python/marshal.c:58-110):

TagCharNotes
TYPE_NULL'0'
TYPE_NONE'N'
TYPE_FALSE'F'
TYPE_TRUE'T'
TYPE_STOPITER'S'
TYPE_ELLIPSIS'.'
TYPE_INT'i'32-bit signed
TYPE_INT64'I'64-bit signed (Python 2 compat)
TYPE_FLOAT'f'ASCII float
TYPE_BINARY_FLOAT'g'IEEE 754 little-endian double
TYPE_LONG'l'base-2^15 limbs
TYPE_STRING's'bytes
TYPE_TUPLE'('
TYPE_SMALL_TUPLE')'1-byte length prefix
TYPE_LIST'['
TYPE_DICT'{'
TYPE_CODE'c'code object
TYPE_UNICODE'u'
TYPE_SHORT_ASCII'z'1-byte length prefix
TYPE_SHORT_ASCII_INTERNED'Z'interned
TYPE_ASCII'a'4-byte length prefix
TYPE_ASCII_INTERNED'A'interned
TYPE_REF'r'back-reference
TYPE_COMPLEX'y'ASCII re+im
TYPE_BINARY_COMPLEX'Y'two IEEE 754 doubles
TYPE_SET'<'
TYPE_FROZENSET'>'

FLAG_REF = 0x80 ORed onto any tag byte. When writing: if the object has been registered for reference tracking, OR its reference count is greater than 1, set FLAG_REF and record an index in the ref table before recursing. On reading: if FLAG_REF bit is set, pre-allocate a slot in the decoder ref list, fill in the decoded object after construction. TYPE_REF 'r' reads a 4-byte index and returns the stored object.

TYPE_LONG encoding (Python/marshal.c:201): the absolute value is split into base-2^15 digits in little-endian order. The digit count is stored as a signed 32-bit integer; negative count signals a negative number. Each digit is a uint16 in little-endian.

.pyc header (16 bytes, Python/importlib/_bootstrap_external.py):

uint32 magic_number // PYC_MAGIC_NUMBER_TOKEN for 3.14, pinned as const
uint32 bit_flags // bit 0: checked; bit 1: source-hash mode
uint32 field2 // source mtime (timestamp mode) or source_hash[0:4] (hash mode)
uint32 field3 // source size (timestamp mode) or source_hash[4:8] (hash mode)
// followed immediately by the marshalled code object

Go shape

// WriteCodeToString marshals a code object to bytes at the given version.
// CPython: Python/marshal.c:1928 PyMarshal_WriteObjectToString
func WriteCodeToString(code *objects.Code, version int) ([]byte, error)

// ReadCodeFromBytes unmarshals a code object from raw marshal bytes (no pyc header).
// CPython: Python/marshal.c:1876 PyMarshal_ReadObjectFromString
func ReadCodeFromBytes(data []byte) (*objects.Code, error)

// WritePyc writes a complete .pyc file: 16-byte header then marshalled code.
// flags: bit 0 = checked, bit 1 = source-hash mode.
// sourceHash is used when bit 1 is set; mtime+size when bit 1 is clear.
func WritePyc(w io.Writer, code *objects.Code, flags uint32, mtime uint32, size uint32, sourceHash uint64) error

// ReadPyc reads and validates a .pyc file header then returns the code object.
// Returns an error if the magic number does not match or the source stamp is stale.
func ReadPyc(r io.Reader, expectedMtime uint32, expectedSize uint32, expectedSourceHash uint64) (*objects.Code, error)

// Dumps serialises obj to bytes (marshal.dumps equivalent).
// CPython: Python/marshal.c:1928 PyMarshal_WriteObjectToString
func Dumps(obj objects.Object, version int) ([]byte, error)

// Loads deserialises bytes to an object (marshal.loads equivalent).
// CPython: Python/marshal.c:1876 PyMarshal_ReadObjectFromString
func Loads(data []byte) (objects.Object, error)

File mapping

C sourceGo target
Python/marshal.c w_object full dispatchmarshal/write.go (replaces encoder in marshal.go)
Python/marshal.c r_object full dispatchmarshal/read.go (replaces decoder in marshal.go)
Python/marshal.c:201 TYPE_LONG encode/decodemarshal/long.go
Python/marshal.c:356 FLAG_REF writer, ref tablemarshal/refs.go
Python/marshal.c:1022 FLAG_REF readermarshal/refs.go
Python/marshal.c:1807 pyc header readmarshal/pyc.go
Python/marshal.c:1928 pyc header writemarshal/pyc.go
Existing skeletonmarshal/marshal.go (shrinks to package doc + Dumps/Loads wrappers)

Checklist

Status legend: [x] shipped, [ ] pending, [~] partial / scaffold, [n] deferred / not in scope this phase.

TYPE_LONG (marshal/long.go)

  • writeLong: split *big.Int absolute value into base-2^15 uint16 limbs, little-endian. Emit signed 32-bit count (negative = negative number). CPython: Python/marshal.c:201 w_long.
  • readLong: read signed 32-bit count, read |count| uint16 limbs, reconstruct *big.Int, apply sign from count sign. CPython: Python/marshal.c:937 r_long.
  • Zero big.Int encodes as count=0 with no limbs.
  • Negative numbers encode as negative count (not two's complement).
  • Round-trip property: readLong(writeLong(n)) == n for arbitrary *big.Int.

FLAG_REF (marshal/refs.go)

  • writerRefs: map from objects.Object identity to pre-assigned 32-bit index. Populated during w_complex_object when an object is seen more than once or is flagged for interning. CPython: Python/marshal.c:356 w_complex_object.
  • Write path: before writing a compound object, check if it needs a ref slot; if so, OR FLAG_REF (0x80) onto the tag byte and record len(refs) as the slot index.
  • readerRefs: growable slice of objects.Object, indexed by order of first appearance.
  • Read path: if FLAG_REF bit is set on the tag, pre-allocate slot (append(refs, nil)) before decoding; after decoding, store the result at that slot. CPython: Python/marshal.c:1022.
  • TYPE_REF ('r'): read 4-byte little-endian index, return refs[index]. Error if index out of range. CPython: Python/marshal.c:1159 REF arm.

TYPE_INTERNED string variants (marshal/write.go, marshal/read.go)

  • Write TYPE_SHORT_ASCII_INTERNED ('Z') for interned strings shorter than 256 bytes. CPython: Python/marshal.c:459 w_object unicode arm.
  • Write TYPE_ASCII_INTERNED ('A') for interned strings >= 256 bytes. CPython: Python/marshal.c:459 w_object.
  • Write TYPE_SHORT_ASCII ('z') for non-interned ASCII strings shorter than 256 bytes.
  • Write TYPE_ASCII ('a') for non-interned ASCII strings >= 256 bytes.
  • Write TYPE_UNICODE ('u') for non-ASCII strings (UTF-8 encoded, 4-byte length).
  • Read path: TYPE_ASCII_INTERNED and TYPE_SHORT_ASCII_INTERNED add the decoded string to the interned-string table (the decoder's strings slice). CPython: Python/marshal.c:1159 string arms.

TYPE_CODE (marshal/write.go, marshal/read.go)

  • Write TYPE_CODE: emit tag, then all objects.Code fields in the exact CPython order (Python/marshal.c:459 w_object code arm): argcount, posonlyargcount, kwonlyargcount, stacksize, flags, code (bytes), consts (tuple), names (tuple), localsplusnames (tuple), localspluskinds (bytes), filename (str), name (str), qualname (str), firstlineno (int), linetable (bytes), exceptiontable (bytes).
  • Read TYPE_CODE: reconstruct objects.Code from the same field sequence. CPython: Python/marshal.c:1159 code arm.
  • Field order must match CPython 3.14 exactly; any mismatch produces an unreadable .pyc file.
  • CODE_FIELD_COUNT constant (16 for 3.14) guards against accidental field reordering in the writer.

TYPE_COMPLEX (marshal/write.go, marshal/read.go)

  • Write TYPE_BINARY_COMPLEX ('Y'): real then imag, each as IEEE 754 little-endian double (8 bytes each). CPython: Python/marshal.c:459 w_object complex arm.
  • Read TYPE_BINARY_COMPLEX: two 8-byte IEEE 754 doubles. CPython: Python/marshal.c:1159 binary-complex arm.
  • [n] TYPE_COMPLEX ASCII form ('y'): deferred to v0.9 with full complex number support.

TYPE_SET / TYPE_FROZENSET (marshal/write.go, marshal/read.go)

  • Write TYPE_SET ('<'): 4-byte count then each element. CPython: Python/marshal.c:459 w_object set arm.
  • Write TYPE_FROZENSET ('>'): same layout.
  • Read TYPE_SET / TYPE_FROZENSET: count then elements, reconstruct *objects.Set or *objects.FrozenSet. CPython: Python/marshal.c:1159 set/frozenset arm.
  • Empty frozenset uses FLAG_REF dedup (CPython interns it).

TYPE_DICT (marshal/write.go, marshal/read.go)

  • Write TYPE_DICT ('{'): interleaved key, value pairs terminated by a NULL byte (the w_object(NULL) sentinel). CPython: Python/marshal.c:459 w_object dict arm.
  • Read TYPE_DICT: read key/value pairs until NULL sentinel is read. CPython: Python/marshal.c:1159 dict arm.

.pyc header (marshal/pyc.go)

  • MagicNumber constant pinned to CPython 3.14's PYC_MAGIC_NUMBER_TOKEN value (3495 * 10 + 13 = 34963, stored as magic | (b'\r\n' << 16)). CPython: Python/import.c magic generation and Lib/importlib/_bootstrap_external.py.
  • WritePyc: write 4 uint32s (magic, flags, field2, field3) then call WriteCodeToString and write the blob.
  • ReadPyc: read 4 uint32s; reject on magic mismatch (raise ImportError); validate timestamp or source-hash depending on bit 1 of flags; then call ReadCodeFromBytes.
  • Timestamp mode (flags bit 1 clear): field2 = source mtime, field3 = source size. Stale if either mismatches.
  • Source-hash mode (flags bit 1 set): field2/field3 = 64-bit source hash split little-endian. Stale if hash mismatches.
  • Unchecked mode (flags bit 0 clear): skip validation entirely. CPython: Lib/importlib/_bootstrap_external.py _validate_bytecode_header.

Write dispatch completeness (marshal/write.go)

  • w_object handles all 27 type tags listed in the type-tag table.
  • TYPE_NONE, TYPE_TRUE, TYPE_FALSE: already in skeleton.
  • TYPE_INT (32-bit): already in skeleton.
  • TYPE_INT64: emit when int fits in int64 but not int32 (Python 2 compat path). CPython: Python/marshal.c:459.
  • TYPE_BINARY_FLOAT: already in skeleton.
  • TYPE_STRING (bytes): already in skeleton.
  • TYPE_SMALL_TUPLE / TYPE_TUPLE: already in skeleton.
  • TYPE_LIST: already in skeleton.
  • TYPE_LONG: delegate to marshal/long.go writeLong.
  • TYPE_DICT: as above.
  • TYPE_CODE: as above.
  • TYPE_BINARY_COMPLEX: as above.
  • TYPE_SET / TYPE_FROZENSET: as above.
  • TYPE_ELLIPSIS ('.'): singleton write.
  • TYPE_STOPITER ('S'): singleton write.
  • TYPE_NULL ('0'): internal sentinel; never called from user code.

Read dispatch completeness (marshal/read.go)

  • r_object handles all 27 type tags.
  • TYPE_NONE, TYPE_TRUE, TYPE_FALSE: already in skeleton.
  • TYPE_INT: already in skeleton.
  • TYPE_INT64: read 8 bytes, return *objects.Long.
  • TYPE_BINARY_FLOAT: already in skeleton.
  • TYPE_FLOAT ASCII form: parse ASCII float string. CPython: Python/marshal.c:1159 float arm.
  • TYPE_STRING (bytes): already in skeleton.
  • TYPE_SMALL_TUPLE, TYPE_TUPLE: already in skeleton.
  • TYPE_LIST: already in skeleton.
  • TYPE_UNICODE ('u'): 4-byte length, UTF-8 bytes.
  • TYPE_SHORT_ASCII, TYPE_ASCII: as above (non-interned).
  • TYPE_SHORT_ASCII_INTERNED, TYPE_ASCII_INTERNED: intern on read.
  • TYPE_LONG: delegate to marshal/long.go readLong.
  • TYPE_DICT: as above.
  • TYPE_CODE: as above.
  • TYPE_BINARY_COMPLEX: as above.
  • TYPE_SET / TYPE_FROZENSET: as above.
  • TYPE_REF: index into ref slice.
  • TYPE_ELLIPSIS: return objects.Ellipsis.
  • TYPE_STOPITER: return objects.StopIteration singleton.
  • TYPE_NULL: return nil (internal sentinel).

Tests

  • marshal/marshal_test.go: Dumps then Loads round-trip for every supported type tag (None, bool, int, big.Int, float, bytes, str, tuple, list, dict, set, frozenset, code object).
  • marshal/marshal_test.go: byte-for-byte match against python3 -c "import marshal, sys; sys.stdout.buffer.write(marshal.dumps(42))" for int, float, str, bytes, and a simple tuple.
  • marshal/marshal_test.go: code-object round-trip: compile a trivial Python source, marshal the code object, unmarshal it, verify co_consts, co_names, co_filename, co_firstlineno.
  • marshal/long_test.go: round-trip for big.Int values 0, 1, -1, 2^15, 2^31, 2^63, 2^128, -(2^128).
  • marshal/refs_test.go: object shared between two tuple slots is written once with FLAG_REF and read back as the same pointer.
  • marshal/pyc_test.go: WritePyc then ReadPyc round-trip in timestamp mode; source-hash mode; stale-magic rejection; stale-mtime rejection.

Cross-references

  • Code object definition: 1687 (objects/code.go). The marshal writer reads every field of objects.Code; the reader constructs one.
  • Import system that reads .pyc files: 1691 (imp/loader.go calls marshal.ReadPyc).
  • Compile pipeline that produces Code objects: 1620 series.
  • objects/set.go, objects/dict.go: 1680, 1681 (needed by TYPE_SET, TYPE_FROZENSET, TYPE_DICT read paths).

Out of scope

  • TYPE_COMPLEX ASCII form ('y'). Lands in v0.9 with full complex numbers.
  • TYPE_FRAME, TYPE_ELLIPSIS, TYPE_STOPITER object construction. The tags are written; the resulting objects land with their respective types in later phases.
  • Pickle compatibility differences for marshal version < 4 interning behaviour.
  • The marshal Python module surface (marshal.version, marshal.MAGIC_NUMBER). Lands with the stdlib bridge.
  • Free-threaded write-barrier integration. gopy is single-threaded through v0.10.