Skip to main content

1692. gopy codecs

What we are porting

Python/codecs.c (1708 lines): the codec registry and the encode/decode dispatch layer that str.encode(), bytes.decode(), and the import system's source-file reader all call into.

CPython's codec system is lookup-driven. Encodings are registered as search functions via codecs.register(search_fn). Each search function is called with a normalised encoding name and returns either None (not handled) or a CodecInfo four-tuple (encode, decode, incrementalencoder, incrementaldecoder).

For gopy v0.8, three built-in codecs must be available before the import source reader can open .py files:

  • utf-8 / utf_8 / utf8 (any normalisation): the default source encoding.
  • ascii: for source files declared with # -*- coding: ascii -*-.
  • latin-1 / iso-8859-1 / latin_1: byte-identity mapping.

The codec registry lives on the interpreter state as interp->codecs.search_path (a list of callables).

Key CPython functions

FunctionLocation
PyCodec_RegisterPython/codecs.c:31
_PyCodec_LookupPython/codecs.c:141
_PyCodec_LookupBuiltinPython/codecs.c:92
codec_register (Python-level)Python/codecs.c:57
PyCodec_EncodePython/codecs.c:491
PyCodec_DecodePython/codecs.c:504
_PyCodec_EncodeInternalPython/codecs.c:435
_PyCodec_DecodeInternalPython/codecs.c:458
PyCodec_RegisterErrorPython/codecs.c:625
PyCodec_LookupErrorPython/codecs.c:654
PyCodec_StrictErrorsPython/codecs.c:785
PyCodec_IgnoreErrorsPython/codecs.c:812
PyCodec_ReplaceErrorsPython/codecs.c:882
PyCodec_XMLCharRefReplaceErrorsPython/codecs.c:960
PyCodec_BackslashReplaceErrorsPython/codecs.c:1058
PyCodec_NameReplaceErrorsPython/codecs.c:1131
normalise_encodingPython/codecs.c:125

Name normalisation

Python/codecs.c:125 normalise_encoding: convert to lower-case, replace hyphens and spaces with underscores, collapse leading/trailing underscores. "UTF-8" normalises to "utf_8". "iso-8859-1" normalises to "iso_8859_1".

Go shape

// CodecInfo holds the encode and decode functions for one codec.
// Mirrors the essential two slots of the CPython four-tuple CodecInfo.
// CPython: Python/codecs.c:141 _PyCodec_Lookup return value.
type CodecInfo struct {
// Encode converts a string to bytes.
// Returns (result object, bytes consumed, error).
Encode func(obj objects.Object, errors string) (objects.Object, int, error)
// Decode converts bytes to a string.
// Returns (result object, bytes consumed, error).
Decode func(obj objects.Object, errors string) (objects.Object, int, error)
}

// SearchFunc is a function that maps a normalised encoding name to
// a CodecInfo, or returns (nil, false) if it does not handle that name.
// Mirrors a Python-level search_function callable.
type SearchFunc func(encoding string) (*CodecInfo, bool)

// Register appends a search function to the codec search path.
// CPython: Python/codecs.c:31 PyCodec_Register
func Register(fn SearchFunc) error

// Lookup normalises the encoding name and walks the registered search
// functions until one returns a CodecInfo.
// Returns LookupError if no codec is found.
// CPython: Python/codecs.c:141 _PyCodec_Lookup
func Lookup(encoding string) (*CodecInfo, error)

// Encode encodes obj using the named codec and error handler.
// CPython: Python/codecs.c:491 PyCodec_Encode
func Encode(obj objects.Object, encoding, errors string) (objects.Object, error)

// Decode decodes obj using the named codec and error handler.
// CPython: Python/codecs.c:504 PyCodec_Decode
func Decode(obj objects.Object, encoding, errors string) (objects.Object, error)

// RegisterError registers a named error handler.
// CPython: Python/codecs.c:625 PyCodec_RegisterError
func RegisterError(name string, handler ErrorHandler) error

// LookupError returns the named error handler.
// Returns LookupError if not found.
// CPython: Python/codecs.c:654 PyCodec_LookupError
func LookupError(name string) (ErrorHandler, error)

// ErrorHandler is the Go type for a codec error callback.
// It receives the encode/decode error and returns a replacement
// (result string or bytes, resume position).
type ErrorHandler func(err error) (replacement objects.Object, pos int, rerr error)

File mapping

C sourceGo target
Python/codecs.c:31 PyCodec_Registercodecs/registry.go
Python/codecs.c:141 _PyCodec_Lookupcodecs/registry.go
Python/codecs.c:125 normalise_encodingcodecs/registry.go
Python/codecs.c:491 PyCodec_Encodecodecs/codec.go
Python/codecs.c:504 PyCodec_Decodecodecs/codec.go
Python/codecs.c:435 _PyCodec_EncodeInternalcodecs/codec.go
Python/codecs.c:458 _PyCodec_DecodeInternalcodecs/codec.go
Python/codecs.c:625 PyCodec_RegisterErrorcodecs/errors.go
Python/codecs.c:654 PyCodec_LookupErrorcodecs/errors.go
Python/codecs.c:785 PyCodec_StrictErrorscodecs/errors.go
Python/codecs.c:812 PyCodec_IgnoreErrorscodecs/errors.go
Python/codecs.c:882 PyCodec_ReplaceErrorscodecs/errors.go
Python/codecs.c:960 PyCodec_XMLCharRefReplaceErrorscodecs/errors.go
Python/codecs.c:1058 PyCodec_BackslashReplaceErrorscodecs/errors.go
Python/codecs.c:1131 PyCodec_NameReplaceErrorscodecs/errors.go
Built-in utf-8, ascii, latin-1 implementationscodecs/builtin.go

Checklist

Status legend: [x] shipped, [ ] pending, [~] partial / scaffold, [n] deferred / not in scope this phase.

codecs/registry.go

  • searchPath: package-level slice of SearchFunc, protected by a mutex for future thread safety. Initially empty; seeded by init(). CPython: Python/codecs.c:141 interp->codecs.search_path.
  • normalizeEncoding(name string) string: lower-case, replace - and with _. CPython: Python/codecs.c:125 normalise_encoding.
  • lookupCache: map[string]*CodecInfo keyed by normalised name, cleared when a new search function is registered. CPython: Python/codecs.c:141 interp->codecs.search_cache.
  • Register(fn SearchFunc) error: append to searchPath, clear lookupCache. CPython: Python/codecs.c:31 PyCodec_Register.
  • Lookup(encoding string) (*CodecInfo, error): normalise name, check lookupCache, walk searchPath, populate cache on hit, return LookupError on miss. CPython: Python/codecs.c:141 _PyCodec_Lookup.
  • init() in codecs/registry.go: call Register with the built-in search function from codecs/builtin.go that handles utf_8, ascii, latin_1 (and their common aliases).

codecs/errors.go

  • errorHandlers: map[string]ErrorHandler, seeded by init(). CPython: Python/codecs.c:625 interp->codecs.error_registry.
  • RegisterError(name string, handler ErrorHandler) error: insert into errorHandlers. CPython: Python/codecs.c:625 PyCodec_RegisterError.
  • LookupError(name string) (ErrorHandler, error): look up errorHandlers[name], return LookupError if absent. CPython: Python/codecs.c:654 PyCodec_LookupError.
  • strictErrors: raise UnicodeEncodeError or UnicodeDecodeError immediately. CPython: Python/codecs.c:785 PyCodec_StrictErrors.
  • ignoreErrors: return an empty string / empty bytes replacement and advance past the bad range. CPython: Python/codecs.c:812 PyCodec_IgnoreErrors.
  • replaceErrors (encode): replace each un-encodable codepoint with ? (0x3F). (Decode): replace each bad byte sequence with U+FFFD. CPython: Python/codecs.c:882 PyCodec_ReplaceErrors.
  • xmlcharrefreplace: replace un-encodable codepoints with &#N; decimal entity references. CPython: Python/codecs.c:960 PyCodec_XMLCharRefReplaceErrors.
  • backslashreplace: replace un-encodable or un-decodable ranges with \xNN, \uNNNN, or \UNNNNNNNN escape sequences. CPython: Python/codecs.c:1058 PyCodec_BackslashReplaceErrors.
  • namereplace: replace un-encodable codepoints with \N{UNICODE NAME}. CPython: Python/codecs.c:1131 PyCodec_NameReplaceErrors.
  • init() in codecs/errors.go: seed errorHandlers with "strict", "ignore", "replace", "xmlcharrefreplace", "backslashreplace", "namereplace", "surrogateescape" (stub), "surrogatepass" (stub).

codecs/codec.go

  • Encode(obj objects.Object, encoding, errors string) (objects.Object, error): call Lookup(encoding), then info.Encode(obj, errors). CPython: Python/codecs.c:491 PyCodec_Encode via Python/codecs.c:435 _PyCodec_EncodeInternal.
  • Decode(obj objects.Object, encoding, errors string) (objects.Object, error): call Lookup(encoding), then info.Decode(obj, errors). CPython: Python/codecs.c:504 PyCodec_Decode via Python/codecs.c:458 _PyCodec_DecodeInternal.
  • Both entry points must call LookupError(errors) if the codec signals a partial error and re-invoke the error handler in a loop until all input is consumed or an error is raised. CPython: Python/codecs.c:435 _PyCodec_EncodeInternal handler loop.
  • Return type is the first element of the codec result tuple (the encoded/decoded object); discard the integer consumed count at this level. CPython: Python/codecs.c:491 result extraction.

codecs/builtin.go

  • utf8Codec: CodecInfo backed by Go's unicode/utf8 package.
    • Encode: []byte(s) for a valid UTF-8 *objects.Str; invoke error handler for surrogates depending on the errors mode.
    • Decode: string(b) after validating UTF-8 with utf8.Valid; invoke error handler on invalid byte sequences.
  • asciiCodec: reject any byte > 0x7F on decode, any codepoint > 0x7F on encode. Invoke the error handler for offending ranges.
  • latin1Codec: byte-identity mapping (codepoint N maps to byte N for 0..255). No error handler needed for encode; every Python codepoint <= 255 is representable.
  • builtinSearch(name string) (*CodecInfo, bool): return the appropriate CodecInfo for the following normalised names: utf_8, utf8, u8, ascii, us_ascii, 646, latin_1, latin1, iso8859_1, iso_8859_1, iso_8859_1_, 8859. Return (nil, false) for anything else.
  • init() in codecs/builtin.go: call Register(builtinSearch).

Surface guarantees

  • codecs.Encode(objects.NewStr("hello"), "utf-8", "strict") returns b"hello" as *objects.Bytes.
  • codecs.Decode(objects.NewBytes([]byte{104,101,108,108,111}), "utf-8", "strict") returns "hello" as *objects.Str.
  • Encoding "\xff" with "ascii" and errors="strict" returns a UnicodeEncodeError.
  • Encoding "\xff" with "ascii" and errors="ignore" returns b"".
  • Encoding "\xff" with "ascii" and errors="replace" returns b"?".
  • Decoding b"\xff" with "utf-8" and errors="replace" returns "�".
  • codecs.Lookup("UTF-8") and codecs.Lookup("utf_8") and codecs.Lookup("utf8") all return the same *CodecInfo.
  • RegisterError / LookupError round-trip: a custom handler registered under "myhandler" is returned by LookupError("myhandler").
  • Registering a new search function clears the lookup cache so subsequent Lookup calls invoke all registered functions.

Tests

  • codecs/codecs_test.go:
    • Encode("hello", "utf-8", "strict") matches b"hello".
    • Encode("hello", "ascii", "strict") matches b"hello".
    • Encode("é", "latin-1", "strict") matches b"\xe9".
    • Decode(b"hello", "utf-8", "strict") matches "hello".
    • Encode("Ā", "ascii", "strict") returns UnicodeEncodeError.
    • Encode("Ā", "ascii", "ignore") returns b"".
    • Encode("Ā", "ascii", "replace") returns b"?".
    • Encode("Ā", "ascii", "xmlcharrefreplace") returns b"&#256;".
    • Encode("Ā", "ascii", "backslashreplace") returns b"\\u0100".
    • Decode(b"\xff", "utf-8", "replace") returns "�".
    • Decode(b"\xff", "utf-8", "ignore") returns "".
  • codecs/errors_test.go:
    • RegisterError("myhandler", fn) then LookupError("myhandler") returns fn.
    • LookupError("nonexistent") returns an error.
    • LookupError("strict") returns the built-in strict handler.
  • codecs/registry_test.go:
    • Lookup("UTF-8") succeeds.
    • Lookup("utf_8") succeeds and returns same codec as "UTF-8".
    • Lookup("unknown-encoding-xyz") returns an error.
    • Custom Register adds a new codec; subsequent Lookup finds it.
    • Registering a new function clears the cache; re-lookup calls the new function.

Cross-references

  • str.encode and bytes.decode methods on objects.Str and objects.Bytes call codecs.Encode / codecs.Decode: 1676, 1677.
  • Import source reader uses codecs.Decode("utf-8", ...) to read .py files: 1691 (imp/loader.go).
  • UnicodeEncodeError / UnicodeDecodeError exception types: 1686.
  • The codecs stdlib module (Python surface) bridges through the import system: 1691. That bridge is not part of this spec.

Out of scope

  • Incremental encoder / decoder interface (IncrementalEncoder, IncrementalDecoder). Needed by io.TextIOWrapper; deferred to the io/ port.
  • StreamReader / StreamWriter. Same deferral.
  • Charmap-based codecs (cp1252, cp437, cp850, etc.). Deferred to the encodings/ stdlib bridge.
  • UTF-16 and UTF-32 with BOM handling. Deferred.
  • surrogateescape and surrogatepass full implementations. Stubbed (raise NotImplementedError) until the io/ port needs them.
  • The encodings/ stdlib package lookup path. That requires the import system (1691) to load Python source; it is a follow-on task after v0.8.
  • The codecs Python module's open(), iterencode(), iterdecode() functions. Stdlib bridge, not part of Python/codecs.c.