1692. gopy codecs
What we are porting
Python/codecs.c (1708 lines): the codec registry and the
encode/decode dispatch layer that str.encode(), bytes.decode(),
and the import system's source-file reader all call into.
CPython's codec system is lookup-driven. Encodings are registered as
search functions via codecs.register(search_fn). Each search
function is called with a normalised encoding name and returns either
None (not handled) or a CodecInfo four-tuple
(encode, decode, incrementalencoder, incrementaldecoder).
For gopy v0.8, three built-in codecs must be available before the
import source reader can open .py files:
utf-8/utf_8/utf8(any normalisation): the default source encoding.ascii: for source files declared with# -*- coding: ascii -*-.latin-1/iso-8859-1/latin_1: byte-identity mapping.
The codec registry lives on the interpreter state as
interp->codecs.search_path (a list of callables).
Key CPython functions
| Function | Location |
|---|---|
PyCodec_Register | Python/codecs.c:31 |
_PyCodec_Lookup | Python/codecs.c:141 |
_PyCodec_LookupBuiltin | Python/codecs.c:92 |
codec_register (Python-level) | Python/codecs.c:57 |
PyCodec_Encode | Python/codecs.c:491 |
PyCodec_Decode | Python/codecs.c:504 |
_PyCodec_EncodeInternal | Python/codecs.c:435 |
_PyCodec_DecodeInternal | Python/codecs.c:458 |
PyCodec_RegisterError | Python/codecs.c:625 |
PyCodec_LookupError | Python/codecs.c:654 |
PyCodec_StrictErrors | Python/codecs.c:785 |
PyCodec_IgnoreErrors | Python/codecs.c:812 |
PyCodec_ReplaceErrors | Python/codecs.c:882 |
PyCodec_XMLCharRefReplaceErrors | Python/codecs.c:960 |
PyCodec_BackslashReplaceErrors | Python/codecs.c:1058 |
PyCodec_NameReplaceErrors | Python/codecs.c:1131 |
normalise_encoding | Python/codecs.c:125 |
Name normalisation
Python/codecs.c:125 normalise_encoding: convert to lower-case,
replace hyphens and spaces with underscores, collapse leading/trailing
underscores. "UTF-8" normalises to "utf_8". "iso-8859-1"
normalises to "iso_8859_1".
Go shape
// CodecInfo holds the encode and decode functions for one codec.
// Mirrors the essential two slots of the CPython four-tuple CodecInfo.
// CPython: Python/codecs.c:141 _PyCodec_Lookup return value.
type CodecInfo struct {
// Encode converts a string to bytes.
// Returns (result object, bytes consumed, error).
Encode func(obj objects.Object, errors string) (objects.Object, int, error)
// Decode converts bytes to a string.
// Returns (result object, bytes consumed, error).
Decode func(obj objects.Object, errors string) (objects.Object, int, error)
}
// SearchFunc is a function that maps a normalised encoding name to
// a CodecInfo, or returns (nil, false) if it does not handle that name.
// Mirrors a Python-level search_function callable.
type SearchFunc func(encoding string) (*CodecInfo, bool)
// Register appends a search function to the codec search path.
// CPython: Python/codecs.c:31 PyCodec_Register
func Register(fn SearchFunc) error
// Lookup normalises the encoding name and walks the registered search
// functions until one returns a CodecInfo.
// Returns LookupError if no codec is found.
// CPython: Python/codecs.c:141 _PyCodec_Lookup
func Lookup(encoding string) (*CodecInfo, error)
// Encode encodes obj using the named codec and error handler.
// CPython: Python/codecs.c:491 PyCodec_Encode
func Encode(obj objects.Object, encoding, errors string) (objects.Object, error)
// Decode decodes obj using the named codec and error handler.
// CPython: Python/codecs.c:504 PyCodec_Decode
func Decode(obj objects.Object, encoding, errors string) (objects.Object, error)
// RegisterError registers a named error handler.
// CPython: Python/codecs.c:625 PyCodec_RegisterError
func RegisterError(name string, handler ErrorHandler) error
// LookupError returns the named error handler.
// Returns LookupError if not found.
// CPython: Python/codecs.c:654 PyCodec_LookupError
func LookupError(name string) (ErrorHandler, error)
// ErrorHandler is the Go type for a codec error callback.
// It receives the encode/decode error and returns a replacement
// (result string or bytes, resume position).
type ErrorHandler func(err error) (replacement objects.Object, pos int, rerr error)
File mapping
| C source | Go target |
|---|---|
Python/codecs.c:31 PyCodec_Register | codecs/registry.go |
Python/codecs.c:141 _PyCodec_Lookup | codecs/registry.go |
Python/codecs.c:125 normalise_encoding | codecs/registry.go |
Python/codecs.c:491 PyCodec_Encode | codecs/codec.go |
Python/codecs.c:504 PyCodec_Decode | codecs/codec.go |
Python/codecs.c:435 _PyCodec_EncodeInternal | codecs/codec.go |
Python/codecs.c:458 _PyCodec_DecodeInternal | codecs/codec.go |
Python/codecs.c:625 PyCodec_RegisterError | codecs/errors.go |
Python/codecs.c:654 PyCodec_LookupError | codecs/errors.go |
Python/codecs.c:785 PyCodec_StrictErrors | codecs/errors.go |
Python/codecs.c:812 PyCodec_IgnoreErrors | codecs/errors.go |
Python/codecs.c:882 PyCodec_ReplaceErrors | codecs/errors.go |
Python/codecs.c:960 PyCodec_XMLCharRefReplaceErrors | codecs/errors.go |
Python/codecs.c:1058 PyCodec_BackslashReplaceErrors | codecs/errors.go |
Python/codecs.c:1131 PyCodec_NameReplaceErrors | codecs/errors.go |
| Built-in utf-8, ascii, latin-1 implementations | codecs/builtin.go |
Checklist
Status legend: [x] shipped, [ ] pending, [~] partial / scaffold,
[n] deferred / not in scope this phase.
codecs/registry.go
-
searchPath: package-level slice ofSearchFunc, protected by a mutex for future thread safety. Initially empty; seeded byinit(). CPython:Python/codecs.c:141interp->codecs.search_path. -
normalizeEncoding(name string) string: lower-case, replace-andwith_. CPython:Python/codecs.c:125 normalise_encoding. -
lookupCache:map[string]*CodecInfokeyed by normalised name, cleared when a new search function is registered. CPython:Python/codecs.c:141interp->codecs.search_cache. -
Register(fn SearchFunc) error: append tosearchPath, clearlookupCache. CPython:Python/codecs.c:31 PyCodec_Register. -
Lookup(encoding string) (*CodecInfo, error): normalise name, checklookupCache, walksearchPath, populate cache on hit, returnLookupErroron miss. CPython:Python/codecs.c:141 _PyCodec_Lookup. -
init()incodecs/registry.go: callRegisterwith the built-in search function fromcodecs/builtin.gothat handlesutf_8,ascii,latin_1(and their common aliases).
codecs/errors.go
-
errorHandlers:map[string]ErrorHandler, seeded byinit(). CPython:Python/codecs.c:625interp->codecs.error_registry. -
RegisterError(name string, handler ErrorHandler) error: insert intoerrorHandlers. CPython:Python/codecs.c:625 PyCodec_RegisterError. -
LookupError(name string) (ErrorHandler, error): look uperrorHandlers[name], returnLookupErrorif absent. CPython:Python/codecs.c:654 PyCodec_LookupError. -
strictErrors: raiseUnicodeEncodeErrororUnicodeDecodeErrorimmediately. CPython:Python/codecs.c:785 PyCodec_StrictErrors. -
ignoreErrors: return an empty string / empty bytes replacement and advance past the bad range. CPython:Python/codecs.c:812 PyCodec_IgnoreErrors. -
replaceErrors(encode): replace each un-encodable codepoint with?(0x3F). (Decode): replace each bad byte sequence with U+FFFD. CPython:Python/codecs.c:882 PyCodec_ReplaceErrors. -
xmlcharrefreplace: replace un-encodable codepoints with&#N;decimal entity references. CPython:Python/codecs.c:960 PyCodec_XMLCharRefReplaceErrors. -
backslashreplace: replace un-encodable or un-decodable ranges with\xNN,\uNNNN, or\UNNNNNNNNescape sequences. CPython:Python/codecs.c:1058 PyCodec_BackslashReplaceErrors. -
namereplace: replace un-encodable codepoints with\N{UNICODE NAME}. CPython:Python/codecs.c:1131 PyCodec_NameReplaceErrors. -
init()incodecs/errors.go: seederrorHandlerswith"strict","ignore","replace","xmlcharrefreplace","backslashreplace","namereplace","surrogateescape"(stub),"surrogatepass"(stub).
codecs/codec.go
-
Encode(obj objects.Object, encoding, errors string) (objects.Object, error): callLookup(encoding), theninfo.Encode(obj, errors). CPython:Python/codecs.c:491 PyCodec_EncodeviaPython/codecs.c:435 _PyCodec_EncodeInternal. -
Decode(obj objects.Object, encoding, errors string) (objects.Object, error): callLookup(encoding), theninfo.Decode(obj, errors). CPython:Python/codecs.c:504 PyCodec_DecodeviaPython/codecs.c:458 _PyCodec_DecodeInternal. - Both entry points must call
LookupError(errors)if the codec signals a partial error and re-invoke the error handler in a loop until all input is consumed or an error is raised. CPython:Python/codecs.c:435 _PyCodec_EncodeInternalhandler loop. - Return type is the first element of the codec result tuple (the
encoded/decoded object); discard the integer consumed count at this
level. CPython:
Python/codecs.c:491result extraction.
codecs/builtin.go
-
utf8Codec:CodecInfobacked by Go'sunicode/utf8package.- Encode:
[]byte(s)for a valid UTF-8*objects.Str; invoke error handler for surrogates depending on theerrorsmode. - Decode:
string(b)after validating UTF-8 withutf8.Valid; invoke error handler on invalid byte sequences.
- Encode:
-
asciiCodec: reject any byte > 0x7F on decode, any codepoint > 0x7F on encode. Invoke the error handler for offending ranges. -
latin1Codec: byte-identity mapping (codepoint N maps to byte N for 0..255). No error handler needed for encode; every Python codepoint <= 255 is representable. -
builtinSearch(name string) (*CodecInfo, bool): return the appropriateCodecInfofor the following normalised names:utf_8,utf8,u8,ascii,us_ascii,646,latin_1,latin1,iso8859_1,iso_8859_1,iso_8859_1_,8859. Return(nil, false)for anything else. -
init()incodecs/builtin.go: callRegister(builtinSearch).
Surface guarantees
-
codecs.Encode(objects.NewStr("hello"), "utf-8", "strict")returnsb"hello"as*objects.Bytes. -
codecs.Decode(objects.NewBytes([]byte{104,101,108,108,111}), "utf-8", "strict")returns"hello"as*objects.Str. - Encoding
"\xff"with"ascii"anderrors="strict"returns aUnicodeEncodeError. - Encoding
"\xff"with"ascii"anderrors="ignore"returnsb"". - Encoding
"\xff"with"ascii"anderrors="replace"returnsb"?". - Decoding
b"\xff"with"utf-8"anderrors="replace"returns"�". -
codecs.Lookup("UTF-8")andcodecs.Lookup("utf_8")andcodecs.Lookup("utf8")all return the same*CodecInfo. -
RegisterError/LookupErrorround-trip: a custom handler registered under"myhandler"is returned byLookupError("myhandler"). - Registering a new search function clears the lookup cache so
subsequent
Lookupcalls invoke all registered functions.
Tests
-
codecs/codecs_test.go:Encode("hello", "utf-8", "strict")matchesb"hello".Encode("hello", "ascii", "strict")matchesb"hello".Encode("é", "latin-1", "strict")matchesb"\xe9".Decode(b"hello", "utf-8", "strict")matches"hello".Encode("Ā", "ascii", "strict")returnsUnicodeEncodeError.Encode("Ā", "ascii", "ignore")returnsb"".Encode("Ā", "ascii", "replace")returnsb"?".Encode("Ā", "ascii", "xmlcharrefreplace")returnsb"Ā".Encode("Ā", "ascii", "backslashreplace")returnsb"\\u0100".Decode(b"\xff", "utf-8", "replace")returns"�".Decode(b"\xff", "utf-8", "ignore")returns"".
-
codecs/errors_test.go:RegisterError("myhandler", fn)thenLookupError("myhandler")returnsfn.LookupError("nonexistent")returns an error.LookupError("strict")returns the built-in strict handler.
-
codecs/registry_test.go:Lookup("UTF-8")succeeds.Lookup("utf_8")succeeds and returns same codec as"UTF-8".Lookup("unknown-encoding-xyz")returns an error.- Custom
Registeradds a new codec; subsequentLookupfinds it. - Registering a new function clears the cache; re-lookup calls the new function.
Cross-references
str.encodeandbytes.decodemethods onobjects.Strandobjects.Bytescallcodecs.Encode/codecs.Decode: 1676, 1677.- Import source reader uses
codecs.Decode("utf-8", ...)to read.pyfiles: 1691 (imp/loader.go). UnicodeEncodeError/UnicodeDecodeErrorexception types: 1686.- The
codecsstdlib module (Python surface) bridges through the import system: 1691. That bridge is not part of this spec.
Out of scope
- Incremental encoder / decoder interface (
IncrementalEncoder,IncrementalDecoder). Needed byio.TextIOWrapper; deferred to the io/ port. StreamReader/StreamWriter. Same deferral.- Charmap-based codecs (
cp1252,cp437,cp850, etc.). Deferred to the encodings/ stdlib bridge. - UTF-16 and UTF-32 with BOM handling. Deferred.
surrogateescapeandsurrogatepassfull implementations. Stubbed (raise NotImplementedError) until the io/ port needs them.- The
encodings/stdlib package lookup path. That requires the import system (1691) to load Python source; it is a follow-on task after v0.8. - The
codecsPython module'sopen(),iterencode(),iterdecode()functions. Stdlib bridge, not part ofPython/codecs.c.