`Lib/codecs.py`

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py

codecs.py is the Python wrapper around the codec registry that lives in Python/codecs.c. Its first act is from _codecs import *, which pulls in the C implementations of register, lookup, encode, decode, register_error, lookup_error, and the various *_encode/*_decode functions. Everything else in the file is pure Python.

The module has four distinct layers. First, BOM constants (lines 44-78) are byte-string literals for UTF-8/16/32 signatures. Second, the codec class hierarchy (lines 83-882) provides CodecInfo, Codec, IncrementalEncoder, IncrementalDecoder, StreamWriter, StreamReader, StreamReaderWriter, and StreamRecoder. Third, shortcut helpers (lines 886-1078) wrap lookup into single-purpose functions. Fourth, named error-handler aliases (lines 1114-1119) bind the six predefined handler names to their callables.

Map

Lines	Symbol	Role	gopy
15-18	`from _codecs import *`	Imports `register`, `lookup`, `encode`, `decode`, `register_error`, `lookup_error`, and the C codec functions into the module namespace.	`codecs/registry.go`
44-78	BOM constants	`BOM_UTF8`, `BOM_LE`, `BOM_BE`, `BOM_UTF16`, `BOM_UTF32`, `BOM_UTF32_LE`, `BOM_UTF32_BE`, plus deprecated aliases `BOM32_`/`BOM64_`. Native-endian names are aliased to LE or BE based on `sys.byteorder`.	`codecs/builtin.go`
83-116	`CodecInfo`	A `tuple` subclass holding `(encode, decode, streamreader, streamwriter)` with named attributes; `_is_text_encoding` flag distinguishes text codecs from binary-only ones.	`codecs/registry.go`
117-181	`Codec`	Abstract base for stateless codecs; defines `encode(input, errors)` and `decode(input, errors)`, both raise `NotImplementedError`.	`(stdlib pending)`
183-255	`IncrementalEncoder`, `BufferedIncrementalEncoder`	Stateful encoding ABC; `encode(input, final=False)` accumulates state between calls. `BufferedIncrementalEncoder` manages a `self.buffer` string for codecs that cannot emit output on every chunk.	`(stdlib pending)`
257-340	`IncrementalDecoder`, `BufferedIncrementalDecoder`	Stateful decoding ABC; `getstate()`/`setstate(state)` checkpoint the pending-bytes buffer. `BufferedIncrementalDecoder.getstate()` returns `(self.buffer, 0)`.	`(stdlib pending)`
349-422	`StreamWriter`	Inherits from `Codec`; wraps a writable stream with encoding. `write()` calls `self.encode(object, self.errors)` and forwards bytes to `self.stream`. `__getattr__` delegates unknown attributes to the underlying stream.	`(stdlib pending)`
425-673	`StreamReader`	Inherits from `Codec`; wraps a readable stream. `read(size, chars, firstline)` fills `self.charbuffer` from `self.bytebuffer` via `self.decode`. `readline` handles `\r\n` boundary splits across network chunks.	`(stdlib pending)`
677-763	`StreamReaderWriter`	Combines a `StreamReader` and a `StreamWriter` on the same stream; used by `codecs.open()` to return a bidirectional file object.	`(stdlib pending)`
767-882	`StreamRecoder`	Transcodes between two encodings in a pipeline: decode with one codec and re-encode with another. The `encode`/`decode` callables handle the frontend; `Reader`/`Writer` handle the backend.	`(stdlib pending)`
886-935	`open`	Deprecated wrapper around `builtins.open`; forces binary mode and wraps the result in a `StreamReaderWriter`. Emits `DeprecationWarning` since 3.14; callers should use `builtins.open(filename, encoding=...)`.	`(stdlib pending)`
975-1078	`getencoder`, `getdecoder`, `getincrementalencoder`, `getincrementaldecoder`, `getreader`, `getwriter`, `iterencode`, `iterdecode`	Convenience wrappers that call `lookup(encoding)` and return one field of the resulting `CodecInfo`. `iterencode`/`iterdecode` drive an `IncrementalEncoder`/`Decoder` over a string iterator, flushing with `final=True` at the end.	`(stdlib pending)`
1081-1110	`make_identity_dict`, `make_encoding_map`	Charmap helpers: `make_identity_dict` builds `{i: i}` for a range; `make_encoding_map` inverts a decoding map, setting any collision target to `None`.	`(stdlib pending)`
1114-1119	Error-handler aliases	`strict_errors`, `ignore_errors`, `replace_errors`, `xmlcharrefreplace_errors`, `backslashreplace_errors`, `namereplace_errors` are bound by calling `lookup_error(name)` at module import time.	`codecs/errors.go`

Reading

`lookup()` and the search-function chain (line 16, implemented in `Python/codecs.c`)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L15-18

lookup is a C function imported from _codecs. It normalizes the encoding name (lowercase, whitespace and hyphens to underscores) and then walks a per-interpreter list of search functions in registration order. The first function that returns a non-None value wins; the result is cached in a dict keyed by the normalized name.

# Example: how lookup and register interact
import codecs

def my_search(name):
    if name == 'rot13':
        return codecs.CodecInfo(
            name='rot13',
            encode=rot13_encode,
            decode=rot13_decode,
        )
    return None

codecs.register(my_search)
info = codecs.lookup('rot_13')   # normalized to 'rot13'

In gopy the same logic is in codecs/registry.go. Register appends a SearchFunc to the per-process slice; Lookup normalizes with strings.ToLower and replaces hyphens/spaces with underscores, then walks the slice and caches hits in a map[string]*CodecInfo.

`CodecInfo`: the four-tuple (lines 83 to 116)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L83-116

class CodecInfo(tuple):
    def __new__(cls, encode, decode, streamreader=None, streamwriter=None,
        incrementalencoder=None, incrementaldecoder=None, name=None,
        *, _is_text_encoding=None):
        self = tuple.__new__(cls, (encode, decode, streamreader, streamwriter))
        self.name = name
        self.encode = encode
        self.decode = decode
        self.incrementalencoder = incrementalencoder
        self.incrementaldecoder = incrementaldecoder
        self.streamwriter = streamwriter
        self.streamreader = streamreader
        ...
        return self

CodecInfo stores the four main callables both as tuple slots (so that legacy code unpacking enc, dec, sr, sw = lookup(name) still works) and as named attributes. incrementalencoder and incrementaldecoder are not in the tuple; they are pure attributes added in Python after the tuple is constructed. _is_text_encoding defaults to True; the standard library sets it to False for binary codecs such as hex_codec to prevent them being used in text-mode open().

`IncrementalEncoder` stateful encoding (lines 183 to 255)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L183-255

class IncrementalEncoder:
    def __init__(self, errors='strict'):
        self.errors = errors
        self.buffer = ""

    def encode(self, input, final=False):
        raise NotImplementedError

    def reset(self):
        pass

    def getstate(self):
        return 0

    def setstate(self, state):
        pass

class BufferedIncrementalEncoder(IncrementalEncoder):
    def encode(self, input, final=False):
        data = self.buffer + input
        (result, consumed) = self._buffer_encode(data, self.errors, final)
        self.buffer = data[consumed:]
        return result

The final=True signal tells the encoder to flush any pending bytes and reset internal state. When final=False the encoder may hold back bytes that form an incomplete multibyte character. BufferedIncrementalEncoder implements this pattern by prepending self.buffer to each new input and keeping whatever the underlying codec did not consume. getstate returns self.buffer or 0 so that an empty buffer serializes as the integer 0 rather than an empty string, which matters for pickle-based state transfer.

Error-handler name constants (lines 1114 to 1119)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L1114-1119

strict_errors           = lookup_error("strict")
ignore_errors           = lookup_error("ignore")
replace_errors          = lookup_error("replace")
xmlcharrefreplace_errors = lookup_error("xmlcharrefreplace")
backslashreplace_errors  = lookup_error("backslashreplace")
namereplace_errors       = lookup_error("namereplace")

lookup_error is a C function from _codecs. It returns the callable registered for the given error-handler name. The six names above are the predefined handlers built into the interpreter. Third-party code can add more with register_error(name, handler). In gopy the same six handlers are bound in codecs/errors.go and the registry is in the same file as codecs/registry.go.

`StreamReader.read()`: bytebuffer and charbuffer (lines 457 to 535)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L457-535

def read(self, size=-1, chars=-1, firstline=False):
    while True:
        if chars >= 0 and len(self.charbuffer) >= chars:
            break
        newdata = self.stream.read(size if size >= 0 else -1)
        data = self.bytebuffer + newdata
        if not data:
            break
        try:
            newchars, decodedbytes = self.decode(data, self.errors)
        except UnicodeDecodeError as exc:
            if firstline:
                newchars, decodedbytes = self.decode(data[:exc.start], self.errors)
                ...
            else:
                raise
        self.bytebuffer = data[decodedbytes:]
        self.charbuffer += newchars
        if not newdata:
            break
    ...

StreamReader maintains two buffers. self.bytebuffer holds raw bytes that arrived from the stream but could not yet be decoded (e.g., an incomplete multi-byte sequence at a read boundary). self.charbuffer holds already-decoded characters waiting to be returned. On each iteration the method tries to decode self.bytebuffer + newdata; the codec returns how many bytes it consumed, and the remainder goes back into self.bytebuffer. firstline=True is a readline optimization: on a decode error, return whatever decoded before the bad byte on the first line rather than raising immediately.

gopy mirror

The gopy codec layer is in codecs/. codecs/registry.go implements Register, Lookup, and CodecInfo (with Name, Encode, and Decode fields; StreamReader/StreamWriter are deferred). codecs/builtin.go registers the built-in UTF-8 and UTF-16 codecs. codecs/errors.go binds the six predefined error-handler names.

IncrementalEncoder, IncrementalDecoder, StreamWriter, StreamReader, StreamReaderWriter, and StreamRecoder have not yet been ported; they are needed for the io.TextIOWrapper layer and are tracked as stdlib pending. The BOM constants (BOM_UTF8, BOM_UTF16, BOM_UTF32) are defined in codecs/builtin.go as []byte variables matching the CPython byte-string literals exactly.

Map​

Reading​

lookup() and the search-function chain (line 16, implemented in Python/codecs.c)​

CodecInfo: the four-tuple (lines 83 to 116)​

IncrementalEncoder stateful encoding (lines 183 to 255)​

Error-handler name constants (lines 1114 to 1119)​

StreamReader.read(): bytebuffer and charbuffer (lines 457 to 535)​

gopy mirror​

Map