Skip to main content

Lib/codecs.py

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py

codecs.py is the Python wrapper around the codec registry that lives in Python/codecs.c. Its first act is from _codecs import *, which pulls in the C implementations of register, lookup, encode, decode, register_error, lookup_error, and the various *_encode/*_decode functions. Everything else in the file is pure Python.

The module has four distinct layers. First, BOM constants (lines 44-78) are byte-string literals for UTF-8/16/32 signatures. Second, the codec class hierarchy (lines 83-882) provides CodecInfo, Codec, IncrementalEncoder, IncrementalDecoder, StreamWriter, StreamReader, StreamReaderWriter, and StreamRecoder. Third, shortcut helpers (lines 886-1078) wrap lookup into single-purpose functions. Fourth, named error-handler aliases (lines 1114-1119) bind the six predefined handler names to their callables.

Map

LinesSymbolRolegopy
15-18from _codecs import *Imports register, lookup, encode, decode, register_error, lookup_error, and the C codec functions into the module namespace.codecs/registry.go
44-78BOM constantsBOM_UTF8, BOM_LE, BOM_BE, BOM_UTF16, BOM_UTF32, BOM_UTF32_LE, BOM_UTF32_BE, plus deprecated aliases BOM32_*/BOM64_*. Native-endian names are aliased to LE or BE based on sys.byteorder.codecs/builtin.go
83-116CodecInfoA tuple subclass holding (encode, decode, streamreader, streamwriter) with named attributes; _is_text_encoding flag distinguishes text codecs from binary-only ones.codecs/registry.go
117-181CodecAbstract base for stateless codecs; defines encode(input, errors) and decode(input, errors), both raise NotImplementedError.(stdlib pending)
183-255IncrementalEncoder, BufferedIncrementalEncoderStateful encoding ABC; encode(input, final=False) accumulates state between calls. BufferedIncrementalEncoder manages a self.buffer string for codecs that cannot emit output on every chunk.(stdlib pending)
257-340IncrementalDecoder, BufferedIncrementalDecoderStateful decoding ABC; getstate()/setstate(state) checkpoint the pending-bytes buffer. BufferedIncrementalDecoder.getstate() returns (self.buffer, 0).(stdlib pending)
349-422StreamWriterInherits from Codec; wraps a writable stream with encoding. write() calls self.encode(object, self.errors) and forwards bytes to self.stream. __getattr__ delegates unknown attributes to the underlying stream.(stdlib pending)
425-673StreamReaderInherits from Codec; wraps a readable stream. read(size, chars, firstline) fills self.charbuffer from self.bytebuffer via self.decode. readline handles \r\n boundary splits across network chunks.(stdlib pending)
677-763StreamReaderWriterCombines a StreamReader and a StreamWriter on the same stream; used by codecs.open() to return a bidirectional file object.(stdlib pending)
767-882StreamRecoderTranscodes between two encodings in a pipeline: decode with one codec and re-encode with another. The encode/decode callables handle the frontend; Reader/Writer handle the backend.(stdlib pending)
886-935openDeprecated wrapper around builtins.open; forces binary mode and wraps the result in a StreamReaderWriter. Emits DeprecationWarning since 3.14; callers should use builtins.open(filename, encoding=...).(stdlib pending)
975-1078getencoder, getdecoder, getincrementalencoder, getincrementaldecoder, getreader, getwriter, iterencode, iterdecodeConvenience wrappers that call lookup(encoding) and return one field of the resulting CodecInfo. iterencode/iterdecode drive an IncrementalEncoder/Decoder over a string iterator, flushing with final=True at the end.(stdlib pending)
1081-1110make_identity_dict, make_encoding_mapCharmap helpers: make_identity_dict builds {i: i} for a range; make_encoding_map inverts a decoding map, setting any collision target to None.(stdlib pending)
1114-1119Error-handler aliasesstrict_errors, ignore_errors, replace_errors, xmlcharrefreplace_errors, backslashreplace_errors, namereplace_errors are bound by calling lookup_error(name) at module import time.codecs/errors.go

Reading

lookup() and the search-function chain (line 16, implemented in Python/codecs.c)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L15-18

lookup is a C function imported from _codecs. It normalizes the encoding name (lowercase, whitespace and hyphens to underscores) and then walks a per-interpreter list of search functions in registration order. The first function that returns a non-None value wins; the result is cached in a dict keyed by the normalized name.

# Example: how lookup and register interact
import codecs

def my_search(name):
if name == 'rot13':
return codecs.CodecInfo(
name='rot13',
encode=rot13_encode,
decode=rot13_decode,
)
return None

codecs.register(my_search)
info = codecs.lookup('rot_13') # normalized to 'rot13'

In gopy the same logic is in codecs/registry.go. Register appends a SearchFunc to the per-process slice; Lookup normalizes with strings.ToLower and replaces hyphens/spaces with underscores, then walks the slice and caches hits in a map[string]*CodecInfo.

CodecInfo: the four-tuple (lines 83 to 116)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L83-116

class CodecInfo(tuple):
def __new__(cls, encode, decode, streamreader=None, streamwriter=None,
incrementalencoder=None, incrementaldecoder=None, name=None,
*, _is_text_encoding=None):
self = tuple.__new__(cls, (encode, decode, streamreader, streamwriter))
self.name = name
self.encode = encode
self.decode = decode
self.incrementalencoder = incrementalencoder
self.incrementaldecoder = incrementaldecoder
self.streamwriter = streamwriter
self.streamreader = streamreader
...
return self

CodecInfo stores the four main callables both as tuple slots (so that legacy code unpacking enc, dec, sr, sw = lookup(name) still works) and as named attributes. incrementalencoder and incrementaldecoder are not in the tuple; they are pure attributes added in Python after the tuple is constructed. _is_text_encoding defaults to True; the standard library sets it to False for binary codecs such as hex_codec to prevent them being used in text-mode open().

IncrementalEncoder stateful encoding (lines 183 to 255)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L183-255

class IncrementalEncoder:
def __init__(self, errors='strict'):
self.errors = errors
self.buffer = ""

def encode(self, input, final=False):
raise NotImplementedError

def reset(self):
pass

def getstate(self):
return 0

def setstate(self, state):
pass

class BufferedIncrementalEncoder(IncrementalEncoder):
def encode(self, input, final=False):
data = self.buffer + input
(result, consumed) = self._buffer_encode(data, self.errors, final)
self.buffer = data[consumed:]
return result

The final=True signal tells the encoder to flush any pending bytes and reset internal state. When final=False the encoder may hold back bytes that form an incomplete multibyte character. BufferedIncrementalEncoder implements this pattern by prepending self.buffer to each new input and keeping whatever the underlying codec did not consume. getstate returns self.buffer or 0 so that an empty buffer serializes as the integer 0 rather than an empty string, which matters for pickle-based state transfer.

Error-handler name constants (lines 1114 to 1119)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L1114-1119

strict_errors = lookup_error("strict")
ignore_errors = lookup_error("ignore")
replace_errors = lookup_error("replace")
xmlcharrefreplace_errors = lookup_error("xmlcharrefreplace")
backslashreplace_errors = lookup_error("backslashreplace")
namereplace_errors = lookup_error("namereplace")

lookup_error is a C function from _codecs. It returns the callable registered for the given error-handler name. The six names above are the predefined handlers built into the interpreter. Third-party code can add more with register_error(name, handler). In gopy the same six handlers are bound in codecs/errors.go and the registry is in the same file as codecs/registry.go.

StreamReader.read(): bytebuffer and charbuffer (lines 457 to 535)

cpython 3.14 @ ab2d84fe1023/Lib/codecs.py#L457-535

def read(self, size=-1, chars=-1, firstline=False):
while True:
if chars >= 0 and len(self.charbuffer) >= chars:
break
newdata = self.stream.read(size if size >= 0 else -1)
data = self.bytebuffer + newdata
if not data:
break
try:
newchars, decodedbytes = self.decode(data, self.errors)
except UnicodeDecodeError as exc:
if firstline:
newchars, decodedbytes = self.decode(data[:exc.start], self.errors)
...
else:
raise
self.bytebuffer = data[decodedbytes:]
self.charbuffer += newchars
if not newdata:
break
...

StreamReader maintains two buffers. self.bytebuffer holds raw bytes that arrived from the stream but could not yet be decoded (e.g., an incomplete multi-byte sequence at a read boundary). self.charbuffer holds already-decoded characters waiting to be returned. On each iteration the method tries to decode self.bytebuffer + newdata; the codec returns how many bytes it consumed, and the remainder goes back into self.bytebuffer. firstline=True is a readline optimization: on a decode error, return whatever decoded before the bad byte on the first line rather than raising immediately.

gopy mirror

The gopy codec layer is in codecs/. codecs/registry.go implements Register, Lookup, and CodecInfo (with Name, Encode, and Decode fields; StreamReader/StreamWriter are deferred). codecs/builtin.go registers the built-in UTF-8 and UTF-16 codecs. codecs/errors.go binds the six predefined error-handler names.

IncrementalEncoder, IncrementalDecoder, StreamWriter, StreamReader, StreamReaderWriter, and StreamRecoder have not yet been ported; they are needed for the io.TextIOWrapper layer and are tracked as stdlib pending. The BOM constants (BOM_UTF8, BOM_UTF16, BOM_UTF32) are defined in codecs/builtin.go as []byte variables matching the CPython byte-string literals exactly.