Skip to main content

Lib/encodings/__init__.py

cpython 3.14 @ ab2d84fe1023/Lib/encodings/__init__.py

Lib/encodings/__init__.py is the bootstrap layer between the C-level codecs module and the per-encoding Python modules scattered under Lib/encodings/. When the interpreter starts, this file registers search_function with codecs.register(). From that point on, every call to open(..., encoding='utf-8'), str.encode('latin-1'), or codecs.lookup('ascii') eventually calls search_function, which normalizes the encoding name and imports the matching submodule (e.g. encodings.utf_8). Each submodule exposes a getregentry() function that returns a CodecInfo namedtuple containing encode, decode, IncrementalEncoder, and IncrementalDecoder references.

Map

LinesSymbolRolegopy
1-30module docstring, importscodecs, sys, _codecs (C extension)n/a
31-60normalize_encodingMap arbitrary encoding name strings to canonical module namesNot yet ported
61-100search_functionImport and cache per-encoding submodulesNot yet ported
101-130codecs.register(search_function)Hook registration at import timeNot yet ported
131-160_cache, _unknown, _aliasesLookup cache and alias tableNot yet ported
161-200CodecInfo usage notes, __all__Package-level exportsn/a

Reading

normalize_encoding: canonicalizing encoding names (lines 31 to 60)

cpython 3.14 @ ab2d84fe1023/Lib/encodings/__init__.py#L31-60

Python's encoding API accepts a wide range of spelling variants: "UTF-8", "utf_8", "utf8", "UTF8" are all valid. normalize_encoding converts any such string to the form used as a Python module name (lowercase, spaces and hyphens replaced with underscores, leading and trailing underscores stripped).

def normalize_encoding(encoding):
if isinstance(encoding, bytes):
encoding = str(encoding, 'ascii')
chars = []
punct = False
for c in encoding:
if c.isalnum() or c == '.':
if punct and chars:
chars.append('_')
chars.append(c)
punct = False
else:
punct = True
return ''.join(chars)

The punct flag collapses any run of non-alphanumeric characters into a single underscore, so "utf---8" and "utf 8" both normalize to "utf_8". The . special-case preserves version suffixes in names like "utf-8-sig" (which normalizes to "utf_8_sig").

search_function: the codec registry callback (lines 61 to 100)

cpython 3.14 @ ab2d84fe1023/Lib/encodings/__init__.py#L61-100

search_function is the callable passed to codecs.register(). The C layer calls it with the raw encoding name whenever a lookup misses the internal C cache. The function first checks _aliases (a large dict defined in Lib/encodings/aliases.py) to resolve common alternate spellings, then imports the submodule and calls its getregentry().

def search_function(encoding):
entry = _cache.get(encoding, _unknown)
if entry is not _unknown:
return entry
norm_encoding = normalize_encoding(encoding)
aliased_encoding = _aliases.get(norm_encoding) or \
_aliases.get(norm_encoding.replace('.', '_'))
if aliased_encoding is not None:
norm_encoding = aliased_encoding
norm_encoding = norm_encoding.lower()
try:
mod = __import__('encodings.' + norm_encoding,
fromlist=_import_tail,
level=0)
except ImportError:
mod = None
try:
getregentry = mod.getregentry
except AttributeError:
mod = None
if mod is None:
_cache[encoding] = None
return None
entry = getregentry()
if not isinstance(entry, CodecInfo):
...
_cache[encoding] = entry
return entry

The result is stored in _cache (a plain dict) so subsequent lookups for the same name string are O(1) dictionary reads, without re-importing the submodule.

Registration at import time (lines 101 to 130)

cpython 3.14 @ ab2d84fe1023/Lib/encodings/__init__.py#L101-130

The final lines of __init__.py call codecs.register(search_function) unconditionally at module import time. Because encodings is imported during interpreter startup (before user code runs), this registration happens once per interpreter session. The codecs C module maintains a per-interpreter linked list of search functions; codecs.lookup(name) walks the list until one returns a non-None value.

codecs.register(search_function)

This one line is what connects the pure-Python submodule tree under Lib/encodings/ to the C-level codec machinery. Removing or deferring it would break str.encode, bytes.decode, and open with an explicit encoding argument for all non-ASCII codecs.

gopy mirror

Not yet ported in full. gopy currently handles ASCII and UTF-8 natively inside the VM and string object layer (objects/str.go) without going through a registry. The alias table (Lib/encodings/aliases.py) is large enough that it is unlikely to be hand-ported; a future implementation would either vendor it as a generated Go map or call into a CPython subprocess for codec resolution. The normalize_encoding function is small and self-contained and would be a straightforward port.

CPython 3.14 changes

CPython 3.14 added support for the "locale" pseudo-encoding name at the encodings layer, resolving it to the current locale codec at lookup time rather than at open() time. This allows codecs.lookup("locale") to return a stable CodecInfo entry for the session. Additionally, the alias table in Lib/encodings/aliases.py gained several new entries for WHATWG-compatible encoding names to improve compatibility with web content processed through the http and email stacks.