`Python/codecs.c`

cpython 3.14 @ ab2d84fe1023/Python/codecs.c

The codec registry is how CPython maps an encoding name (e.g. "utf-8", "latin_1", "iso8859-15") to a five-element tuple of callables: (encoder, decoder, stream_reader, stream_writer, incremental_encoder_factory). Search functions registered via codecs.register() are tried in order; the first non-None return wins and is cached.

The file is roughly 900 lines and covers four concerns: registration of search functions, lookup with name normalization and caching, convenience wrappers (PyCodec_Encode, PyCodec_Decode, PyCodec_IncrementalEncoder, etc.), and the charmap_build / charmap_decode helpers used by single-byte codec implementations. The Python-visible codecs module imports the C entry points via _codecsmodule.c; the functions here are pure C API.

Name normalization is the subtlest part: before consulting the cache or calling any search function, _PyCodec_Lookup converts the encoding string to lowercase and maps every space and hyphen to an underscore, so "UTF-8", "utf_8", "utf 8", and "utf-8" all hash to the same cache key.

Map

Lines	Symbol	Role	gopy
1-80	`_PyCodec_InitRegistry` / `PyCodec_Register`	Initialize the per-interpreter `codec_search_path` list and `codec_search_cache` dict; append a new search function.	`pythonrun/codecs.go:InitRegistry` / `Register`
81-220	`_PyCodec_Lookup` / `normalizestring`	Normalize the encoding name (lowercase, spaces/hyphens to `_`), check `codec_search_cache`, then walk `codec_search_path` calling each function.	`pythonrun/codecs.go:Lookup`
221-380	`_PyCodecInfo_GetIncrementalDecoder` / `_PyCodecInfo_GetIncrementalEncoder`	Extract `codec[3]` / `codec[2]` from the codec-info tuple and instantiate with the `errors` argument; used for BOM-aware UTF-* codecs.	`pythonrun/codecs.go:GetIncrementalDecoder` / `GetIncrementalEncoder`
381-540	`PyCodec_Encode` / `PyCodec_Decode`	Look up the codec tuple, call `codec[0](obj, errors)` or `codec[1](obj, errors)`, validate that the result is a 2-tuple, and return `result[0]`.	`pythonrun/codecs.go:Encode` / `Decode`
541-680	`PyCodec_IncrementalEncoder` / `PyCodec_IncrementalDecoder` / `PyCodec_StreamReader` / `PyCodec_StreamWriter`	Factory wrappers: call `codec[2]` through `codec[5]` with the `errors` string and return the resulting object.	`pythonrun/codecs.go:IncrementalEncoder` etc.
681-800	`_PyCodec_EncodeInternal` / `_PyCodec_DecodeInternal`	Internal variants that accept a pre-looked-up codec tuple, used by the `bytes.decode` / `str.encode` fast paths.	`pythonrun/codecs.go:EncodeInternal` / `DecodeInternal`
801-900	`charmap_build` / helper tables	Build a decode table from a 256-character Unicode string mapping ordinal positions to code points; used by single-byte encodings.	`pythonrun/codecs.go:CharmapBuild`

Reading

`_PyCodec_Lookup` — normalization and cache (lines 81 to 220)

cpython 3.14 @ ab2d84fe1023/Python/codecs.c#L81-220

PyObject *
_PyCodec_Lookup(const char *encoding)
{
    if (encoding == NULL) {
        PyErr_BadArgument();
        return NULL;
    }
    PyInterpreterState *interp = _PyInterpreterState_GET();
    if (interp->codec_search_path == NULL && _PyCodec_InitRegistry(interp) < 0) {
        return NULL;
    }

    /* Normalize the encoding name */
    PyObject *v = normalizestring(encoding);
    if (v == NULL) {
        return NULL;
    }

    /* Check the cache first */
    PyObject *result = PyDict_GetItemWithError(interp->codec_search_cache, v);
    if (result != NULL) {
        Py_INCREF(result);
        Py_DECREF(v);
        return result;
    }
    ...
    /* Walk codec_search_path */
    Py_ssize_t i, len = PyList_GET_SIZE(interp->codec_search_path);
    for (i = 0; i < len; i++) {
        PyObject *func = PyList_GET_ITEM(interp->codec_search_path, i);
        result = PyObject_CallOneArg(func, v);
        if (result == Py_None) {
            Py_DECREF(result);
            continue;
        }
        if (result != NULL) {
            /* Cache successful lookup */
            if (PyDict_SetItem(interp->codec_search_cache, v, result) < 0) {
                Py_DECREF(result);
                Py_DECREF(v);
                return NULL;
            }
            Py_DECREF(v);
            return result;
        }
        Py_DECREF(v);
        return NULL;
    }
    /* No search function matched */
    PyErr_Format(PyExc_LookupError,
                 "unknown encoding: %s", encoding);
    Py_DECREF(v);
    return NULL;
}

normalizestring (immediately above this function) iterates over each byte of the encoding name: ASCII letters are lowercased with Py_TOLOWER; spaces (0x20) and hyphens (0x2D) become underscores (0x5F); every other character is kept as-is. The result is a Python str used as the dict key.

The cache (codec_search_cache) is a plain dict on the per-interpreter state. It is populated after the first successful lookup and consulted before walking the search path on subsequent calls, so repeated "utf-8" lookups are a single dict probe.

`PyCodec_Encode` and `PyCodec_Decode` (lines 381 to 540)

cpython 3.14 @ ab2d84fe1023/Python/codecs.c#L381-540

PyObject *
PyCodec_Encode(PyObject *object, const char *encoding, const char *errors)
{
    PyObject *codec = _PyCodec_Lookup(encoding);
    if (codec == NULL)
        return NULL;

    PyObject *encoder = PyTuple_GET_ITEM(codec, 0);
    PyObject *args = Py_BuildValue("(Os)", object, errors ? errors : "strict");
    Py_DECREF(codec);
    if (args == NULL)
        return NULL;

    PyObject *result = PyObject_Call(encoder, args, NULL);
    Py_DECREF(args);
    if (result == NULL)
        return NULL;

    /* result must be (encoded_object, length) */
    if (!PyTuple_Check(result) || PyTuple_GET_SIZE(result) != 2) {
        PyErr_SetString(PyExc_TypeError,
                        "encoder must return a tuple (object, integer)");
        Py_DECREF(result);
        return NULL;
    }
    PyObject *encoded = PyTuple_GET_ITEM(result, 0);
    Py_INCREF(encoded);
    Py_DECREF(result);
    return encoded;
}

The codec tuple layout is fixed: index 0 is the encoder callable, index 1 is the decoder, index 2 is the incremental encoder factory, index 3 is the incremental decoder factory, index 4 is the stream reader factory, and index 5 is the stream writer factory. PyCodec_Decode follows the identical pattern using index 1.

Both functions validate the return value. A codec that forgets to wrap its output in a two-element tuple gets a TypeError from the C layer rather than a silent wrong-type result propagating to the caller.

`_PyCodecInfo_GetIncrementalDecoder` (lines 221 to 380)

cpython 3.14 @ ab2d84fe1023/Python/codecs.c#L221-380

PyObject *
_PyCodecInfo_GetIncrementalDecoder(PyObject *codec_info,
                                   const char *errors)
{
    PyObject *incremental_decoder =
        PyTuple_GET_ITEM(codec_info, 3);
    if (incremental_decoder == Py_None) {
        PyErr_SetString(PyExc_LookupError,
                        "codec doesn't support incremental decoding");
        return NULL;
    }
    return PyObject_CallFunction(incremental_decoder, "s", errors);
}

The incremental interfaces are used by io.TextIOWrapper for buffered stream decoding. UTF-8, UTF-16, and UTF-32 use this path because they require BOM detection and multi-byte boundary handling across read() calls. _PyCodecInfo_GetIncrementalEncoder is the symmetric counterpart using index 2.

Notes for the gopy mirror

pythonrun/codecs.go holds the registry state as a per-interpreter struct with a []callable search path and a map[string]CodecInfo cache. The name normalization is a straightforward byte loop matching normalizestring in CPython. Encode / Decode call into the Python-side codec tuple via objects.Call and validate the two-element tuple return in the same order as the C code.

CPython 3.14 changes worth noting

In 3.14 PyCodec_Register gained a check that the search function is actually callable before appending it to the path, turning a deferred TypeError at lookup time into an eager one at registration time. The codec tuple layout (indices 0 through 5) is unchanged since Python 2.2.

Map​

Reading​

_PyCodec_Lookup — normalization and cache (lines 81 to 220)​

PyCodec_Encode and PyCodec_Decode (lines 381 to 540)​

_PyCodecInfo_GetIncrementalDecoder (lines 221 to 380)​

Notes for the gopy mirror​

CPython 3.14 changes worth noting​

Map