Skip to main content

Python/codecs.c

Source:

cpython 3.14 @ ab2d84fe1023/Python/codecs.c

codecs.c implements the codec registry that backs str.encode(), bytes.decode(), and codecs.open(). Codec lookup normalizes the encoding name and calls registered search functions.

Map

LinesSymbolRole
1-100PyCodec_RegisterAdd a search function to the registry
101-250_PyCodec_LookupNormalize name, call search functions, cache result
251-450PyCodec_Encode, PyCodec_DecodeEncode/decode using a named codec
451-650PyCodec_IncrementalEncodercodecs.getincrementalencoder(name)()
651-850Buffered incrementalIncrementalDecoder.decode(data, final=False)
851-1000Error handlersreplace, ignore, xmlcharrefreplace, backslashreplace, namereplace
1001-1200PyCodec_LookupErrorGet a named error handler

Reading

Codec lookup and normalization

// CPython: Python/codecs.c:145 _PyCodec_Lookup
PyObject *
_PyCodec_Lookup(const char *encoding)
{
/* Normalize: lowercase, replace hyphens and spaces with underscores */
char *normalized = codec_name_normalize(encoding);
/* Check cache first */
PyObject *cached = PyDict_GetItemString(interp->codec_search_cache, normalized);
if (cached) return cached;
/* Try each registered search function */
PyObject *search_path = interp->codec_search_path;
for (Py_ssize_t i = 0; i < PyList_GET_SIZE(search_path); i++) {
PyObject *func = PyList_GET_ITEM(search_path, i);
PyObject *result = PyObject_CallOneArg(func, normalized_str);
if (result && result != Py_None) {
PyDict_SetItemString(interp->codec_search_cache, normalized, result);
return result;
}
}
PyErr_Format(PyExc_LookupError, "unknown encoding: %s", encoding);
return NULL;
}

'UTF-8', 'utf-8', 'utf_8', and 'utf8' all resolve to the same codec after normalization.

PyCodec_Encode / PyCodec_Decode

// CPython: Python/codecs.c:290 PyCodec_Encode
PyObject *
PyCodec_Encode(PyObject *object, const char *encoding, const char *errors)
{
PyObject *codec = _PyCodec_Lookup(encoding);
/* codec is a tuple: (encoder, decoder, inc_encoder, inc_decoder, stream_reader, stream_writer) */
PyObject *encoder = PyTuple_GET_ITEM(codec, 0);
PyObject *args = PyTuple_Pack(2, object, errors_str);
PyObject *result = PyObject_Call(encoder, args, NULL);
/* result is a tuple (bytes, length); return just the bytes */
return PyTuple_GET_ITEM(result, 0);
}

Incremental encoder/decoder protocol

// CPython: Python/codecs.c:530 PyCodec_IncrementalEncoder
PyObject *
PyCodec_IncrementalEncoder(const char *encoding, const char *errors)
{
PyObject *codec = _PyCodec_Lookup(encoding);
PyObject *inc_enc_cls = PyTuple_GET_ITEM(codec, 2);
return PyObject_CallFunctionObjArgs(inc_enc_cls, errors_str, NULL);
}

IncrementalEncoder.encode(data, final=False) may buffer partial inputs. final=True flushes the buffer.

Built-in error handlers

// CPython: Python/codecs.c:880 PyCodec_StrictErrors
/* 'strict' — raise UnicodeEncodeError/UnicodeDecodeError */
/* 'ignore' — skip undecodable bytes */
/* 'replace' — replace with '?' (encode) or U+FFFD (decode) */
/* 'xmlcharrefreplace' — &#{ord}; for unencodable characters */
/* 'backslashreplace' — \xNN or \uNNNN for unencodable */
/* 'namereplace' — \N{NAME} using Unicode character names */

Error handlers are callables that receive a UnicodeDecodeError and return (replacement, new_position).

gopy notes

The codec registry is in module/codecs/registry.go as a []SearchFunc slice. _PyCodec_Lookup normalizes via strings.ToLower + hyphen-to-underscore. Built-in codecs (utf-8, utf-16, latin-1, ascii, etc.) are registered during stdlibinit. Error handlers are a map[string]ErrorHandler. PyCodec_Encode/Decode call objects.StrEncode/BytesDecode.