Skip to main content

Modules/_codecsmodule.c

cpython 3.14 @ ab2d84fe1023/Modules/_codecsmodule.c

_codecs is the C backbone of Python's codec system. Lib/codecs.py imports this module at startup and re-exports everything under the public codecs name. The file covers three areas: the codec registry (register, unregister, lookup, encode, decode), the error-handler registry (register_error, lookup_error, _unregister_error), and a large set of built-in codec wrappers for UTF-8, UTF-16, UTF-32, UTF-7, Latin-1, ASCII, charmap, escape, raw-unicode-escape, and (on Windows) MBCS/OEM/code-page codecs. Every codec wrapper delegates immediately into Objects/unicodeobject.c via the PyUnicode_* / _PyUnicode_* family. The file contains no codec logic of its own: it is a pure argument-adaptation and dispatch layer.

Map

LinesSymbolRolegopy
1-48headers, clinic boilerplatePython.h, pycore_codecs.h, pycore_unicodeobject.h, clinic include-
63-71_codecs_registerRegister a codec search function via PyCodec_Register-
83-92_codecs_unregisterRemove a codec search function via PyCodec_Unregister-
102-107_codecs_lookup_implLook up a codec by name; returns a CodecInfo object-
124-134_codecs_encode_implEncode an object using a named codec (default utf-8)-
151-161_codecs_decode_implDecode an object using a named codec (default utf-8)-
165-172codec_tupleHelper: wrap a decoded/encoded object and its length into a (obj, n) tuple-
182-190_codecs_escape_decode_implDecode Python byte-string escape sequences-
199-253_codecs_escape_encode_implEncode a bytes object as an escaped byte string-
264-274_codecs_utf_7_decode_implStateful UTF-7 decode-
284-294_codecs_utf_8_decode_implStateful UTF-8 decode-
304-316_codecs_utf_16_decode_implStateful UTF-16 decode (native byte order)-
326-338_codecs_utf_16_le_decode_implStateful UTF-16-LE decode-
348-360_codecs_utf_16_be_decode_implStateful UTF-16-BE decode-
378-392_codecs_utf_16_ex_decode_implUTF-16 decode, exposes byteorder to caller-
401-414_codecs_utf_32_decode_implStateful UTF-32 decode-
424-436_codecs_utf_32_le_decode_implStateful UTF-32-LE decode-
446-458_codecs_utf_32_be_decode_implStateful UTF-32-BE decode-
476-488_codecs_utf_32_ex_decode_implUTF-32 decode, exposes byteorder to caller-
498-508_codecs_unicode_escape_decode_implDecode \uXXXX / \UXXXXXXXX escape sequences-
518-528_codecs_raw_unicode_escape_decode_implDecode \uXXXX raw escape sequences-
537-544_codecs_latin_1_decode_implLatin-1 decode-
553-560_codecs_ascii_decode_implASCII decode-
570-582_codecs_charmap_decode_implDecode using an arbitrary charmap mapping-
584-644_codecs_mbcs_decode_impl, _codecs_oem_decode_impl, _codecs_code_page_decode_implWindows-only MBCS/OEM/code-page decoders-
656-664_codecs_readbuffer_encode_implCopy a buffer verbatim as bytes (identity encode)-
673-680_codecs_utf_7_encode_implUTF-7 encode-
689-696_codecs_utf_8_encode_implUTF-8 encode-
713-720_codecs_utf_16_encode_implUTF-16 encode with optional BOM-
729-736_codecs_utf_16_le_encode_implUTF-16-LE encode-
745-752_codecs_utf_16_be_encode_implUTF-16-BE encode-
769-776_codecs_utf_32_encode_implUTF-32 encode with optional BOM-
785-792_codecs_utf_32_le_encode_implUTF-32-LE encode-
801-808_codecs_utf_32_be_encode_implUTF-32-BE encode-
817-824_codecs_unicode_escape_encode_implEncode as \uXXXX escape sequences-
833-840_codecs_raw_unicode_escape_encode_implEncode as raw \uXXXX escape sequences-
849-856_codecs_latin_1_encode_implLatin-1 encode-
865-872_codecs_ascii_encode_implASCII encode-
882-892_codecs_charmap_encode_implEncode using an arbitrary charmap mapping-
900-905_codecs_charmap_build_implBuild a decoding map from an encoding string-
907-955_codecs_mbcs_encode_impl, _codecs_oem_encode_impl, _codecs_code_page_encode_implWindows-only MBCS/OEM/code-page encoders-
972-981_codecs_register_error_implRegister a named error handler-
1000-1005_codecs__unregister_error_implRemove a named error handler-
1018-1023_codecs_lookup_error_implRetrieve a named error handler-
1027-1099_codecs_functions, _codecs_slots, codecsmodule, PyInit__codecsMethod table, module def, init-

Reading

Registry dispatch: encode and decode (lines 124 to 161)

cpython 3.14 @ ab2d84fe1023/Modules/_codecsmodule.c#L124-161

The two generic entry points are deliberately minimal. They exist only to supply a default encoding and forward to the registry:

static PyObject *
_codecs_encode_impl(PyObject *module, PyObject *obj,
const char *encoding, const char *errors)
{
if (encoding == NULL)
encoding = PyUnicode_GetDefaultEncoding();
return PyCodec_Encode(obj, encoding, errors);
}

static PyObject *
_codecs_decode_impl(PyObject *module, PyObject *obj,
const char *encoding, const char *errors)
{
if (encoding == NULL)
encoding = PyUnicode_GetDefaultEncoding();
return PyCodec_Decode(obj, encoding, errors);
}

PyCodec_Encode and PyCodec_Decode live in Python/codecs.c. They look up the codec by name, call the appropriate encoder or decoder callable, and validate the return type. The _codecs module never touches the registry data structures directly.

codec_tuple and the (result, length) convention (lines 165 to 172)

cpython 3.14 @ ab2d84fe1023/Modules/_codecsmodule.c#L165-172

Every built-in codec wrapper returns not just the encoded/decoded object but also the number of input units consumed. This is the protocol that codecs.py's IncrementalDecoder relies on for streaming:

static PyObject *
codec_tuple(PyObject *decoded, Py_ssize_t len)
{
if (decoded == NULL)
return NULL;
return Py_BuildValue("Nn", decoded, len);
}

N means "steal reference" (no extra Py_DECREF needed on decoded). n is Py_ssize_t. The wrapper functions typically pass data->len for fixed-length codecs (Latin-1, ASCII) and a consumed variable for stateful codecs that may stop short of the full buffer.

Stateful UTF-8 decode (lines 284 to 294)

cpython 3.14 @ ab2d84fe1023/Modules/_codecsmodule.c#L284-294

The stateful variant is used by StreamReader and incremental decoders. When final is False, the decoder stops at the last complete character, leaving any trailing partial sequence to be fed in the next call:

static PyObject *
_codecs_utf_8_decode_impl(PyObject *module, Py_buffer *data,
const char *errors, int final)
{
Py_ssize_t consumed = data->len;
PyObject *decoded = PyUnicode_DecodeUTF8Stateful(
data->buf, data->len, errors,
final ? NULL : &consumed);
return codec_tuple(decoded, consumed);
}

When final is True, consumed is passed as NULL which tells PyUnicode_DecodeUTF8Stateful to treat a trailing incomplete sequence as an error. When final is False, consumed is updated to the number of bytes actually decoded, and codec_tuple returns that smaller count so the caller can resume from the correct position.

gopy mirror

Basic UTF-8 and ASCII encode/decode is handled natively in the gopy codecs package. The full codec registry, error-handler registry, and the remaining built-in codecs (UTF-16, UTF-32, UTF-7, Latin-1, charmap, escape variants) are pending.

CPython 3.14 changes

3.14 added _codecs.unregister (lines 83-92) as the public counterpart to the previously private _PyCodec_Unregister, letting users remove search functions registered at runtime. It also added _codecs._unregister_error (lines 1000-1005) to complement register_error, following the same pattern. Both additions carry the Py_mod_gil = Py_MOD_GIL_NOT_USED slot introduced for free-threaded builds.