Objects/unicodeobject.c (part 8)

Source:

cpython 3.14 @ ab2d84fe1023/Objects/unicodeobject.c

This annotation covers the codec interface. See objects_unicodeobject7_detail for str.__new__, str.join, str.format, str.split, and internal encodings.

Map

Lines	Symbol	Role
1-80	`str.encode`	Encode to `bytes` using a codec
81-160	`bytes.decode`	Decode to `str` using a codec
161-240	`PyUnicode_AsEncodedString`	C API: encode with explicit error handler
241-340	Error handlers	`strict`, `ignore`, `replace`, `xmlcharrefreplace`, `backslashreplace`
341-600	Codec search / `codecs.lookup`	Find encoder/decoder for a codec name

Reading

`str.encode`

// CPython: Objects/unicodeobject.c:11540 unicode_encode_impl
static PyObject *
unicode_encode_impl(PyObject *self, const char *encoding, const char *errors)
{
    if (encoding == NULL) encoding = "utf-8";
    if (errors == NULL) errors = "strict";
    return PyUnicode_AsEncodedString(self, encoding, errors);
}

PyObject *
PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
{
    PyObject *v = _PyCodec_EncodeInternal(unicode, encoding, errors);
    /* Verify result is bytes */
    if (!PyBytes_Check(v)) {
        PyErr_Format(PyExc_TypeError, "encoder did not return a bytes object");
        Py_DECREF(v);
        return NULL;
    }
    return v;
}

'hello'.encode() defaults to 'utf-8' with 'strict' error handling. The codec system is called via _PyCodec_EncodeInternal, which looks up the codec by name and calls its encoder function.

Error handlers

// CPython: Objects/unicodeobject.c:11200 unicode_translate_call_errorhandler
static PyObject *
unicode_call_errorhandler(const char *errors, PyObject **errorHandler,
                          const char *encoding, const char *reason,
                          PyObject *unicode, Py_ssize_t *exceptionObject,
                          Py_ssize_t startpos, Py_ssize_t endpos,
                          Py_ssize_t *newpos)
{
    /* Look up the error handler by name */
    if (*errorHandler == NULL) {
        *errorHandler = PyCodec_LookupError(errors);
    }
    /* Call handler(UnicodeEncodeError(...)) */
    PyObject *exc = PyUnicodeEncodeError_Create(encoding, unicode, ...);
    PyObject *restuple = PyObject_CallOneArg(*errorHandler, exc);
    /* restuple: (replacement_str_or_bytes, new_position) */
    ...
}

Error handlers receive a UnicodeEncodeError (or DecodeError) and return a (replacement, position) tuple. Built-in handlers: strict raises, ignore returns ('', endpos), replace returns ('?', endpos), xmlcharrefreplace returns ('&#N;', endpos).

Codec search

// CPython: Python/codecs.c:80 _PyCodec_Lookup
PyObject *
_PyCodec_Lookup(const char *encoding)
{
    /* Normalize: lowercase, replace hyphens/spaces/dots with underscores */
    PyObject *v = normalizestring(encoding);
    PyObject *result = PyDict_GetItemWithError(interp->codec_search_cache, v);
    if (result != NULL) return result;
    /* Walk sys.codec_search_functions */
    for each search_function in interp->codec_search_path:
        result = search_function(v);
        if (result != NULL) {
            PyDict_SetItem(interp->codec_search_cache, v, result);
            return result;
        }
    ...
}

Codec names are normalized: 'UTF-8', 'utf8', 'UTF_8' all map to the same codec. The search cache avoids repeated lookups. Custom codecs register via codecs.register(search_function).

gopy notes

str.encode is objects.StrEncode in objects/str.go. It calls module/codecs.Encode. Error handlers are registered in a Go map[string]objects.ErrorHandler. The utf-8, ascii, latin-1, and utf-16 codecs are built in; others are looked up via sys.codec_search_functions.

Map​

Reading​

str.encode​

Error handlers​

Codec search​

gopy notes​

Map

Reading

`str.encode`

Error handlers

Codec search

gopy notes