Skip to main content

Objects/unicodeobject.c (part 9)

Source:

cpython 3.14 @ ab2d84fe1023/Objects/unicodeobject.c

This annotation covers the encode path. See objects_strobject8_detail for str.decode and codec infrastructure, and earlier parts for str.__new__, __format__, and search methods.

Map

LinesSymbolRole
1-80str.encodePython-level entry; resolve codec name and errors handler
81-200PyUnicode_AsEncodedStringCore: look up codec, call encoder, check result type
201-320PyCodec_EncodeCodec registry lookup and dispatch
321-440_PyUnicode_AsUTF8Fast path: return internal UTF-8 cache if present
441-600Incremental encodercodecs.getincrementalencoder flow

Reading

str.encode

// CPython: Objects/unicodeobject.c:11420 unicode_encode_impl
static PyObject *
unicode_encode_impl(PyObject *self, const char *encoding, const char *errors)
{
if (encoding == NULL)
encoding = PyUnicode_GetDefaultEncoding(); /* "utf-8" */
return PyUnicode_AsEncodedString(self, encoding, errors);
}

"hello".encode() defaults to UTF-8. "hello".encode('latin-1', 'replace') replaces unencodable characters with ?. The errors argument is passed verbatim to the codec.

PyUnicode_AsEncodedString

// CPython: Objects/unicodeobject.c:3620 PyUnicode_AsEncodedString
PyObject *
PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding,
const char *errors)
{
/* Fast paths for common encodings */
if (_Py_IsUTF8Encoding(encoding)) {
return _PyUnicode_AsUTF8String(unicode, errors);
}
if (_Py_IsASCIIEncoding(encoding)) {
return PyUnicode_EncodeASCII(...);
}
/* General path: look up codec */
PyObject *v = PyCodec_Encode(unicode, encoding, errors);
if (!PyBytes_Check(v)) {
PyErr_Format(PyExc_TypeError,
"'%.400s' encoder returned '%s' instead of 'bytes'",
encoding, Py_TYPE(v)->tp_name);
Py_DECREF(v);
return NULL;
}
return v;
}

The fast paths bypass the codec registry for the two most common encodings. Non-bytes return values from a custom codec trigger TypeError — encoders must return bytes.

_PyUnicode_AsUTF8

// CPython: Objects/unicodeobject.c:3980 _PyUnicode_AsUTF8
const char *
_PyUnicode_AsUTF8(PyObject *unicode)
{
/* Return the cached UTF-8 representation if available.
If not, encode and cache it in self->utf8. */
if (PyUnicode_IS_ASCII(unicode)) {
/* ASCII strings: the internal buffer IS the UTF-8 */
return (const char *)PyUnicode_DATA(unicode);
}
if (((PyCompactUnicodeObject *)unicode)->utf8 != NULL) {
return ((PyCompactUnicodeObject *)unicode)->utf8;
}
/* Encode and cache */
PyObject *bytes = _PyUnicode_AsUTF8String(unicode, "strict");
...
((PyCompactUnicodeObject *)unicode)->utf8 = cache;
return cache;
}

_PyUnicode_AsUTF8 is used by PyUnicode_AsUTF8AndSize which C extensions call to get a const char * for passing to C APIs. The cache means repeated calls are O(1) after the first. The cached buffer is freed when the str object is deallocated.

Incremental encoder

# CPython: Lib/codecs.py:180 IncrementalEncoder
class IncrementalEncoder:
"""Stateful encoder for streaming use."""
def __init__(self, errors='strict'):
self.errors = errors
self.buffer = ""

def encode(self, input, final=False):
raise NotImplementedError

def reset(self):
pass

def getstate(self):
return 0

def setstate(self, state):
pass

codecs.getincrementalencoder('utf-8')() returns a stateful encoder. Feed partial strings via encode(chunk) and call encode('', final=True) to flush. Used by TextIOWrapper for streaming writes.

gopy notes

str.encode is objects.UnicodeEncode in objects/str.go. PyUnicode_AsEncodedString calls objects.CodecEncode. _PyUnicode_AsUTF8 returns objects.Str.UTF8Cache (a []byte field populated on first access). Incremental encoders are module/codecs.IncrementalEncoder in module/codecs/module.go.