Skip to main content

Objects/unicodeobject.c (part 5)

Source:

cpython 3.14 @ ab2d84fe1023/Objects/unicodeobject.c

This annotation covers str.format_map, the codec interface, and the translation table mechanism. See objects_unicodeobject_detail through objects_unicodeobject4_detail for the basic API, PEP 393 compact strings, case operations, and str.format.

Map

LinesSymbolRole
1-100str.format_mapLike str.format(**mapping) but takes any mapping
101-250str.translate / str.maketransCharacter-by-character substitution via a table
251-400str.encodeEncode to bytes using a named codec
401-600PyUnicode_AsEncodedStringCore codec dispatch used by encode
601-800_PyUnicode_AsASCIIStringFast ASCII encoding without codec lookup
801-1000PyUnicode_DecodeUTF8Fast UTF-8 decoder
1001-1500PyUnicode_DecodeUTF8StatefulIncremental UTF-8 decoding for streams

Reading

str.format_map

// CPython: Objects/unicodeobject.c:4820 unicode_format_map
static PyObject *
unicode_format_map(PyObject *self, PyObject *map)
{
/* Like str.format(**map) but:
1. The mapping is not copied into a dict first.
2. Attribute access ('{obj.attr}') is not supported.
Purpose: allow custom Mapping subclasses (e.g. defaultdict). */
return _PyObject_FormatMap(self, map);
}

'{key}'.format_map(some_mapping) calls some_mapping[key] directly, which allows __missing__ to be triggered on a defaultdict.

str.maketrans

// CPython: Objects/unicodeobject.c:5020 unicode_maketrans
static PyObject *
unicode_maketrans(PyObject *null, PyObject *args)
{
/* Builds a translation table dict:
x: str of from-chars, y: str of to-chars, z: str of delete-chars
OR
x: dict mapping ordinals/chars/strings to ordinals/chars/None */
...
/* Result: {ord(from_char): ord(to_char), ord(del_char): None, ...} */
}

str.translate(table) maps each character's code point through table: table[ord(ch)] gives the replacement ordinal or None (delete).

str.translate

// CPython: Objects/unicodeobject.c:5100 unicode_translate
static PyObject *
unicode_translate(PyObject *self, PyObject *table)
{
return _PyUnicode_TranslateCharmap(self, table, "ignore");
}

The inner loop calls PyObject_GetItem(table, PyLong_FromLong(ch)) for each character. If the result is None, the character is deleted; if it's an int, it becomes the replacement code point; if it's a string, the string is inserted.

str.encode

// CPython: Objects/unicodeobject.c:5200 unicode_encode_impl
static PyObject *
unicode_encode_impl(PyObject *self, const char *encoding,
const char *errors)
{
return PyUnicode_AsEncodedString(self, encoding, errors);
}

PyUnicode_AsEncodedString

// CPython: Objects/unicodeobject.c:3600 PyUnicode_AsEncodedString
PyObject *
PyUnicode_AsEncodedString(PyObject *unicode,
const char *encoding, const char *errors)
{
/* Fast path for UTF-8, ASCII, Latin-1 */
if (strcmp(encoding, "utf-8") == 0 || strcmp(encoding, "utf8") == 0)
return _PyUnicode_AsUTF8String(unicode, errors);
if (strcmp(encoding, "ascii") == 0)
return _PyUnicode_AsASCIIString(unicode, errors);
/* General path: look up codec by name */
PyObject *encoder = _PyCodec_LookupTextEncoding(encoding, "codecs.encode");
return _PyCodec_EncodeInternal(unicode, encoder, encoding, errors);
}

PyUnicode_DecodeUTF8Stateful

// CPython: Objects/unicodeobject.c:1200 PyUnicode_DecodeUTF8Stateful
PyObject *
PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size,
const char *errors, Py_ssize_t *consumed)
{
/* consumed: set to number of bytes consumed (for incremental decoding).
If consumed != NULL, incomplete sequences at the end are not errors. */
...
}

consumed is used by codecs.getincrementaldecoder('utf-8') to handle partial multi-byte sequences at buffer boundaries.

gopy notes

str.encode calls vm.CodecEncode in vm/codec.go. Fast paths for UTF-8/ASCII/Latin-1 are implemented in objects/unicode_encode.go. PyUnicode_DecodeUTF8Stateful is objects.UnicodeDecodeUTF8Stateful, used by the incremental codec infrastructure in module/codecs/incremental.go.