Skip to main content

Objects/unicodeobject.c (part 8)

Source:

cpython 3.14 @ ab2d84fe1023/Objects/unicodeobject.c

This annotation covers encoding and table-driven operations. See objects_strobject7_detail for str.split/join/replace/strip, and earlier parts for str.__new__ and repr.

Map

LinesSymbolRole
1-100str.encodeEncode to bytes via the codec registry
101-220str.format_mapLike str.format but with a mapping argument
221-360str.maketransBuild a translation table for str.translate
361-480str.translateCharacter-by-character replacement using a table
481-600str.expandtabsReplace \t with spaces aligned to tab stops

Reading

str.encode

// CPython: Objects/unicodeobject.c:3480 unicode_encode_impl
static PyObject *
unicode_encode_impl(PyObject *self, const char *encoding, const char *errors)
{
/* Default encoding is UTF-8 if encoding is NULL */
if (encoding == NULL) {
return _PyUnicode_AsUTF8String(self, errors);
}
return PyUnicode_AsEncodedString(self, encoding, errors);
}

'hello'.encode('latin-1') calls into the codec registry. The fast path for UTF-8 uses _PyUnicode_AsUTF8String which avoids registry lookup. The errors argument controls behavior on unencodable characters: strict (default), replace, ignore, xmlcharrefreplace, backslashreplace.

str.maketrans

// CPython: Objects/unicodeobject.c:2980 unicode_maketrans_impl
static PyObject *
unicode_maketrans_impl(PyObject *x, PyObject *y, PyObject *z)
{
PyObject *new = PyDict_New();
if (PyUnicode_Check(x)) {
/* Two-argument form: str.maketrans('abc', 'xyz') */
/* Map each char in x to the corresponding char in y */
for (int i = 0; i < PyUnicode_GET_LENGTH(x); i++) {
Py_UCS4 c = PyUnicode_READ_CHAR(x, i);
Py_UCS4 d = PyUnicode_READ_CHAR(y, i);
PyDict_SetItem(new, PyLong_FromLong(c), PyLong_FromLong(d));
}
} else {
/* One-argument form: str.maketrans({'a': 'x', 'b': None}) */
...
}
/* z = chars to delete: map to None */
...
return new;
}

str.maketrans('abc', 'ABC', 'xyz') creates a dict mapping ord('a') -> ord('A'), etc., and ord('x') -> None (delete). The dict is keyed by Unicode code points (integers), not characters.

str.translate

// CPython: Objects/unicodeobject.c:3080 unicode_translate
static PyObject *
unicode_translate(PyObject *self, PyObject *table)
{
/* Walk through self, looking up each code point in table.
None -> delete, int -> replacement code point, str -> insert string. */
return _PyUnicode_TranslateCharmap(self, table, "ignore");
}

'hello world'.translate(str.maketrans('aeiou', '12345')) produces ``'h2ll4 w4rld'`. Translating large strings is O(n) in the string length times the table lookup cost.

str.expandtabs

// CPython: Objects/unicodeobject.c:3280 unicode_expandtabs_impl
static PyObject *
unicode_expandtabs_impl(PyObject *self, int tabsize)
{
/* Replace each \t with spaces to reach the next tab stop.
Tab stops are at positions 0, tabsize, 2*tabsize, ... */
Py_ssize_t col = 0;
for each char c in self:
if (c == '\t') {
Py_ssize_t spaces = tabsize - (col % tabsize);
/* append 'spaces' space chars */
col += spaces;
} else {
col = (c == '\n' || c == '\r') ? 0 : col + 1;
/* append c */
}
}

'\thello\tworld'.expandtabs(8) produces 'hello world' with tab stops every 8 characters. \n and \r reset the column counter.

gopy notes

str.encode is objects.UnicodeEncode in objects/str.go. It calls module/codecs.Encode. str.maketrans builds a Go map from rune to rune or nil (delete). str.translate walks the []rune and applies the map. str.expandtabs uses a strings.Builder and tracks column position.