Objects/unicodeobject.c

Source:

cpython 3.14 @ ab2d84fe1023/Objects/unicodeobject.c

Objects/unicodeobject.c is the largest file in CPython (~15,500 lines). It implements Python's str type. Since PEP 393 (Python 3.3), strings use a flexible internal representation: ASCII strings use a compact 1-byte-per-char layout, Latin-1 strings use 1 byte per char, BMP strings use 2 bytes, and full Unicode uses 4 bytes. This avoids wasting memory on short ASCII strings while supporting the full Unicode range.

Map

Lines	Symbol	Role
1-500	Object layout, `_PyUnicode_New`, `PyUnicode_New`	Allocation by kind
501-1500	Encoding/decoding: UTF-8, UTF-16, UTF-32, Latin-1, ASCII, raw-unicode-escape	Codec dispatch
1501-2500	`PyUnicode_FromFormat`, `PyUnicode_FromObject`	Construction
2501-3500	`PyUnicode_Decode`, `PyUnicode_Encode`	Generic codec interface
3501-5000	String operations: `find`, `count`, `replace`, `split`, `join`, `strip`	Method implementations
5001-6500	`unicode_format`	`%` formatting and `str.format()`
6501-8000	`PyUnicode_CompareWithASCIIString`, `PyUnicode_RichCompare`, `unicode_hash`	Comparison and hashing
8001-10000	Interning: `PyUnicode_InternInPlace`, `_Py_ID` table	String interning
10001-15500	`tp_methods`, `tp_as_sequence`, `tp_as_mapping`	`str` type slots and methods

Reading

PEP 393 compact representation

The layout of a Unicode object is:

// CPython: Include/cpython/unicodeobject.h PyASCIIObject
typedef struct {
    PyObject_HEAD
    Py_ssize_t length;
    Py_hash_t  hash;
    struct { ... unsigned int kind:3; unsigned int compact:1; unsigned int ascii:1; ... } state;
    wchar_t   *wstr;  /* deprecated */
} PyASCIIObject;

For kind=1 (Latin-1), the character data immediately follows the PyASCIIObject header. For kind=2 (UCS-2) and kind=4 (UCS-4), a PyCompactUnicodeObject header is used instead.

Interning

PyUnicode_InternInPlace looks up the string in interned (a dict). If found, it replaces the caller's reference with the interned singleton. Short identifiers (NAME tokens, attribute names) are automatically interned.

// CPython: Objects/unicodeobject.c:14910 PyUnicode_InternInPlace
void
PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;
    ...
    PyObject *t = PyDict_SetDefault(interned, s, s);
    if (t != s) {
        Py_SETREF(*p, Py_NewRef(t));
    }
    ...
}

`unicode_hash`

Uses SipHash-1-3 (or FNV on old builds) to hash the UTF-8 encoded form. The result is cached in _PyASCIIObject_CAST(op)->hash.

`PyUnicode_FromFormat`

Formats a C string pattern with printf-like substitutions, but using Python object types (%R for repr(), %S for str(), %U for a PyUnicodeObject*).

gopy notes

The gopy equivalent is objects/str.go. gopy stores strings as Go string (immutable UTF-8 bytes). The PEP 393 kind system is not replicated; Go's rune slice is used for character-level operations. String interning uses a Go sync.Map. PyUnicode_FromFormat maps to Go's fmt.Sprintf.

Map​

Reading​

PEP 393 compact representation​

Interning​

unicode_hash​

PyUnicode_FromFormat​

gopy notes​

Map