Skip to main content

Objects/unicodeobject.c: Unicode Object Detail

unicodeobject.c is the largest single file in CPython (roughly 16 000 lines). It defines the str type in all its storage forms, handles encoding and decoding for a dozen codecs, implements hashing with secret-keyed SipHash, and manages a global intern table. The 3.14 cycle added the stable PyUnicodeWriter public API.

Map

LinesSymbolPurpose
1–200kind/form macrosPyUnicode_1BYTE_KIND, PyASCIIObject, PyCompactUnicodeObject layout
201–450PyUnicode_NewAllocate and select storage kind (Latin-1, UCS-2, UCS-4)
451–700_PyUnicode_ReadyUpgrade legacy non-compact objects (removed in 3.12+)
701–900unicode_hashSipHash-1-3 over UTF-8 bytes with Py_HashSecret
901–1050_PyUnicode_InternInPlaceGlobal intern table insert and dedup
1051–1300PyUnicode_DecodeUTF8StatefulStreaming UTF-8 decoder with error handlers
1301–1550PyUnicode_AsUTF8AndSizeEncode to cached UTF-8 byte buffer
1551–1800unicode_concatenate+ operator with in-place realloc fast path
1801–2100PyUnicode_Format%-formatting implementation
2101–2400PyUnicode_RichCompareLexicographic compare across kinds
2401–2700unicode_find / _Py_FindFirstBoyer-Moore-Horspool substring search
2701–3000unicode_split / unicode_rsplitSplit by whitespace or separator
3001–3300unicode_joinstr.join fast path for all-ASCII lists
3301–3600PyUnicodeWriter_*3.14 public writer API (new)
3601–3900_PyUnicodeWriter_* internalInternal writer used by format, decode, etc.
3901–4200codec registration / lookupPyCodec_* wrappers for the codec registry

Reading

PyUnicode_New: kind selection

Objects/unicodeobject.c #L201-450

CPython stores strings in one of three in-memory forms, chosen at construction time from the maximum code point:

Max code pointKindBytes per characterStruct type
<= 127ASCII compact1PyASCIIObject
<= 255Latin-1 compact1PyCompactUnicodeObject
<= 65535UCS-2 compact2PyCompactUnicodeObject
<= 1114111UCS-4 compact4PyCompactUnicodeObject

The "compact" flag means the character buffer is allocated immediately after the header in one block, so a single free() releases both. Non-compact objects (rare since 3.3) carry a separate wstr pointer and are fully gone in 3.12.

// Objects/unicodeobject.c:298
// ASCII fast path: one allocation, no kind upgrade ever needed.
if (maxchar < 128) {
obj = (PyObject *)PyObject_Malloc(sizeof(PyASCIIObject) + length + 1);
...
_PyUnicode_STATE(obj).kind = PyUnicode_1BYTE_KIND;
_PyUnicode_STATE(obj).ascii = 1;
}

unicode_hash: SipHash with Py_HashSecret

Objects/unicodeobject.c #L701-900

String hashing uses SipHash-1-3 (1 compression round, 3 finalisation rounds) keyed by _Py_HashSecret, a 128-bit secret initialised from os.urandom at interpreter startup. This prevents hash-flooding denial-of-service attacks.

// Objects/unicodeobject.c:762
x = _Py_HashSecret.djbx33a.suffix;
x ^= _Pyx_SipHash13(
_Py_HashSecret.siphash.k0,
_Py_HashSecret.siphash.k1,
(const uint8_t *)utf8, len);

ASCII strings hash over their raw bytes directly. Non-ASCII strings encode to UTF-8 on first hash call and cache the result in PyASCIIObject.hash. The cached value -1 is reserved for "not yet computed"; actual -1 results are mapped to -2.

PyUnicodeWriter (3.14 public API)

Objects/unicodeobject.c #L3301-3600

Prior to 3.14 the _PyUnicodeWriter type was private. The 3.14 stable ABI exposes PyUnicodeWriter_Create, PyUnicodeWriter_WriteStr, PyUnicodeWriter_WriteChar, PyUnicodeWriter_WriteUTF8, and PyUnicodeWriter_Finish.

Internally the writer tracks a PyObject *buffer that starts as a pre-allocated compact ASCII string and is upgraded in-place (kind promotion, buffer realloc) as wider characters are appended. Finish returns the final object and transfers ownership.

// Objects/unicodeobject.c:3340
PyUnicodeWriter *
PyUnicodeWriter_Create(Py_ssize_t length)
{
_PyUnicodeWriter *writer = PyMem_Malloc(sizeof(_PyUnicodeWriter));
_PyUnicodeWriter_Init(writer);
...
}

gopy notes

  • objects/str.go uses a Go string (UTF-8, immutable) as the backing storage. The three CPython kinds collapse to a single representation because Go strings are always valid UTF-8, removing the need for kind selection.
  • unicode_hash is ported in objects/str.go using the same SipHash-1-3 algorithm with Py_HashSecret equivalent seeded at pythonrun init.
  • The intern table is implemented in objects/str.go as a map[string]*StrObject guarded by a sync.Mutex, matching CPython's global intern dict.
  • PyUnicode_DecodeUTF8Stateful maps to DecodeUTF8Stateful and relies on Go's unicode/utf8 package for byte-level validation, with CPython- compatible surrogateescape and replace error handlers.
  • The 3.14 PyUnicodeWriter API is exposed as UnicodeWriter in objects/str.go using a strings.Builder internally, with kind-upgrade logic replaced by Go's native UTF-8 concatenation.
  • PyUnicode_AsUTF8AndSize is implemented as a no-op cache since Go strings are already UTF-8; the returned pointer is the backing array of the Go string.