Skip to main content

Include/internal/pycore_unicode.h

cpython 3.14 @ ab2d84fe1023/Include/internal/pycore_unicode.h

The private companion to Include/unicodeobject.h. The public header declares the stable C API (PyUnicode_FromString, PyUnicode_AsUTF8, etc.), while this file exposes the three-tier struct layout from PEP 393, the _PyUnicodeWriter incremental builder, fast-path data accessors, the interning table, and internal consistency checks.

PEP 393 (3.3) replaced the UCS-2 / UCS-4 dual representation with a compact scheme: strings whose maximum code point fits in one byte use a PyCompactUnicodeObject with 1-byte (Latin-1) code units; those that fit in two bytes use 2-byte (UCS-2) units; those requiring four bytes use 4-byte (UCS-4) units. Pure ASCII strings use the smallest variant, PyASCIIObject, which omits even the wstr and utf8 cache fields.

CPython 3.12 completed the removal of the wstr and wstr_length fields that existed through 3.11 as a migration aid from pre-PEP-393 code. The structs here reflect the final 3.12+ layout.

In gopy, all three tiers collapse into a single objects.Unicode struct backed by a Go string (always valid UTF-8). The kind, ascii, length, and hash fields pin the PEP 393 metadata. _PyUnicodeWriter maps to Go's strings.Builder.

Map

LinesSymbolRolegopy
1-40PyASCIIObjectSmallest representation: ob_base + length + hash + packed state bits (interned, kind, compact, ascii, ready). No separate data pointer; character data follows immediately in memory.objects/str.go (Unicode.ascii, Unicode.ready)
41-80PyCompactUnicodeObjectExtends PyASCIIObject with a utf8_length and utf8 cache pointer for non-ASCII strings. Character data (1-, 2-, or 4-byte units) follows in memory.objects/str.go (Unicode.v)
81-120PyUnicodeObjectLegacy "legacy string" wrapper that adds data.any and wstr pointers. Kept for Py_UNICODE * API compatibility; compact bit distinguishes this from the compact forms.(not ported; pre-3.12 layout)
121-160_PyUnicode_STATE / PyUnicode_IS_ASCII / PyUnicode_IS_COMPACT / PyUnicode_KIND / PyUnicode_GET_LENGTHAccessor macros that read the state bitfield inline without an indirect call.objects/str.go (Unicode.IsASCII, Unicode.Kind, Unicode.Length)
161-200_PyUnicodeWriter structIncremental string builder: pre-allocated buffer, current write position, minimum character kind, and overallocation strategy.(Go strings.Builder)
201-240_PyUnicodeWriter_Init / _PyUnicodeWriter_Prepare / _PyUnicodeWriter_WriteStr / _PyUnicodeWriter_WriteChar / _PyUnicodeWriter_Finish / _PyUnicodeWriter_DeallocWriter lifecycle: initialize, grow buffer on demand, append string/char, finalize into a str.(Go strings.Builder)
241-280_PyUnicode_FastCopyCharacters / _PyUnicode_EqualToASCIIString / _PyUnicode_CheckConsistencyFast bulk copy between compatible-kind buffers; ASCII equality shortcut; debug consistency checker.objects/str.go

Reading

Three-tier unicode object layout (lines 1 to 120)

cpython 3.14 @ ab2d84fe1023/Include/internal/pycore_unicode.h#L1-120

/* Tier 1: ASCII-only strings */
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* number of code points */
Py_hash_t hash; /* -1 if not set */
struct {
unsigned int interned:2;
unsigned int kind:3; /* 1 = Latin-1, 2 = UCS-2, 4 = UCS-4 */
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
} state;
/* character data follows immediately */
} PyASCIIObject;

/* Tier 2: non-ASCII compact strings */
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length; /* byte length of UTF-8 cache */
char *utf8; /* NULL until first AsUTF8 call */
/* character data (kind * length bytes) follows immediately */
} PyCompactUnicodeObject;

/* Tier 3: legacy non-compact strings (pre-3.12 migration aid) */
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data;
} PyUnicodeObject;

The three structs form a strict prefix chain. Code that only inspects length, hash, and state can accept any of the three via a PyASCIIObject * pointer. Code that needs the character data uses PyUnicode_DATA(op), which computes the pointer past the end of the appropriate header:

static inline void *
PyUnicode_DATA(PyObject *op) {
if (PyUnicode_IS_COMPACT(op)) {
/* Data starts immediately after the header */
return (char *)op + (PyUnicode_IS_ASCII(op)
? sizeof(PyASCIIObject)
: sizeof(PyCompactUnicodeObject));
}
/* Legacy path: explicit data pointer */
return ((PyUnicodeObject *)op)->data.any;
}

The compact bit in state distinguishes the two allocation strategies. For the compact (3.3+) path the object and its character data are a single PyObject_Malloc call; there is no separate ob_item-style pointer. This eliminates one pointer indirection on every character access and improves cache locality.

The kind field encodes the bytes-per-code-unit: 1 for Latin-1 (ASCII is a subset of this tier), 2 for BMP strings that need UCS-2, 4 for strings that contain supplementary-plane code points. PyUnicode_KIND reads this field. PyUnicode_MAX_CHAR_VALUE returns the maximum code point representable in the given kind.

In gopy, objects.Unicode stores the canonical UTF-8 Go string and derives kind, ascii, and length in classify() by scanning the runes once on construction. There is no separate character data buffer; all three tiers collapse into the Go string.

_PyUnicodeWriter init/write/finish (lines 161 to 240)

cpython 3.14 @ ab2d84fe1023/Include/internal/pycore_unicode.h#L161-240

typedef struct {
PyObject *buffer; /* current PyUnicodeObject being built */
void *data; /* write pointer into buffer->data */
int kind; /* kind of buffer: 1, 2, or 4 */
Py_UCS4 maxchar; /* largest char written so far */
Py_ssize_t size; /* allocated capacity in code units */
Py_ssize_t pos; /* next write position in code units */
Py_ssize_t min_length; /* hint for initial allocation */
Py_UCS4 min_char; /* hint for minimum kind */
unsigned char overallocate; /* 1 = use geometric growth */
unsigned char readonly; /* 1 = buffer is borrowed */
} _PyUnicodeWriter;

_PyUnicodeWriter is the workhorse for all CPython code that builds strings incrementally: str.join, str % args, str.format, repr() for containers, and the + concatenation operator when the operand count is not known upfront.

The lifecycle is:

_PyUnicodeWriter writer;
_PyUnicodeWriter_Init(&writer);
writer.overallocate = 1; /* geometric growth for long builds */

/* append pieces */
if (_PyUnicodeWriter_WriteStr(&writer, piece) < 0) goto error;
if (_PyUnicodeWriter_WriteChar(&writer, sep) < 0) goto error;

/* finalize: returns ownership of a new str */
PyObject *result = _PyUnicodeWriter_Finish(&writer);
goto done;

error:
_PyUnicodeWriter_Dealloc(&writer);
done:

_PyUnicodeWriter_Prepare is the growth step. When the current buffer's kind is too narrow for a new character (e.g. adding a UCS-2 character to a Latin-1 buffer), it reallocates and widens all previously written code units. This widening is paid at most twice (1 to 2, then 2 to 4) over the lifetime of any writer.

_PyUnicodeWriter_Finish trims the buffer to pos code units, sets the ready bit, and returns the result. If the buffer was over-allocated, the trailing bytes are reclaimed with _PyObject_Realloc.

In gopy, strings.Builder covers the _PyUnicodeWriter use-case. The kind widening is unnecessary because Go's string type is always UTF-8. The Finish step is builder.String().

gopy mirror

objects/str.go pins the PEP 393 metadata (kind, length, ascii, ready, hash) as fields on objects.Unicode. NewStr corresponds to PyUnicode_FromString plus the compact-allocation plus the classify() kind scan. StrKind1Byte, StrKind2Byte, and StrKind4Byte mirror PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND, and PyUnicode_4BYTE_KIND. The interning table is a sync.Map from Go string to *Unicode and corresponds to _Py_interned (a PyDictObject in CPython, a sync.Map in gopy to avoid import cycles).