`Include/internal/pycore_unicode.h`

cpython 3.14 @ ab2d84fe1023/Include/internal/pycore_unicode.h

The private companion to Include/unicodeobject.h. The public header declares the stable C API (PyUnicode_FromString, PyUnicode_AsUTF8, etc.), while this file exposes the three-tier struct layout from PEP 393, the _PyUnicodeWriter incremental builder, fast-path data accessors, the interning table, and internal consistency checks.

PEP 393 (3.3) replaced the UCS-2 / UCS-4 dual representation with a compact scheme: strings whose maximum code point fits in one byte use a PyCompactUnicodeObject with 1-byte (Latin-1) code units; those that fit in two bytes use 2-byte (UCS-2) units; those requiring four bytes use 4-byte (UCS-4) units. Pure ASCII strings use the smallest variant, PyASCIIObject, which omits even the wstr and utf8 cache fields.

CPython 3.12 completed the removal of the wstr and wstr_length fields that existed through 3.11 as a migration aid from pre-PEP-393 code. The structs here reflect the final 3.12+ layout.

In gopy, all three tiers collapse into a single objects.Unicode struct backed by a Go string (always valid UTF-8). The kind, ascii, length, and hash fields pin the PEP 393 metadata. _PyUnicodeWriter maps to Go's strings.Builder.

Map

Lines	Symbol	Role	gopy
1-40	`PyASCIIObject`	Smallest representation: `ob_base` + `length` + `hash` + packed `state` bits (interned, kind, compact, ascii, ready). No separate data pointer; character data follows immediately in memory.	`objects/str.go` (`Unicode.ascii`, `Unicode.ready`)
41-80	`PyCompactUnicodeObject`	Extends `PyASCIIObject` with a `utf8_length` and `utf8` cache pointer for non-ASCII strings. Character data (1-, 2-, or 4-byte units) follows in memory.	`objects/str.go` (`Unicode.v`)
81-120	`PyUnicodeObject`	Legacy "legacy string" wrapper that adds `data.any` and `wstr` pointers. Kept for `Py_UNICODE *` API compatibility; `compact` bit distinguishes this from the compact forms.	(not ported; pre-3.12 layout)
121-160	`_PyUnicode_STATE` / `PyUnicode_IS_ASCII` / `PyUnicode_IS_COMPACT` / `PyUnicode_KIND` / `PyUnicode_GET_LENGTH`	Accessor macros that read the `state` bitfield inline without an indirect call.	`objects/str.go` (`Unicode.IsASCII`, `Unicode.Kind`, `Unicode.Length`)
161-200	`_PyUnicodeWriter` struct	Incremental string builder: pre-allocated buffer, current write position, minimum character kind, and overallocation strategy.	(Go `strings.Builder`)
201-240	`_PyUnicodeWriter_Init` / `_PyUnicodeWriter_Prepare` / `_PyUnicodeWriter_WriteStr` / `_PyUnicodeWriter_WriteChar` / `_PyUnicodeWriter_Finish` / `_PyUnicodeWriter_Dealloc`	Writer lifecycle: initialize, grow buffer on demand, append string/char, finalize into a `str`.	(Go `strings.Builder`)
241-280	`_PyUnicode_FastCopyCharacters` / `_PyUnicode_EqualToASCIIString` / `_PyUnicode_CheckConsistency`	Fast bulk copy between compatible-kind buffers; ASCII equality shortcut; debug consistency checker.	`objects/str.go`

Reading

Three-tier unicode object layout (lines 1 to 120)

cpython 3.14 @ ab2d84fe1023/Include/internal/pycore_unicode.h#L1-120

/* Tier 1: ASCII-only strings */
typedef struct {
    PyObject_HEAD
    Py_ssize_t  length;   /* number of code points */
    Py_hash_t   hash;     /* -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;      /* 1 = Latin-1, 2 = UCS-2, 4 = UCS-4 */
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;
    } state;
    /* character data follows immediately */
} PyASCIIObject;

/* Tier 2: non-ASCII compact strings */
typedef struct {
    PyASCIIObject _base;
    Py_ssize_t    utf8_length;  /* byte length of UTF-8 cache */
    char         *utf8;        /* NULL until first AsUTF8 call */
    /* character data (kind * length bytes) follows immediately */
} PyCompactUnicodeObject;

/* Tier 3: legacy non-compact strings (pre-3.12 migration aid) */
typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void      *any;
        Py_UCS1   *latin1;
        Py_UCS2   *ucs2;
        Py_UCS4   *ucs4;
    } data;
} PyUnicodeObject;

The three structs form a strict prefix chain. Code that only inspects length, hash, and state can accept any of the three via a PyASCIIObject * pointer. Code that needs the character data uses PyUnicode_DATA(op), which computes the pointer past the end of the appropriate header:

static inline void *
PyUnicode_DATA(PyObject *op) {
    if (PyUnicode_IS_COMPACT(op)) {
        /* Data starts immediately after the header */
        return (char *)op + (PyUnicode_IS_ASCII(op)
            ? sizeof(PyASCIIObject)
            : sizeof(PyCompactUnicodeObject));
    }
    /* Legacy path: explicit data pointer */
    return ((PyUnicodeObject *)op)->data.any;
}

The compact bit in state distinguishes the two allocation strategies. For the compact (3.3+) path the object and its character data are a single PyObject_Malloc call; there is no separate ob_item-style pointer. This eliminates one pointer indirection on every character access and improves cache locality.

The kind field encodes the bytes-per-code-unit: 1 for Latin-1 (ASCII is a subset of this tier), 2 for BMP strings that need UCS-2, 4 for strings that contain supplementary-plane code points. PyUnicode_KIND reads this field. PyUnicode_MAX_CHAR_VALUE returns the maximum code point representable in the given kind.

In gopy, objects.Unicode stores the canonical UTF-8 Go string and derives kind, ascii, and length in classify() by scanning the runes once on construction. There is no separate character data buffer; all three tiers collapse into the Go string.

`_PyUnicodeWriter` init/write/finish (lines 161 to 240)

cpython 3.14 @ ab2d84fe1023/Include/internal/pycore_unicode.h#L161-240

typedef struct {
    PyObject  *buffer;      /* current PyUnicodeObject being built */
    void      *data;        /* write pointer into buffer->data */
    int        kind;        /* kind of buffer: 1, 2, or 4 */
    Py_UCS4    maxchar;     /* largest char written so far */
    Py_ssize_t size;        /* allocated capacity in code units */
    Py_ssize_t pos;         /* next write position in code units */
    Py_ssize_t min_length;  /* hint for initial allocation */
    Py_UCS4    min_char;    /* hint for minimum kind */
    unsigned char overallocate;  /* 1 = use geometric growth */
    unsigned char readonly;      /* 1 = buffer is borrowed */
} _PyUnicodeWriter;

_PyUnicodeWriter is the workhorse for all CPython code that builds strings incrementally: str.join, str % args, str.format, repr() for containers, and the + concatenation operator when the operand count is not known upfront.

The lifecycle is:

_PyUnicodeWriter writer;
_PyUnicodeWriter_Init(&writer);
writer.overallocate = 1;   /* geometric growth for long builds */

/* append pieces */
if (_PyUnicodeWriter_WriteStr(&writer, piece) < 0) goto error;
if (_PyUnicodeWriter_WriteChar(&writer, sep) < 0) goto error;

/* finalize: returns ownership of a new str */
PyObject *result = _PyUnicodeWriter_Finish(&writer);
goto done;

error:
    _PyUnicodeWriter_Dealloc(&writer);
done:

_PyUnicodeWriter_Prepare is the growth step. When the current buffer's kind is too narrow for a new character (e.g. adding a UCS-2 character to a Latin-1 buffer), it reallocates and widens all previously written code units. This widening is paid at most twice (1 to 2, then 2 to 4) over the lifetime of any writer.

_PyUnicodeWriter_Finish trims the buffer to pos code units, sets the ready bit, and returns the result. If the buffer was over-allocated, the trailing bytes are reclaimed with _PyObject_Realloc.

In gopy, strings.Builder covers the _PyUnicodeWriter use-case. The kind widening is unnecessary because Go's string type is always UTF-8. The Finish step is builder.String().

gopy mirror

objects/str.go pins the PEP 393 metadata (kind, length, ascii, ready, hash) as fields on objects.Unicode. NewStr corresponds to PyUnicode_FromString plus the compact-allocation plus the classify() kind scan. StrKind1Byte, StrKind2Byte, and StrKind4Byte mirror PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND, and PyUnicode_4BYTE_KIND. The interning table is a sync.Map from Go string to *Unicode and corresponds to _Py_interned (a PyDictObject in CPython, a sync.Map in gopy to avoid import cycles).

Map​

Reading​

Three-tier unicode object layout (lines 1 to 120)​

_PyUnicodeWriter init/write/finish (lines 161 to 240)​

gopy mirror​

Map

Reading

Three-tier unicode object layout (lines 1 to 120)

`_PyUnicodeWriter` init/write/finish (lines 161 to 240)

gopy mirror