Include/internal/pycore_unicodeobject.h

Source:

cpython 3.14 @ ab2d84fe1023/Include/internal/pycore_unicodeobject.h

pycore_unicodeobject.h exposes the three string representations used internally by CPython and the macros to access them without a virtual call.

Map

Lines	Symbol	Role
1-60	`PyASCIIObject`	Pure-ASCII strings: data immediately follows the header
61-120	`PyCompactUnicodeObject`	Latin-1 or UCS-2/UCS-4: single data block after the header
121-180	`PyUnicodeObject`	Legacy: separately allocated `wchar_t *` buffer
181-240	`_PyUnicode_KIND` macros	Extract 1/2/4-byte code unit width without branching
241-280	`_PyUnicode_DATA`	Pointer to the code-unit array

Reading

`PyASCIIObject`

// CPython: Include/internal/pycore_unicodeobject.h:38 PyASCIIObject
typedef struct {
    PyObject_HEAD
    Py_ssize_t length;    /* number of code points */
    Py_hash_t  hash;      /* cached hash, -1 = uncached */
    struct {
        unsigned int interned : 2;   /* 0=not, 1=mortal, 2=immortal */
        unsigned int kind     : 3;   /* PyUnicode_1BYTE_KIND = 1, 2, 4 */
        unsigned int compact  : 1;   /* data immediately follows? */
        unsigned int ascii    : 1;   /* all code points < 128? */
        unsigned int ready    : 1;   /* legacy: always 1 */
    } state;
    wchar_t *wstr;         /* deprecated wchar_t representation */
} PyASCIIObject;
/* For ASCII strings, the char data starts at (PyASCIIObject *)s + 1 */

PyASCIIObject is the header of every string. For ASCII compact strings the character array follows immediately in memory. Accessing it avoids an indirection: (char *)((PyASCIIObject *)s + 1).

`PyCompactUnicodeObject`

// CPython: Include/internal/pycore_unicodeobject.h:70 PyCompactUnicodeObject
typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;   /* length of cached UTF-8 encoding */
    char *utf8;               /* cached UTF-8 encoding or NULL */
    Py_ssize_t wstr_length;   /* deprecated */
} PyCompactUnicodeObject;
/* For non-ASCII compact strings, data starts at
   (PyCompactUnicodeObject *)s + 1 */

PyCompactUnicodeObject adds optional UTF-8 caching. Most strings created from Python source code are compact. The data array may be UCS-1 (Latin-1), UCS-2, or UCS-4 depending on the kind field.

`_PyUnicode_KIND` macros

// CPython: Include/internal/pycore_unicodeobject.h:195 _PyUnicode_KIND
#define PyUnicode_1BYTE_KIND  1  /* Latin-1, code points 0-255 */
#define PyUnicode_2BYTE_KIND  2  /* UCS-2, code points 0-65535 */
#define PyUnicode_4BYTE_KIND  4  /* UCS-4, full Unicode range */

#define PyUnicode_KIND(op)    \
    (assert(PyUnicode_Check(op)), \
     ((PyASCIIObject *)(op))->state.kind)

#define PyUnicode_GET_LENGTH(op) \
    (assert(PyUnicode_Check(op)), \
     ((PyASCIIObject *)(op))->length)

#define PyUnicode_READ(kind, data, index) \
    ((Py_UCS4)(kind == PyUnicode_1BYTE_KIND ? \
        ((Py_UCS1 *)(data))[(index)] : \
        (kind == PyUnicode_2BYTE_KIND ? \
            ((Py_UCS2 *)(data))[(index)] : \
            ((Py_UCS4 *)(data))[(index)])))

PyUnicode_READ is used in hot loops over string characters. The kind dispatch compiles to a branch that the CPU branch predictor handles well for homogeneous strings.

`_PyUnicode_DATA`

// CPython: Include/internal/pycore_unicodeobject.h:220 _PyUnicode_DATA
#define PyUnicode_DATA(op) \
    (assert(PyUnicode_Check(op)), \
     PyUnicode_IS_COMPACT(op) ? \
       _PyUnicode_COMPACT_DATA(op) : \
       ((PyUnicodeObject *)(op))->data.any)

#define _PyUnicode_COMPACT_DATA(op)  \
    (PyUnicode_IS_ASCII(op) ? \
        (void *)((PyASCIIObject *)(op) + 1) : \
        (void *)((PyCompactUnicodeObject *)(op) + 1))

PyUnicode_DATA returns a void * pointer to the first code unit. Combined with PyUnicode_KIND and PyUnicode_READ, this is the standard triple for iterating string contents.

gopy notes

objects.Str in objects/str.go stores its content as a Go string (UTF-8). PyUnicode_KIND and PyUnicode_DATA are not exposed in gopy; instead, objects.StrGetItem(s, i) returns the i-th Unicode code point. Compact ASCII optimization is not needed since Go strings are already compact.

Map​

Reading​

PyASCIIObject​

PyCompactUnicodeObject​

_PyUnicode_KIND macros​

_PyUnicode_DATA​

gopy notes​

Map