Parser/string_parser.c
cpython 3.14 @ ab2d84fe1023/Parser/string_parser.c
Handles every string literal form that the Python grammar recognises.
The tokenizer delivers raw token bytes; this file turns them into
PyObject * values or AST nodes. There are four families: plain
str (decoded as UTF-8 with escape expansion), bytes (decoded as
Latin-1 with escape expansion), raw (no escape processing), and
f-string (split into literal segments and FormattedValue AST nodes
before being merged into a JoinedStr).
The file is small relative to its surface area because two concerns are deliberately outside its scope: the tokenizer identifies the literal boundaries and delivers a single token per literal (no multi-line joining), and the AST arena allocator owns all node memory. This file only decodes and constructs.
Map
| Lines | Symbol | Role | gopy |
|---|---|---|---|
| 1-40 | _PyPegen_decode_string | Entry point for non-f-string literals: strip quotes, detect prefix flags, run the escape decoder. | parser/string_parser.go:DecodeString |
| 41-110 | (escape decode loop) | Walk the token bytes, expand \n, \t, \xHH, \uXXXX, \UXXXXXXXX, \N{name}, \ooo; raw strings skip this block. | parser/string_parser.go:decodeEscapes |
| 111-170 | fstring_find_literal | Scan an f-string segment for { / }, splitting off the leading literal portion and detecting doubled braces {{ / }}. | parser/string_parser.go:fstringFindLiteral |
| 171-230 | fstring_compile_expr / fstring_parse_expr | Parse the expression inside {...} by feeding it back into the PEG parser as a sub-expression; handle = (debug format), ! conversion, and : format spec. | parser/string_parser.go:fstringCompileExpr |
| 231-290 | fstring_find_expr | Drive fstring_find_literal and fstring_compile_expr in alternation to build the list of FormattedValue and Constant sub-nodes for one f-string token. | parser/string_parser.go:fstringFindExpr |
| 291-320 | _PyPegen_parse_string | Dispatch based on prefix characters (b/B, r/R, f/F, u/U) to the correct decode path. | parser/string_parser.go:ParseString |
| 321-339 | _PyPegen_concatenate_strings | Merge a sequence of already-decoded string parts into a single Constant or JoinedStr expr node. | parser/string_parser.go:ConcatenateStrings |
Reading
Escape decoding in _PyPegen_decode_string (lines 1 to 110)
cpython 3.14 @ ab2d84fe1023/Parser/string_parser.c#L1-110
static int
decode_unicode_with_escapes(Parser *parser, const char **s, size_t length,
Token *t)
{
const char *p = *s;
const char *end = p + length;
while (p < end) {
if (*p != '\\') { p++; continue; }
p++; /* skip backslash */
switch (*p) {
case 'n': *w++ = '\n'; p++; break;
case 't': *w++ = '\t'; p++; break;
case 'x': /* \xHH */
...
case 'u': /* \uXXXX */
...
case 'U': /* \UXXXXXXXX */
...
case 'N': /* \N{name} */
...
default: /* \ooo octal or unrecognised: keep both chars */
...
}
}
}
Raw strings (r"..." or R"...") bypass this function entirely: the
prefix check in _PyPegen_parse_string detects the r/R flag and
calls PyUnicode_DecodeUTF8 directly on the token bytes. For all other
str literals the escape loop above runs. \N{name} is the only case
that performs a runtime import: it calls _PyUnicode_GetNameObject,
which uses the Unicode name database. The \ooo octal fallback keeps
the backslash and the character if the octal digit count is wrong, which
matches the documented behavior that an unrecognised escape sequence is
passed through unchanged (with a DeprecationWarning in 3.12+).
Bytes literals follow the same loop but only expand \xHH, \ooo,
\n, \t, and the single-character escapes; \uXXXX, \UXXXXXXXX,
and \N{name} are not valid in bytes and produce a SyntaxError.
F-string literal scanning (lines 111 to 170)
cpython 3.14 @ ab2d84fe1023/Parser/string_parser.c#L111-170
static int
fstring_find_literal(const char **p, const char *end, int raw,
PyObject **literal, int recursion_depth, Parser *parser)
{
const char *s = *p;
while (s < end) {
char ch = *s;
if (ch == '{' || ch == '}') {
if (s[1] == ch) {
/* doubled brace: emit one brace literal, advance two */
...
s += 2;
} else if (ch == '}') {
RAISE_SYNTAX_ERROR("f-string: single '}' is not allowed");
return -1;
} else {
break; /* start of expression */
}
}
s++;
}
*p = s;
...
}
An f-string token is scanned character by character. A { that is not
followed by another { terminates the leading literal and signals the
start of an interpolation. A } that is not doubled is a syntax error.
The raw flag is threaded through so that rf"..." does not expand
escape sequences in the literal portions, which matches the rule that
raw f-strings are raw everywhere except inside {...}.
Nested f-strings (f-strings inside the expression part of another
f-string) are handled by the recursion_depth guard. CPython 3.12
lifted the nesting ban, and the guard replaces the earlier hard limit
with an ast.MAX_NESTING_DEPTH check so that pathological inputs do
not exhaust the C stack.
_PyPegen_concatenate_strings (lines 321 to 339)
cpython 3.14 @ ab2d84fe1023/Parser/string_parser.c#L321-339
expr_ty
_PyPegen_concatenate_strings(Parser *p, asdl_seq *strings,
PyArena *arena)
{
Py_ssize_t n = asdl_seq_LEN(strings);
/* fast path: single non-fstring */
if (n == 1 && !is_fstring(asdl_seq_GET(strings, 0))) {
return asdl_seq_GET(strings, 0);
}
/* mixing bytes with str/fstring is illegal */
for (Py_ssize_t i = 0; i < n; i++) {
if (kind_of(asdl_seq_GET(strings, i)) == KIND_BYTES) {
if (any_non_bytes(strings, i)) {
RAISE_SYNTAX_ERROR(...);
return NULL;
}
}
}
/* merge: if any part is an fstring build JoinedStr, else Constant */
...
}
Adjacent literals are fed in as an asdl_seq of already-decoded
expr_ty nodes. The merge logic has three outcomes: if there is only
one part and it is not an f-string, it is returned unchanged; if all
parts are plain bytes constants, they are concatenated into one
Constant holding a PyBytesObject; if any part is an f-string, all
parts are flattened into a JoinedStr where the non-f-string segments
become Constant children. Mixing bytes with str or f-string is
caught here with a SyntaxError rather than propagating to the
compiler.