Skip to main content

Parser/pegen.c

cpython 3.14 @ ab2d84fe1023/Parser/pegen.c

Runtime for the generated PEG parser at Parser/parser.c. Every grammar rule produced by Tools/peg_generator/ calls into this file. The file covers six concerns: packrat memoisation, token buffering, keyword classification, lookahead helpers, the Parser lifecycle, and the two-pass error-recovery driver.

Map

LinesSymbolRole
17-23_PyPegen_interactive_exitREPL hook for ^D.
26-77_PyPegen_byte_offset_to_character_offset*Byte to character offsets for error messages.
80-93_PyPegen_insert_memoPrepend a Memo onto a token's memo list.
96-109_PyPegen_update_memoUpdate existing memo entry or insert.
112-123init_normalizationLazy import of unicodedata.normalize.
125-159growable_comment_array_*Type-comment stash.
162-178_get_keyword_or_name_typeNAME vs keyword classification.
244-309_PyPegen_fill_tokenPull the next token from the lexer.
349-390_PyPegen_is_memoizedProbe the memo table.
392-410LOOKAHEAD1 / LOOKAHEAD2Generic lookahead helpers.
412-462_PyPegen_expect_tokenMatch next token by type.
464-486_PyPegen_expect_soft_keywordMatch match, case, type.
502-570_PyPegen_new_identifierBuild interned NFKC-normalised NAME.
572-620_PyPegen_name_from_tokenWrap a NAME token in a Name AST node.
622-694parsenumber* / _PyPegen_number_tokenNumeric literal parsing.
754-778bad_single_statementsingle_input multi-statement diagnostic.
780-865compute_parser_flagsCompose parser flags from compiler flags.
867-878_PyPegen_Parser_FreeTear down the parser.
879-945reset_parser_state_for_error_passReset memo and marks for the second pass.
947-994_PyPegen_run_parserDrive the two-pass parse.
996-1042_PyPegen_run_parser_from_file_pointerTop-level file entry.
1044-1083_PyPegen_run_parser_from_stringTop-level string entry.

Reading

The memo table (lines 80 to 109)

cpython 3.14 @ ab2d84fe1023/Parser/pegen.c#L80-109

int
_PyPegen_insert_memo(Parser *p, int mark, int type, void *node)
{
Memo *m = _PyArena_Malloc(p->arena, sizeof(Memo));
if (m == NULL) {
return -1;
}
m->type = type;
m->node = node;
m->mark = p->mark;
m->next = p->tokens[mark]->memo;
p->tokens[mark]->memo = m;
return 0;
}

The Memo struct (declared in pegen.h) lives in the per-parser arena and is keyed by rule type. Each Token owns a singly linked list of memos. When the generated parser enters rule R at token position mark it calls _PyPegen_is_memoized; on a hit the cached node is returned and p->mark advances to the cached value. On a miss the rule runs, then _PyPegen_insert_memo or _PyPegen_update_memo records the result. The linked-list-per-token shape is chosen over a global hash because the memo lifetime tracks the token lifetime naturally.

Token feed (lines 244 to 309)

cpython 3.14 @ ab2d84fe1023/Parser/pegen.c#L244-309

_PyPegen_fill_token pulls one token from _PyTokenizer_Get into p->tokens[]. The buffer grows geometrically when full. Comments are discarded (the parser never sees COMMENT); TYPE_COMMENT tokens are stashed in p->type_ignore_comments. NAME tokens get their keyword classification applied here via _get_keyword_or_name_type (162-178), so for is FOR by the time the parser sees it.

The keyword tables are indexed first by name length, then linearly searched on the matching bytes. The generator emits one table per length, NULL-terminated, which keeps both the memory and the inner loop tight:

static int
_get_keyword_or_name_type(Parser *p, struct token *new_token)
{
Py_ssize_t name_len = new_token->end_col_offset - new_token->col_offset;
if (name_len >= p->n_keyword_lists ||
p->keywords[name_len] == NULL ||
p->keywords[name_len]->type == -1) {
return NAME;
}
for (KeywordToken *k = p->keywords[name_len]; k != NULL && k->type != -1; k++) {
if (strncmp(k->str, new_token->start, (size_t)name_len) == 0) {
return k->type;
}
}
return NAME;
}

Identifier construction (lines 502 to 570)

cpython 3.14 @ ab2d84fe1023/Parser/pegen.c#L502-570

Two concerns shape this function. The first is NFKC normalisation: identifiers with any non-ASCII characters are normalised through unicodedata.normalize('NFKC', name), which is why café (composed) and café (decomposed) compare equal as identifiers. The second is interning. Every NAME PyUnicode object goes through _PyUnicode_InternMortal, which lets later LOAD_GLOBAL lookups compare by pointer rather than by string content. init_normalization (112-123) defers the unicodedata import until an identifier actually needs it, because most parses are pure ASCII.

Numeric literals (lines 622 to 694)

cpython 3.14 @ ab2d84fe1023/Parser/pegen.c#L622-694

Three thin layers. parsenumber_raw inspects the suffix and dispatches: j produces a complex, . or e produces a float, otherwise the prefix (0x, 0o, 0b, or none) selects an integer base. Underscores are stripped inline. parsenumber wraps the raw helper with an error-context string. _PyPegen_number_token reads the NUMBER token off the stream, decodes its bytes, and returns a Constant AST node holding the value.

The integer paths bottom out in _PyLong_FromString, which has its own single-digit cache, so this file does not add a separate fast path for small ints.

The two-pass driver (lines 947 to 994)

cpython 3.14 @ ab2d84fe1023/Parser/pegen.c#L947-994

mod_ty
_PyPegen_run_parser(Parser *p)
{
void *res = _PyPegen_parse(p);
assert(p->level == 0);
if (res == NULL) {
if ((p->flags & PyPARSE_ALLOW_INCOMPLETE_INPUT) && _is_end_of_source(p)) {
PyErr_Clear();
return _PyPegen_raise_error(p, PyExc_IncompleteInputError, 0, "incomplete input");
}
if (PyErr_Occurred() && !PyErr_ExceptionMatches(PyExc_SyntaxError)) {
return NULL;
}
Token *last_token = p->tokens[p->fill - 1];
reset_parser_state_for_error_pass(p);
_PyPegen_parse(p);
_Pypegen_set_syntax_error(p, last_token);
if (PyErr_ExceptionMatches(PyExc_SyntaxError)) {
_PyPegen_set_syntax_error_metadata(p);
}
return NULL;
}
...
}

Three things to notice. The PyPARSE_ALLOW_INCOMPLETE_INPUT flag, set by the REPL, converts a parse failure at EOF into the dedicated IncompleteInputError exception so the REPL prompts for another line instead of raising. The second pass is gated by reset_parser_state_for_error_pass (879-945), which clears the memo table, resets p->mark, and sets p->call_invalid_rules = 1. That flag is what the generated parser tests in its invalid_* alternatives to enable the heavier diagnostic rules. The recovery pass is allowed to fail too; _Pypegen_set_syntax_error then constructs a SyntaxError from the error indicator and the failing token.

In debug builds the successful tree is validated through _PyAST_Validate (982-991), which checks invariants the grammar cannot encode, such as "starred targets only inside tuples".

File and string entry points (lines 996 to 1083)

cpython 3.14 @ ab2d84fe1023/Parser/pegen.c#L996-1083

Two thin wrappers. Both build a tok_state, transfer ownership of the filename PyObject to the tokenizer, create a Parser, run it through _PyPegen_run_parser, and free both. The file variant additionally flips tok->fp_interactive when reading from <stdin> so the tokenizer emits the line-edit cues the REPL needs.

Notes for the gopy mirror

parser/pegen/runtime.go mirrors _PyPegen_run_parser with the same two-pass structure; the invalid_* rules are present in parser/parser_gen.go. The Memo becomes a Go struct kept in a slice-per-token rather than a linked list; semantics are unchanged. NFKC normalisation goes through golang.org/x/text/unicode/norm. Name interning uses the parser/intern package, a single global string set.