Skip to main content

Parser

CPython's frontend has two pieces. The tokenizer turns source bytes into a flat stream of tokens. The parser turns that stream into an AST. Both pieces live under Parser/ in the source tree.

Source map

FileRole
Parser/tokenizer.cThe hand-written tokenizer.
Parser/tokenizer.hThe tok_state struct.
Parser/lexer/lexer.cThe low-level character pump used by the tokenizer.
Parser/pegen.cThe PEG parser core (memoization, packrat tables).
Parser/parser.cGenerated from Grammar/python.gram.
Grammar/python.gramThe grammar source.
Tools/peg_generator/The PEG generator that builds parser.c.

Tokenizer

The tokenizer is a state machine. Each call to tok_get reads characters until it can return one token. The state includes the file position, the indentation stack, the f-string nesting stack, the encoding declaration, and a small pushback buffer.

The token kinds are listed in Grammar/Tokens. The big ones are NAME, NUMBER, STRING, OP, NEWLINE, INDENT, DEDENT, plus the soft-keyword variants like ASYNC and AWAIT that turn into NAME in some grammar positions.

Indentation

CPython does not use whitespace insensitivity tricks. The tokenizer maintains an explicit stack of indentation widths. When a line's leading whitespace is wider than the top of the stack, it pushes and emits INDENT. When narrower, it pops and emits DEDENT for each level popped, raising IndentationError if the new width does not match any prior level.

F-strings

F-strings are tokenized as a separate sub-language. Inside an f-string, the tokenizer switches mode to emit FSTRING_START, FSTRING_MIDDLE, FSTRING_END, and { / } braces that mark embedded expressions. Each embedded expression is parsed with the regular grammar via a recursive call into pegen_parse.

Encoding

The tokenizer reads the encoding declaration in the first two lines of the file (PEP 263) and decodes bytes through that codec before tokenizing. The default is UTF-8.

PEG parser

CPython moved to a PEG parser in PEP 617. The grammar lives in Grammar/python.gram. The PEG generator under Tools/peg_generator/ emits Parser/parser.c.

What "PEG" means here

A Parsing Expression Grammar matches input by ordered choice: alternatives are tried left to right and the first match wins. This removes ambiguity at the cost of making grammar rules order-sensitive. PEG is paired with packrat memoization: each rule, evaluated at each position, caches its result. That makes worst-case linear time achievable.

Memoization

pegen.c keeps a memo table indexed by (rule, position). On re-entry it returns the cached result. The memo table is also how the parser detects and aborts left-recursive descents that would otherwise loop.

Left recursion

PEP 617 supports left-recursive rules through a fixed-point trick. A left-recursive rule is initially memoized as "fail". The parser then iteratively retries the rule, replacing the memo with the longest successful parse, until the result stabilises.

Actions

Each grammar rule has an attached action, written in C inside curly braces. The action allocates AST nodes from the per-parse PyArena. The generator turns the action into a C function in parser.c. Helper macros like RAISE_SYNTAX_ERROR, CHECK, CHECK_VERSION, and Constant simplify the action bodies. The set of helpers is in Parser/action_helpers.c.

Error recovery

When the parser fails, it does not stop at the first failure. It reruns the failed rule with an "invalid" branch enabled that matches almost anything and produces a tuned SyntaxError message. Examples are the suggestions you see for missing colons, unmatched brackets, and walrus operators in the wrong context.

Arena allocation

Every AST node and intermediate object is allocated in a PyArena. The arena is freed wholesale when parsing finishes. This keeps the parser pointer-stable and removes the need for per-node reference counting during construction.

Putting it together

// Parser/peg_api.c
mod_ty
_PyParser_ASTFromString(const char *str, PyObject *filename,
int mode, PyCompilerFlags *flags,
PyArena *arena);

That function spins up a tokenizer over str, calls the generated file_rule (or eval_rule, etc.) in parser.c, and returns a mod_ty AST. From there control passes to Python/ast.c for validation.

Reading order

For the validator and the AST shape, continue to AST. For an end-to-end picture of how parse output feeds compile, return to Pipeline overview.