Skip to main content

Pipeline overview

The compile pipeline takes a .py file and produces a PyCodeObject. The PyCodeObject is a callable container of bytecode that the eval loop runs.

Five stages run in order. Each stage has its own page in the sidebar; this page is the high-level map.

source bytes

▼ Parser/tokenizer.c, Parser/parser.c
tokens, then AST

▼ Python/ast.c
validated AST

▼ Python/symtable.c
symbol table

▼ Python/compile.c
intermediate "instruction sequence" per scope

▼ Python/flowgraph.c
control-flow graph, optimised

▼ Python/assemble.c
PyCodeObject

Entry points

The compiler is reached from three entry points. They all converge on _PyAST_Compile.

FunctionWhere called from
Py_CompileStringFlagsThe public C API.
PyRun_FileExFlagspython file.py.
builtin_compilecompile(...) in Python.

All three call _PyAST_Compile in Python/compile.c, which is the canonical entry point.

// Python/compile.c
PyCodeObject *
_PyAST_Compile(mod_ty mod, PyObject *filename, PyCompilerFlags *flags,
int optimize, PyArena *arena);

What each stage produces

1. Parse

The tokenizer in Parser/tokenizer.c turns source bytes into a stream of Token structs. The parser in Parser/parser.c consumes the token stream and produces a mod_ty AST. The AST is defined by Parser/Python.asdl and generated into Include/internal/pycore_ast.h.

2. AST validate

PyAST_Validate in Python/ast.c walks the tree and rejects malformed shapes that the grammar accepts but the language disallows. Things like augmented assignment to a literal, await outside async def, and starred expressions in invalid positions.

3. Symbol table

PySymtable_Build in Python/symtable.c walks the AST and builds a tree of scopes. Each scope records every name it sees and classifies the name as local, free, cell, global, or implicit global. The compiler relies on this for every name-handling opcode choice.

4. Codegen

compiler_codegen in Python/compile.c walks the AST top-down once per scope. For each statement and expression form, the compiler emits a sequence of pseudo-instructions into the current "instruction sequence". A nested scope produces a nested sequence; the outer sequence references it through a constant that holds the inner code object.

5. Flow graph

After codegen, the instruction sequence is broken into basic blocks in Python/flowgraph.c. The block graph runs several passes: jump threading, constant folding, dead-block removal, stack-depth analysis, exception-table construction.

6. Assemble

assemble in Python/assemble.c linearises the block graph, resolves jump targets to byte offsets, encodes the exception table (PEP 657 column locations), packs the constant / name / variable pools, and emits a PyCodeObject.

Reading order

Read Parser for stage 1, AST for stage 2, Symtable for stage 3, Compiler for stage 4, Flowgraph for stage 5, and Assembler for stage 6.

The output of stage 6, the PyCodeObject, is the input to the VM.