Skip to main content

v0.5.5 - The lexer and the parser handover

Released May 5, 2026.

The Python lexer is one of those subsystems that everybody thinks is simple until they have to write one. Whitespace is significant. Indentation is significant in a way that requires a counter, not a regex. A line continuation is a backslash followed by exactly the right kind of newline. F-strings (PEP 701) and t-strings (PEP 750) contain arbitrary expressions that the lexer has to recognize as a sub-language, push a mode onto a stack, and pop the mode when the closing brace arrives. Triple-quoted strings span lines but single-quoted ones don't. Comments end at the newline unless they end at the EOF. Type comments look like regular comments unless the caller asked for them, in which case they're tokens.

CPython's lexer (Parser/lexer/lexer.c plus the buffer and state files alongside it) has accumulated this complexity over thirty years. Every edge case is in there. You can't write a faithful Python lexer by reading the language reference; you have to read the C source and port it.

v0.5.5 ports that lexer. After this release, gopy can take Python source bytes and produce a token stream that matches CPython's, token for token, position for position, including the PEP 701 f-string mode stack and the PEP 750 t-string variant. The SyntaxError messages a lexer error produces match CPython byte for byte.

The pegen runtime ships alongside: the token buffer, the mark / reset / peek primitives, the Expect helpers the generated parser table will lean on. The generated parser table itself is not in this drop, because it depends on a Go-targeted PEG generator that's still in flight under tools/parser_gen/. The generator plus the table land in v0.6.

What this means: after v0.5.5, the lexer side of the parse stage is done. The parser is one release away.

Highlights

Three themes pull through this release.

A faithful Python lexer

The lexer is structured the way CPython's is: a state struct (Parser/lexer/state.h), a buffer struct (Parser/lexer/buffer.h), and a giant FSM (Parser/lexer/lexer.c) that walks the buffer one rune at a time and emits tokens.

import "gopy/parser/lexer"

src := []byte(`def f(x):
return x + 1
`)

lex, _ := lexer.NewFromString(src)
for {
tok, err := lex.Next()
if err != nil { /* SyntaxError */ }
if tok.Type == token.ENDMARKER { break }
fmt.Println(tok)
}
// NAME 'def', NAME 'f', LPAR '(', NAME 'x', RPAR ')', COLON ':', NEWLINE,
// INDENT, NAME 'return', NAME 'x', OP '+', NUMBER '1', NEWLINE, DEDENT,
// ENDMARKER

The FSM handles INDENT and DEDENT through a pendin counter the way the C source does (you can't emit two tokens at once, so multi-level dedents queue up). Line continuations (backslash newline) re-enter the scan at the right offset. Paren balance tracks across lines so a ( opened on line 1 lets line 2 ignore its leading whitespace.

The buffer is offset-based, not pointer-based. CPython's C source has a buffer-rebase dance where a reallocating tok->buf requires fixing up every pointer that was held into it. We sidestepped that by using integer offsets instead of raw pointers, which is one of the few places the Go port reads shorter than the C original.

PEP 701 f-strings, PEP 750 t-strings

F-strings used to be a parser-level rewrite. In Python 3.11 they were promoted to first-class syntax, and PEP 701 gave them a proper grammar. The lexer now emits FSTRING_START, FSTRING_MIDDLE, and FSTRING_END tokens, plus the expression tokens between them.

name = "world"
greeting = f"hello, {name.upper()!r:>20}"

The lexer for this is its own mode. When the regular-mode scanner sees an f-prefix on a string literal, it pushes f-string mode onto a stack and emits FSTRING_START. Inside f-string mode, the scanner reads literal text until it hits {, at which point it emits FSTRING_MIDDLE for the literal so far, switches back to regular mode for the expression, and tracks brace depth. When the brace depth returns to zero on a }, it pops back into f-string mode and continues. When the closing quote arrives, it emits FSTRING_END and pops the mode.

The mode stack supports nesting: an f-string inside an f-string inside an f-string works, the way CPython supports it post-3.12.

f"outer {f'inner {f"deep"}'}"

T-strings (PEP 750, new in 3.14) are template literals: the lexer treats them the same way as f-strings but emits the template-variant tokens, and the parser builds a TemplateStr AST node instead of JoinedStr. The lexer-level work is identical aside from the token types.

Byte-faithful SyntaxError messages

When Python tells you IndentationError: unexpected indent, the exact wording matters. Test suites match against it. IDEs parse it. Tooling depends on it. CPython has a giant pegen_errors.c that emits these messages with care, including the file path, the line number, the column range, and a caret pointer at the failing column.

We ported every message verbatim.

File "<stdin>", line 1
def f(x):
return x
^^^^^^
IndentationError: unexpected indent

The wording, the line wrap, the caret position. Byte for byte the same as CPython 3.14.

The error builder publishes four constructors:

  • Raise for a one-position SyntaxError.
  • RaiseIndent for indentation errors.
  • RaiseTab for tab/space mixing errors.
  • RaiseRange for an error that spans a range of columns (used by f-string scope errors that point at the whole expression).

Each carries a Kind enum so the caller can lift the record into the matching exception class: SyntaxError, IndentationError, TabError, or OverflowError. The OverflowError variant covers the case where a number literal is too big to parse, which CPython delivers as a SyntaxError-shaped diagnostic but a different exception class.

What's new

The full package breakdown.

parser/lexer/

The full Python lexer. Ports the Parser/lexer/ family from CPython.

  • state.go from Parser/lexer/state.[ch]. The lexer state struct: cursor offset, line/column counters, paren-balance stack, indent stack, pendin counter, mode stack, error slot.
  • buffer.go from Parser/lexer/buffer.[ch]. The source buffer. We use offsets instead of raw pointers, so the C source's buffer-rebase dance (the part that fixes up dangling pointers when the buffer grows) disappears in the Go port.
  • lexer.go from Parser/lexer/lexer.c. The regular-mode FSM. NAME / NUMBER / STRING / OP scanning. INDENT / DEDENT emission via the pendin counter. Line continuation. Paren balance tracking (a ( opens, a ) closes; opens across lines swallow the newline). COMMENT / NL / type-comment emission under the extra-tokens flag.
  • token.go for the Token struct and the per-kind constants. The constants are generated against CPython's Grammar/Tokens in v0.5 (tokenize skeleton); this release wires them into real Token instances.
  • fstring.go from Parser/lexer/lexer.c tok_get_fstring_mode. PEP 701 f-string mode plus the PEP 750 t-string variant. Prefix detection (any of f, F, rf, Rf, rF, RF, fr, Fr, fR, FR, plus the t-equivalents). Mode stack push. FSTRING_START emission. The literal-text reader that walks until { and emits FSTRING_MIDDLE. The brace-depth tracker that re-enters f-string mode when } closes the expression. FSTRING_END emission and mode stack pop on closing quote.

The lexer ships with three drivers, mirroring CPython's three tokenizer entry points:

  • driver_string.go. Tokenize from a byte slice. BOM stripping on the first line lives here. Used by the in-memory test path.
  • driver_file.go. Tokenize from an io.Reader. Used by the file-source entry point.
  • driver_readline.go. Tokenize through a readline-style callback. Used by the REPL where we don't have the full source upfront. The callback returns one line per call; the lexer asks for the next line whenever its buffer runs out.

parser/pegen/

The pegen runtime. Ports Parser/pegen.[ch] and the action helper file.

  • parser.go from Parser/pegen.[ch]. The Parser struct, the fillToken routine that pulls tokens from the lexer into the parser's token array, the mark / reset / peek primitives the generated parser table will use to do its packrat caching. Expect and ExpectName as the consume helpers (consume a specific token type / a specific NAME).
  • actions.go from Parser/action_helpers.c. The action helpers the generated parser calls during reduction. Sequence-shape operations (singleton, insert-in-front, append, flatten, first-item, last-item), the dotted-name joiners (for building Attribute(Attribute(Name("a"), "b"), "c") from a.b.c), SetExprContext that walks the Name / Tuple / List / Starred / Attribute / Subscript family to set their ctx field to Store / Load / Del, and the GetExprName phrase table behind "cannot assign to %s" (the thing that tells you you can't assign to a function call).

The AST builder helpers (MakeArguments for function signatures with positional-only / keyword-only / defaults markers, the decorated FunctionDef constructor, JoinedStr assembly from the f-string token stream) ship alongside the generated parser in v0.6. The shape is wired but the helpers are stubs until the parser table calls them.

parser/errors/

The SyntaxError surface.

  • messages.go transcribes the entire SyntaxError text panel from Parser/pegen_errors.c. Goal: byte-for-byte parity with CPython 3.14. A SyntaxError that gopy emits should be indistinguishable from the one CPython would emit for the same source. We pinned this with a panel test that runs CPython against a corpus of broken inputs, captures the output, and diffs against gopy's.
  • builder.go exposes the four constructors the parser reaches for: Raise, RaiseIndent, RaiseTab, RaiseRange. Each builds a SyntaxError record with the right kind, message, filename, line, column range, and source line for the caret.
  • tokenizer_errors.go dispatches from a lexer errcode to the matching message. Mirrors _Pypegen_tokenizer_error in the C source: each lexer-internal error code maps to one of the panel messages.

parser/string/

String literal escape decoding. Ports Parser/string_parser.c.

  • parse.go. ParseString takes a raw string-literal token (the byte sequence between and including the quotes, with the b/u/r prefix attached) and returns the decoded string. It strips the wrapping (prefix plus quotes, single or triple), identifies the mode (bytes vs str, raw vs cooked), and delegates to the decoder.
  • decode.go. The escape table: \n, \r, \t, \v, \a, \b, \f, \\, \', \", octal up to three digits (\012), \xNN, \uNNNN, \UNNNNNNNN. Bytes mode rejects non-ASCII bytes per the C source's check. Raw mode passes backslashes through unchanged. The no-backslash fast path bypasses the table entirely for the common case where the literal has no escapes.

The escape decoder is a hot path during compile (every string literal in every module hits it), so the fast path matters. The C source has the same fast path with the same shape; we kept it.

Why we built it this way

A few decisions deserve a callout.

Why offsets instead of pointers in the buffer

CPython's lexer buffer is a char * that grows by reallocation as more source comes in (the readline driver in particular adds one line at a time, growing the buffer each call). The C source has to fix up every pointer into the buffer when it grows, which it does through a "buffer rebase" function that adjusts the cur, start, end, inp, and several others.

The Go port uses integer offsets into a []byte. A growing buffer through append may move, but offsets don't care. The rebase code disappears entirely, and the resulting code is shorter and easier to read.

Why a separate fstring.go

The f-string mode is its own state machine with its own transition rules. Putting it next to the regular-mode FSM in lexer.go would have made both harder to follow. CPython's C source keeps both in the same file but separates them by a clear "f-string mode" header comment; we made that separation a file boundary. The mode stack and the brace-depth tracker live in fstring.go; the regular-mode FSM lives in lexer.go. The two files call into each other at the mode transitions.

Why ship the pegen runtime before the parser

The Parser struct, the token buffer, the mark / reset / peek primitives are a small amount of code that the generated parser table will call into. Shipping them now lets the parser table generator (under tools/parser_gen/) target a stable API. It also lets the action helpers be ported and tested independently: SetExprContext walks an existing AST, so we can write tests for it without a parser to feed it inputs.

Why byte-for-byte SyntaxError parity

Test suites match against SyntaxError messages. IDEs parse them. The traceback formatter uses them verbatim. If a gopy SyntaxError reads differently from a CPython SyntaxError for the same source, every downstream tool that consumes the text breaks.

We chose the byte-for-byte target because it's the only target that doesn't break consumers. The cost is that we now have to keep messages.go in sync with pegen_errors.c as CPython evolves; we manage that with a panel test that captures CPython output at the pinned 3.14 commit and diffs against ours.

Where it lives

The new packages:

  • parser/lexer/ for the lexer FSM, state, buffer, and the three drivers.
  • parser/pegen/ for the parser runtime: Parser struct, mark / reset / peek, action helpers.
  • parser/errors/ for the SyntaxError text panel and the Raise / RaiseIndent / RaiseTab / RaiseRange constructors.
  • parser/string/ for string literal escape decoding.

The CPython sources we ported from:

  • Parser/lexer/state.[ch] for the state struct.
  • Parser/lexer/buffer.[ch] for the buffer (with the rebase dance dropped in favor of offsets).
  • Parser/lexer/lexer.c for the regular-mode FSM and the f-string mode.
  • Parser/pegen.[ch] for the Parser runtime.
  • Parser/action_helpers.c for the action helpers.
  • Parser/pegen_errors.c for the SyntaxError text panel.
  • Parser/string_parser.c for escape decoding.
  • Grammar/Tokens plus Include/internal/pycore_token.h plus Lib/token.py for the token type table (via the v0.5 tokens_go generator).

Compatibility

  • Go: 1.26 or newer.
  • CPython behavioral target: 3.14.0+.

The gate panel for v0.5.5 pins:

  • A representative source file tokenizes to the same token stream CPython produces (same types, same string values, same line and column numbers).
  • The INDENT / DEDENT stack handles multi-level dedent correctly (one DEDENT per indent level, queued through the pendin counter).
  • A nested f-string scenario (f"outer {f'inner'}") tokenizes to the expected FSTRING_START / FSTRING_MIDDLE / FSTRING_END sequence with the right mode pushes and pops.
  • A broken source (unclosed string, mismatched indent, unexpected dedent) produces the same SyntaxError text CPython produces, byte for byte.

Out of scope

A few things wait for v0.6.

  • tools/parser_gen/. The Go-targeted PEG generator. It reads Grammar/python.gram and emits a parser table. The resulting table lands as parser/pegen/parser_gen.go.
  • The full action helper panel. MakeArguments, decorated FunctionDef, JoinedStr and TemplateStr assembly, ConcatenateStrings. The shapes are wired in actions.go; the implementations land alongside the generated parser when the parser starts calling them.
  • parser/errors/invalid_rules.go. The second-pass diagnostic helpers CPython uses to produce better error messages on specific patterns ("did you mean := instead of =", "missing colon after if"). The first-pass diagnostics are in for v0.5.5; the second-pass helpers land with the parser.
  • partest/gate_test.go. An end-to-end test that round-trips the v0.5 disassembly goldens through the parser to prove byte parity from source through to disassembly. The test is wired but skipped until the parser arrives.

What's next

v0.6 is the big one. It brings:

  • The generated parser table. Python source becomes an AST.
  • The VM. AST plus compile pipeline becomes executable bytecode.
  • The frame stack, the evaluator loop, real opcode dispatch.
  • The integration: source goes in one end, a Python program runs out the other.

The pieces we shipped today (the lexer, the pegen runtime, the SyntaxError panel) are the inputs to v0.6's parser. The pieces that landed in v0.5 (the compile pipeline) are the consumers of the AST the parser produces. v0.6 wires them together.

Three releases from the start of the project, gopy will run Python.