v0.5.5 - The lexer and the parser handover
Released May 5, 2026.
The Python lexer is one of those subsystems that everybody thinks is simple until they have to write one. Whitespace is significant. Indentation is significant in a way that requires a counter, not a regex. A line continuation is a backslash followed by exactly the right kind of newline. F-strings (PEP 701) and t-strings (PEP 750) contain arbitrary expressions that the lexer has to recognize as a sub-language, push a mode onto a stack, and pop the mode when the closing brace arrives. Triple-quoted strings span lines but single-quoted ones don't. Comments end at the newline unless they end at the EOF. Type comments look like regular comments unless the caller asked for them, in which case they're tokens.
CPython's lexer (Parser/lexer/lexer.c plus the buffer and state
files alongside it) has accumulated this complexity over thirty
years. Every edge case is in there. You can't write a faithful
Python lexer by reading the language reference; you have to read
the C source and port it.
v0.5.5 ports that lexer. After this release, gopy can take Python source bytes and produce a token stream that matches CPython's, token for token, position for position, including the PEP 701 f-string mode stack and the PEP 750 t-string variant. The SyntaxError messages a lexer error produces match CPython byte for byte.
The pegen runtime ships alongside: the token buffer, the mark /
reset / peek primitives, the Expect helpers the generated parser
table will lean on. The generated parser table itself is not in
this drop, because it depends on a Go-targeted PEG generator
that's still in flight under tools/parser_gen/. The generator
plus the table land in v0.6.
What this means: after v0.5.5, the lexer side of the parse stage is done. The parser is one release away.
Highlights
Three themes pull through this release.
A faithful Python lexer
The lexer is structured the way CPython's is: a state struct
(Parser/lexer/state.h), a buffer struct
(Parser/lexer/buffer.h), and a giant FSM (Parser/lexer/lexer.c)
that walks the buffer one rune at a time and emits tokens.
import "gopy/parser/lexer"
src := []byte(`def f(x):
return x + 1
`)
lex, _ := lexer.NewFromString(src)
for {
tok, err := lex.Next()
if err != nil { /* SyntaxError */ }
if tok.Type == token.ENDMARKER { break }
fmt.Println(tok)
}
// NAME 'def', NAME 'f', LPAR '(', NAME 'x', RPAR ')', COLON ':', NEWLINE,
// INDENT, NAME 'return', NAME 'x', OP '+', NUMBER '1', NEWLINE, DEDENT,
// ENDMARKER
The FSM handles INDENT and DEDENT through a pendin counter the
way the C source does (you can't emit two tokens at once, so
multi-level dedents queue up). Line continuations (backslash
newline) re-enter the scan at the right offset. Paren balance
tracks across lines so a ( opened on line 1 lets line 2 ignore
its leading whitespace.
The buffer is offset-based, not pointer-based. CPython's C
source has a buffer-rebase dance where a reallocating
tok->buf requires fixing up every pointer that was held into
it. We sidestepped that by using integer offsets instead of raw
pointers, which is one of the few places the Go port reads
shorter than the C original.
PEP 701 f-strings, PEP 750 t-strings
F-strings used to be a parser-level rewrite. In Python 3.11 they
were promoted to first-class syntax, and PEP 701 gave them a
proper grammar. The lexer now emits FSTRING_START,
FSTRING_MIDDLE, and FSTRING_END tokens, plus the expression
tokens between them.
name = "world"
greeting = f"hello, {name.upper()!r:>20}"
The lexer for this is its own mode. When the regular-mode scanner
sees an f-prefix on a string literal, it pushes f-string mode
onto a stack and emits FSTRING_START. Inside f-string mode, the
scanner reads literal text until it hits {, at which point it
emits FSTRING_MIDDLE for the literal so far, switches back to
regular mode for the expression, and tracks brace depth. When
the brace depth returns to zero on a }, it pops back into
f-string mode and continues. When the closing quote arrives, it
emits FSTRING_END and pops the mode.
The mode stack supports nesting: an f-string inside an f-string inside an f-string works, the way CPython supports it post-3.12.
f"outer {f'inner {f"deep"}'}"
T-strings (PEP 750, new in 3.14) are template literals: the
lexer treats them the same way as f-strings but emits the
template-variant tokens, and the parser builds a TemplateStr
AST node instead of JoinedStr. The lexer-level work is
identical aside from the token types.
Byte-faithful SyntaxError messages
When Python tells you IndentationError: unexpected indent, the
exact wording matters. Test suites match against it. IDEs parse
it. Tooling depends on it. CPython has a giant pegen_errors.c
that emits these messages with care, including the file path,
the line number, the column range, and a caret pointer at the
failing column.
We ported every message verbatim.
File "<stdin>", line 1
def f(x):
return x
^^^^^^
IndentationError: unexpected indent
The wording, the line wrap, the caret position. Byte for byte the same as CPython 3.14.
The error builder publishes four constructors:
Raisefor a one-position SyntaxError.RaiseIndentfor indentation errors.RaiseTabfor tab/space mixing errors.RaiseRangefor an error that spans a range of columns (used by f-string scope errors that point at the whole expression).
Each carries a Kind enum so the caller can lift the record into
the matching exception class: SyntaxError, IndentationError,
TabError, or OverflowError. The OverflowError variant covers the
case where a number literal is too big to parse, which CPython
delivers as a SyntaxError-shaped diagnostic but a different
exception class.
What's new
The full package breakdown.
parser/lexer/
The full Python lexer. Ports the Parser/lexer/ family from
CPython.
state.gofromParser/lexer/state.[ch]. The lexer state struct: cursor offset, line/column counters, paren-balance stack, indent stack, pendin counter, mode stack, error slot.buffer.gofromParser/lexer/buffer.[ch]. The source buffer. We use offsets instead of raw pointers, so the C source's buffer-rebase dance (the part that fixes up dangling pointers when the buffer grows) disappears in the Go port.lexer.gofromParser/lexer/lexer.c. The regular-mode FSM. NAME / NUMBER / STRING / OP scanning. INDENT / DEDENT emission via the pendin counter. Line continuation. Paren balance tracking (a(opens, a)closes; opens across lines swallow the newline). COMMENT / NL / type-comment emission under the extra-tokens flag.token.gofor the Token struct and the per-kind constants. The constants are generated against CPython'sGrammar/Tokensin v0.5 (tokenizeskeleton); this release wires them into real Token instances.fstring.gofromParser/lexer/lexer.c tok_get_fstring_mode. PEP 701 f-string mode plus the PEP 750 t-string variant. Prefix detection (any off,F,rf,Rf,rF,RF,fr,Fr,fR,FR, plus the t-equivalents). Mode stack push. FSTRING_START emission. The literal-text reader that walks until{and emits FSTRING_MIDDLE. The brace-depth tracker that re-enters f-string mode when}closes the expression. FSTRING_END emission and mode stack pop on closing quote.
The lexer ships with three drivers, mirroring CPython's three tokenizer entry points:
driver_string.go. Tokenize from a byte slice. BOM stripping on the first line lives here. Used by the in-memory test path.driver_file.go. Tokenize from anio.Reader. Used by the file-source entry point.driver_readline.go. Tokenize through a readline-style callback. Used by the REPL where we don't have the full source upfront. The callback returns one line per call; the lexer asks for the next line whenever its buffer runs out.
parser/pegen/
The pegen runtime. Ports Parser/pegen.[ch] and the action helper
file.
parser.gofromParser/pegen.[ch]. The Parser struct, thefillTokenroutine that pulls tokens from the lexer into the parser's token array, themark/reset/peekprimitives the generated parser table will use to do its packrat caching.ExpectandExpectNameas the consume helpers (consume a specific token type / a specific NAME).actions.gofromParser/action_helpers.c. The action helpers the generated parser calls during reduction. Sequence-shape operations (singleton, insert-in-front, append, flatten, first-item, last-item), the dotted-name joiners (for buildingAttribute(Attribute(Name("a"), "b"), "c")froma.b.c),SetExprContextthat walks the Name / Tuple / List / Starred / Attribute / Subscript family to set theirctxfield to Store / Load / Del, and theGetExprNamephrase table behind "cannot assign to %s" (the thing that tells you you can't assign to a function call).
The AST builder helpers (MakeArguments for function signatures
with positional-only / keyword-only / defaults markers, the
decorated FunctionDef constructor, JoinedStr assembly from
the f-string token stream) ship alongside the generated parser
in v0.6. The shape is wired but the helpers are stubs until the
parser table calls them.
parser/errors/
The SyntaxError surface.
messages.gotranscribes the entire SyntaxError text panel fromParser/pegen_errors.c. Goal: byte-for-byte parity with CPython 3.14. A SyntaxError that gopy emits should be indistinguishable from the one CPython would emit for the same source. We pinned this with a panel test that runs CPython against a corpus of broken inputs, captures the output, and diffs against gopy's.builder.goexposes the four constructors the parser reaches for:Raise,RaiseIndent,RaiseTab,RaiseRange. Each builds a SyntaxError record with the right kind, message, filename, line, column range, and source line for the caret.tokenizer_errors.godispatches from a lexer errcode to the matching message. Mirrors_Pypegen_tokenizer_errorin the C source: each lexer-internal error code maps to one of the panel messages.
parser/string/
String literal escape decoding. Ports Parser/string_parser.c.
parse.go.ParseStringtakes a raw string-literal token (the byte sequence between and including the quotes, with the b/u/r prefix attached) and returns the decoded string. It strips the wrapping (prefix plus quotes, single or triple), identifies the mode (bytes vs str, raw vs cooked), and delegates to the decoder.decode.go. The escape table:\n,\r,\t,\v,\a,\b,\f,\\,\',\", octal up to three digits (\012),\xNN,\uNNNN,\UNNNNNNNN. Bytes mode rejects non-ASCII bytes per the C source's check. Raw mode passes backslashes through unchanged. The no-backslash fast path bypasses the table entirely for the common case where the literal has no escapes.
The escape decoder is a hot path during compile (every string literal in every module hits it), so the fast path matters. The C source has the same fast path with the same shape; we kept it.
Why we built it this way
A few decisions deserve a callout.
Why offsets instead of pointers in the buffer
CPython's lexer buffer is a char * that grows by reallocation
as more source comes in (the readline driver in particular adds
one line at a time, growing the buffer each call). The C source
has to fix up every pointer into the buffer when it grows, which
it does through a "buffer rebase" function that adjusts the
cur, start, end, inp, and several others.
The Go port uses integer offsets into a []byte. A growing
buffer through append may move, but offsets don't care. The
rebase code disappears entirely, and the resulting code is
shorter and easier to read.
Why a separate fstring.go
The f-string mode is its own state machine with its own
transition rules. Putting it next to the regular-mode FSM in
lexer.go would have made both harder to follow. CPython's C
source keeps both in the same file but separates them by a clear
"f-string mode" header comment; we made that separation a file
boundary. The mode stack and the brace-depth tracker live in
fstring.go; the regular-mode FSM lives in lexer.go. The two
files call into each other at the mode transitions.
Why ship the pegen runtime before the parser
The Parser struct, the token buffer, the mark / reset / peek
primitives are a small amount of code that the generated parser
table will call into. Shipping them now lets the parser table
generator (under tools/parser_gen/) target a stable API. It
also lets the action helpers be ported and tested independently:
SetExprContext walks an existing AST, so we can write tests
for it without a parser to feed it inputs.
Why byte-for-byte SyntaxError parity
Test suites match against SyntaxError messages. IDEs parse them.
The traceback formatter uses them verbatim. If a gopy
SyntaxError reads differently from a CPython SyntaxError for the
same source, every downstream tool that consumes the text breaks.
We chose the byte-for-byte target because it's the only target
that doesn't break consumers. The cost is that we now have to
keep messages.go in sync with pegen_errors.c as CPython
evolves; we manage that with a panel test that captures CPython
output at the pinned 3.14 commit and diffs against ours.
Where it lives
The new packages:
parser/lexer/for the lexer FSM, state, buffer, and the three drivers.parser/pegen/for the parser runtime: Parser struct, mark / reset / peek, action helpers.parser/errors/for the SyntaxError text panel and the Raise / RaiseIndent / RaiseTab / RaiseRange constructors.parser/string/for string literal escape decoding.
The CPython sources we ported from:
Parser/lexer/state.[ch]for the state struct.Parser/lexer/buffer.[ch]for the buffer (with the rebase dance dropped in favor of offsets).Parser/lexer/lexer.cfor the regular-mode FSM and the f-string mode.Parser/pegen.[ch]for the Parser runtime.Parser/action_helpers.cfor the action helpers.Parser/pegen_errors.cfor the SyntaxError text panel.Parser/string_parser.cfor escape decoding.Grammar/TokensplusInclude/internal/pycore_token.hplusLib/token.pyfor the token type table (via the v0.5tokens_gogenerator).
Compatibility
- Go: 1.26 or newer.
- CPython behavioral target: 3.14.0+.
The gate panel for v0.5.5 pins:
- A representative source file tokenizes to the same token stream CPython produces (same types, same string values, same line and column numbers).
- The INDENT / DEDENT stack handles multi-level dedent correctly (one DEDENT per indent level, queued through the pendin counter).
- A nested f-string scenario (
f"outer {f'inner'}") tokenizes to the expected FSTRING_START / FSTRING_MIDDLE / FSTRING_END sequence with the right mode pushes and pops. - A broken source (unclosed string, mismatched indent, unexpected dedent) produces the same SyntaxError text CPython produces, byte for byte.
Out of scope
A few things wait for v0.6.
tools/parser_gen/. The Go-targeted PEG generator. It readsGrammar/python.gramand emits a parser table. The resulting table lands asparser/pegen/parser_gen.go.- The full action helper panel.
MakeArguments, decoratedFunctionDef,JoinedStrandTemplateStrassembly,ConcatenateStrings. The shapes are wired inactions.go; the implementations land alongside the generated parser when the parser starts calling them. parser/errors/invalid_rules.go. The second-pass diagnostic helpers CPython uses to produce better error messages on specific patterns ("did you mean:=instead of=", "missing colon afterif"). The first-pass diagnostics are in for v0.5.5; the second-pass helpers land with the parser.partest/gate_test.go. An end-to-end test that round-trips the v0.5 disassembly goldens through the parser to prove byte parity from source through to disassembly. The test is wired but skipped until the parser arrives.
What's next
v0.6 is the big one. It brings:
- The generated parser table. Python source becomes an AST.
- The VM. AST plus compile pipeline becomes executable bytecode.
- The frame stack, the evaluator loop, real opcode dispatch.
- The integration: source goes in one end, a Python program runs out the other.
The pieces we shipped today (the lexer, the pegen runtime, the SyntaxError panel) are the inputs to v0.6's parser. The pieces that landed in v0.5 (the compile pipeline) are the consumers of the AST the parser produces. v0.6 wires them together.
Three releases from the start of the project, gopy will run Python.