v0.10.2 - The parser drop
Released May 7, 2026.
A Python parser is a strange thing to write. The CPython parser is
generated. There's a grammar file at Parser/Python.gram, a
generator at Tools/peg_generator/, and a result at
Parser/parser.c that's roughly twenty thousand lines of
machine-emitted code. Nobody writes Python's parser by hand. The
generator does it.
We've followed CPython's approach. Our generator lives at
tools/parser_gen/, our grammar is the same Python.gram file
CPython ships, and our generated output is parser/pegen/parser_gen.go.
The benefit is enormous: when CPython 3.14.1 lands a grammar
patch, we apply it to the grammar and regenerate. The cost is
strange too: a bug in the generator looks like a bug in 268
different generated functions all at once.
This release is the one where the generator finally does what it
needs to do. v0.10.2 is the first release that parses the full
Lib/test/test_grammar.py panel end to end. It's the first that
agrees with CPython 3.14 on ast.dump byte-for-byte across a
seeded fixture set. The corpus number that had been stuck at 10
out of 720 typed test files (1.4%) now reads 720 out of 720
(100%).
Three things had been holding the corpus back. The lexer was
emitting NEWLINE tokens for blank and comment-only lines instead
of swallowing them, which broke statement boundaries in any file
that contained a comment between two statements. The generated
parser confused "this alt's action returned nil" with "this alt
failed", which broke any rule whose action returned a Python None
(an empty arglist, an absent return type annotation, a missing
default). Integer literals overflowed Go's int64, which broke any
file that contained a literal like 0xffffffffffffffff. Each had
a different fix and shipped in this release.
After those three landed, the corpus moved from 1.4% typed to 78.2%. Closing the long tail of action-helper shapes (f-strings, comprehensions, augmented operators, future imports, match / lambda / with / type-alias) walked it the rest of the way to ok=720 / sentinel=0 / fail=0. The remaining 21.8% wasn't gated by a single design issue; it was a long list of helper functions the generator emitted with wrong argument indexing.
Highlights
Three pieces of work define this release.
The lexer's blank-line handling
The CPython lexer has a state called blankline that tracks
whether the current line has produced any non-whitespace,
non-comment content. If a line ends with no content, the lexer
loops back to the top of the dispatch function ("goto nextline"
at Parser/lexer/lexer.c:805) rather than emitting a NEWLINE
token. That's why a file like this:
def f():
return 1
# this comment
def g():
return 2
parses correctly. The blank line and the comment-only line both elide their NEWLINEs, so the two function definitions sit at the same indentation level without a phantom NEWLINE between them.
Our v0.10.1 lexer had the opposite default. Every newline emitted
a NEWLINE token. The parser then saw def f(): return 1,
NEWLINE, NEWLINE (from the blank line), NEWLINE (from the
comment-only line), def g(): return 2, and treated each NEWLINE
as a statement terminator. The grammar's statements rule
expected one NEWLINE between statements at the same level; three
NEWLINEs in a row pinned a SyntaxError.
We ported the goto nextline flow from
Parser/lexer/lexer.c:805. Blank lines, comment-only lines, and
newlines inside parens loop back to the top of the dispatch under
the parser's tok_extra_tokens = 0 mode. Under the tokenize-module
mode (tok_extra_tokens = 1), the same paths emit NL and
COMMENT tokens, matching Lib/tokenize.py's shape: the
tokenize module needs to see the blank lines and comments to
round-trip the source, but the parser needs to skip them.
Tracking blankline correctly across the indent / nextline /
newline branches is the load-bearing piece. A blank line can be
seen across two separate calls into tokGetNormalMode: the
indentation prefix on one call, the actual \n on the next.
v0.10.1 lost the blankline flag between the two calls, so the
second call emitted a NEWLINE even though the first call had
established the line as blank. The new code carries the flag in
the lexer state instead of stack-local memory.
The comment branch under tokenize-module mode now backs up the
trailing \n (tok_backup at Parser/lexer/lexer.c:721) so
the newline emits as NL on the next call rather than getting
absorbed into the COMMENT token. That matches CPython exactly
and unblocks the tokenize module's round-trip behavior.
The generator's matched-vs-failed split
PEG parsers have a subtle distinction between "this alt failed" and "this alt matched and returned a value that happens to be falsy". Consider a grammar rule like:
arguments: args=positional? -> Call(args=args)
If the input has no positional arguments, positional? matches
the empty sequence and binds args to nil. The alt matched.
But our generator was emitting:
if args := positional(); args != nil {
return CallNode{Args: args}
}
return nil // alt failed
That treats args != nil as the success condition, which means a
matching-but-empty args looks identical to a failed match.
Result: every empty arglist in the corpus rolled back to
ErrParserNotImplemented.
The fix is a sentinel value. We added placeholderMatched and
the matchedOr helper to parser/pegen/action_helpers_gen.go:
// matchedOr wraps a possibly-nil result. nil becomes placeholderMatched
// so the calling alt sees "matched". The truthy() helper collapses
// placeholderMatched back to false in boolean tests.
func matchedOr(v interface{}) interface{} {
if v == nil {
return placeholderMatched
}
return v
}
The generator now emits matchedOr(...) around every default-action
and bare-bound-name return. The regenerated
parser/pegen/parser_gen.go carries 306 call sites. Each one
lifts the "matched but empty" case out of the "actually failed"
case.
Without this bridge, class B2(): (empty parens), def f() -> None:
(missing return type), and dozens of other shapes were rolling
back. With it, they parse cleanly.
Big-int literal support
Go's int64 tops out at 9223372036854775807. CPython's integers
are arbitrary precision. Numeric literals like
0xffffffffffffffff (18446744073709551615) overflow Go's
strconv.ParseInt but parse cleanly in CPython through
PyLong_FromString, which builds a PyLongObject of whatever
size the literal needs.
The fix is a fallback. parser/pegen/action_helpers_gen.go
now routes through *math/big.Int when strconv.ParseInt
overflows. Literals like 0xffffffffffffffff and
0o1000000000000000000000 (a 70-bit octal) now parse cleanly
instead of pinning a SyntaxError.
We chose math/big.Int rather than rolling our own arbitrary
precision because (a) the Go standard library implementation is
already well-tested, (b) the CPython PyLong representation is
deliberately not Go-interoperable so we'd need a conversion layer
either way, and (c) the parser-side cost is exactly one fallback
on overflow, which is much cheaper than a custom int type.
# Literals that broke v0.10.1 and parse in v0.10.2:
mask = 0xffffffffffffffff
big = 0o1000000000000000000000
shift = 1 << 100
formatted = 1_000_000 # PEP 515 underscored literal
The PEP 515 underscored literal is in the same patch: the
numberToken function now defaults the parser's
featureVersion to build.PythonMinorVersion rather than zero,
so the featureVersion < 6 guard that was rejecting underscores
no longer fires.
What's new
The full feature breakdown, grouped by package.
parser/lexer/
The lexer changes that unblocked the blank-line handling, plus two related fixes.
lexer.go. Port thegoto nextlineflow fromParser/lexer/lexer.c:805. Blank lines, comment-only lines, and newlines inside parens loop back to the top of the dispatch function under the parser'stok_extra_tokens = 0mode rather than emitting NEWLINE. Under tokenize-module mode (tok_extra_tokens = 1) the same paths emitNLandCOMMENT, matchingLib/tokenize.py.state.go. Trackblanklineacross the indent / nextline / newline branches so a blank or comment-only line is recognised even when the indentation prefix and the actual\nare seen on separate calls intotokGetNormalMode.- The comment branch under tokenize-module mode now backs up the
trailing
\n(tok_backupatParser/lexer/lexer.c:721) so the newline emits asNLon the next call, matching CPython. - The
.branch inscanOperatorno longer eagerly consumes..as part of an ELLIPSIS attempt. Two dots emit twoDOTtokens and three dots emitELLIPSIS, matchingParser/lexer/lexer.c:832. Without the fixfrom ..pool import ThreadPoollost its first., which broke relative imports in any package nested more than one level deep. - Inside f-strings, every brace opener bumps the curly-bracket
depth so the lexer's expression / format-spec state machine
pops back out at the right
}. MirrorsParser/lexer/lexer.cf-string mode. Without this, f-strings containing dict or set displays (f"{ {1, 2} }") closed early.
parser/
Top-level parser changes.
parser.go.runParsenow surfaces the pinnedSyntaxErrorwhenever a rule reachesErrParserNotImplemented. CPython's_PyPegen_run_parser(Parser/pegen.c:1136) prefers the deepest pinned error over a generic miss; the gopy entry point now does the same instead of swallowing real syntax errors asErrParserNotImplemented. Before this fix, a real syntax error in user code surfaced as a generic "parser not implemented" message rather than the line / column / reason CPython would have produced.grammar_panel_test.go. New gate (TestParseTestGrammar) parses the fullLib/test/test_grammar.pyend to end. This is the v0.10.2 HARD goal pin; the corpus iteration incorpus_test.gokeeps the broader number moving.
parser/pegen/
The generated parser's helper table picks up the matched-vs-failed sentinel and the big-int fallback, plus a long list of action-helper ports.
action_helpers_gen.go. Numeric literal parsing falls back to*math/big.Intwhenstrconv.ParseIntoverflows. CPython'sparsenumber(Parser/action_helpers.c) callsPyLong_FromStringwhich is arbitrary-precision; literals like0xffffffffffffffffand0o1000000000000000000000now parse cleanly instead of pinning a SyntaxError.action_helpers_gen.go. Add thematchedOrhelper plus theplaceholderMatchedsentinel. Single-binding alts that bind a rule whose body returnednil(e.g. an empty arglist matching_rhs_22) lift thenilthroughmatchedOrso the outer alt sees "matched" instead of treating the alt as a failure and resetting the mark.truthy()collapses the sentinel back tofalseso it never survives into a surrounding boolean test. Without this bridgeclass B2():(empty parens) and similar shapes would unwind LPAR / RPAR consumption and surface asErrParserNotImplemented.- Port the helpers the generated grammar reaches for once the
parse actually starts producing real AST:
alias,global,nonlocal,lambda,with,match, andtypealias constructors out ofParser/action_helpers.c. Each of these had been a panic-stub since v0.7. With the matched-vs-failed fix in place, the parser actually reaches them, and they needed to do the right thing. - Argument indexing fixes that the generator had been getting
wrong:
actionPgenCmpopExprPairreads op / expr at the right slots,augoperatorreads atargs[1],checked_future_importcallsseq_count_dotswith the right index. Each was off-by-one against CPython. The generator emits indexes based on the grammar's bracket positions; the off-by-one came from a mismatch in how we counted forced markers. DictComp: recognise thepair->key/pair->valueshape fromParser/Python.asdlso dict comprehensions build a properDictCompnode instead of falling through to the generic miss. CPython's grammar splits the comprehension into akvpairrule that returns a struct; our generator was treating the struct as a tuple and pulling the wrong fields.- F-string action helpers: port
joined_strandformatted_valuefor the no-debug path, plus_PyPegen_check_fstring_conversionfor the{x!r}/{x!s}/{x!a}flag and_PyPegen_setup_full_format_specfor{x:>10}-style format specs.FSTRING_MIDDLEliteral segments now build the rightConstantnode. Without these, every f-string in the corpus rolled back. numberToken: default the parser'sfeatureVersiontobuild.PythonMinorVersionrather than zero, so PEP 515 underscored literals like1_000_000and0xFFFF_FFFFactually parse instead of being rejected by thefeatureVersion < 6guard.
tools/parser_gen/
The generator changes that ripple through the 268 grammar rules.
emit.go. Wrap the default-action and bare-bound-name returns inmatchedOr(...)so the alt-success / action-result split applies uniformly across the 268 rules. The regeneratedparser/pegen/parser_gen.gocarries 306matchedOrcall sites. This is the single biggest change in the release by line count: each call site is one or two lines, but the change ripples through every grammar rule the generator touches.- Quoted operator literals in the grammar (
'+=','**', ...) now route throughExpect(token.<EXACT>)instead ofExpectOp, matching the upgrade-on-intake step infillToken. The exact-token distinction matters because**is exact (DOUBLESTAR) and**=is also exact (DOUBLESTAR_EQUAL); routing them asExpectOp(*)would have collapsed them. - Drop the implicit name on forced markers; capitalise
initialisms (
URL,ID,HTTP) in generated identifiers sogolangci-lintis happy with the emitted code. The lint cleanup is cosmetic but it kept the regen pipeline green; a failing lint on regenerated code is a permanent yellow CI.
ast/
The Python-level ast module fills out.
- Port the rest of
Lib/ast.py:iter_fields,iter_child_nodes,walk,NodeVisitor,NodeTransformer,dump,parse,unparse, plus theliteral_evalconstant subset. These are the names downstream tooling reaches for, and they are needed by the differential parity gate (below).ast.dumpin particular is the spec for what "matches CPython" means at the AST level; if ourast.dumpproduces the same string CPython's does for the same input, the two parsers agree.
Parser differential parity gate
The strongest single signal in the tree that "we are CPython 3.14".
parity_test.go: parse the same source withpython3and with gopy, dump both viaast.dump, and require byte-equality. The seeded fixture set covers 50+ shapes: assigns, if / while / for, def with*args/**kw, classes with empty / metaclass bases, decorators, imports, raise / raise-from, try / except / finally, with, lambda, the four comprehension kinds, slices, unpacking, global / nonlocal, assert, f-strings with!rand format specs, match, type alias, every constant kind.- The CI test job installs Python 3.14 so the gate runs on
ubuntu / macos / windows. The test skips when
python3reports a different minor version, sinceast.dumpoutput is not stable across versions. - The gate is gated specifically on byte-for-byte equality, not semantic equivalence. We chose this deliberately: semantic equivalence is easier to game (an "almost-equal" comparator forgives differences we'd rather know about), while byte-equality fails noisily and forces us to look at every divergence.
Corpus and panel
Lib/test/test_grammar.pyparses end to end. The panel exercises decorators (PEP 614), classes with empty and non-empty arglists, function definitions with positional-only and keyword-only parameters, comprehensions, walrus, match statements, f-strings, t-strings, type aliases, async functions, and big-int literals. This was the HARD goal pin for v0.10.2.TestCorpusParseover$CPYTHON/Lib(test/ subtrees skipped): ok=720 / sentinel=0 / fail=0 / total=720 (100% typed). v0.10.1 was 10/720. Every action-helper shape the corpus reaches for has a real implementation; no panic-stubs remain on this path.
Why we built it this way
Three calls deserve a callout.
Why we kept the PEG generator
The obvious shortcut would have been to write a hand-rolled recursive-descent parser. Python's grammar is not that wild; an experienced compiler engineer could write a parser for it in a month. We considered it and rejected it for the same reason CPython rejected it: maintenance.
CPython 3.14.1 will land grammar tweaks. Match statements got extra patterns in 3.12. F-strings got debug expressions in 3.13. Type parameter syntax landed in 3.12. Each of these arrived as a grammar patch; the generator did the work to translate that patch into thousands of lines of parser code. If we had a hand-rolled parser, every minor version of CPython would mean a week of careful porting work.
With the generator, a grammar patch is a git apply plus
go generate. The diff in our generated code matches the diff
in CPython's. The matched-vs-failed and big-int fixes in this
release are exactly the kind of work the generator's strange
behaviour produces, but they're also exactly the kind of work
you do once and the result keeps working.
Why byte-for-byte ast.dump matching
We could have written a semantic-equivalence comparator for the parity gate: a function that walks two ASTs and returns true if they "mean the same thing". That comparator would have been more forgiving of subtle differences, which is exactly the problem. We don't want forgiveness. We want every divergence to surface.
ast.dump is what CPython publishes as the canonical
serialisation. If our ast.dump produces different bytes from
CPython's ast.dump on the same input, something is different.
Maybe a position attribute is missing. Maybe a node type is wrong.
Maybe an ast.Constant should be an ast.Num for back-compat.
Whatever it is, we want to know.
The cost is that the gate is noisy when CPython 3.14 itself
changes (which is why it skips when python3 --version reports
a different minor). The benefit is that within a minor version,
the gate catches divergence the moment it happens, not three
weeks later when somebody else finds the wrong AST shape in
production.
Why we ported _PyPegen_run_parser error preference
CPython's _PyPegen_run_parser does something subtle. When the
parser fails, it returns the deepest pinned error rather than the
most recent miss. The reason is that PEG parsers explore many
alternatives in parallel; the most recent miss is often a generic
"didn't match" from way back in the grammar, while the deepest
pinned error is usually the actual syntax error the user wrote.
Without this preference, we kept seeing "parser not implemented" errors for code that had a perfectly explicable syntax error somewhere deep in the parse tree. Porting the preference gave us the same error messages CPython produces, which dramatically improved the developer experience: error messages that point at the actual problem rather than at the place the parser gave up.
Where it lives
parser/lexer/lexer.gohas the blank-line and ELLIPSIS fixes.parser/lexer/state.gocarries theblanklineflag.parser/pegen/action_helpers_gen.gohas thematchedOrsentinel, the big-int fallback, and most of the ported action helpers.tools/parser_gen/emit.gois the generator change.ast/carries theLib/ast.pyport.parser/parity_test.gois the differential gate.parser/grammar_panel_test.gois the HARD goal gate.
Compatibility
A few user-visible changes are worth flagging.
- Big-int literals work. Programs that worked around the
v0.10.1 overflow (constructing the value at runtime via
int('0xff...', 16)) can now use the literal directly. - Comment-only lines no longer break statement boundaries. Programs that had inserted dummy statements between comments to keep the parser happy can drop them.
- Real syntax errors now surface with line / column. The v0.10.1
generic "parser not implemented" is gone; instead, you'll see
the same
SyntaxErrormessage CPython would have produced. - PEP 515 underscored literals parse. Code that wrote
1000000instead of1_000_000to dodge the v0.10.1 rejection can use the readable form. - F-strings with
!r/!s/!aand format specs work. Programs that fell back toformat(repr(x))or'%r' % xcan use the f-string form directly.
What's next
The remaining parser polish lands in v0.11:
- Position tracking parity.
lineno,col_offset,end_lineno,end_col_offseton every AST node. The parser produces nodes today but the position panel is not yet pinned against CPython. The differential gate doesn't catch this becauseast.dumpby default doesn't include positions; once we widen the gate toast.dump(node, include_attributes=True), position bugs will surface. - Memo-cache hit pattern parity with CPython on
Lib/test/test_grammar.py. Diagnostic only; not user-observable, but matching the memo-cache hits tells us our generator's cache layout matches CPython's, which in turn catches edge cases where we descend into a branch CPython doesn't. - Broaden the parity fixture set beyond the seeded 50: drive
ast.dumpbyte-equality across the full$CPYTHON/Libcorpus. The corpus gate proves the parser doesn't crash; parity proves the parser produces the right AST. We want both.
After the parser-side polish, the next big push is the VM. v0.11 opens the path to the optimizer rewrite (tier 1, then tier 2) that lands in v0.12.
Acknowledgments
This release closes spec 1670 (parser corpus 100%), spec 1671
(differential parity gate), and the long tail of action-helper
porting work that had been parked since v0.7. The pull request
that shipped this release carries 306 matchedOr call sites in
the regenerated parser. Every one of them is a small piece of
the matched-vs-failed bridge that turns a generator-emitted
"return nil" into a "return matched".