Skip to main content

v0.10.2 - The parser drop

Released May 7, 2026.

A Python parser is a strange thing to write. The CPython parser is generated. There's a grammar file at Parser/Python.gram, a generator at Tools/peg_generator/, and a result at Parser/parser.c that's roughly twenty thousand lines of machine-emitted code. Nobody writes Python's parser by hand. The generator does it.

We've followed CPython's approach. Our generator lives at tools/parser_gen/, our grammar is the same Python.gram file CPython ships, and our generated output is parser/pegen/parser_gen.go. The benefit is enormous: when CPython 3.14.1 lands a grammar patch, we apply it to the grammar and regenerate. The cost is strange too: a bug in the generator looks like a bug in 268 different generated functions all at once.

This release is the one where the generator finally does what it needs to do. v0.10.2 is the first release that parses the full Lib/test/test_grammar.py panel end to end. It's the first that agrees with CPython 3.14 on ast.dump byte-for-byte across a seeded fixture set. The corpus number that had been stuck at 10 out of 720 typed test files (1.4%) now reads 720 out of 720 (100%).

Three things had been holding the corpus back. The lexer was emitting NEWLINE tokens for blank and comment-only lines instead of swallowing them, which broke statement boundaries in any file that contained a comment between two statements. The generated parser confused "this alt's action returned nil" with "this alt failed", which broke any rule whose action returned a Python None (an empty arglist, an absent return type annotation, a missing default). Integer literals overflowed Go's int64, which broke any file that contained a literal like 0xffffffffffffffff. Each had a different fix and shipped in this release.

After those three landed, the corpus moved from 1.4% typed to 78.2%. Closing the long tail of action-helper shapes (f-strings, comprehensions, augmented operators, future imports, match / lambda / with / type-alias) walked it the rest of the way to ok=720 / sentinel=0 / fail=0. The remaining 21.8% wasn't gated by a single design issue; it was a long list of helper functions the generator emitted with wrong argument indexing.

Highlights

Three pieces of work define this release.

The lexer's blank-line handling

The CPython lexer has a state called blankline that tracks whether the current line has produced any non-whitespace, non-comment content. If a line ends with no content, the lexer loops back to the top of the dispatch function ("goto nextline" at Parser/lexer/lexer.c:805) rather than emitting a NEWLINE token. That's why a file like this:

def f():
return 1

# this comment

def g():
return 2

parses correctly. The blank line and the comment-only line both elide their NEWLINEs, so the two function definitions sit at the same indentation level without a phantom NEWLINE between them.

Our v0.10.1 lexer had the opposite default. Every newline emitted a NEWLINE token. The parser then saw def f(): return 1, NEWLINE, NEWLINE (from the blank line), NEWLINE (from the comment-only line), def g(): return 2, and treated each NEWLINE as a statement terminator. The grammar's statements rule expected one NEWLINE between statements at the same level; three NEWLINEs in a row pinned a SyntaxError.

We ported the goto nextline flow from Parser/lexer/lexer.c:805. Blank lines, comment-only lines, and newlines inside parens loop back to the top of the dispatch under the parser's tok_extra_tokens = 0 mode. Under the tokenize-module mode (tok_extra_tokens = 1), the same paths emit NL and COMMENT tokens, matching Lib/tokenize.py's shape: the tokenize module needs to see the blank lines and comments to round-trip the source, but the parser needs to skip them.

Tracking blankline correctly across the indent / nextline / newline branches is the load-bearing piece. A blank line can be seen across two separate calls into tokGetNormalMode: the indentation prefix on one call, the actual \n on the next. v0.10.1 lost the blankline flag between the two calls, so the second call emitted a NEWLINE even though the first call had established the line as blank. The new code carries the flag in the lexer state instead of stack-local memory.

The comment branch under tokenize-module mode now backs up the trailing \n (tok_backup at Parser/lexer/lexer.c:721) so the newline emits as NL on the next call rather than getting absorbed into the COMMENT token. That matches CPython exactly and unblocks the tokenize module's round-trip behavior.

The generator's matched-vs-failed split

PEG parsers have a subtle distinction between "this alt failed" and "this alt matched and returned a value that happens to be falsy". Consider a grammar rule like:

arguments: args=positional? -> Call(args=args)

If the input has no positional arguments, positional? matches the empty sequence and binds args to nil. The alt matched. But our generator was emitting:

if args := positional(); args != nil {
return CallNode{Args: args}
}
return nil // alt failed

That treats args != nil as the success condition, which means a matching-but-empty args looks identical to a failed match. Result: every empty arglist in the corpus rolled back to ErrParserNotImplemented.

The fix is a sentinel value. We added placeholderMatched and the matchedOr helper to parser/pegen/action_helpers_gen.go:

// matchedOr wraps a possibly-nil result. nil becomes placeholderMatched
// so the calling alt sees "matched". The truthy() helper collapses
// placeholderMatched back to false in boolean tests.
func matchedOr(v interface{}) interface{} {
if v == nil {
return placeholderMatched
}
return v
}

The generator now emits matchedOr(...) around every default-action and bare-bound-name return. The regenerated parser/pegen/parser_gen.go carries 306 call sites. Each one lifts the "matched but empty" case out of the "actually failed" case.

Without this bridge, class B2(): (empty parens), def f() -> None: (missing return type), and dozens of other shapes were rolling back. With it, they parse cleanly.

Big-int literal support

Go's int64 tops out at 9223372036854775807. CPython's integers are arbitrary precision. Numeric literals like 0xffffffffffffffff (18446744073709551615) overflow Go's strconv.ParseInt but parse cleanly in CPython through PyLong_FromString, which builds a PyLongObject of whatever size the literal needs.

The fix is a fallback. parser/pegen/action_helpers_gen.go now routes through *math/big.Int when strconv.ParseInt overflows. Literals like 0xffffffffffffffff and 0o1000000000000000000000 (a 70-bit octal) now parse cleanly instead of pinning a SyntaxError.

We chose math/big.Int rather than rolling our own arbitrary precision because (a) the Go standard library implementation is already well-tested, (b) the CPython PyLong representation is deliberately not Go-interoperable so we'd need a conversion layer either way, and (c) the parser-side cost is exactly one fallback on overflow, which is much cheaper than a custom int type.

# Literals that broke v0.10.1 and parse in v0.10.2:
mask = 0xffffffffffffffff
big = 0o1000000000000000000000
shift = 1 << 100
formatted = 1_000_000 # PEP 515 underscored literal

The PEP 515 underscored literal is in the same patch: the numberToken function now defaults the parser's featureVersion to build.PythonMinorVersion rather than zero, so the featureVersion < 6 guard that was rejecting underscores no longer fires.

What's new

The full feature breakdown, grouped by package.

parser/lexer/

The lexer changes that unblocked the blank-line handling, plus two related fixes.

  • lexer.go. Port the goto nextline flow from Parser/lexer/lexer.c:805. Blank lines, comment-only lines, and newlines inside parens loop back to the top of the dispatch function under the parser's tok_extra_tokens = 0 mode rather than emitting NEWLINE. Under tokenize-module mode (tok_extra_tokens = 1) the same paths emit NL and COMMENT, matching Lib/tokenize.py.
  • state.go. Track blankline across the indent / nextline / newline branches so a blank or comment-only line is recognised even when the indentation prefix and the actual \n are seen on separate calls into tokGetNormalMode.
  • The comment branch under tokenize-module mode now backs up the trailing \n (tok_backup at Parser/lexer/lexer.c:721) so the newline emits as NL on the next call, matching CPython.
  • The . branch in scanOperator no longer eagerly consumes .. as part of an ELLIPSIS attempt. Two dots emit two DOT tokens and three dots emit ELLIPSIS, matching Parser/lexer/lexer.c:832. Without the fix from ..pool import ThreadPool lost its first ., which broke relative imports in any package nested more than one level deep.
  • Inside f-strings, every brace opener bumps the curly-bracket depth so the lexer's expression / format-spec state machine pops back out at the right }. Mirrors Parser/lexer/lexer.c f-string mode. Without this, f-strings containing dict or set displays (f"{ {1, 2} }") closed early.

parser/

Top-level parser changes.

  • parser.go. runParse now surfaces the pinned SyntaxError whenever a rule reaches ErrParserNotImplemented. CPython's _PyPegen_run_parser (Parser/pegen.c:1136) prefers the deepest pinned error over a generic miss; the gopy entry point now does the same instead of swallowing real syntax errors as ErrParserNotImplemented. Before this fix, a real syntax error in user code surfaced as a generic "parser not implemented" message rather than the line / column / reason CPython would have produced.
  • grammar_panel_test.go. New gate (TestParseTestGrammar) parses the full Lib/test/test_grammar.py end to end. This is the v0.10.2 HARD goal pin; the corpus iteration in corpus_test.go keeps the broader number moving.

parser/pegen/

The generated parser's helper table picks up the matched-vs-failed sentinel and the big-int fallback, plus a long list of action-helper ports.

  • action_helpers_gen.go. Numeric literal parsing falls back to *math/big.Int when strconv.ParseInt overflows. CPython's parsenumber (Parser/action_helpers.c) calls PyLong_FromString which is arbitrary-precision; literals like 0xffffffffffffffff and 0o1000000000000000000000 now parse cleanly instead of pinning a SyntaxError.
  • action_helpers_gen.go. Add the matchedOr helper plus the placeholderMatched sentinel. Single-binding alts that bind a rule whose body returned nil (e.g. an empty arglist matching _rhs_22) lift the nil through matchedOr so the outer alt sees "matched" instead of treating the alt as a failure and resetting the mark. truthy() collapses the sentinel back to false so it never survives into a surrounding boolean test. Without this bridge class B2(): (empty parens) and similar shapes would unwind LPAR / RPAR consumption and surface as ErrParserNotImplemented.
  • Port the helpers the generated grammar reaches for once the parse actually starts producing real AST: alias, global, nonlocal, lambda, with, match, and type alias constructors out of Parser/action_helpers.c. Each of these had been a panic-stub since v0.7. With the matched-vs-failed fix in place, the parser actually reaches them, and they needed to do the right thing.
  • Argument indexing fixes that the generator had been getting wrong: actionPgenCmpopExprPair reads op / expr at the right slots, augoperator reads at args[1], checked_future_import calls seq_count_dots with the right index. Each was off-by-one against CPython. The generator emits indexes based on the grammar's bracket positions; the off-by-one came from a mismatch in how we counted forced markers.
  • DictComp: recognise the pair->key / pair->value shape from Parser/Python.asdl so dict comprehensions build a proper DictComp node instead of falling through to the generic miss. CPython's grammar splits the comprehension into a kvpair rule that returns a struct; our generator was treating the struct as a tuple and pulling the wrong fields.
  • F-string action helpers: port joined_str and formatted_value for the no-debug path, plus _PyPegen_check_fstring_conversion for the {x!r} / {x!s} / {x!a} flag and _PyPegen_setup_full_format_spec for {x:>10}-style format specs. FSTRING_MIDDLE literal segments now build the right Constant node. Without these, every f-string in the corpus rolled back.
  • numberToken: default the parser's featureVersion to build.PythonMinorVersion rather than zero, so PEP 515 underscored literals like 1_000_000 and 0xFFFF_FFFF actually parse instead of being rejected by the featureVersion < 6 guard.

tools/parser_gen/

The generator changes that ripple through the 268 grammar rules.

  • emit.go. Wrap the default-action and bare-bound-name returns in matchedOr(...) so the alt-success / action-result split applies uniformly across the 268 rules. The regenerated parser/pegen/parser_gen.go carries 306 matchedOr call sites. This is the single biggest change in the release by line count: each call site is one or two lines, but the change ripples through every grammar rule the generator touches.
  • Quoted operator literals in the grammar ('+=', '**', ...) now route through Expect(token.<EXACT>) instead of ExpectOp, matching the upgrade-on-intake step in fillToken. The exact-token distinction matters because ** is exact (DOUBLESTAR) and **= is also exact (DOUBLESTAR_EQUAL); routing them as ExpectOp(*) would have collapsed them.
  • Drop the implicit name on forced markers; capitalise initialisms (URL, ID, HTTP) in generated identifiers so golangci-lint is happy with the emitted code. The lint cleanup is cosmetic but it kept the regen pipeline green; a failing lint on regenerated code is a permanent yellow CI.

ast/

The Python-level ast module fills out.

  • Port the rest of Lib/ast.py: iter_fields, iter_child_nodes, walk, NodeVisitor, NodeTransformer, dump, parse, unparse, plus the literal_eval constant subset. These are the names downstream tooling reaches for, and they are needed by the differential parity gate (below). ast.dump in particular is the spec for what "matches CPython" means at the AST level; if our ast.dump produces the same string CPython's does for the same input, the two parsers agree.

Parser differential parity gate

The strongest single signal in the tree that "we are CPython 3.14".

  • parity_test.go: parse the same source with python3 and with gopy, dump both via ast.dump, and require byte-equality. The seeded fixture set covers 50+ shapes: assigns, if / while / for, def with *args / **kw, classes with empty / metaclass bases, decorators, imports, raise / raise-from, try / except / finally, with, lambda, the four comprehension kinds, slices, unpacking, global / nonlocal, assert, f-strings with !r and format specs, match, type alias, every constant kind.
  • The CI test job installs Python 3.14 so the gate runs on ubuntu / macos / windows. The test skips when python3 reports a different minor version, since ast.dump output is not stable across versions.
  • The gate is gated specifically on byte-for-byte equality, not semantic equivalence. We chose this deliberately: semantic equivalence is easier to game (an "almost-equal" comparator forgives differences we'd rather know about), while byte-equality fails noisily and forces us to look at every divergence.

Corpus and panel

  • Lib/test/test_grammar.py parses end to end. The panel exercises decorators (PEP 614), classes with empty and non-empty arglists, function definitions with positional-only and keyword-only parameters, comprehensions, walrus, match statements, f-strings, t-strings, type aliases, async functions, and big-int literals. This was the HARD goal pin for v0.10.2.
  • TestCorpusParse over $CPYTHON/Lib (test/ subtrees skipped): ok=720 / sentinel=0 / fail=0 / total=720 (100% typed). v0.10.1 was 10/720. Every action-helper shape the corpus reaches for has a real implementation; no panic-stubs remain on this path.

Why we built it this way

Three calls deserve a callout.

Why we kept the PEG generator

The obvious shortcut would have been to write a hand-rolled recursive-descent parser. Python's grammar is not that wild; an experienced compiler engineer could write a parser for it in a month. We considered it and rejected it for the same reason CPython rejected it: maintenance.

CPython 3.14.1 will land grammar tweaks. Match statements got extra patterns in 3.12. F-strings got debug expressions in 3.13. Type parameter syntax landed in 3.12. Each of these arrived as a grammar patch; the generator did the work to translate that patch into thousands of lines of parser code. If we had a hand-rolled parser, every minor version of CPython would mean a week of careful porting work.

With the generator, a grammar patch is a git apply plus go generate. The diff in our generated code matches the diff in CPython's. The matched-vs-failed and big-int fixes in this release are exactly the kind of work the generator's strange behaviour produces, but they're also exactly the kind of work you do once and the result keeps working.

Why byte-for-byte ast.dump matching

We could have written a semantic-equivalence comparator for the parity gate: a function that walks two ASTs and returns true if they "mean the same thing". That comparator would have been more forgiving of subtle differences, which is exactly the problem. We don't want forgiveness. We want every divergence to surface.

ast.dump is what CPython publishes as the canonical serialisation. If our ast.dump produces different bytes from CPython's ast.dump on the same input, something is different. Maybe a position attribute is missing. Maybe a node type is wrong. Maybe an ast.Constant should be an ast.Num for back-compat. Whatever it is, we want to know.

The cost is that the gate is noisy when CPython 3.14 itself changes (which is why it skips when python3 --version reports a different minor). The benefit is that within a minor version, the gate catches divergence the moment it happens, not three weeks later when somebody else finds the wrong AST shape in production.

Why we ported _PyPegen_run_parser error preference

CPython's _PyPegen_run_parser does something subtle. When the parser fails, it returns the deepest pinned error rather than the most recent miss. The reason is that PEG parsers explore many alternatives in parallel; the most recent miss is often a generic "didn't match" from way back in the grammar, while the deepest pinned error is usually the actual syntax error the user wrote.

Without this preference, we kept seeing "parser not implemented" errors for code that had a perfectly explicable syntax error somewhere deep in the parse tree. Porting the preference gave us the same error messages CPython produces, which dramatically improved the developer experience: error messages that point at the actual problem rather than at the place the parser gave up.

Where it lives

  • parser/lexer/lexer.go has the blank-line and ELLIPSIS fixes.
  • parser/lexer/state.go carries the blankline flag.
  • parser/pegen/action_helpers_gen.go has the matchedOr sentinel, the big-int fallback, and most of the ported action helpers.
  • tools/parser_gen/emit.go is the generator change.
  • ast/ carries the Lib/ast.py port.
  • parser/parity_test.go is the differential gate.
  • parser/grammar_panel_test.go is the HARD goal gate.

Compatibility

A few user-visible changes are worth flagging.

  • Big-int literals work. Programs that worked around the v0.10.1 overflow (constructing the value at runtime via int('0xff...', 16)) can now use the literal directly.
  • Comment-only lines no longer break statement boundaries. Programs that had inserted dummy statements between comments to keep the parser happy can drop them.
  • Real syntax errors now surface with line / column. The v0.10.1 generic "parser not implemented" is gone; instead, you'll see the same SyntaxError message CPython would have produced.
  • PEP 515 underscored literals parse. Code that wrote 1000000 instead of 1_000_000 to dodge the v0.10.1 rejection can use the readable form.
  • F-strings with !r / !s / !a and format specs work. Programs that fell back to format(repr(x)) or '%r' % x can use the f-string form directly.

What's next

The remaining parser polish lands in v0.11:

  • Position tracking parity. lineno, col_offset, end_lineno, end_col_offset on every AST node. The parser produces nodes today but the position panel is not yet pinned against CPython. The differential gate doesn't catch this because ast.dump by default doesn't include positions; once we widen the gate to ast.dump(node, include_attributes=True), position bugs will surface.
  • Memo-cache hit pattern parity with CPython on Lib/test/test_grammar.py. Diagnostic only; not user-observable, but matching the memo-cache hits tells us our generator's cache layout matches CPython's, which in turn catches edge cases where we descend into a branch CPython doesn't.
  • Broaden the parity fixture set beyond the seeded 50: drive ast.dump byte-equality across the full $CPYTHON/Lib corpus. The corpus gate proves the parser doesn't crash; parity proves the parser produces the right AST. We want both.

After the parser-side polish, the next big push is the VM. v0.11 opens the path to the optimizer rewrite (tier 1, then tier 2) that lands in v0.12.

Acknowledgments

This release closes spec 1670 (parser corpus 100%), spec 1671 (differential parity gate), and the long tail of action-helper porting work that had been parked since v0.7. The pull request that shipped this release carries 306 matchedOr call sites in the regenerated parser. Every one of them is a small piece of the matched-vs-failed bridge that turns a generator-emitted "return nil" into a "return matched".