1642. gopy pegen
What we are porting
The PEG parser is two files at runtime:
Parser/pegen.c: the runtime. ~3500 lines. Memo cache, token buffer, error pinning,_PyPegen_*helpers used by the generated rules.Parser/parser.c: the generated parser. ~40k lines, regenerated fromGrammar/python.grambyTools/peg_generator/. Each rule is one C function returning an AST node orNULL.
Parser/peg_api.c is a thin public wrapper (PyPegen_ASTFrom*).
The generated parser.c is too large to hand-port; we run the
upstream generator (in Python) at build time and let it emit Go.
Strategy
- Hand-port
pegen.ctoparser/pegen/. Memo cache, token feeder, error sink, expect/expect_token/lookahead helpers. - Fork
Tools/peg_generator/so it emits Go from the samepython.gram. Output lands inparser/pegen/parser_gen.go. Keep the generator undertools/parser_gen/and check its output in (no runtime codegen). - Hand-port
peg_api.cto a smallparser.Parsedriver that chooses the start rule fromMode.
The generator fork is the only divergence from "hand-port the C". The grammar is too volatile (CPython tweaks it nearly every release) and the generator already emits target-language-agnostic PEG state machines.
Go shape
// Parser is the runtime mirror of struct Parser from pegen.h.
// Each generated rule takes *Parser and returns (ast.Node, bool).
type Parser struct {
tokens []lexer.Tok
pos int
memo map[memoKey]memoEntry
err *SyntaxError
arena *arena.Arena
flags Flags
mode StartRule
fill func() error // pull more tokens from the lexer
}
type memoKey struct {
rule int // rule id, dense small-int
pos int
}
type memoEntry struct {
node ast.Node
end int // position after the match, or pos for failure
ok bool
}
The memo cache is the heart of PEG. Every rule call is keyed by
(rule_id, pos). The cache survives backtracking. CPython uses an
array-of-array indexed by rule then pos; Go uses a hash map for
simplicity. The hot path is the cache hit, which is one map lookup
either way.
Token fill
pegen.c:_PyPegen_fill_token pulls the next token from the lexer
when the parser exceeds the buffered range. We translate to:
func (p *Parser) fill() error {
for p.pos >= len(p.tokens) {
tok, err := p.next()
if err != nil {
return err
}
p.tokens = append(p.tokens, tok)
}
return nil
}
Same behaviour: lazy, unbounded buffer. The buffer never shrinks because the memo cache references token indices.
Error pinning
PEG errors fire on the deepest position any rule reached. CPython
tracks p->error_indicator, p->level, p->call_invalid_rules.
We mirror these on Parser:
func (p *Parser) RaiseSyntaxError(msg string) ast.Node {
if p.pos > p.err.farthestPos {
p.err = &SyntaxError{Pos: p.tokens[p.pos].Start, Msg: msg}
}
return nil
}
The "farthest reached" rule wins. This is CPython's "last good error" heuristic; without it, PEG errors point at the start of the file.
Generator
tools/parser_gen/ is a Go-targeted fork of
cpython/Tools/peg_generator/. It reads Grammar/python.gram and
emits parser/pegen/parser_gen.go containing:
const Rule_<name> int = Nfor each rule.- A
func (p *Parser) <rule_name>() (ast.Node, bool)per rule. - A
func StartRule(p *Parser, m StartRule) (ast.Node, bool)that dispatches onMode.
The generator runs at go generate time, not at compile time. The
generated file is checked in.
File mapping
| C source | Go target |
|---|---|
Parser/pegen.c | parser/pegen/runtime.go |
Parser/pegen.h | parser/pegen/parser.go (struct) |
Parser/peg_api.c | parser/parser.go (Parse, ParseString) |
Parser/parser.c | parser/pegen/parser_gen.go (generated) |
Tools/peg_generator/ | tools/parser_gen/ (forked, Go target) |
Checklist
Status legend: [x] shipped, [ ] pending, [~] partial / scaffold,
[n] deferred / not in scope this phase.
Files
-
parser/pegen/parser.go:Parserstruct,StartRuleenum,Flags,New. -
parser/pegen/runtime.go:expect_token,expect_forced,lookahead,lookahead_with_name,seq_count_dots,singleton_seq, the helpers the generator emits calls to. -
parser/pegen/memo.go:memo_get,memo_put, the cache. -
parser/pegen/fill.go: lazy token pulling from the lexer. -
parser/pegen/errors.go:RaiseSyntaxError,RaiseSyntaxErrorKnownLocation, error pinning,farthestPos. -
parser/pegen/parser_gen.go: generated. ~19.5k lines, all 268 rules fromGrammar/python.gram. Includes left-recursion seed-and-grow leaders, cut commit, invalid_rules second-pass gating, and agram-sha256drift header. -
parser/pegen/action_helpers_gen.go: hand-written helpers that meet the typed AST surface. CurrentlyactionPgenMakeModulefor the file rule. The generator's action-helper exclusion list defers to this file by name. -
parser/parser.go: top-levelParse(mode, src) (ast.Mod, error), thepeg_api.csurface. -
tools/parser_gen/: Go-targeted PEG generator. Eight milestones land:- M1 grammar parser (
grammar_parser.go). - M2 nullable + first sets (
firstsets.go,nullable.go). - M3 per-rule emitter (
emit.go). - M4 left-recursion + cut (
sccs.go, leader wrappers). - M5 invalid_rules second-pass gating.
- M6 action expression translator (
action.go): translates a subset of C action bodies (CHECK pass-through, RAISE_*,_PyAST_*,_PyPegen_*); falls back to the bound-name[]anyform for patterns it cannot type yet. - M7 wires Dispatch to return the rule result and surfaces
*ast.Modulefor empty source. - M8 corpus iteration (
parser/corpus_test.go) and grammar drift check (drift.go,-check-driftflag).
- M1 grammar parser (
Generator output panel
- Generator round-trips against the upstream
Grammar/python.gramfor 3.14.0. Pinned byTestEmitPythonGramRoundtripand the SHA256 in the generated preamble. - Each rule body emits the M3 closure-per-alt shape: per-item
bindings with shape-aware blocking / always / boolean / noop
arms, then either a translated action expression or the bound
names lifted into
[]any. - Cut operator (
~) commits at the same point CPython does;TestEmitPythonGramRoundtripverifies at least one alt setscut = trueagainstpython.gram. - Left-recursive rule sets (Tarjan SCC) emit a seed-and-grow
leader wrapper plus an unmemoized
_rawbody for non-leaders. -
invalid_*rules gate themselves onp.CallInvalid()andDispatchretries withSetCallInvalid(true)on first-pass miss. - Alt-success vs action-result split. CPython treats an alt as
matched as soon as every item in the sequence consumes (the action
is then run for its side effect / return); gopy's emitter signals
alt success by returning a non-nil
any. Single-binding alts (default action_res = <bound>) liftnilthroughmatchedOr(...), which substitutes theplaceholderMatchedsentinel so the outer alt sees "matched" instead of resetting the mark.truthy()collapses the sentinel back tofalseso it does not survive into surrounding boolean tests. Without this bridge, shapes likeclass B2():(empty arglist) would unwind the consumedLPAR/RPARand re-raise asErrParserNotImplemented.
Surface guarantees
-
Parse(ModeFile, src)returns a real*ast.Modulefor the fullLib/test/test_grammar.pypanel; pinned byTestParseTestGrammar. Coverage on the broader Lib tree is tracked byTestCorpusParse: v0.10.2 reaches ok=563 / sentinel=155 / fail=2 / total=720 (78.2% typed) on$CPYTHON/Lib. The remaining sentinels are dominated by action-helper shapes the generator still emits panic-stubs for; sentinels surface the pinnedSyntaxErrorif one fires, falling back toErrParserNotImplementedonly when no real error pinned. - Memo cache hit pattern parity against CPython on the
Lib/test/test_grammar.pycorpus. Diagnostic, not user-visible. - SyntaxError text and position match CPython on the
parser/errors/golden_panel_test.gocorpus. -
Parseis goroutine-safe for distinct calls; one Parser instance is not safe for concurrent use. - Grammar drift detection:
parser_gen -check-driftandTestCheckGrammarDriftfail loudly when the recordedgram-sha256does not match the currentpython.gram.
Out of scope for v0.5.5
- Incremental / resumable parsing for the LSP. Tracked separately.
- Soft keyword handling beyond what 3.14 already needs (
match,case,type).
Cross-references
- Lexer surface: 1641.
- SyntaxError text: 1643.
- Action helpers: 1643.
- String literal handling inside parser actions: 1644.