1641. gopy lexer and tokenizer
What we are porting
CPython's lexer is split into two layers:
Parser/lexer/: the state machine that consumes bytes and emits tokens.lexer.cis the FSM,state.cowns the per-tokenizer state struct,buffer.cowns the slidable input buffer.Parser/tokenizer/: four drivers that feed the lexer from different sources.utf8_tokenizer.c(in-memory UTF-8 string),string_tokenizer.c(legacy string with encoding detection),file_tokenizer.c(FILE*),readline_tokenizer.c(REPL callback).helpers.cis the shared decode / line-handling surface.
Together they are the most stateful part of CPython's parser, with roughly 6k lines of C. The lexer tracks indentation, parenthesis depth, type-comment mode, async-aware keywords (3.7+), f-string nesting, and continuation lines.
Go translation
Top-level surface lives in parser/lexer/:
// State is the per-tokenizer struct. Mirrors struct tok_state from
// Parser/lexer/state.h. Fields renamed from snake_case to Go style.
type State struct {
buf *Buffer // input buffer
indents []int // indent stack
parens []byte // open paren stack: '(' '[' '{' or 0
line int // 1-based
col int // 0-based code-point offset
mode Mode // file, single, eval, fstring
async asyncState // 3.7 keyword tracking
fstring []fstringTok // open f-string contexts
err *SyntaxError
}
// Mode mirrors Parser/lexer/state.h:Pegen_*Mode.
type Mode int
const (
ModeFile Mode = iota
ModeSingle
ModeEval
ModeFunctionType
ModeFString
)
Buffer model in parser/lexer/buffer.go:
// Buffer mirrors the buf/cur/inp/end pointer quartet from
// Parser/lexer/buffer.c. We use offsets into a []byte instead of
// raw pointers, but the semantics are identical.
type Buffer struct {
src []byte
cur int // current read offset
lineStart int
eof bool
}
Token emission lives in parser/lexer/lexer.go:
// Tok is the lexer's emitted token. Mirrors struct token from
// Parser/lexer/state.h. Distinct from tokenize.Token (1665) which
// is the public Python-facing surface.
type Tok struct {
Kind tokenize.Type
Bytes []byte
Start Pos
End Pos
Metadata uint32 // packs is_keyword, is_async_keyword, etc.
}
// Next pulls one token. Mirrors tok_get from Parser/lexer/lexer.c.
func (s *State) Next() (Tok, error)
Driver dispatch
Each of the four tokenizer drivers is a thin constructor over
State:
// FromUTF8 mirrors utf8_tokenizer.c:_PyTokenizer_FromUTF8.
func FromUTF8(src []byte, mode Mode) *State
// FromString mirrors string_tokenizer.c with encoding detection
// (BOM + PEP 263 cookie).
func FromString(src []byte, mode Mode) (*State, error)
// FromFile mirrors file_tokenizer.c. Wraps an io.Reader and
// handles incremental reads.
func FromFile(r io.Reader, mode Mode) *State
// FromReadline mirrors readline_tokenizer.c. The callback returns
// one line at a time; used by the REPL.
func FromReadline(rl func() (string, error), mode Mode) *State
Indentation, parens, async
The three statefulnesses CPython tracks:
- Indent stack:
tabsize=8,altabsize=1, error on inconsistent tab/space mixing under PEP 8 mode. Same algorithm astok_get_indentinlexer.c. - Paren stack: balances
(),[],{}across logical lines. Mismatch yields the sameunmatched ']'text CPython emits. - Async-keyword state: pre-3.7 quirk is gone in 3.14; the field stays so we can re-enable for older grammar tests if needed.
f-string and t-string nesting
f-strings recursively re-enter the lexer with ModeFString. The
nesting stack is fstring []fstringTok. Each entry tracks the
quote style, the brace depth, and whether we are inside a : format
spec. Same structure as tok->tok_extra_tokens in CPython 3.12+.
t-strings (PEP 750, 3.14) reuse the same machinery with a different Tok kind on emission. The nesting algorithm is identical; only the emitted token type differs.
Errors
Lexer errors lift to a *SyntaxError whose text is verbatim from
pegen_errors.c. The mapping table lives in 1643.
File mapping
| C source | Go target |
|---|---|
Parser/lexer/state.h (struct) | parser/lexer/state.go |
Parser/lexer/state.c | parser/lexer/state.go |
Parser/lexer/buffer.c | parser/lexer/buffer.go |
Parser/lexer/lexer.c | parser/lexer/lexer.go |
Parser/tokenizer/utf8_tokenizer.c | parser/lexer/driver_utf8.go |
Parser/tokenizer/string_tokenizer.c | parser/lexer/driver_string.go |
Parser/tokenizer/file_tokenizer.c | parser/lexer/driver_file.go |
Parser/tokenizer/readline_tokenizer.c | parser/lexer/driver_readline.go |
Parser/tokenizer/helpers.c | parser/lexer/helpers.go |
Checklist
Status legend: [x] shipped, [ ] pending, [~] partial / scaffold,
[n] deferred / not in scope this phase.
Files
-
parser/lexer/state.go:Statestruct,Modeconstants,New,Free. Indent stack, paren stack, f-string mode stack. Async-keyword tracking is intentionally not wired: 3.14 madeasync/awaitfull hard keywords so the pre-3.7 quirk is dead code in CPython too. -
parser/lexer/buffer.go: collapses to a no-op pair plusreserveBuf. The C source's pointer rebase dance is unnecessary because gopy stores offsets. -
parser/lexer/lexer.go: regular-mode FSM (NAME, NUMBER, STRING single + triple, OP, NEWLINE/NL, INDENT/DEDENT, comment, ENDMARKER), type-comment branch, line continuation, and the entry into f-string mode all land. The f-string scanner itself sits infstring.go. Async-keyword tracking is N/A on 3.14. -
parser/lexer/fstring.go: f-string brace-balance scanner,:-format-spec mode, conversion specifiers. - [n]
parser/lexer/driver_utf8.go: collapsed into driver_string.go because Go strings are already UTF-8. -
parser/lexer/driver_string.go: in-memory driver with BOM- PEP 263 cookie detection. The cookie scanner lives alongside in
source.goand is exercised bysource_test.go.
- PEP 263 cookie detection. The cookie scanner lives alongside in
-
parser/lexer/driver_file.go:io.Readerdriver with incremental refill. -
parser/lexer/driver_readline.go: REPL driver over afunc() (string, error)callback. -
parser/lexer/helpers.go: shared decode / line slicing / printable-ASCII filter ports ofhelpers.c. -
parser/lexer/lexer_test.go: tokenisation panels including type comments. Indent/dedent, paren balance, and f-string nesting are pinned by sibling panels underpartest/(indent_test.go,paren_mismatch_test.go,fstring_nesting_test.go,fstring_walrus_test.go).
Surface guarantees
- Token kinds match the table generated for 1665. Pinned by
parser/lexer/types_test.goreferencingtokenize.Type. - Indent/dedent emission matches CPython on the
Lib/test/test_tokenize.pyindentation corpus. - Paren-mismatch errors quote the same span CPython quotes (start of opening, span to current).
- f-string nesting depth panel: 0..6 levels with mixed
:format specs reproduces CPython's emission. - Encoding detection: UTF-8 BOM, ASCII default, PEP 263 cookies on line 1 and line 2, conflicting BOM-vs-cookie error message.
- CRLF, CR, LF line ending normalisation matches CPython.
Cross-references
- Token table values: 1665.
- SyntaxError text: 1643.
- String literal post-processing: 1644.
Out of scope for v0.5.5
- Interactive readline. Lands in 1645 alongside v0.9 REPL work.
tok->tok_extra_tokensfor COMMENT / NL / ENCODING inextraTokens=truemode. Surface lands in 1665, lexer side lands here in v0.9.
Out of scope, period
- Free-threaded parser paths. The parser runs under one goroutine.