1665. gopy tokenize
What we are porting
One file from cpython/Python/:
| C source | Lines | Go target |
|---|---|---|
Python-tokenize.c | 445 | tokenize/tokenize.go |
Python-tokenize.c is the Python-visible wrapper around the lexer. It
exposes the _tokenize.TokenizerIter class that tokenize.tokenize()
in the stdlib delegates to. The actual lexer (state machine, indent
tracking, f-string handling, encoding detection) lives in
cpython/Parser/tokenizer/ and is in the parser spec series.
This spec is small on purpose. It covers only the iterator surface and its mapping to a Go iterator. The lexer port underneath is tracked elsewhere; v0.9 wires this module through the import system.
What the wrapper does
tokenizeriter_new constructs a tokenizeriterobject from:
- a readline callable (
io.TextIO.readline-shaped), or - a source string (
sourcekeyword), with optionalextra_tokensflag.
tokenizeriter_next advances the underlying struct tok_state* one
token at a time and returns a 5-tuple
(type, string, start, end, line). The type value is one of the
constants exported from token.h (NAME=1, NUMBER=2, STRING=3, OP=54,
etc.). When extra_tokens=False the wrapper filters out COMMENT, NL,
ENCODING, and NEWLINE-at-EOF.
The wrapper also handles:
- Encoding detection on byte input. The first physical line is parsed
for
coding:BOM or PEP 263 cookie. - SyntaxError lifting. A lexer error becomes a Python SyntaxError with the right offset, end-offset, and source line.
tok->curto source-line slicing for thelinefield of every token.
Go shape
package tokenize
type Type int
const (
ENDMARKER Type = iota
NAME
NUMBER
STRING
NEWLINE
INDENT
DEDENT
LPAR
// ... full set generated from cpython/Grammar/Tokens.
)
type Pos struct{ Line, Col int }
type Token struct {
Type Type
Value string
Start Pos
End Pos
Line string
}
// Iter is the Go-side TokenizerIter equivalent. Next returns
// io.EOF when the stream is exhausted.
type Iter struct {
// unexported handle into the lexer state
}
func New(src string, extraTokens bool) *Iter
func NewReadline(rl func() (string, error), extraTokens bool) *Iter
func (it *Iter) Next() (Token, error)
The lexer state behind Iter is the Go port of
cpython/Parser/tokenizer/. v0.5 stubs this with a parser-side
lexer skeleton sufficient for the pipeline gate; the public surface in
this package is stable across that change.
Token type constants are generated from cpython/Grammar/Tokens by
tools/tokens_go and emitted into tokenize/types_gen.go. The
generator is shared with the parser spec.
Stdlib bridge
Python's tokenize.tokenize(readline) is implemented in
Lib/tokenize.py and calls _tokenize.TokenizerIter. Once v0.9
wires the import system through frozen modules, our tokenize/
package backs that import via a Go module exposing TokenizerIter
through the objects.Module protocol.
Errors
The wrapper preserves CPython's exact SyntaxError strings. Cases:
- Unterminated string literal.
- Inconsistent indentation.
- Unmatched bracket.
- Encoding declaration mismatch.
- Tab/space mixing under
-tt.
Each is reproduced verbatim from Python-tokenize.c plus the lexer
error path.
Gate
Round-trip tokenize a panel of source files (test_tokenize.py
fixtures from CPython) and assert the token stream matches what
CPython's tokenize.tokenize produces, byte-for-byte on the value
field and exact on type plus position. The gate lives in
tokenize/tokenize_test.go and ships with v0.9 once the import
plumbing exists; v0.5 adds the package skeleton plus a Go-only unit
test of the iterator contract.
Out of scope
- The lexer state machine itself (Parser/tokenizer/). Separate spec.
untokenize. Lives inLib/tokenize.py, ported as part of the stdlib effort.- Async tokenization. Not a CPython surface.
v0.5 checklist
This is the implementation contract. Each box maps to a test in 1625 §16.
Files
-
token/token.go: hand-writtenTypedeclaration plusString()method that delegates to the generatedtokenNamestable. Lives in its own package to mirror CPython'spycore_token.h/Lib/token.pysplit and avoid an import cycle with the parser's lexer. -
tools/tokens_go/main.go: generator that readsInclude/internal/pycore_token.h(the#define NAME Nlines) andLib/token.py(theNAME = Nlines), filtersN_TOKENS/NT_OFFSET/ENCODING_OFFSET, and emitstoken/types_gen.go. -
token/types_gen.go: 69 token kinds (ENDMARKER=0..ENCODING=68) plusNTokens=69and thetokenNamestable. -
tokenize/tokenize.go:-
Pos struct{ Line, Col int }. -
Token struct{ Type token.Type; Value string; Start, End Pos; Line string }. (Lineis left empty; full source-line slicing tracked under the v0.9 surface guarantee below.) -
Iteropaque struct holding the underlying*lexer.State. -
New(src string, extraTokens bool) *Iter: wireslexer.FromStringplusSetExtraTokens. -
NewReadline(rl func() (string, error), extraTokens bool) *Iter: wireslexer.FromReadlinewith a[]byte/stringadapter. -
(it *Iter) Next() (Token, error)returningio.EOFafter the lexer's ENDMARKER lands.
-
-
tokenize/tokenize_test.go: ENDMARKER-then-EOF, simple assignment kinds, readline driver smoke. The numeric/Type.String panel moved totoken/token_test.goalong with the constants.
Surface guarantees
- Numeric values of
Typeconstants matchcpython/Include/internal/pycore_token.handcpython/Lib/token.pyone-to-one. Pinned byTestTypeNumericValues. -
extraTokens=falsefilters COMMENT, NL, ENCODING, and the trailing NEWLINE-at-EOF (CPython parity). (v0.9.) - Encoding detection on byte input recognises BOM and PEP 263 cookies. (v0.9.)
- Lexer errors lift to a Go
*SyntaxErrorwhoseError()string is verbatim fromPython-tokenize.cplus the lexer error path. (v0.9.) -
tok->curslicing for theLinefield of every token. (v0.9.)
Stdlib bridge (v0.9, tracked here)
-
objects.Moduleshim that exposesTokenizerItersoLib/tokenize.pyworks once the import system is wired.
Out of scope for v0.5
- The actual lexer state machine. Tracked under
Parser/tokenizer/. untokenize. Stdlib effort.