1665. gopy tokenize

What we are porting

One file from cpython/Python/:

C source	Lines	Go target
`Python-tokenize.c`	445	`tokenize/tokenize.go`

Python-tokenize.c is the Python-visible wrapper around the lexer. It exposes the _tokenize.TokenizerIter class that tokenize.tokenize() in the stdlib delegates to. The actual lexer (state machine, indent tracking, f-string handling, encoding detection) lives in cpython/Parser/tokenizer/ and is in the parser spec series.

This spec is small on purpose. It covers only the iterator surface and its mapping to a Go iterator. The lexer port underneath is tracked elsewhere; v0.9 wires this module through the import system.

What the wrapper does

tokenizeriter_new constructs a tokenizeriterobject from:

a readline callable (io.TextIO.readline-shaped), or
a source string (source keyword), with optional extra_tokens flag.

tokenizeriter_next advances the underlying struct tok_state* one token at a time and returns a 5-tuple (type, string, start, end, line). The type value is one of the constants exported from token.h (NAME=1, NUMBER=2, STRING=3, OP=54, etc.). When extra_tokens=False the wrapper filters out COMMENT, NL, ENCODING, and NEWLINE-at-EOF.

The wrapper also handles:

Encoding detection on byte input. The first physical line is parsed for coding: BOM or PEP 263 cookie.
SyntaxError lifting. A lexer error becomes a Python SyntaxError with the right offset, end-offset, and source line.
tok->cur to source-line slicing for the line field of every token.

Go shape

package tokenize

type Type int

const (
    ENDMARKER Type = iota
    NAME
    NUMBER
    STRING
    NEWLINE
    INDENT
    DEDENT
    LPAR
    // ... full set generated from cpython/Grammar/Tokens.
)

type Pos struct{ Line, Col int }

type Token struct {
    Type   Type
    Value  string
    Start  Pos
    End    Pos
    Line   string
}

// Iter is the Go-side TokenizerIter equivalent. Next returns
// io.EOF when the stream is exhausted.
type Iter struct {
    // unexported handle into the lexer state
}

func New(src string, extraTokens bool) *Iter
func NewReadline(rl func() (string, error), extraTokens bool) *Iter

func (it *Iter) Next() (Token, error)

The lexer state behind Iter is the Go port of cpython/Parser/tokenizer/. v0.5 stubs this with a parser-side lexer skeleton sufficient for the pipeline gate; the public surface in this package is stable across that change.

Token type constants are generated from cpython/Grammar/Tokens by tools/tokens_go and emitted into tokenize/types_gen.go. The generator is shared with the parser spec.

Stdlib bridge

Python's tokenize.tokenize(readline) is implemented in Lib/tokenize.py and calls _tokenize.TokenizerIter. Once v0.9 wires the import system through frozen modules, our tokenize/ package backs that import via a Go module exposing TokenizerIter through the objects.Module protocol.

Errors

The wrapper preserves CPython's exact SyntaxError strings. Cases:

Unterminated string literal.
Inconsistent indentation.
Unmatched bracket.
Encoding declaration mismatch.
Tab/space mixing under -tt.

Each is reproduced verbatim from Python-tokenize.c plus the lexer error path.

Gate

Round-trip tokenize a panel of source files (test_tokenize.py fixtures from CPython) and assert the token stream matches what CPython's tokenize.tokenize produces, byte-for-byte on the value field and exact on type plus position. The gate lives in tokenize/tokenize_test.go and ships with v0.9 once the import plumbing exists; v0.5 adds the package skeleton plus a Go-only unit test of the iterator contract.

Out of scope

The lexer state machine itself (Parser/tokenizer/). Separate spec.
untokenize. Lives in Lib/tokenize.py, ported as part of the stdlib effort.
Async tokenization. Not a CPython surface.

v0.5 checklist

This is the implementation contract. Each box maps to a test in 1625 §16.

Files

Surface guarantees

Numeric values of Type constants match cpython/Include/internal/pycore_token.h and cpython/Lib/token.py one-to-one. Pinned by TestTypeNumericValues.
extraTokens=false filters COMMENT, NL, ENCODING, and the trailing NEWLINE-at-EOF (CPython parity). (v0.9.)
Encoding detection on byte input recognises BOM and PEP 263 cookies. (v0.9.)
Lexer errors lift to a Go *SyntaxError whose Error() string is verbatim from Python-tokenize.c plus the lexer error path. (v0.9.)
tok->cur slicing for the Line field of every token. (v0.9.)

Stdlib bridge (v0.9, tracked here)

objects.Module shim that exposes TokenizerIter so Lib/tokenize.py works once the import system is wired.

Out of scope for v0.5

The actual lexer state machine. Tracked under Parser/tokenizer/.
untokenize. Stdlib effort.

What we are porting​

What the wrapper does​

Go shape​

Stdlib bridge​

Errors​

Gate​

Out of scope​

v0.5 checklist​

Files​

Surface guarantees​

Stdlib bridge (v0.9, tracked here)​

Out of scope for v0.5​