Modules/_sre.c: SRE Regex Engine
_sre.c is the C heart of Python's re module. It implements the NFA-based
pattern matching engine (SRE, Simple Regular Expression), plus the
SRE_Pattern and SRE_Match Python-facing objects. The bytecode compiler
lives separately in sre_compile.py; this file only executes precompiled
opcodes.
Map
| Lines | Symbol | Role |
|---|---|---|
| 1–120 | includes, SRE_CHAR typedef | platform portability, UCS-2/UCS-4 dispatch |
| 121–310 | SrePattern struct, SreMatch struct | C layout of pattern and match objects |
| 311–520 | mark stack helpers | save/restore group capture positions |
| 521–900 | sre_match | core NFA execution loop, opcode dispatch |
| 901–1050 | sre_search | anchored vs unanchored search, charset scan |
| 1051–1300 | pattern_match, pattern_search | Python-level entry points wrapping sre_match/sre_search |
| 1301–1550 | match_group, match_groups, match_span | match object attribute accessors |
| 1551–1750 | pattern_scanner | incremental scanner object |
| 1751–1950 | _sre_compile_impl | loads compiled bytecode from Python into SrePattern |
| 1951–2200 | pattern_richcompare, pattern_repr | pattern object protocol |
| 2201–2500 | module init, method tables | PyModuleDef, PyTypeObject for pattern and match |
Reading
SrePattern and SreMatch structs
SrePattern stores the compiled opcode array, flags, number of groups, and a
weak-reference list. SreMatch holds a back-pointer to its pattern, the
subject string (with buffer protocol), and the flat mark array that records
(start, end) pairs for every capture group.
// CPython: Modules/_sre.c:121 SrePatternObject
typedef struct {
PyObject_HEAD
PyObject *pattern; /* original pattern string */
int flags;
PyObject *code; /* compiled bytecode (list of ints) */
Py_ssize_t groups;
PyObject *groupindex;
PyObject *indexgroup;
} SrePatternObject;
// CPython: Modules/_sre.c:200 SreMatchObject
typedef struct {
PyObject_HEAD
PyObject *string;
Py_ssize_t pos, endpos;
SrePatternObject *pattern;
Py_ssize_t lastindex;
Py_ssize_t numgroups;
Py_ssize_t mark[1]; /* flexible array: 2 * numgroups entries */
} SreMatchObject;
sre_match NFA loop
sre_match is a direct-threaded opcode interpreter. The mark stack is a
separately allocated region that the engine pushes before entering a group and
pops on backtrack. Repetition opcodes (REPEAT, MAX_UNTIL, MIN_UNTIL)
carry their own saved-state structs pushed onto a data stack that lives in
the C call frame.
// CPython: Modules/_sre.c:521 sre_match
static Py_ssize_t
sre_match(SRE_STATE *state, SRE_CODE *pattern, int match_all)
{
/* ... opcode switch ... */
case SRE_OP_MARK:
/* save/restore group boundary in state->mark[] */
MARK_PUSH(lastmark);
state->mark[pattern[1]] = ptr;
pattern += 2;
break;
/* ... */
}
SRE_SEARCH vs SRE_MATCH entry
sre_search advances the subject pointer and calls sre_match at each
candidate position. For anchored patterns (SRE_FLAG_MULTILINE or ^ at
start) it short-circuits after one attempt. The Python-visible
pattern.search and pattern.match differ only in what they pass as the
match_all flag and the initial ptr position.
gopy notes
- The opcode set is generated by
sre_compile.pyand is version-stable since Python 3.6. A Go port can vendor the opcode constants directly. - The
markarray is sized at match-object allocation time using2 * pattern.groupsslots. Go equivalent: a[]intfield on the match struct, allocated innewMatch. - Pattern object caching (
re._cache) is handled in pure Python above this layer; the C struct has no cache slot. - 3.14 change:
SrePatternObjectgained anatomicflag field for atomic grouping support (possessive quantifiers). New opcodesATOMICandPOSSESSIVE_REPEATwere added to the opcode table.