Skip to main content

Modules/_sre.c: SRE Regex Engine

_sre.c is the C heart of Python's re module. It implements the NFA-based pattern matching engine (SRE, Simple Regular Expression), plus the SRE_Pattern and SRE_Match Python-facing objects. The bytecode compiler lives separately in sre_compile.py; this file only executes precompiled opcodes.

Map

LinesSymbolRole
1–120includes, SRE_CHAR typedefplatform portability, UCS-2/UCS-4 dispatch
121–310SrePattern struct, SreMatch structC layout of pattern and match objects
311–520mark stack helperssave/restore group capture positions
521–900sre_matchcore NFA execution loop, opcode dispatch
901–1050sre_searchanchored vs unanchored search, charset scan
1051–1300pattern_match, pattern_searchPython-level entry points wrapping sre_match/sre_search
1301–1550match_group, match_groups, match_spanmatch object attribute accessors
1551–1750pattern_scannerincremental scanner object
1751–1950_sre_compile_implloads compiled bytecode from Python into SrePattern
1951–2200pattern_richcompare, pattern_reprpattern object protocol
2201–2500module init, method tablesPyModuleDef, PyTypeObject for pattern and match

Reading

SrePattern and SreMatch structs

SrePattern stores the compiled opcode array, flags, number of groups, and a weak-reference list. SreMatch holds a back-pointer to its pattern, the subject string (with buffer protocol), and the flat mark array that records (start, end) pairs for every capture group.

// CPython: Modules/_sre.c:121 SrePatternObject
typedef struct {
PyObject_HEAD
PyObject *pattern; /* original pattern string */
int flags;
PyObject *code; /* compiled bytecode (list of ints) */
Py_ssize_t groups;
PyObject *groupindex;
PyObject *indexgroup;
} SrePatternObject;

// CPython: Modules/_sre.c:200 SreMatchObject
typedef struct {
PyObject_HEAD
PyObject *string;
Py_ssize_t pos, endpos;
SrePatternObject *pattern;
Py_ssize_t lastindex;
Py_ssize_t numgroups;
Py_ssize_t mark[1]; /* flexible array: 2 * numgroups entries */
} SreMatchObject;

sre_match NFA loop

sre_match is a direct-threaded opcode interpreter. The mark stack is a separately allocated region that the engine pushes before entering a group and pops on backtrack. Repetition opcodes (REPEAT, MAX_UNTIL, MIN_UNTIL) carry their own saved-state structs pushed onto a data stack that lives in the C call frame.

// CPython: Modules/_sre.c:521 sre_match
static Py_ssize_t
sre_match(SRE_STATE *state, SRE_CODE *pattern, int match_all)
{
/* ... opcode switch ... */
case SRE_OP_MARK:
/* save/restore group boundary in state->mark[] */
MARK_PUSH(lastmark);
state->mark[pattern[1]] = ptr;
pattern += 2;
break;
/* ... */
}

SRE_SEARCH vs SRE_MATCH entry

sre_search advances the subject pointer and calls sre_match at each candidate position. For anchored patterns (SRE_FLAG_MULTILINE or ^ at start) it short-circuits after one attempt. The Python-visible pattern.search and pattern.match differ only in what they pass as the match_all flag and the initial ptr position.

gopy notes

  • The opcode set is generated by sre_compile.py and is version-stable since Python 3.6. A Go port can vendor the opcode constants directly.
  • The mark array is sized at match-object allocation time using 2 * pattern.groups slots. Go equivalent: a []int field on the match struct, allocated in newMatch.
  • Pattern object caching (re._cache) is handled in pure Python above this layer; the C struct has no cache slot.
  • 3.14 change: SrePatternObject gained an atomic flag field for atomic grouping support (possessive quantifiers). New opcodes ATOMIC and POSSESSIVE_REPEAT were added to the opcode table.