Skip to main content

Lib/html/parser.py

Source:

cpython 3.14 @ ab2d84fe1023/Lib/html/parser.py

Map

LinesSymbolPurpose
1–50imports, constantsCompiled regexes: tagfind_tolerant, attrfind_tolerant, entity patterns
51–100HTMLParser.__init__Buffer init, convert_charrefs flag (default True)
101–160HTMLParser.feed, closeIncremental buffering; close flushes end-of-stream
161–310HTMLParser.goaheadCentral state machine: DATA, TAG, ATTR, COMMENT, CDATA dispatch
311–390handle_starttag, handle_endtagCallback stubs; subclasses override these
391–440handle_data, handle_commentData and comment callback stubs
441–470handle_entityref, handle_charrefEntity and character reference callback stubs
471–500getpos, reset1-based (line, col) position tracking; buffer reset

Reading

feed buffering and the goahead loop

feed appends incoming text to self.rawdata and immediately calls goahead(0). The 0 argument signals "not end of file", so the method leaves incomplete tokens in the buffer rather than reporting an error. close calls goahead(1) to flush any remaining content.

# CPython: Lib/html/parser.py:101 HTMLParser.feed
def feed(self, data):
self.rawdata = self.rawdata + data
self.goahead(0)

def close(self):
HTMLParser.goahead(self, 1)
self.rawdata = ''

goahead scans self.rawdata with a while loop. Each iteration locates the next < with rawdata.find('<', i). Text before that position is dispatched to handle_data (or held until character references are resolved when convert_charrefs is set). When no < is found and end is false, the remaining text stays in the buffer for the next feed call.

# CPython: Lib/html/parser.py:161 HTMLParser.goahead
def goahead(self, end):
rawdata = self.rawdata
i = 0
n = len(rawdata)
while i < n:
if self.convert_charrefs and not self.cdata_content_elements:
j = rawdata.find('<', i)
if j < 0:
if end:
self.handle_data(rawdata[i:n])
i = n
break
# ... tag dispatch continues
self.rawdata = rawdata[i:]

Attribute parsing and handle_starttag

When goahead identifies a start tag it calls handle_starttag(tag, attrs). The attrs argument is a list of (name, value) tuples. Bare attributes such as disabled produce (name, None). The tagfind_tolerant and attrfind_tolerant regexes accept spaces around = and unquoted values to handle real-world malformed HTML.

# CPython: Lib/html/parser.py:311 HTMLParser.handle_starttag
def handle_starttag(self, tag, attrs):
pass # override in subclass

def handle_endtag(self, tag):
pass

In CPython 3.14, tagfind_tolerant was tightened to reject tag names beginning with a digit. Markup such as <3foo> is now passed to handle_data as raw text instead of being treated as a tag, aligning behavior more closely with the HTML5 specification.

CDATA mode for script and style

Inside <script> and <style> elements the parser enters CDATA mode via set_cdata_mode. In this mode goahead scans for the literal closing tag string instead of treating every < as a potential new tag. This prevents content like if (a < b) from being misread as markup.

# CPython: Lib/html/parser.py:270 set_cdata_mode
def set_cdata_mode(self, elem):
self.cdata_elem = elem.lower()

def clear_cdata_mode(self):
self.cdata_elem = None

# Inside goahead, when self.cdata_elem is set:
# j = rawdata.lower().find('</' + self.cdata_elem, i)
# if j < 0:
# break # wait for more data
# self.handle_data(rawdata[i:j])

handle_comment follows the same incomplete-token deferral: if --> is not yet present in the buffer, the comment fragment stays in rawdata until the next feed call delivers the rest.

gopy notes

Status: not yet ported.

Planned package path: module/html/ (public name html.parser).

Key porting decisions:

  • convert_charrefs (default True since Python 3.5) causes goahead to batch all text between tags and resolve &amp;, &#39;, and similar references before calling handle_data. A port must apply the same batching or callers will see raw entity text.
  • cdata_content_elements is a frozenset containing 'script' and 'style'. It is consulted inside goahead, not in handle_starttag. A port must preserve this placement.
  • getpos returns (line, col) where both are 1-based. Line counts are maintained by counting \n in consumed data. A port must update position counters at the same points CPython does or getpos will drift.
  • The 3.14 digit-in-tagname restriction is baked into tagfind_tolerant. A port should translate the updated regex pattern unconditionally.