Skip to main content

Lib/html/parser.py

cpython 3.14 @ ab2d84fe1023/Lib/html/parser.py

HTMLParser is a push-based, event-driven HTML parser. Callers feed bytes or a string into feed() incrementally; the parser buffers incomplete markup and fires callbacks as complete constructs are recognized. The public interface consists of eight overridable methods: handle_starttag, handle_endtag, handle_startendtag, handle_data, handle_comment, handle_decl, handle_pi, and unknown_decl. The base class provides empty implementations so subclasses can override only what they need.

The scanner at the heart of the class is goahead(end), a hand-rolled loop that uses a compiled regex (interesting_normal or interesting_cdata) to skip literal text and find the next < or & character. From there it dispatches on the character sequence to handle_starttag (via handle_startendtag for void elements), handle_endtag, handle_comment, handle_decl, or handle_pi.

HTMLParser extends _markupbase.ParserBase for declaration handling (<!DOCTYPE ...>, <!ELEMENT ...>, etc.). Character references (&amp;, &#38;, &#x26;) are resolved by html.unescape in the data callback.

Map

LinesSymbolRolegopy
1-100Module header, compiled regexes, HTMLParser.__init__, reset, feedCompiled patterns for tag scanning; feed appends to a buffer and calls goahead(0); reset clears state.(stdlib pending)
100-250goahead, handle_starttag dispatch, handle_endtag dispatchMain scanning loop; regex-based tag detection; attribute parsing via attrfind_tolerant; void-element detection.(stdlib pending)
250-400handle_data, handle_comment, handle_decl, handle_pi, CDATA sectionCharacter-data and comment callbacks; handle_decl delegates to _markupbase; CDATA section scanning for convert_charrefs=False.(stdlib pending)
400-500error, getpos, set_cdata_mode, clear_cdata_mode, overridable stubsLine/column tracking; CDATA mode toggle for <script> and <style> raw-text elements; empty stubs for all eight event methods.(stdlib pending)

Reading

goahead() main scanning loop (lines 100 to 250)

cpython 3.14 @ ab2d84fe1023/Lib/html/parser.py#L100-250

def goahead(self, end):
rawdata = self.rawdata
i = 0
n = len(rawdata)
while i < n:
if self.convert_charrefs and not self.cdata_elem:
j = rawdata.find('<', i)
if j < 0:
# No more tags; hold the rest for next feed()
if end:
handle_data(rawdata[i:n])
i = self.updatepos(i, n)
break
else:
match = self.interesting.search(rawdata, i)
j = match.start() if match else -1

if i < j:
if self.convert_charrefs and not self.cdata_elem:
handle_data(rawdata[i:j])
else:
handle_data(unescape(rawdata[i:j]))
i = self.updatepos(i, j)

if i == n:
break
startswith = rawdata.startswith
if startswith('<', i):
if starttagopen.match(rawdata, i): # <letter
k = self.handle_starttag_dispatch(i)
elif startswith("</", i):
k = self.handle_endtag_dispatch(i)
elif startswith("<!--", i):
k = self.handle_comment_dispatch(i)
elif startswith("<!", i):
k = self.handle_decl_dispatch(i)
elif startswith("<?", i):
k = self.handle_pi_dispatch(i)
else:
handle_data(rawdata[i])
k = i + 1
if k < 0:
# Incomplete tag: stop and wait for more data
if not end:
break
k = rawdata.find('>', i + 1)
if k < 0:
k = rawdata.find('<', i + 1)
if k < 0:
k = i + 1
else:
k += 1
handle_data(rawdata[i:k])
i = self.updatepos(i, k)
elif startswith("&", i):
...

goahead(end=False) is called after each feed(). The end=False flag means the scanner must stop whenever it cannot confirm that a construct is complete (negative k), so the unfinished bytes are left in rawdata for the next feed(). goahead(end=True) is called by close() and treats any trailing incomplete markup as literal data.

When convert_charrefs=True (the default since Python 3.5), inter-tag text is accumulated until the next < and then passed as a single handle_data call, with character references resolved. When convert_charrefs=False, the interesting regex also matches &, and each character reference is handled inline.

Starttag attribute parsing (lines 100 to 250)

cpython 3.14 @ ab2d84fe1023/Lib/html/parser.py#L100-250

# Regex used to parse attribute key=value pairs tolerantly
attrfind_tolerant = re.compile(
r'((?<=[\'"\s/])[^\s/>][^\s/=>]*)(\s*=+\s*'
r'(\'[^\']*\'|"[^"]*"|(?![\'"])[^>\s]*))?(?:\s|/(?!>))*',
re.ASCII)

def handle_starttag_dispatch(self, i):
rawdata = self.rawdata
end = rawdata.find('>', i)
if end < 0:
return -1 # incomplete
tag, attrs = self._parse_starttag_attrs(rawdata, i, end)
self.handle_starttag(tag, attrs)
...

def _parse_starttag_attrs(self, rawdata, i, end):
attrsd = []
match = tagfind_tolerant.match(rawdata, i + 1)
tag = match.group(1).lower()
k = match.end()
while k < end:
m = attrfind_tolerant.match(rawdata, k)
if not m:
break
attrname, rest, attrvalue = m.group(1, 2, 3)
if not rest:
attrvalue = None
elif attrvalue[:1] == "'" == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
elif attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
attrvalue = unescape(attrvalue) if attrvalue else attrvalue
attrsd.append((attrname.lower(), attrvalue))
k = m.end()
return tag, attrsd

Attribute parsing uses attrfind_tolerant, a deliberately permissive regex that accepts HTML5's unquoted attribute values, bare flag attributes (no = sign), and stray / characters before >. This tolerant approach matches browsers, which must parse real-world HTML that violates the specification. Values are unquoted (stripping the outer ' or ") and then passed through html.unescape to resolve character references in attribute values.

gopy mirror

html.parser has no C extension dependency. It is pure Python. A gopy port will vendor Lib/html/parser.py and Lib/_markupbase.py together, rewriting only the import html reference to use the gopy html.unescape implementation. The compiled regexes are portable and require no changes. The getpos() line/column tracking uses a simple rawdata[:i].count('\n') calculation that translates directly.