Skip to main content

Lib/html/parser.py (part 3)

Source:

cpython 3.14 @ ab2d84fe1023/Lib/html/parser.py

This annotation covers the incremental HTML parsing loop. See lib_html2_detail for HTMLParser.__init__, entity reference handling, and the convert_charrefs option.

Map

LinesSymbolRole
1-80HTMLParser.feedIncrementally feed data to the parser
81-180HTMLParser.goaheadMain parsing loop
181-280handle_starttag / handle_endtagOverride-able tag callbacks
281-380handle_dataCharacter data callback
381-500Attribute parsinghandle_starttag_attrs regex

Reading

HTMLParser.feed

# CPython: Lib/html/parser.py:108 feed
def feed(self, data):
self.rawdata = self.rawdata + data
self.goahead(0)

feed appends new data to self.rawdata (a buffer for incomplete tokens) and calls goahead. The 0 argument means "don't force end-of-file handling". Incremental feeds allow parsing HTML as it streams in.

HTMLParser.goahead

# CPython: Lib/html/parser.py:130 goahead
def goahead(self, end):
rawdata = self.rawdata
i = 0
n = len(rawdata)
while i < n:
if self.convert_charrefs and not self.cdata_content_elements:
j = rawdata.find('<', i)
if j < 0: j = n
else:
match = self.interesting.search(rawdata, i) # '<' or '&'
j = match.start() if match else n
if i < j:
self.handle_data(rawdata[i:j])
i = self.updatepos(i, j)
if i == n: break
startswith = rawdata.startswith
if startswith('<', i):
if starttagopen.match(rawdata, i):
k = self.handle_starttag(i)
elif startswith("</", i):
k = self.handle_endtag_stuff(rawdata, i)
elif startswith("<!--", i):
k = self.handle_comment(rawdata, i)
...
elif startswith('&', i):
k = self.handle_entityref(rawdata, i)
if k < 0:
break # incomplete token: wait for more data
i = self.updatepos(i, k)
self.rawdata = rawdata[i:] # save unconsumed remainder

goahead scans for < and & characters. Each recognized token calls the appropriate handler. If a token is incomplete (e.g., < without a matching >), k < 0 and the remainder is preserved for the next feed call.

Attribute parsing

# CPython: Lib/html/parser.py:260 handle_starttag (attr parsing excerpt)
attrfind_tolerant = re.compile(
r'((?<=[\'"\s/])[^\s/>][^\s/=>]*)(\s*=+\s*'
r'(\'[^\']*\'|"[^"]*"|(?![\'"])[^>\s]*))?(?:\s|/(?!>))*',
re.ASCII)

HTML attributes are parsed by attrfind_tolerant which handles quoted values ('val', "val"), unquoted values (<img width=100>), and boolean attributes (<input disabled>). The regex is "tolerant" — it does not reject malformed attributes, mirroring browser behavior.

Callback model

# CPython: Lib/html/parser.py:320 handle_* stubs
def handle_starttag(self, tag, attrs): pass
def handle_endtag(self, tag): pass
def handle_data(self, data): pass
def handle_comment(self, data): pass
def handle_entityref(self, name): pass
def handle_charref(self, name): pass

HTMLParser uses the callback pattern: subclasses override handle_* methods. HTMLParser.feed('<p>Hello</p>') calls handle_starttag('p', []), handle_data('Hello'), then handle_endtag('p'). Attributes are passed as a list of (name, value) tuples.

gopy notes

html.parser is a pure-Python module; gopy runs it directly. goahead uses re.compile and str.find, routing through module/re and objects.StrFind. The callback methods are virtual — subclasses override them as regular Python methods, invoked via the standard __call__ dispatch.