Skip to main content

Lib/xml/etree/ElementTree.py

cpython 3.14 @ ab2d84fe1023/Lib/xml/etree/ElementTree.py

CPython normally loads a C accelerator (_elementtree) and shadows this module at import time. The Python source is the authoritative specification of every class's behaviour and is the only implementation available when the C extension is absent. The file is divided into four broad areas: the in-memory tree model (Element, ElementTree), serialisation helpers (tostring, _serialize_xml), the streaming parser stack (TreeBuilder, XMLParser, XMLPullParser, iterparse), and the C14N 2.0 writer (C14NWriterTarget).

Map

LinesSymbolRole
107-116ParseErrorSyntaxError subclass carrying code and position
126-416ElementCore node: tag, attrib, text, tail, children list
419-434SubElementFactory that creates and appends a child Element
470-513QNameNamespace-qualified name wrapper with comparison operators
518-743ElementTreeTree wrapper with parse, write, find, findall, iterfind
1077-1099tostringSerialise element tree to bytes or Unicode string
1204-1215parseLoad XML file into an ElementTree
1218-1279iterparseStreaming iterator yielding (event, elem) pairs
1281-1336XMLPullParserEvent-queue based non-blocking pull parser
1339-1353XML / fromstringParse an XML string and return the root Element
1398-1516TreeBuilderSAX-like target: start/data/end callbacks build the tree
1520-1752XMLParserexpat-backed parser that drives a TreeBuilder target
1790-2102C14NWriterTargetCanonical XML 2.0 serialisation writer target

Reading

Element: the core data node

# CPython: Lib/xml/etree/ElementTree.py:170 Element.__init__
def __init__(self, tag, attrib={}, **extra):
if not isinstance(attrib, dict):
raise TypeError("attrib must be dict, not %s" % (
attrib.__class__.__name__,))
self.tag = tag
self.attrib = {**attrib, **extra}
self._children = []

# CPython: Lib/xml/etree/ElementTree.py:276 Element.find
def find(self, path, namespaces=None):
"""Find first matching element by tag name or path.

*path* is a string having either an element tag or an XPath,
*namespaces* is an optional mapping from namespace prefix to full name.

Return the first matching element, or None if no element was found.

"""
return ElementPath.find(self, path, namespaces)

_children is a plain Python list; every list-protocol method on Element (append, extend, insert, remove, __getitem__, __len__) delegates to it. The C accelerator replicates this layout with a PyObject * array on the struct.

TreeBuilder: incremental tree construction

# CPython: Lib/xml/etree/ElementTree.py:1460 TreeBuilder.start
def start(self, tag, attrs):
"""Open new element and return it.

*tag* is the element name, *attrs* is a dict containing element
attributes.

"""
self._flush()
self._last = elem = self._factory(tag, attrs)
if self._elem:
self._elem[-1].append(elem)
elif self._root is None:
self._root = elem
self._elem.append(elem)
self._tail = 0
return elem

# CPython: Lib/xml/etree/ElementTree.py:1477 TreeBuilder.end
def end(self, tag):
"""Close and return current Element."""
self._flush()
self._last = self._elem.pop()
assert self._last.tag == tag,\
"end tag mismatch (expected %s, got %s)" % (
self._last.tag, tag)
self._tail = 1
return self._last

XMLParser wrapping expat

# CPython: Lib/xml/etree/ElementTree.py:1530 XMLParser.__init__
def __init__(self, *, target=None, encoding=None):
try:
from xml.parsers import expat
except ImportError:
try:
import pyexpat as expat
except ImportError:
raise ImportError(
"No module named expat; use SimpleXMLTreeBuilder instead"
)
parser = expat.ParserCreate(encoding, "}")
if target is None:
target = TreeBuilder()
self.parser = self._parser = parser
self.target = self._target = target
self._error = expat.error
self._names = {} # name memo cache
# main callbacks
parser.DefaultHandlerExpand = self._default
if hasattr(target, 'start'):
parser.StartElementHandler = self._start
if hasattr(target, 'end'):
parser.EndElementHandler = self._end
if hasattr(target, 'data'):
parser.CharacterDataHandler = target.data
parser.buffer_text = 1
parser.ordered_attributes = 1

The ordered_attributes = 1 flag causes expat to report attributes as a flat list of alternating name/value strings, which _start then converts to a dict. The namespace separator "}" is the Clark-notation sentinel used throughout ElementTree.

iterparse: streaming (event, elem) pairs

# CPython: Lib/xml/etree/ElementTree.py:1218 iterparse
def iterparse(source, events=None, parser=None):
"""Incrementally parse XML document into ElementTree.

Returns an iterator providing (event, elem) pairs.

"""
pullparser = XMLPullParser(events=events, _parser=parser)

if not hasattr(source, "read"):
source = open(source, "rb")
close_source = True
else:
close_source = False

def iterator(source):
try:
while True:
yield from pullparser.read_events()
data = source.read(16 * 1024)
if not data:
break
pullparser.feed(data)
root = pullparser._close_and_return_root()
yield from pullparser.read_events()
it = wr()
if it is not None:
it.root = root
finally:
if close_source:
source.close()

The inner iterator generator reads in 16 KiB chunks and drains read_events() between each chunk so callers see events as soon as they are available rather than after the full document is loaded.

gopy notes

gopy ships no XML package today. When the time comes, the natural split is to port Element (tag/attrib/text/tail plus a []Element slice for children) as a Go struct, keep ElementTree as a thin wrapper that holds the root, and implement TreeBuilder as the target interface fed by an expat binding or a Go-native XML tokeniser. The iterparse streaming path maps cleanly onto a Go channel of (string, *Element) pairs. The XPath subset in ElementPath is a separate file and would be a separate port task.