Lib/xml/etree/ElementTree.py

cpython 3.14 @ ab2d84fe1023/Lib/xml/etree/ElementTree.py

CPython normally loads a C accelerator (_elementtree) and shadows this module at import time. The Python source is the authoritative specification of every class's behaviour and is the only implementation available when the C extension is absent. The file is divided into four broad areas: the in-memory tree model (Element, ElementTree), serialisation helpers (tostring, _serialize_xml), the streaming parser stack (TreeBuilder, XMLParser, XMLPullParser, iterparse), and the C14N 2.0 writer (C14NWriterTarget).

Map

Lines	Symbol	Role
107-116	`ParseError`	SyntaxError subclass carrying code and position
126-416	`Element`	Core node: tag, attrib, text, tail, children list
419-434	`SubElement`	Factory that creates and appends a child Element
470-513	`QName`	Namespace-qualified name wrapper with comparison operators
518-743	`ElementTree`	Tree wrapper with parse, write, find, findall, iterfind
1077-1099	`tostring`	Serialise element tree to bytes or Unicode string
1204-1215	`parse`	Load XML file into an ElementTree
1218-1279	`iterparse`	Streaming iterator yielding (event, elem) pairs
1281-1336	`XMLPullParser`	Event-queue based non-blocking pull parser
1339-1353	`XML` / `fromstring`	Parse an XML string and return the root Element
1398-1516	`TreeBuilder`	SAX-like target: start/data/end callbacks build the tree
1520-1752	`XMLParser`	expat-backed parser that drives a TreeBuilder target
1790-2102	`C14NWriterTarget`	Canonical XML 2.0 serialisation writer target

Reading

Element: the core data node

# CPython: Lib/xml/etree/ElementTree.py:170 Element.__init__
    def __init__(self, tag, attrib={}, **extra):
        if not isinstance(attrib, dict):
            raise TypeError("attrib must be dict, not %s" % (
                attrib.__class__.__name__,))
        self.tag = tag
        self.attrib = {**attrib, **extra}
        self._children = []

# CPython: Lib/xml/etree/ElementTree.py:276 Element.find
    def find(self, path, namespaces=None):
        """Find first matching element by tag name or path.

        *path* is a string having either an element tag or an XPath,
        *namespaces* is an optional mapping from namespace prefix to full name.

        Return the first matching element, or None if no element was found.

        """
        return ElementPath.find(self, path, namespaces)

_children is a plain Python list; every list-protocol method on Element (append, extend, insert, remove, __getitem__, __len__) delegates to it. The C accelerator replicates this layout with a PyObject * array on the struct.

TreeBuilder: incremental tree construction

# CPython: Lib/xml/etree/ElementTree.py:1460 TreeBuilder.start
    def start(self, tag, attrs):
        """Open new element and return it.

        *tag* is the element name, *attrs* is a dict containing element
        attributes.

        """
        self._flush()
        self._last = elem = self._factory(tag, attrs)
        if self._elem:
            self._elem[-1].append(elem)
        elif self._root is None:
            self._root = elem
        self._elem.append(elem)
        self._tail = 0
        return elem

# CPython: Lib/xml/etree/ElementTree.py:1477 TreeBuilder.end
    def end(self, tag):
        """Close and return current Element."""
        self._flush()
        self._last = self._elem.pop()
        assert self._last.tag == tag,\
               "end tag mismatch (expected %s, got %s)" % (
                   self._last.tag, tag)
        self._tail = 1
        return self._last

XMLParser wrapping expat

# CPython: Lib/xml/etree/ElementTree.py:1530 XMLParser.__init__
    def __init__(self, *, target=None, encoding=None):
        try:
            from xml.parsers import expat
        except ImportError:
            try:
                import pyexpat as expat
            except ImportError:
                raise ImportError(
                    "No module named expat; use SimpleXMLTreeBuilder instead"
                    )
        parser = expat.ParserCreate(encoding, "}")
        if target is None:
            target = TreeBuilder()
        self.parser = self._parser = parser
        self.target = self._target = target
        self._error = expat.error
        self._names = {} # name memo cache
        # main callbacks
        parser.DefaultHandlerExpand = self._default
        if hasattr(target, 'start'):
            parser.StartElementHandler = self._start
        if hasattr(target, 'end'):
            parser.EndElementHandler = self._end
        if hasattr(target, 'data'):
            parser.CharacterDataHandler = target.data
        parser.buffer_text = 1
        parser.ordered_attributes = 1

The ordered_attributes = 1 flag causes expat to report attributes as a flat list of alternating name/value strings, which _start then converts to a dict. The namespace separator "}" is the Clark-notation sentinel used throughout ElementTree.

iterparse: streaming (event, elem) pairs

# CPython: Lib/xml/etree/ElementTree.py:1218 iterparse
def iterparse(source, events=None, parser=None):
    """Incrementally parse XML document into ElementTree.

    Returns an iterator providing (event, elem) pairs.

    """
    pullparser = XMLPullParser(events=events, _parser=parser)

    if not hasattr(source, "read"):
        source = open(source, "rb")
        close_source = True
    else:
        close_source = False

    def iterator(source):
        try:
            while True:
                yield from pullparser.read_events()
                data = source.read(16 * 1024)
                if not data:
                    break
                pullparser.feed(data)
            root = pullparser._close_and_return_root()
            yield from pullparser.read_events()
            it = wr()
            if it is not None:
                it.root = root
        finally:
            if close_source:
                source.close()

The inner iterator generator reads in 16 KiB chunks and drains read_events() between each chunk so callers see events as soon as they are available rather than after the full document is loaded.

gopy notes

gopy ships no XML package today. When the time comes, the natural split is to port Element (tag/attrib/text/tail plus a []Element slice for children) as a Go struct, keep ElementTree as a thin wrapper that holds the root, and implement TreeBuilder as the target interface fed by an expat binding or a Go-native XML tokeniser. The iterparse streaming path maps cleanly onto a Go channel of (string, *Element) pairs. The XPath subset in ElementPath is a separate file and would be a separate port task.

Map​

Reading​

Element: the core data node​

TreeBuilder: incremental tree construction​

XMLParser wrapping expat​

iterparse: streaming (event, elem) pairs​

gopy notes​

Map