XML: ElementTree (etree)

SAX and DOM

SAX

  • Event-driven (elements start and end)

  • Commonly used to parse long streams of structured data

  • “De-facto” standard

  • Available in multiple languages

  • Python: xml.sax

DOM: “Document Object Model”

  • Document available as a tree

  • Programmatically navigable as a tree

  • Relatively comfortable

  • Python: xml.dom

  • Problems

    • Only relatively comfortable

    • Not Pythonic enough

ElementTree

xml.etree: Python specific ⟶ absolutely

comfortable

  • Seamless integration in Python (⟶ iteration)

  • A document is a tree, and trees are lists of lists

  • XML attributes represented as dictionaries

⟶ simple!

A Very Simple Document

Python code
from xml.etree.ElementTree import Element
element = Element("root")
child = Element("child")
element.append(child)
Or alternatively …
element = Element("root")
SubElement(element, "child")
XML
<root>
  <child />
</root>

Attributes

  • XML elements have attributes

  • Python’s XML elements have the attrib dictionary

element = Element("root")
child = SubElement(element, "child")
child.attrib['age'] = '15'
child = SubElement(element, "child")
child.attrib['age'] = '17'
<root>
  <child age="15" />
  <child age="17" />
</root>

Text (1)

In XML documents, free text is permitted …

  • Inside one element

  • After one element, but before the start of another element

Accordingly, Python elements have members …

  • element.text

  • element.tail

  • No text ⟶ None

Text (2)

element = Element("root")
child = SubElement(element, "child")
child.text = 'Text'
child.tail = 'Tail'
<root><child>Text</child>Tail</root>

Careful with indentation

  • Whitespace, linefeed etc. is text, no matter what

  • str.strip() may be helpful

Writing XML Documents

  • We have created Element objects

  • Added child elements

  • Now how do we create XML?

  • Wrap into ElementTree - a helper

from xml.etree.ElementTree import ElementTree
tree = ElementTree(element)
tree.write(sys.stdout) # oder file(..., 'w')
  • Output is very tight

  • Text is preserved as-is

  • Pretty output would be incorrect

    • Linefeed and indentation is text

Reading XML Documents

This is simple …
from xml.etree.ElementTree import parse

tree = parse(sys.stdin)
for child in tree.getroot():
    age = child.attrib.get('age')
    if age is not None:
        print age
    if child.text is not None:
        print child.text