The xml module

The XML module comes with Python. In the following section we will focus on the two sub-modules minidom and ElementTree.

Working with minidom

In the following example we analyse books.xml:

 1<?xml version="1.0"?>
 2<catalog>
 3   <book id="1">
 4      <title>Python basics</title>
 5      <language>en</language>
 6      <author>Veit Schiele</author>
 7      <license>BSD-3-Clause</license>
 8      <date>2021-10-28</date>
 9   </book>
10   <book id="2">
11      <title>Jupyter Tutorial</title>
12      <language>en</language>
13      <author>Veit Schiele</author>
14      <license>BSD-3-Clause</license>
15      <date>2019-06-27</date>
16   </book>
17   <book id="3">
18      <title>Jupyter Tutorial</title>
19      <language>de</language>
20      <author>Veit Schiele</author>
21      <license>BSD-3-Clause</license>
22      <date>2020-10-26</date>
23   </book>
24   <book id="4">
25      <title>PyViz Tutorial</title>
26      <language>en</language>
27      <author>Veit Schiele</author>
28      <license>BSD-3-Clause</license>
29      <date>2020-04-13</date>
30   </book>
31</catalog>
  1. To do this, we first import the minidom module and give it the same name so that it can be referenced more easily:

    1import xml.dom.minidom as minidom
    
  2. Then we define the method getTitles and capture the desired XML tags with the method getElementsByTagName:

     4def getTitles(xml):
     5    """
     6    Print all titles found in books.xml
     7    """
     8    doc = minidom.parse(xml)
     9    node = doc.documentElement
    10    books = doc.getElementsByTagName("book")
    
  3. Then we create an empty list called titles, which is filled with the title objects:

    12    titles = []
    13    for book in books:
    14        titleObj = book.getElementsByTagName("title")[0]
    15        titles.append(titleObj)
    
  4. Now the title is output in nested for-loops:

    17    for title in titles:
    18        nodes = title.childNodes
    19        for node in nodes:
    20            if node.nodeType == node.TEXT_NODE:
    21                print(node.data)
    
  5. Finally, we set the __name__ variable like __main__ so that the module can be executed like the main program. Then we apply our getTitles method to our books.xml file:

    24if __name__ == "__main__":
    25    document = "books.xml"
    26    getTitles(document)
    

Parsing with ElementTree

  1. Importing cElementTree:

    1import xml.etree.cElementTree as ET
    

    Note

    cElementTree written in C and is considerably faster than ElementTree.

  2. Then we define the method parseXML and the root element:

     4def parseXML(xml_file):
     5    """
     6    Parse XML with ElementTree
     7    """
     8    tree = ET.ElementTree(file=xml_file)
     9    print(tree.getroot())
    10    root = tree.getroot()
    11    print(f"tag={root.tag}, attrib={root.attrib}")
    
    <Element 'catalog' at 0x10b009620>
    tag=catalog, attrib={}
    
  3. Output the XML child elements of book:

    13    for child in root:
    14        print(child.tag, child.attrib)
    15        if child.tag == "book":
    16            for step_child in child:
    17                print(step_child.tag)
    
    book {'id': '1'}
    title
    language
    author
    license
    date
    book {'id': '2'}
    ...
    
  4. Output the contents of the child elements with iter:

    20    print("-" * 20)
    21    print("Iterating using iter")
    22    print("-" * 20)
    23    books = root.iter()
    24    for book in books:
    25        book_children = book.iter()
    26        for book_child in book_children:
    27            print(f"{book_child.tag}={book_child.text}")
    
    --------------------
    Iterating using iter
    --------------------
    catalog=
    book=
    title=Python basics
    language=en
    author=Veit Schiele
    license=BSD-3-Clause
    date=2021-10-28
    book=
    title=Jupyter Tutorial
    ...