Elegant Python loop for parsing flat XML

+1 vote
asked May 23, 2016 by paragbaxi

I am parsing an XML file in Python3 using lxml.objectify:

<root> <object_header></object_header> <object_details></object_details> <object_details></object_details> <object_header></object_header> <object_details></object_details> <object_header></object_header>
</root>

Note that sometimes the object does not have attributes.

The way I am currently parsing this (which works but is inelegant) is by the following:

from lxml import objectify, etree
root = objectify.parse(xmlFile).getroot()
elems = [el for el in root.iterchildren()]
# data is list of objects
data = []
# Have to instantiate outside of for loop in case last object has not details.
objectDetails = ''
# Don't store first object right away.
firstObject = True
# Iterate through each XML element.
for elem in elems: if elem.tag == 'object_header': # Remember object header info. object = storeHeaderInfo(objectDetails) # Skip saving if first object, need to grab object details. if firstObject == True: # Don't skip again, in case object has no details. firstObject = False continue # Save object, already grabbed object details. data.append(object) else: # Process object details in <object_details> tag. objectDetails += etree.tostring(elem)
# Save last object.
object = storeHeaderInfo(objectDetails)
data.append(object)

What I don't like is how I have to code storing the object twice. Once for each object in the for loop, and then again for the last object.

Is there a more pythonic or elegant way of doing this?

1 Answer

+2 votes
answered Nov 29 by alecxe

You can make things simpler if you would use the following-sibling::* expression:

from lxml import objectify, etree
root = objectify.parse("input.xml").getroot()
elems = root.xpath("//object_header")
for elem in elems: header = elem.text objectDetails = '' for sibling in elem.xpath("following-sibling::*"): if sibling.tag == 'object_header': break objectDetails += str(etree.tostring(sibling)) print(header, objectDetails)

Given the following input:

<root> <object_header>object1</object_header> <object_details>detail1</object_details> <object_details>detail2</object_details> <object_header>object2</object_header> <object_details>detail1</object_details> <object_header>object3</object_header>
</root>

The code would print:

object1 b'<object_details>detail1</object_details>'b'<object_details>detail2</object_details>'
object2 b'<object_details>detail1</object_details>'
object3 
Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
...