Don’t Parse That XML!

August 7th, 2010

I’ve talked a few times about how the best code you can write is code you never write.  One of the major places I end up seeing developer writing code that they don’t need to write is when parsing XML.

A word of caution before I go into how to not have to parse XML!

What I am going to describe is not always going to be the best solution.  It is an easy solution that will cover simple processing of XML files.  For a large XML file, the solutions I am going to suggest might be memory intensive and take too long.

Seems like everyone is doing it

I’ve walked into so many software development shops and seen code to parse XML files.  It seems to be one of those really common things that not enough developers realize can be completely automated.

I have started to wonder if it is self-propagating.  If developers have a tendency to see it being manually parsed in one place, assume that there is not a better way, then propagate that manual parsing to the next place they go.

Why is it so bad?

First of all, it is not an easy task to parse XML.  Even when using an XML parsing library there is a large amount of code that has to be written, especially for a complex XSD.

XML parsing code is also very fragile.  If the structure of an XML file changes, the code will have to be modified, and the modification can have cascading effects.

Manually generated XML parsing code cannot be regenerated if the structure of the XML changes.

Most importantly, any code you have to write runs the risk of introducing bugs and complexity into the system.

It’s so simple you wouldn’t believe it

So, how simple is it to automatically parse XML into objects?

Very simple.  First I am going to give you the basic pattern, then I am going to tell you how to do it in both C# and Java.

Basic pattern:

  1. Use a tool to generate an inferred XSD from your XML file.  (You can skip this step if you already have an XSD file.)
  2. Use a code generation tool to generate your classes automatically from the XSD file.
  3. In your code, deserialize your XML file into an object tree using the framework you generated the classes from.

If you are doing something more complex than this, without a really good reason, you are doing it wrong!

Learning how to do this in your language of choice is a very important tool to put into your tool bag.  There are many times that I have run into the need to parse XML files, where I have saved many hours of development time, by knowing how to automatically deserialize my XML files into objects.

There are two main ways in which XML serialization frameworks work.

  1. Serializers that auto-generate the classes from the XSD files.
  2. Serializers that use annotations or attributes on classes.

Using a serializer that auto-generates the classes from an XSD is the easiest to use and can work in most cases.  If you need more control over the generation of the XML, you might want to use an attribute or annotation based framework.

One of the biggest barriers in getting started with an XML framework is knowing what to use and how to use it.  I am going to cover 3 options that will get you going for C#, Java SE, and Java Android development.

C# (XSD.exe)

XML serialization is so easy in C# because it is built right into the .NET framework.

The only real piece of magic you need to know is the XSD.exe tool which is installed with Visual Studio.  This one tool can be run to infer an XSD from your XML file and then again to take that XSD and produce fully serializable / deserializable classes.

If you have an XML file named myFile.xml, you can simply go to the Visual Studio command prompt and type:

xsd myFile.xml

Which will produce a myFile.xsd.

Then type

xsd myFile.xsd /c

This will generate a set of classes that you can add to your project, and then you can deserialize an xml file with this simple code:

   1: XmlSerializer serializer = 

   2: new XmlSerializer(typeof(MyFile));

   3:  

   4: Stream reader = new FileStream("myFile.xml",FileMode.Open);

   5:  

   6: MyFile myFile = (MyFile) serializer.Deserialize(reader);

 
It really is that simple.  There is no excuse for hand writing XML parsing code when you can literally take an XML file you have never seen before and turn it into an object in memory in 10 minutes.
The serialization framework and XSD tool provide options for using attributes to control how the XML is generated also.

Java (JAXB)

The steps are slightly more complicated with JAXB, but it is still fairly easy.

First we have to generate an XSD file from an XML file.  JAXB doesn’t do this itself as far as I know, but there is another tool we can use called Trang.

First step, download Trang, then run it like so:

java –jar trang.jar –I xml –O xsd myFile.xml myFile.xsd

You can also use the XSD.exe tool from Visual Studio if you have it installed or download it.  There are a few other tools out there as well.

Once you have the XSD file, or if you already had one you had written, you need to generate Java classes using JAXB’s tool like so:

xjc –p my.package.name myFile.xsd –d myDirectory

Running this command will produce Java files that represent the elements in your XML document.

Finally, to create your objects you can use the JAXB unmarshaller.

   1: JAXBContext jc = JAXBContext.newInstance("my.package.name");

   2: Unmarshaller unmarshaller = jc.createUnmarshaller();

   3: MyFile myFile = (MyFile) unmarshaller.unmarshal(new File( "myFile.xml"));

Not as simple as the C# example, but really quite simple.  I’ve omitted the steps like downloading JAXB and adding it to your class path, but you can see that the process really is not very painful at all.

JAXB also provides some options for customizing the serialization and deserialization.

Android (Simple XML)

You can’t use JAXB with Android.  It seems like because of the Dalvik VM, the reflection part of JAXB doesn’t work.

I found a pretty good and small XML framework that I am using in my Android app that seems to do the trick nicely.  You have to annotate your classes and create them by hand, but it is very simple and straightforward to do so.

The tool is called “Simple XML.”

You can find lots of examples on the “tutorial” tab on the web page.

Basically, you download Simple XML, add the jars to you class path, and create class files with some annotations to specify how to deserialize or serialize your XML.

Here is a small example of an annotated class, from the Simple XML website.

   1: @Root

   2: public class Example {

   3:  

   4:    @Element

   5:    private String text;

   6:  

   7:    @Attribute

   8:    private int index;

   9:  

  10:    public Example() {

  11:       super();

  12:    }  

  13:  

  14:    public Example(String text, int index) {

  15:       this.text = text;

  16:       this.index = index;

  17:    }

  18:  

  19:    public String getMessage() {

  20:       return text;

  21:    }

  22:  

  23:    public int getId() {

  24:       return index;

  25:    }

  26: }

 

To deserialize this xml you just use the following code.

   1: Serializer serializer = new Persister();

   2: File source = new File("example.xml");

   3:  

   4: Example example = serializer.read(Example.class, source)

Very simple and straightforward.  The only downside is generating the Java class files yourself, but that isn’t really very hard to do using the annotations.

No excuses

So there you have it, XML serialization frameworks abound to make your life easier when dealing with XML.  For most simple cases you should never handwrite XML parsing code, even when using a library to help you do it.

Now that I’ve shown you how easy it is, there really are no excuses!

As always, you can subscribe to this RSS feed to follow my posts on elegant code.  Feel free to check out my main personal blog at http://simpleprogrammer.com, which has a wider range of posts, updated 2-3 times a week.  Also, you can follow me on twitter here.

  • Brad

    The XmlSerializer route (in c#) is crazy slow. It is about 2 full orders of magnitude slower than custom (or generated) classes using Linq2Xml. The XmlReader is faster yet, especially for large files.

  • James

    Yeah, our initial get-it-working solution used XmlSerializer, but it was so, freakin, slow, we dumped it for XmlReader, and then later dumped XML for Google Protocol Buffers, since this message processing is on a critical path.

    Config files, and hardly-read XML, sure, use a framework like this, if you don’t mind paying the cost.

  • http://elegantcode.com/about/john-sonmez/ John Sonmez

    @Brad
    @James

    This solution isn’t going to be best for speed or large XML files that are time sensitive. Even though XML serialization may be slower, you really need to have a good reason to abandon such a simple solution and choose to write your own code to parse the XML.

    Consider that web services in asp.net basically do the serialization / deserialization on the fly.

  • Steve

    I have often used a hybrid technique.
    Say you have a file of orders with … repeat adnauseaum
    I will use an XmlReader to readfrom record to record, then when positioned on the start of a single, use ReadSubTree to get a new XmlReader starting there and feed that to the the XmlSerializer.
    That way you are only deserializing a single order at a time and can thus operating on arbitrarily large files without increasing you memory footprint.

  • Steve

    @Steve
    hrumpf. sample xml was sarcificed to the XSS gods.
    ex. [orders][order]…[/order] repeater order nodes here[/orders]

  • http://blog.zoolutions.se Mikael Henriksson

    Hi John,

    There are situations when parsing xml by hand is the only sane option. One example that springs to mind is http://www.192.com/ Linq-To-Xml is superior to generating the classes from the schema and trying to use the right types. since the names etc does not make sense. Now, why should I generate 200 classes I don’t need? Maybe it’s just the title of the blog post was chosen unwisely because it’s a good tip. Just don’t agree with the title.

  • http://www.voronoigame.com Indrit Selimi

    Hi,

    I don’t agree at all, this approach (the uses of automatic tools like jaxb) creates useless intermediate DTO-s, ie data-structures that has typically a very short lifecycle and you need to copy data from your DTO to your domain object. This is verbose and creates ugly temporary data-structures (like jaxb does), the result is to have the same data from xml format to java/c#/etc format… Why do this? Why not using the xpath and map properties directly to your (eventually annotated) domain object (like SimpleXml for android)? Why having a +1 step approach?

    I don’t agree.

  • http://codegoop.com Chad

    Java 6 includes JAXB. No need to worry about downloading or putting stuff on the classpath.

  • http://elegantcode.com/about/john-sonmez/ John Sonmez

    @Steve

    I like you approach. Very nice way to have the convenience of generated XML code and not have the memory footprint.

  • Alex

    It depends what you want achieve – speed of coding or speed of application. We had XML parsers contest and winners all were custom build parses. They overrun standard common ones (like JAXB) at least ten times. That’s the price you pay when you use “common” solution.