Don’t Parse That XML!

August 7th, 2010

I’ve talked a few times about how the best code you can write is code you never write.  One of the major places I end up seeing developer writing code that they don’t need to write is when parsing XML.

A word of caution before I go into how to not have to parse XML!

What I am going to describe is not always going to be the best solution.  It is an easy solution that will cover simple processing of XML files.  For a large XML file, the solutions I am going to suggest might be memory intensive and take too long.

Seems like everyone is doing it

I’ve walked into so many software development shops and seen code to parse XML files.  It seems to be one of those really common things that not enough developers realize can be completely automated.

I have started to wonder if it is self-propagating.  If developers have a tendency to see it being manually parsed in one place, assume that there is not a better way, then propagate that manual parsing to the next place they go.

Why is it so bad?

First of all, it is not an easy task to parse XML.  Even when using an XML parsing library there is a large amount of code that has to be written, especially for a complex XSD.

XML parsing code is also very fragile.  If the structure of an XML file changes, the code will have to be modified, and the modification can have cascading effects.

Manually generated XML parsing code cannot be regenerated if the structure of the XML changes.

Most importantly, any code you have to write runs the risk of introducing bugs and complexity into the system.

It’s so simple you wouldn’t believe it

So, how simple is it to automatically parse XML into objects?

Very simple.  First I am going to give you the basic pattern, then I am going to tell you how to do it in both C# and Java.

Basic pattern:

  1. Use a tool to generate an inferred XSD from your XML file.  (You can skip this step if you already have an XSD file.)
  2. Use a code generation tool to generate your classes automatically from the XSD file.
  3. In your code, deserialize your XML file into an object tree using the framework you generated the classes from.

If you are doing something more complex than this, without a really good reason, you are doing it wrong!

Learning how to do this in your language of choice is a very important tool to put into your tool bag.  There are many times that I have run into the need to parse XML files, where I have saved many hours of development time, by knowing how to automatically deserialize my XML files into objects.

There are two main ways in which XML serialization frameworks work.

  1. Serializers that auto-generate the classes from the XSD files.
  2. Serializers that use annotations or attributes on classes.

Using a serializer that auto-generates the classes from an XSD is the easiest to use and can work in most cases.  If you need more control over the generation of the XML, you might want to use an attribute or annotation based framework.

One of the biggest barriers in getting started with an XML framework is knowing what to use and how to use it.  I am going to cover 3 options that will get you going for C#, Java SE, and Java Android development.

C# (XSD.exe)

XML serialization is so easy in C# because it is built right into the .NET framework.

The only real piece of magic you need to know is the XSD.exe tool which is installed with Visual Studio.  This one tool can be run to infer an XSD from your XML file and then again to take that XSD and produce fully serializable / deserializable classes.

If you have an XML file named myFile.xml, you can simply go to the Visual Studio command prompt and type:

xsd myFile.xml

Which will produce a myFile.xsd.

Then type

xsd myFile.xsd /c

This will generate a set of classes that you can add to your project, and then you can deserialize an xml file with this simple code:

   1: XmlSerializer serializer = 

   2: new XmlSerializer(typeof(MyFile));

   3:  

   4: Stream reader = new FileStream("myFile.xml",FileMode.Open);

   5:  

   6: MyFile myFile = (MyFile) serializer.Deserialize(reader);

 
It really is that simple.  There is no excuse for hand writing XML parsing code when you can literally take an XML file you have never seen before and turn it into an object in memory in 10 minutes.
The serialization framework and XSD tool provide options for using attributes to control how the XML is generated also.

Java (JAXB)

The steps are slightly more complicated with JAXB, but it is still fairly easy.

First we have to generate an XSD file from an XML file.  JAXB doesn’t do this itself as far as I know, but there is another tool we can use called Trang.

First step, download Trang, then run it like so:

java –jar trang.jar –I xml –O xsd myFile.xml myFile.xsd

You can also use the XSD.exe tool from Visual Studio if you have it installed or download it.  There are a few other tools out there as well.

Once you have the XSD file, or if you already had one you had written, you need to generate Java classes using JAXB’s tool like so:

xjc –p my.package.name myFile.xsd –d myDirectory

Running this command will produce Java files that represent the elements in your XML document.

Finally, to create your objects you can use the JAXB unmarshaller.

   1: JAXBContext jc = JAXBContext.newInstance("my.package.name");

   2: Unmarshaller unmarshaller = jc.createUnmarshaller();

   3: MyFile myFile = (MyFile) unmarshaller.unmarshal(new File( "myFile.xml"));

Not as simple as the C# example, but really quite simple.  I’ve omitted the steps like downloading JAXB and adding it to your class path, but you can see that the process really is not very painful at all.

JAXB also provides some options for customizing the serialization and deserialization.

Android (Simple XML)

You can’t use JAXB with Android.  It seems like because of the Dalvik VM, the reflection part of JAXB doesn’t work.

I found a pretty good and small XML framework that I am using in my Android app that seems to do the trick nicely.  You have to annotate your classes and create them by hand, but it is very simple and straightforward to do so.

The tool is called “Simple XML.”

You can find lots of examples on the “tutorial” tab on the web page.

Basically, you download Simple XML, add the jars to you class path, and create class files with some annotations to specify how to deserialize or serialize your XML.

Here is a small example of an annotated class, from the Simple XML website.

   1: @Root

   2: public class Example {

   3:  

   4:    @Element

   5:    private String text;

   6:  

   7:    @Attribute

   8:    private int index;

   9:  

  10:    public Example() {

  11:       super();

  12:    }  

  13:  

  14:    public Example(String text, int index) {

  15:       this.text = text;

  16:       this.index = index;

  17:    }

  18:  

  19:    public String getMessage() {

  20:       return text;

  21:    }

  22:  

  23:    public int getId() {

  24:       return index;

  25:    }

  26: }

 

To deserialize this xml you just use the following code.

   1: Serializer serializer = new Persister();

   2: File source = new File("example.xml");

   3:  

   4: Example example = serializer.read(Example.class, source)

Very simple and straightforward.  The only downside is generating the Java class files yourself, but that isn’t really very hard to do using the annotations.

No excuses

So there you have it, XML serialization frameworks abound to make your life easier when dealing with XML.  For most simple cases you should never handwrite XML parsing code, even when using a library to help you do it.

Now that I’ve shown you how easy it is, there really are no excuses!

As always, you can subscribe to this RSS feed to follow my posts on elegant code.  Feel free to check out my main personal blog at http://simpleprogrammer.com, which has a wider range of posts, updated 2-3 times a week.  Also, you can follow me on twitter here.

  • http://Www.Manchesterdeveloper.com Lee englestone

    I’m immediately suspicious of any process that involves breaking out and running command line tools as this is not feasible with a frequent changing structure. How often are you going to do this really?

    However if the structure is somewhat static I can see the benefits.

    Your post if anything has encouraged me to use xsds where I don’t already.

    Cheers

  • Valeriu Caraulean

    It’s no so simple as you’re talking. Real world is cruel.
    Samples are looking nice, but when you’re trying to do some real work – it doesn’t work.

    An example – xsd.exe isn’t handling properly lists in xml: http://stackoverflow.com/questions/1419316/trouble-with-xml-deserialization-into-xsd-generated-classes.

  • Geert

    I can be harsh, since you make a claim that is no true:

    “If you are doing something more complex than this, without a really good reason, you are doing it wrong!”

    Well, this is how I see it. If I want to go somewhere by car and someone tries to convince me I should use a bike it could be for two reasons
    1. The road is short, easy and it is sunny
    2. the “someone” has no drivers license

    My road is long, windy and it is pouring rain.
    I am afraid you don’t have a drivers license :-)

    Please come back to this once you have experienced this with complex XML

  • http://www.charlesstrahan.com Charles Strahan

    Sheesh, the illiterate attack. I’m pretty sure “without a really good reason” invalidates the two arguments above.

    With that said, thanks John for the post. I see a lot of developers bend over backwards to hand write XML gunk, when it could have just as easily been accomplished via xsd inference and code generation. There’s a time and a place for both approaches; unfortunately, most devs never even consider the codegen route.

  • Nick Martin

    I did not know about this XSD tool. I used this method to automate the creation of objects for application setting persistence and I like it. Great post.

  • Jeremy Cook

    simple-xml will not work for Android since it also depends on StAX, which is a shame.

  • Jim Stevenson

    @Jeremy Cook Simple does work with Android, if it can’t find StAX it defaults to DOM and works fine!

  • Robert Strauch

    Thanks for this post. Though I like the idea of mapping XML elements to Java objects using a serialization framework it may not always be the best idea.

    Let’s take the Simple framework for example. It’s doing a great job and it’s pretty easy to use, even for me as a “non-pro developer”. However, in most cases, you need to create a class for every non-primitive type which can be pretty much overhead.

    Example:

     
        06:32
     
     
        07:48
     

    I would map both Time elements to attributes of a Connection class like:

    public class Connection() {
      private String departureTime;
      private String arrivalTime;
    }

    I cannot do this with Simple without creating classes for Departure and Arrival. This does not seem reasonable to me.

    • Blaise Doughan

      Hi Robert,

      The EclipseLink MOXy framework does not require the creation of intermediate objects.  Below is your Connection object mapped to your sample XML document:

      import org.eclipse.persistence.oxm.annotations;

      public class Connection() {
        @XmlPath:disqus  (“departure/time/text()”) 
        private String departureTime;

        @XmlPath:disqus (“arrival/time/text()”)
        private String arrivalTime;
      }

      -Blaise

    • Blaise Doughan

      Hi Robert,

      The EclipseLink MOXy framework does not require the creation of intermediate objects.  Below is your Connection object mapped to your sample XML document:

      import org.eclipse.persistence.oxm.annotations;

      public class Connection() {
        @XmlPath:disqus  (“departure/time/text()”) 
        private String departureTime;

        @XmlPath:disqus (“arrival/time/text()”)
        private String arrivalTime;
      }

      -Blaise

    • Blaise Doughan

      Hi Robert,

      The EclipseLink MOXy framework does not require the creation of intermediate objects.  Below is your Connection object mapped to your sample XML document:

      import org.eclipse.persistence.oxm.annotations;

      public class Connection() {
        @XmlPath:disqus  (“departure/time/text()”) 
        private String departureTime;

        @XmlPath:disqus (“arrival/time/text()”)
        private String arrivalTime;
      }

      -Blaise