Don’t Parse That XML!

I?ve talked a few times about how the best code you can write is code you never write.  One of the major places I end up seeing developer writing code that they don?t need to write is when parsing XML.

A word of caution before I go into how to not have to parse XML!

What I am going to describe is not always going to be the best solution.  It is an easy solution that will cover simple processing of XML files.  For a large XML file, the solutions I am going to suggest might be memory intensive and take too long.

Seems like everyone is doing it

I?ve walked into so many software development shops and seen code to parse XML files.  It seems to be one of those really common things that not enough developers realize can be completely automated.

I have started to wonder if it is self-propagating.  If developers have a tendency to see it being manually parsed in one place, assume that there is not a better way, then propagate that manual parsing to the next place they go.

Why is it so bad?

First of all, it is not an easy task to parse XML.  Even when using an XML parsing library there is a large amount of code that has to be written, especially for a complex XSD.

XML parsing code is also very fragile.  If the structure of an XML file changes, the code will have to be modified, and the modification can have cascading effects.

Manually generated XML parsing code cannot be regenerated if the structure of the XML changes.

Most importantly, any code you have to write runs the risk of introducing bugs and complexity into the system.

It?s so simple you wouldn?t believe it

So, how simple is it to automatically parse XML into objects?

Very simple.  First I am going to give you the basic pattern, then I am going to tell you how to do it in both C# and Java.

Basic pattern:

  1. Use a tool to generate an inferred XSD from your XML file.  (You can skip this step if you already have an XSD file.)
  2. Use a code generation tool to generate your classes automatically from the XSD file.
  3. In your code, deserialize your XML file into an object tree using the framework you generated the classes from.

If you are doing something more complex than this, without a really good reason, you are doing it wrong!

Learning how to do this in your language of choice is a very important tool to put into your tool bag.  There are many times that I have run into the need to parse XML files, where I have saved many hours of development time, by knowing how to automatically deserialize my XML files into objects.

There are two main ways in which XML serialization frameworks work.

  1. Serializers that auto-generate the classes from the XSD files.
  2. Serializers that use annotations or attributes on classes.

Using a serializer that auto-generates the classes from an XSD is the easiest to use and can work in most cases.  If you need more control over the generation of the XML, you might want to use an attribute or annotation based framework.

One of the biggest barriers in getting started with an XML framework is knowing what to use and how to use it.  I am going to cover 3 options that will get you going for C#, Java SE, and Java Android development.

C# (XSD.exe)

XML serialization is so easy in C# because it is built right into the .NET framework.

The only real piece of magic you need to know is the XSD.exe tool which is installed with Visual Studio.  This one tool can be run to infer an XSD from your XML file and then again to take that XSD and produce fully serializable / deserializable classes.

If you have an XML file named myFile.xml, you can simply go to the Visual Studio command prompt and type:

xsd myFile.xml

Which will produce a myFile.xsd.

Then type

xsd myFile.xsd /c

This will generate a set of classes that you can add to your project, and then you can deserialize an xml file with this simple code:

   1: XmlSerializer serializer = 

   2: new XmlSerializer(typeof(MyFile));

   3:  

   4: Stream reader = new FileStream("myFile.xml",FileMode.Open);

   5:  

   6: MyFile myFile = (MyFile) serializer.Deserialize(reader);

 
It really is that simple.  There is no excuse for hand writing XML parsing code when you can literally take an XML file you have never seen before and turn it into an object in memory in 10 minutes.
The serialization framework and XSD tool provide options for using attributes to control how the XML is generated also.

Java (JAXB)

The steps are slightly more complicated with JAXB, but it is still fairly easy.

First we have to generate an XSD file from an XML file.  JAXB doesn?t do this itself as far as I know, but there is another tool we can use called Trang.

First step, download Trang, then run it like so:

java ?jar trang.jar ?I xml ?O xsd myFile.xml myFile.xsd

You can also use the XSD.exe tool from Visual Studio if you have it installed or download it.  There are a few other tools out there as well.

Once you have the XSD file, or if you already had one you had written, you need to generate Java classes using JAXB?s tool like so:

xjc ?p my.package.name myFile.xsd ?d myDirectory

Running this command will produce Java files that represent the elements in your XML document.

Finally, to create your objects you can use the JAXB unmarshaller.

   1: JAXBContext jc = JAXBContext.newInstance("my.package.name");

   2: Unmarshaller unmarshaller = jc.createUnmarshaller();

   3: MyFile myFile = (MyFile) unmarshaller.unmarshal(new File( "myFile.xml"));

Not as simple as the C# example, but really quite simple.  I?ve omitted the steps like downloading JAXB and adding it to your class path, but you can see that the process really is not very painful at all.

JAXB also provides some options for customizing the serialization and deserialization.

Android (Simple XML)

You can?t use JAXB with Android.  It seems like because of the Dalvik VM, the reflection part of JAXB doesn?t work.

I found a pretty good and small XML framework that I am using in my Android app that seems to do the trick nicely.  You have to annotate your classes and create them by hand, but it is very simple and straightforward to do so.

The tool is called ?Simple XML.?

You can find lots of examples on the ?tutorial? tab on the web page.

Basically, you download Simple XML, add the jars to you class path, and create class files with some annotations to specify how to deserialize or serialize your XML.

Here is a small example of an annotated class, from the Simple XML website.

   1: @Root

   2: public class Example {

   3:  

   4:    @Element

   5:    private String text;

   6:  

   7:    @Attribute

   8:    private int index;

   9:  

  10:    public Example() {

  11:       super();

  12:    }  

  13:  

  14:    public Example(String text, int index) {

  15:       this.text = text;

  16:       this.index = index;

  17:    }

  18:  

  19:    public String getMessage() {

  20:       return text;

  21:    }

  22:  

  23:    public int getId() {

  24:       return index;

  25:    }

  26: }

 

To deserialize this xml you just use the following code.

   1: Serializer serializer = new Persister();

   2: File source = new File("example.xml");

   3:  

   4: Example example = serializer.read(Example.class, source)

Very simple and straightforward.  The only downside is generating the Java class files yourself, but that isn?t really very hard to do using the annotations.

No excuses

So there you have it, XML serialization frameworks abound to make your life easier when dealing with XML.  For most simple cases you should never handwrite XML parsing code, even when using a library to help you do it.

Now that I?ve shown you how easy it is, there really are no excuses!

As always, you can subscribe to this RSS feed to follow my posts on elegant code.  Feel free to check out my main personal blog at http://simpleprogrammer.com, which has a wider range of posts, updated 2-3 times a week.  Also, you can follow me on twitter here.

31 thoughts on “Don’t Parse That XML!

  1. The XmlSerializer route (in c#) is crazy slow. It is about 2 full orders of magnitude slower than custom (or generated) classes using Linq2Xml. The XmlReader is faster yet, especially for large files.

  2. Yeah, our initial get-it-working solution used XmlSerializer, but it was so, freakin, slow, we dumped it for XmlReader, and then later dumped XML for Google Protocol Buffers, since this message processing is on a critical path.

    Config files, and hardly-read XML, sure, use a framework like this, if you don’t mind paying the cost.

  3. @Brad
    @James

    This solution isn’t going to be best for speed or large XML files that are time sensitive. Even though XML serialization may be slower, you really need to have a good reason to abandon such a simple solution and choose to write your own code to parse the XML.

    Consider that web services in asp.net basically do the serialization / deserialization on the fly.

  4. I have often used a hybrid technique.
    Say you have a file of orders with … repeat adnauseaum
    I will use an XmlReader to readfrom record to record, then when positioned on the start of a single, use ReadSubTree to get a new XmlReader starting there and feed that to the the XmlSerializer.
    That way you are only deserializing a single order at a time and can thus operating on arbitrarily large files without increasing you memory footprint.

  5. @Steve
    hrumpf. sample xml was sarcificed to the XSS gods.
    ex. [orders][order]…[/order] repeater order nodes here[/orders]

  6. Hi John,

    There are situations when parsing xml by hand is the only sane option. One example that springs to mind is http://www.192.com/ Linq-To-Xml is superior to generating the classes from the schema and trying to use the right types. since the names etc does not make sense. Now, why should I generate 200 classes I don’t need? Maybe it’s just the title of the blog post was chosen unwisely because it’s a good tip. Just don’t agree with the title.

  7. Hi,

    I don’t agree at all, this approach (the uses of automatic tools like jaxb) creates useless intermediate DTO-s, ie data-structures that has typically a very short lifecycle and you need to copy data from your DTO to your domain object. This is verbose and creates ugly temporary data-structures (like jaxb does), the result is to have the same data from xml format to java/c#/etc format… Why do this? Why not using the xpath and map properties directly to your (eventually annotated) domain object (like SimpleXml for android)? Why having a +1 step approach?

    I don’t agree.

  8. It depends what you want achieve – speed of coding or speed of application. We had XML parsers contest and winners all were custom build parses. They overrun standard common ones (like JAXB) at least ten times. That’s the price you pay when you use “common” solution.

  9. This approach is not applicable if you have to deal with XML that are document, like for example the word2007ml format. Just the number of classe that you will have to take care of will be too big, the memory footprint and the speed.
    Automatically genereted classe (JAXB) are fine, but you will have to use it in some of your own classes. So you will have to maintain these relation too.

  10. I always start with an XPath library. I’ll use a data binding framework if the XML represents object serialization. But, XPath saves the day most of the time. I agree that dealing with parsers directly (DOM, SAX, StAX, etc.) is usually a bad idea.

  11. Working in Java, I have used all kinds of approaches for this kind of thing, and I am a firm believer in KISS. Therefore, I’m not too crazy about using XSD’s and generated classes, like you would for your JAXB solution. Even if they are autogenerated, it means that you have a couple extra steps every time you want to change your object model.

    In response to this, I came up with my own library which I call EasyXML. This attempts to be as close to no-config as possible, using annotations to determine how different Java properties will be persisted. All you have to do is set up an EasyXML persister with a map of classes to element names, annotate the getters of properties you want to persist with @MapToAttribute or @MapToElement, then call toString() on an object to get XML, or toObject() on some XML to get the object back. I’ve used it in several places and it just works, with very little care and feeding.

    Recently, I discovered Simple XML, which takes the same approach as EasyXML but implements it more fully. I haven’t used it yet but it looks very well thought out. If I were starting over again, I would go with Simple XML, but at the time I started with EasyXML I don’t think it was out there yet.

    Another library which does a really good job of low-config XML persistence is XStream, but it doesn’t give you as much control over the resulting XML.

  12. jaxb has excellent mvn integration which can be used to generate your java classes (if your using mvn).

    something like this in the build config:

    org.jvnet.jaxb2.maven2
    maven-jaxb2-plugin
    0.7.4

    src/main/resources/schema
    src/generated/java

    generate

  13. Several things :

    @bob tinsman : changing the XML format is not a problem if you’ll have just one version laying around at anytime and not need to migrate from versions to versions. As it’s not good to make you DB model from your object model, I not sure, going from the object model to choose the format of your XML is anygood. What about the version issues ? Backward compatibility and such ?

    If people don’t read your XML, and data is just sended over the wire on software of the same version. Maybe you don’t even need XML anyway. Java serialization for exemple will give far better performance and will avoid to deal with XML at all.

    In response of the blog post :

    When you need XML, I prefer to have a nice XSD. Using a good XSD, help you to validate your file (many error cases automatically handled), help people that write your XML file using completion in their editor…

    But you must then make your XSD yourself. You cannot just use a generated version from an XML sample. (It can serve at a model too). When making your XSD, it’s good to think about being future proof to.

    But the generated code will not suffice. Because each time you’ll change your XSD a litle, the generated code will change. And the generated object model is not really the object model you want for the rest of your code.

    So using an intermediate object model that will change less ofen and will make you less dependant from the actual XML format is a good thing. For exemple you could have only one object model, supporting all the features of the application. And your import code could support several version of the XML file and deal with the details here.

    Another added benefit of an intermediate model is the opportunity to add additionnal check to your files so to fail early on inconsistents files, instead of having your code break.

  14. Good post!

    Your article focussed on starting from an XML schema. JAXB is also aimed at people starting from Java classes and has a rich set of annotations for mapping to an XML schema. Another advantage of JAXB is that it is a standard with multiple implementations (I lead the MOXy JAXB implementation).

    @Bob Tinsman: JAXB also supports a minimal config annotation-based approach for mapping classes to XML. JAXB implementations such as MOXy also support an XML representation of this metadata.

    @Indrit Selimi: You don’t necessarily need DTOs. JAXB implementations such as MOXy offer extensions such as XPath based mapping and JPA entity mapping that allow you to use your own domain model.

    @Alex: It’s hard to judge your performance benchmark without knowing any details. Did you included the creation of the JAXB context in the performance test? When comparing to SAX what did the ContentHandler do? Were you using a recent version of JAXB 2.2?

  15. I’m immediately suspicious of any process that involves breaking out and running command line tools as this is not feasible with a frequent changing structure. How often are you going to do this really?

    However if the structure is somewhat static I can see the benefits.

    Your post if anything has encouraged me to use xsds where I don’t already.

    Cheers

  16. I can be harsh, since you make a claim that is no true:

    “If you are doing something more complex than this, without a really good reason, you are doing it wrong!”

    Well, this is how I see it. If I want to go somewhere by car and someone tries to convince me I should use a bike it could be for two reasons
    1. The road is short, easy and it is sunny
    2. the “someone” has no drivers license

    My road is long, windy and it is pouring rain.
    I am afraid you don’t have a drivers license 🙂

    Please come back to this once you have experienced this with complex XML

  17. Sheesh, the illiterate attack. I’m pretty sure “without a really good reason” invalidates the two arguments above.

    With that said, thanks John for the post. I see a lot of developers bend over backwards to hand write XML gunk, when it could have just as easily been accomplished via xsd inference and code generation. There’s a time and a place for both approaches; unfortunately, most devs never even consider the codegen route.

  18. I did not know about this XSD tool. I used this method to automate the creation of objects for application setting persistence and I like it. Great post.

  19. simple-xml will not work for Android since it also depends on StAX, which is a shame.

  20. Thanks for this post. Though I like the idea of mapping XML elements to Java objects using a serialization framework it may not always be the best idea.

    Let’s take the Simple framework for example. It’s doing a great job and it’s pretty easy to use, even for me as a “non-pro developer”. However, in most cases, you need to create a class for every non-primitive type which can be pretty much overhead.

    Example:

     
        06:32
     
     
        07:48
     

    I would map both Time elements to attributes of a Connection class like:

    public class Connection() {
      private String departureTime;
      private String arrivalTime;
    }

    I cannot do this with Simple without creating classes for Departure and Arrival. This does not seem reasonable to me.

  21. Hi Robert,

    The EclipseLink MOXy framework does not require the creation of intermediate objects.  Below is your Connection object mapped to your sample XML document:

    import org.eclipse.persistence.oxm.annotations;

    public class Connection() {
      @XmlPath:disqus  (“departure/time/text()”) 
      private String departureTime;

      @XmlPath:disqus (“arrival/time/text()”)
      private String arrivalTime;
    }

    -Blaise

  22. Hi Robert,

    The EclipseLink MOXy framework does not require the creation of intermediate objects.  Below is your Connection object mapped to your sample XML document:

    import org.eclipse.persistence.oxm.annotations;

    public class Connection() {
      @XmlPath:disqus  (“departure/time/text()”) 
      private String departureTime;

      @XmlPath:disqus (“arrival/time/text()”)
      private String arrivalTime;
    }

    -Blaise

  23. Hi Robert,

    The EclipseLink MOXy framework does not require the creation of intermediate objects.  Below is your Connection object mapped to your sample XML document:

    import org.eclipse.persistence.oxm.annotations;

    public class Connection() {
      @XmlPath:disqus  (“departure/time/text()”) 
      private String departureTime;

      @XmlPath:disqus (“arrival/time/text()”)
      private String arrivalTime;
    }

    -Blaise

Comments are closed.

Proudly powered by WordPress | Theme: Code Blog by Crimson Themes.