Oracle Blog

Don't panic !

refactoring xml

Refactoring is defined as "Improving a computer program by reorganising its internal structure without altering its external behaviour". This is incredibly useful in OO programming, and is what has led to the growth of IDEs such as Netbeans, IntelliJ and Eclipse, and is behind very powerful software development movements such as Agile and Xtreeme programming. It is what helps every OO programmer get over the insidious writers block. Don't worry too much about the model or field names now, it will be easy to refactor those later!

If maintaining behavior is what defines refactoring of OO programs - change the code, but maintain the behavior - what would the equivalent be for XML? If XML is considered a syntax for declarative languages, then refactoring XML would be changing the XML whilst maintaining its meaning. So this brings us right to the question of meaning. Meaning in a procedural language is easy to define. It is closely related to behavior, and behavior is what programming languages do their best to specify very precisely. Java pushes that very far, creating very complex and detailed tests for every aspect of the language. Nothing can be called Java if it does not pass the JCP, if it does not act the way specified.
So again what is meaning of an XML document? XML does not define behavior. It does not even define an abstract semantics, how the symbols refer to the world. XML is purely specified at the syntactic level: how can one combine strings to form valid XML documents, or valid subsets of XML documents. If there is no general mapping of XML to one thing, then there is nothing that can be maintained to retain its meaning. There is nothing in general that can be said to be preserved by transformation one XML document into another.
So it is not really possible to define the meaning of an XML document in the abstract. One has to look at subsets of it, such at the Atom syndication format. These subset are given more or less formal semantics. The atom syndication format is given an english readable one for example. Other XML formats in the wild may have none at all, other than what an english reader will be able to deduce by looking at it. Now it is not always necessary to formally describe the semantics of a language for it to gain one. Natural languages for example do not have formal semantics, they evolved one. The problem with artificial languages that don't have a formal semantics is that in order to reconstruct it one has to look at how they are used, and so one has to make very subtle distinction between appropriate and inappropriate uses. This inevitably ends up being time consuming and controversial. Nothing that is going to make it easy to build automatic refactoring tools.

This is where Frameworks such as RDF come in very handy. The semantics of RDF, are very well defined using model theory. This defines clearly what every element of an RDF document means, what it refers to. To refactor RDF is then simply any change that preserves the meaning of the document. If two RDF names refer to the same resource, then one can replace one name with the other, the meaning will remain the same, or at least the facts described by the one will be the same as the one described by the other, which may be exactly what the person doing the refactoring wishes to preserve.

In conclusion: to refactor a document is to change it at the syntactic level whilst preserving its meaning. One cannot refactor XML in general, and in particular instances it will be much easier to build refactoring tools for documents with clear semantics. XML documents that have clear RDF interpretations will be very very easy to refactor mechanically. So if you are ever asking yourself what XML format you want to use: think how useful it is to be able to refactor your Java programs. And consider that by using a format with clear semantics you will be able to make use of similar tools for your data.

Henry, are you suggesting that people who are considering XML as their data format should consider RDF/XML? Or that such people should consider the RDF data model (graphs of triples)? Because I suspect you're promoting RDF/XML here, and I feel strongly that RDF/XML would give people more pain than pleasure; in fact some RDF/XML refactoring could break their XPaths. What, XPath should not be used on RDF/XML, you might say? If RDF/XML is suggested as an alternative to XML, people will surely expect to use XPath.

So maybe I just don't understand, can you please clarify the point of the article?

Hi Jacek,
Thanks for asking these questions. What I am really suggesting is that you take care to think about refactoring, when you consider storing your data, and that the only way to do this is to work with a format that has a clear semantics.

Now if you want to use an XML format, then RDF/XML is a good standard solution. If you also wish to make it easy for people who only have access to DOM tools to use this then you can create an XML crystallization of your RDF graph. IE. specify an RelaxNG constraint on your xml that is compatible with RDF/XML. This will allow XPath, XSLTs and other tools to work nicely. This will make refactoring a little more difficult, as you will have to fit it inside the RelaxNG constraints, but if it is worth it for you, then so be it. The tools for parsing RDF are now available for all platforms and in all languages, so I don't think this consideration need be the only one anymore.

So you may wonder what situation does one come across the need to refactor data structures? A really good example is NetBeans. NetBeans is built up of many little XML files, lying around everywhere on the file system. You can write plugins for NetBeans, using NetBeans which will generate many such little files all over the place. All fair and good. But what happens when you want to change your plugins class name? Suddenly NetBeans has to regenerate a whole bunch of interdependent XML files, each with very little systematic semantics. This job is not a very pleasant one. Had these files been written out in RDF/XML the job could have been automated very easily. This is just one example. I suspect that there are many more.

Hi Henry, I must be still missing something because it seems to me that there is a difference in what you can do to refactor code, and what you can do to refactor data.

A refactored code does the same thing the old one did, just differently and hopefully more elegantly. You can change all the class names, packages, method names, parameters and all that, the behavior is still the same. When you do refactoring, the IDE makes sure that all the necessary dependencies are satisfied for every change.

Refactored data, as you put it, is the same data just written down differently. Nothing of significance can be renamed or reordered etc. I don't think the term "refactoring" is really appropriate for that. I could see data refactoring as combined with the code that handles the data - the IDE would allow you to move things around in the data and it would propagate those changes (semiautomatically, I guess) to the code that uses that data. Now that would be data refactoring, but you don't talk about such propagation of changes from data to code.

If you feel that refactoring data is insignificant or is an inappropriate word, then you may want to talk to Martin Fowler who co authored a large book entitled "Refactoring databases".

It would be fun to re read the book with my new RDF glasses on, and see how what he says is simplified by using RDF. It certainly will not be simplified by using XML without clear semantics, for the reason I mentioned, namely that re-factoring is changing the way things are said whilst keeping what is said constant. Semantics gives us the mapping from syntax to the world, which gives us the tools so that we can keep the world constant, whilst rewriting the descriptions.

Changing the name of an object is really easy in RDF. Say I want to change my name from
http://bblfish.net/people/henry/foaf#me to http://bblfish.net/people/henry/card#me. I can simple add the following to my inferencing database

I could also take a graph, search for all triples that start or end with http://bblfish.net/people/henry/foaf#me and replace them with triples that start or end with the other.

Perhaps I want to create from a relation R a new relation S with a larger domain and range and make R inherit from S. That would be like pushing a method up the class hierarchy. It can be done easily with RDF by saying that

:R rdfs:subPropertyOf :S

There are many more examples I can come up with...

Dean Allemang is writing a book that is covering a lot of this field. Hopefully it will be out soon. That should help you get some idea of how this can be used.