Building Hierarchical Structures from Flat Data

Sometimes you want to convert only a subset of the available data.
For example, you might want to store each year in a separate
XML document. With a few tweaks to the basic code, this is
not hard to do. Another possibility is to arrange the data hierarchically,
with Agency elements containing Bureau elements which contain
Account elements which contain Subfunction elements.
This is a little more complex, but far from impossible.
Or perhaps you want to massage the data while converting it, for instance
by changing the amounts of thousands of dollars to amounts of single dollars.
Or you can do all three.
The hierarchy would be arranged something like Example 4.5:

There are two basic approaches to imposing hierarchy on the
flat structures we’re starting with. One is to build the hierarchy
into the Java data structures in memory. Then these data structures
can be output in XML quite simply. The alternative is to leave the data
structures in memory flat, but try to insert the hierarchy as the output code is
being written. The latter approach is not too difficult when the data can
be easily sorted, but
most of the time I find the former approach to be simpler.

Although all the data could still be stored in lists of lists and
lists of maps and maps of lists and so forth, I really do find
that at this point the data structures are clearer if we name them;
and in Java there really isn’t any difference between a named data structure
and a class. Thus we can describe the structure as classes and objects.
Informally, the structure looks like this:

Note

I must admit that the lack of generic types (templates to C++
programmers) hurts a little here. For example, I want to say that an
Agency contains a list of
Bureaus. However, all I can really say is that an
Agency contains a list of objects and that this
list happens to be named bureaus. It’s not a huge
problem—if it were, I could always define my own custom list
classes with the appropriate types—but it does lead to a less
natural solution than generic types do. I thought seriously about using
generics for this example,
but that would limit this code to Java 1.4. Maybe for the second edition.

Of course, we’ll also need a number of methods in these classes.
Factory methods will be used instead of constructors
to guarantee that 19 IRS Bureau objects aren’t created just because
the IRS has 19 line items in the budget. (One IRS is bad enough. :-) )
Each class will have a getXML() method that returns
a String containing the
XML form of the object. And each class will have a method to add a child
node of the appropriate type. Figure 4.2
shows a UML static structure diagram for the complete group of classes.

Figure 4.2. A UML diagram for the budget class hierarchy

The Budget class, Example 4.6
is at the top of the tree. It contains a list of agencies and
a year. Line items encoded as maps by BudgetData are fed into
the tree through the add() method.
This method breaks that map up into the relevant parts, encodes each as an
object, adds each of those objects to the tree in the right place, and
stores the agency in the list of agencies. The various add() methods
in both this and the other classes are responsible for putting
each piece in the right place.

Most of the other classes are structurally the same with subtle variations.
Each has a list of its children,
a private constructor, a public factory method,
an add() method to add a child
of the appropriate type, and a getXML() method
that returns the XML representation of the object as a
String.
I’ll begin with the
Agency class in Example 4.7.

This is not a general purpose class. The public interface is limited
to a factory method, a method to add
a Bureau to the Agency,
and a method to get the XML representation of the object as a
String. There aren’t even any
getter methods. I’ve deliberately chosen not to burden this and the other classes with
a lot of error checking. I’m assuming that will be done both in the database
before the data is input here and later through validation
after the XML document has been produced. Doing it here too just
seemed a tad superfluous.

Clients retrieve instances of the class
using the getInstance() factory method.
The first time this class is asked to get a particular agency
it constructs it using the private constructor, stores it
in the static instances map, and returns a reference to it.
Subsequent requests will return
the formerly created object from the map.
The key for the map is the concatenation of the agency’s code and year.

The add() method simply stores
Bureau objects in the
bureaus list. Since bureaus
is private and only accessed through this add() method,
we can be certain that no
non-Bureau objects will ever
get added to this list. We can further guarantee that the list
doesn’t contain any duplicates.

The getXML() method returns an
XML representation of this Agency object.
It loops through all the bureaus the agency contains,
and uses their getXML() methods
to get the XML representations
for each of them. The XML is returned as a String.
An alternative approach that might be more memory efficient
and perhaps a little faster, especially in streaming applications,
would be to
pass the OutputStream or Writer
that the XML should be written onto
to the getXML() method. In essence, this would be a
writeXML() method rather than a
getXML() method.

The code here is far from the most efficient possible.
There are a lot of optimizations that could be done, including
many that would build far fewer intermediate string representations of
each element. For instance, you could pass in the StringBuffer
to append to to
each getXML() method. However,
since this is intended primarily as a one-off solution,
the optimization didn’t seem to be worth the investment of time
or the added complexity.

The Bureau class shown in
Example 4.8 is much the same except
that it contains a list of accounts. Furthermore, both an agency code
and a bureau code are required to
uniquely identify bureaus. The bureau code alone is not sufficient,
so the keys for the instances map
are formed by concatenating the agency code, the bureau code,
and the year.

The Account class shown in
Example 4.9 is similar to
Bureau and Agency.
Accounts are uniquely identified by an account code, a bureau code,
an agency
code, and a BEA code.
Otherwise, this class is structured the same as the last two.

Finally the Subfunction class,
Example 4.10, contains a title,
a code, and an amount of money for a
given year. Unlike agencies, bureaus, and accounts,
subfunctions are not necessarily unique. For example, the
“Department of Defense-Military” subfunction
appears in 222 different accounts. Therefore this class does not
use a factory method,
just a regular constructor.

The ultimate plan is to read all the records of the budget data,
and use that information to create objects of these classes. As
each object is created, it will be added to its parent object’s
list. When all records have been processed, the
getXML() method of the
Budget object will be invoked to retrieve
the XML representation of the budget. This XML can be written
onto an output stream. Example 4.11 does this. It has a
main() method to provide a simple user interface.
This just reads the year and the
input and output file names from the user, and calls
convert(). The convert() method
delegates reading the input to the
BudgetData class and the data-structure building
and most of the output to
the Budget class.

Example 4.11. The driver class that builds the data structure and writes it out again

The technique used here is actually an extremely powerful one.
The data is read one item at a time. Each item is stored in a data structure.
The data structure automatically moves each item into its appropriate slot.
The final result is generated
by a single method call to the root of the data structure.

Note

A similar approach would be to define classes
that represent the XML elements and attributes
rather than the input data. Indeed this is the approach taken by the Document
Object Model, DOM. I’ll explore this alternative in coming chapters.