In SQL: Processing XML’s Complex Hierarchical Structures

In Part 1 of last issue’s article entitled In SQL: Using XML to Link Below the Root I described how ANSI SQL can transparently integrate native
XML, and do it at a full hierarchical level that utilizes the hierarchical semantics in the data. This inherent hierarchical SQL processing became possible in the SQL-92 standard with the
introduction of the Left Outer Join operation which can model multi-leg hierarchical structures. The Left Outer Join, also known simply as the Left Join, enables hierarchical data structures to be
defined in SQL by placing the left join data argument over the right data argument and linking the two using the ON clause specification. This hierarchical semantics instructs the relational engine
how to process the data structure hierarchically. In effect, hierarchical processing became a subset of relational processing with the Left Join operating on tree structures where the nodes are
tables or elements. This produces a valid hierarchical result that is also ANSI SQL accurate and mathematically sound.

Complex XML Capabilities

There are many complex and unconventional capabilities being introduced by XML that are not being currently addressed in commercial SQL-based products. In this Part 2, we will examine how ANSI SQL
inherently goes beyond the standard hierarchical processing described in Part 1 to naturally process XML’s complex hierarchical capabilities. These include network structures, variable
structures and structural transformations. These capabilities are not being fully addressed because they impose significant problems for SQL and its relational processing. These SQL capabilities
are described in this article. It is felt that these XML capabilities are useful and it is important that SQL-based products naturally handle these new unconventional XML features and capabilities.
You will notice how well these complex capabilities naturally integrate with and utilize SQL’s inherent hierarchical processing capabilities. Because of this, the examples used in this
article can be processing relational, XML, legacy data or a combination.

Network Structures

Network structures can be created by XML’s duplicate and shared elements when they occur in an XML data structure. Duplicate elements occur where the same (named) element type occurs in
multiple locations in the XML data structure. Shared elements are created by the XML IDREF and ID attribute constructs which creates a logical alternative pathway in the physical XML data structure
forming a network structure. Both of these unconventional structures are demonstrated in Figure 1 as the top two structure diagrams. The Addr node in these diagrams represents the duplicate and
shared element. The dotted arrow in the XML Shared Element diagram in Figure 1 represents the logical IDREF pathway. The problem with these two structures is that they are both ambiguous for a
nonprocedural hierarchical query language such as SQL because there is no single unambiguous access path to a specific Addr node location. What makes a nonprocedural hierarchical query language so
powerful is that the hierarchical data structures they operate on are naturally unambiguous because they only have a single path to each node. In this way, the query can be specified unambiguously
because each node in the structure has its own unique access path and specific semantics that can be utilized automatically to process the query.

The Alias feature of SQL allows the duplicate and shared element structures shown in Figure 1 to be data modeled as unambiguous hierarchical structures by using the optional AS keyword to rename
the nodes (tables or elements). In this example, both of the ambiguous structures can use the same data modeling SQL to produce an unambiguous structure which maintains the original semantics of
both input structures. This is possible because the semantics of both input structures with differing storage distribution of the Addr node are the same and produce the same result. With the
unambiguously modeled structure shown in Figure 1, each specific Addr node can be unambiguously referenced by using its unique node name as a prefix to the field name. In this way each Addr node
reference has its own logical path with its own hierarchical semantics. This allows the full nonprocedural hierarchical power of SQL to be easily controlled with simple intuitive queries. Avoiding
the use of node name prefixes can be accomplished by using the SQL Alias feature on the SELECT list to rename the duplicate field names to unique names. The underlying XML access module’s
logic can adapt transparently to the physical storage representation of the Addr element whether it is shared or a duplicated.

Variable Structures

XML can define variable structures which allow for considerable variability of structure formats for a single definition of the structure. This means that from structure occurrence to occurrence or
even within a single structure occurrence of a record or document, the structure format can vary. Usually with a varying structure, some piece of information from a higher level in the structure
indicates how a variable substructure is to or has formed. This operation is very similar to COBOL’s variable structure capability which is controlled by its DEPENDING ON clause. With this
indicator information, SQL data modeling which controls the building of each structure occurrence can define the appropriate substructures dynamically. An example of this is shown in Figure 2
below, where the ON clause is used to control the generation of the variable portion of the structure.

Figure 2 is a simple example where the generation of the data structure in memory can vary depending on the value of the field StoreType in the Store node. In this case, only one substructure was
affected, but there is no limit to the number of variations which can be controlled separately by the ON clauses testing for any number of structure indicator values to control whether they are
generated. These structure decision tests can be coded to duplicate the rules specified in XML DTD and Schemas for variable element generation which can become quite complex. These indicator values
can be embedded in variable portions of the structure offering increased capabilities. This variable structure modeling in SQL also controls how variable physical structures are nonprocedurally
navigated.

The example in Figure 2 also demonstrates that ON clauses and WHERE clauses are very different in their operation. WHERE clauses remove entire rows, while ON clauses can remove isolated portions of
rows which are replaced with null values. Looking at this operation from a hierarchical processing perspective, WHERE clauses can remove nodes above and below the node used as the qualifying
filter, while ON clauses can only remove data from their qualification node downward. This means that the WHERE clause is logically applied after each record or row occurrence is complete, while
the ON clause is applied as the record is being built. This is why the ON clause can dynamically control how the structured record occurrence is built.

Structure Transformation

In Part 1, it was shown how the natural operation of node promotion controlled by which nodes were selected for output formed fragments that maintained their general structure and could be
manipulated by joining as a contiguous structure. Figure 3 below demonstrated two fragments indicated by the dashed circles that encircle them. Unselected nodes are represented by a dashed box.
These unselected nodes are not transferred when output causing lower level nodes to be promoted to the next ancestor on its path, retaining its basic structure and semantics For example, the Cust
node which is an ancestor of the Item node controlled its existence before and after its parent node was removed.

By combining fragment processing and the Alias feature used in processing network structures above, it is possible to flexibly perform powerful structure transformations. This operates by
specifying different fragments from the same structure by using the Alias capability to logically create multiple copies of the same structure. This enables different fragments in the same
structure can be isolated and then independently manipulated. The Alias feature renames views allowing duplicate input structures to be logically defined so that references to them can be made
unambiguously. The example in Figure 3 above demonstrates this by creating two separate and independent fragments from the StoreView view structure which are encircled. These now independent
fragments are then recombined into a different structure by re-joining them. Notice that this is a hierarchical controlled join (described in Part 1) that precisely controls the hierarchical
placement of the two fragments made possible by the Outer Joins’s ON clause. This is a simple example of structure transformation, multiple structures can each have multiple fragments
extracted which can all be combined in any order. The fragments can be joined as they are needed. Structural transformations can also be defined in an SQL view for abstraction and reuse in larger
structures.

Heterogeneous Hierarchical Structures

It is important and interesting to note that the XML advanced hierarchical capabilities described in this article, while brought into prominence by XML are not unique to XML. These are inherent
hierarchical operations and are applicable to any logical or physical hierarchical structure. This means that they will operate seamlessly and consistently across logical (relational) and physical
(XML and legacy data) hierarchically modeled structures under the control of SQL hierarchical processing.

Conclusion

This article has demonstrated how ANSI SQL’s standard well known capabilities coupled with its inherent hierarchical processing naturally perform XML’s advance hierarchical operations.
Part 1 in the last issue demonstrated how hierarchical processing is a valid subset of relational processing. This enables all of SQL’s basic hierarchical processing capability to integrate
transparently with native XML at a full hierarchical processing level. This includes dynamic hierarchical controlled joining of structures and powerful hierarchical optimization seamlessly across
relational and XML native data.

Part 3 will examine hierarchical query semantics to see how and why standard SQL exactly follows it. I will also compare the natural ANSI hierarchical solution described here to proprietary
solutions, XQuery used in SQL, and the SQL/XML Standard. I will also examine the synergy between hierarchical processing optimization and hierarchical SQL view support that produces a flexible and
powerful, and at the same time, user friendly and intuitive nonprocedural hierarchical processor.

For additional information on the capabilities described in this article, see www.adatinc.com. Any potential customers or SQL/XML developer/researchers
interested in becoming involved with this transparent ANSI SQL native XML technology, please get in touch with mike@adatinc.com.

Share

About Michael David

Michael is is the founder of Advanced Data Access Technologies, Inc. Previously, he was the lead XML architect for NCR/Teradata, and served as their representative to the ANSI SQLX Group. Before that
he was a staff scientist for Teradata and designed high level multi-featured SQL utilities. From his earlier career, he has more than 25 years of experience researching and designing commercial
nonprocedural heterogeneous database hierarchical query processing products using flat, relational and hierarchical data. From this experience, he authored the book Advanced ANSI SQL Data Modeling
and Structure Processing, as well as numerous papers and articles on this subject. His research on hierarchical and relational systems and data integration has resulted in discoveries that led to the
development of an ANSI SQL transparent XML hierarchical processor prototype that integrates and processes relational and XML data at a full multipath hierarchical level. This also proves that
inherent hierarchical processing capability is possible in ANSI SQL.
Contact Mike at mike@adatinc.com, and read his blog at www.adatinc.com/blog1.