Interview: Frank Cohen on FastSOA

InfoQ today publishes a one-chapter excerpt from Frank Cohen's book "FastSOA". On this occasion, InfoQ had a chance to talk to Frank Cohen, creator of the FastSOA methodology, about the issues when trying to process XML messages, scalability, using XQuery in the middle tier, and document-object-relational-mapping.
InfoQ: Can you briefly explain the ideas behind "FastSOA"?

Frank Cohen: For the past 5-6 years I have been investigating the impact an average Java developer's choice of technology, protocols, and patterns for building services has on the scalability and performance of the resulting application. For example, Java developers today have a choice of 21 different XML parsers! Each one has its own scalability, performance, and developer productivity profile. So a developer's choice on technology makes a big impact at runtime.

I looked at distributed systems that used message oriented middleware to make remote procedure calls. Then I looked at SOAP-based Web Services. And most recently at REST and AJAX. These experiences led me to look at SOA scalability and performance built using application server, enterprise service bus (ESB,) business process execution (BPEL,) and business integration (BI) tools. Across all of these technologies I found a consistent theme: At the intersection of XML and SOA are significant scalability and performance problems.

FastSOA is a test methodology and set of architectural patterns to find and solve scalability and performance problems. The patterns teach Java developers that there are native XML technologies, such as XQuery and native XML persistence engines, that should be considered in addition to Java-only solutions.

InfoQ: What's "Fast" about it? ;-)

FC: First off, let me describe the extent of the problem. Java developers building Web enabled software today have a lot of choices. We've all heard about Service Oriented Architecture (SOA), Web Services, REST, and AJAX techniques. While there are a LOT of different and competing definitions for these, most Java developers I speak to expect that they will be working with objects that message to other objects - locally or on some remote server - using encoded data, and often the encoded data is in XML format.

The nature of these interconnected services we're building means our software needs to handle messages that can be small to large and simple to complex. Consider the performance penalty of using a SOAP interface and streams XML parser (StAX) to handle a simple message schema where the message size grows. A modern and expensive multi-processor server that easily serves 40 to 80 Web pages per second serves as little as 1.5 to 2 XML requests per second.

Without some sort of remediation Java software often slows to a crawl when handling XML data because of a mismatch between the XML schema and the XML parser. For instance, we checked one SOAP stack that instantiated 14,385 Java objects to handle a request message of 7000 bytes that contains 200 XML elements.

Of course, titling my work SlowSOA didn't sound as good. FastSOA offers a way to solve many of the scalability and performance problems. FastSOA uses native XML technology to provide service acceleration, transformation, and federation services in the mid-tier. For instance, an XQuery engine provides a SOAP interface for a service to handle decoding the request, transform the request data into something more useful, and routes the request to a Java object or another service.

InfoQ: One alternative to XML databinding in Java is the use of XML technologies, such as XPath or XQuery. Why muddy the water with XQuery? Why not just use Java technology?

FC:We're all after the same basic goals:

Good scalability and performance in SOA and XML environments.

Rapid development of software code.

Flexible and easy maintenance of software code as the environment and needs change.

In SOA, Web Service, and XML domains I find the usual Java choices don't get me to all three goals.

Chris Richardson explains the Domain Model Pattern in his book POJOs in Action. The Domain Model is a popular pattern to build Web applications and is being used by many developers to build SOA composite applications and data services.

The Domain Model divides into three portions: A presentation tier, an application tier, and a data tier. The presentation tier uses a Web browser with AJAX and RSS capabilities to create a rich user interface. The browser makes a combination of HTML and XML requests to the application tier. Also at the presentation tier is a SOAP-based Web Service interface to allow a customer system to access functions directly, such as a parts ordering function for a manufacturer's service.

At the application tier, an Enterprise Java Bean (EJB) or plain-old Java object (Pojo) implements the business logic to respond to the request. The EJB uses a model, view, controller (MVC) framework - for instance, Spring MVC, Struts or Tapestry - to respond to the request by generating a response Web page. The MVC framework uses an object/relational (O/R) mapping framework - for instance Hibernate or Spring - to store and retrieve data in a relational database.

I see problem areas that cause scalability and performance problems when using the Domain Model in XML environments:

Each request operates the entire service. For instance, many times the user will check order status sooner than any status change is realistic. If the system kept track of the most recent response's time-to-live duration then it would not have to operate all of the service to determine the most previously cached response.

The vendor application requires the request message to be in XML form. The data the EJB previously processed from XML into Java objects now needs to be transformed back into XML elements as part of the request message. Many Java to XML frameworks - for instance, JAXB, XMLBeans, and Xerces ? require processor intensive transformations. Also, I find these frameworks challenging me to write difficult and needlessly complex code to perform the transformation.

The service persists order information in a relational database using an object-relational mapping framework. The framework transforms Java objects into relational rowsets and performs joins among multiple tables. As object complexity and size grows my research shows many developers need to debug the O/R mapping to improve speed and performance.

In no way am I advocating a move away from your existing Java tools and systems. There is a lot we can do to resolve these problems without throwing anything out. For instance, we could introduce a mid-tier service cache using XQuery and a native XML database to mitigate and accelerate many of the XML domain specific requests.

The advantage to using the FastSOA architecture as a mid-tier service cache is in its ability to store any general type of data, and its strength in quickly matching services with sets of complex parameters to efficiently determine when a service request can be serviced from the cache. The FastSOA mid-tier service cache architectures accomplishes this by maintaining two databases:

Service Database. Holds the cached message payloads. For instance, the service database holds a SOAP message in XML form, an HTML Web page, text from a short message, and binary from a JPEG or GIF image.

Policy Database. Holds units of business logic that look into the service database contents and make decisions on servicing requests with data from the service database or passing through the request to the application tier. For instance, a policy that receives a SOAP request validates security information in the SOAP header to validate that a user may receive previously cached response data. In another instance a policy checks the time-to-live value from a stock market price quote to see if it can respond to a request from the stock value stored in the service database.

FastSOA uses the XQuery data model to implement policies. The XQuery data model supports any general type of document and any general dynamic parameter used to fetch and construct the document. Used to implement policies the XQuery engine allows FastSOA to efficiently assess common criteria of the data in the service cache and the flexibility of XQuery allows for user-driven fuzzy pattern matches to efficiently represent the cache.

FastSOA uses native XML database technology for the service and policy databases for performance and scalability reasons. Relational database technology delivers satisfactory performance to persist policy and service data in a mid-tier cache provided the XML message schemas being stored are consistent and the message sizes are small.

InfoQ: What kinds of performance advantages does this deliver?

FC: I implemented a scalability test to contrast native XML technology and Java technology to implement a service that receives SOAP requests.

The test varies the size of the request message among three levels: 68 K, 202 K, 403 K bytes. The test measures the roundtrip time to respond to the request at the consumer. The test results are from a server with dual CPU Intel Xeon 3.0 Ghz processors running on a gigabit switched Ethernet network. I implemented the code in two ways:

FastSOA technique. Uses native XML technology to provide a SOAP service interface. I used a commercial XQuery engine to expose a socket interface that receives the SOAP message, parses its content, and assembles a response SOAP message.

Java technique. Uses the SOAP binding proxy interface generator from a popular commercial Java application server. A simple Java object receives the SOAP request from the binding, parses its content using JAXB created bindings, and assembles a response SOAP message using the binding.

The results show a 2 to 2.5 times performance improvement when using the FastSOA technique to expose service interfaces. The FastSOA method is faster because it avoids many of the mappings and transformations that are performed in the Java binding approach to work with XML data. The greater the complexity and size of the XML data the greater will be the performance improvement.

InfoQ: Won't these problems get easier with newer Java tools?

FC: I remember hearing Tim Bray, co-inventor of XML, extolling a large group of software developers in 2005 to go out and write whatever XML formats they needed for their applications. Look at all of the different REST and AJAX related schemas that exist today. They are all different and many of them are moving targets over time. Consequently, when working with Java and XML the average application or service needs to contend with three facts of life:

There's no gatekeeper to the XML schemas. So a message in any schema can arrive at your object at any time.

The messages may be of any size. For instance, some messages will be very short (less than 200 bytes) while some messages may be giant (greater than 10 Mbytes.)

The messages use simple to complex schemas. For instance, the message schema may have very few levels of hierarchy (less than 5 children for each element) while other messages will have multiple levels of hierarchy (greater than 30 children.)

What's needed is an easy way to consume any size and complexity of XML data and to easily maintain it over time as the XML changes. This kind of changing landscape is what XQuery was created to address.

InfoQ: Is FastSOA only about improving service interface performance?

FC: FastSOA addresses these problems:

Solves SOAP binding performance problems by reducing the need for Java objects and increasing the use of native XML environments to provide SOAP bindings.

Introduces a mid-tier service cache to provide SOA service acceleration, transformation, and federation.

FastSOA is an architecture that provides a mid-tier service binding, XQuery processor, and native XML database. The binding is a native and streams-based XML data processor. The XQuery processor is the actual mid-tier that parses incoming documents, determines the transaction, communicates with the ?local? service to obtain the stored data, serializes the data to XML and stores the data into a cache while recording a time-to-live duration. While this is an XML oriented design XQuery and native XML databases handle non-XML data, including images, binary files, and attachments. An equally important benefit to the XQuery processor is the ability to define policies that operate on the data at runtime in the mid-tier.

FastSOA provides mid-tier transformation between a consumer that requires one schema and a service that only provides responses using a different and incompatible schema. The XQuery in the FastSOA tier transforms the requests and responses between incompatible schema types.

Lastly, when a service commonly needs to aggregate the responses from multiple services into one response, FastSOA provides service federation. For instance, many content publishers such as the New York Times provide new articles using the Rich Site Syndication (RSS) protocol. FastSOA may federate news analysis articles published on a Web site with late breaking news stories from several RSS feeds. This can be done in your application but is better done in FastSOA because the content (news stores and RSS feeds) usually include time-to-live values that are ideal for FastSOA's mid-tier caching.

InfoQ: Can you elaborate on the problems you see in combining XML with objects and relational databases?

FC: While I recommend using a native XML database for XML persistence it is possible to be successful using a relational database. Careful attention to the quality and nature of your application's XML is needed. For instance, XML is already widely used to express documents, document formats, interoperability standards, and service orchestrations. There are even arguments put forward in the software development community to represent service governance in XML form and operated upon with XQuery methods. In a world full of XML, we software developers have to ask if it makes sense to use relational persistence engines for XML data. Consider these common questions:

How difficult is it to get XML data into a relational database?

How difficult is it to get relational data to a service or object that needs XML data? Can my database retrieve the XML data with lossless fidelity to the original XML data? Will my database deliver acceptable performance and scalability for operations on XML data stored in the database? Which database operations (queries, changes, complex joins) are most costs in terms of performance and required resources (cpus, network, memory, storage)?

Your answers to these questions forms a criteria by which it will make sense to use a relational database, or perhaps not. The alternative to relational engines are native XML persistence engines such as eXist, Mark Logic, IBM DB2 V9, TigerLogic, and others.

InfoQ: What are the core ideas behind the PushToTest methodology, and what is its relation to SOA?

FC: It frequently surprises me how few enterprises, institutions, and organizations have a method to test services for scalability and performance. One fortune 50 company asked a summer intern they wound up hiring to run a few performance tests when he had time between other assignments to check and identify scalability problems in their SOA application. That was their entire approach to scalability and performance testing.

The business value of running scalability and performance tests comes once a business formalizes a test method that includes the following:

Choose the right set of test cases. For instance, the test of a multiple-interface and high volume service will be different than a service that handles periodic requests with huge message sizes. The test needs to be oriented to address the end-user goals in using the service and deliver actionable knowledge.

Accurate test runs. Understanding the scalability and performance of a service requires dozens to hundreds of test case runs. Ad-hoc recording of test results is unsatisfactory. Test automation tools are plentiful and often free.

Make the right conclusions when analyzing the results. Understanding the scalability and performance of a service requires understanding how the throughput measured as Transactions Per Second (TPS) at the service consumer changes with increased message size and complexity and increased concurrent requests.

All of this requires much more than an ad-hoc approach to reach useful and actionable knowledge. So I built and published the PushToTest SOA test methodology to help software architects, developers, and testers. The method is described on the PushToTest.com Web site and I maintain an open-source test automation tool called PushToTest TestMaker to automate and operate SOA tests.

PushToTest provides Global Services to its customers to use our method and tools to deliver SOA scalability knowledge. Often we are successful convincing an enterprise or vendor that contracts with PushToTest for primary research to let us publish the research under an open source license. For example, the SOA Performance kit comes with the encoding style, XML parser, and use cases. The kit is available for free download at: http://www.pushtotest.com/Downloads/kits/soakit.html and older kits are at http://www.pushtotest.com/Downloads/kits.

InfoQ: Thanks a lot for your time.

Frank Cohen is the leading authority for testing and optimizing software developed with Service Oriented Architecture (SOA) and Web Service designs. Frank is CEO and Founder of PushToTest and inventor of TestMaker, the open-source SOA test automation tool, that helps software developers, QA technicians and IT managers understand and optimize the scalability, performance, and reliability of their systems. Frank is author of several books on optimizing information systems (Java Testing and Design from Prentice Hall in 2004 and FastSOA from Morgan Kaufmann in 2006.) For the past 25 years he led some of the software industry's most successful products, including Norton Utilities for the Macintosh, Stacker, and SoftWindows. He began by writing operating systems for microcomputers, helping establish video games as an industry, helping establish the Norton Utilities franchise, leading Apple's efforts into middleware and Internet technologies, and was principal architect for the Sun Community Server. He cofounded Inclusion.net (OTC: IINC), and TuneUp.com (now Symantec Web Services.) Contact Frank at fcohen@pushtotest.com and http://www.pushtotest.com.

Hi Jason: The patterns are platform agnostic, but they do require a persistence engine. As far as I know IBM's DataPower appliances give you the transformation and security decription capability. I did not see a persistence engine to let you do things like intelligent caching, transformation, and federation in the appliance. Did I get that wrong? -Frank

I would expect that for a fixed data set (where the message itself had a high correlation to the physical storage) this would be more efficient. For a message that has customer, order, payment, and product to represent and "purchase order service" I would expect each entity to be persisted in a relational way so that another service might be “customer details” and another to be “product details”. I would expect the performance to go down as you had more needs to aggregate data from the XML DB. I am guessing that as the more aggregation was needed, the less a hierarchical db would be more efficient than an object relational solution.

XML is a hierarchical, so I presume that Native XML DB would also be persist hierarchical, but most apps have a need to store things in a relational database because of the relational nature of the information. While the transformations are a cost, shouldn’t this cost be compared on an aggregate level?

I have not worked with any XML databases so I was just wondering about the efficiencies of using them.

Hi Jeff: Native XML databases are typically hierarchical in the way they store data and multi-dimensional in the each stored XML document have a different hierarchy. This is different from object databases where you truly didn't know exactly what you would be storing and querying ahead of time. Native XML databases now have the XQuery and XPath query syntax to work with the stored documents. Most of the XML DBs also have an indexing engine that uses one or more schemas to determine the correct approach to indexing the contents of a collection (the XML DB version of a row-set.) In my performance and scalability testing I find that native XML databases usually achieve 5x to 30x performance increase over using similar XML techniques on a relational DB.