Enclosed is my report for the Saturday meeting on messaging
architectures suitable for use by MOBY. I cover REST, CORBA and
SOAP. Looking forward to discussing this with everybody in more
detail this weekend.
Lincoln
MOBY PROJECT: TECHNICAL REPORT ON WEB MESSAGING LAYER
Date: March 9, 2003
Author: Lincoln Stein
Version: 1.0
This report concerns the messaging layer of the Moby project, that
point at which semantic information is exchanged between the data
consumer (the biologist or client process) and the data provider (the
model organism system database).
There are a wide variety of possible messaging systems. Here are a
number of prominent ones listed in rough chronological order.
1) Custom messaging system using raw TCP/IP
2) Custom messaging system using BEEP (an IEEE applications
protocol framework)
3) ASN.1 exchange
4) Microsoft DCOM
5) REST
6) CORBA
7) XML-RPC
8) SOAP
For the purposes of this assessment, I ignored 1, 2, 3 and 4. I
rejected custom messaging alternatives 1 and 2 because they represent
fallback positions that we should consider only if none of the
standard solutions meet our requirements. Exchange of messages via
ASN.1 streams and Microsoft DCOM can rightly be treated as legacy
solutions that have found important places in niche applications but
are no longer acceptable as solutions for enabling the exchange of
semantic information across administrative domains. SOAP arises from,
and supersedes, XML-RPC, and so are folded together. This leaves
REST, CORBA, and SOAP.
I will now consider these in chronological order.
--------------------------------------------------------------------------
REST
----
REST stands for REpresentational State Transfer, and is a term coined
in Roy Fielding's graduate thesis to describe a style of information
architecture that had already become the de facto standard for the
World Wide Web. According to Fielding, REST is suited for scaleable
applications in which relatively large hypermedia representations of
information resources are exchanged within the "anarchic network."
Key Features:
a) Resources are identified using stable addresses, the URI.
b) Resources are never exchanged themselves. Instead representations
of resources are exchanged. A particular resource, such as a database
entry, may be represented in multiple ways, e.g. as an HTML file or a
postscript file. Resources are viewed as changing with time.
c) REST has many nouns but few verbs. Its verbs follow the CRUD
paradigm, and consist of PUT, GET, POST and DELETE. Its nouns are an
extensible set of hypermedia representations.
d) REST is stateless, and places the burden of maintaining session
information squarely on the client.
e) REST is close to the transport layer, and allows but does not
require applications to be concerned with performance issues
such as caching, parsing, and rendering latency.
Discussion:
Probably the most innovative aspect of REST is its use of URIs to
address each resource. I will use a DAS-like application (DAS is the
Distributed Sequence Annotation System, used to distribute annotations
on a genome) as an exemplar of this style. In DAS, a URI can be used
to identify a particular segment of a genome:
http://my.site/das/d-melanogaster/r3.1/2R
This identifies the genome of drosophila melanogaster, assembly
release version 3.1, chromosome arm 2R. To fetch the list of features
from this region, one would issue a GET request on the following URL:
GET http://my.site/das/d-melanogaster/r3.1/2R/features
To address an individual feature named "exon00001", one refers to this
URL:
GET http://my.site/das/d-melanogaster/r3.1/2R/features/exon00001
To add a new feature to the chromosome, one issues a PUT:
PUT http://my.site/das/d-melanogaster/r3.1/2R/features/exon00002
Updates and deletes are handled similarly.
REST is elegant because it allows very generic software to be written.
For example, caching code does not have to know anything about the
contents of the data it caches, and fetching code can simply hand off
the data it receives to the appropriate helper application. However,
it is unclear to me how REST can be used to handle transformative
tasks. For example, for the task of transforming genes into GO_terms,
should the task be represented as a method on the gene:
GET http://my.site/das/d-melanogaster/genes/notch/GO_terms
or as a hierarchy of tasks:
GET http://my.site/transformations/gene/gene2go?gene=notch
Who is using REST:
In one sense, everyone is. In another sense, nobody. The main
exemplar of a fully RESTful Web service is WebDAV, an implementation
of the DAV Distributed Authoring and Versioning protcol. Beyond
WebDAV, there are previous few "pure REST" applications out there.
There are many almost-REST services, but a variety of common
practices, such as the use of cookies, interferes with REST by
confusing the semantics of stateless information transfer operations.
DAS/1 is among these "almost REST" services. More or less
accidentally, it follows some of the REST conventions but it does
other things that are discouraged by REST, such as using POST to mean
GET.
The security of REST messages is limited to whatever HTTP can provide,
which ranges from horribly insecure cleartext passwords (Basic
authentication) to a sophisticated session-based public key encryption
system (SSL). Because each REST message is specified by a distinct
combination of URL and request method, Web server-based access and
authentication controls can be applied directly to REST messages,
allowing fine-grained control over who accesses data and what
manipulations they are allowed to perform on it.
Software support:
There is significant infrastructural support for REST services. It's
the web!
==================================================================
CORBA
-----
CORBA is a remote method call (RMC) protocol that uses binary-encoded
objects and an object request, lookup and serialization infrastructure
called an ORB.
Key Features:
- An RMC-based API that makes remote method calls look more-or-less
like local ones.
- Bindings to many popular languages, including C, Java, C++ and
Perl.
- A language-independent interface description language (IDL) to
describe objects and their methods.
- A directory service for finding services and returning their
identifiers and locations.
Another key feature of CORBA is its ability to support legacy
applications in C++, Java or Perl. In theory, one can take existing
library code written to support a local application, define a public
interface to it in IDL, and then link a small CORBA application
wrapper to it, thereby turning it into a network service. Client code
written to access the local library can now operate on the remote
service with no other source code changes. In practice, I have found
this process a less than transparent because the architecture of a
network server has fundamental constraints that are different from
that of a local application, and rarely does a converted application
perform in a satisfactory way.
Discussion:
The Life Sciences committee of the Object Management Group (OMG), has
been hard at work for several years developing IDLs for the life
sciences. However, due to the rapid change of the field, the IDLs
that are being ratified now have little relationship to the MOBY use
cases, and therefore are not as valuable as one would hope.
To contrast CORBA to REST, let us consider casting DAS in CORBA terms.
The first step would be to write an IDL with some of the following
declarations:
interface DataSource {
void setGenome(string new_source);
string getGenome();
void setVersion(string new_version);
string getVersion();
Segment getSegment(string lsid);
};
typedef sequence<Feature> FeatureArray
interface Segment {
DataSource getSource();
string getReference();
integer getStart();
integer getEnd();
FeatureArray getFeatureSet();
};
interface Feature : Segment {
string getType();
FeatureArray getSubFeatures();
};
A CORBA DAS service would provide an API like the following:
data_source = MYCORBA.get_an_object_somehow('urn:lsid:biodas.org:provider/das');
data_source.setGenome('urn:lsid:www.taxonomy.org:taxa/dmelanogaster');
data_source.setVersion('r3.1');
segment = data_source.getSegment('urn:lsid:my.site:chromosomes/2R');
features = segment.getFeatureSet();
if (features[0].getType() == "transcript")
exons = features[0].getSubFeatures();
The process of fetching a list of exon feature becomes a set of method
calls. Objects are identified by an arbitrary naming system that is
unrelated to the Web's URI system. For fun, I've used the LSID
system, but in fact any opaque identifier would do here. The
get_an_object_somehow() call is a stand-in for a series of CORBA
object directory lookup calls, which are outside the scope of this
discussion.
Unlike SOAP, CORBA is strongly tied to the underlying transport layer
at one side, and to the directory service at the other. It is also
very much integrated with the syntax and feature set of IDL. This has
usage implications. For example, the Internet Inter-ORB Protocol
(IIOP) is the only sanctioned object exchange protocol for TCP/IP.
IIOP currently provides synchronous message exchange only: after
initiating a method call, a client process must wait until it is
complete. Asynchronous communications are precluded, at least until a
new version of CORBA becomes available. Similarly, the object
discovery service is tightly bound to the CORBA package, and cannot
easily be mixed and matched.
The CORBA IDL is a powerful and expressive interface language that
includes many of the features of object-oriented languages. It was
designed during the days when C++ reigned, and this heritage shows: it
provides the basic C++ types, including characters, strings, integers,
floats, unions, references and enums, as well as aggregations of these
types, including structs and arrays. The pointer type is not
available, for good reasons.
IDL supports class inheritance including multiple inheritance, but
does not have a straightforward mapping to C++'s method scoping rules,
such as protected methods. The multiple inheritance rules also forbid
inheriting the same method name from two base classes, something that
will prevent multiple inheritance from being much use in large
collaborative projects where such collisions are frequent. IDL also
has the concept of an object "attribute", something akin to an
instance variable, but some texts recommend against using them.
CORBA supports both application- and system-defined exceptions.
Application-level exceptions are explicitly declared in the IDL
interface definition and are mapped onto whatever language-specific
exception-handling mechanism CORBA is bound to. Because of the need
to support non-object oriented languages like C, exception types
cannot be inherited or extended.
Security is provided via a CORBA Security Service, which provides an
API for object-level access control. The API is "technology neutral,"
which means that it is up to ORB implementors to find the best way to
implement the API. The dominant solution seems to be to run CORBA
running on top of SSL/TLS, but I do not have the complete picture of
how widespread such implementations are.
Who is using CORBA:
At one point, CORBA was going to be the saviour of bioinformatics. It
was heavily promoted by the EBI and by a number of biotech/biopharm
companies. It has found a niche in certain LAN applications, but has
not achieved any significant use for public servers. I do not have a
good sampling of opinions as to why it has failed, but Ewan Birney, an
early and strong proponent of CORBA, now quotes "performance problems"
as a major factor.
CORBA never had the support of Microsoft, and no longer has the
support of IBM or Sun.
Software Support:
Software support is good if you are on a Linux machine running the
Gnome desktop environment. Gnome made the big leap five years ago and
committed to a completely CORBAized architecture. Therefore the CORBA
libraries, development tools, and other infrastructural elements are
preinstalled on such machines. As far as I can tell from my personal
experience, CORBA supports Gnome on the desktop well, but it has not
provided the interoperability win or "killer app" that one might hope
for.
Netscape and Mozilla both include a freeware ORB and IIOP, allowing
components of those browsers to send and receive CORBA objects. The
Mozilla ORB appears to be different from the one that comes with
Gnome, at least insofar as it comes with a different IDL compiler and
a slightly different IDL syntax.
The Java runtime up to J2SEE v1.4 includes an ORB, and the standard
Java library has a full set of CORBA bindings. However, Sun clearly
intends to deprecate its CORBA support. The Java Web Services FAQ,
marketing literature, and white papers are exclusively devoted to
SOAP/XML, and references to CORBA are now buried in the technical
documentation.
Microsoft has long been antagonistic to CORBA, and spent the 90s
actively promoting DCOM (under a variety of names, including ActiveX)
as an alternative framework for network services. Microsoft operating
systems require the installation of third-party ORBs in order to
participate in CORBA-based services.
==================================================================
SOAP
----
SOAP initially stood for Simple Object Access Protocol. However it
isn't particularly simple, and it has little to do with accessing
objects, so recent reference works have tended to use the acronym on
its own. In a nutshell, SOAP is a Remote Procedure Call (RPC)
protocol which uses XML for its messages.
Key Features:
- A RPC-based API that makes local procedure calls look more-or-less
like remote ones.
- Bindings to many popular languages, including Java, C++ and Perl.
- A choice of XML-based data definition languages, the most popular
being XSD.
- A language-independent service description language called WSDL.
- Support for a directory service called UDDI.
- A lot of industry support, books, etc.
Discussion:
SOAP is positioned very much in the same niche as CORBA. In theory,
one can take legacy applications, flip a compiler switch, and have
them act as SOAP clients and servers. This is because each language
provides bindings that map its fundamental data types and method call
conventions into language-independent XML encodings.
I would repeat the DAS example here, but the interface definition
would be very much larger and harder to understand in XML/WSDL format.
The application-level code, however, would be similar, if not
identical to the CORBA example.
In contrast to CORBA, which is tightly coupled to its transport layer,
interface language, and directory service, SOAP takes a modular
approach. The SOAP transport framework can run on top of stateless
synchronous protocols such as HTTP, stateful synchronous protocols
such as FTP, stateless asynchronous protocols such as Jabber, and
delayed stateless asynchronous protocols such as SMTP. In theory
services can be described using a variety of data definition
languages, although in practice XSD and WSDL dominate. Resource
discovery is outside of core SOAP, but is provided by the separate
UDDI specification.
SOAP does not formally support inheritance, nor does it, to be honest,
truly support an object-oriented API. It is up to the language
binding to serialize and deserialize native objects in such a way to
simulate the exchange of objects and the invocation of method calls on
them. XSD provides for inheritance, but it is an extremely
data-centric type of inheritance that requires some oddities, such as
restating the contents of the base class in the derived classes, that
interfere with OO design. I do not understand how inheritance works
in WSDL, and have been unable to determine from my readings whether a
SOAP service that provides a derived object class can communicate
correctly with a client that expects to receive and manipulate the
superclass.
SOAP has a formal exception-handling mechanism that maps onto the
exception system of the currently bound language. Like CORBA, the
list of exceptions form a simple enumerated list without
inheritance. However, the list of exceptions can be extended by
application developers, and it does not seem impossible for a language
binding to impose inheritance on the system.
SOAP security can be achieved by running SOAP sessions across SSL/TLS.
This provides a very coarse-grained access control based on the
identities of the server and client, and not the fine-grained access
control needed to provide selective access to individual objects and
method calls. A number of proposed extensions to SOAP add this
fine-grained access control. The one that is furthest along is a
straightforward application of the XML digital signature syntax to
SOAP. Interestingly, one of SOAP's strongest selling points is that
when used on top of HTTP it will go through firewalls, which typically
pass port 80 traffic. Thus its ability to circumvent firewall
security is a feature, not a bug.
I have tried SOAP in my own applications and find that it works fine
for simple to moderately complex applications. Because of its
transparency, programmers can be tricked into performing foolish
operations. For example, in a local application it makes sense to
create lots of large complex objects and then invoke method calls on
them. In SOAP, every method call requires the object to be marshalled
(serialized along with all its subobjects), transmitted across the
wire, and unserialized by the server. The whole process is repeated
on the way back, making the application slow. Just as is the case
with CORBA, awareness of the strength and weaknesses of network-based
software must inform the design of services from the very beginning.
In my hands, SOAP does not work well in applications that transfer
extremely large amounts of data. For example, the genome-size data
streams that DAS generates rapidly exceed the DOM data structures of
SOAP/1.1 and earlier libraries, which expect objects to fit in memory.
SOAP 1.2 fixes this by allowing for incremental event-based (SAX)
parsing of messages, but this weakens the procedure-call API by
exposing the developer to the innards of object marshalling and
unmarshalling.
Who is Using SOAP:
Many people are talking about SOAP but the list of toy examples and
proofs of principle far outnumber the number of production
applications. This applies both to biological and non-biological
domains. My greatest success with SOAP has been a database
application that tracks the merges and splits in gene names. The
operations in this application are lightweight and require very little
data transfer, and a server written in Perl communicates very nicely
with clients written in Java and C. However, the application remains
a toy. In production I connect to the database over a socket using
the database's SQL API. One issue is speed, but another is that I do
not have confidence in the Perl SOAP library. Undoubtedly I would be
more enthusiastic if I were using the Java binding, which is more
mature and better supported.
Software Support:
SOAP is receiving strong developer support from IBM and Sun. The
level of support is not even across languages. It is very good for
Java and C#, pretty good for C++, good for Perl (although I don't
trust the library to be bug-free), and poor for Python.
Microsoft's .NET architecture, as far as I can tell from the marketing
literature and sometimes contradictory Internet commentary, is SOAP,
WDL and UDDI with a set of Microsoft APIs at the front end and a
runtime intermediate language in the middle. Provided that Microsoft
does not follow its traditional practice of embracing and modifying
open standards, the future of SOAP and its associated technologies
looks assured.
CONCLUSION:
-----------
REST is a collection of software architecture patterns that is
suitable for developing highly-tuned web-based services. However, it
is very much a do-it-yourself proposition that is hard to compare
directly to either CORBA or SOAP. If we were to consider adopting
REST for use in MOBY, we would have to define the following:
- A URL-based system for referring to biological objects by
identifier. This would be extremely similar to the "moby triple"
and LSID ideas, but we would have to come up with a new URL-like
syntax.
- A series of schemas, partial schemas, or MIME types to describe
shared biological objects. This is hard. However, you have to do
the same thing to get interoperable data representations for CORBA
and SOAP as well. One "advantage" of REST is that the objects
aren't constrained to use XML representations. For example, DAS
could distribute genome annotations using tab-delimited files and a
well-known MIME type.
- A way of describing what services a web site offers. No help here
from WSDL or UDDI.
A REST-based MOBY would be tied for the conceivable future to the HTTP
protocol and to the HTTP security architecture. We would be unable to
run MOBY on top of asynchronous or delayed protocols such as instant
messaging or SMTP.
CORBA is a very complete package that provides everything from the
transport layer to the resource discovery system. Details of the
CORBA messaging system are well-hidden from the application developer.
This simplifies the development process, but makes it harder to tune
performance. The major downside to CORBA is that it has been
abandoned as a technology by the vendors that once promoted it, and
has been effectively pithed by Microsoft's SOAP-based .NET
architecture. If we were to run MOBY on CORBA, we would be hitching
our cart to a half-dead horse caught in quicksand. I do not recommend
this course of action.
SOAP provides a messaging system that is mostly transparent to the
applications developer. It does not have a tightly-coupled transport
layer and resource discovery system, thereby providing the flexibility
to mix and match these components. We could run MOBY services off an
e-mail server; thereby reintroducing the sorely-missed batch BLAST and
sequence retrieval services of my graduate school days.
Unlike CORBA, SOAP is well supported by the industry, and absenting an
underhanded move by Microsoft is likely to dominate web services over
the next half decade. At the same time, it is a work in progress, and
we can expect to reimplement MOBY a few times as SOAP evolves.
If we were to run MOBY on SOAP (as Mark's prototype does!) we would
have to define the following:
- A series of LSID or MOBY triple mappings for shared biological
IDs.
- A series of XSDs, or partial XSDs for common biological
objects. See the note below for a major concern of mine.
- A registry system. I.E. should we continue with MOBY Central
or bite the bullet and use UDDI? Alternatively, perhaps we
should redirect our efforts towards making web service provides
self describing?
An unresolved concern of mine is whether SOAP truly supports
object-oriented interface design. I don't know if this will eventually
turn into a requirement, but a nice feature for MOBY to have would be
the ability to write clients that access and manipulate the base class
of an object. If a simple client that understands the
"SimpleSequence" class tries to retrieve data from a more
sophisticated MOBY service that returns "SuperDuperSequence" objects,
will the client be able to invoke methods on the derived class? I
don't see this being enforced in any real way in the SOAP
specification, and the textbooks and Internet sources are curiously
silent on this topic.
BOTTOM LINE: I think we can make a plausible argument for either SOAP
or REST as the messaging layer for MOBY.
------------------------------
Bibliography:
* REST
REST+SOAP
http://www.intertwingly.net/stories/2002/07/20/restSoap.html
Roy Fielding's Dissertation:
http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
FrontPage RESTwiki
http://internet.conveyor.com/RESTwiki/moin.cgi/FrontPage
Roots of the REST/SOAP Debate
http://www.prescod.net/rest/rest_vs_soap_overview/
* CORBA
Client/Server Programming with Java and CORBA, Orfali and Harkey
Advanced CORBA Programming with C++, Henning and Vinoski
* SOAP
Programming Web Services with SOAP, Snell Tidwell and Kulchenko.
* Digital Signature Extension for SOAP
http://www.w3.org/TR/SOAP-dsig/
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein at cshl.org Cold Spring Harbor, NY
========================================================================