In this paper we
address one of the problems that may arise when two heterogeneous web services
(i.e., not originally conceived to cooperate) try to exchange messages. We
assume that the web services (WS) exchange SOAP messages containing (in
“document style”) a complete business document (e.g., a request-for-quote: rfq)
in XML format (or RDF, in the Semantic Web).

Assume that a web service of a buyer company (ws-b) sends a rfq document (rfq-doc),
as payload of a SOAP message (rfq-msg)
to a web service of a provider company (ws-p).
The latter will be able to directly acquire and process the data carried by rfq-msg only if the structure, the tags,
the coding, and all the required elements of the rfq-doc have been previously agreed (e.g., by adhering to a common
business standard, such as ebXML). If this is not true, it is a highly probable
that the ws-p will not be able to
correctly acquire and process the data transported by the rfq-msg. This is one of the key problem of interoperability that
arise between heterogeneous web services: data interoperability clash. Today, to
obviate this problem there is not an easy solution.

Today, the most advanced commercial technology for data
interoperability is represented by the Enterprise Application Integration
solutions, such as Tibco or Web Methods. Such solutions require the development
of an adaptor for each pair of
cooperating web services. While EAI technology represents an important
commercial reality and is widely adopted, especially in large corporations (due
to the high costs), this kind of solution presents a number of drawbacks, and
increasing costs, when the scenario is not stable (e.g., cooperating partners
and/or their data schemas often change) and the number of cooperating WSs is
high. In fact, the development of an adaptor is a critical and expensive task,
requiring high skilled experts. Furthermore, given n cooperating WSs, the number of adapters to be developed (assuming
that potentially any partner may exchange data with any other partner) is O(n2).

We foresee that with the advent of the Semantic Web [SIGMOD] the
picture will change, since each partner will have access to the knowledge needed
to align the structures involved in a data exchange, allowing for a
semantic-based data reconciliation. In essence, we will have a scenario where
the differences in the representation a document, say rfq-doc, in ws-b and ws-p will remain, but the available data
semantics will allow an automatic reconciliation of the diverging representations
in a large number of cases (naturally, there will be cases where the manual
intervention and/or ad-hoc solutions will be required). The development of an
ontology-based infrastructure for semantic reconciliation is one of the goals
of the IST Athena Integrate Project,
launched in the context of the 6th Framework Programme of the
European Union. The proposed approach is based on a reference ontology and an
inference engine dedicated to the enactment of a set of semantic reconciliation
rules. Such rules are produced confronting the two different data structures
(e.g., rfq-doc-b and rfq-doc-p, respectively) on two
different levels: representational and semantic level. The goal is to identify
the representational mismatches and bridge them by using the underlying
semantics. A complete description of the Athena Semantic Suite falls outside of
the scope of this position paper. Here we will focus on the analysis of the representational
mismatches that are at the bases of the production of the semantic
reconciliation rules.

The documents
representational mismatches: a systematic analysis

The goal of our
work is to understand if, given two document schemas, there exist a mapping
between them, i.e., there is a function capable of transforming one instance of
the first doc into one instance of the other without a loss of information. Furthermore,
if a lossless transformation does not exist, identify the mapping minimising the loss of information.

Our work has started with the objective of identifying a limited
number of patterns that characterise the representational mismatches between business
document schemas, with the aim to produce semantic reconciliation rules to be
used at runtime. The analysed mismatches are partitioned in two main groups:
lossless and lossy.

Lossless mismatches –
when two document schemas express the same content and an information
preserving transformation can be defined. The mapping can be simple or
composite:

- An element of the schema of the doc-p corresponds exactly to an
element of the schema of the doc-b (and vice versa)

- The meaning of an element of the schema of the doc-p can be
precisely expressed by a suitable composition of elements of the schema of the
doc-b. (Such composite transformation requires in general a Reference Ontology)

Lossy mismatches – when it is not possible to define an
information preserving mapping between the two schemas, because of their
inherent semantic divergence. Therefore, when transforming an instance of doc-b into an instance of doc-p, in the best situations we will
have a quasi-matching and therefore some
information may be lost. Given an element of the first schema, there are three
possible situations:

- There is a quasi-matching element in the second schema that
exhibits a greater level of abstraction. There will be a direct information
loss (e.g., from doc-b to doc-p), therefore the receiving WS will
not experience any information loss.

- There is a quasi-matching element in the second schema that
exhibits a greater level of refinement. There will be an inverse information
loss, that is all information sent will be represented, but finally there will
be missing elements.

- There is no corresponding element in the second schema: there will
be a bidirectional loss.

Below we report a table with the types of
mismatch identified. Beside the name and brief description, we report fragments
of two RFQ documents, seen from the buyer and provider perspectives. The examples
are expressed in RDF/N3 [] syntax.

Lossless Mismatches

Mismatch

Description

doc-b Schema

doc-p Schema

Naming

Different labels for the same content

:RFQ a :Class .

:RequestedQuote a :Class .

doc-a
identifies a request for quotation as RFQ

doc-b identifies
a request for quotation as RFQuote

Attribute Granularity

The same information is decomposed into a
different number of attributes

:Buyer a :Class .

:has_Address a rdf:Property;

:domain :Buyer;

:range xsd:string .

:BuyerAddress a :Class .

:has_Street a rdf:Property;

:domain :BuyerAddress;

:range xsd:string .

:has_StreetNr a rdf:Property;

:domain :BuyerAddress;

:range xsd:string .

-doc-a represents the Address of a Buyer as a single
string, containing the name of the street and the street number

-doc-b represents the Address of a Buyer as a structure
explicitly decomposed into two fields, one for the name of the street and the
other for the street number.

Structure Organization

Different structures and organization of
the same content

:RFQ a :Class .

:Buyer a :Class .

:has_Buyer a rdf:Property;

:domain :RFQ;

:range :Buyer .

:RFQuote a :Class .

:Parties a :Class .

:Buyer a :Class .

:has_Parties a rdf:Property;

:domain :RFQuote;

:range :Parties .

:has_Buyer a rdf:Property;

:domain :Parties;

:range :Buyer .

-doc-a represents the information of the Buyer directly nested in
the RFQ structure

-doc-b collects the information about the Buyer under the Parties
structure and not directly under RFQuote.

SubClass-Attribute

An attribute with predefined value set is
represented by a set of subclasses

:RawMaterial a :Class .

:MaterialType a :Class .

:has_Type a rdf:Property;

:domain :RawMaterial;

:range MaterialType .

:Copper a :MaterialType

:Iron a :MaterialType .

:RawMaterial a :Class

:Copper a :Class

subClassOf :RawMaterial

:Iron a :Class

subClassOf :RawMaterial

-doc-a specifies the type of a RawMaterial instantiating the
property MaterialType by selecting the value from a predefined set of
instances (Copper and Iron)

-doc-b represents the same information by instantiating either the Copper
class or the Iron class, both subclasses of the RawMaterial
class

Schema-Instance

Data hold schema information

:Contact a :Class .

:Position a :Class

:has_Position a rdf:Property;

:domain :Contact;

:range :Position .

:has_Name a rdf:Property;

:domain :Contact;

:range xsd:string .

:Director a :Position

:Employee a :Position

:Contact a :Class .

:has_DirectorName a rdf:Property;

:domain :Contact;

:range
xsd:string .

:has_EmployeeName a rdf:Property;

:domain :Contact;

:range xsd:string .

-doc-a represents the position of Contact person by instantiating
the has_Position property selecting the value from predefined
instances (Director and Employee)

-doc-b has two different properties one for representing the fact that a
Contact person is a director and the other one if he/she is an
employee

What is a value
in doc-a is part of the schema in doc-b

Encoding

Different format of data or unit of measure

:Product a :Class .

:has_PriceInEuro a rdf:Property;

:domain :Product;

:range xsd:Float .

:Product a :Class .

:has_PriceInUSD a rdf:Property;

:domain :Product;

:range xsd:Float .

-doc-a expresses the price of a Product in euro

-doc-b expresses the price of a Product in US dollar

A simple
conversion from euro to US dollar will allow to exchange the right
information

Lossy Mismatches

Clash

Description

doc-a Schema

doc-b Schema

Content

Different content denoted by the same
concept (typically expressed by enumeration)

:RFQ a :Class .

:TransportationTerm a :Class

:has_TranspTerm a rdf:Property;

:domain :RFQ;

:range :TransportationTerm.

:EXW a :TransportationTerm.

:FCA a :TransportationTerm.

:CFR a :TransportationTerm.

:RFQuote a :Class .

:TransportationTerms a :Class

:has_TranspTerm a rdf:Property;

:domain :RFQuote;

:range : TransportationTerm.

:ExWorks a

:TransportationTerm.

:FreeCarrier a

:TransportationTerm.

-doc-a represents the transportation terms by a set of values: EXW,
FCA, CFR

-doc-b represents the transportation terms by a different set of values:
ExWorks and FreeCarrier

While there is
an equivalence between the first two pairs of values, the third option of doc-a
has not a corresponding value in doc-b

Coverage

The absence of information

:RFQ a :Class

:has_PreferredDeliveryDate a
rdf:Property;

:domain :RFQ;

:range xsd:string.

The Schema B cannot allow to express a
preferred date for the goods delivery

The preferred delivery date is not considered
by the doc-b. There is no way to exchange such an information.

Precision

The accuracy of information

:RawMaterialPiece a :Class .

:MaterialSize a :Class .

:has_Size a rdf:Property;

:domain :RawMaterialPiece;

:range MaterialSize .

:LessThan1Cm a:MaterialSize .

:Bw1And5Cm a:MaterialSize .

:MoreThan5Cm a :MaterialSize

:RawMaterialPiece a :Class .

:has_SizeInCubicMeters a rdf:Property;

:domain :RawMaterialPiece;

:range xsd:Float .

-doc-a represents the size of a piece of material by selecting the value
from three possible ranges of values

-doc-b represents the size of a piece of material with the exact measure
in cubic meters

There is not a
precise way to transform information from doc-a to doc-b

Abstraction

Level of specialisation refinement of the
information

:Order a :Class

:DeliveryTerms a :Class

:has_DeliveryTerms a rdf:Property;

:domain :Order;

:range :DeliveryTerms .

:Order a :Class

:NationalDelivTerms a :Class

:ForeignDelivTerms a :Class

:has_NationalDelivTerms a rdf:Property;

:domain :Order;

:range :NationalDelivTerms

:has_ForeignDelivTerms a rdf:Property;

:domain :Order;

:range :NationalDelivTerms

doc-a just
allows to represent generic delivery terms

doc-b
allows to express if the delivery conditions have to respect, for instance,
national or international laws

Conclusions

In this paper we
presented the preliminary results of the work carried out in the Athena European IP, aimed at developing
a semantic interoperability platform for web services, within enterprise
software applications. The proposed solution is based on reconciliation rules
allowing WSs to exchange data even if they have different representations and
schemas. The proposed solution has been developed bottom-up, starting with the
analysis of the typical mismatches that two different schemas, aimed at
representing the same business entity, may exhibit. Starting from the analysis
of the schema mismatches we are developing a set of rule templates that, by
using also a Reference Ontology, will be processed by a Jena2 inference engine [McBr02] to reconcile the instances
exchanged by two Ws.