We
consider online communities in the general sense of sharing common
interests or purposes through data. Community members can be private
users volunteering their time for a common project. They can also be
professionals (researchers, engineers, support staff, etc.) who use
web-scale collaboration in their workplace within or across their
organizations. Therefore, the advantages of mass collaboration such as
faster production and better accuracy of knowledge and data can be
brought to all kinds of companies and, for instance, help them create
better services and products faster at lower cost. To illustrate the
data management requirements in this context, let us introduce two
representative, rather complex examples of online community
applications: collaborative medical research and social networking
systems.

Collaborative medical research. Medical
research is a highly collaborative process, involving multiple
organizations (research laboratories, hospitals, pharmacy companies,
etc.) and multiple participants with different levels of expertise
(patient volunteers, medical scientists, biologists, pharmacists,
physicians, pathologists, surgeons, etc.). As an example of complex
collaboration scenario, an extensive study of the course of a disease
over a number of years would require integrating data about a
population of selected patients (with their diagnoses, family
histories, therapies, etc) and matching them with other data (pathology
data, oncology data, etc.). Other scenarios may involve simpler,
shorter collaboration, e.g. a group of pathologists trying to diagnose
a given patient.

Vast amounts of medical data are
produced continuously with various levels of accessibility (patient
records, drug studies, epidemiologic studies, genomic sequences, etc.).
Although most of the data is stored in medical information systems,
these are isolated, do not interoperate and do not provide support for
collaboration. Thus, the typical way of collaborating is that the
participants copy and directly exchange data using files (e.g.
spreadsheets) which makes data integration and data analysis
time-consuming (e.g. when needing a high number of files from different
participants) and error prone (e.g. when merging data from spreadsheets
of different formats). Furthermore, data copying may violate patient
privacy regulations.

A good analysis of the
general requirements of a Computer Supported Cooperative Work (CSCW)
for medical research collaboration is given in [Sta+08]. From the point
of view of data management, we can derive the following requirements:
transparent data access with query capabilities (join, transformation),
update support for shared data (e.g. adding annotations to experimental
results, relating data sources), support for dynamic groups of
participants, data privacy with role management (different participants
have different roles and thus different access rights on the data).
Furthermore, data uncertainty must be supported, e.g. to deal with
observations made with tools of different accuracy.

Some of these requirements (except data uncertainty and data privacy)
can be addressed by building a CSCW using an existing database as
proposed in [Sta+08]. But this solution has the traditional drawbacks
of centralized systems: single point of failure, heavy administration
of global information, latency for remote users. Furthermore, it is ill
suited to enable dynamic groups to form quickly in order to perform
fast, short lasting collaboration (e.g. when a patient's life is
involved). Finally, data privacy may be compromised as a result of
copying sensitive data in the central database.

We claim that a P2P solution is the right solution in this case as it
is light-weight in terms of administration and can scale up easily.
Peers can be the participants or organizations involved in
collaboration and may keep full control over their data. Furthermore,
data replication can be exploited to increase data availability and
foster parallel work.

Social networks. A social network such as Facebook enables its users to
share personal information stored in a central repository. It is rather
straightforward to develop new applications using a simple API.
However, these applications are rather limited in scope. We claim that
there are two fundamental flaws in this setting. The first one is that
it is technically rather inefficient to centralize all the data and the
control in a system that is bound to become a bottleneck or end up
wasting enormous resources (the Facebook farm). More importantly, many
users are reluctant to give full control over their private data to a
provider (Facebook) who can sell it to other businesses and worse,
leave such control to third parties of unknown affiliations.

In this case again, a P2P solution where peers are social participants
is promising for data management as it allows better control over
personal data privacy (in some proxy database) by their owners. Then a
user can interact with the system with an interface in the style of
mashups. A proxy handles her data and the interaction with the
community. Note that with such an approach, a user can have her own
data (e.g., phone number, list of trusted friends) shared between many
systems (Myspace, GoogleMail, Flickr, etc.) rather than replicated and
inconsistent on the private servers of these systems. Finally, more
advanced data management capabilities (e.g. queries, replication, etc.)
could be provided and increase significantly the scope of social
networks (e.g. enable large-scale collaboration of social participants).

Summary of RequirementsAs
many other online community applications, these two applications have
common requirements (e.g. high level data access, data privacy,) and
differences. For instance, collaborative medical research may be quite
demanding in terms of quantity of data exchanged while social networks
may involve very high numbers of participants. A P2P architecture
provides important advantages like decentralized control and
administration, scale up to high numbers of peers and support of the
dynamic behaviour of peers (who may join or leave the system at will).
These advantages are important for online communities. In addition, we
have the following requirements for data management:

Data
uncertainty. Some data should not be assumed to be 100% certain,
precise or correct, in particular, when coming from peers with
different confidence. Data uncertainty should be supported at all
levels of data management: schema management, semantic data
descriptions, query processing, replication and privacy.

Semantic
data integration. Users should be able to access a set of data sources
using their own semantic descriptions (e.g. ontologies) or annotations.
For this purpose, the system should provide a mapping discovery service
that uses an automatic and incremental process. This process should be
self configurable and efficient.

Query
expressiveness. The query language should allow users to describe the
desired data at the appropriate level of detail. For structured data,
an SQL-like query capability is necessary. It should provide the
ability to rank results and deal with uncertainty. Keyword search as
with search engines can also be provided on top an SQL-like query
facility for simple queries [CHZ05].

Update,
change control, replication. Data should be replicated to improve
availability despite peers’ failures and to improve performance of mass
collaboration. Since the data can be updated in parallel by different
peers, data reconciliation must be supported. The management and the
surveillance of changes are also major challenges.

Data
privacy and trust. P2P data sharing systems pay little attention to
data privacy and trust among participants. This has been exploited to
avoid centralized control (and violate copyright law). But for
collaboration among professionals with sensitive data (as in
collaborative medical research), data privacy and trust among
participants are major requirements.

State of the Art

The
state of the art useful to the DataRing project is related to recent
extensions of database systems, data integration systems, P2P data
sharing systems and semantic web

Database systems.
For a long time, the research agenda of the database research community
has been to provide advanced database system capabilities for emerging
applications of information systems. Some recent work in database
systems is related to DataRing: support of top-k queries, support of
data uncertainty and data privacy.

Top-k
queries enable users to rank their results based on a scoring function
as in search engines but on structured data (e.g. with SQL syntax). The
first important work is [FLN03] which models the general problem of
answering top-k queries using lists of data items sorted by their local
scores and proposes a simple, yet efficient algorithm, Fagin's
algorithm (FA). A better algorithm over sorted lists is the Threshold
Algorithm (TA) [FLN03]. TA has been the basis for many extensions in
distributed database systems. Recently, Best Position Algorithms (BPA)
[APV07a] demonstrated significant and consistent performance
improvement over TA. We plan to capitalize on this work in DataRing to
support more general forms of flexible querying in a P2P environment.

Data uncertainty in DBMS has recently received attention in order to
deal with data extracted from data sources of various qualities (e.g.
scientific data, commercial data). An important project is Trio-One at
Stanford [ABD+06] which aims at providing data uncertainty and lineage
in an integrated manner in a DBMS. This is done by extending the
relational model and SQL with several constructs, in particular,
numeric confidence values, optionally attached to tuples. Confidence
values represent the degree of certainty and respect a probabilistic
interpretation as in probabilistic databases [DS05], i.e. the certainty
about the correctness of data is the probability that the data is
correct. Trio-One is built on top of relational DBMS using data and
query translation techniques and stored procedures. Another important
approach to deal with imprecise data is using fuzzy logic, where data
values range over a user-defined vocabulary. It has been used
successfully to build user-oriented database summaries [SRM05].
Probabilistic and fuzzy databases are two complementary approaches
which we plan to explore in DataRing, but in a different context (P2P).

As data about individuals and organizations can be easily
disclosed and collected on the web, data privacy is becoming a major
issue. A basic principle of data privacy is purpose specification which
states that data providers should be able to specify the purpose for
which their data will be collected and used. Hippocratic databases
provide mechanisms for enforcing purpose-based disclosure control
within a database [AKS+02]. This is achieved by using privacy metadata,
i.e. privacy policies and privacy authorizations stored in relational
tables. In the context of P2P systems, decentralized control makes it
hard to enforce purpose-based privacy which remains an open problem.

Data integration systems.
Data management in distributed systems has been traditionally achieved
by distributed database systems [ÖV99] which enable users to
transparently access and update multiple databases in a network using a
high-level query language (e.g. SQL). Transparency is achieved through
a global schema which hides the local databases’ heterogeneity. In its
simplest form, a distributed database system is a centralized server
that supports a global schema and implements distributed database
techniques (query processing, transaction management, consistency
management, etc.). This approach has proved effective for applications
that can benefit from centralized control and full-fledge database
capabilities, e.g. information systems. However, it cannot scale up to
more than tens of databases. Data integration systems, e.g. DISCO
[TRV98] extend the distributed database approach to access autonomous
data sources (such as files, databases, documents, etc.) on the web
with a simpler query language in read-only mode. However, data
integration systems typically do not support important data management
functions such as replication and updates, which our target
collaborative applications require. Recent work on data integration
systems has dealt with XML schema matching [DBH07].

Dataspaces [FHM05] go one step further than data integration systems by
relaxing the needs for a global schema and providing data management
functionality over all data sources, regardless of how they are
integrated. One basic function is keyword search which does not require
any integration at all. However, for richer SQL-like querying over some
data sources, an additional integration effort is needed, following an
incremental, “pay-as-you-go” principle. The vision of a Data Ring in
[AP07] which we adopt in this project focuses on a high-level,
easy-to-use dataspace for content sharing communities and emphasizes
declarative querying with data exchanged in a high-level format (e.g.
XML). The MetaQuerier project at Yahoo! Research [CHZ05] adopts an
extreme approach to web-scale data integration by automatically
creating a unified interface to deep-web sources in specific semantic
domains. The PayGo project at Google [MCD+07] represents a major effort
to realize the vision of dataspaces and emphasizes pay-as-you-go as a
means to achieve web-scale integration of structured data including
deep-web sources and sites like Google Base. Besides the
“pay-as-you-go” incremental fashion of improving semantic data
integration, PayGo proposes new components to go beyond the
state-of-the-art in data integration: management of approximate
mappings, support of keyword queries, heterogeneous result ranking, and
support of uncertainty for data mappings and queries.

P2P data sharing systems.
P2P techniques which focus on scaling, dynamicity, autonomy and
decentralized control can be very useful to online communities. Initial
research on P2P systems has focused on improving the performance of
query routing in the unstructured systems which rely on flooding. This
work led to structured solutions based on distributed hash tables (DHT)
or hybrid solutions with superpeers that index subsets of peers. Recent
work on P2P data management has concentrated on supporting semantically
rich data (e.g., XML documents, relational tables, etc.) using a
high-level query language and distributed database capabilities (mostly
schema management and query processing), e.g. ActiveXML [ABC+03], Appa
[AMP+06]. Somewhere [RAC+06] enables more semantic integration of web
data using ontologies. PeerSum [HRV+08] is a first attempt at building
summaries over P2P data. Work on update support in P2P has started only
recently [APV07, MPE+08]. Privacy is considered a critical issue in
such systems. For instance, in social networks, users are very
concerned by leaks of their private data. P2P systems have to provide
access control mechanisms of the same quality as in centralized
systems. More precisely, data owners should have the means to control
access (in read or right mode) to their contents. This issue is
challenging in P2P settings and should rely on sophisticated encryption
techniques such as [ACF+06], privacy techniques such as [HSV08] and new
trust models for P2P such as [NCR08].

Semantic web.
The semantic web now provides a simple data expression language (RDF)
with a powerful ontology language (OWL) and associated query language
(SPARQL). The amount of available information expressed in these
languages is rapidly increasing. Because RDF has been designed from
scratch for distributed use and integration, it is well suited to P2P
data integration. Furthermore, using ontologies instead of schemas,
provides a flexible way to specialize data semantics for specific
purposes: each peer having different interests and different
capabilities can adapt ontologies to its purposes. Ontology
reconciliation, and thus interoperability, can be obtained through
ontology matching [ES07]. However, research on semantic P2P systems
[HB04, SS06] has considered so far ontologies which do not evolve. One
challenge in DataRing is to serendipitously use ontology alignments in
a P2P environment.