Andritsos, Fuxman, and Miller1have
shown that a probabilistic database
can simplify the deduplication task, by
allowing multiple conflicting tuples to
coexist in the database. Many other applications have looked at probabilistic
databases for their data management
needs: RFID data management, 34 management of anonymized data, 30 and scientific data management. 28

We present a number of key concepts for managing probabilistic data
that have emerged in recent years. We
group these concepts by the three facets, although some concepts may be
relevant to more than one facet.

facet 1: semanticsand Representation

The de facto formal semantics of a
probabilistic database is the possible
worlds model. 12 By contrast, there is no
agreement on a representation system,
instead there are several approaches
covering a spectrum between expressive power and usability. 4 A key concept
in most representation systems is that
of lineage, which is derived from early
work on incomplete databases by Im-melinski and Lipski. 22

Possible Worlds Semantics. In its most
general form, a probabilistic database
is a probability space over the possible
contents of the database. It is customary to denote a (conventional) relational database instance with the letter I.
Assuming there is a single table in our
database, I is simply a set of tuples (
records) representing that table; this is a
conventional database. A probabilistic
database is a discrete probability space
PDB = ( W, p), where W = {I , I , …, I } is

12na set of possible instances, called pos-sible worlds, and p: W → [0, 1] is suchthat Σ p(I ) = 1. In the terminology ofj = 1,n jnetworks of belief, there is one randomvariable for each possible tuple whosevalues are 0 (meaning that the tupleis not present) or 1 (meaning that thetuple is present), and a probabilisticdatabase is a joint probability distribu-tion over the values of these randomvariables.

This is a very powerful definition
that encompasses all the concrete
data models over discrete domains
that have been studied. In practice,
however, one must step back from this
generality and impose some workable
restrictions, but it is always helpful to

keep the general model in mind. Note
that in our discussion we restrict ourselves to discrete domains: although
probabilistic databases with continuous attributes are needed in some applications, 7, 15 no formal semantics in
terms of possible worlds has been proposed so far.

Consider some tuple t (we use inter-changeably the terms tuple and recordin this article). The probability thatthe tuple belongs to a randomly cho-sen world is p(t) = Σ p(I ), and is alsoj: t ∈ Ijjcalled the marginal probability of thetuple t. Similarly, if we have two tuplest , t , we can examine the probability12that both are present in a randomly cho-sen world, denoted p(t t ). When the12latter is p(t )p(t ), we say that t , t are in-1 2 12dependent tuples; if it is 0 then we saythat t , t are disjoint tuples or exclusive12tuples. If none of these hold, then thetuples are correlated in a nonobviousway. Consider a query Q, expressed insome relational query language likeSQL, and a possible tuple t in the que-ry’s answer. p(t ∈ Q) denotes the proba-bility that, in a randomly chosen world,t is an answer to Q. The job of a proba-bilistic database system is to return allpossible tuples t , t , … together with12their probabilities p(t ∈ Q), p(t ∈ Q), .…

12

Representation Formalisms. In practice, one can never enumerate all possible worlds, and instead we need to
use some more concise representation
formalism. One way to achieve that is
to restrict the class of probabilistic databases that one may represent. A popular approach is to restrict the possible
tuples to be either independent or disjoint. Call a probabilistic database block
independent-disjoint, or BID, if the set
of all possible tuples can be partitioned
into blocks such that tuples from the
same block are disjoint events, and
tuples from distinct blocks are independent. A BID database is specified by
defining the partition into blocks, and
by listing the tuples’ marginal probabilities. This is illustrated in Figure 1.
The blocks are obtained by grouping
Researchers by Name, and grouping
Services by (Name, Conference,
Role). The probabilities are given by
the P attribute. Thus, the tuples t 2 and

1t3 are disjoint (they are in the same1block), while the tuples t1, t2, s , s are1512independent (they are from differentblocks). An intuitive BID model was