Nội dung Text: Database Systems: The Complete Book- P13

tells us that total sales of all Aardvark lnodels in all colors, over all time at all all-or-nothing choices of grouping by day or aggregating over all time. For
dealers is 198.000 cars for a total price of $3,521,727,000. another esanlple based on our running automobile database, Ive could choose
to aggregate dealers completely or not aggregate them a t all. Hon-ever, we could
Consider how to answer a query in \\-hich we specify conditions on certain also choose to aggregate by city, by state, or perhaps by other regions, larger
attributes of the S a l e s relation and group by some other attributes, n-hile or smaller. Thus: there are a t least sis choices of grouping for time and at least
asking for the sum, count, or average price. In the relation are r sales), we four for dealers.
look for those tuples t with the fo1lov;ing properties: l\Tllen the number of choices for grouping along each dimension grows, it
becomes increasingly expensive to store the results of aggregating by every
1. If the query specifies a value v for attribute a ; then tuple t has v in its possible conlbination of groupings. S o t only are there too many of them, but
component for a. they are not as easily organized as the structure of Fig. 20.17 suggests for tlle
2. If the query groups by an attribute a , then t has any non-* value in its all-or-nothing case. Thus, commercial data-cube systems may help the user t o
conlponent for a. choose some n~aterializedviews of the data cube. A materialized view is the
result of some query, which we choose t o store in the database, rather than
3. If the query neither groups by attribute a nor specifies a value for a. then reconstructing (parts of) it as needed in response t o queries. For the data cube,
t has * in its component for a. the vie~vswe n-ould choose to materialize xi11 typically be aggregations of the
full data cube.
Each tuple t has tlie sum and count for one of the desired groups. If n-e \%-ant The coarser the partition implied by the grouping, the less space the mate-
the average price, a division is performed on the sum and count conlponents of rialized view takes. On the other hand, if ire ~vantto use a view to answer a
each tuple t. certain query, then the view must not partition any dimension more coarsely
E x a m p l e 20.18 : The query than the query does. Thus, to maximize the utility of materialized views, we
generally n-ant some large \-iers that group dimensions into a fairly fine parti-
SELECT c o l o r , AVG(price) tion. In addition, the choice of vien-s to materialize is heavily influenced by the
F O Sales
RM kinds of queries that the analysts are likely to ask. .in example will suggest tlie
W E E model = 'Gobi'
HR tradeoffs in\-011-ed.
GROUP BY c o l o r ;
INSERT INTO SalesVl
is ansn-ered by looking for all tuples of sales) ~ v i t h form
the
SELECT model, c o l o r , month, c i t y ,
('Gobi', C. *, *, 21, n) SUM(va1) A v a l , SUM(cnt) A c n t
S S
F O S a l e s JOIN Dealers ON d e a l e r = name
RM
here c is any specific color. In this tuple, v will be the sum of sales of Gobis GROUP BY model, c o l o r , month, c i t y ;
in that color, while n will be the nlini!)cr of sales of Gobis in that color. Tlie
average price. although not a n attribute of S a l e s or sales) directly. is
v / n . Tlie answer to the query is the set of ( c ,vln) pairs obtained fi-om all Figure 20.18: The materialized vien. SalesVl
('Gobi'. c , *, *. v. n ) tuples.
Example 20.19 : Let us return to the data cube
20.5.2 Cube ImplementaOion by Materialized Views
11% suggested in Fig. 20.17 that adding aggregations to the cube doesn't cost S a l e s (model, c o l o r , d a t e , d e a l e r , v a l , c n t )
much in tcrms of space. and saves a lot in time \vhen the common kincis of
decision-support queries are asked. Ho~vever:our analysis is based on the as- that n e de\-eloped in Esample 20.17. One possible materialized vie\\- groups
sumption that queries choose either t o aggregate completely in a dimension dates by nionth and dealers by city. This view. 1%-hich call SalesV1, is
1%-e
or not to aggregate a t all. For some dime~isions.there are many degrees of constlucted by the query in Fig. 20.18. This query is not strict SQL. since n-e
granularity that could be chosen for a grouping on that dimension. imagine that dates and their grouping units such as months are understood
U c have already mentioned thc case of time. xvl-herenumerolls options such by the data-cube system n-ithout being told to join S a l e s with the imaginary
as aggregation by weeks, months: quarters, or ycars exist,, in addition to the relation rep~esenting a j s that \ve discussed in Example 20.14.
d

CHAPTER 20. IiYFORI\IATIOAr IArTEGR.4TION 20.5. DdT.4 CUBES 1055
aggregation of cities into states, probably by accessing the dimension table for
INSERT INTO SalesV2
dealers.
SELECT model, week, s t a t e ,
\Ye cannot answer Q2 from SalesV2. Although we could roll-up cities into
SUM(va1) A v a l , SUM(cnt) A c n t
S S
states (i.e.. aggregate the cities into their states) to use SalesV1, we carrrlot
F O S a l e s JOIN D e a l e r s ON d e a l e r = name
RM
roll-up ~veeksinto years, since years are not evenly divided into weeks. and
GROUP BY model, week, s t a t e ;
data from a week beginning. say, Dec. 29, 2001. contributes to years 2001 and
2002 in a way we carinot tell from the data aggregated by weeks.
Figure 20.19: Another materialized view, SalesV2 Finally, a query like
43: SELECT model, c o l o r , d a t e , ~ ~ ~ ( v a l )
Another possible materialized view aggregates colors completely, aggregates F O Sales
RM
time into u-eeks, and dealers by states. This view, SalesV2, is defined by the GROUP B model, c o l o r , d a t e ;
Y
query in Fig. 20.19. Either view S a l e s V l or SalesV2 can be used to ansn-er a
query that partitions no more finely than either in any dimension. Thus, the can b e anslvered from neither SalesVl nor SalesV2. It cannot be answered
query from S a l e s v l because its partition of days by ~nonths too coarse to recover
is
sales by day, and it cannot be ans~vered from SalesV2 because that view does
41: SELECT model, SUM(va1) not group by color. We would have to answer this query directly from the full
FO Sales
RM data cube.
G O P BY model;
RU
can be answered either by 20.5.3 The Lattice of Views
To formalize the cbservations of Example 20.10. it he!ps to think of a lattice of
SELECT model, SUM(va1)
possibl~ groupings for each dimension of the cube. The points of the lattice are
F O SalesVl
RM
the ways that we can partition the ~ a l u c of a dimension by grouping according
s
G O P BY model;
RU
t o one or more attributes of its dimension table. nB say that partition PI is
belo~v partition P2.written PI 5 P2 if and only if each group of Pl is contained
within some group of PZ.
SELECT model, SUM(va1)
All
F O SalesV2
RM
G O P B model;
RU Y
On the other hand, the query
/ Years
1
42: SELECT model, y e a r , s t a t e , SUM(va1)
F O S a l e s JOIN D e a l e r s ON d e a l e r = name
RM
G O P BY model, y e a r , s t a t e ;
RU
I
Weeks
Quarters
I
Months
can o n 1 be ans\vered from SalesV1. as
Days
SELECT model, y e a r , s t a t e , SUM(va1)
F O SalesVl
RM Figure 20.20: A lattice of partitions for time inter\-als
G O P BY model, y e a r , s t a t e ;
RU
Incidentally. the query inmediately above. like the qu'rics that nggregate time Example 20.20: For the lattice of time partitions n-e might choose the dia-
units, is not strict SQL. That is. s t a t e is not ari attribute of SalesVl: only gram of Fig. 20.20. -4 path from some node f i dotvn to PI means that PI 5 4.
c i t y is. \Ye rmust assume that the data-cube systenl knol\-s how to perform the These are not the only possible units of time, but they \\-ill serve as an example

of what units a s ~ s t e r n
might support. Sotice that daks lie below both \reeks
and months, but weeks do not lie below months. The reason is that while a
group of events that took place in one day surely took place within one \reek
and within one month. it is not true that a group of events taking place in one
week necessarily took place in any one month. Similarly, a week's group need
not be contained within the group cor~esponding one quarter or to one year.
to Sales
At tlie top is a partition we call "all," meaning that events are grouped into a
single group; i.e.. we niake no distinctions among diffeient times. Figure 20.22: The lattice of views and queries from Example 20.19
All
I SalesV2; of course it could also be answered froni the full data cube S a l e s , but
there is no reason to want to do so if one of the other views is materialized. Q2
State
can be answered from either SalesVl or S a l e s , while Q 3 can only be answered
I from S a l e s . Each of these relationships is expressed in Fig. 20.22 by the paths
City downxard from the queries to their supporting vie~vs.
I Placing queries in the lattice of views helps design data-cube databases.
Dealer
Some recently developed design tools for data-cube systems start with a set of
Figure 20.21: A lattice of partitions for automobile dealers queries that they regard as ..typical" of the application a t hand. They then
select a set of views to materialize so that each of these queries is above a t least
Figure 20.21 shows another lattice, this time for the dealer dimension of our one of the riel\-s, preferably identical to it or very close (i.e., the query and the
automobiles example. This lattice is siniplcr: it shows that partitioning sales by view use the same grouping in most of the dimensions).
dealer gives a finer partition than partitioning by the city of the dealer. i

1088 CHAPTER 20. ISFORJlATIOS IXTEGRA4TIOS 20.6. DATA -111-YIA-G 1089
*! Exercise 20.5.3: In Exercise 20.4.1 lve spoke of PC-order data organized as ! Exercise 20.5.9: In Exercise 20.5.3 n e designed a cube suitable for use ~ v i t h
a cube. If we are to apply the CCBE operator, we might find it convenient to the CCBE operator. Horn-ever. some of the dimensions could also b e given a non-
break several dimensions more finely. For example, instead of one processor trivial lattice structure. In particular, the processor type could b e organized by
dimension, we might have one dimension for the type (e.g., AlID Duron or manufacturer (e g., SUT, Intel. .AND. llotorola). series (e.g.. SUN Ult~aSparc.
Pentium-IV), and another d~mension the speed. Suggest a set of dimrnsions
for Intel Pentium or Celeron. AlID rlthlon, or llotorola G-series), and model (e.g.,
and dependent attributes that will allow us to obtain answers to a variety of Pentiuni-I\- or G4).
useful aggregation queries. In particular, what role does the customer play?
.Also, the price in Exercise 20.4.1 referred to the price of one macll~ne,while a) Design tlie lattice of processor types following the examples described
several identical machines could be ordered in a single tuple. What should the above.
dependent attribute(s) be?
b) Define a view that groups processors by series, hard disks by type, and
Exercise 20.5.4 : What tuples of the cube from Exercise 20.5.3 would you use removable disks by speed, aggregating everything else.
to answer the following queries?
c) Define a view that groups processors by manufacturer, hard disks by
a) Find, for each processor speed, the total number of computers ordered in speed. and aggregates everything else except memory size.
each month of the year 2002. d) Give esamples of qneries that can be ansn-ered from the view of (11) only,
the vieiv of (c) only, both, and neither.
b) List for each type of hard disk (e.g., SCSI or IDE) and eacli processor
type the number of computers ordered. *!! Exercise 20.5.10: If the fact table F to n-hicli n-e apply the C u B E operator is
c) Find the average price of computers with 1500 megahertz processors for sparse (i.e.. there are inany fen-er tuples in F than the product of the number
each month from Jan., 2001. of possihle values along each dimension), then tlie ratio of the sizes of CCBE(F)
and F can be very large. Hon large can it be?
! Exercise 20.5.5 : The computers described in the cube of Exercise 20.5.3 do
not include monitors. IVhat dimensions would you suggest to represent moni-
tors? You may assume that the price of the monitor is included in the price of
20.6 Data Mining
the computer. A family of database applications cal!ed data rnin,ing or knowledge discovery in
dntnbases has captured considerable interest because of opportunities to learn
Exercise 20.5.6 : Suppose that a cube has 10 dimensions. and eacli dimension
surprising facts fro111esisting databases. Data-mining queries can be thought
has 5 options for granularity of aggregation. including "no aggregation" and
of as an estended form of decision-support querx, although the distinction is in-
"aggregate fully.'' How many different views can we construct by clioosing a
formal (see the box on -Data-llining Queries and Decision-Support Queries").
granularity in each dinlension?
Data nli11i11:. stresses both the cpcry-optimization and data-management com-
Exercise 20.5.7 : Show how t o add the following time units to the lattice of ponents of a traditional database system, as 1%-ell suggesting some important
as
Fig. 20.20: hours, minutes, seconds, fortnights (two-week periods). decades. estensions to database languages, such as language primitix-es that support effi-
and centuries. cient sampling of data. In this section, we shall esamine the principal directions
data-mining applications have taken. Me then focus on tlie problem called "fre-
Exercise 20.5.8: How 15-onld you change the dealer lattice of Fig. 20.21 to quc'iit iteinsets." n-hich has 1-eceiwd the most attention from the database point
include -regions." ~ f : of view.
a ) A region is a set of states. 20.6.1 Data-Iblining Applications
* b) Regions are not com~liensuratewith states. but each city is in only one Broadly. data-mining queries ask for a useful summary of data, often ~vithout
region. suggcstir~gthe values of para~netcrsthat would best yield such a summary.
This family of problems thus requires rethinking the n a y database systems are
c) Regions are like area codes: each region is contained \vithin a state. some to be used to provide snch insights a b o ~ i tthe data. Below are some of tlie
cities are in two or more regions. and some regions h a ~ several cities.
e applications and problems that are being addressed using very large amounts

1092 CHAPTER 20. I;YFORhlATION INTEGR.4TION
(stop words) such as .'and" or 'The." which tend to be present in all docu- Baskets (basket, item)
ments and tell us nothing about the content A document is placed in this
space according t o t h e fraction of its word occurrences that are any particular where the first attribute is a .'basket ID," o r unique identifier for a market
word. For instance, if the document has 1000 word occurrences, two of which basket, and the secoild attribute is the ID of some item found in that basket.
are "database." then the doculllent ~vould placed a t the ,002 coordinate in
be S o t e that it is not essential for the relation t o come from true ma~ket-basket
the dimension cor~espondingto "database." By clustering documents in this data; it could be any relation from which we x a n t t o find associated items. For
space, we tend t o get groups of documents that talk about the same thing. ~nstance,the '.baskets" could be documents and the "items" could be words,
For instance, documents that talk about databases might have occurrences of in which case n e are really looking for words that appear in many documents
words like "data," "query," "lock," and so on, while documents about baseball together.
are unlikely to have occurrences of these rvords. The simplest form of market-basket analysis searches for sets of items that
The data-mining problem here is to take the data and select the "means" frequently appear together in market baskets. The support for a set of items is
or centers of the clusters. Often the number of clusters is given in advance. the number of baskets in which all those items appear. The problem of finding
although that number niay be selectable by the data-mining process as ti-ell. frequent sets of ~ t e m s to find, given a support threshold s , all those sets of
is
Either way, a naive algorithm for choosing the centers so that the average items that have support a t least s.
distance from a point to its nearest center is minimized involves many queries; If the number of items in the database is large, then even if we restrict our
each of which does a complex aggregation. attention to small sets, say pairs of items only, the time needed to count the
support for all pairs of items is enormous. Thus, the straightforward way to
solve even the frequent pairs problem - compute the support for each pair of
20.6.2 Finding Frequent Sets of Items items z and j, as suggested by the SQL query in Fig. 20.24 - ~villnot work
Now. we shall see a data-mining problem for which algorithms using secondary This query involves joining Baskets r ~ i t hitself, grouping the resulting tuples
storage effectively have been developed. The problem is most easily described by the tri-o l t e ~ n s
found 111 that tuple, and throwing anay groups where the
in terms of its principal application: the analysis of market-basket data. Stores number of baskets is belon- the support threshold s Sote that the condition
today often hold in a data warehouse a record of what customers have bought I.item < J. item in the WHERE-clause is there to prevent the same pair from
together. That is, a customer approaches the checkout with a .'market basket" being considered in both orders. or for a .'pair" consisting of the same item
full of the items he or she has selected. The cash register records all of these twice from being considered at all.
items as part of a single transaction. Thus, even if lve don't know anything
about the customer, and we can't tell if the customer returns and buys addi- SELECT I.itern, J.item, COUNT(I.basket)
tional items. we do know certain items that a single customer b u - s together. FROM Baskets I, Baskets J
If items appear together in market baskets more often than ~vouldbe es- WHERE 1.basket = J.basket AND
pected, then the store has an opportunity to learn something about how cus- I.item < J.item
tomers are likely to traverse the store. The items can be placed in the store so GROUP BY I.item, J.item
that customers will tend to take certain paths through the store, and attractive HAVING COUNT(I.basket) >= s;
items can be placed along these paths.
E x a m p l e 20.22 : .A famous example. which has been clainied by several peo- Figure 20.24: Saive way to find all high-support pairs of items
ple; is the discovery that people rvho buy diapcrs are unusually likely also to
buy beer. Theories have been advanced foi n.hy that relationship is true. in-
cluding tile possibility that peoplc n-110 buy diapers. having a baby at home. ale 20.6.3 T h e A-Priori Algorithm
less likely to go out to a bar in the evening and therefore tcnd to drink beer at
home. Stores may use the fact that inany customers 15-illwalk through the store There is an optimization that greatly reduccs the running time of a qutry like
from where the diapers are to where the beer is. or vice versa. Clever maiketers Fig. 20.21 \\-hen the support threshold is sufficiently large that few pairs meet
place beer and diapers near each other, rvitli potato chips in the middle. The it. It is ieaso~iableto set the threshold high, because a list of thousands or
claim is that sales of all three items then increase. millions of pairs would not be very useful anyxay; ri-e xi-ant the data-mining
query to focus our attention on a sn~allnumber of the best candidates. The
We can represent market-basket data by a fact table: a-przorz algorithm is based on the folloiving observation:

1094 C H A P T E R 20. IATFORlI~4TION
INTEGR.ATION
INSERT INTO Candidates
Association Rules SELECT *
FROM Baskets
A more complex type of market-basket mining searches for associatzon
WHERE item IN (
~ x l e sof the form {il, 22, . . . , i n ) 3 j. TKO possible properties that \ve
SELECT item
might want in useful rules of this form are:
FROM Baskets
1. Confidence: the probability of finding item j in a basket that has GROUP BY item
all of {il,i2.. . . ,i n ) is above a certain threshold. e.g., 50%; e.g.. "at HAVING COUNT(*) >= s
least 50% of the people who buy diapers buy beer." >;
2. Interest: the probability of finding item j in a basket that has all SELECT I.item, J.item, ~ ~ ~ N ~ ( ~ . b a s k e t )
of {il, i2,. . . , i n } is significantly higher or lower than the probability FROM Candidates I, Candidates J
of finding j in a random basket. In statistical terms, j correlates WHERE 1.basket = J.basket AND
with {il, i z ,. . . ,i,,), either positively or negatively. The discovery in I.item < J.item
Example 20.22 was really that the rule {diapers) + beer has high GROUP BY I.item, J.item
interest. HAVING COUNT(*) >= s;
Sote that el-en if a n association rule has high confidence or interest. it n-ill
tend not to be useful unless the set of items inrrolved has high support. Figure 20.25: Tlie a-priori algorithm first finds frequent items before finding
The reason is that if the support is low, then the number of instances of frequent pairs
the rule is not large, which limits the benefit of a strategy that exploits
the rule.
E x a m p l e 20.23 : To get a feel for how the a-priori algorithm helps, consider a
supermarket that sells 10,000 different items. Suppose that the average market-
basket has 20 items in it. Also assume that the database keeps 1,000,000 baskets
If a set of items S has support s. then each subset of A must also have
' as data (a small number compared with what would be stored in practice).
support a t least s. Then the Baskets relation has 20,000,000 tuples, and the join in Fig. 20.24
(the naive algorithm) has 190,000,000 pairs. This figure represents one million
In particular, if a pair of items. say {i. j ) appears in, say, 1000 baskets. then baskets times (y) which is 190: pairs of items. These 190,000,000 tuples must
we know there are a t least 1000 baskets with item i and we know there are at all be grouped and counted.
least 1000 baskets xvith item j. However, suppose that s is 10,000, i.e., 1% of the baskets. It is impossi-
The converse of the above rule is that if we are looking for pairs of items ble that Inore than 20.000,000/10,000 = 2000 items appear in a t least 10,000
~vith support at least s. we may first eliminate from consideration any item that baskets. because there are only 20,000.000 tuples in Baskets, and any item ap-
does not by itself appear in a t least s baskets. The a-priorz algorltl~m ans11-ers pearing in 10.000 baskets appears in at least 10,000 of those tuples. Thus: if we
the same query as Fig. 20.24 by: use the a-priori algoritllrn of Fig. 20.25, the subquery that finds the candidate
ite~ns cannot produce more than 2000 items. and I\-ill probably produce many
1. First finding the srt of candidate nte~ns those that appear in a sufficient
- fewer than 2000.
number of baskets by thexnsel~es and then
- \\'e cannot he sure ho~v large Candidates is. since in the norst case all the
items that appear in Baskets will appear in at least 1%of them. Honever. in
2. Running the query of Fig. 20.24 on only the candidate items. practice Candidates will be considerably smaller than Baskets. if the threshold
s is high. For sake of argument, suppose Candidates has on the average 10
The a-priori algorithnl is thus summarized by the sequence of two SQL queries itelns per basket: i.e., it is half the size of Baskets. Then the join of Candidates
in Fig. 20.25. It first computes Candidates. the subset of the Baskets relation with itself in step (2) has 1,000,000 times (y) = 45,000,000 tuples, less than
i~hose iter~ls
ha\-c high support by theniselves. then joins Candidates ~vith
itself. 11-1 of the number of tuples in the join of Baskets ~ - i t h itself. \Ye ~vould
as in the naive algorithm of Fig. 20.24. thtis expect the a-priori algorithm to run in about 111 the time of the naive

1096 C H A P T E R 20. IlYFORM-rlTI0.V INTEGRATION 20.7. SC'AIAI,4RY OF C H A P T E R 20 1097
algorithm. In common situations, where C a n d i d a t e s has much less than half Exercise 20.6.3: Using the baskets of Exercise 20.6.1, answer the following:
tlie tuples of B a s k e t s , the improvement is even greater, since running time
shrinks quadratically with the reduction in the number of tuples involved in a) If the support threshold is 35%, what is the set of candidate triples?
the join.
b) If the support threshold is 35%, what sets of triples are frequent?
20.6.4 Exercises for Section 20.6
Exercise 20.6.1: Suppose we are given the eight "market baskets" of Fig. 20.7 Summary of Chapter 20
20.26.
+ Integration of Information: Frequently, there exist a variety of databases
or other information sources that contain related information. nTe have
B1= {milk, coke, beer) the opportunity to combine these sources into one. Ho~vever,hetero-
BP= {milk, pepsi, juice) geneities in the schemas often exist; these incompatibilities include dif-
B3 = {milk, beer) fering types, codes or conventions for values, interpretations of concepts,
B4 = {coke, juice) and different sets of concepts represented in different schernas.
B = {milk, pepsi, beer)
g
B6 = {milk, beer, juice, pepsi) + Approaches to Information Integration: Early approaches involved "fed-
B7 = {coke, beer, juice) eration," where each database would query t h e others in the terms under-
B8 = {beer, pepsi) stood by the second. Nore recent approaches involve ~varehousing, where
data is translated to a global schema and copied to the warehouse. An
alternative is mediation, where a virtual warehouse is created to allolv
Figure 20.26: Example market-basket data queries to a global schema; the queries are then translated to the terms
of the data sources.
* a) As a percentage of the baskets, what is the support of the set {beer, juice)? + Extractors and Wrappers: Warehousing and mediation require compo-
b) What is the support of the set {coke, pepsi)? nents a t each source, called extractors and wrappers, respectively. X ma-
jor function is to translate querics and results betneen the global schema
* c) What is t h e confidence of milk given beer (i.e., of the association rule and the local schema a t the source.
{beer) + milk)?
+ Wrapper Generators: One approach to designing wrappers is t o use tem-
d) What is the confidence of juice given milk? plates, which describe how a query of a specific form is translated from the
global schema to the local schema. These templates are tabulated and in-
e) What is the confidence of coke, given beer and juice?
terpreted by a driver that tries to match queries t o templates. The driver
* f) If the support threshold is 35% (i.e., 3 out of the eight baskets are needed), may also have the ability to combine templates in various ways, and/or
which pairs of items are frequent? perform additional ~vork such as filtering. to answer more con~plex queries.
g) If the support threshold is 50%, which pairs of items are frequent? + Capability-Based Optimtzation: The sources for a mediator often are able
or ~villingto answer only limited forms of queries. Thus. the mediator
! Exercise 20.6.2 : The a-priori algorithm also may be used to find frequent sets must select a query plan based on the capabilities of its sources, before it
of more than ttvo items. Recall that a set S of k items cannot have support at can el-en think about optiniizing the cost of query plans as con\-entional
least s t~nlessevery proper subset of S has support at least s. In particular. DBAIS's do.
the subsets of X that are of size k - 1 must all have support a t least s. Thus.
having found the frequent itemsets (those with support at least s) of size k - 1. + OLAP: An important application of data I

1098 C H A P T E R 20. IXFORJIIATION IhTTEGR.4TI0.\' 20.8. REFERENCES FOR CH-APTER 20 1099
+ ROLAP and AIOLAP: It is frequently useful when building a warehouse [4] is a survey of data-mining techniques, and [13] is a n on-line survey of
for OLAP, to think of the data as residing in a multidimensional space. data mining. The a-priori algorithm was del-eloped in [I] and 121.
with diniensions corresponding t o independent aspects of the data repre-
sented. Systems that support such a vie~v data take either a relational
of 1. R. Agranal, T. Imielinski, and A. Sn-ami: '.lIining association rules be-
point of view (ROLAP, or relational OLAP systems), or use the special- tween sets of items in large databases," Proc. -ACAi SIGAlOD Intl. Conf.
ized data-cube model (lIOL.AP, or multidimensional OLAP systems). on ibfanagement of Data (1993), pp. 203-216.
+ Star Schernas: In a star schema, each data element (e.g., a sale of an item) 2. R. Agrawal, and R. Srikant, "Fast algorithms for mining association rules,"
Proc. Intl. Conf. on V e q Large Databa.ses (1994), pp. 487-199.
is represented in one relation, called tlie fact table, while inforniation
helping to interpret the values along each dimension (e.g.. what kind of 3. S. Chaudhuri and U.Dayal, .'Ail overview of data warehousing and OLAP
product is iten1 1234?) is stored in a diinension table for each diinension. technology," SIGAJOD Record 26: 1 (1997), pp. 63-74.
+ The Cube Operator: A specialized operator called CCBE pre-aggregates 4. U. 52. Fayyad, G. Piatetsky-Shapiro. P. Smyth, and R. Uthurusamy, Ad-
the fact table along all subsets of dimensions. It may add little to the space Lances i n Knowledge Discovery and Data hlznzng. AAAI Press, hlenlo
needed by the fact table, and greatly increases the speed with which many Park CA, 1996.
OLAP queries can be answered.
3. H. Garcia-llolina, Y. Papakonstalltinou. D. Quass. -1. Rajalaman, Y. Sa-
+ Dzmenszon Lattices and Alaterialized Vzews: A more polverful approach giv. V. \Bssalos. J. D. Ullman, and J . n7idorn) The TSIlIlIIS approach
than the CLBE operator, used by some data-cube implementations. is to to mediation: data nlodels and languages. J. Intellzgent Informatzon Sys-
establish a lattice of granularities for aggregation along each dimension tems 8:2 (1997), pp. 117-132.
(e.g., different time units like days, months, and years). The ~vareliouse
is then designed by materializing certain v i e w that aggregate in different 6. J. S. Gray, A. Bosworth, A. Layman. and H. Pirahesh, .'Data cube: a
\va!.s along the different dimensions, and the rien- with the closest fit is relational aggregation operator generalizing group-by. cross-tab, and sub-
used t o answer a given query. totals." Proc. Intl. Conf. on Data Englneerzng (1996). pp. 132-139.
+ Data Mining: IVareliouses are also used to ask broad questions that in- 7. -1.Gupta and I. S. SIumick. A.laterioltzed Vieccs: Technzques, Implemcn-
tatzons, and Applzcatzons. l I I T Pres4. Cambridge 11-1. 1999
volve not only aggregating on command. as in OL.1P queries, but search-
ing for the "right" aggregation. Common types of data mining include 8. V. Harinarayan, - .1Rajaraman, and J . D. Ullman. ~~Implementiiig
data
clustering data into similar groups. designing decision trees to predict one cubes efficiently." Proc. ACAf SIGilfOD Intl. Conf. on Management of
attribute based on the value of others. and finding sets of items that occur Data (1996). pp. 205-216.
together frequently.
9. D. Loniet and J. U-idom (eds.). Special i ~ s u e materialized l-ie~vsand
on
+ The A-Priori Algorithm: -An efficiellt \\-a?; to find frequent itemsets is to data warehouses. IEEE Data Erlg?ilcerlng Builet~n18:2 (1395).
use the a-priori algorithm. This technique exploits the fact that if a set
occurs frequently. then so do all of its subsets. 10. I*.Papakonstantinou. H. Garcia-llolina. arid J . n'idom. "Object ex-
change across heterogeneous information sources." Proc. Intl. Conf. on
Data Englneerlng (1993). pp 251-260.
20.8 References for Chapter 20
11. I-.Papakonstantinou. . .Gupta. and L. Haas. "Capnl>ilities-base query
I
Recent smveys of \varehonsing arid related technologics are in [9]. [3]. and [ T I . ren-riting in mediator s!-stems." Conference 011 Par(111el and Distributed
Federated systems are surveyed 111 11'21. The concept of tlic mediato1 conies Informntion Systc~ns (1996). ,\l-;lil~il~le
as:
from [14].
Implementation of mediators and \\-rappers, especially tlie mapper-genera-
tor approach. is covered in [5]. Capabilities-based optilnization for iriediators
n-as explored in [ll. 131. 12. .A. P. Sheth and J . -1. Larson. "Federated databases for managing dis-
The cube operator was proposed in 161. The i~iipleinentationof cubes by tributed. heterogeneous. and autonomous databases." Cornputzng Surreys
materialized vie\\-s appeared in 181. 22:3 (1990), pp. 183-236.