6
6 Query Model in Virtual Integration User formulates query in terms of his/her ontology on the mediated (or “global”) schema System reformulates queries in terms of sub-queries for each source (“local” schema) Structure of the query model should be more intuitive for the user

7
7 Reformulation Problem Given: –A query Q posed over the mediated schema –Descriptions of the data sources Find: –A query Q’ over the data source relations, such that: Q’ provides only correct answers to Q, and Q’ provides all possible answers from to Q given the sources.

8
8 Mapping between the global logical schema and the single source schemata (logical view definition) Two basic approaches – GAV (Global As View) – LAV (Local As View) Can be used also in case of different data models In that case a model transformation is required

9
9 GAV (Global As View) Up to now we supposed that the global schema be derived from the integration process of the data source schemata Thus the global schema is expressed in terms of the data source schemata Such approach is called the Global As View approach

10
10 The other possible ways… LAV (Local As View) The global schema has been designed independently of the data source schemata The relationship (mapping) between sources and global schema is obtained by defining each data source as a view over the global schema GLAV (Global and Local As View) The relationship (mapping) between sources and global schema is obtained by defining a set of views, some over the global schema and some over the data sources

11
11 GGlobal schema G schemata SSource schemata S Mapping M between sources and global schema: a set of assertionsMapping M between sources and global schema: a set of assertions q S  q G q G  q S Intuitively, the first assertion specifies that the concept represented by a view (query) q S over a source schema S corresponds to the concept specified by q G over the global schema. Viceversa for the second assertion. Mapping between data sources and global schema

12
12 A data integration system is a triple (G, S, M)A data integration system is a triple (G, S, M) The query to the integrated system are posed in terms of G and specify which data of the virtual database we are interested inThe query to the integrated system are posed in terms of G and specify which data of the virtual database we are interested in The problem is understanding which real data (in the data sources) correspond to those virtual dataThe problem is understanding which real data (in the data sources) correspond to those virtual data Mapping between data sources and mediated schema

13
13 GAV A GAV mapping is a set of assertions, one for each element g of G g  q S That is, the mapping specifies g as a query q S over the data sources. This means that the mapping tells us exactly how the element g is computed.  OK for stable data sources  Difficult to extend with a new data source

16
16 GAV Suppose now we introduce a new sourceSuppose now we introduce a new source The simple view we have just created is to be modifiedThe simple view we have just created is to be modified In the simplest case we only need to add a union with a new SELECT-FROM- WHERE clauseIn the simplest case we only need to add a union with a new SELECT-FROM- WHERE clause This is not true in general, view definitions may be much more complexThis is not true in general, view definitions may be much more complex

17
17 GAV Quality depends on how well we have compiled the sources into the global schema through the mapping Whenever a source changes or a new one is added, the global schema needs to be reconsidered Query processing can be based on some sort of unfolding Example: one already seen

18
18 LAV The Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema. Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.

19
19 LAV A mapping LAV is a set of assertions, one for each element s of each source S s  q G Thus the content of each source is characterized in terms of a view q G over the global schema OK if the global schema is stable, e.g. based on a domain ontology or an enterprise modelOK if the global schema is stable, e.g. based on a domain ontology or an enterprise model It favours extensibilityIt favours extensibility Query processing much more complexQuery processing much more complex

20
20 LAV Quality depends on how well we have characterized the sources High modularity and extensibility (if the global schema is well designed, when a source changes or is added, only its definition is to be updated) Query processing needs reasoning

35
35 Views: Sound & Completeness The terms sound, exact and complete are used to express to what degree the extent of a view corresponds to its definition. A view defined over some data source is sound if it provides a subset of the available data in the data source that corresponds to the definition. –It delivers only, but not necessarily all answers to its definition. The answers it does deliver might be incomplete, but they are correct (sound). If a view is complete, it provides a superset of the available data in the data source that corresponds to the definition. It delivers all answers to its definition, and maybe more. –Since the set of answers might contain more than the answers corresponding to the views definition, but does contain all answers corresponding to the views definition, the term complete is being used. A view is exact if it provides all and only data corresponding to the definition.

36
36 Query answering in GAV Query answering in the Global-as-View happens through query unfolding. Unfolding queries to the local sources is relatively easy compared to the Local as-View approach in which queries posted to the global schema have to be rewritten before being able to answer the query.

37
37 Query answering in GAV Because each global entity is defined as a view over the local schemas, the total number of GAV-rules necessary to define the mappings in a data- integration system corresponds to the number of entities in the global schema. the extent of a global schema defined by GAV-rules is assumed to be exact.

38
38 Query answering in LAV The Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema. Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.

39
39 Query answering in LAV Each entity in the local schemas is defined as a view over the global schema. Therefore, the total number of LAV-rules necessary to define the mappings in a data-integration system corresponds to the number of entities in the local schemas. Processing queries in the Local-as-View approach is difficult because the only knowledge of the global-schema available is through the views representing the sources. Such view only provides partial information about the data. Since the mapping associated to each source as a view over the global schema it is not clear how to use the sources in order to answer queries expressed over the global schema. Therefore extracting information from an integration system using the Local-as-View approach is a complex task because one has to answer queries with incomplete information. Views defined by LAV-rules might be sound or complete, in terms of 3.1.

46
Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 46 More datalog terminology A datalog program is a set of datalog rules. A program with one rule is a conjunctive query. We distinguish EDB predicates and IDB predicates » EDB’s are stored in the database, › appear only in rule bodies » IDB’s are intensionally defined, › appear in both bodies and heads.

50
50 Query reformulation using views Problem: Given a user query Q and view definitions V={V1..Vn} Q’ is an Equivalent Rewriting of Q using V iff Q’ refers only to views in V, Q’ is equivalent to Q Q’ is a Maximally-contained Rewriting of Q using V iff Q’ refers only to views in V, Q’  Q, there is no such rewriting Q1 that Q’  Q1  Q, Q1  Q

51
Kambhampati & KnoblockInformation Integration on the Web (MA-1)51 Reformulation in LAV Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn? –Notice the use of materialized (pre-computed) views ONLY materialized views should be used… Approaches –Bucket algorithm [Levy; 96] –Inverse rules algorithm [Duschka, 99] –Hybrid versions SV-Bucket [2001], MiniCon [2001]

53
Kambhampati & KnoblockInformation Integration on the Web (MA-1)53 Maximal Containment Query plan should be sound and complete –Sound implies that Query should be contained in the Plan (I.e., tuples satisfying the query are subset of the tuples satisfying the plan –Completeness? –Traditional DBs aim for equivalence Query contains the plan; Plan contains the query –Impossible We want all query tuples that can be extracted given the sources we have –Maximal containment (no other query plan, which “contains” this plan is available) P contains Q if P |= Q (exponential even for “conjunctive” (SPJ) query plans)

54
Kambhampati & KnoblockInformation Integration on the Web (MA-1)54 The world of Containment Consider Q 1 (.) :- B 1 (.) Q 2 (.) :- B 2 (.) Q 1  Q 2 (“contained in”) if the answer to Q 1 is a subset of Q 2 –Basically holds if B 1 (x) |= B 2 (x) Given a query Q, and a query plan Q 1, –Q 1 is a sound query plan if Q 1 is contained in Q –Q 1 is a complete query plan if Q is contained in Q 1 –Q 1 is a maximally contained query plan if there is no Q 2 which is a sound query plan for Q 1, such that Q 1 is contained in Q 2

58
Kambhampati & KnoblockInformation Integration on the Web (MA-1)58 Take-home vs. In-class (The “Tastes Great/Less Filling” debate of CSE 494) More time consuming –To set, to take and to grade Caveat: people tend to over-estimate the time taken to do the take-home since the don’t factor out the “preparation” time that is interleaved with the exam time..but may be more realistic Probably will be given on Th 6 th and will be due by 11 th evening. (+/-) Less time-consuming –To take (sort of like removing a bandage..) –and definitely to grade… Scheduled to be on Tuesday, May 11 th 2:40— 4:30

59
Kambhampati & KnoblockInformation Integration on the Web (MA-1)59 Interactive Review on 5/4 A large (probably most?) of the class on 5/4 will be devoted to interactive semester review Everyone should come prepared with at least 3 topics/ideas that they got out of the class –You will be called in random order and you should have enough things to not repeat what others before you said. What you say will then be immortalized on the class homepage –See the *old* acquired wisdom page on the class page.

62
Kambhampati & KnoblockInformation Integration on the Web (MA-1)62 Complexity of finding maximally contained plans in LAV Complexity does change if the sources are not “conjunctive queries” –Sources as unions of conjunctive queries (NP-hard) Disjunctive descriptions –Sources as recursive queries (Undecidable) Comparison predicates Complexity is less dependent on the query –Recursion okay; but inequality constraints lead to NP-hardness Complexity also changes based on Open vs. Closed world assumption True source contents (of Big Two) Advertised description “All cars” [Abiteboul & Duschka, 98] You can “reduce” the complexity by taking a conjunctive query that is an upperbound.  This just pushes the complexity to minimization phase Advertised description “Toyota” U “Olds” Big Two

63
Kambhampati & KnoblockInformation Integration on the Web (MA-1)63 Query Optimization Challenges -- Deciding what to optimize --Getting the statistics on sources --Doing the optimization We will first talk about reformulation level challenges

64
Kambhampati & KnoblockInformation Integration on the Web (MA-1)64 Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? True source contents Advertised description Guarantees (LCW; Inter-source comparisons) Problems: 1. Sources may not be interested in giving these!  Need to learn  hard to learn! 2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is useless

74
Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 74 Dealing with Incomplete Information » Assume to add a new source for Enrolled, that in the back-end becomes, e.g.,: relat(2,X,Y):- s(4,X,Y). relat(2,X,Y):- s(9,X,Y). » No connection to student: but, one who is enrolled to a University should be a student

76
MOMIS – www.dbgroup.unimo.it 76 The MOMIS Query Manager reformulates the query taking into account the mappings defined among the local classes and the global classes of the GVV (Global Virtual View). The mappings are defined by using a GAV (Global as View) approach: each global class of the GVV is expressed by means of the full-disjunction operator over the local classes. Query Execution Query rewriting Query rewriting  GAV approach:  GAV approach: the query is processed by means of unfolding Fusion and Reconciliation Fusion and Reconciliation of the local answers into the global answer  Object Identification  Object Identification : Join conditions among local classes  Inconsistencies:  Inconsistencies: Resolution functions to deal with conflits

79
MOMIS – www.dbgroup.unimo.it 79 Fusion and Reconciliation L3scqG1 result set L1scqG2 result set L2scqG1 result set L1scqG1 result set L2scqG2 result set VENERE SAPERVIAGGIAREGUIDACAMPEGGI partial results full join full join L1scqG1 full join L2scqG1 full join L3scqG1 full join L1scqG2 full join L2scqG2 scqG1 result set scqG2 result set q 0 result set join scqG1 join scqG2