Building Data Warehouse - University at Buffalo

Building Data Warehouse Zhenhao Qi Department of Biochemistry & Department of Computer Science and Engineering State University of New York at Buffalo March 23rd, 2000 Outline: 1. Migrating data from legacy systems: an iterative, incremental methodology. 2. Building high data quality into data warehouse. 3. Optimal machine architectures for parallel query scalability The difficulties of migrating data from legacy systems: 1. The same data is presented differently in different system. 2. The schema for a single database may not be consistent over time. 3. Data may simply be bad.

4. The data values are not represented in a form that is meaningful to end users. 5. Conversions and migrations in heterogeneous environments typically involve data from multiple incompatible DBMS and hardware platform. 6. The execution windows for data conversion programs must be coordinated carefully in order to provide the new applications with a consistent view of data without impacting production system. The need for an iterative, incremental methodology IT organizations failed to use the same serial methodology for large migration problems that they used for relatively discrete projects. 1. In a large organization, the complexity of the analysis and design can involve person-years of effort without any demonstrable results. 2. The sheer complexity of both the data analysis and the design of the target system has prevented effective progress. 3. The rate of change in operational systems has outstripped the migration teams ability to keep current.

Implications for metadata and the data warehouse 3 sources of change that the data warehouse team must anticipate Those arising from the normal regular changes to operational system. Those that result from using an iterative, incremental methodology. Those that result from external business drivers like acquisition. The key to dealing with change cost effectively lies in metadata. Four Dimensions of Metadata: Another way of looking at the sources of technical challenge with respect to the type of metadata required to minimize the impact of change. The need to adapt to change, error and complexity is regular over time can be seen like waves on a beach. change time complexity

error Metadata that capture the current environment The record and data element definitions. Inter- and intra- database relationships. A definition of each interface program used to build or refresh the warehouses a. which inter-database joins it uses; b. the timing and direction of the execution; c. any execution parameters d. dependencies on any other interface programs e. the use or production of other ancillary database f. the name and location of the file that contains the source code for this data interface program. g. the tool and session name if this interface program was automatically generated. Metadata required to reduce the cost of errors

The meta-model should allow the inclusion of the information discovered at the level of data element, record, database, and join. This information includes but is not limited to: legal ranges and values any exception logic that the data interface program should take if an illegal value is found Metadata that can reduce the cost of complexity Other types of metadata may be needed to reduce the complexity of specifying, maintaining, and executing the data interface program Factoring in time 1. Information about each database that can effect execution time such as: a. Database size and volatility. b. The time window during which each database can be accessed. c. The mechanism that should be used for changed data capture. 2. Versioning

Servers Metadata Repositories Replication Tools Developing an evaluation grid The best strategy would be to create a list of the types of change the organization is most likely to encounter. determine the types of metadata required to respond to this change cost effectively. From this data one should be able to determine a set of requirements regarding

1. The number of systems and tools that must be interfaced. 2. The types of metadata required for the meta-model. 3. The best versioning strategy for performing impact analysis. 4. The desired set of functionality for automating this process. The greatest cost benefit of high data quality Data quality assures previously unavailable competitive advantage and strategic capability 1. Improved accuracy, timeliness, and confidence in decision making. 2. Improved customer service and retention. 3. Unprecedented sales and marketing opportunities. 4. Support for business reengineering initiatives High data quality improves productivity

Why uniform data access times are optimal for parallel query execution? Algorithmic parallelism is achieved using a paradigm similar to the division of labor The material (data) must be evenly distributed among the personnel (CPUs) or the effect of parallelism is lost Symmetric multiprocessor (SMP) A classical SMP is a tightly coupled connection model where all components connected to a single bus are equidistant. Disadvantage: very short buses limit scalability. CPUs One hop One hop Shared symmetric system bus

Shared memory Disks Loosely Coupled Architectures The 2-D mesh It has a connection density of 4, in that each node is attached to at most four of its neighbors. CPU Mem Disk CPU Mem CPU Mem Disk

Auxiliary Disk Crossbar Switch Borrowed from telephony, this technology creates a direct, point-to-point connection between every node, with only one hop through the switch to get from one node to any other node. All nodes are only one hop away from one another Switch element Switch element Switch

element Switch element A direct connection between each node and every other node Node Connections Are typically Bi-directional And Non-blocking A query parse tree example SELECT * FROM Table_a ORDER BY Column_2

Paul St-Pierre, PHD Candidate, FIMS, UWO Project's Goal Organizational Culture & Evaluation * The key goal of our research is to examine existing evaluation systems of public libraries, in an effort to replace the current reliance on a predominantly output...

they are represented. News values are applied in the selection and construction of stories, in order to suit the purpose of the media text and its target audience, e.g. a local paper will have a different news agenda than a...

Outline. Review: building classes, controlling access to members of a class, creating constructors. Composition — a capability that allows a class to have references to objects of other classes as members

The author of the first textbook on Internet Marketing specifically for Hospitality and Tourism industry, a book widely adopted worldwide. Create the first Internet Marketing course for tourism major in the world. Can be reached by email: [email protected] Or find...