Which are our Which are our lowest/highest margin lowest/highest margin customers ?? customers What is the most What is the most effective distribution effective distribution channel? channel?
Who are my customers Who are my customers and what products and what products are they buying? are they buying?
What product promWhat product prom-otions have the biggest -otions have the biggest impact on revenue? impact on revenue? What impact will What impact will new products/services new products/services have on revenue have on revenue and margins? and margins?
Which customers Which customers are most likely to go are most likely to go to the competition ?? to the competition
4
.A producer wants to know….

Data.  I can’t find the data I need
 data is scattered over the network  many versions... subtle differences
 I can’t get the data I need
 need an expert to get the data
 I can’t understand the data I found
 available data poorly documented
 I can’t use the data I found
 results are unexpected  data needs to be transformed from one form to other
5
. Data everywhere yet .

complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.What is a Data Warehouse?
A single. [Barry Devlin]
6
.

What are the users saying...
 Data should be integrated across the enterprise  Summary data has a real value to the organization  Historical data holds the key to understanding data over time  What-if capabilities are required
7

What is Data Warehousing?
Information
A process of transforming data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research, April 1996]

Data Warehousing -It is a process
 Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible  A decision support database maintained separately from the organization’s operational database
12
.

Data Warehouse for Decision Support & OLAP
 Putting Information technology to help the knowledge worker make faster and better decisions
 Which of my customers are most likely to go to the competition?  What product promotions have the biggest impact on revenue?  How did the share price of software companies correlate with profits over last 10 years?
16
.

Decision Support
 Used to manage and control business  Data is historical or point-in-time  Optimized for inquiry rather than update  Use of the system is loosely defined and can be ad-hoc  Used by managers and end-users to understand the business and make judgements
17
.

500 airline miles as an incentive to purchase rather than 5.We want to know ... which persons are the least likely to default on their credit cards?  Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?  If I raise the price of my product by Rs.000. how many lost responses will result?  If I emphasize ease-of-use of the product as opposed to its technical capabilities. what is the effect on my ROI?  If I offer only 2. what will be the net effect on my revenues?  Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
19
.000 names. 2.
 Given a database of 100.

Data Mining in Use
 The US Government uses Data Mining to track fraud  A Supermarket becomes an information broker  Basketball teams use it to track game strategy  Cross Selling  Warranty Claims Routing  Holding on to Good Customers  Weeding out Bad Customers
21
.

products -. salespeople etc.Operational Systems
 Run the business in real time  Based on up-to-the-second data  Optimized to handle large numbers of simple read/write transactions  Optimized for fast response to predefined transactions  Used by people who deal with customers.  They are increasingly used by customers
26
.clerks.

OLTP vs. average amount spent on phone calls between 9AM-5PM in Pune during the month of December
30
..g. Data Warehouse
 OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse  Special data organization. access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)
 e.

.
 OLTP Systems are used to “run” a business
 The Data Warehouse helps to “optimize” the business
34
.To summarize ..

Why Now?
      Data is being produced ERP provides clean data The computing power is available The computing power is affordable The competitive pressures are strong Commercial products are available
35
.

.The Reality
 Legacy systems no longer documented  Outside sources with questionable quality procedures  Production systems with no built in integrity checks and no integration
 Operational systems are usually designed to solve a specific business problem and are rarely developed to a a corporate plan
 “And get it done quickly.Data Quality . we do not have time to worry about corporate standards..”
48
.

Data Integration Across Sources
Savings Loans Trust Credit card
Same data different name
Different data Same name
Data found here nowhere else
Different keys same data
49
.

 Conditioning
 The conversion of data types from the source to the target data store (warehouse) -- always a relational database
53

Data Transformation Terms
 Householding
 Identifying all members of a household (living at the same address)  Ensures only one mail is sent to a household  Can result in substantial savings: 1 lakh catalogues at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh.
54

 Scoring
 computation of a probability of an event. e.g..., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product
55

resume. scrubbing. need to load the data into the warehouse  Issues
 huge volumes of data to be loaded  small time window available when warehouse can be taken off line (usually nights)  when to build index and summary tables  allow system administrators to monitor.restart after failure from where you were and without loss of data integrity
56
. cancel. change load rates  Recover gracefully -. validating etc.Loads
 After extracting. cleaning.

Load Techniques
 Use SQL to append or insert new data
 record at a time interface  will lead to random disk I/O’s
 Use batch load utility
57
.

g. every week) or after significant events  on every update: not warranted unless warehouse data require current data (up to the minute stock quotes)  refresh policy set by administrator based on user needs and traffic  possibly different policies for different sources
60
.. every night.When to Refresh?
 periodically (e.

subject orientation
67
.Data -.Heart of the Data Warehouse
 Heart of the data warehouse is the data itself!  Single version of the truth  Corporate memory  Data is organized in a way that represents business -.

g.. account etc. customer may be five tables
68
. product... E.customer.Data Warehouse Structure
 Subject Orientation -. policy. A subject may be implemented as a set of related tables.

 Detailed data too voluminous
71
.Granularity in Warehouse
 Can not answer some questions with summarized data
 Did Anand call Seshadri last month? Not possible to answer if total duration of calls by Anand over a month is only maintained and individual call details are not.

Granularity in Warehouse
 Tradeoff is to have dual level of granularity
 Store summary data on disks
 95% of DSS processing done against this data
 Store detail on tapes
 5% of DSS processing against this data
72
.

No Name Date Opened
Rarely accessed
Interest Rate Address
Smaller table and so less I/O
73
.Vertical Partitioning
Acct. No Name BalanceDate Opened Interest Rate Address
Frequently accessed
Acct. Balance No Acct.

Derived Data
 Introduction of derived (calculated data) may often help  Have seen this in the context of dual levels of granularity  Can keep auxiliary views and indexes to speed up query processing
74
.

De-normalization
 Normalization in a data warehouse may lead to lots of small tables  Can lead to excessive I/O’s since many tables have to be accessed  De-normalization is the answer especially since updates are rare
81
.

Creating Arrays
 Many times each occurrence of a sequence of data is in a different physical location  Beneficial to collect all occurrences together and store as an array in a single row  Makes sense only if there are a stable number of occurrences which are accessed together  In a data warehouse. such situations arise naturally due to time based orientation
 can create an array by month
82
.

Selective Redundancy
 Description of an item can be stored redundantly with order table -most often item description is also accessed with order table  Updates have to be careful
83
.

Partitioning
 Breaking data into several physical units that can be handled separately  Not a question of whether to do it in data warehouses but how to do it  Granularity and partitioning are key to effective implementation of a warehouse
84
.

Where to Partition?
 Application level or DBMS level  Makes sense to partition at application level
 Allows different definition for each year
 Important since warehouse spans many years and as business evolves definition changes
 Allows data to be moved between processing complexes easily
87
.

one for each distinct value of the column  Each bitmap has N bits where N is the number of rows in the table  A bit corresponding to a value v for a row r is set if and only if r has the value for the indexed attribute
98
.Indexing Techniques
 Bitmap index:
 A collection of bitmaps -.

BitMap Indexes
 An alternative representation of RID-list  Specially advantageous for low-cardinality domains  Represent each row of a table by a bit and the table as a bit vector  There is a distinct bit vector Bv for each value v for the domain  Example: the attribute sex has values M and F. A table of 100 million people needs 2 lists of 100 million bits
99
.

a join index on city dimension of calls fact table  correlates for each city the calls (in the calls table) from that city
103
..g.Join Indexes
 Pre-computed joins  A join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute
 e.

Join Indexes
 Join indexes can also span multiple dimension tables
 e.g., a join index on city and time dimension of calls fact table

cumulative totals
 Cube operator
 group by on all subsets of a set of attributes (month.city)  redundant scan and sorting of data can be avoided
114
.SQL Extensions
 Reporting features
 running total.

cume(dollars) as run_dollars.perkey
115
. period t where year = 1993 and product like ‘Columbian%’ and city like ‘San Fr%’ order by t. product. dollars.Red Brick has Extended set of Aggregates
 Select month. market. cume(weight) as run_weights from sales. weight.

D. in SQL”
-.Limitations of SQL “A Freshman in Business needs a Ph.Ralph Kimball
120
.

Typical OLAP Queries
 Write a multi-table join to compare sales for each product line YTD this year vs. last year.
121
.  Repeat the above process to find the sales of a product line to new vs.  Repeat the above process to find the top 5 product contributors to margin. existing customers.  Repeat the above process to find the customers that have had negative sales growth.

interactive response times  It is good for analyzing time series  It can be useful to find some clusters and outliers  Many vendors offer OLAP tools
124
.Strengths of OLAP
 It is a powerful visualization paradigm  It provides fast.

Organizationally Structured Data
 Different Departments look at the same detailed data in different ways. Without the detailed. organizationally structured data as a foundation. there is no reconcilability of data
marketing sales finance manufacturing
133
.

starnetinc.For a Successful Warehouse
 From day one establish that warehousing is a joint user/builder project  Establish that maintaining data quality will be an ONGOING joint user/builder responsibility  Train the users one step at a time  Consider doing a high level corporate data model in no more than three weeks
143
From Larry Greenfield.htm
. http://pwp.com/larryg/index.

and loading tools  Implement a user accessible automated directory to information stored in the warehouse  Determine a plan to test the integrity of the data in the warehouse  From the start get warehouse users in the habit of 'testing' complex queries
144
. cleaning.For a Successful Warehouse
 Look closely at the data extracting.

ask others who have done the same thing for advice  Be on the lookout for small. but strategic. projects  Market and sell your data warehousing systems
145
.For a Successful Warehouse
 Coordinate system roll-out with network administration personnel  When in a bind.

and loading data  Despite best efforts at project management. data warehousing project scope will increase  You are going to find problems with systems feeding the data warehouse  You will find the need to store data not being captured by any existing system  You will need to validate data not being validated by transaction processing systems
146
. cleaning.Data Warehouse Pitfalls
 You are going to spend much time extracting.

requests for IS written reports may increase  Your warehouse users will develop conflicting business rules  Large scale data warehousing can become an exercise in data homogenizing
147
.Data Warehouse Pitfalls
 Some transaction processing systems feeding the warehousing system will not contain detail  Many warehouse end users will be trained and never or seldom apply their training  After end users receive query and report tools.

.Data Warehouse Pitfalls
 'Overhead' can eat up great amounts of disk space  The time it takes to load the warehouse will expand to the amount of the time in the available window. and then some  Assigning security cannot be done with a transaction processing system mindset  You are building a HIGH maintenance system  You will fail if you concentrate on resource optimization to the neglect of project. data.. and customer management issues and an understanding of what adds value to the customer
148
.