Presentation Transcript

Slide1:

Data Mining

Introduction:

Introduction Finding interesting trends or pattern in large datasets Statistics: exploratory data analysis
Artificial intelligence: knowledge discovery and machine learning Scalability with respect to data size An algorithm is scalable if the running time grows (linearly) in proportion to the dataset size, given the available system resources

Knowledge Discovery (KDD):

Knowledge Discovery (KDD) Data Selection Data Cleaning Data Mining Evaluation The Knowledge Discovery Process Identify the target dataset and relevant attributes Remove noise and outliers, transform field values to common units, generate new fields, bring the data into the relational schema Present the patterns in an understandable form to the end user (e.g., through visualization)

Frequent Itemsets:

Frequent Itemsets Itemset: a set of items
Support of an itemset: the fraction of transactions in the database that contain all items in the itemset Example:
Itemset {pen, ink} Support 75%
Itemset {milk, juice} Support 25%

Frequent Itemsets:

Frequent Itemsets An algorithm for identify (all) frequent itemsets? The a priory property:
Every subset of a frequent itemset must also be a frequent itemset

Frequent Itemsets:

Frequent Itemsets For each item,
check if it is a frequent itemset
k = 1
repeat
for each new frequent itemset Ik with k items
Generate all itemsets Ik+1 with k+1 items, Ik  Ik+1
Scan all transactions once and check if the generated k+1 itemsets are frequent
k = k + 1
Until no new frequent itemsets are identified An algorithm for identifying frequent itemsets

Frequent Itemsets:

Frequent Itemsets For each item,
check if it is a frequent itemset
k = 1
repeat
for each new frequent itemset Ik with k items
Generate all itemsets Ik+1 with k+1 items, Ik  Ik+1 whose subsets are ferquent itemsets
Scan all transactions once and check if the generated k+1 itemsets are frequent
k = k + 1
Until no new frequent itemsets are identified A refinement of the algorithm for identifying frequent itemsets

Iceberg Queries:

Iceberg Queries Assume we want to find pairs of customers and items such that the customer has purchased the item at least 5 times select P.custid, P. item, sum(P.qty)
from Purchases P
group by P.custid, P.item
having sum (P.qty) > 5 Execution plan for the query? The number of groups is very large but the answer to the query (the tip of the iceberg) is usually very small

Association Rules:

Association Rules Example
{pen}  {ink}
If a pen is purchased in a transaction, it is likely that ink will also be purchased in the transaction In general:
LHS  RHS

Association Rules:

Association Rules LHS  RHS Support: support(LHS  RHS)
The percentage of transactions that contain all of these items Confidence: support(LHS  RHS) / support(LHS)
Is an indication of the strength of the rule
P(RHS | LHS) An algorithm for finding all rules with minsum and minconf?

Association Rules:

Association Rules An algorithm for finding all rules with minsup and minconf Step 1: Find all frequent itemsets with minsup
Step 2: Generate all rules from step 1 For each frequent itemset I with support support(I)
Divide I into LHSI and RHSI
confidence = support(I) / support(LHSI) Step 2

Association Rules and ISA Hierarchies:

Association Rules and ISA Hierarchies An ISA hierarchy or category hierarchy upon the set of items: a transaction implicitly contains for each of its items all of the item’s ancestors Beverage Milk Juice Stationery Pen Ink Detect relationships between items at different levels of the hierarchy
In general, the support of an itemset can only increase if an item is replaced by its ancestor

Sequential Patterns:

Sequential Patterns Sequence of Itemsets:
The sequence of itemsets purchased by the customer:
Example custid 201: {pen, ink, milk, juice}, {pen, ink, juice}
(ordered by date)
A subsequence of a sequence of itemsets is obtained by deleting one or more itemsets and is also a sequence of itemsets

Sequential Patterns:

Sequential Patterns A sequence {a1, a2, .., an} is contained in sequence S if S has a subsequence {bq, .., bm} such that ai  bi for 1  i  m Example
{pen}, {ink, milk}, {pen, juice} is contained in {pen, ink}, {shirt}, {juice, ink, milk}, {juice, pen, milk}
The order of items within each itemset does not matter but the order of itemsets does matter
{pen}, {ink, milk}, {pen, juice} is not conatined in {pen, ink}, {shirt}, {juice, pen, milk}, {juice, milk, ink}

Sequential Patterns:

Sequential Patterns The support for a sequence S of itemsets is the percentage of customer sequences of which S is a subsequence Identify all sequences that have a minimum support

Classification and Regression Rules:

Classification and Regression Rules InsuranceInfo(age: integer, cartype: string, highrisk: boolean) There is one attribute (highrisk) whose value we would like to predict: dependent attribute
The other attributes are called the predictors General form of the types of rules we want to discover:
P1(X1)  P2(X2)  …  Pk(Xk)  Y = c

Classification and Regression Rules:

Classification and Regression Rules P1(X1)  P2(X2)  …  Pk(Xk)  Y = c Support:
The support for a condition C is the percentage of tuples that satisfy C.
The support for a rule C1  C2 is the support of the condition C1  C2
Confidence
Consider the tuples that satisfy condition C1. The confidence for a rule C1  C2 is the percentage of such tuples that also satisfy condition C2

Classification and Regression Rules:

Classification and Regression Rules Differ from association rules by considering continuous and categorical attributes, rather than one field that is set-valued

Tree-Structured Rules:

Tree-Structured Rules Classification or decision trees
Regression trees
Typically the tree itself is the output of data mining
Easy to understand
Efficient algorithms to construct them

Decision Trees:

Decision Trees A graphical representation of a collection of classification rules. Given a data record, the tree directs the record from the root to a leaf. Internal nodes: labeled with a predictor attribute (called a splitting attribute)
Outgoing edges: labeled with predicates that involve the splitting attribute of the node (splitting criterion)
Leaf nodes: labeled with a value of a dependent attribute

Decision Trees:

Decision Trees Constructed into two phases
Phase 1: growth phase
construct a vary large tree (e.g., leaf nodes for individual records in the database
Phase 2: pruning phase Build the tree greedily top down:
At the root node, examine the database and compute the locally best splitting criterion
Partition the database into two parts
Recurse on each child

Decision Trees:

Decision Trees Split selection method
An algorithm that takes as input (part of) a relation and outputs the locally best spliting criterion
Example: examine the attributes cartype and age, select one of them as a splitting attribute and then select the splitting predicates

Decision Trees:

Decision Trees How can we construct decision trees when the input database is larger than main memory? Provide the split selection method with aggregated information about the database instead of loading the complete database into main memory
We need aggregated information for each predictor attribute AVC set of the predictor attribute X at node n is the projection of n’s database partition onto X and the dependent attribute where counts of the individual values in the domain of the dependent attribute are aggregated

Clustering:

Clustering Partition a set of records into groups (clusters) such that all records within a group are similar to each other and records that belong to two different groups are disimilar. Similarity between records measured computationally by a distance function.

Clustering:

Clustering:

Clustering The output of a clustering algorithm consists of a summarized representation of each cluster.
Type of output depends on type and shape of clusters.
For example if spherical clusters: center C (mean) and radius R:
given a collection of records r1, r2, .., rn
C =  ri R =  (ri - C)
n

Clustering:

Clustering Two types of clustering algorithms:
Partitional clustering: partitions the data into k groups such that some criterion that evaluates the clustering quality is optimized
Hierarchical clustering generates a sequence of partitions of the records. Starting with a partition in which each cluster consists of a single record, merges two partitions in each step

Clustering:

Clustering Assumptions
Large number of records, just one scan of them
A limited amount of main memory The BIRCH algorithm: Two parameters
k: main memory threshold: maximum number of clusters that can be maintained in memory
e: initial threshold of the radius of each cluster. A cluster is compact if its radius is smaller than e.
Always maintain in main memory k or fewer compact cluster summaries (Ci, Ri)
(If this is no possible adjust e)

Clustering:

Clustering Read a record r from the database
Compute the distance of r and each of the existing cluster centers
Let i be the cluster (index) such that the distance between r and Ci is the smallest
Compute R’i assuming r is inserted in the ith cluster
If R’i  e,
insert r in the ith cluster
recompute Ri and Ci
else
start a new cluster containing only r The BIRCH algorithm:

Similarity Search over Sequences:

Similarity Search over Sequences A user specifies a query sequence and wants to retrieve all data sequences that are similar to the query sequence Not exact matches A data sequence X is a sequence of numbers X = <x1, x2, .., xk>
Also called a time series
k length of the sequence
A subsequence Z = <z1, z2, …, zj> is deleting from another sequence by deleting numbers from the front and back of the sequence

Similarity Search over Sequences:

Similarity Search over Sequences Given two sequences X and Y we can define the distance of the two sequences as the Euclidean norm Given a user-specified query sequence and a threshold parameter e, retrieve all data sequences within e-distance to the query sequence Complete sequence matching (the query sequence and the sequence in the database have the same length)
Subsequence matching (the query is shorter)

Similarity Search over Sequences:

Similarity Search over Sequences Given a user-specified query sequence and a threshold parameter e, retrieve all data sequences within e-distance to the query sequence Brute-force method? Each data sequence and the query sequence of length k may be represented as a point in a k-dimensional space
Construct a multidimensional index
Non-exact matches? Query the index with a hyper-rectangle with side length 2-e and the query sentence as the center