CS 345: Topics in Data Warehousing - PowerPoint PPT Presentation

About This Presentation
Title:

CS 345: Topics in Data Warehousing

Description:

CS 345: Topics in Data Warehousing Tuesday, November 16, 2004 Review of Thursday s Class Dimension Key Mapping Revisited Comments on Assignment #2 Updating the Data ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 26
Provided by: BrianB120
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 345: Topics in Data Warehousing


1
CS 345Topics in Data Warehousing
  • Tuesday, November 16, 2004

2
Review of Thursdays Class
  • Dimension Key Mapping Revisited
  • Comments on Assignment 2
  • Updating the Data Warehouse
  • Incremental maintenance vs. drop rebuild
  • Self-maintainable views
  • Approximate Query Processing
  • Sampling-based techniques
  • Computing confidence intervals
  • Online vs. pre-computed samples
  • Sampling and joins
  • Alternative techniques

3
Outline of Todays Class
  • Data Mining
  • What is data mining?
  • Types of data mining
  • Data mining pitfalls
  • Decision Tree Classifiers
  • What is a decision tree?
  • Learning decision trees
  • Entropy
  • Information Gain
  • Cross-Validation

4
Data Mining
  • What is data mining?
  • Many definitions
  • Basically identify interesting patterns in data
  • Most often, the term data mining refers to
    automatic detection of patterns through machine
    learning
  • Data mining is one part of the broader process of
    knowledge discovery in databases (KDD)
  • KDD the process of identifying valid, novel,
    potential useful, and ultimately understandable
    patterns in data
  • This is what data warehousing is all about.
  • Data mining is a field of research
  • Draws from databases, artificial intelligence,
    statistics
  • Relatively new research community
  • Several conferences and journals
  • ACM KDD, SIAM Data Mining, IEEE ICDM

5
Knowledge Discovery in Databases
Knowledge
Interpretation/Evaluation
  • Validation Tests
  • Visualization

Data Mining
  • Identify Patterns
  • Generate Models

Preprocessing
  • Selection
  • Cleaning
  • Transformation
  • Feature Extraction

Data
6
Types of Data Mining
  • OLAP
  • Group-by aggregation queries are a simple type of
    data mining
  • Summarize the data set
  • Classification
  • Build predictive model to categorize records into
    discrete classes
  • Examples
  • Classify mortgage applicants as will default or
    will not default
  • Face recognition in image database
  • Identify likely terrorists vs. unlikely
    terrorists
  • Regression
  • Build predictive model to predict real-valued
    function
  • Examples
  • Predict how much revenue each customer will
    generate
  • Predict profitability of planned marketing
    campaign
  • Clustering
  • Separate data records into groups of similar
    items
  • Clustering vs. Classification
  • Classification is supervised, clustering is
    unsupervised
  • Classification uses pre-defined class labels,
    clustering doesnt.

7
Types of Data Mining
  • Outlier detection
  • Identify unusual or atypical data records
  • Sometimes to investigate them further
  • Sometimes to exclude them from a broad analysis
  • Trend analysis / forecasting
  • Identify changes in patterns of data over time
  • Example What will be next months revenue?
  • Dependency detection
  • Which attributes are correlated with one another?
  • Which attribute values are likely to occur
    together?
  • Popular technique Association rule mining
  • Also known as market basket analysis
  • Find products that are often bought together as
    part of same transaction
  • Temporal pattern detection / time series mining
  • Recognize commonly recurring patterns in time
    series data
  • Example Technical analysis of financial
    markets

8
Data Mining Pitfalls
  • Overfitting
  • Spurious patterns may emerge by chance
  • Dont mistake coincidence for causality
  • Example ESP experiment
  • Ask 10,000 test subjects to predict whether each
    of 10 face-down playing cards is red or black
  • 10 subjects predicted all 10 cards correctly!
  • Conclusion 1 out of every 1000 people have
    ESP
  • Can be a particular concern in datasets with
  • Lots of attributes
  • Not too many records
  • Reporting obvious patterns
  • Learning cancer risk factors
  • Women are more likely than men to have breast
    cancer
  • Men are more likely than women to have prostate
    cancer
  • These patterns are not novel

9
Data Mining Pitfalls
  • Confusing correlation and causation
  • Data mining can identify attributes that are
    correlated
  • Correlation doesnt necessarily imply causation
  • Example Studying causes of obesity
  • Overweight people are more likely to drink diet
    soda
  • Conclusion Drinking diet soda causes obesity
  • Moral of the story Interpretation and
    evaluation of patterns is crucial
  • Data mining algorithms are not magical
  • Patterns they identify must be examined carefully
    to avoid drawing inappropriate conclusions

10
Decision Tree Classifiers
  • Decision trees are one type of classification
    model
  • Internal nodes of decision tree labeled with
    attributes
  • Each internal node represents a test
  • Edges labeled with attribute values
  • Edges represent the results of the tests
  • Leafs labeled with class values
  • Leafs represent the classifiers predictions
  • To classify a record, walk down the tree starting
    at the root
  • The path that is followed depends on the
    attribute values of the record being classified

Employed?
Yes
No
Credit Score?
Income?
High
Low
High
Low
Approve
Approve
Reject
Reject
11
Decision Tree Learning
  • Were given a data set with unknown values for an
    attribute of interest
  • Example
  • Data set is Customer records
  • Attribute of interest is Will Close Account in
    Next 3 Months
  • Unknown attribute referred to as target attribute
  • This data set is referred to as the test set
  • We also have a second data set where the values
    of the target attribute are known
  • Referred to as the training set
  • We would like to build a decision tree classifier
    to predict the value of the target attribute
  • Construct a decision tree that accurately
    classifies the records in the training set
  • Use the decision tree to predict the value of the
    target attribute for the records from the test
    set
  • Hopefully a classifier that works well on the
    training set will also work well on the other
    data set!

12
Decision Tree Learning
  • When does decision tree learning work well?
  • Training set and test set are similar
  • Patterns in the training set are also present in
    the test set
  • Rules learned from one data set apply to the
    other
  • Decision tree identifies general, globally valid
    patterns
  • And not specific, idiosyncratic properties of the
    training records
  • Need to avoid overfitting the model to the
    training set
  • Occams razor simple explanations are usually
    the best
  • Simple (small) decision trees are usually
    preferable
  • Easier for humans to interpret
  • Usually less prone to overfitting
  • Finding the smallest accurate decision tree is
    NP-Hard
  • Decision trees are usually built top-down using
    greedy heuristic
  • Idea First test attributes that do best job of
    separating the classes

13
Decision Tree Learning
  • Basic decision tree learning algorithm
  • Do all records in training set belong to same
    class?
  • Yes ? Return leaf node with that class.
  • Do all records in training set have the same
    values for all attributes (other than target)?
  • Yes ? Return leaf node with most common class.
  • Otherwise
  • Pick the single attribute that best separates
    records from different classes
  • Use that attribute for the root of the decision
    tree
  • Children of root node are decision trees
  • Build them recursively using same algorithm

14
Splitting Criterion
  • How to decide which attribute is best to test
    first?
  • Each attribute splits data into subsets
  • Ideally, each subset should be as homogenous as
    possible
  • Need metric for homogeneity of a data set
  • Example
  • Two classes, /-
  • 100 records overall (50 s and 50 -s)
  • A and B are two binary attributes
  • Records with A0 48, 2-Records with A1 2,
    48-
  • Records with B0 26, 24-Records with B1
    24, 26-
  • Splitting on A is better than splitting on B
  • A does a good job of separating s and -s
  • B does a poor job of separating s and -s

15
Entropy
  • Entropy is a good way to measure homogeneity
  • Measures minimum number of bits per record needed
    to optimally encode class values
  • Entropy example
  • Three classes (A,B,C)
  • A occurs ½ of the time
  • B and C each occur ¼ of the time
  • Optimal encoding A 0, B 10, C 11
  • Entropy Average bits / record 1.5
  • Entropy formula
  • Entropy of data set S is denoted H(S)
  • cis are the possible classes
  • pi fraction of records from S that have class ci

16
Entropy Examples
  • Example
  • 10 records have class A
  • 20 records have class B
  • 30 records have class C
  • 40 records have class D
  • Entropy -(.1 log .1) (.2 log .2) (.3 log
    .3) (.4 log .4)
  • Entropy 1.85
  • Earlier example revisited
  • Two classes, /-
  • 100 records overall (50 s and 50 -s)
  • A and B are two binary attributes
  • Records with A0 48, 2- Entropy 0.24
    Records with A1 2, 48- Entropy 0.24
  • Records with B0 26, 24- Entropy
    0.99Records with B1 24, 26- Entropy 0.99
  • A is better than B because average entropy is
    less after splitting on A

17
Information Gain
  • Information gain Expected reduction in entropy
  • Expected entropy after splitting on attribute A
    H(SA)
  • H(SA) Sum (percentage of records with
    Aai)(Entropy of records with Aai)
  • Sum is taken over all possible values of
    attribute A
  • Computes weighted average entropy across all
    subsets
  • Weight of subset number of records in the
    subset
  • Always split on attribute with greatest
    information gain
  • This is one possible splitting rule for building
    decision trees
  • However, other splitting criteria are also used
    sometimes
  • Gain ratio, Gini index, etc.
  • Alternative methods of measuring homogeneity

18
Decision Tree Example
State Season Barometer Weather
AK Winter Down Snow
HI Winter Down Sun
HI Summer Up Sun
CA Summer Up Rain
AK Winter Up Snow
CA Winter Down Sun
AK Summer Down Sun
CA Winter Up Rain
HI Summer Down Sun
Predicting the weather Target attribute
Weather Source attributes State, Season,
Barometer
19
Decision Tree Example
State AK 2 Snow, 1 Sun ? 0.92HI 3 Sun ?
0.00 CA 2 Rain, 1 Sun ? 0.92 Entropy 0.62
State Season Barometer Weather
AK Winter Down Snow
HI Winter Down Sun
HI Summer Up Sun
CA Summer Up Rain
AK Winter Up Snow
CA Winter Down Sun
AK Summer Down Sun
CA Winter Up Rain
HI Summer Down Sun
Season Winter 2 Snow, 2 Sun, 1 Rain ?
1.52Summer 3 Sun, 1 Rain ? 0.81 CA 2 Rain, 1
Sun ? 0.92 Entropy 1.20
Barometer Down 1 Snow, 4 Sun ? 0.72Up 1
Snow, 1 Sun, 2 Rain ? 1.50 Entropy 1.07
20
Decision Tree Example
State AK Split on Season Winter
Snow Summer Sun
State Season Barometer Weather
AK Winter Down Snow
AK Winter Up Snow
AK Summer Down Sun
HI Winter Down Sun
HI Summer Up Sun
HI Summer Down Sun
CA Summer Up Rain
CA Winter Down Sun
CA Winter Up Rain
State HI Leaf node Sun
State CA Split on Barometer Up Rain Down
Sun
21
Decision Tree Example
State
CA
AK
HI
Barometer
Season
Sun
Down
Up
Summer
Winter
Sun
Snow
Sun
Rain
22
Overfitting and Pruning
  • Performance graph at right exhibits typical
    phenomenon
  • Accuracy on training data increases decision tree
    grows
  • Accuracy on test data initially increases, then
    decreases.
  • Why does this happen?
  • Highly predictive attributes near root of
    decision tree capture general patterns
  • Less predictive attributes added later are mostly
    capturing statistical noise
  • Goal Stop building the decision tree before
    overfitting kicks in
  • Pruning ? eliminate lower portions of the
    decision tree
  • Replace sub-tree with a leaf node

Accuracy
Training Set Accuracy
Test Set Accuracy
Decisiontree size
Optimaltree size
23
Pruning via Cross-Validation
  • Cross-validation
  • Separate training set into two parts
  • Most of the training set is used to build tree
  • Small holdout set is used to validate accuracy
  • Post-pruning approach
  • Build decision tree with training data (less
    holdout set)
  • Traverse tree in bottom-up fashion
  • For each sub-tree
  • Consider pruning sub-tree, replacing with leaf
    node
  • If pruned tree is more accurate on holdout set,
    then use it
  • Otherwise, stick with original sub-tree
  • Idea behind pruning
  • Portion of tree that models general patterns
    works well on holdout set
  • Portion of tree that fits random noise works
    poorly on holdout set

24
Sufficient Statistics
  • What information is need to determine what
    attribute to split on?
  • Need to compute expected entropy of each
    attribute
  • To compute expected entropy after splitting on
    attribute A
  • How many records are there with each value of A?
  • Among the records with each A value, how many
    belong to each class?
  • These counts are called sufficient statistics
  • Computing sufficient statistics via SQL
  • Use a simple group-by SQL query (one per
    attribute)SELECT A, Class, COUNT()FROM
    TableGROUP BY A, Class
  • For non-root nodes, need a WHERE clause for
    earlier splitsSELECT A, Class, COUNT()FROM
    TableWHERE Bx AND CyGROUP BY A, Class
  • Full data cube contains all sufficient statistics
    for entire decision tree

25
Decision Trees and Data Warehouses
  • Generally building a decision tree involves
    dimension-focused queries
  • As opposed to typical fact-focused queries
  • Records for which predictions are made are
    dimension rows (e.g. Customers, Accounts)
  • Sometimes queries just involve the dimension
    table
  • Other times dimension attributes may be
    supplemented by virtual behavioral attributes
  • Two approaches for gathering sufficient
    statistics
  • Compute entire data cube (including subtotals) in
    one query
  • Issue a series of small group-by queries
Write a Comment
User Comments (0)
About PowerShow.com