Title: The Software Infrastructure for Electronic Commerce
1The Software Infrastructurefor Electronic
Commerce
- Databases and Data Mining
- Lecture 3 An Introduction To Data Mining (I)
- Johannes Gehrke
- johannes_at_cs.cornell.edu
- http//www.cs.cornell.edu/johannes
2Lectures Three and Four
- Data preprocessing
- Multidimensional data analysis
- Data mining
- Association rules
- Classification trees
- Clustering
3Why Data Preprocessing?
- Quality decisions come from quality data.
- Problems with real life data
- Data needs to be integrated from different
sources - Missing values
- Noisy and inconsistent values
- Data is not at the right level of aggregation
4Recall The Relational Data Model
- A relational database is a set of relations
- A relation has two components
- The relation instance. Basically a table, with
rows (also records, tuples) and columns (also
fields, attributes).Number of records in the
relation instance Cardinality. - The relation schema. Specifies name of the
relation, plus name and type of each column. - Turing award (Nobel price in CS) for Codd in 1981
for his work on the relational model
5Example Customer Relation
- Relation schemaCustomers (cid integer, name
string, byear integer, state string) - Relation instance
6Data Integration
- Integrate data from multiple sources into a
common format for data mining. - Note A good data warehouse has already taken
care of this step.
Data Warehouse Server
OLTPDBMSs
Extract, clean,transform, aggregate,load, update
Other Data Sources
Data Marts
7Data Integration (Contd.)
- Problem Heterogeneous schema integration
- Different attribute names
- Different units Sales in , sales in Yen, sales
in DM -
8Data Integration (Contd.)
- Problem Heterogeneous schema integration
- Different scales Sales in dollars versus sales
in pennies - Derived attributes Annual salary versus monthly
salary -
9Data Integration (Contd.)
- Problem Inconsistency due to redundancy
- Customer with customer-id 150 has three children
in relation1 and four children in relation2 - Computation of annual salary from monthly salary
in relation1 does not match annual-salary
attribute in relation2 -
10Missing Values
- Often the values of some attributes in a record
are not known. - Reasons for missing values
- Attribute does not apply (e.g., maiden name)
- Inconsistency with other recorded data
- Equipment malfunction
- Human errors
- Attribute introduced recently (e.g., email
address)
11Missing Values Approaches
- Ignore the record
- Complete the missing value
- Manual completion Tedious and likely to be
infeasible - Fill in a global constant, e.g., NULL,
unknown - Use the attribute mean or mode
- Construct a data mining model that predicts the
missing value
12Noisy Data
- Examples
- Faulty data collection instruments
- Data entry problems, misspellings
- Data transmission problems
- Technology limitation
- Inconsistency in naming conventions
- Duplicate records with different values for a
common field
13Noisy Data Remove Outliers
14Noisy Data Smoothing
y
Y1
Y1
x
X1
15Noisy Data Normalization
- Scale data to fall within a small, specified
range - Leave out extreme order statistics
- Min-max normalization
- Z-score normalization
- Normalization by decimal scaling
16Data Reduction
- Problem
- Data might not be at the right scale for
analysis.Example Individual phone calls versus
monthly phone call usage - Complex data mining tasks might run a very long
time.Example Multi-terabyte data warehousesOne
disk drive About 20MB/s
17Data Reduction Attribute Selection
- Select the relevant attributes for the data
mining task - If there are k attributes, there are 2k-1
different subsets - Example salary,children,byear,
salary,children, salary,byear,
children,byear, salary, children, byear - Choice of the right subset depends on
- Data mining task
- Underlying probability distribution
18Attribute Selection (Contd.)
- How to choose relevant attributes
- Forward selection Select greedily one attribute
at a time - Backward elimination Start with all attributes,
eliminate irrelevant attributes - Combination of forward selection and backward
elimination
19Data Reduction Parametric Models
- Main idea
- Fit a parametric model to the data (e.g.,
multivariate normal distribution) - Store the model parameters, discard the data
(except for outliers)
20Parametric Models Example
- Instead of storing (x,y) pairs, store only the
x-value.Then recompute the y-value usingy ax
b
y
x
21Data Reduction Sampling
- Choose a representative subset of the data
- Simple random sampling may have very poor
performance in the presence of skew
22Data Reduction Sampling (Contd.)
- Stratified sampling Biased sampling
- Example Keep population group ratios
- Example Keep minority population group count
23Data Reduction Histograms
- Divide data into buckets and store average (sum)
for each bucket - Can be constructed optimally for one attribute
using dynamic programming - ExampleDataset 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,
5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram
(range, count, sum)(1-2,12,16), (3-6,8,36),
(7-9,6,48), (10-12,6,66)
24Histograms (Contd.)
- Equal-width histogram
- Divides the domain of an attribute into k
intervals of equal size - Interval width (Max Min)/k
- Computationally easy
- Problems with data skew and outliers
- ExampleDataset 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,
5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram
(range, count, sum)(1-3,14,22), (4-6,6,30),
(7-9,6,48), (10-12,6,66)
25Histograms (Contd.)
- Equal-depth histogram
- Divides the domain of an attribute into k
intervals, each containing the same number of
records - Variable interval width
- Computationally easy
- ExampleDataset 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,
5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram
(range, count, sum)(1,8,8), (2-4,8,22),
(5-8,8,52), (9-12,8,84)
26Data Reduction Discretization
- Same concept as histograms
- Divide domain of a numerical attribute into
intervals. - Replace attribute value with label for interval.
- Example
- Dataset (age salary)(2530,000),(3080,000),(27
50,000),(6070,000),(5055,000),(2825,000) - Discretized dataset (age, discretizedSalary)(25,
low),(30,high),(27,medium),(60,high),(50,medium),
(28,low)
27Discretization Natural Hierarchies
- Replace low-level concepts with high-level
concepts - Example replacements
- Product SKU by category
- City by state
Year
Industry
Country
Quarter
Category
State
Month
Week
Product
City
Day
28Data Reduction Aggregation
- Natural hierarchies on attributes can be used to
aggregate data along the hierarchy
Year
Industry
Country
Quarter
Category
State
Month
Week
Product
City
Day
29Aggregation Example
30Data Reduction Other Methods
- Principal component analysis
- Fourier transformation
- Wavelet transformation
31Data Preprocessing Summary
- Problems during data integration
- Different attribute names
- Different units
- Different scales
- Derived attributes
- Redundant data
- Missing values
- Imputation
- Prediction
-
- Noisy data
- Outlier removal
- Smoothing
- Normalization
- Data Reduction
- Attribute selection
- Fitting parametric models
- Sampling
- Histograms
- Discretization
- Aggregation
32Data Analysis
- Data preprocessing
- Multidimensional data analysis
- Data mining
- Association rules
- Classification trees
- Clustering
33Multidimensional Data Analysis
- Recall
- Transactions(ckey, timekey, pkey, units, price)
- Customers(ckey, cid, name, byear, city, state,
country) - Time(tkey, day, month, quarter, year)
- Products(pkey, pname, price, pid, category,
industry) - Hierarchies on dimensions
Year
Industry
Country
Quarter
Category
State
Month
Week
Product
City
Day
34Multidimensional Data Analysis
Year
Industry
CountryUSA
Quarter
Category
State
Month
Week
Product
City
Day
35Corresponding Query in SQL
- SELECT SUM(units)FROM Transactions T, Products
P, Customers CWHERE T.pkey P.pkey AND T.ckey
C.ckey AND C.country USAGROUP BY
P.industry, C.state - We think that Industry3 in CA is interesting.
Year
Industry
CountryUSA
Quarter
Category
State
Month
Week
Product
City
Day
36Slice and Drill-Down
Year
IndustryIndustry3
Country
Quarter
Category
StateCA
Month
Week
Product
City
Day
37Corresponding Query in SQL
- SELECT SUM(units)FROM Transactions T, Products
P, Customers CWHERE T.pkey P.pkey AND T.ckey
C.ckey AND P.industry Industry3 AND C.state
CAGROUP BY P.category, C.city - We think that Category3 is interesting.
Year
IndustryIndustry3
Country
Quarter
Category
StateCA
Month
Week
Product
City
Day
38Slice and Drill-Down
Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
39Corresponding Query in SQL
- SELECT SUM(units)FROM Transactions T, Products
P, Customers CWHERE T.pkey P.pkey AND T.ckey
C.ckey AND C.state CA AND P.category
Category3GROUP BY P.product, C.city - Nothing new in this view of the data.
Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
40Pivot To (City, Year)
Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
41Corresponding Query in SQL
- SELECT SUM(units)FROM Transactions T, Products
P, Customers CWHERE T.pkey P.pkey AND T.ckey
C.ckey AND C.state CA AND P.category
Category3GROUP BY C.city, T.year
Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
42Multidimensional Data Analysis
- Set of data manipulation operators
- Roll-up Go up one step in a dimension hierarchy.
Example month -gt quarter - Drill-down Go down one step in a dimension
hierarchy. Example quarter -gt month - Slice Select a subset of the values of a
dimension. Example All categories -gt only
Category3 - Dice Select all values of a dimension. Example
Only Category3 -gt all categories - Pivot Select new dimensions to visualize the
data. Example Pivot to Time(quarter) and
Customer(state)
43Visual Intuition Cube
roll-up to category
Customer Data Mart
roll-up to state
SH
SF
LA
Product1 Product2Product3 Product4Product5Produ
ct6
20 30 20 15 10 50
Product
roll-up to week
M T W Th F S S
Time
50 Units of Product6 sold on Monday in LA
44OLAP Server Architectures
- Relational OLAP (ROLAP)
- Relational DBMS stores data mart
- OLAP middleware
- Aggregation and navigation logic
- Optimized for DBMS in the background, but slow
and complex - Basically only one vendor Microstrategy
- Multidimensional OLAP (MOLAP)
- Specialized array-based storage structure
- Vendors Hyperion (Essbase), Appix (iTM1),
Oracle, Microsoft
45OLAP Server Architectures
- Desktop OLAP (DOLAP)
- Performs OLAP directly at your PC
- Vendors Cognos (Powerplay), Business Objects,
Brio Technology, Hummingbird - Hybrids and Application OLAP
- More www.olapreport.com
46The OLAP Market
47Enterprise Reporting Tools
48Summary Multidimensional Analysis
- Spreadsheet style data analysis
- Roll-up, drill-down, slice, dice, and pivot your
way to interesting cells in the CUBE - Mainstream technology
- Established enterprises already have OLAP
installations - When establishing your e-business, OLAP will be
your first step in data analysis
49Data Mining
50Definition
- Data mining is the exploration and analysis of
large quantities of data in order to discover
valid, novel, potentially useful, and ultimately
understandable patterns in data. - Example pattern (Census Bureau Data)If
(relationship husband), then (gender male).
99.6
51Definition (Cont.)
- Data mining is the exploration and analysis of
large quantities of data in order to discover
valid, novel, potentially useful, and ultimately
understandable patterns in data. - Valid The patterns hold in general.
- Novel We did not know the pattern beforehand.
- Useful We can devise actions from the patterns.
- Understandable We can interpret and comprehend
the patterns.
52Why Use Data Mining Today?
- Human analysis skills are inadequate
- Volume and dimensionality of the data
- High data growth rate
- Availability of
- Data
- Storage
- Computational power
- Off-the-shelf software
53An Abundance of Data
- Supermarket scanners, POS data
- Preferred customer cards
- Credit card transactions
- Direct mail response
- Call center records
- Web server logs
- Customer web site trails
- ATM machines
- Demographic data
54Evolution of Database Technology
- 1960s IMS and network DBMS
- 1970s The relational data model, small
relational DBMS implementations - 1980s Maturing RDBMS, application-specific DBMS
(spatial data, scientific data, images, etc.) - 1990s Mature, high-performance RDBMS technology,
parallel DBMS, terabyte data warehouses,
object-relational DBMS, middleware and web
technology - 2000s High availability, maintainability,
seamless integration into business processes
55Computational Power
- Moores Law In 1965, Intel Corp. cofounder
Gordon Moore predicted that the density of
transistors in an integrated circuit would double
every year. - Later changed to reflect 18 months progress.
- Experts on ants estimate that there are 1016 to
1017 ants on earth. In the year 1997, we produced
one transistor per ant.
56OTS-Software for Data Mining
- ANGOSS KnowledgeStudio
- IBM Intelligent Miner
- Metaputer PolyAnalyst
- SAS Enterprise Miner
- SGI Mineset
- SPSS Clementine
- Many others
- More at http//www.kdnuggets.com/software
57Why Use Data Mining Today?
- And competitive pressure!
- The secret of success is to know something that
nobody else knows. - Aristotle Onassis
- Competition on service, not only on price (Banks,
phone companies, hotel chains, rental car
companies) - Personalization
- Your competitors already apply data mining
58The Knowledge Discovery Process
- Steps
- Identify business problem
- Data preprocessing
- Data selection Identify target datasets and
relevant fields - Data cleaning Remove noise and outliers
- Data transformation
- Create common units
- Generate new fields
- Data mining and model evaluation
- Deployment and measurement of the results
59Preprocessing and Mining
Knowledge
Patterns
PreprocessedData
TargetData
Interpretation
ModelConstruction
Original Data
Preprocessing
DataIntegrationand Selection
60What is a Data Mining Model?
- A data mining model is a description of a
specific aspect of a dataset. It produces output
values for an assigned set of input values. - Examples
- Linear regression model
- Classification model
- Clustering
61Data Mining Models (Contd.)
- A data mining model can be described at two
levels - Functional level
- Describes model in terms of its intended
usage.Examples Classification, clustering - Representational level
- Specific representation of a model.Example
Log-linear model, classification tree, nearest
neighbor method. - Black-box models versus transparent models
62Data Mining Techniques
- Supervised learning
- Classification and regression
- Unsupervised learning
- Clustering
- Dependency modeling
- Associations, summarization, causality
- Outlier and deviation detection
- Trend analysis and change detection
63Fields Related to Data Mining
- Database systems
- Machine learning
- Statistics
- Visualization
- Parallel computing
64Traditional Analysis Tools
- Predefined set of reports
- Capture most important known business questions
- Generated at fixed points in time
- No support for impromptu queries
- SQL Write your own SQL query.
- Statistics department within a company
- Slow reaction to users needs
- Limited understanding of the business questions
- Problems communicating the findings
65Example Application Sports
- IBM Advanced Scout analyzesNBA game statistics
- Shots blocked
- Assists
- Fouls
- http//www.research.ibm.com/scout/home.html
66Advanced Scout
- Example pattern An analysis of thedata from a
game played betweenthe New York Knicks and the
CharlotteHornets revealed that When Glenn Rice
played the shooting guard position, he shot 5/6
(83) on jump shots." - Pattern is interestingThe average shooting
percentage for the Charlotte Hornets during that
game was 54.
67Example Application Sky Survey
- Input data 3 TB of image data with 2 billion sky
objects, took more than six years to complete - Goal Generate a catalog with all objects and
their type - Method Use decision trees as data mining model
- Results
- 94 accuracy in predicting sky object classes
- Increased number of faint objects classified by
300 - Helped team of astronomers to discover 16 new
high red-shift quasars in one order of magnitude
less observation time
68Gold Nuggets?
- Investment firm mailing list Discovered that old
people do not respond to IRA mailings - Bank clustered their customers. One cluster
Older customers, no mortgage, less likely to have
a credit card - Bank of 1911
- Customer churn example
69Market Basket Analysis
- Consider shopping cart filled with several items
- Market basket analysis tries to answer the
following questions - Who makes purchases?
- What do customers buy together?
- In what order do customers purchase items?
70Market Basket Analysis
- Given
- A database of customer transactions
- Each transaction is a set of items
- ExampleTransaction with TID 111 contains items
Pen, Ink, Milk, Juice
71Market Basket Analysis (Contd.)
- Coocurrences
- 80 of all customers purchase items X, Y and Z
together. - Association rules
- 60 of all customers who purchase X and Y also
buy Z. - Sequential patterns
- 60 of customers who first buy X also purchase Y
within three weeks.
72Confidence and Support
- We prune the set of all possible association
rules using two interestingness measures - Confidence of a rule
- X gt Y has confidence c if P(YX) c
- Support of a rule
- X gt Y has support s if P(XY) s
- We can also define
- Support of an itemset (a coocurrence) XY
- XY has support s if P(XY) s
73Example
- Examples
- Pen gt MilkSupport 75Confidence 75
- Ink gt PenSupport 100Confidence 100
74Exercise
- Find all itemsets withsupport gt 75?
75Exercise
- Can you find all association rules with support
gt 50?
76Extensions
- Imposing constraints
- Only find rules involving the dairy department
- Only find rules involving expensive products
- Only find expensive rules
- Only find rules with whiskey on the right hand
side - Only find rules with milk on the left hand side
- Hierarchies on the items
- Calendars (every Sunday, every 1st of the month)
77Market Basket Analysis Applications
- Sample Applications
- Direct marketing
- Fraud detection for medical insurance
- Floor/shelf planning
- Web site layout
- Cross-selling
78Beyond Support and Confidence
- Example 5000 students
- 3000 students play basketball
- 3750 students eat cereal
- 2000 students both play basketball and eat cereal
79Misleading Association Rules
- Basketball gt Cereal (support 40, confidence
66.7) is misleading because 75 of students eat
cereal - Basketball gt No cereal (support 20,
confidence 33.3) is more interesting, although
with lower support and confidence
80Interest
- Interest of rule A gt B P(AB)/(P(A)P(B))
- Symmetric (uses both P(A) and P(B))
- Note that confidence is not symmetric(confidence
of rule A gt B P(AB)/P(A)) - Interest values
- Interest 1 A and B are independent
(P(AB)P(B)P(A)) - Interest gt 1 A and B are positively correlated
- Interest lt 1 A and B are negatively correlated
81Interest Example
82Extensions
- Imposing constraints
- Only find rules involving the dairy department
- Only find rules involving expensive products
- Only find expensive rules
- Only find rules with whiskey on the right hand
side - Only find rules with milk on the left hand side
- Hierarchies on the items
- Calendars (every Sunday, every 1st of the month)
83Applications
- Sample Applications
- Direct marketing
- Fraud detection for medical insurance
- Floor/shelf planning
- Web site layout
- Cross-selling
- Case study
84Summary
- Data preprocessing
- Quality analysis comes from quality data
- Multidimensional data analysis
- Interactive, spreadsheet-style data analysis
- Data mining
- The knowledge discovery process
- Association rules
85Questions?
- (In the fourth lecture More data mining.)