The Software Infrastructure for Electronic Commerce - PowerPoint PPT Presentation

About This Presentation

Title:

The Software Infrastructure for Electronic Commerce

Description:

Percent of information technology executives citing the systems used in their ... Database and data mining technology is crucial for any enterprise ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 59

Provided by: johanne46

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Software Infrastructure for Electronic Commerce

1
The Software Infrastructurefor Electronic
Commerce

Databases and Data Mining
Lecture 4 An Introduction To Data Mining (II)
Johannes Gehrke
johannes_at_cs.cornell.edu
http//www.cs.cornell.edu/johannes

2
Lectures Three and Four

Data preprocessing
Multidimensional data analysis
Data mining
Association rules
Classification trees
Clustering

3
Types of Attributes

Numerical Domain is ordered and can be
represented on the real line (e.g., age, income)
Nominal or categorical Domain is a finite set
without any natural ordering (e.g., occupation,
marital status, race)
Ordinal Domain is ordered, but absolute
differences between values is unknown (e.g.,
preference scale, severity of an injury)

4
Classification

Goal Learn a function that assigns a record to
one of several predefined classes.

5
Classification Example

Example training database
Two predictor attributesAge and Car-type
(Sport, Minivan and Truck)
Age is ordered, Car-type iscategorical attribute
Class label indicateswhether person
boughtproduct
Dependent attribute is categorical

6
Regression Example

Example training database
Two predictor attributesAge and Car-type
(Sport, Minivan and Truck)
Spent indicates how much person spent during a
recent visit to the web site
Dependent attribute is numerical

7
Types of Variables (Review)

Numerical Domain is ordered and can be
represented on the real line (e.g., age, income)
Nominal or categorical Domain is a finite set
without any natural ordering (e.g., occupation,
marital status, race)
Ordinal Domain is ordered, but absolute
differences between values is unknown (e.g.,
preference scale, severity of an injury)

8
Definitions

Random variables X1, , Xk (predictor variables)
and Y (dependent variable)
Xi has domain dom(Xi), Y has domain dom(Y)
P is a probability distribution on dom(X1) x x
dom(Xk) x dom(Y)Training database D is a random
sample from P
A predictor d is a functiond dom(X1) dom(Xk)
? dom(Y)

9
Classification Problem

If Y is categorical, the problem is a
classification problem, and we use C instead of
Y.dom(C) J.
C is called the class label, d is called a
classifier.
Take r be record randomly drawn from P. Define
the misclassification rate of dRT(d,P)
P(d(r.X1, , r.Xk) ! r.C)
Problem definition Given dataset D that is a
random sample from probability distribution P,
find classifier d such that RT(d,P) is minimized.

10
Regression Problem

If Y is numerical, the problem is a regression
problem.
Y is called the dependent variable, d is called a
regression function.
Take r be record randomly drawn from P. Define
mean squared error rate of dRT(d,P) E(r.Y -
d(r.X1, , r.Xk))2
Problem definition Given dataset D that is a
random sample from probability distribution P,
find regression function d such that RT(d,P) is
minimized.

11
Goals and Requirements

Goals
To produce an accurate classifier/regression
function
To understand the structure of the problem
Requirements on the model
High accuracy
Understandable by humans, interpretable
Fast construction for very large training
databases

12
Different Types of Classifiers

Linear discriminant analysis (LDA)
Quadratic discriminant analysis (QDA)
Density estimation methods
Nearest neighbor methods
Logistic regression
Neural networks
Fuzzy set theory
Decision Trees

13
Difficulties with LDA and QDA

Multivariate normal assumption often not true
Not designed for categorical variables
Form of classifier in terms of linear or
quadratic discriminant functions is hard to
interpret

14
Histogram Density Estimation

Curse of dimensionality
Cell boundaries are discontinuities. Beyond
boundary cells, estimate falls abruptly to zero.

15
Kernel Density Estimation

How to choose kernel bandwith h?
The optimal h depends on a criterion
The optimal h depends on the form of the kernel
The optimal h might depend on the class label
The optimal h might depend on the part of the
predictor space
How to choose form of the kernel?

16
K-Nearest Neighbor Methods

Difficulties
Data must be stored for classification of a new
record, all data must be available
Computationally expensive in high dimensions
Choice of k is unknown

17
Difficulties with Logistic Regression

Few goodness of fit and model selection
techniques
Categorical predictor variables have to be
transformed into dummy vectors.

18
Neural Networks and Fuzzy Set Theory

Difficulties
Classifiers are hard to understand
How to choose network topology and initial
weights?
Categorical predictor variables?

19
What are Decision Trees?

Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
20
Decision Trees

A decision tree T encodes d (a classifier or
regression function) in form of a tree.
A node t in T without children is called a leaf
node. Otherwise t is called an internal node.

21
Internal Nodes

Each internal node has an associated splitting
predicate. Most common are binary
predicates.Example predicates
Age lt 20
Profession in student, teacher
5000Age 3Salary 10000 gt 0

22
Internal Nodes Splitting Predicates

Binary Univariate splits
Numerical or ordered X X lt c, c in dom(X)
Categorical X X in A, A subset dom(X)
Binary Multivariate splits
Linear combination split on numerical
variablesS aiXi lt c
k-ary (kgt2) splits analogous

23
Leaf Nodes

Consider leaf node t
Classification problem Node t is labeled with
one class label c in dom(C)
Regression problem Two choices
Piecewise constant modelt is labeled with a
constant y in dom(Y).
Piecewise linear modelt is labeled with a
linear model Y yt S aiXi

24
Example

Encoded classifier
If (agelt30 and carTypeMinivan)Then YES
If (age lt30 and(carTypeSports or
carTypeTruck))Then NO
If (age gt 30)Then NO

Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
25
Choice of Classification Algorithm?

Example study (Lim, Loh, and Shih, Machine
Learning 2000)
33 classification algorithms
16 (small) data sets (UC Irvine ML Repository)
Each algorithm applied to each data set
Experimental measurements
Classification accuracy
Computational speed
Classifier complexity

26
Classification Algorithms

Tree-structure classifiers
IND, S-Plus Trees, C4.5, FACT, QUEST, CART, OC1,
LMDT, CAL5, T1
Statistical methods
LDA, QDA, NN, LOG, FDA, PDA, MDA, POL
Neural networks
LVQ, RBF

27
Experimental Details

16 primary data sets, created 16 more data sets
by adding noise
Converted categorical predictor variables to 0-1
dummy variables if necessary
Error rates for 6 data sets estimated from
supplied test sets, 10-fold cross-validation used
for the other data sets

28
Ranking by Mean Error Rate

Rank Algorithm Mean Error Time
1 Polyclass 0.195 3 hours
2 Quest Multivariate 0.202 4 min
3 Logistic Regression 0.204 4 min
6 LDA 0.208 10 s
8 IND CART 0.215 47 s
12 C4.5 Rules 0.220 20 s
16 Quest Univariate 0.221 40 s

29
Other Results

Number of leaves for tree-based classifiers
varied widely (median number of leaves between 5
and 32 (removing some outliers))
Mean misclassification rates for top 26
algorithms are not statistically significantly
different, bottom 7 algorithms have significantly
lower error rates

30
Decision Trees Summary

Powerful data mining model for classification
(and regression) problems
Easy to understand and to present to
non-specialists
TIPS
Even if black-box models sometimes give higher
accuracy, construct a decision tree anyway
Construct decision trees with different splitting
variables at the root of the tree

31
Clustering

Input Relational database with fixed schema
Output k groups of records called clusters, such
that the records within a group are more similar
to records in other groups
More difficult than classification (unsupervised
learning no record labels are given)
Usage
Exploratory data mining
Preprocessing step (e.g., outlier detection)

32
Clustering (Contd.)

In clustering we partitioning a set of records
into meaningful sub-classes called clusters.
Cluster a collection of data objects that are
similar to one another and thus can be treated
collectively as one group.
Clustering helps users to detect inherent
groupings and structure in a data set.

33
Clustering (Contd.)

Example input database Two numerical variables
How many groups are here?
Requirements Need to define similarity between
records

34
Graphical Representation
35
Clustering (Contd.)

Output of clustering
Representative points for each cluster
Labeling of each record with each cluster number
Other description of each cluster
Important Use the right distance function
Scale or normalize all attributes. Example
seconds, hours, days
Assign different weights associated with
importance of the attribute

36
Clustering Summary

Finding natural groups in data
Common post-processing steps
Build a decision tree with the cluster label as
class label
Try to explain the groups using the decision tree
Visualize the clusters
Examine the differences between the clusters with
respect to the fields of the dataset
Try different number of clusters

37
Web Usage Mining

Data sources
Web server log
Information about the web site
Site graph
Metadata about each page (type, objects shown)
Object concept hierarchies
Preprocessing
Detect session and user context (Cookies, user
authentication, personalization)

38
Web Usage Mining (Contd.)

Data Mining
Association Rules
Sequential Patterns
Classification
Action
Personalized pages
Cross-selling
Evaluation and Measurement
Deploy personalized pages selectively
Measure effectiveness of each implemented action

39
Large Case Study Churn

Telecommunications industry
Try to predict churn (whether customer will
switch long-distance carrier)
Dataset
5000 records (tiny dataset, but manageable here
in class)
21 attributes, both numerical and categorical
attributes (very few attributes)
Data is already cleaned! No missing values,
inconsistencies, etc. (again, for classroom
purposes)

40
Churn Example Dataset Columns

State
Account length Number of months the customer has
been with the company
Area code
Phone number
International plan yes/no
Voice mail yes/no
Number of voice Average number of voice messages
per day
Total (day, evening, night, international)
minutes Average number of minutes charged
Total (day, evening, night, international) calls
Average number of calls made
Total (day, evening, night, international)
charge Average amount charged per day
Number customer service calls Number of calls
made to customer support in the last six months
Churned Did the customer switch long-distance
carriers in the last six months

41
Churn Example Analysis

We start out by getting familiar with the dataset
Record viewer
Statistics visualization
Evidence classifier
Visualizing joint distributions
Visualizing geographic distribution of churn

42
Churn Example Analysis (Contd.)

Building and interpreting data mining models
Decision trees
Clustering

43
Evaluating Data Mining Tools
44
Evaluating Data Mining Tools

Checklist
Integration with current applications and your
data management infrastructure
Ease of usage
Automation
Scalability to large datasets
Number of records
Number of attributes
Datasets larger than main memory
Support of sampling
Export of models into your enterprise
Stability of the company that offers the product

45
Integration With Data Management

Proprietary storage format?
Native support of major database systems
IBM DB2, Informix, Oracle, SQL Server, Sybase
ODBC
Support of parallel database systems
Integration with your data warehouse

46
Cost Considerations

Proprietary or commodity hardware and operating
system
Client and server might be different
What server platforms are supported?
Support staff needed
Training of your staff members
Online training, tutorials
On-site training
Books, course material

47
Data Mining Projects

Checklist
Start with well-defined business questions
Have a champion within the company
Define measures of success and failure
Main difficulty No automation
Understanding the business problem
Selecting the relevant data
Data transformation
Selection of the right mining methods
Interpretation

48
Understand the Business Problem

Important questions
What is the problem that we need to solve?
Are there certain aspects of the problem that are
especially interesting?
Do we need data mining to solve the problem?
What information is actionable, and when?
Are there important business rules that constrain
our solution?
What people should we keep in the loop, and with
whom should we discuss intermediate results?
Who are the (internal) customers of the effort?

49
Hiring Outside Experts?

Factors
One-time problem versus ongoing process
Source of data
Deployment of data mining models
Availability and skills of your own staff

50
Hiring Experts

Types of experts
Your software vendor
Consulting companies/centers/individuals
Your goal Develop in-house expertise

51
The Data Mining Market

Revenues for the data mining market8 billion
(Mega Group 1/1999)
Sales of data mining software (Two Crows
Corporation 6/99)
1998 50 million
1999 75 million
2000 120 million
Hardware companies often use their data mining
software as loss-leaders (Examples IBM, SGI)

52
Knowledge Management in General

Percent of information technology executives
citing the systems used in their knowledge
management strategy (IW 4/1999)
Relational Database 95
Text/Document Search 80
Groupware 71
Data Warehouse 65
Data Mining Tools 58
Expert Database/AI Tools 25

53
Crossing the Chasm

Data mining is currently trying to cross this
chasm.
Great opportunities, but also great perils.
You have a unique advantage by applying data
mining the right way.
It is not yet common knowledge how to apply data
mining the right way.
No major cooking recipes to make a data mining
project work (yet).

54
Summary

Database and data mining technology is crucial
for any enterprise
We talked about the complete data management
infrastructure
DBMS technology
Querying
WWW/DBMS integration
Data warehousing and dimensional modeling
OLAP
Data mining

55
Additional Material Web Sites

Data mining companies, jobs, courses,
publications, datasets, etcwww.kdnuggets.com
ACM Special Interest Group on Knowledge Discovery
and Data Miningwww.acm.org/sigkdd

56
Additional Material Books

U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R.
Uthurusamy, editors, Advances in Knowledge
Discovery and Data Mining, AAAI/MIT Press, 1996
Michael Berry Gordon Linoff, Data Mining
Techniques for Marketing, Sales and Customer
Support, John Wiley Sons, 1997.
Ian Witten and Eibe Frank, Data Mining, Practical
Machine Learning Tools and Techniques with Java
Implementations, Oct 1999
Michael Berry Gordon Linoff, Mastering Data
Mining, John Wiley Sons, 2000.

57
Additional Material Database Systems