The Software Infrastructure for Electronic Commerce - PowerPoint PPT Presentation

About This Presentation
Title:

The Software Infrastructure for Electronic Commerce

Description:

Percent of information technology executives citing the systems used in their ... Database and data mining technology is crucial for any enterprise ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 59
Provided by: johanne46
Category:

less

Transcript and Presenter's Notes

Title: The Software Infrastructure for Electronic Commerce


1
The Software Infrastructurefor Electronic
Commerce
  • Databases and Data Mining
  • Lecture 4 An Introduction To Data Mining (II)
  • Johannes Gehrke
  • johannes_at_cs.cornell.edu
  • http//www.cs.cornell.edu/johannes

2
Lectures Three and Four
  • Data preprocessing
  • Multidimensional data analysis
  • Data mining
  • Association rules
  • Classification trees
  • Clustering

3
Types of Attributes
  • Numerical Domain is ordered and can be
    represented on the real line (e.g., age, income)
  • Nominal or categorical Domain is a finite set
    without any natural ordering (e.g., occupation,
    marital status, race)
  • Ordinal Domain is ordered, but absolute
    differences between values is unknown (e.g.,
    preference scale, severity of an injury)

4
Classification
  • Goal Learn a function that assigns a record to
    one of several predefined classes.

5
Classification Example
  • Example training database
  • Two predictor attributesAge and Car-type
    (Sport, Minivan and Truck)
  • Age is ordered, Car-type iscategorical attribute
  • Class label indicateswhether person
    boughtproduct
  • Dependent attribute is categorical

6
Regression Example
  • Example training database
  • Two predictor attributesAge and Car-type
    (Sport, Minivan and Truck)
  • Spent indicates how much person spent during a
    recent visit to the web site
  • Dependent attribute is numerical

7
Types of Variables (Review)
  • Numerical Domain is ordered and can be
    represented on the real line (e.g., age, income)
  • Nominal or categorical Domain is a finite set
    without any natural ordering (e.g., occupation,
    marital status, race)
  • Ordinal Domain is ordered, but absolute
    differences between values is unknown (e.g.,
    preference scale, severity of an injury)

8
Definitions
  • Random variables X1, , Xk (predictor variables)
    and Y (dependent variable)
  • Xi has domain dom(Xi), Y has domain dom(Y)
  • P is a probability distribution on dom(X1) x x
    dom(Xk) x dom(Y)Training database D is a random
    sample from P
  • A predictor d is a functiond dom(X1) dom(Xk)
    ? dom(Y)

9
Classification Problem
  • If Y is categorical, the problem is a
    classification problem, and we use C instead of
    Y.dom(C) J.
  • C is called the class label, d is called a
    classifier.
  • Take r be record randomly drawn from P. Define
    the misclassification rate of dRT(d,P)
    P(d(r.X1, , r.Xk) ! r.C)
  • Problem definition Given dataset D that is a
    random sample from probability distribution P,
    find classifier d such that RT(d,P) is minimized.

10
Regression Problem
  • If Y is numerical, the problem is a regression
    problem.
  • Y is called the dependent variable, d is called a
    regression function.
  • Take r be record randomly drawn from P. Define
    mean squared error rate of dRT(d,P) E(r.Y -
    d(r.X1, , r.Xk))2
  • Problem definition Given dataset D that is a
    random sample from probability distribution P,
    find regression function d such that RT(d,P) is
    minimized.

11
Goals and Requirements
  • Goals
  • To produce an accurate classifier/regression
    function
  • To understand the structure of the problem
  • Requirements on the model
  • High accuracy
  • Understandable by humans, interpretable
  • Fast construction for very large training
    databases

12
Different Types of Classifiers
  • Linear discriminant analysis (LDA)
  • Quadratic discriminant analysis (QDA)
  • Density estimation methods
  • Nearest neighbor methods
  • Logistic regression
  • Neural networks
  • Fuzzy set theory
  • Decision Trees

13
Difficulties with LDA and QDA
  • Multivariate normal assumption often not true
  • Not designed for categorical variables
  • Form of classifier in terms of linear or
    quadratic discriminant functions is hard to
    interpret

14
Histogram Density Estimation
  • Curse of dimensionality
  • Cell boundaries are discontinuities. Beyond
    boundary cells, estimate falls abruptly to zero.

15
Kernel Density Estimation
  • How to choose kernel bandwith h?
  • The optimal h depends on a criterion
  • The optimal h depends on the form of the kernel
  • The optimal h might depend on the class label
  • The optimal h might depend on the part of the
    predictor space
  • How to choose form of the kernel?

16
K-Nearest Neighbor Methods
  • Difficulties
  • Data must be stored for classification of a new
    record, all data must be available
  • Computationally expensive in high dimensions
  • Choice of k is unknown

17
Difficulties with Logistic Regression
  • Few goodness of fit and model selection
    techniques
  • Categorical predictor variables have to be
    transformed into dummy vectors.

18
Neural Networks and Fuzzy Set Theory
  • Difficulties
  • Classifiers are hard to understand
  • How to choose network topology and initial
    weights?
  • Categorical predictor variables?

19
What are Decision Trees?

Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
20
Decision Trees
  • A decision tree T encodes d (a classifier or
    regression function) in form of a tree.
  • A node t in T without children is called a leaf
    node. Otherwise t is called an internal node.

21
Internal Nodes
  • Each internal node has an associated splitting
    predicate. Most common are binary
    predicates.Example predicates
  • Age lt 20
  • Profession in student, teacher
  • 5000Age 3Salary 10000 gt 0

22
Internal Nodes Splitting Predicates
  • Binary Univariate splits
  • Numerical or ordered X X lt c, c in dom(X)
  • Categorical X X in A, A subset dom(X)
  • Binary Multivariate splits
  • Linear combination split on numerical
    variablesS aiXi lt c
  • k-ary (kgt2) splits analogous

23
Leaf Nodes
  • Consider leaf node t
  • Classification problem Node t is labeled with
    one class label c in dom(C)
  • Regression problem Two choices
  • Piecewise constant modelt is labeled with a
    constant y in dom(Y).
  • Piecewise linear modelt is labeled with a
    linear model Y yt S aiXi

24
Example
  • Encoded classifier
  • If (agelt30 and carTypeMinivan)Then YES
  • If (age lt30 and(carTypeSports or
    carTypeTruck))Then NO
  • If (age gt 30)Then NO

Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
25
Choice of Classification Algorithm?
  • Example study (Lim, Loh, and Shih, Machine
    Learning 2000)
  • 33 classification algorithms
  • 16 (small) data sets (UC Irvine ML Repository)
  • Each algorithm applied to each data set
  • Experimental measurements
  • Classification accuracy
  • Computational speed
  • Classifier complexity

26
Classification Algorithms
  • Tree-structure classifiers
  • IND, S-Plus Trees, C4.5, FACT, QUEST, CART, OC1,
    LMDT, CAL5, T1
  • Statistical methods
  • LDA, QDA, NN, LOG, FDA, PDA, MDA, POL
  • Neural networks
  • LVQ, RBF

27
Experimental Details
  • 16 primary data sets, created 16 more data sets
    by adding noise
  • Converted categorical predictor variables to 0-1
    dummy variables if necessary
  • Error rates for 6 data sets estimated from
    supplied test sets, 10-fold cross-validation used
    for the other data sets

28
Ranking by Mean Error Rate
  • Rank Algorithm Mean Error Time
  • 1 Polyclass 0.195 3 hours
  • 2 Quest Multivariate 0.202 4 min
  • 3 Logistic Regression 0.204 4 min
  • 6 LDA 0.208 10 s
  • 8 IND CART 0.215 47 s
  • 12 C4.5 Rules 0.220 20 s
  • 16 Quest Univariate 0.221 40 s

29
Other Results
  • Number of leaves for tree-based classifiers
    varied widely (median number of leaves between 5
    and 32 (removing some outliers))
  • Mean misclassification rates for top 26
    algorithms are not statistically significantly
    different, bottom 7 algorithms have significantly
    lower error rates

30
Decision Trees Summary
  • Powerful data mining model for classification
    (and regression) problems
  • Easy to understand and to present to
    non-specialists
  • TIPS
  • Even if black-box models sometimes give higher
    accuracy, construct a decision tree anyway
  • Construct decision trees with different splitting
    variables at the root of the tree

31
Clustering
  • Input Relational database with fixed schema
  • Output k groups of records called clusters, such
    that the records within a group are more similar
    to records in other groups
  • More difficult than classification (unsupervised
    learning no record labels are given)
  • Usage
  • Exploratory data mining
  • Preprocessing step (e.g., outlier detection)

32
Clustering (Contd.)
  • In clustering we partitioning a set of records
    into meaningful sub-classes called clusters.
  • Cluster a collection of data objects that are
    similar to one another and thus can be treated
    collectively as one group.
  • Clustering helps users to detect inherent
    groupings and structure in a data set.

33
Clustering (Contd.)
  • Example input database Two numerical variables
  • How many groups are here?
  • Requirements Need to define similarity between
    records

34
Graphical Representation
35
Clustering (Contd.)
  • Output of clustering
  • Representative points for each cluster
  • Labeling of each record with each cluster number
  • Other description of each cluster
  • Important Use the right distance function
  • Scale or normalize all attributes. Example
    seconds, hours, days
  • Assign different weights associated with
    importance of the attribute

36
Clustering Summary
  • Finding natural groups in data
  • Common post-processing steps
  • Build a decision tree with the cluster label as
    class label
  • Try to explain the groups using the decision tree
  • Visualize the clusters
  • Examine the differences between the clusters with
    respect to the fields of the dataset
  • Try different number of clusters

37
Web Usage Mining
  • Data sources
  • Web server log
  • Information about the web site
  • Site graph
  • Metadata about each page (type, objects shown)
  • Object concept hierarchies
  • Preprocessing
  • Detect session and user context (Cookies, user
    authentication, personalization)

38
Web Usage Mining (Contd.)
  • Data Mining
  • Association Rules
  • Sequential Patterns
  • Classification
  • Action
  • Personalized pages
  • Cross-selling
  • Evaluation and Measurement
  • Deploy personalized pages selectively
  • Measure effectiveness of each implemented action

39
Large Case Study Churn
  • Telecommunications industry
  • Try to predict churn (whether customer will
    switch long-distance carrier)
  • Dataset
  • 5000 records (tiny dataset, but manageable here
    in class)
  • 21 attributes, both numerical and categorical
    attributes (very few attributes)
  • Data is already cleaned! No missing values,
    inconsistencies, etc. (again, for classroom
    purposes)

40
Churn Example Dataset Columns
  • State
  • Account length Number of months the customer has
    been with the company
  • Area code
  • Phone number
  • International plan yes/no
  • Voice mail yes/no
  • Number of voice Average number of voice messages
    per day
  • Total (day, evening, night, international)
    minutes Average number of minutes charged
  • Total (day, evening, night, international) calls
    Average number of calls made
  • Total (day, evening, night, international)
    charge Average amount charged per day
  • Number customer service calls Number of calls
    made to customer support in the last six months
  • Churned Did the customer switch long-distance
    carriers in the last six months

41
Churn Example Analysis
  • We start out by getting familiar with the dataset
  • Record viewer
  • Statistics visualization
  • Evidence classifier
  • Visualizing joint distributions
  • Visualizing geographic distribution of churn

42
Churn Example Analysis (Contd.)
  • Building and interpreting data mining models
  • Decision trees
  • Clustering

43
Evaluating Data Mining Tools
44
Evaluating Data Mining Tools
  • Checklist
  • Integration with current applications and your
    data management infrastructure
  • Ease of usage
  • Automation
  • Scalability to large datasets
  • Number of records
  • Number of attributes
  • Datasets larger than main memory
  • Support of sampling
  • Export of models into your enterprise
  • Stability of the company that offers the product

45
Integration With Data Management
  • Proprietary storage format?
  • Native support of major database systems
  • IBM DB2, Informix, Oracle, SQL Server, Sybase
  • ODBC
  • Support of parallel database systems
  • Integration with your data warehouse

46
Cost Considerations
  • Proprietary or commodity hardware and operating
    system
  • Client and server might be different
  • What server platforms are supported?
  • Support staff needed
  • Training of your staff members
  • Online training, tutorials
  • On-site training
  • Books, course material

47
Data Mining Projects
  • Checklist
  • Start with well-defined business questions
  • Have a champion within the company
  • Define measures of success and failure
  • Main difficulty No automation
  • Understanding the business problem
  • Selecting the relevant data
  • Data transformation
  • Selection of the right mining methods
  • Interpretation

48
Understand the Business Problem
  • Important questions
  • What is the problem that we need to solve?
  • Are there certain aspects of the problem that are
    especially interesting?
  • Do we need data mining to solve the problem?
  • What information is actionable, and when?
  • Are there important business rules that constrain
    our solution?
  • What people should we keep in the loop, and with
    whom should we discuss intermediate results?
  • Who are the (internal) customers of the effort?

49
Hiring Outside Experts?
  • Factors
  • One-time problem versus ongoing process
  • Source of data
  • Deployment of data mining models
  • Availability and skills of your own staff

50
Hiring Experts
  • Types of experts
  • Your software vendor
  • Consulting companies/centers/individuals
  • Your goal Develop in-house expertise

51
The Data Mining Market
  • Revenues for the data mining market8 billion
    (Mega Group 1/1999)
  • Sales of data mining software (Two Crows
    Corporation 6/99)
  • 1998 50 million
  • 1999 75 million
  • 2000 120 million
  • Hardware companies often use their data mining
    software as loss-leaders (Examples IBM, SGI)

52
Knowledge Management in General
  • Percent of information technology executives
    citing the systems used in their knowledge
    management strategy (IW 4/1999)
  • Relational Database 95
  • Text/Document Search 80
  • Groupware 71
  • Data Warehouse 65
  • Data Mining Tools 58
  • Expert Database/AI Tools 25

53
Crossing the Chasm
  • Data mining is currently trying to cross this
    chasm.
  • Great opportunities, but also great perils.
  • You have a unique advantage by applying data
    mining the right way.
  • It is not yet common knowledge how to apply data
    mining the right way.
  • No major cooking recipes to make a data mining
    project work (yet).

54
Summary
  • Database and data mining technology is crucial
    for any enterprise
  • We talked about the complete data management
    infrastructure
  • DBMS technology
  • Querying
  • WWW/DBMS integration
  • Data warehousing and dimensional modeling
  • OLAP
  • Data mining

55
Additional Material Web Sites
  • Data mining companies, jobs, courses,
    publications, datasets, etcwww.kdnuggets.com
  • ACM Special Interest Group on Knowledge Discovery
    and Data Miningwww.acm.org/sigkdd

56
Additional Material Books
  • U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R.
    Uthurusamy, editors, Advances in Knowledge
    Discovery and Data Mining, AAAI/MIT Press, 1996
  • Michael Berry Gordon Linoff, Data Mining
    Techniques for Marketing, Sales and Customer
    Support, John Wiley Sons, 1997.
  • Ian Witten and Eibe Frank, Data Mining, Practical
    Machine Learning Tools and Techniques with Java
    Implementations, Oct 1999
  • Michael Berry Gordon Linoff, Mastering Data
    Mining, John Wiley Sons, 2000.

57
Additional Material Database Systems
  • IBM DB2 www.ibm.com/software/data/db2
  • Oracle www.oracle.com
  • Sybase www.sybase.com
  • Informix www.informix.com
  • Microsoft www.microsoft.com/sql
  • NCR Teradata www.ncr.com/product/teradata

58
Questions?
  • Prediction is very difficult, especially about
    the future.
  • Niels Bohr
Write a Comment
User Comments (0)
About PowerShow.com