Tutorial on Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Tutorial on Data Mining

Description:

Tutorial on Data Mining Workshop of the Indian Database Research Community Sunita Sarawagi School of IT, IIT Bombay Data mining Process of semi-automatically ... – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 44

Provided by: Sun43

Category:

more less

Transcript and Presenter's Notes

Title: Tutorial on Data Mining

1
Tutorial on Data Mining

Workshop of the Indian Database Research
Community
Sunita Sarawagi
School of IT, IIT Bombay

2
Data mining

Process of semi-automatically analyzing large
databases to find interesting and useful patterns
Overlaps with machine learning, statistics,
artificial intelligence and databases but
more scalable in number of features and instances
more automated to handle heterogeneous data

3
Outline

Applications
Usage scenarios
Overview of operations
Mining research groups
Relevance in India
Ten research problems

4
Applications

Customer relationship management
identify those who are likely to leave for a
competitor.
Targeted marketing identify likely responders to
promotions
Fraud detection telecommunications, financial
transactions
Manufacturing and production
Medicine disease outcome, effectiveness of
treatments
Molecular/Pharmaceutical identify new drugs
Scientific data analysis
Web site/store design and promotion

5
Usage scenarios

Data warehouse mining
assimilate data from operational sources
mine static data
Mining log data
Continuous mining example in process control
Stages in mining
data selection ? pre-processing cleaning ?
transformation ? mining ? result evaluation ?
visualization

6
Some basic operations

Predictive
Regression
Classification
Descriptive
Clustering / similarity matching
Association rules and variants
Deviation detection

7
Classification

Given old data about customers and payments,
predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
8
Classification methods

Goal Predict class Ci f(x1, x2, .. Xn)
Regression (linear or any other polynomial)
ax1 bx2 c Ci.
Nearest neighour
Decision tree classifier divide decision space
into piecewise constant regions.
Probabilistic/generative models
Neural networks partition by non-linear
boundaries

9
Nearest neighbor

Define proximity between instances, find
neighbors of new instance and assign majority
class
Case based reasoning when attributes are more
complicated than real-valued.

Cons
Slow during application.
No feature selection.
Notion of proximity vague

Pros
Fast training

10
Decision trees

Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.

Salary lt 1 M
Prof teacher
Age lt 30
11
Algorithm for tree building

Greedy top-down construction.

Gen_Tree (Node, data)
Yes
make node a leaf?
Stop
Selection criteria
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j,
data_j)
12
Split criteria

K classes, set of S instances partitioned into r
subsets. Instance Sj has fraction pij instances
of class j.
Information entropy
Gini index

1/4
Gini
0
1
Impurity
r 1, k2
13
Scalable algorithm
rid A1 A2 A3 C

Input table of records
Vertically partition data and sort on ltattribute
value, classgt
Finding best split
Scan and maintain class counts in memory and find
gini incrementally.
Performing split
Use split attribute to build
rid to L/R hash in memory.
Divide other attributes using above hash table.

A2 C rid
A3 C rid
A1 C rid
14
Issues

Preventing overfitting
Occams razor
prefer the simplest hypothesis that fits the data
Tree pruning methods
Cross validation with separate test data
Minimum description length (MDL) criteria
Multi attribute tests on nodes to handle
correlated attributes
Linear multivariate
Non-linear multivariate e.g. a neural net at each
node.
Methods of handling missing values

15
Pros and Cons of decision trees

Cons
Cannot handle complicated relationship between
features
simple decision boundaries
problems with lots of missing data

Pros
Reasonable training time
Fast application
Easy to interpret
Easy to implement
Can handle large number of features

More information http//www.stat.wisc.edu/limt/t
reeprogs.html
16
Neural networks

Useful for learning complex data like
handwriting, speech and image recognition

Decision boundaries
Neural network
Classification tree
17
Neural network

Set of nodes connected by directed weighted edges

Basic NN unit
A more typical NN
x1
x1
w1
x2
x2
w2
x3
Output nodes
x3
w3
Hidden nodes
18
Pros and Cons of Neural Network

Cons
Slow training time
Hard to interpret
Hard to implement trial and error for choosing
number of nodes

Pros
Can learn more complicated class boundaries
Fast application
Can handle large number of features

Conclusion Use neural nets only if decision
trees/NN fail.
19
Bayesian learning

Assume a probability model on generation of data.
Apply bayes theorem to find most likely class as
Naïve bayes Assume attributes conditionally
independent given class value
Easy to learn probabilities by counting,
Useful in some domains e.g. text

20
Bayesian belief network

Find joint probability over set of variables
making use of conditional independence whenever
known
Learning parameters hard when hidden units use
gradient descent / EM algorithms
Learning structure of network harder

a
d
ad ad ad ad
b
b
0.1 0.2 0.3 0.4
Variable e independent of d given b
b
0.3 0.2 0.1 0.5
e
C
21
Clustering

Unsupervised learning when old data with class
labels not available e.g. when introducing a new
product to a customer base
Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
Identify micro-markets and develop policies for
each
Key requirement Need a good measure of
similarity between instances

22
Distance functions

Numeric data euclidean, manhattan distances
Categorical data 0/1 to indicate
presence/absence followed by
Hamming distance ( dissimilarity)
Jaccard coefficients similarity in 1s/( of 1s)
data dependent measures similarity of A and B
depends on co-occurance with C.
Combined numeric and categorical data
weighted normalized distance

23
Distance functions on high dimensional data

Example Time series, Text, Images
Euclidian measures make all points equally far
Reduce number of dimensions
choose subset of original features using random
projections, feature selection techniques
transform original features using statistical
methods like Principal Component Analysis
Define domain specific similarity measures e.g.
for images define features like number of
objects, color histogram for time series define
shape based measures.
Define non-distance based (model-based)
clustering methods

24
Clustering methods

Hierarchical clustering
agglomerative Vs divisive
single link Vs complete link
Partitional clustering
distance-based K-means
model-based EM
density-based

25
Partitional methods K-means

Criteria minimize sum of square of distance
Between each point and centroid of the cluster.
Between each pair of points in the cluster
Algorithm
Select initial partition with K clusters random,
first K, K separated points
Repeat until stabilization
Assign each point to closest cluster center
Generate new cluster centers
Adjust clusters by merging/splitting

26
Properties

May not reach global optima
Converges fast in practice guaranteed for
certain forms of optimization function
Complexity O(KndI)
I number of iterations, n number of points, d
number of dimensions, K number of clusters.
Database research on scalable algorithms
Birch one/two pass of data by keeping R-tree
like index in memory Sigmod 96

27
Model based clustering

Assume data generated from K probability
distributions. Need to find distribution
parameters.
EM algorithm K Gaussian mixtures
Iterate between two steps
Expectation step assign points to clusters
Maximation step estimate model parameters

28
Association rules
T

Given set T of groups of items
Example set of baskets of items purchased
Goal find all rules on itemsets of the form
a--gtb such that
support of a and b gt user threshold s
conditional probability (confidence) of b given
a gt user threshold c
Example Milk --gt bread
Lot of work done on scalable algorithms

Milk, cereal
Tea, milk
Tea, rice, bread
cereal
29
Variants

High confidence may not imply high correlation
Use correlations. Find expected support and
large departures from that interesting.
Brin et al. Limited attempt.
More complete work in statistical literature on
contingency tables.
Still too many rules, need to prune...
Does not imply causality as in Bayesian networks

30
Prevalent ? Interesting

Analysts already know about prevalent rules
Interesting rules are those that deviate from
prior expectation
Minings payoff is in finding surprising phenomena

1995
Milk and cereal selltogether!
Milk and cereal selltogether!
31
What makes a rule surprising?

Does not match prior expectation
Correlation between milk and cereal remains
roughly constant over time

Cannot be trivially derived from simpler rules
Milk 10, cereal 10
Milk and cereal 10 surprising
Eggs 10
Milk, cereal and eggs 0.1 surprising!
Expected 1

32
Applications of fast itemset counting

Find correlated events
Applications in medicine find redundant tests
Cross selling in retail, banking
Improve predictive capability of classifiers that
assume attribute independence
New similarity measures of categorical
attributes Mannila et al, KDD 98

33
Mining market

Around 20 to 30 mining tool vendors 1/5th the
size of OLAP market.
Major players
Clementine,
IBMs Intelligent Miner,
SGIs MineSet,
SASs Enterprise Miner.
All pretty much the same set of tools
Many embedded products fraud detection,
electronic commerce applications

34
Integrating mining with DBMS

Need to
intermix operations
iterate through results
flexibly query and filter results and data
Existing file-based, batched approach not
satisfactory.
Research challenge Identify a collection of
primitive, composable operators like in
relational DBMS and build a mining engine

35
OLAP Mining integration

OLAP (On Line Analytical Processing)
Multidimensional view of data factors are
dimensions, quantity to be analyszed
measures/cells.
Facilitates fast interactive exploration of
multidimensional aggregates.
OLAP products provide a minimal set of tools for
analysis
Heavy reliance on manual operations for analysis
tedious and error-prone on large multidimensional
data
Ideal platform for vertical integration of mining
but needs to be interactive instead of batch.

36
State of art in mining OLAP integration

Decision trees Information discovery, Cognos
find factors influencing high profits
Clustering Pilot software
segment customers to define hierarchy on that
dimension
Time series analysis Seagates Holos
Query for various shapes along time spikes,
outliers etc
Multi-level Associations Han et al.
find association between members of dimensions

37
New approach

Identify complex operations with specific
OLAP needs in mind (what does an analyst need?)
rather than looking at mining operations and
choosing what fits
Two examples
Exceptions in data to guide exploration
One reason for manual exploration is to make sure
that there are no surprises.
Pre-mines abnormalities in data and points them
out to analysts using highlights at aggregate
levels
Reasons for specific why questions at aggregate
level
most compactly represent the answer that user can
quickly assimilate

38
Vertical integration Mining on the web

Web log analysis for site design
what are popular pages,
what links are hard to find.
Electronic stores sales enhancements
recommendations, advertisement
Collaborative filtering Net perception, Wisewire
Inventory control what was a shopper looking for
and could not find..

39
Research problems

Automatic model selection different ways of
solving same problem, which one to use?
Automatic classification of complex data types
especially time series data.
Refreshing mined results explaining and modeling
changes along time
Quality of mined results guarding against wrong
conclusions, chance discovering
Incorporating domain knowledge to filter results
and improve result quality

40
Research problems

Close integration with data sources to be mined
Distributed mining across multiple relations at a
single site or spread across multiple sites.
Integration with other data analysis tools
example statistical tools, OLAP and SQL querying
Interactive data mining toolkit of micro
operators
Mixed media mining link textual reports with
images and numeric fields

41
Relevance in India

Emerging application areas especially in the
banking, retail industry and manufacturing
processes
Mining large scientific databases export laws
might require indigeneous technology
Rich research area with interesting algorithm
components -- just need to implement.
Too expensive to purchase US/Europe products