Title: Tutorial on Data Mining
1Tutorial on Data Mining
- Workshop of the Indian Database Research
Community - Sunita Sarawagi
- School of IT, IIT Bombay
2Data mining
- Process of semi-automatically analyzing large
databases to find interesting and useful patterns - Overlaps with machine learning, statistics,
artificial intelligence and databases but - more scalable in number of features and instances
- more automated to handle heterogeneous data
3Outline
- Applications
- Usage scenarios
- Overview of operations
- Mining research groups
- Relevance in India
- Ten research problems
4Applications
- Customer relationship management
- identify those who are likely to leave for a
competitor. - Targeted marketing identify likely responders to
promotions - Fraud detection telecommunications, financial
transactions - Manufacturing and production
- Medicine disease outcome, effectiveness of
treatments - Molecular/Pharmaceutical identify new drugs
- Scientific data analysis
- Web site/store design and promotion
5Usage scenarios
- Data warehouse mining
- assimilate data from operational sources
- mine static data
- Mining log data
- Continuous mining example in process control
- Stages in mining
- data selection ? pre-processing cleaning ?
transformation ? mining ? result evaluation ?
visualization
6Some basic operations
- Predictive
- Regression
- Classification
- Descriptive
- Clustering / similarity matching
- Association rules and variants
- Deviation detection
7Classification
- Given old data about customers and payments,
predict new applicants loan eligibility.
Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
8Classification methods
- Goal Predict class Ci f(x1, x2, .. Xn)
- Regression (linear or any other polynomial)
- ax1 bx2 c Ci.
- Nearest neighour
- Decision tree classifier divide decision space
into piecewise constant regions. - Probabilistic/generative models
- Neural networks partition by non-linear
boundaries
9Nearest neighbor
- Define proximity between instances, find
neighbors of new instance and assign majority
class - Case based reasoning when attributes are more
complicated than real-valued.
- Cons
- Slow during application.
- No feature selection.
- Notion of proximity vague
10Decision trees
- Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.
Salary lt 1 M
Prof teacher
Age lt 30
11Algorithm for tree building
- Greedy top-down construction.
Gen_Tree (Node, data)
Yes
make node a leaf?
Stop
Selection criteria
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j,
data_j)
12Split criteria
- K classes, set of S instances partitioned into r
subsets. Instance Sj has fraction pij instances
of class j. - Information entropy
- Gini index
1/4
Gini
0
1
Impurity
r 1, k2
13Scalable algorithm
rid A1 A2 A3 C
- Input table of records
- Vertically partition data and sort on ltattribute
value, classgt - Finding best split
- Scan and maintain class counts in memory and find
gini incrementally. - Performing split
- Use split attribute to build
- rid to L/R hash in memory.
- Divide other attributes using above hash table.
A2 C rid
A3 C rid
A1 C rid
14Issues
- Preventing overfitting
- Occams razor
- prefer the simplest hypothesis that fits the data
- Tree pruning methods
- Cross validation with separate test data
- Minimum description length (MDL) criteria
- Multi attribute tests on nodes to handle
correlated attributes - Linear multivariate
- Non-linear multivariate e.g. a neural net at each
node. - Methods of handling missing values
15Pros and Cons of decision trees
- Cons
- Cannot handle complicated relationship between
features - simple decision boundaries
- problems with lots of missing data
- Pros
- Reasonable training time
- Fast application
- Easy to interpret
- Easy to implement
- Can handle large number of features
More information http//www.stat.wisc.edu/limt/t
reeprogs.html
16Neural networks
- Useful for learning complex data like
handwriting, speech and image recognition
Decision boundaries
Neural network
Classification tree
17Neural network
- Set of nodes connected by directed weighted edges
Basic NN unit
A more typical NN
x1
x1
w1
x2
x2
w2
x3
Output nodes
x3
w3
Hidden nodes
18Pros and Cons of Neural Network
- Cons
- Slow training time
- Hard to interpret
- Hard to implement trial and error for choosing
number of nodes
- Pros
- Can learn more complicated class boundaries
- Fast application
- Can handle large number of features
Conclusion Use neural nets only if decision
trees/NN fail.
19Bayesian learning
- Assume a probability model on generation of data.
- Apply bayes theorem to find most likely class as
- Naïve bayes Assume attributes conditionally
independent given class value - Easy to learn probabilities by counting,
- Useful in some domains e.g. text
20Bayesian belief network
- Find joint probability over set of variables
making use of conditional independence whenever
known - Learning parameters hard when hidden units use
gradient descent / EM algorithms - Learning structure of network harder
a
d
ad ad ad ad
b
b
0.1 0.2 0.3 0.4
Variable e independent of d given b
b
0.3 0.2 0.1 0.5
e
C
21Clustering
- Unsupervised learning when old data with class
labels not available e.g. when introducing a new
product to a customer base - Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster. - Identify micro-markets and develop policies for
each - Key requirement Need a good measure of
similarity between instances
22Distance functions
- Numeric data euclidean, manhattan distances
- Categorical data 0/1 to indicate
presence/absence followed by - Hamming distance ( dissimilarity)
- Jaccard coefficients similarity in 1s/( of 1s)
- data dependent measures similarity of A and B
depends on co-occurance with C. - Combined numeric and categorical data
- weighted normalized distance
23Distance functions on high dimensional data
- Example Time series, Text, Images
- Euclidian measures make all points equally far
- Reduce number of dimensions
- choose subset of original features using random
projections, feature selection techniques - transform original features using statistical
methods like Principal Component Analysis - Define domain specific similarity measures e.g.
for images define features like number of
objects, color histogram for time series define
shape based measures. - Define non-distance based (model-based)
clustering methods
24Clustering methods
- Hierarchical clustering
- agglomerative Vs divisive
- single link Vs complete link
- Partitional clustering
- distance-based K-means
- model-based EM
- density-based
25Partitional methods K-means
- Criteria minimize sum of square of distance
- Between each point and centroid of the cluster.
- Between each pair of points in the cluster
- Algorithm
- Select initial partition with K clusters random,
first K, K separated points - Repeat until stabilization
- Assign each point to closest cluster center
- Generate new cluster centers
- Adjust clusters by merging/splitting
26Properties
- May not reach global optima
- Converges fast in practice guaranteed for
certain forms of optimization function - Complexity O(KndI)
- I number of iterations, n number of points, d
number of dimensions, K number of clusters. - Database research on scalable algorithms
- Birch one/two pass of data by keeping R-tree
like index in memory Sigmod 96 -
27Model based clustering
- Assume data generated from K probability
distributions. Need to find distribution
parameters. - EM algorithm K Gaussian mixtures
- Iterate between two steps
- Expectation step assign points to clusters
- Maximation step estimate model parameters
28Association rules
T
- Given set T of groups of items
- Example set of baskets of items purchased
- Goal find all rules on itemsets of the form
a--gtb such that - support of a and b gt user threshold s
- conditional probability (confidence) of b given
a gt user threshold c - Example Milk --gt bread
- Lot of work done on scalable algorithms
Milk, cereal
Tea, milk
Tea, rice, bread
cereal
29Variants
- High confidence may not imply high correlation
- Use correlations. Find expected support and
large departures from that interesting. - Brin et al. Limited attempt.
- More complete work in statistical literature on
contingency tables. - Still too many rules, need to prune...
- Does not imply causality as in Bayesian networks
30Prevalent ? Interesting
- Analysts already know about prevalent rules
- Interesting rules are those that deviate from
prior expectation - Minings payoff is in finding surprising phenomena
1995
Milk and cereal selltogether!
Milk and cereal selltogether!
31What makes a rule surprising?
- Does not match prior expectation
- Correlation between milk and cereal remains
roughly constant over time
- Cannot be trivially derived from simpler rules
- Milk 10, cereal 10
- Milk and cereal 10 surprising
- Eggs 10
- Milk, cereal and eggs 0.1 surprising!
- Expected 1
32Applications of fast itemset counting
- Find correlated events
- Applications in medicine find redundant tests
- Cross selling in retail, banking
- Improve predictive capability of classifiers that
assume attribute independence - New similarity measures of categorical
attributes Mannila et al, KDD 98
33Mining market
- Around 20 to 30 mining tool vendors 1/5th the
size of OLAP market. - Major players
- Clementine,
- IBMs Intelligent Miner,
- SGIs MineSet,
- SASs Enterprise Miner.
- All pretty much the same set of tools
- Many embedded products fraud detection,
electronic commerce applications
34Integrating mining with DBMS
- Need to
- intermix operations
- iterate through results
- flexibly query and filter results and data
- Existing file-based, batched approach not
satisfactory. - Research challenge Identify a collection of
primitive, composable operators like in
relational DBMS and build a mining engine
35OLAP Mining integration
- OLAP (On Line Analytical Processing)
- Multidimensional view of data factors are
dimensions, quantity to be analyszed
measures/cells. - Facilitates fast interactive exploration of
multidimensional aggregates. - OLAP products provide a minimal set of tools for
analysis - Heavy reliance on manual operations for analysis
- tedious and error-prone on large multidimensional
data - Ideal platform for vertical integration of mining
but needs to be interactive instead of batch.
36State of art in mining OLAP integration
- Decision trees Information discovery, Cognos
- find factors influencing high profits
- Clustering Pilot software
- segment customers to define hierarchy on that
dimension - Time series analysis Seagates Holos
- Query for various shapes along time spikes,
outliers etc - Multi-level Associations Han et al.
- find association between members of dimensions
37New approach
- Identify complex operations with specific
OLAP needs in mind (what does an analyst need?)
rather than looking at mining operations and
choosing what fits - Two examples
- Exceptions in data to guide exploration
- One reason for manual exploration is to make sure
that there are no surprises. - Pre-mines abnormalities in data and points them
out to analysts using highlights at aggregate
levels - Reasons for specific why questions at aggregate
level - most compactly represent the answer that user can
quickly assimilate
38Vertical integration Mining on the web
- Web log analysis for site design
- what are popular pages,
- what links are hard to find.
- Electronic stores sales enhancements
- recommendations, advertisement
- Collaborative filtering Net perception, Wisewire
- Inventory control what was a shopper looking for
and could not find..
39Research problems
- Automatic model selection different ways of
solving same problem, which one to use? - Automatic classification of complex data types
especially time series data. - Refreshing mined results explaining and modeling
changes along time - Quality of mined results guarding against wrong
conclusions, chance discovering - Incorporating domain knowledge to filter results
and improve result quality
40Research problems
- Close integration with data sources to be mined
- Distributed mining across multiple relations at a
single site or spread across multiple sites. - Integration with other data analysis tools
example statistical tools, OLAP and SQL querying - Interactive data mining toolkit of micro
operators - Mixed media mining link textual reports with
images and numeric fields
41Relevance in India
- Emerging application areas especially in the
banking, retail industry and manufacturing
processes - Mining large scientific databases export laws
might require indigeneous technology - Rich research area with interesting algorithm
components -- just need to implement. - Too expensive to purchase US/Europe products
42- Need to build usable prototypes not simply tweak
algorithms for publications.
43Summary
- What is data mining and an overview of the
various operations - Classification regression, nearest neighbour,
neural network, bayesian - Clustering distance based (k-means),
distribution based(EM) - Itemset counting
- Several operations challenge is choosing the
right operation for the problem - New directions and identification of research
problems