Title: An Introduction to Data Mining
1An Introduction to Data Mining
- Ling Chen
- lchen_at_L3S.de
Slides Courtesy http//www.cse.iitb.ac.in/dbms/Dat
a/Talks/datamining-intro-IEP.ppt
http//www-users.cs.umn.edu/kumar/dmbook/figures
/chap1.ppt
2Why Data Mining? Commercial Viewpoint
- Lots of data is being collected and warehoused
- Web data, e-commerce
- purchases at department/grocery stores
- Bank/Credit Card transactions
- Computers have become cheaper and more powerful
- Competitive Pressure is strong
- Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
3Why Mine Data? Scientific Viewpoint
- Data collected and stored at enormous speeds
(GB/hour) - remote sensors on a satellite
- telescopes scanning the skies
- microarrays generating gene expression data
- scientific simulations generating terabytes of
data - Traditional techniques infeasible for raw data
- Data mining may help scientists
- in classifying and segmenting data
- in Hypothesis Formation
4Mining Large Data Sets - Motivation
- There is often information hidden in the data
that is not readily evident. - Human analysts may take weeks to discover useful
information.
5What is Data Mining?
- Many Definitions
- Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data - Exploration analysis, by automatic or
semi-automatic means, of large quantities of
data in order to discover meaningful patterns
6What is (not) Data Mining?
- What is Data Mining?
- Certain names are more prevalent in certain US
locations (e.g. OBrien, ORurke, OReilly in
Boston area) - Group together similar documents returned by
search engine according to their context (e.g.
Amazon rainforest, Amazon.com,)
- What is not Data Mining?
- Look up phone number in phone directory
- Query a Web search engine for information about
Amazon
7Origins of Data Mining
- Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems - Traditional Techniquesmay be unsuitable due to
- Enormity of data
- High dimensionality of data
- Heterogeneous, distributed nature of data
Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
8Data Mining Tasks
- Prediction Methods
- Use some variables to predict unknown or future
values of other variables. - Description Methods
- Find human-interpretable patterns that describe
the data.
9Data Mining Tasks...
- Classification Predictive
- Clustering Descriptive
- Association Rule Discovery Descriptive
- Sequential Pattern Discovery Descriptive
- Regression Predictive
- Deviation Detection Predictive
10- Classification (Supervised learning)
11Classification
Given old data about customers and payments,
predict new applicants loan eligibility.
Classifier
Decision tree
Previous customers
Age Salary Profession Location Customer type
Salary gt 5 K
good/ bad
Prof. Exec
New applicants data
12Classification methods
- Goal Predict class Ci f(x1, x2, , xn)
- Methods
- Regression (linear or any other polynomial)
- e.g., ax1 bx2 c Ci.
- Nearest neighbour
- Decision tree classifier divide decision space
into piecewise constant regions. - Neural networks partition by non-linear
boundaries - Bayesian Classifiers
- SVM
-
13Nearest neighbor
- Define proximity between instances, find
neighbors - of new instance and assign majority class
- Cons
- Slow during application.
- No feature selection.
- Notion of proximity vague
14Decision trees
Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.
15Decision tree classifiers
- Widely used learning method
- Easy to interpret can be re-represented as
if-then-else rules - Does not require any prior knowledge of data
distribution, works well on noisy data.
- Pros
- Reasonable training time
- Fast application
- Easy to implement
- Can handle large number of features
- Cons
- Cannot handle complicated relationship
between features - Simple decision boundaries
- Problems with lots of missing data
16Neural networks
Set of nodes connected by directed weighted edges
Basic NN unit
17Neural networks
A more typical NN
x1
x2
x3
Output nodes
Hidden nodes
18Neural networks
- Useful for learning complex data like
handwriting, - speech and image recognition
Decision boundaries
Neural network
Classification tree
Linear regression
19Pros and Cons of Neural Network
- Cons
- Slow training time
- Hard to interpret
- Hard to implement
- trial and error for
- choosing number of
- nodes
- Pros
- Can learn more complicated
- class boundaries
- Fast application
- Can handle large number of
- features
20- Clustering (Unsupervised Learning)
21Clustering
- Unsupervised learning when old data with class
labels not available - Group/cluster existing customers based on time
series of payment history such that similar
customers locate in the same cluster. - Key requirement Need a good measure of
similarity between instances.
22Distance functions
- Numeric data Euclidean, Manhattan distances
- Categorical data 0/1 to indicate
absence/presence - Hamming distance ( dissimilarity)
- Jaccard coefficients similarity in 1s/( of 1s)
- data dependent measures similarity of A and B
depends on co-occurrance with C. - Combined numeric and categorical data
- weighted normalized distance
23Clustering methods
- Hierarchical clustering
- agglomerative Vs. divisive
- single link Vs. complete link
-
- Partitional clustering
- distance-based K-means
- model-based GMM
- density-based DBSCAN
24Agglomerative Hierarchical clustering
- Given matrix of similarity between every point
pair - Start with each point in a separate cluster and
merge clusters based on some criteria - Single link merge two clusters such that the
minimum distance between two points from the two
different cluster is the least - Complete link merge two clusters such that
maximum distance between two points from the two
different cluster is the least
25Partitional methods K-means
- Criteria minimize sum of square of distance
between each point and centroid of the cluster. - Algorithm
- Randomly select K points as initial centroids
- Repeat until stabilization
- Assign each point to closest centroid
- Generate new cluster centroids
- Adjust clusters by merging/splitting
26K-Means
- Strength
- Easy to use.
- Efficient to calculate.
- Weakness
- Initialization problem
- Cannot handle clusters of different densities.
- Restricted to data for which there is a notion of
a center/centroid.
27Model-based methods GMM
Each data point is viewed as an observation from
a mixture of Gaussian distribution.
,where
28Model-based methods
- Strength
- More general than K-means
- Better representation of cluster
- Satisfy the statistical assumptions
- Weakness
- Inefficient in estimating the parameters
- How to choose the models
- Problems with noises and outliers
29Density based method DBSCAN
Given the radius Eps and the threshold
MinPts Core Point the number points within the
neighborhood of the point, defined by Eps,
exceeds the threshold MinPts. Border Point not
core points, but within a neighborhood of a core
point. Outlier neither core points nor border
points.
30Density based method DBSCAN
- Label all points as core, border and outlier
points. - Eliminate outlier points.
- Put an edge between all core points that are
within Eps of each other. - Make each group of connected core points into a
separate cluster - Assign each border point to one of the clusters
of its associated core points (ties may need to
be solved).
31Density-based methods
- Strength
- Relatively resistant to noise.
- Handle clusters of arbitrary shapes and sizes.
- Weakness
- Problem with clusters having widely varying
densities. - Density is more difficult to define with
high-dimensional data. - Expensive in calculating all pairwise
proximities.
32 33Association rules
Transaction
- Input a set of groups of items
- Goal find all rules on itemsets of the form
a--gtb such that - Support of a and b gt threshold s
- Confidence (conditional probability ) of b given
a gt threshold c - Example milk --gt bread
- Support(milk, bread) 2/4
- Confidence(milk --gt bread) 2/3
milk, cereal, bread
tea, milk, bread
milk, rice
cereal
34Prevalent ? Interesting
1995
- Analysts already know about prevalent rules
- Interesting rules are those that deviate from
prior expectation - Minings payoff is in finding surprising phenomena
Milk and cereal selltogether!
Milk and cereal selltogether!
35Variants
- Frequent itemset mining /Infrequent itemset
mining - Positive association rules /Negative association
rules - Frequent high dimensional data
- Frequent sub-tree mining
- Frequent sub-graph mining
36Other Issues
37Evaluation
- Classification
- Metric classification accuracy
- Strategy holdout, random sampling, cross-
- validation, bootstrap
- Clustering
- Cohesion, separation
- Association Rule Mining
- Efficiency w.r.t. thresholds
- Scalability w.r.t. thresholds
38Tools
- Weka
- http//www.cs.waikato.ac.nz/ml/weka/
- CLUstering Toolkit (CLUTO)
- http//glaros.dtc.umn.edu/gkhome/cluto/cluto/overv
iew - SAS, SPSS
39Applications of Data Mining
- Web data mining
- Biological data mining
- Financial data mining
- Social network data mining
- .
40 41Bayesian learning
- Assume a probability model on generation of data.
-
- Apply bayes theorem to find most likely class as
- Naïve bayes Assume attributes conditionally
independent given class value - Easy to learn probabilities by counting,
- Useful in some domains e.g. text
42SVM
- "Perhaps the biggest limitation of the support
vector approach lies in choice of the
kernel."Burgess (1998) - "A second limitation is speed and size, both in
training and testing."Burgess (1998) - "Discete data presents another problem..."Burgess
(1998) - "...the optimal design for multiclass SVM
classifiers is a further area for
research."Burgess (1998) - "Although SVMs have good generalization
performance, they can be abysmally slow in test
phase, a problem addressed in (Burges, 1996
Osuna and Girosi, 1998)."Burgess (1998) - "Besides the advantages of SVMs - from a
practical point of view - they have some
drawbacks. An important practical question that
is not entirely solved, is the selection of the
kernel function parameters - for Gaussian kernels
the width parameter sigma - and the value of
epsilon in the epsilon-insensitive loss
function...more"Horváth (2003) in Suykens et
al. - "However, from a practical point of view perhaps
the most serious problem with SVMs is the high
algorithmic complexity and extensive memory
requirements of the required quadratic
programming in large-scale tasks."Horváth (2003)
in Suykens et al. p 392