An Introduction to Data Mining - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

An Introduction to Data Mining

Description:

Human analysts may take weeks to discover useful information. ... Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems ... – PowerPoint PPT presentation

Number of Views:186

Avg rating:3.0/5.0

Slides: 43

Provided by: Ral97

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Data Mining

1
An Introduction to Data Mining

Ling Chen
lchen_at_L3S.de

Slides Courtesy http//www.cse.iitb.ac.in/dbms/Dat
a/Talks/datamining-intro-IEP.ppt
http//www-users.cs.umn.edu/kumar/dmbook/figures
/chap1.ppt
2
Why Data Mining? Commercial Viewpoint

Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive Pressure is strong
Provide better, customized services for an edge
(e.g. in Customer Relationship Management)

3
Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds
(GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene expression data
scientific simulations generating terabytes of
data
Traditional techniques infeasible for raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation

4
Mining Large Data Sets - Motivation

There is often information hidden in the data
that is not readily evident.
Human analysts may take weeks to discover useful
information.

5
What is Data Mining?

Many Definitions
Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
Exploration analysis, by automatic or
semi-automatic means, of large quantities of
data in order to discover meaningful patterns

6
What is (not) Data Mining?

What is Data Mining?
Certain names are more prevalent in certain US
locations (e.g. OBrien, ORurke, OReilly in
Boston area)
Group together similar documents returned by
search engine according to their context (e.g.
Amazon rainforest, Amazon.com,)

What is not Data Mining?
Look up phone number in phone directory
Query a Web search engine for information about
Amazon

7
Origins of Data Mining

Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Traditional Techniquesmay be unsuitable due to
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of data

Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
8
Data Mining Tasks

Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Description Methods
Find human-interpretable patterns that describe
the data.

9
Data Mining Tasks...

Classification Predictive
Clustering Descriptive
Association Rule Discovery Descriptive
Sequential Pattern Discovery Descriptive
Regression Predictive
Deviation Detection Predictive

Classification (Supervised learning)

11
Classification
Given old data about customers and payments,
predict new applicants loan eligibility.
Classifier
Decision tree
Previous customers
Age Salary Profession Location Customer type
Salary gt 5 K
good/ bad
Prof. Exec
New applicants data
12
Classification methods

Goal Predict class Ci f(x1, x2, , xn)
Methods
Regression (linear or any other polynomial)
e.g., ax1 bx2 c Ci.
Nearest neighbour
Decision tree classifier divide decision space
into piecewise constant regions.
Neural networks partition by non-linear
boundaries
Bayesian Classifiers
SVM

13
Nearest neighbor

Define proximity between instances, find
neighbors
of new instance and assign majority class

Cons
Slow during application.
No feature selection.
Notion of proximity vague

Pros
Fast training

14
Decision trees
Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.
15
Decision tree classifiers

Widely used learning method
Easy to interpret can be re-represented as
if-then-else rules
Does not require any prior knowledge of data
distribution, works well on noisy data.

Pros
Reasonable training time
Fast application
Easy to implement
Can handle large number of features

Cons
Cannot handle complicated relationship
between features
Simple decision boundaries
Problems with lots of missing data

16
Neural networks
Set of nodes connected by directed weighted edges
Basic NN unit
17
Neural networks
A more typical NN
x1
x2
x3
Output nodes
Hidden nodes
18
Neural networks

Useful for learning complex data like
handwriting,
speech and image recognition

Decision boundaries
Neural network
Classification tree
Linear regression
19
Pros and Cons of Neural Network

Cons
Slow training time
Hard to interpret
Hard to implement
trial and error for
choosing number of
nodes

Pros
Can learn more complicated
class boundaries
Fast application
Can handle large number of
features

Clustering (Unsupervised Learning)

21
Clustering

Unsupervised learning when old data with class
labels not available
Group/cluster existing customers based on time
series of payment history such that similar
customers locate in the same cluster.
Key requirement Need a good measure of
similarity between instances.

22
Distance functions

Numeric data Euclidean, Manhattan distances
Categorical data 0/1 to indicate
absence/presence
Hamming distance ( dissimilarity)
Jaccard coefficients similarity in 1s/( of 1s)
data dependent measures similarity of A and B
depends on co-occurrance with C.
Combined numeric and categorical data
weighted normalized distance

23
Clustering methods

Hierarchical clustering
agglomerative Vs. divisive
single link Vs. complete link
Partitional clustering
distance-based K-means
model-based GMM
density-based DBSCAN

24
Agglomerative Hierarchical clustering

Given matrix of similarity between every point
pair
Start with each point in a separate cluster and
merge clusters based on some criteria
Single link merge two clusters such that the
minimum distance between two points from the two
different cluster is the least
Complete link merge two clusters such that
maximum distance between two points from the two
different cluster is the least

25
Partitional methods K-means

Criteria minimize sum of square of distance
between each point and centroid of the cluster.
Algorithm
Randomly select K points as initial centroids
Repeat until stabilization
Assign each point to closest centroid
Generate new cluster centroids
Adjust clusters by merging/splitting

26
K-Means

Strength
Easy to use.
Efficient to calculate.
Weakness
Initialization problem
Cannot handle clusters of different densities.
Restricted to data for which there is a notion of
a center/centroid.

27
Model-based methods GMM
Each data point is viewed as an observation from
a mixture of Gaussian distribution.
,where
28
Model-based methods

Strength
More general than K-means
Better representation of cluster
Satisfy the statistical assumptions
Weakness
Inefficient in estimating the parameters
How to choose the models
Problems with noises and outliers

29
Density based method DBSCAN
Given the radius Eps and the threshold
MinPts Core Point the number points within the
neighborhood of the point, defined by Eps,
exceeds the threshold MinPts. Border Point not
core points, but within a neighborhood of a core
point. Outlier neither core points nor border
points.
30
Density based method DBSCAN

Label all points as core, border and outlier
points.
Eliminate outlier points.
Put an edge between all core points that are
within Eps of each other.
Make each group of connected core points into a
separate cluster
Assign each border point to one of the clusters
of its associated core points (ties may need to
be solved).

31
Density-based methods

Strength
Relatively resistant to noise.
Handle clusters of arbitrary shapes and sizes.
Weakness
Problem with clusters having widely varying
densities.
Density is more difficult to define with
high-dimensional data.
Expensive in calculating all pairwise
proximities.

Association Rules

33
Association rules
Transaction

Input a set of groups of items
Goal find all rules on itemsets of the form
a--gtb such that
Support of a and b gt threshold s
Confidence (conditional probability ) of b given
a gt threshold c
Example milk --gt bread
Support(milk, bread) 2/4
Confidence(milk --gt bread) 2/3

milk, cereal, bread
tea, milk, bread
milk, rice
cereal
34
Prevalent ? Interesting
1995

Analysts already know about prevalent rules
Interesting rules are those that deviate from
prior expectation
Minings payoff is in finding surprising phenomena

Milk and cereal selltogether!
Milk and cereal selltogether!
35
Variants

Frequent itemset mining /Infrequent itemset
mining
Positive association rules /Negative association
rules
Frequent high dimensional data
Frequent sub-tree mining
Frequent sub-graph mining

36
Other Issues
37
Evaluation

Classification
Metric classification accuracy
Strategy holdout, random sampling, cross-
validation, bootstrap
Clustering
Cohesion, separation
Association Rule Mining
Efficiency w.r.t. thresholds
Scalability w.r.t. thresholds

38
Tools

Weka
http//www.cs.waikato.ac.nz/ml/weka/
CLUstering Toolkit (CLUTO)
http//glaros.dtc.umn.edu/gkhome/cluto/cluto/overv
iew
SAS, SPSS

39
Applications of Data Mining

Web data mining
Biological data mining
Financial data mining
Social network data mining
.

Questions ?
Thanks!

41
Bayesian learning

Assume a probability model on generation of data.
Apply bayes theorem to find most likely class as
Naïve bayes Assume attributes conditionally
independent given class value
Easy to learn probabilities by counting,
Useful in some domains e.g. text

42
SVM

"Perhaps the biggest limitation of the support
vector approach lies in choice of the
kernel."Burgess (1998)
"A second limitation is speed and size, both in
training and testing."Burgess (1998)
"Discete data presents another problem..."Burgess
(1998)
"...the optimal design for multiclass SVM
classifiers is a further area for
research."Burgess (1998)
"Although SVMs have good generalization
performance, they can be abysmally slow in test
phase, a problem addressed in (Burges, 1996
Osuna and Girosi, 1998)."Burgess (1998)
"Besides the advantages of SVMs - from a
practical point of view - they have some
drawbacks. An important practical question that
is not entirely solved, is the selection of the
kernel function parameters - for Gaussian kernels
the width parameter sigma - and the value of
epsilon in the epsilon-insensitive loss
function...more"Horváth (2003) in Suykens et
al.
"However, from a practical point of view perhaps
the most serious problem with SVMs is the high
algorithmic complexity and extensive memory
requirements of the required quadratic
programming in large-scale tasks."Horváth (2003)
in Suykens et al. p 392