An Introduction to Data Mining - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

An Introduction to Data Mining

Description:

Human analysts may take weeks to discover useful information. ... Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems ... – PowerPoint PPT presentation

Number of Views:186
Avg rating:3.0/5.0
Slides: 43
Provided by: Ral97
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Data Mining


1
An Introduction to Data Mining
  • Ling Chen
  • lchen_at_L3S.de

Slides Courtesy http//www.cse.iitb.ac.in/dbms/Dat
a/Talks/datamining-intro-IEP.ppt
http//www-users.cs.umn.edu/kumar/dmbook/figures
/chap1.ppt
2
Why Data Mining? Commercial Viewpoint
  • Lots of data is being collected and warehoused
  • Web data, e-commerce
  • purchases at department/grocery stores
  • Bank/Credit Card transactions
  • Computers have become cheaper and more powerful
  • Competitive Pressure is strong
  • Provide better, customized services for an edge
    (e.g. in Customer Relationship Management)

3
Why Mine Data? Scientific Viewpoint
  • Data collected and stored at enormous speeds
    (GB/hour)
  • remote sensors on a satellite
  • telescopes scanning the skies
  • microarrays generating gene expression data
  • scientific simulations generating terabytes of
    data
  • Traditional techniques infeasible for raw data
  • Data mining may help scientists
  • in classifying and segmenting data
  • in Hypothesis Formation

4
Mining Large Data Sets - Motivation
  • There is often information hidden in the data
    that is not readily evident.
  • Human analysts may take weeks to discover useful
    information.

5
What is Data Mining?
  • Many Definitions
  • Non-trivial extraction of implicit, previously
    unknown and potentially useful information from
    data
  • Exploration analysis, by automatic or
    semi-automatic means, of large quantities of
    data in order to discover meaningful patterns

6
What is (not) Data Mining?
  • What is Data Mining?
  • Certain names are more prevalent in certain US
    locations (e.g. OBrien, ORurke, OReilly in
    Boston area)
  • Group together similar documents returned by
    search engine according to their context (e.g.
    Amazon rainforest, Amazon.com,)
  • What is not Data Mining?
  • Look up phone number in phone directory
  • Query a Web search engine for information about
    Amazon

7
Origins of Data Mining
  • Draws ideas from machine learning/AI, pattern
    recognition, statistics, and database systems
  • Traditional Techniquesmay be unsuitable due to
  • Enormity of data
  • High dimensionality of data
  • Heterogeneous, distributed nature of data

Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
8
Data Mining Tasks
  • Prediction Methods
  • Use some variables to predict unknown or future
    values of other variables.
  • Description Methods
  • Find human-interpretable patterns that describe
    the data.

9
Data Mining Tasks...
  • Classification Predictive
  • Clustering Descriptive
  • Association Rule Discovery Descriptive
  • Sequential Pattern Discovery Descriptive
  • Regression Predictive
  • Deviation Detection Predictive

10
  • Classification (Supervised learning)

11
Classification
Given old data about customers and payments,
predict new applicants loan eligibility.
Classifier
Decision tree
Previous customers
Age Salary Profession Location Customer type
Salary gt 5 K
good/ bad
Prof. Exec
New applicants data
12
Classification methods
  • Goal Predict class Ci f(x1, x2, , xn)
  • Methods
  • Regression (linear or any other polynomial)
  • e.g., ax1 bx2 c Ci.
  • Nearest neighbour
  • Decision tree classifier divide decision space
    into piecewise constant regions.
  • Neural networks partition by non-linear
    boundaries
  • Bayesian Classifiers
  • SVM

13
Nearest neighbor
  • Define proximity between instances, find
    neighbors
  • of new instance and assign majority class
  • Cons
  • Slow during application.
  • No feature selection.
  • Notion of proximity vague
  • Pros
  • Fast training

14
Decision trees
Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.
15
Decision tree classifiers
  • Widely used learning method
  • Easy to interpret can be re-represented as
    if-then-else rules
  • Does not require any prior knowledge of data
    distribution, works well on noisy data.
  • Pros
  • Reasonable training time
  • Fast application
  • Easy to implement
  • Can handle large number of features
  • Cons
  • Cannot handle complicated relationship
    between features
  • Simple decision boundaries
  • Problems with lots of missing data

16
Neural networks
Set of nodes connected by directed weighted edges
Basic NN unit
17
Neural networks
A more typical NN
x1
x2
x3
Output nodes
Hidden nodes
18
Neural networks
  • Useful for learning complex data like
    handwriting,
  • speech and image recognition

Decision boundaries
Neural network
Classification tree
Linear regression
19
Pros and Cons of Neural Network
  • Cons
  • Slow training time
  • Hard to interpret
  • Hard to implement
  • trial and error for
  • choosing number of
  • nodes
  • Pros
  • Can learn more complicated
  • class boundaries
  • Fast application
  • Can handle large number of
  • features

20
  • Clustering (Unsupervised Learning)

21
Clustering
  • Unsupervised learning when old data with class
    labels not available
  • Group/cluster existing customers based on time
    series of payment history such that similar
    customers locate in the same cluster.
  • Key requirement Need a good measure of
    similarity between instances.

22
Distance functions
  • Numeric data Euclidean, Manhattan distances
  • Categorical data 0/1 to indicate
    absence/presence
  • Hamming distance ( dissimilarity)
  • Jaccard coefficients similarity in 1s/( of 1s)
  • data dependent measures similarity of A and B
    depends on co-occurrance with C.
  • Combined numeric and categorical data
  • weighted normalized distance

23
Clustering methods
  • Hierarchical clustering
  • agglomerative Vs. divisive
  • single link Vs. complete link
  • Partitional clustering
  • distance-based K-means
  • model-based GMM
  • density-based DBSCAN

24
Agglomerative Hierarchical clustering
  • Given matrix of similarity between every point
    pair
  • Start with each point in a separate cluster and
    merge clusters based on some criteria
  • Single link merge two clusters such that the
    minimum distance between two points from the two
    different cluster is the least
  • Complete link merge two clusters such that
    maximum distance between two points from the two
    different cluster is the least

25
Partitional methods K-means
  • Criteria minimize sum of square of distance
    between each point and centroid of the cluster.
  • Algorithm
  • Randomly select K points as initial centroids
  • Repeat until stabilization
  • Assign each point to closest centroid
  • Generate new cluster centroids
  • Adjust clusters by merging/splitting

26
K-Means
  • Strength
  • Easy to use.
  • Efficient to calculate.
  • Weakness
  • Initialization problem
  • Cannot handle clusters of different densities.
  • Restricted to data for which there is a notion of
    a center/centroid.

27
Model-based methods GMM
Each data point is viewed as an observation from
a mixture of Gaussian distribution.
,where
28
Model-based methods
  • Strength
  • More general than K-means
  • Better representation of cluster
  • Satisfy the statistical assumptions
  • Weakness
  • Inefficient in estimating the parameters
  • How to choose the models
  • Problems with noises and outliers

29
Density based method DBSCAN
Given the radius Eps and the threshold
MinPts Core Point the number points within the
neighborhood of the point, defined by Eps,
exceeds the threshold MinPts. Border Point not
core points, but within a neighborhood of a core
point. Outlier neither core points nor border
points.
30
Density based method DBSCAN
  • Label all points as core, border and outlier
    points.
  • Eliminate outlier points.
  • Put an edge between all core points that are
    within Eps of each other.
  • Make each group of connected core points into a
    separate cluster
  • Assign each border point to one of the clusters
    of its associated core points (ties may need to
    be solved).

31
Density-based methods
  • Strength
  • Relatively resistant to noise.
  • Handle clusters of arbitrary shapes and sizes.
  • Weakness
  • Problem with clusters having widely varying
    densities.
  • Density is more difficult to define with
    high-dimensional data.
  • Expensive in calculating all pairwise
    proximities.

32
  • Association Rules

33
Association rules
Transaction
  • Input a set of groups of items
  • Goal find all rules on itemsets of the form
    a--gtb such that
  • Support of a and b gt threshold s
  • Confidence (conditional probability ) of b given
    a gt threshold c
  • Example milk --gt bread
  • Support(milk, bread) 2/4
  • Confidence(milk --gt bread) 2/3

milk, cereal, bread
tea, milk, bread
milk, rice
cereal
34
Prevalent ? Interesting
1995
  • Analysts already know about prevalent rules
  • Interesting rules are those that deviate from
    prior expectation
  • Minings payoff is in finding surprising phenomena

Milk and cereal selltogether!
Milk and cereal selltogether!
35
Variants
  • Frequent itemset mining /Infrequent itemset
    mining
  • Positive association rules /Negative association
    rules
  • Frequent high dimensional data
  • Frequent sub-tree mining
  • Frequent sub-graph mining

36
Other Issues
37
Evaluation
  • Classification
  • Metric classification accuracy
  • Strategy holdout, random sampling, cross-
  • validation, bootstrap
  • Clustering
  • Cohesion, separation
  • Association Rule Mining
  • Efficiency w.r.t. thresholds
  • Scalability w.r.t. thresholds

38
Tools
  • Weka
  • http//www.cs.waikato.ac.nz/ml/weka/
  • CLUstering Toolkit (CLUTO)
  • http//glaros.dtc.umn.edu/gkhome/cluto/cluto/overv
    iew
  • SAS, SPSS

39
Applications of Data Mining
  • Web data mining
  • Biological data mining
  • Financial data mining
  • Social network data mining
  • .

40
  • Questions ?
  • Thanks!

41
Bayesian learning
  • Assume a probability model on generation of data.
  • Apply bayes theorem to find most likely class as
  • Naïve bayes Assume attributes conditionally
    independent given class value
  • Easy to learn probabilities by counting,
  • Useful in some domains e.g. text

42
SVM
  • "Perhaps the biggest limitation of the support
    vector approach lies in choice of the
    kernel."Burgess (1998)
  • "A second limitation is speed and size, both in
    training and testing."Burgess (1998)
  • "Discete data presents another problem..."Burgess
    (1998)
  • "...the optimal design for multiclass SVM
    classifiers is a further area for
    research."Burgess (1998)
  • "Although SVMs have good generalization
    performance, they can be abysmally slow in test
    phase, a problem addressed in (Burges, 1996
    Osuna and Girosi, 1998)."Burgess (1998)
  • "Besides the advantages of SVMs - from a
    practical point of view - they have some
    drawbacks. An important practical question that
    is not entirely solved, is the selection of the
    kernel function parameters - for Gaussian kernels
    the width parameter sigma - and the value of
    epsilon in the epsilon-insensitive loss
    function...more"Horváth (2003) in Suykens et
    al.
  • "However, from a practical point of view perhaps
    the most serious problem with SVMs is the high
    algorithmic complexity and extensive memory
    requirements of the required quadratic
    programming in large-scale tasks."Horváth (2003)
    in Suykens et al. p 392
Write a Comment
User Comments (0)
About PowerShow.com