Data Mining - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Data Mining

Description:

Approaches to DM ... through statistical approaches because data is ... Massive data can be approached by. Sampling (for modeling, but not pattern rec. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 25
Provided by: sve2
Category:

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • Introduction
  • By Svetlana Stenchikova

2
Definition
  • analysis of (often large) observational data sets
    to
  • finds unsuspected relationships
  • summarizes the data in novel ways
  • understandable and useful to the user

3
Statistics vs. Data Mining
  • DM usually deals with the data collected for
    other purposes (DMsecondary analysis)
  • DM large data sets
  • Statistics data collection using efficient
    strategies to answer specific question

4
Approaches to DM
  • Generalizing data is collected for sample
    population, generalize for the whole population
  • Determine how future customers will behave
  • Generalization may not be achievable through
    statistical approaches because data is not a
    random sample
  • Compressing make result more comprehensible
  • Census data for the whole country

5
Which is true?
  • DM deals with large amounts of data, while ML may
    deal with smaller data sets
  • ML is a subset of DM
  • DM uses ML algorithms
  • In DM task, we do not have to have a hypothesis,
    while in ML we start with hypotheses

6
What is Data Mining
  • DM is interdisciplinary exercise involving
  • Statistics
  • Databases
  • Machine learning
  • Pattern recognition
  • AI
  • Visualization
  • Difficult to draw a boundary

7
Data Settraining datasampledatabase
  • Set of measurements taken from an environment or
    process
  • Stored in a n x p matrix(table) n samples, p
    variables
  • Ex. Census of bureau data
  • Find relationships between variables predict
    income from other vars.
  • Cluster groups
  • Find values at which variables often coincide

8
  • Ex. Data set on medical patients
  • Multiple mearusementes(different time/day)
  • Image data for some patients
  • Text comments
  • Hierarchy (doctors, hospitals, geographic
    locations)
  • N x p matrix is a simplification

9
Data set examples
  • Text documents rows - document id, column -
    word, table entry of occurrences of word w in
    document d.
  • Web transactions log row- person, col- web page.
  • Sparse matrices.
  • Some Information may be lost

10
Global vs. Local structures
  • Model structure global summary of a data set
  • Row in data matrix is p-dimensional vector
  • YaXc (Y X are variables, a c are
    parameters).
  • Pattern structure - restricted regions of the
    space spanned by the vars.
  • If Xgtx1 then prob(Ygty1)p1
  • Anomaly detection

11
Linearity of a model
  • In Data Mining linearity is a function of
    parameters
  • YaXc linear
  • Y aX2 c linear
  • In Statistics Linearity is a function of
    variables
  • YaXc linear
  • Y aX2 c not linear

12
Model Structure vs. Pattern
  • Model Structure
  • function representing the data
  • Fitted models pattern parameters have specific
    values
  • Sometimes distinction is unclear

13
Data Mining Tasks
  • Exploratory Data Analysis
  • Descriptive Modeling
  • Predictive Modeling
  • Discovering Patterns and Rules
  • Retrieval by content
  • Measure of similarity of distance is shared
    between tasks
  • Different models and pattern structures for
    different task

14
Data Mining Tasks
  • Exploratory Data Analysis
  • No clear idea what to look for
  • Interactive and visual display of data
  • As dimentionality (p-value) of data increases it
    becomes difficult to visualize

15
Data Mining Tasks
  • Descriptive Modeling
  • Density estimate models for overall prob
    distribution
  • Cluster analysis (discover natural groups in
    data) and segmentation (group similar records
    e.g. market analysis, number of groups is
    specified by user, no correct number)
  • Dependency modeling describes relationships
    between variables

16
Data Mining Tasks
  • Predictive Modeling Classification and
    Regression
  • Build a model to predict values of one variable
    from the known values of others
  • Classification variable is categorical,
    regression variable is quantitative

17
Data Mining Tasks
  • Discovering Patterns and Rules
  • Finding outliers Spotting fraudulent behavior
  • Finding combinations that occur frequently
  • Retrieval by content
  • User has a pattern of interest and wishes to find
    similar patterns in data set
  • Common for text and image data sets

18
Data Mining Algorithms
  • Model or Pattern Structure
  • Score function
  • Optimization and Search method
  • Data management strategy

19
Score Function
  • Measures how well model or parameter structure
    fits given data set
  • Likelihood
  • Sum of squared errors
  • Misclassification rate

20
Optimization and Search Methods
  • Goal is to determine structure and the parameter
    values

21
Data Management Strategies
  • How data is stored, indexed, or accessed
  • Implementation of data mining algorithms deals
    with this issue
  • Current data analysis algorithms assume fast RAM
    access gt scale poorly

22
Statistics and Data Mining
  • Data size of DM problem
  • Genome project 109
  • Digital sky survey 108 individual sky objects,
    400 GB catalog
  • Massive data can be approached by
  • Sampling (for modeling, but not pattern rec.)
  • Adaptive methods
  • Summarizing in terms of sufficient statistics

23
Statistics and Data Mining
  • Curse of dimensionality exponential growth as
    the number of vars increases.
  • Impose a restriction prior choice of modes, e.g.
    assume a linear model
  • Contamination or corruption of data points is
    common
  • Part of a model is a component describing the
    mechanism by which missing data arrises
  • EM algorithm

24
Homework
  • Find two data mining papers and identify for each
    of them
  • Which task is being executed (Exploratory Data
    Analysis/Descriptive Modeling/Predictive
    Modeling/Discovering Patterns and Rules/Retrieval
    by content)
  • What is the size and representation of the data
    set
  • What is the score function
Write a Comment
User Comments (0)
About PowerShow.com