The Knowledge Discovery Process; Data Preparation - PowerPoint PPT Presentation

About This Presentation
Title:

The Knowledge Discovery Process; Data Preparation

Description:

The Knowledge Discovery Process; Data Preparation & Preprocessing Bamshad Mobasher DePaul University ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 51
Provided by: Bamsh5
Category:

less

Transcript and Presenter's Notes

Title: The Knowledge Discovery Process; Data Preparation


1
The Knowledge Discovery ProcessData Preparation
Preprocessing
Bamshad Mobasher DePaul University
2
The Knowledge Discovery Process
- The KDD Process
3
Steps in the KDD Process
  • Learning the application domain
  • Translate the business problem into a data mining
    problem
  • Gathering and integrating of data
  • Cleaning and preprocessing data
  • may be the most resource intensive part
  • Reducing and selecting data
  • find useful features, dimensionality reduction,
    etc.
  • Choosing functions of data mining
  • summarization, classification, regression,
    association, clustering, etc.
  • Choosing the mining algorithm(s)
  • Data mining discover patterns of interest
    construct models
  • Evaluate patterns and models
  • Interpretation analysis of results
  • visualization, alteration, removing redundant
    patterns, querying
  • Use of discovered knowledge

4
Data Mining What Kind of Data?
  • Structured Databases
  • relational, object-relational, etc.
  • can use SQL to perform parts of the process
  • e.g., SELECT count() FROM Items WHERE
  • typevideo GROUP BY category

5
Data Mining What Kind of Data?
  • Flat Files
  • most common data source
  • can be text (or HTML) or binary
  • may contain transactions, statistical data,
    measurements, etc.
  • Transactional databases
  • set of records each with a transaction id, time
    stamp, and a set of items
  • may have an associated description file for the
    items
  • typical source of data used in market basket
    analysis

6
Data Mining What Kind of Data?
  • Other Types of Databases
  • legacy databases
  • multimedia databases (usually very
    high-dimensional)
  • spatial databases (containing geographical
    information, such as maps, or satellite imaging
    data, etc.)
  • Time Series Temporal Data (time dependent
    information such as stock market data usually
    very dynamic)
  • World Wide Web
  • basically a large, heterogeneous, distributed
    database
  • need for new or additional tools and techniques
  • information retrieval, filtering and extraction
  • agents to assist in browsing and filtering
  • Web content, usage, and structure (linkage)
    mining tools

7
Data Mining What Kind of Data?
  • Data Warehouses
  • a data warehouse is a repository of data
    collected from multiple data sources (often
    heterogeneous) so that analysis can be done under
    the same unified schema
  • data from the different sources are loaded,
    cleaned, transformed and integrated
  • allows for interactive analysis and
    decision-making
  • usually modeled by a multi-dimensional data
    structure (data cube) to facilitate
    decision-making
  • aggregate values can be pre-computed and stored
    along many dimensions
  • each dimension of the data cube contains a
    hierarchy of values for one attribute
  • data cubes are well suited for fast interactive
    querying and analysis of data at different
    conceptual levels, known as On-Line Analytical
    Processing (OLAP)
  • OLAP operations allow the navigation of data at
    different levels of abstraction, such as
    drill-down, roll-up, slice, dice, etc.

8
DM Tasks Classification
  • Classifying observations/instances into different
    given classes (i.e., classification is
    supervised)
  • example, classifying credit applicants as low,
    medium, or high risk
  • normally use a training set where all objects are
    already associated with known class labels
  • classification algorithm learns from the training
    set and builds a model
  • the model is used to classify new objects
  • Suitable data mining tools
  • Decision Tree algorithms (CHAD, C 5.0, and CART)
  • Memory-based reasoning
  • Neural Network
  • Example (Hypothetical Video Store)
  • build a model of users based their rental history
    (returning on-time, payments, etc.)
  • observe the current customers rental and payment
    history
  • decide whether should charge a deposit to current
    customer

9
DM Tasks Prediction
  • Same as classification
  • except classify according to some predicted or
    estimated future value
  • In prediction, historical data is used to build a
    (predictive) model that explains the current
    observed behavior
  • Model can then be applied to new instances to
    predict future behavior or forecast the future
    value of some missing attribute
  • Examples
  • predicting the size of balance transfer if the
    prospect accepts the offer
  • predicting the load on a Web server in a
    particular time period
  • Suitable data mining tools
  • Association rule discovery
  • Memory-based reasoning
  • Decision Trees
  • Neural Networks

10
DM Tasks Affinity Grouping
  • Determine what items often go together (usually
    in transactional databases)
  • Often Referred to as Market Basket Analysis
  • used in retail for planning arrangement on
    shelves
  • used for identifying cross-selling opportunities
  • should be used to determine best link structure
    for a Web site
  • Examples
  • people who buy milk and beer also tend to buy
    diapers
  • people who access pages A and B are likely to
    place an online order
  • Suitable data mining tools
  • association rule discovery
  • clustering
  • Nearest Neighbor analysis (memory-based reasoning)

11
DM Tasks Clustering
  • Like classification, clustering is the
    organization of data into classes
  • however, class labels are unknown and it is up to
    the clustering algorithm to discover acceptable
    classes
  • also called unsupervised classification, because
    the classification is not dictated by given class
    labels
  • There are many clustering approaches
  • all based on the principle of maximizing the
    similarity between objects in a same class
    (intra-class similarity) and minimizing the
    similarity between objects of different classes
    (inter-class similarity)
  • Examples
  • doing market segmentation of customers based on
    buying patterns and demographic attributes
  • grouping user transactions on a Web site based
    their navigational patterns

12
Characterization Discrimination
  • Data characterization is a summarization of
    general features of objects in a target class
  • The data relevant to a target class are retrieved
    by a database query and run through a
    summarization module to extract the essence of
    the data at different levels of abstraction
  • example characterize the video stores customers
    who regularly rent more than 30 movies a year
  • Data discrimination is used to compare of the
    general features of objects between two classes
  • comparison relevant features of objects between a
    target class and a contrasting class
  • example compare the general characteristics of
    the customers who rented more than 30 movies in
    the last year with those who rented less than 5
  • The techniques used for data discrimination are
    very similar to the techniques used for data
    characterization with the exception that data
    discrimination results include comparative
    measures

13
Example Moviegoer Database
14
Example Moviegoer Database
SELECT moviegoers.name, moviegoers.sex,
moviegoers.age, sources.source, movies.name FROM
movies, sources, moviegoers WHERE
sources.source_ID moviegoers.source_ID AND
movies.movie_ID moviegoers.movie_ID ORDER BY
moviegoers.name
15
Example Moviegoer Database
  • Classification
  • determine sex based on age, source, and movies
    seen
  • determine source based on sex, age, and movies
    seen
  • determine most recent movie based on past movies,
    age, sex, and source
  • Prediction or Estimation
  • for predict, need a continuous variable (e.g.,
    age)
  • predict age as a function of source, sex, and
    past movies
  • if we had a rating field for each moviegoer, we
    could predict the rating a new moviegoer gives to
    a movie based on age, sex, past movies, etc.
  • Clustering
  • find groupings of movies that are often seen by
    the same people
  • find groupings of people that tend to see the
    same movies
  • clustering might reveal relationships that are
    not necessarily recorded in the data (e.g., we
    may find a cluster that is dominated by people
    with young children or a cluster of movies that
    correspond to a particular genre)

16
Example Moviegoer Database
  • Affinity Grouping
  • market basket analysis (MBA) which movies go
    together?
  • need to create transactions for each moviegoer
    containing movies seen by that moviegoer
  • may result in association rules such as
  • Phenomenon, The Birdcage gt
    Trainspotting
  • Trainspotting, The Birdcage gt sex f
  • Sequence Analysis
  • similar to MBA, but order in which items appear
    in the pattern is important
  • e.g., people who rent The Birdcage during a
    visit tend to rent Trainspotting in the next
    visit.

17
The Knowledge Discovery Process
- The KDD Process
18
Data Preprocessing
  • Why do we need to prepare the data?
  • In real world applications data can be
    inconsistent, incomplete and/or noisy
  • Data entry, data transmission, or data collection
    problems
  • Discrepancy in naming conventions
  • Duplicated records
  • Incomplete data
  • Contradictions in data
  • What happens when the data can not be trusted?
  • Can the decision be trusted? Decision making is
    jeopardized
  • Better chance to discover useful knowledge when
    data is clean

19
Data Preprocessing
20
Data Preprocessing
Data Cleaning
Data Integration
Data Transformation
-2,32,100,59,48
-0.02,0.32,1.00,0.59,0.48
Data Reduction
21
Data Cleaning
  • Real-world application data can be incomplete,
    noisy, and inconsistent
  • No recorded values for some attributes
  • Not considered at time of entry
  • Random errors
  • Irrelevant records or fields
  • Data cleaning attempts to
  • Fill in missing values
  • Smooth out noisy data
  • Correct inconsistencies
  • Remove irrelevant data

22
Dealing with Missing Values
  • Data is not always available (missing attribute
    values in records)
  • equipment malfunction
  • deleted due to inconsistency or misunderstanding
  • not considered important at time of data
    gathering
  • Solving Missing Data
  • Ignore the record with missing values
  • Fill in the missing values manually
  • Use a global constant to fill in missing values
    (NULL, unknown, etc.)
  • Use the attribute value mean to filling missing
    values of that attribute
  • Use the attribute mean for all samples belonging
    to the same class to fill in the missing values
  • Infer the most probable value to fill in the
    missing value
  • may need to use methods such as Bayesian
    classification or decision trees to automatically
    infer missing attribute values

23
Smoothing Noisy Data
  • The purpose of data smoothing is to eliminate
    noise

Original Data for price (after sorting) 4, 8,
15, 21, 21, 24, 25, 28, 34
Partition into equidepth bins Bin1 4, 8,
15 Bin2 21, 21, 24 Bin3 25, 28, 34
Binning
Min and Max values in each bin are identified
(boundaries). Each value in a bin is replaced
with the closest boundary value.
Each value in a bin is replaced by the mean value
of the bin.
means Bin1 9, 9, 9 Bin2 22, 22, 22 Bin3 29,
29, 29
boundaries Bin1 4, 4, 15 Bin2 21, 21, 24 Bin3
25, 25, 34
24
Smoothing Noisy Data
  • Other Methods

Similar values are organized into groups
(clusters). Values falling outside of clusters
may be considered outliers and may be
candidates for elimination.
Clustering
Fit data to a function. Linear regression finds
the best line to fit two variables. Multiple
regression can handle multiple variables. The
values given by the function are used instead of
the original values.
Regression
25
Smoothing Noisy Data - Example
  • Want to smooth Temperature by bin means with
    bins of size 3
  • First sort the values of the attribute (keep
    track of the ID or key so that the transformed
    values can be replaced in the original table.
  • Divide the data into bins of size 3 (or less in
    case of last bin).
  • Convert the values in each bin to the mean value
    for that bin
  • Put the resulting values into the original table

26
Smoothing Noisy Data - Example
Value of every record in each bin is changed to
the mean value for that bin. If it is necessary
to keep the value as an integer, then the mean
values are rounded to the nearest integer.
27
Smoothing Noisy Data - Example
The final table with the new values for the
Temperature attribute.
28
Data Integration
  • Data analysis may require a combination of data
    from multiple sources into a coherent data store
  • Challenges in Data Integration
  • Schema integration CID C_number Cust-id
    cust
  • Semantic heterogeneity
  • Data value conflicts (different representations
    or scales, etc.)
  • Synchronization (especially important in Web
    usage mining)
  • Redundant attributes (redundant if it can be
    derived from other attributes) -- may be able to
    identify redundancies via correlation analysis
  • Meta-data is often necessary for successful data
    integration

Correlation analysis Pr(A,B) / (Pr(A).Pr(B))
1 independent, gt 1 positive correlation, lt 1
negative correlation.
29
Normalization
  • Min-max normalization linear transformation from
    v to v
  • v (v - min)/(max - min) x (newmax - newmin)
    newmin
  • Ex transform 30000 between 10000..45000 into
    0..1 gt
  • (30000 10000) / 35000 x (1 - 0) 0 0.514
  • z-score normalization normalization of v into v
    based on attribute value mean and standard
    deviation
  • v (v - Mean) / StandardDeviation
  • Normalization by decimal scaling
  • moves the decimal point of v by j positions such
    that j is the minimum number of positions moved
    so that absolute maximum value falls in 0..1.
  • v v / 10j
  • Ex if v ranges between -56 and 9976, j4 gt v
    ranges between -0.0056 and 0.9976

30
Normalization Example
  • z-score normalization v (v - Mean) / Stdev
  • Example normalizing the Humidity attribute

Mean 80.3 Stdev 9.84
31
Normalization Example II
  • Min-Max normalization on an employee database
  • max distance for salary 100000-19000 81000
  • max distance for age 52-27 25
  • New min for age and salary 0 new max for age
    and salary 1

32
Data Reduction
  • Data is often too large reducing data can
    improve performance
  • Data reduction consists of reducing the
    representation of the data set while producing
    the same (or almost the same) results
  • Data reduction includes
  • Data cube aggregation
  • Dimensionality reduction
  • Discretization
  • Numerosity reduction
  • Regression
  • Histograms
  • Clustering
  • Sampling

33
Data Cube Aggregation
  • Reduce the data to the concept level needed in
    the analysis
  • Use the smallest (most detailed) level necessary
    to solve the problem
  • Queries regarding aggregated information should
    be answered using data cube when possible

34
Dimensionality Reduction
  • Feature selection (i.e., attribute subset
    selection)
  • Select only the necessary attributes.
  • The goal is to find a minimum set of attributes
    such that the resulting probability distribution
    of data classes is as close as possible to the
    original distribution obtained using all
    attributes.
  • Exponential number of possibilities.
  • Use heuristics select local best (or most
    pertinent) attribute, e.g., using information
    gain, etc.
  • step-wise forward selection A1A1, A3A1,
    A3, A5
  • step-wise backward elimination A1, A2, A3, A4,
    A5A1, A3, A4, A5 A1, A3, A5
  • combining forward selection and backward
    elimination
  • decision-tree induction

35
Decision Tree Induction
36
Numerocity Reduction
  • Reduction via histograms
  • Divide data into buckets and store representation
    of buckets (sum, count, etc.)
  • Reduction via clustering
  • Partition data into clusters based on closeness
    in space
  • Retain representatives of clusters (centroids)
    and outliers
  • Reduction via sampling
  • Will the patterns in the sample represent the
    patterns in the data?
  • Random sampling can produce poor results
  • Stratified sample (stratum group based on
    attribute value)

37
Sampling Techniques
SRSWOR (simple random sample without
replacement)
SRSWR
Raw Data
Cluster/Stratified Sample
Raw Data
38
Discretization
  • 3 Types of attributes
  • nominal - values from an unordered set (also
    categorical attributes)
  • ordinal - values from an ordered set
  • continuous - real numbers (but sometimes also
    integer values)
  • Discretization is used to reduce the number of
    values for a given continuous attribute
  • usually done by dividing the range of the
    attribute into intervals
  • interval labels are then used to replace actual
    data values
  • Some data mining algorithms only accept
    categorical attributes and cannot handle a range
    of continuous attribute value
  • Discretization can also be used to generate
    concept hierarchies
  • reduce the data by collecting and replacing low
    level concepts (e.g., numeric values for age)
    by higher level concepts (e.g., young, middle
    aged, old)

39
Discretization - Example
  • Example discretizing the Humidity attribute
    using 3 bins.

Low 60-69 Normal 70-79 High 80
40
Converting Categorical Attributes to Numerical
Attributes
Attributes Outlook (overcast, rain,
sunny) Temperature real Humidity real Windy
(true, false)
Standard Spreadsheet Format
Create separate columns for each value of a
categorical attribute (e.g., 3 values for the
Outlook attribute and two values of the Windy
attribute). There is no change to the numerical
attributes.
41
Visualizing Patterns
  • Example Cross Tabulation

Windy Not Windy
Outlook sunny 2 3
Outlook rain 2 3
Outlook overcast 2 2
42
Evaluating Models
  • To train and evaluate models, data are often
    divided into three sets the training set, the
    test set, and the evaluation set
  • Training Set
  • is used to build the initial model
  • may need to enrich the data to get enough of
    the special cases
  • Test Set
  • is used to adjust the initial model
  • models can be tweaked to be less idiosyncrasies
    to the training data and can be adapted for a
    more general model
  • idea is to prevent over-training (i.e., finding
    patterns where none exist).
  • Evaluation Set
  • is used to evaluate the model performance

43
Training Sets
  • The training set will be used to train the models
  • most important consideration need to cover the
    full range of values for all the features that
    the model might encounter
  • good idea to have several examples for each value
    of a categorical feature and for a range of
    values for numerical features
  • Data Enrichment
  • training set should have sufficient number of
    examples of rare events
  • a random sample of available data is not
    sufficient since common examples will swamp the
    rare examples
  • training set needs to over-sample the rare cases
    so that the model can learn the features for rare
    events instead of predicting everything to be
    common outputs

44
Test and Evaluation Sets
  • Reading too much into the training set
    (overfitting)
  • common problem with most data mining algorithms
  • resulting model works well on the training set
    but performs miserably on unseen data
  • test set can be used to tweak the initial
    model, and to remove unnecessary inputs or
    features
  • Evaluations Set for Final Performance Evaluation
  • test set can not be used to evaluate model
    performance for future unseen data
  • error rate on the evaluation is a good predictor
    of error rate on unseen data
  • Insufficient data to divide into three disjoint
    sets?
  • In such cases, validation techniques can play a
    major role
  • Cross Validation
  • Bootstrap Validation

45
Cross Validation
  • Cross validation is a heuristic that works as
    follows
  • randomly divide the data into n folds, each with
    approximately the same number of records
  • create n models using the same algorithms and
    training parameters each model is trained with
    n-1 folds of the data and tested on the remaining
    fold
  • can be used to find the best algorithm and its
    optimal training parameter
  • Steps in Cross Validation
  • 1. Divide the available data into a training set
    and an evaluation set
  • 2. Split the training data into n folds
  • 3. Select an architecture (algorithm) and
    training parameters
  • 4. Train and test n models
  • 5. Repeat step 2 to 4 using different
    architectures and parameters
  • 6. Select the best model
  • 7. Use all the training data to train the model
  • 8. Assess the final model using the evaluation set

46
Bootstrap Validation
  • Based on the statistical procedure of sampling
    with replacement
  • data set of n instances is sampled n times (with
    replacement) to give another data set of n
    instances
  • since some elements will be repeated, there will
    be elements in the original data set that are not
    picked
  • these remaining instances are used as the test
    set
  • How many instances in the test set?
  • Probability of not getting picked in one sampling
    1 - 1/n
  • Pr(not getting picked in n samples) (1 -1/n)n
    e-1 0.368
  • so, for large data set, test set will contain
    about 36.8 of instances
  • to compensate for smaller training sample
    (63.2), test set error rate is combined with the
    re-substitution error in training set
  • e (0.632 e test instance) (0.368 e
    training instance)
  • Bootstrap validation increases variance that can
    occur in each fold

47
Measuring Effectiveness
  • Predictive models are measured based on the
    accuracy of their predictions on unseen data
  • Classification and Prediction Tasks
  • accuracy measured in terms of error rate (usually
    of records classified incorrectly)
  • error rate on a pre-classified evaluation set
    estimates the real error rate
  • can also use cross-validation methods discussed
    before
  • Estimation Effectiveness
  • difference between predicted scores and the
    actual results (from evaluation set)
  • the accuracy of the model is measured in terms of
    variance (i.e., average of the squared
    differences)
  • to be able to express this in terms of the
    original units, compute the standard deviation
    (i.e., square root of the variance)

48
Ordinal or Nominal Outputs
  • When the output field is ordinal or nominal
    (e.g., in two-class prediction), we use the
    classification table, the so-called confusion
    matrix to evaluate the resulting model
  • Example
  • Overall correct classification rate (18 15) /
    38 87
  • Given T, correct classification rate 18 / 20
    90
  • Given F, correct classification rate 15 / 18
    83

Predicted Class
Actual Class
49
Measuring Effectiveness
  • Market Basket Analysis
  • MBA may be used for estimation or prediction
    (e.g., people who buy milk and diapers tend to
    also buy beer)
  • confidence percentage of the time beer occurs
    in transaction that contain milk and diapers
    (i.e., conditional probability)
  • support percentage of the time milk, diaper,
    and beer occurred together in the whole
    training set (i.e., prior probability)
  • Distance-Based Techniques
  • clustering and memory-based reasoning a measure
    of distance is used to evaluate the closeness
    or similarity of a point to cluster centers or
    a neighborhood
  • regression line of best fit minimizes the sum of
    the distances between the line and the
    observations
  • often for numeric variables, can use Euclidean
    distance measure (square root of the sum of the
    squares of the difference along each dimension)

50
Measuring Effectiveness Lift Charts
  • usually used for classification, but can be
    adopted to other methods
  • measure change in conditional probability of a
    target class when going from the general
    population (full test set) to a biased sample
  • Example
  • suppose expected response rate to a direct
    mailing campaign is 5 in the training set
  • use classifier to assign yes or no value to a
    target class predicted to respond
  • the yes group will contain a higher proportion
    of actual responders than the test set
  • suppose the yes group (our biased sample)
    contains 50 actual responders
  • this gives lift of 10 0.5 / 0.05
  • What if the lift sample is too small
  • need to increase the sample size
  • trade-off between lift and sample size
Write a Comment
User Comments (0)
About PowerShow.com