Tools for Data Preparation - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Tools for Data Preparation

Description:

e.g., decimal scaling into the range (0,1) by mapping, or standard deviation ... such as conditional transformation, compute new variables & recode values ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 21
Provided by: lab2102
Category:

less

Transcript and Presenter's Notes

Title: Tools for Data Preparation


1
Tools for Data Preparation
  • November 8, 2002

2
Why Data Preparation?
Data Mining Stages Time to Complete ( of Total) Importance to Success( of Total)
Data Preparation 75 75
Data Surveying 20 15
Data Modeling 5 10
  • Source D Pyle, Data Preparation for Data Mining,
    1999

3
Data Preparation Process
Data Selection
Data Cleaning
New Data Construction
Data Formatting
4
Data Selection
  • Based on The Following Criteria
  • Data quality properties
  • completeness and correctness
  • Technical constraints such as limits on data
    volume or data type
  • related to data mining tools

5
Data Cleaning
  • Possible Techniques for Data Cleaning
  • Data normalization.
  • e.g., decimal scaling into the range (0,1) by
    mapping, or standard deviation normalization.
  • Data smoothing.
  • e.g. Discretization of numeric attributes, this
    is helpful or even necessary for logic based
    methods.

6
Data Cleaning Contd
  • Treatment of missing values.
  • Predict missing values replace them with
    the least biased values. e.g. Preserve the
    relationship between variables.
  • Data Reduction.
  • The most usual step examine the attributes
    and consider their predictive potential.
  • e.g. attribute selection from means and
    variances,
  • merging features using linear transform.

7
Data Missing Example
Position 11 Missing
0.0886
0.0684
0.3515
0.9874
0.4713
0.6115
0.2573
0.2914
0.1662
0.4400
?
Preserve Mean
0.0886
0.0684
0.3515
0.9874
0.4713
0.6115
0.2573
0.2914
0.1662
0.4400
0.3731
Preserve Variance
0.0886
0.0684
0.3515
0.9874
0.4713
0.6115
0.2573
0.2914
0.1662
0.4400
0.6629
Position Original Sample
1 0.0886
2 0.0684
3 0.3515
4 0.9874
5 0.4713
6 0.6115
7 0.2573
8 0.2914
9 0.1662
10 0.4400
11 0.6939
 
8
New Data Construction
  • Constructive Operations on Selected Data Include
  • Derivation of new attributes from the existing
    attributes.
  • Generation of new records.
  • Data Transformation.
  • Merging Tables.
  • Aggregation Summarizing information from
    multiple records and/or tables.

9
Data Formatting
  • It Involves Syntactic Modification Required by
    Modeling Tools
  • Reordering of the attributes or records.
  • Changes related to the constraints of the
    modeling tools e.g. removing comma or tabs,
    trimming strings to maximum allowed number of
    characters, replacing special characters with
    allowed set of special characters.

10
Data Preparation Tools
  • Data Junction Integration Studio-
    http//www.datajunction.com/
  • SPSS Base 11.5
  • - http//www.spss.com/
  • Informatica PowerCenter
  • - http//www.informatica.com/
  • WizWhy
  • - http//www.wizsoft.com/

11
Data Junction Integration Studio
  • It includes five visual design tools
  • Process Designer
  • Full conditional flow control
  • Testing of global variables
  • Execution of external processes and a complete
    expression language allow for automation of
    complex event-driven or scheduled routines
  • Multi-threaded Integration Engine

12
Data Junction Integration Studio Contd
  • Map Designer
  • Mapping source data to target structures
  • Defining rules for mapping complex hierarchical
    structures
  • Define complex rules for record filtering
  • Error and reject record handling
  • Error logging

13
Data Junction Integration Studio Contd
  • Metadata Query
  • Allows users to run queries against the Data
    Junction Metadata Repository
  • Record Layout Designer
  • A visual tool for defining or modifying data
    structures (including field names, sizes, length,
    offset, data types, etc.) for both sources and
    targets

14
Data Junction Integration Studio Contd
  • Universal Data Browser
  • Allows users to view files other than the sources
    and targets involved in a current design session
  • View data formats from applications not installed
    on the system

15
SPSS Base 11.5 Data Preparation Components
  1. Data Editor a spreadsheet-like system for
    defining, entering, editing and displaying data
  2. Data preparation tools get data ready for
    analysis. The Define Variable Properties tool to
    easily set up data dictionary information (such
    as value labels, variable labels and variable
    types) as a "template" so it can be applied to
    other data files and to other variables within
    the same file.
    Apply the dictionary information using
    the Copy Data Properties tool.

16
SPSS Base 11.5 Contd
  • Data Restructure Wizard take a data file that
    has multiple records per subject and restructure
    it so data for each subject are in a single
    record. No need to set up vectors or loops.
    Particularly helpful with transactional data.
  • Can also do the reverse action that is, take
    data from a single record and spread it across
    multiple cases.

17
SPSS Base 11.5 Contd
  • Data transformations work with combined data
    more reliably by "flipping" responses so all
    the data are in the same direction.
  • e.g. Help to create multiple-item indices when
    working with surveys that ask respondents to give
    both positively worded and negatively worded
    responses.
  • And other transformation capabilities such as
    conditional transformation, compute new variables
    recode values

18
WizWhy
  • Features
  • Performs Boolean as well as multi-value analysis
  • Analyzes the data by discovering all the if-then
    rules
  • Reveals necessary and sufficient conditions
    (if-and-only- if rules)
  • Calculates the error probability of each rule
  • Reveals the interesting phenomena in the data by
    uncovering the unexpected rules

19
WizWhy Features contd
  • Predicts new cases on the basis of the
    discovered rules
  • Explains predictions by listing relevant rules
  • Calculates the predictions conclusive
    probability and error probability
  • Predictions are based on error costs (a cost of
    a miss vs. false alarm) and not influenced by
    subjective choices
  • Points out cases deviating from the discovered
    rules
  • Proven to be faster and more accurate than
    other data mining methods

20
WizWhy Rules Report Example
1) CUSTOMER starts with MORGA if and only if
KEY is 985 The rule exists in 32
records. Significance Level Error probability
is almost 0
Write a Comment
User Comments (0)
About PowerShow.com