Data Preparation and - PowerPoint PPT Presentation

About This Presentation
Title:

Data Preparation and

Description:

Data preparation includes editing, coding and data entry ... Examining the shape of the distribution for skewness, kurtosis and the modal pattern ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 30
Provided by: Zage
Learn more at: http://www.cs.bsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Preparation and


1
Data Preparation and Preliminary Analysis
2
Data
  • Once the data starts to flow, our attention turns
    to data analysis
  • Data preparation includes editing, coding and
    data entry
  • Exploring, displaying and examining data the
    search for meaningful patterns
  • Data mining used to extract patterns and
    predictive trends from databases

3
Data Editing
  • Checking entries for correctness, consistency
  • Coding assigning numbers
  • Data entry spreadsheet, data editor of a
    statistical program or database

4
Exploring Data
  • You could move directly into the statistical
    analysis
  • When the studys purpose is not the production of
    causal inferences, confirmatory data analysis is
    not required
  • When it is, you should discover as much as
    possible about the data before selecting the
    appropriate means of confirmation

5
Exploratory Data Analysis
  • Set of techniques
  • The flexibility to respond to the patterns
    revealed by successive iterations in the
    discovery process is an important attribute
  • EDA can be compared to the role of the police
    detectives and other investigators
  • Confirmatory analysis can be compared to the role
    of the judge
  • The former are involved in the search for clues
    the latter are preoccupied with evaluating the
    strength

6
EDA
  • Free to take many paths in revealing mysteries in
    the data
  • Emphasizes visual representations and graphical
    techniques over summary statistics
  • Summary statistics , may obscure, conceal the
    underlying structure of the data
  • When numerical summaries are used exclusively and
    accepted without visual inspection, the selection
    of confirmatory modes may be based on flawed
    assumptions and may produce erroneous conclusions

7
Techniques for Displaying Data
  • Frequency Tables
  • Bar Charts
  • Pie Charts

8
Frequency Tables
  • Information
  • Displays the data from the lowest value to the
    highest
  • Columns for percent
  • Percent adjusted for missing values
  • Cumulative percent

9
A Frequency Table for Market Sector
Value Label Value Frequency
Valid Cum. Chemicals
1 10 10.0
10.0 10.0 Consumer Products 2
8 8.0 8.0
18.0 Durables 3
7 7.0 7.0
25.0 Energy 4
13 13.0 13.0
38.0 Financial 5
24 24.0 24.0
62.0 Health 6
4 4.0 4.0
66.0 High-Tech 7
11 11.0 11.0
77.0 Insurance 8
6 6.0 6.0
83.0 Retailing 9
7 7.0 7.0
90.0 Other 10
10 10.0 10.0 100.0
Total
100 100.0 100.0 Valid Cases 100
Missing Cases 0
10
Sector Bar Chart Display
11
Sector Pie Chart Display
12
Analysis
  • The values and percentages are more readily
    understood in graphic format.
  • The relative sizes of the sectors can be
    visualized with the bar and pie

13
Another Frequency Table (Ratio-Interval Data)
  • Row Value Freq. Cum.
  • 1 54.9 1 2 2
  • 55.4 1 2 4
  • 55.6 1 2 6
  • 4 56.4 1 2 8
  • 5 56.8 1 2 10
  • 6 56.9 1 2 12
  • 7 57.8 1 2 14
  • 58.1 1 2 16
  • 58.2 1 2 18
  • 10 58.3 1 2 20
  • 11 58.5 1 2 22
  • 12 59.2 2 4 26
  • Row Value Freq. Cum.
  • 13 61.5 1 2 28
  • 62.6 1 2 30
  • 64.8 1 2 32
  • 16 66.0 2 4 36
  • 17 66.3 1 2 38
  • 18 67.6 1 2 40
  • 19 69.1 1 2 42
  • 69.2 1 2 44
  • 70.5 1 2 46
  • 22 72.7 1 2 48
  • 23 72.9 1 2 50
  • 24 73.5 1 2 52

Row Value Freq. Cum.
14
Interval-Ratio Data
  • The last chart was not informative
  • Primary contribution was an ordered list of
    values
  • If converted to a bar chart, it would have 48
    bars of equal length and two bars with two
    occurrences
  • A pie chart would also be pointless
  • Notice that when the variable of interest is
    measured on an interval-ration scale and is one
    of many potential values, these techniques are
    not particularly informative

15
Histogram
  • Conventional solution for display of
    interval-ratio data
  • Group the variables values into intervals
  • Useful
  • Displaying all intervals in a distribution even
    those without observed values
  • Examining the shape of the distribution for
    skewness, kurtosis and the modal pattern

16
Histogram
  • Questions to ask
  • Is there a single hump?
  • Are subgroups identifiable when multiple modes
    are present?
  • Are straggling data values detached from the
    central concentration?

17
Histogram when grouping in increments of 20
18
Observations
  • Intervals with 0 counts show gaps in the data and
    alert the analyst to look for problems with
    spread
  • There are two extreme values
  • Along with the peaked midpoint and reduced number
    of observations in the upper tail, this histogram
    warns us of irregularities in the data.

19
Stem and Leaf Displays
  • Closely related to the histogram
  • Shares features but offers unique advantages
  • Easy to construct by hand for small samples
  • In contrast to histograms which lose information
    by grouping values into intervals, actual data
    can be inspected directly
  • Range of data is apparent at a glance
  • Also shape and spread impressions immediate

20
Stem and Leaf Displays
  • To develop, the first digit of each data item are
    arranged to the left of a vertical line.
  • Each row is referred to as a stem and each piece
    of information leaf

21
Example of a Stem and Leaf Display
  1. 0 2 2 3 5 6 7 8
  2. 4 5 5 6 6 6 7 8 8 8 8 9 9
  3. 0 2 2 6 8
  4. 2 4
  5. 0 1 8
  6. 3
  7. 1
  8. 1
  9. 0 6
  10. 3
  11. 3 6
  12. 3
  13. 6
  14. 8

22
Boxplots
  • Another technique for exploratory data analysis
  • Boxplot reduces the detail of the stem-and-leaf
    display and provides a different visual image of
    the distributions location, spread, shape, tail
    length, and outliers
  • Summary consists of the median, upper and lower
    quartiles, and the largest and smallest
    observations.
  • The median and quartiles are used because they
    are particularly resistant statistics.

23
Resistant Statistics
  • Example data set 5,6,6,7,7,7,8,8,9
  • The mean is 7 and the standard deviation 1.23
  • Replace the 9 with 90 and the mean becomes 16 and
    the standard deviation 27.78.
  • Changing only one of the nine values has
    disturbed the location and spread summaries to
    the point where they no longer represent the
    other eight values. Both mean and standard
    deviation are considered nonresistant statistics
  • The median remained at 7 and the lower and upper
    quartiles stayed at 6 and 8, respectively.

24
Boxplots
  • Rectangular plot that encompasses 50 percent of
    the data values
  • A center line ( or other notation) marking the
    median and going through the width of the box
  • The edges of the box are called hinges
  • The whiskers that extend from the right and left
    hinges to the largest and smallest values

25
Boxplot Components
Largest observed value within 1.5 IQR of upper
hinge
Smallest observed value within 1.5 IQR of lower
hinge
Extreme Or far Outside value
Outside Value Or outlier
Outside Value Or outlier
Whiskers
Median
IQR
1.5IQR
1.5IQR
Inner fence 1.5(IQR) plus Upper hinge
50 of observed Values are within the box
Inner fence Lower hinge Minus 1.5(IQR)
Outer fence Lower hinge Minus 3(IQR)
Outer fence 3(IQR) plus Upper hinge
26
Example
  • Minimum 54.9
  • Lower hinge 60.3
  • Median 74.55
  • Upper hinge 111.52
  • Maximum 218.2
  • IQR 111.52 60.3 51.22
  • .5 (IQR) 25.61
  • Inner fence lower hinge 60.3 (51.2225.61)
    -16.53
  • Inner fence upper hinge 111.52 (51.2225.61)
    188.35
  • The smallest and largest values from the
    distribution within the fences are used to
    determine the whisker length

27
Observations
  • In preliminary analysis, it is important to
    separate legitimate outliers from errors in
    measurement, editing, coding and data entry
  • Outliers that are mistakes should be corrected or
    removed

28
Other Observations
Symmetric
Right Skewed
Left Skewed
Small Spread
29
Visual Techniques of EDA
  • Gain insight into the data
  • More common ways of summarizing location,
    spread, and shape
  • Used resistant statistics
  • From these we could make decisions on test
    selection and whether the data should be
    transformed or reexpressed before further analysis
Write a Comment
User Comments (0)
About PowerShow.com