Data Preparation and - PowerPoint PPT Presentation

About This Presentation

Title:

Data Preparation and

Description:

Data preparation includes editing, coding and data entry ... Examining the shape of the distribution for skewness, kurtosis and the modal pattern ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 30

Provided by: Zage

Learn more at: http://www.cs.bsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Preparation and

1
Data Preparation and Preliminary Analysis
2
Data

Once the data starts to flow, our attention turns
to data analysis
Data preparation includes editing, coding and
data entry
Exploring, displaying and examining data the
search for meaningful patterns
Data mining used to extract patterns and
predictive trends from databases

3
Data Editing

Checking entries for correctness, consistency
Coding assigning numbers
Data entry spreadsheet, data editor of a
statistical program or database

4
Exploring Data

You could move directly into the statistical
analysis
When the studys purpose is not the production of
causal inferences, confirmatory data analysis is
not required
When it is, you should discover as much as
possible about the data before selecting the
appropriate means of confirmation

5
Exploratory Data Analysis

Set of techniques
The flexibility to respond to the patterns
revealed by successive iterations in the
discovery process is an important attribute
EDA can be compared to the role of the police
detectives and other investigators
Confirmatory analysis can be compared to the role
of the judge
The former are involved in the search for clues
the latter are preoccupied with evaluating the
strength

6
EDA

Free to take many paths in revealing mysteries in
the data
Emphasizes visual representations and graphical
techniques over summary statistics
Summary statistics , may obscure, conceal the
underlying structure of the data
When numerical summaries are used exclusively and
accepted without visual inspection, the selection
of confirmatory modes may be based on flawed
assumptions and may produce erroneous conclusions

7
Techniques for Displaying Data

Frequency Tables
Bar Charts
Pie Charts

8
Frequency Tables

Information
Displays the data from the lowest value to the
highest
Columns for percent
Percent adjusted for missing values
Cumulative percent

9
A Frequency Table for Market Sector
Value Label Value Frequency
Valid Cum. Chemicals
1 10 10.0
10.0 10.0 Consumer Products 2
8 8.0 8.0
18.0 Durables 3
7 7.0 7.0
25.0 Energy 4
13 13.0 13.0
38.0 Financial 5
24 24.0 24.0
62.0 Health 6
4 4.0 4.0
66.0 High-Tech 7
11 11.0 11.0
77.0 Insurance 8
6 6.0 6.0
83.0 Retailing 9
7 7.0 7.0
90.0 Other 10
10 10.0 10.0 100.0
Total
100 100.0 100.0 Valid Cases 100
Missing Cases 0
10
Sector Bar Chart Display
11
Sector Pie Chart Display
12
Analysis

The values and percentages are more readily
understood in graphic format.
The relative sizes of the sectors can be
visualized with the bar and pie

13
Another Frequency Table (Ratio-Interval Data)

Row Value Freq. Cum.
1 54.9 1 2 2
55.4 1 2 4
55.6 1 2 6
4 56.4 1 2 8
5 56.8 1 2 10
6 56.9 1 2 12
7 57.8 1 2 14
58.1 1 2 16
58.2 1 2 18
10 58.3 1 2 20
11 58.5 1 2 22
12 59.2 2 4 26

Row Value Freq. Cum.
13 61.5 1 2 28
62.6 1 2 30
64.8 1 2 32
16 66.0 2 4 36
17 66.3 1 2 38
18 67.6 1 2 40
19 69.1 1 2 42
69.2 1 2 44
70.5 1 2 46
22 72.7 1 2 48
23 72.9 1 2 50
24 73.5 1 2 52

Row Value Freq. Cum.
14
Interval-Ratio Data

The last chart was not informative
Primary contribution was an ordered list of
values
If converted to a bar chart, it would have 48
bars of equal length and two bars with two
occurrences
A pie chart would also be pointless
Notice that when the variable of interest is
measured on an interval-ration scale and is one
of many potential values, these techniques are
not particularly informative

15
Histogram

Conventional solution for display of
interval-ratio data
Group the variables values into intervals
Useful
Displaying all intervals in a distribution even
those without observed values
Examining the shape of the distribution for
skewness, kurtosis and the modal pattern

16
Histogram

Questions to ask
Is there a single hump?
Are subgroups identifiable when multiple modes
are present?
Are straggling data values detached from the
central concentration?

17
Histogram when grouping in increments of 20
18
Observations

Intervals with 0 counts show gaps in the data and
alert the analyst to look for problems with
spread
There are two extreme values
Along with the peaked midpoint and reduced number
of observations in the upper tail, this histogram
warns us of irregularities in the data.

19
Stem and Leaf Displays

Closely related to the histogram
Shares features but offers unique advantages
Easy to construct by hand for small samples
In contrast to histograms which lose information
by grouping values into intervals, actual data
can be inspected directly
Range of data is apparent at a glance
Also shape and spread impressions immediate

20
Stem and Leaf Displays

To develop, the first digit of each data item are
arranged to the left of a vertical line.
Each row is referred to as a stem and each piece
of information leaf

21
Example of a Stem and Leaf Display

0 2 2 3 5 6 7 8
4 5 5 6 6 6 7 8 8 8 8 9 9
0 2 2 6 8
2 4
0 1 8
3
1
1
0 6
3
3 6
3
6
8

22
Boxplots

Another technique for exploratory data analysis
Boxplot reduces the detail of the stem-and-leaf
display and provides a different visual image of
the distributions location, spread, shape, tail
length, and outliers
Summary consists of the median, upper and lower
quartiles, and the largest and smallest
observations.
The median and quartiles are used because they
are particularly resistant statistics.

23
Resistant Statistics

Example data set 5,6,6,7,7,7,8,8,9
The mean is 7 and the standard deviation 1.23
Replace the 9 with 90 and the mean becomes 16 and
the standard deviation 27.78.
Changing only one of the nine values has
disturbed the location and spread summaries to
the point where they no longer represent the
other eight values. Both mean and standard
deviation are considered nonresistant statistics
The median remained at 7 and the lower and upper
quartiles stayed at 6 and 8, respectively.

24
Boxplots

Rectangular plot that encompasses 50 percent of
the data values
A center line ( or other notation) marking the
median and going through the width of the box
The edges of the box are called hinges
The whiskers that extend from the right and left
hinges to the largest and smallest values

25
Boxplot Components
Largest observed value within 1.5 IQR of upper
hinge
Smallest observed value within 1.5 IQR of lower
hinge
Extreme Or far Outside value
Outside Value Or outlier
Outside Value Or outlier
Whiskers
Median
IQR
1.5IQR
1.5IQR
Inner fence 1.5(IQR) plus Upper hinge
50 of observed Values are within the box
Inner fence Lower hinge Minus 1.5(IQR)
Outer fence Lower hinge Minus 3(IQR)
Outer fence 3(IQR) plus Upper hinge
26
Example

Minimum 54.9
Lower hinge 60.3
Median 74.55
Upper hinge 111.52
Maximum 218.2
IQR 111.52 60.3 51.22
.5 (IQR) 25.61
Inner fence lower hinge 60.3 (51.2225.61)
-16.53
Inner fence upper hinge 111.52 (51.2225.61)
188.35
The smallest and largest values from the
distribution within the fences are used to
determine the whisker length

27
Observations

In preliminary analysis, it is important to
separate legitimate outliers from errors in
measurement, editing, coding and data entry
Outliers that are mistakes should be corrected or
removed

28
Other Observations
Symmetric
Right Skewed
Left Skewed
Small Spread
29
Visual Techniques of EDA

Gain insight into the data
More common ways of summarizing location,
spread, and shape
Used resistant statistics
From these we could make decisions on test
selection and whether the data should be
transformed or reexpressed before further analysis

Write a Comment

User Comments (0)