Part 2: Describing Data - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Part 2: Describing Data

Description:

Title: PowerPoint Presentation Last modified by: UIC Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 38

Provided by: eduh1175

Category:

more less

Transcript and Presenter's Notes

Title: Part 2: Describing Data

1
Part 2 Describing Data

What I am going to talk
Data and data SPSS file
Frequency and related plots
Basic statistics and related plots

2
3 Types of Data in SPSS

Norminal - Categorical, qualitative or attribute
variables
(????) male and female, smoking and
non-smoking, buy and do not buy, color of hair
Ordered variables
(????) three levels of students,
satisfactory degree
Scale - Numerical or quantitative variables
(?????)
Discrete variables the number of phone calls in
one day, the number of times going to down town
of Zhuhai, per month, etc.
Continuous variables height, income, weight

3
Summary of Types of Variables
Categorical Data
4
Data collected from sampling survey A General
Social Survey (GSS)

The GSS has been conducted regularly since 1972
by NORC, a social science research organization
at the University of Chicago.
The population of interest is all adults living
in US, but not in institutions such as mental
hospitals and college dormitories.
A carefully trained interviewer visits each
selected household and questions the chosen
person, called the respondent.

A simple real example
5
A General Social Survey (GSS)

In this chapter we shall often use the GSS data
to teach how to apply SPSS for various purpose.
People were asked the following questions
Do you personally ever use a computer at home, at
work, or at some other location?
About how many minutes or hours per week do you
spending and answering electronic mail, or email?
Other than for email, do you ever use the
Internet or World Wide Web?
How many hours per week do you use the Web?

6
Data GSSNET Survey on internet use

There are a number of variables, some are
numerical, some are categorical.
Age age of respondent, in years
Educ highest year of school completed
Usecomp use computer? Yes (1) or No (0)
Usenet use internet ? Yes or No
Usemail use Email? Yes or No
Emailhrs weekly e-mail hours
Webhrs hours of web use
Region region of interview

7
A simple frequency table

The missing item tells us how many people did
not select one of the two valid answers.

8
Missing data
Data missing can happen in many variables of a
data set.

One idea for treating missing data problem is to
remove those observations.
-- Removing people who arent asked a question
from the calculation of percentages is not
troublesome. They dont make interpretation of
the results difficult.
When your data have many missing values because
of people refusing to answer questions, it may be
difficult to draw correct answer..
-- If a lot of people who are asked the question
refuse to answer, that can be a problem.

9
Remark for the code

A code of -1 is used for someone who does not use
the internet at all.
A code of -3 is used when you dont know if
someone uses the internet?
A code of -9 is used for Internet users whose
time on the internet is unknown.

10
Percentages based on valid responses

There is a large of percentage of missing
observations. It is difficult to interpret the
above result based on all cases. Therefore, we
use only valid data and see related valid
percentages.

11
Pie charts
12
Pie charts
13
Bar chart
14
Frequency table sorted by counts
15
(No Transcript)
16
Histogram (???)

It is a histogram for grouped numerical data
in which the frequencies or percentages of each
group of numerical data are represented as
individual bars.

17
Histogram
18
Remarks on histogram

There are many parameters in drawing a histogram
the number of intervals
the width of intervals
the minimum point
the maximum point
SPSS has two choices auto or input your request
parameters.

19
Histogram by auto
20
Histogram by specific parameters
21
Basic statistics
22
Basic statistics

A. The mean (??)
Suppose you define the time to get ready as
the time in minutes from when you get out of bed
to when you leave your home. You collect the
times shown below for 10 consecutive work days
The mean is 396/1039.6

Day 1 2 3 4 5 6 7 8 9 10
Time (minutes) 39 29 43 52 39 44 40 31 44 35
23
Basic statistics

B. The median (???)
The median is the value that splits a ranked set
of data into two equal parts.
The median is the middle value in a set of data
that has been ordered from lowest to highest
value.
For odd number of observations, the median is the
middle ranked value.
For even number of observations, the median is
the average of the two middle ranked values.

24
Basic statistics

B. The median (???)

Ranked values 29 31 35 39 39 40 43 44 44 52
ranks 1 2 3 4 5 6 7 8 9 10
Median39.5
Ranked values 37.3 39.2 44.2 44.5 53.8 56.6 59.3 62.4 66.5
ranks 1 2 3 4 5 6 7 8 9
Median53.8
25
Basic statistics

Comments on mean and median
Robustness (???) the median is not affected by
extreme values, by the mean does not have this
property
It is easy to find more beautiful formulas
related the mean, but it is difficult for the
median

26
Basic statistics

C. Mode (??)
The mode is the value in a set of data that
appears most frequently
Example the following data represents the number
of server failures in a day for the past two
weeks
1 3 0 3 26 2 7 4 0 2 3 3 6 3
The mode is 3 as 3 appears five times.
The extreme value 26 is an outlier. An
observation is called outlier if it has a
different pattern from majority of the data set.

27
Basic statistics

D. Quartiles (????)
Quartiles split a set of data into four equal
parts
The first quartile divide the smallest 25
of the values from the other 75 that are larger.
The second quartile it is just the median
The third quartile divides the smallest
75 of the values from the largest 25.
We shall show the application in the box plot.

28
Basic statistics

D. Quartiles
Compute and of 2003 return for the
nine small cap mutual funds with high risk is
Ranked value
37.1 39.2 44.2 44.5 53.8 56.6 59.3 62.4
66.5
Ranks
1 2 3 4 5 6
7 8 9

29
Basic statistics

Measures of variation (?????)
E. The range (??)
The range is equal to the largest value minus
the smallest value
Data
35 39 40 43 29 31 44 52 44 39
Range52-2923

30
Basic statistics

F. The interquartile range (??????,?????)
It is the difference between the third and
first quartiles.
Data
35 39 40 43 29 31 44 52 44 39
Interquartile range44-359
The interval to is often called as the
middle fifty.

31
Basic statistics

G. The variance and the standard deviation
These two statistics measure the average
scatter around the mean

32
Basic statistics

G. The variance and the standard deviation
The sample standard deviation is the square root
of the sample variance.
The sample standard deviation has the same unit
with the original data.
Divide by n-1 not n is from the statistical
criterion unbiaseness (???) . When n becomes
larger, the difference between dividing by n-1 or
n becomes smaller.

33
The Box Plot

A box plot or Box-and-Whisker plot is a graphical
display, based on quartiles, that helps to
picture a set of data.
Five characteristics of data are needed to
construct a box plot
the Minimum Value,
the First Quartile,
the Median,
the Third Quartile,
the Maximum Value.

34
(No Transcript)
35
Skewness

Skewness (??) is the measurement of the lack of
symmetry of the distribution.
The coefficient of skewness (???? ) can range
from -3.00 up to 3.00 when using the following
formula
A value of 0 indicates a symmetric distribution.

36
Relationship between the box plot and polygon