Title: Part 2: Describing Data
1Part 2 Describing Data
- What I am going to talk
- Data and data SPSS file
- Frequency and related plots
- Basic statistics and related plots
23 Types of Data in SPSS
- Norminal - Categorical, qualitative or attribute
variables - (????) male and female, smoking and
non-smoking, buy and do not buy, color of hair - Ordered variables
- (????) three levels of students,
satisfactory degree - Scale - Numerical or quantitative variables
- (?????)
- Discrete variables the number of phone calls in
one day, the number of times going to down town
of Zhuhai, per month, etc. - Continuous variables height, income, weight
3Summary of Types of Variables
Categorical Data
4Data collected from sampling survey A General
Social Survey (GSS)
- The GSS has been conducted regularly since 1972
by NORC, a social science research organization
at the University of Chicago. - The population of interest is all adults living
in US, but not in institutions such as mental
hospitals and college dormitories. - A carefully trained interviewer visits each
selected household and questions the chosen
person, called the respondent. -
A simple real example
5A General Social Survey (GSS)
- In this chapter we shall often use the GSS data
to teach how to apply SPSS for various purpose. - People were asked the following questions
- Do you personally ever use a computer at home, at
work, or at some other location? - About how many minutes or hours per week do you
spending and answering electronic mail, or email? - Other than for email, do you ever use the
Internet or World Wide Web? - How many hours per week do you use the Web?
6Data GSSNET Survey on internet use
- There are a number of variables, some are
numerical, some are categorical. - Age age of respondent, in years
- Educ highest year of school completed
- Usecomp use computer? Yes (1) or No (0)
- Usenet use internet ? Yes or No
- Usemail use Email? Yes or No
- Emailhrs weekly e-mail hours
- Webhrs hours of web use
- Region region of interview
7A simple frequency table
- The missing item tells us how many people did
not select one of the two valid answers.
8Missing data
Data missing can happen in many variables of a
data set.
- One idea for treating missing data problem is to
remove those observations. - -- Removing people who arent asked a question
from the calculation of percentages is not
troublesome. They dont make interpretation of
the results difficult. - When your data have many missing values because
of people refusing to answer questions, it may be
difficult to draw correct answer.. - -- If a lot of people who are asked the question
refuse to answer, that can be a problem.
9Remark for the code
- A code of -1 is used for someone who does not use
the internet at all. - A code of -3 is used when you dont know if
someone uses the internet? - A code of -9 is used for Internet users whose
time on the internet is unknown.
10Percentages based on valid responses
- There is a large of percentage of missing
observations. It is difficult to interpret the
above result based on all cases. Therefore, we
use only valid data and see related valid
percentages.
11Pie charts
12Pie charts
13Bar chart
14Frequency table sorted by counts
15(No Transcript)
16Histogram (???)
- It is a histogram for grouped numerical data
in which the frequencies or percentages of each
group of numerical data are represented as
individual bars.
17Histogram
18Remarks on histogram
- There are many parameters in drawing a histogram
- the number of intervals
- the width of intervals
- the minimum point
- the maximum point
- SPSS has two choices auto or input your request
parameters.
19Histogram by auto
20Histogram by specific parameters
21Basic statistics
22 Basic statistics
- A. The mean (??)
- Suppose you define the time to get ready as
the time in minutes from when you get out of bed
to when you leave your home. You collect the
times shown below for 10 consecutive work days - The mean is 396/1039.6
Day 1 2 3 4 5 6 7 8 9 10
Time (minutes) 39 29 43 52 39 44 40 31 44 35
23 Basic statistics
- B. The median (???)
- The median is the value that splits a ranked set
of data into two equal parts. - The median is the middle value in a set of data
that has been ordered from lowest to highest
value. - For odd number of observations, the median is the
middle ranked value. - For even number of observations, the median is
the average of the two middle ranked values.
24Basic statistics
Ranked values 29 31 35 39 39 40 43 44 44 52
ranks 1 2 3 4 5 6 7 8 9 10
Median39.5
Ranked values 37.3 39.2 44.2 44.5 53.8 56.6 59.3 62.4 66.5
ranks 1 2 3 4 5 6 7 8 9
Median53.8
25Basic statistics
- Comments on mean and median
- Robustness (???) the median is not affected by
extreme values, by the mean does not have this
property - It is easy to find more beautiful formulas
related the mean, but it is difficult for the
median
26 Basic statistics
- C. Mode (??)
- The mode is the value in a set of data that
appears most frequently - Example the following data represents the number
of server failures in a day for the past two
weeks - 1 3 0 3 26 2 7 4 0 2 3 3 6 3
- The mode is 3 as 3 appears five times.
- The extreme value 26 is an outlier. An
observation is called outlier if it has a
different pattern from majority of the data set.
27 Basic statistics
- D. Quartiles (????)
- Quartiles split a set of data into four equal
parts - The first quartile divide the smallest 25
of the values from the other 75 that are larger. - The second quartile it is just the median
- The third quartile divides the smallest
75 of the values from the largest 25. - We shall show the application in the box plot.
28 Basic statistics
- D. Quartiles
- Compute and of 2003 return for the
nine small cap mutual funds with high risk is - Ranked value
- 37.1 39.2 44.2 44.5 53.8 56.6 59.3 62.4
66.5 - Ranks
- 1 2 3 4 5 6
7 8 9
29Basic statistics
- Measures of variation (?????)
- E. The range (??)
- The range is equal to the largest value minus
the smallest value - Data
- 35 39 40 43 29 31 44 52 44 39
- Range52-2923
30 Basic statistics
- F. The interquartile range (??????,?????)
- It is the difference between the third and
first quartiles. - Data
- 35 39 40 43 29 31 44 52 44 39
- Interquartile range44-359
- The interval to is often called as the
middle fifty.
31 Basic statistics
- G. The variance and the standard deviation
- These two statistics measure the average
scatter around the mean -
-
-
32 Basic statistics
- G. The variance and the standard deviation
- The sample standard deviation is the square root
of the sample variance. - The sample standard deviation has the same unit
with the original data. - Divide by n-1 not n is from the statistical
criterion unbiaseness (???) . When n becomes
larger, the difference between dividing by n-1 or
n becomes smaller. -
33The Box Plot
- A box plot or Box-and-Whisker plot is a graphical
display, based on quartiles, that helps to
picture a set of data. - Five characteristics of data are needed to
construct a box plot - the Minimum Value,
- the First Quartile,
- the Median,
- the Third Quartile,
- the Maximum Value.
34(No Transcript)
35Skewness
- Skewness (??) is the measurement of the lack of
symmetry of the distribution. - The coefficient of skewness (???? ) can range
from -3.00 up to 3.00 when using the following
formula - A value of 0 indicates a symmetric distribution.
36Relationship between the box plot and polygon
- A and D are symmetric, mean and median are equal
- B is left-skewed, the few small values distort
the mean toward the left tail. - C is right-skewed, the concentration of values
is on the low end of the scale.
37(No Transcript)