Part 2: Describing Data - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Part 2: Describing Data

Description:

Title: PowerPoint Presentation Last modified by: UIC Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 38
Provided by: eduh1175
Category:
Tags: data | describing | part | zhuhai

less

Transcript and Presenter's Notes

Title: Part 2: Describing Data


1
Part 2 Describing Data
  • What I am going to talk
  • Data and data SPSS file
  • Frequency and related plots
  • Basic statistics and related plots

2
3 Types of Data in SPSS
  • Norminal - Categorical, qualitative or attribute
    variables
  • (????) male and female, smoking and
    non-smoking, buy and do not buy, color of hair
  • Ordered variables
  • (????) three levels of students,
    satisfactory degree
  • Scale - Numerical or quantitative variables
  • (?????)
  • Discrete variables the number of phone calls in
    one day, the number of times going to down town
    of Zhuhai, per month, etc.
  • Continuous variables height, income, weight

3
Summary of Types of Variables
Categorical Data
4
Data collected from sampling survey A General
Social Survey (GSS)
  • The GSS has been conducted regularly since 1972
    by NORC, a social science research organization
    at the University of Chicago.
  • The population of interest is all adults living
    in US, but not in institutions such as mental
    hospitals and college dormitories.
  • A carefully trained interviewer visits each
    selected household and questions the chosen
    person, called the respondent.

A simple real example
5
A General Social Survey (GSS)
  • In this chapter we shall often use the GSS data
    to teach how to apply SPSS for various purpose.
  • People were asked the following questions
  • Do you personally ever use a computer at home, at
    work, or at some other location?
  • About how many minutes or hours per week do you
    spending and answering electronic mail, or email?
  • Other than for email, do you ever use the
    Internet or World Wide Web?
  • How many hours per week do you use the Web?

6
Data GSSNET Survey on internet use
  • There are a number of variables, some are
    numerical, some are categorical.
  • Age age of respondent, in years
  • Educ highest year of school completed
  • Usecomp use computer? Yes (1) or No (0)
  • Usenet use internet ? Yes or No
  • Usemail use Email? Yes or No
  • Emailhrs weekly e-mail hours
  • Webhrs hours of web use
  • Region region of interview

7
A simple frequency table
  • The missing item tells us how many people did
    not select one of the two valid answers.

8
Missing data
Data missing can happen in many variables of a
data set.
  • One idea for treating missing data problem is to
    remove those observations.
  • -- Removing people who arent asked a question
    from the calculation of percentages is not
    troublesome. They dont make interpretation of
    the results difficult.
  • When your data have many missing values because
    of people refusing to answer questions, it may be
    difficult to draw correct answer..
  • -- If a lot of people who are asked the question
    refuse to answer, that can be a problem.

9
Remark for the code
  • A code of -1 is used for someone who does not use
    the internet at all.
  • A code of -3 is used when you dont know if
    someone uses the internet?
  • A code of -9 is used for Internet users whose
    time on the internet is unknown.

10
Percentages based on valid responses
  • There is a large of percentage of missing
    observations. It is difficult to interpret the
    above result based on all cases. Therefore, we
    use only valid data and see related valid
    percentages.

11
Pie charts
12
Pie charts
13
Bar chart
14
Frequency table sorted by counts
15
(No Transcript)
16
Histogram (???)
  • It is a histogram for grouped numerical data
    in which the frequencies or percentages of each
    group of numerical data are represented as
    individual bars.

17
Histogram
18
Remarks on histogram
  • There are many parameters in drawing a histogram
  • the number of intervals
  • the width of intervals
  • the minimum point
  • the maximum point
  • SPSS has two choices auto or input your request
    parameters.

19
Histogram by auto
20
Histogram by specific parameters
21
Basic statistics
22
Basic statistics
  • A. The mean (??)
  • Suppose you define the time to get ready as
    the time in minutes from when you get out of bed
    to when you leave your home. You collect the
    times shown below for 10 consecutive work days
  • The mean is 396/1039.6

Day 1 2 3 4 5 6 7 8 9 10
Time (minutes) 39 29 43 52 39 44 40 31 44 35
23
Basic statistics
  • B. The median (???)
  • The median is the value that splits a ranked set
    of data into two equal parts.
  • The median is the middle value in a set of data
    that has been ordered from lowest to highest
    value.
  • For odd number of observations, the median is the
    middle ranked value.
  • For even number of observations, the median is
    the average of the two middle ranked values.

24
Basic statistics
  • B. The median (???)

Ranked values 29 31 35 39 39 40 43 44 44 52
ranks 1 2 3 4 5 6 7 8 9 10
Median39.5
Ranked values 37.3 39.2 44.2 44.5 53.8 56.6 59.3 62.4 66.5
ranks 1 2 3 4 5 6 7 8 9
Median53.8
25
Basic statistics
  • Comments on mean and median
  • Robustness (???) the median is not affected by
    extreme values, by the mean does not have this
    property
  • It is easy to find more beautiful formulas
    related the mean, but it is difficult for the
    median

26
Basic statistics
  • C. Mode (??)
  • The mode is the value in a set of data that
    appears most frequently
  • Example the following data represents the number
    of server failures in a day for the past two
    weeks
  • 1 3 0 3 26 2 7 4 0 2 3 3 6 3
  • The mode is 3 as 3 appears five times.
  • The extreme value 26 is an outlier. An
    observation is called outlier if it has a
    different pattern from majority of the data set.

27
Basic statistics
  • D. Quartiles (????)
  • Quartiles split a set of data into four equal
    parts
  • The first quartile divide the smallest 25
    of the values from the other 75 that are larger.
  • The second quartile it is just the median
  • The third quartile divides the smallest
    75 of the values from the largest 25.
  • We shall show the application in the box plot.

28
Basic statistics
  • D. Quartiles
  • Compute and of 2003 return for the
    nine small cap mutual funds with high risk is
  • Ranked value
  • 37.1 39.2 44.2 44.5 53.8 56.6 59.3 62.4
    66.5
  • Ranks
  • 1 2 3 4 5 6
    7 8 9

29
Basic statistics
  • Measures of variation (?????)
  • E. The range (??)
  • The range is equal to the largest value minus
    the smallest value
  • Data
  • 35 39 40 43 29 31 44 52 44 39
  • Range52-2923

30
Basic statistics
  • F. The interquartile range (??????,?????)
  • It is the difference between the third and
    first quartiles.
  • Data
  • 35 39 40 43 29 31 44 52 44 39
  • Interquartile range44-359
  • The interval to is often called as the
    middle fifty.

31
Basic statistics
  • G. The variance and the standard deviation
  • These two statistics measure the average
    scatter around the mean

32
Basic statistics
  • G. The variance and the standard deviation
  • The sample standard deviation is the square root
    of the sample variance.
  • The sample standard deviation has the same unit
    with the original data.
  • Divide by n-1 not n is from the statistical
    criterion unbiaseness (???) . When n becomes
    larger, the difference between dividing by n-1 or
    n becomes smaller.

33
The Box Plot
  • A box plot or Box-and-Whisker plot is a graphical
    display, based on quartiles, that helps to
    picture a set of data.
  • Five characteristics of data are needed to
    construct a box plot
  • the Minimum Value,
  • the First Quartile,
  • the Median,
  • the Third Quartile,
  • the Maximum Value.

34
(No Transcript)
35
Skewness
  • Skewness (??) is the measurement of the lack of
    symmetry of the distribution.
  • The coefficient of skewness (???? ) can range
    from -3.00 up to 3.00 when using the following
    formula
  • A value of 0 indicates a symmetric distribution.

36
Relationship between the box plot and polygon
  • A and D are symmetric, mean and median are equal
  • B is left-skewed, the few small values distort
    the mean toward the left tail.
  • C is right-skewed, the concentration of values
    is on the low end of the scale.

37
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com