Descriptive Statistics - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Descriptive Statistics

Description:

Title: PowerPoint Presentation Author: Chris Holdsworth Last modified by: Chris Created Date: 10/8/2002 8:25:28 PM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 60
Provided by: ChrisH484
Category:

less

Transcript and Presenter's Notes

Title: Descriptive Statistics


1
Descriptive Statistics
2
Five types of statistical analysis
What are the characteristics of the respondents?
Descriptive
What are the characteristics of the population?
Inferential
Are two or more groups the same or different?
Differences
Are two or more variables related in a systematic
way?
Associative
Can we predict one variable if we know one or
more other variables?
Predictive
3
Descriptive Statistics
  • Summarization of a collection of data in a clear
    and understandable way
  • the most basic form of statistics
  • lays the foundation for all statistical knowledge
  • Measures of central tendency
  • mean, median, mode
  • Measures of dispersion
  • range, standard deviation, and coefficient of
    variation
  • Measures of shape
  • skewness and kurtosis
  • If you use fewer statistics to describe the
    distribution of a variable, you lose information
    but gain clarity.

4
Type of Measurement
Type of descriptive analysis
Frequency table Proportion (percentage) Frequenc
y table Category proportions (percentages) Mode
Two categories
Nominal
More than two categories
5
Type of Measurement
Type of descriptive analysis
Ordinal
Rank order Median
Interval
Arithmetic mean
Ratio
means
6
Data Tabulation
  • Tabulation The organized arrangement of data in
    a table format that is easy to read and
    understand.
  • A count of the number of responses to each
    question.
  • Simple Tabulation tabulating of results of only
    one variable informs you how often each response
    was given.
  • Frequency Distribution A distribution of data
    that summarizes the number of times a certain
    value of a variable occurs expressed in terms of
    percentages.

7
Frequency Tables
  • The arrangement of statistical data in a
    row-and-column format that exhibits the count of
    responses or observations for each category
    assigned to a variable
  • How many of certain brand users can be called
    loyal?
  • What percentage of the market are heavy users and
    light users?
  • How many consumers are aware of a new product?
  • What brand is the Top of Mind of the market?

8
More on relative frequency distributions
  • Rules for relative frequency distributions
  • Make sure each observation is in one and only one
    category.
  • Use categories of equal width.
  • Choose an appealing number of categories.
  • Provide labels
  • Double-check your graph.

A bar graph is a relative frequency distribution
of a qualitative variable
A histogram is a relative frequency distribution
of a quantitative variable
9
(No Transcript)
10
How many times per week do you use mouthwash
? 1__ 2__ 3__ 4__ 5__ 6__ 7__ 1 1 2 2 2 3 3
3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7
1 2
2 3
3 5
4 7
5 5
6 3
7 2
11
Normal Distribution
?
- ?
?
?
a
b
12
The total area under the curve is equal to 1,
i.e. It takes in all observations The area of a
region under the normal distribution between any
two values equals the probability of observing a
value in that range when an observation is
randomly selected from the distribution For
example, on a single draw there is a 34 chance
of selecting from the distribution a person with
an IQ between 100 and 115
13
  • Curve is basically bell shaped from - ? to ?
  • symmetric with scores concentrated in the middle
    (i.e. on the mean) than in the tails.
  • Mean, medium and mode coincide
  • They differ in how spread out they are.
  • The area under each curve is 1.
  • The height of a normal distribution can be
    specified mathematically in terms of two
    parameters the mean (m) and the standard
    deviation (s).

Normal Distributions
14
Skewed Distributions
  • Occur when one tail of the distribution is longer
    than the other.
  • Positive Skew Distributions
  • have a long tail in the positive direction.
  • sometimes called "skewed to the right"
  • more common than distributions with negative
    skews
  • E.g. distribution of income. Most people make
    under 80,000 a year, but some make quite a bit
    more with a small number making many millions of
    dollars per year
  • The positive tail therefore extends out quite a
    long way
  • Negative Skew Distributions
  • have a long tail in the negative direction.
  • called "skewed to the left."
  • negative tail stops at zero
  • E.g. GPA

15
  • Kurtosis how peaked a distribution is. A zero
    indicates normal distribution, positive numbers
    indicate a peak, negative numbers indicate a
    flatter distribution)

Peaked distribution
Flat distribution
Thanks, Scott!
16
Summary statistics
  • central tendency
  • Dispersion or variability
  • A quantitative measure of the degree to which
    scores in a distribution are spread out or are
    clustered together

17
Measures of Central Tendency
  • Mode the number that occurs most often in a
    string (nominal data)
  • Median half of the responses fall above this
    point, half fall below this point (ordinal data)
  • Mean the average (interval/ratio data)

18
  • Mode
  • the most frequent category
  • users 25
  • non-users 75
  • Advantages
  • meaning is obvious
  • the only measure of central tendency that can be
    used with nominal data.
  • Disadvantages
  • many distributions have more than one mode, i.e.
    are multimodal
  • greatly subject to sample fluctuations
  • therefore not recommended to be used as the only
    measure of central tendency.

19
  • Median
  • the middle observation of the data
  • number times per week consumers use mouthwash
  • 1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6
    7 7

Frequency distribution of Mouthwash use per week
20
(No Transcript)
21
(No Transcript)
22
Measures of Dispersion or Variability
  • Minimum, Maximum, and Range (Highest value minus
    the lowest value)
  • Variance
  • Standard Deviation (A measures distance from the
    mean)

23
- 1 SD
1 SD
RANGE
24
Variance
  • The difference between an observed value and the
    mean is called the deviation from the mean
  • The variance is the mean squared deviation from
    the mean
  • i.e. you subtract each value from the mean,
    square each result and then take the average.
  • Because it is squared it can never be negative

25
Standard Deviation
  • The standard deviation is the square root of the
    variance
  • Thus the standard deviation is expressed in the
    same units as the variables
  • Helps us to understand how clustered or spread
    the distribution is around the mean value.

26
(No Transcript)
27

28

29
Normal Distributions with different SD
30
Cross Tabulation
  • A statistical technique that involves tabulating
    the results of two or more variables
    simultaneously
  • informs you how often each response was given
  • Shows relationships among and between variables
  • frequency distribution for each subgroup compared
    to the frequency distribution for the total
    sample
  • must be nominally scaled

31
Cross-tabulation
  • Helps answer questions about whether two or more
    variables of interest are linked
  • Is the type of mouthwash user (heavy or light)
    related to gender?
  • Is the preference for a certain flavor (cherry or
    lemon) related to the geographic region (north,
    south, east, west)?
  • Is income level associated with gender?
  • Cross-tabulation determines association not
    causality.

32
Dependent and Independent Variables
  • The variable being studied is called the
    dependent variable or response variable.
  • A variable that influences the dependent variable
    is called independent variable.

33
Cross-tabulation
  • Cross-tabulation of two or more variables is
    possible if the variables are discrete
  • The frequency of one variable is subdivided by
    the other variable categories.
  • Generally a cross-tabulation table has
  • Row percentages
  • Column percentages
  • Total percentages
  • Which one is better?
  • DEPENDS on which variable is considered as
    independent.

34
Contingency Table
  • A contingency table shows the conjoint
    distribution of two discrete variables
  • This distribution represents the probability of
    observing a case in each cell
  • Probability is calculated as

35
Cross tabulation
GROUPINC Gender Crosstabulation
36
General Procedure for Hypothesis Test
  • Formulate H0 (null hypothesis) and H1
    (alternative hypothesis)
  • Select appropriate test
  • Choose level of significance
  • Calculate the test statistic (SPSS)
  • Determine the probability associated with the
    statistic.
  • Determine the critical value of the test
    statistic.

37
General Procedure for Hypothesis Test
  • a) Compare with the level of significance, ?
  • b) Determine if the critical value falls in
    the rejection region. (check tables)
  • Reject or do not reject H0
  • Draw a conclusion

38
1. Formulate H1and H0
  • The hypothesis the researcher wants to test is
    called the alternative hypothesis H1.
  • The opposite of the alternative hypothesis is the
    null hypothesis H0 (the status quo)(no difference
    between the sample and the population, or between
    samples).
  • The objective is to DISPROVE the null hypothesis.
  • The Significance Level is the Critical
    probability of choosing between the null
    hypothesis and the alternative hypothesis

39
2. Select Appropriate Test
  • The selection of a proper Test depends on
  • Scale of the data
  • nominal
  • interval
  • the statistic you seek to compare
  • Proportions (percentages)
  • means
  • the sampling distribution of such statistic
  • Normal Distribution
  • T Distribution
  • ?2 Distribution
  • Number of variables
  • Univariate
  • Bivariate
  • Multivariate
  • Type of question to be answered

40
Example A tire manufacturer believes that men are
more aware of their brand than women. To find
out, a survey is conducted of 100 customers, 65
of whom are men and 35 of whom are women. The
question they are asked is Are you aware of our
brand Yes or No. 50 of the men were aware and
15 were not, whereas 10 of the women were aware
and 25 were not. Are these differences
significant?
Men Women Total
Aware 50 10 60 Unaware
15 25 40 65 35
100
41
1. Formulate H1and H0
  • We want to know whether brand awareness is
    associated with gender. What are the Hypotheses

H0 H1
There is no difference in brand awareness based
on gender There is a difference in brand
awareness based on gender
Chi-square test results are unstable if cell
count is lower than 5
42
2. Select Appropriate Test
X2 (Chi Square)
  • Used to discover whether 2 or more groups of one
    variable (dependent variable) vary significantly
    from each other with respect to some other
    variable (independent variable).
  • Are the two variables of interest associated
  • Do men and women differ with respect to product
    usage (heavy, medium, or light)
  • Is the preference for a certain flavor (cherry or
    lemon) related to the geographic region (north,
    south, east, west)?
  • H0 Two variables are independent (not
    associated)
  • H1 Two variables are not independent
    (associated)
  • Must be nominal level, or, if interval or ratio
    must be divided into categories

43
Awareness of Tire Manufacturers Brand
Men Women Total
Aware 50/39 10/21 60 Unaware
15/26 25/14 40 65
35 100
Estimated cell Frequency
Ri total observed frequency in the ith row Cj
total observed frequency in the jth column n
sample size Eij estimated cell frequency
44
3. Choose Level of Significance
  • Whenever we draw inferences about a population,
    there is a risk that an incorrect conclusion will
    be reached
  • The real question is how strong the evidence in
    favor of the alternative hypothesis must be to
    reject the null hypothesis.
  • The significance level states the probability of
    rejecting H0 when in fact it is true.
  • In this example an error would be committed if we
    said that there is a difference between men and
    women with respect to brand awareness when in
    fact there was no difference i.e. we have
    rejected the null hypothesis when it is in fact
    true
  • This error is commonly known as Type I error, The
    value of ? is called the significance level of
    the test Type I error

45
  • Significance Level selected is typically .05 or
    .01
  • i.e 5 or 1
  • In other words we are willing to accept the risk
    that 5 (or 1) of the time the results we get
    indicate that we should reject the null
    hypothesis when it is in fact true.
  • 5 (or 1) of the time we are willing to commit
    a Type 1 error
  • stating there is a difference between men and
    women with respect to brand awareness when in
    fact there is no difference

46
3. Choose Level of Significance
  • We commit Type error II when we incorrectly
    accept a null hypothesis when it is false. The
    probability of committing Type error II is
    denoted by ?.
  • In our example we commit a type II error when we
    say that.

there is NO difference between men and women with
respect to brand awareness (we accept the null
hypothesis) when in fact there is
47
Type I and Type II Errors
Accept null
Reject null
Null is true
Correct- no error
Type I error
Null is false
Type II error
Correct- no error
48
Which is worse?
  • Both are serious, but traditionally Type I error
    has been considered more serious, thats why the
    objective of hypothesis testing is to reject H0
    only when there is enough evidence that supports
    it.
  • Therefore, we choose ? to be as small as possible
    without compromising ?. (accepting when false)
  • Increasing the sample size for a given a will
    decrease ß (I.e. accepting the null hypothesis
    when it is in fact false)

49
Awareness of Tire Manufacturers Brand
Men Women Total
Aware 50/39 10/21 60 Unaware
15/26 25/14 40 65
35 100
Estimated cell Frequency
Ri total observed frequency in the ith row Cj
total observed frequency in the jth column n
sample size Eij estimated cell frequency
50
Chi-Square Test
Estimated cell Frequency
Ri total observed frequency in the ith row Cj
total observed frequency in the jth column n
sample size Eij estimated cell frequency
Chi-Square statistic
x² chi-square statistics Oi observed
frequency in the ith cell Ei expected frequency
on the ith cell
Degrees of Freedom
d.f.(R-1)(C-1)
51
  • Degrees of Freedom
  • the number of values in the final calculation of
    a statistic that are free to vary
  • For example To calculate the standard deviation
    of a random sample, we must first calculate the
    mean of that sample and then compute the sum of
    the squared deviations from that mean
  • While there will be n such squared deviations
    only (n - 1) of them are free to assume any value
    whatsoever.
  • This is because the final squared deviation from
    the mean must include the one value of X such
    that the sum of all the Xs divided by n will
    equal the obtained mean of the sample.
  • All of the other (n - 1) squared deviations from
    the mean can, theoretically, have any values
    whatsoever..

52
4. Calculate the Test Statistic
Chi-Square Test Differences Among Groups
Chi-square test results are unstable if cell
count is lower than 5
53
5. Determine the Probability-value (Critical
Value)
  • The p-value is the probability of seeing a random
    sample at least as extreme as the sample observed
    given that the null hypothesis is true.
  • given the value of alpha, ? we use statistical
    theory to determine the rejection region.
  • If the sample falls into this region we reject
    the null hypothesis otherwise, we accept it
  • Sample evidence that falls into the rejection
    region is called statistically significant at the
    alpha level.

54
COMBINATIONS
A combination is the selection of a certain
number of objects taken from a group of objects
without regard to order. We use the symbol
(5 choose 3) to indicate that we have five
objects taken three at a time, without regard to
order. To calculate the possible number of
combinations the formula used is
5x4x3x2x1 120 10
(3x2x1)x(2x1) 12 If we choose a sample
of 5 from a total of 20 there are 15, 504
possible combinations. If we took the means of
some measurement for each of the possible
combinations those means would form a normal
distribution.
55
Critical value
  • A critical value is the value that a test
    statistic must exceed in order for the the null
    hypothesis to be rejected.
  • For example, the critical value of t (with 12
    degrees of freedom using the .05 significance
    level) is 2.18.
  • This means that for the probability value to be
    less than or equal to .05, the absolute value of
    the t statistic must be 2.18 or greater.

critical value
Significance level (.05)
Test statistic
56
Significance from p-values -- continued
  • How small is a small p-value? This is largely a
    matter of semantics but if the
  • p-value is less than 0.01, it provides
    convincing evidence that the alternative
    hypothesis is true
  • p-value is between 0.01 and 0.05, there is
    strong evidence in favor of the alternative
    hypothesis
  • p-value is between 0.05 and 0.10, it is in a
    gray area
  • p-values greater than 0.10 are interpreted as
    weak or no evidence in support of the
    alternative.

57
5. Determine the Probability-value (Critical
Value)
Chi-square Test for Independence
Under H0, the probability distribution is
approximately distributed by the Chi-square
distribution (?2).
Chi-square
?2 with 1 d.f. at .05 critical value 3.84
58
  • a) Compare with the level of significance, ?
  • b) Determine if the critical value falls in
    the rejection region. (check tables)
  • 22.16 is greater than 3.84 and falls in the
    rejection area
  • In fact it is significant at the .001 level,
    which means that the chance that our variables
    are independent, and we just happened to pick an
    outlying sample, is less than 1/1000
  • Or, in other words, the chance that we have a
    Type 1 error is less than .1
  • i.e. That there is a .1 chance that we reject
    the null hypothesis when it is true -- that there
    is no difference between men and women with
    respect to brand awareness, and say that there
    is, when in fact the null hypothesis is true
    there is no difference.

59
  • Reject or do not reject H0
  • Since 22.16 is greater than 3.84 we reject the
    null hypothesis
  • Draw a conclusion
  • Men and women differ with respect to brand
    awareness, specifically, men are more brand aware
    then women
Write a Comment
User Comments (0)
About PowerShow.com