Basic Statistics with Excel - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Basic Statistics with Excel

Description:

It consists of both information and exercises to introduce Excel as a tool for ... To 'fix' a cell, use $ in front of the column letter and/or row number. ... – PowerPoint PPT presentation

Number of Views:243
Avg rating:3.0/5.0
Slides: 32
Provided by: webcheE
Category:

less

Transcript and Presenter's Notes

Title: Basic Statistics with Excel


1
Basic Statistics with Excel
  • ChE 408 Engineering Experimental Design
  • Valerie L. Young
  • Chemical Engineering, Ohio University

2
Outline Excel as a Tool for . . .
  • Summarizing data
  • Fitting data (regression)
  • Hypothesis testing

This is a tutorial. It consists of both
information and exercises to introduce Excel as a
tool for statistical analysis. Work through it at
your own pace. It assumes that you are familiar
with Excel and can do simple calculations and
make plots. If you cant, seek more help.
3
Some Excel Tidbits
  • Excel can display dates in many formats, but it
    stores dates as numbers.
  • 1 January 1, 1900
  • 37622 January 1, 2003
  • A single quote forces a cell entry to be text
  • To fix a cell, use in front of the column
    letter and/or row number. Then the address wont
    update when you copy the cell.
  • Excel can count the number of values in a range
  • count(N2N35)
  • countif(N2N35,-777)
  • countif(N2N35,gt-777)

4
Summarizing Data Histograms
  • When your data set consists of a single variable
    measured multiple times, a histogram shows how
    the values are distributed.

About 26 of measurements are between 55 and 56.
Indicate total of values on graph or in caption.
This distribution is left-skewed.
This is ? 50 and lt 51. Next category is ? 51 and
lt 52. No overlap.
All categories the same size.
5
Exercise Construct a Histogram
  • Now that you have seen a histogram generated by
    Excel, you should practice by making one
    yourself. On the next slide, you will be
    directed to load a data file and make a histogram
    to display the data. Each bullet on the slide is
    a hint for what you should do next to complete
    the histogram. You should be able to figure out
    how to make Excel do these things.

6
Exercise Construct a Histogram
  • Load calibration.xls from the ChE 408 homepage
  • Count the total number of entries under cal
    factor
  • count(b2b28)
  • Set up a list of categories for the histogram
  • 46-47
  • Count the number of entries in each category
  • countif(b2b28,lt47)-countif(b2b28,lt46)
  • Calculate the fraction of entries in each
    category
  • Insert a column chart on a new page, fraction vs.
    category

7
End of Histogram Exercise
  • Continue to learn about more ways to summarize
    data.

8
More Ways to Summarize Data
  • Plots
  • Time series (use xy, not line, to see trends
    over course of experiment)
  • y vs. x (use xy, not line, to investigate
    correlations)
  • Column chart (to make histogram)
  • Pie chart (to show proportions rarely used)
  • Descriptive Statistics
  • Measures of location
  • Measures of variability
  • Tools gt Data Analysis gt Descriptive Statistics

9
Plots to Summarize Data
Dependent variable. Cal Factor depends on sample
size.
Look! No gray background.
Axes scaled so data covers graph.
Independent variable
Figure 2a. Example of a time-series plot, showing
that values from the first day appear to be lower
and more scattered than the rest.
Figure 2b. Example of a y vs. x plot, showing
that lower values of calibration factor are
obtained with smaller amounts of sample.
10
Descriptive Statistics Location
  • Mean (average)
  • Strongly affected by unusual points
  • average(b2b28)
  • Median (50 of data higher, 50 of data lower)
  • Seldom strongly affected by unusual points
  • median(b2b28)

11
Descriptive Statistics Variability
  • Standard Deviation, stdev(b2b28)
  • A measure of how widely the data is spread around
    the mean
  • Strongly affected by unusual points
  • Relative Standard Deviation
  • stdev(b2b28)/average(b2b28)
  • Usually given as percentage
  • Format gt Cells gt Percent
  • Variance
  • (stdev(b2b28))2

12
Descriptive Statistics Variability
  • Interquartile range, IQR
  • Width of the middle 50 of the data
  • Seldom strongly affected by unusual points
  • IQR q0.75 q0.25
  • q0.75 75 of data is lower, percentile(b2b28,0
    .75)
  • q0.25 25 of data is lower, percentile(b2b28,0
    .25)
  • Range
  • Values spanned by the data
  • max(b2b28) min(b2b28)
  • Strongly affected by unusual points

13
Descriptive Statistics Variability
  • Confidence Interval
  • Mean /- Uncertainty covers the confidence
    interval
  • 95 confidence limits means that if you
    calculate a mean and confidence interval from a
    set of replicate measurements (the sample), you
    can be 95 sure that if you made an infinite
    number of measurements (the population), their
    mean would lie within the confidence interval

14
Descriptive Statistics
The mode is the value that appears most often.
Each value appears just once in this data set.
The 95 confidence interval is 54.2 0.9.
Kurtosis and skewness describe the symmetry of
the distribution. We wont discuss these further
in this class.
This table is as-generated by Excel. Round to
appropriate sig figs before reporting these
numbers.
15
Exercise Summarizing Data
  • Now that you have seen graphical and statistical
    methods for summarizing data, you should
    practice. On the next slide, you will be
    directed to load a data file and generate plots
    and descriptive statistics. Each bullet on the
    slide is a hint for what you should do next. You
    should be able to figure out how to make Excel do
    these things.

16
Exercise Summarizing Data
  • Load calibration.xls from the ChE 408 homepage
  • Construct a time-series plot for the calibration
    factor
  • Plot calibration factor vs. sample amount
  • Use descriptive statistics to summarize the
    location and variability of the calibration
    factor
  • Calculate each statistic individually
  • Use Tools gt Data Analysis gt Descriptive Statistics

17
End of Summarizing Data Exercise
  • Continue to learn about fitting equations to
    data.

18
Fitting Data
  • For linear regression, use Tools gt Data Analysis
    gt Regression
  • Dont use trendline no statistics
  • X Variable ? slope
  • R2 Coefficient of Determination
  • A value near 1.0 means that the value of y
    depends strongly on the value of x. Does NOT
    mean the dependence is linear.
  • Use residual plots to show linearity
  • Residuals should be random around zero
  • Use p-values to show significance of linear fit
  • p probability that points are arranged like
    this by chance

19
Linear Regression Example of Excel Output
Round to appropriate sig figs before reporting
these numbers.
99 of the variation in y is explained by
variation in x. The remainder may be random
error, or may be explained by some factor other
than x.
Ratio of variability explained by model to
leftover variability. High number means model
explains most variation in data.
Probability of getting that value of F by
randomly sampling from normally-distributed data.
y (0.580.02)x (0.0150.010)
Probability of getting a slope or intercept this
much different from zero by randomly sampling
from normally distributed data.
Confidence limits on slope and intercept.
20
Residual Plot
This plot is as-generated by Excel. Fix the
formatting before including this plot in a report.
The random distribution of residuals around zero
suggests that the model accounts for all
predictable variation in y, and all that is left
is random uncertainty.
A residual for a given value of x is the
difference between the measured value of y, and
the value of y calculated using the regression
model.
21
Exercise Linear Regression
  • Now that you have seen the kind of information
    Excel gives you when you fit a straight line to
    data, you should practice. On the next slide, you
    will be directed to load a data file, plot the
    data, and fit a line to it. Each bullet on the
    slide is a hint for what you should do next. You
    should be able to figure out how to make Excel do
    these things.

22
Exercise Fitting Data
  • Load Jul2627data.xls from the ChE 408 homepage
  • Plot i-butane vs. n-butane. Does this plot
    appear linear?
  • Perform linear regression on i-butane vs.
    n-butane.
  • Are these values strongly correlated?
  • Are these values linearly correlated?
  • What is the linear equation relating these
    values?
  • What uncertainty would you place on the values of
    the slope and intercept for this linear
    relationship?

23
End of Summarizing Data Exercise
  • Continue to learn about fitting equations to
    data.

24
Hypothesis Testing Can Answer Questions Like . .
.
  • Is this value significantly different from the
    one I expected?
  • Is the variability of this data significantly
    different from what I expected?
  • Are these two sets of data significantly
    different from each other?
  • When you say significant, back it up with
    statistics.

25
Hypothesis Testing
  • H0 null hypothesis
  • There is no significant difference
  • H1 alternative hypothesis
  • There is a significant difference (two-sided)
  • This is significantly higher / lower (one-sided)
  • You cannot prove the null hypothesis. You can at
    best say the data offer no significant evidence
    against it.

26
Testing Locationp-value (Significance Level)
  • Significance level of 5 (p lt 0.05) means at
    most a 5 chance that this difference is due to
    random uncertainty and not real.
  • Excel calculates this as part of a t-test
  • p-value gt significance level accept H0
  • Cannot prove that observations and expectations
    differ

27
Testing Location t-test
  • t ? (difference between samples) / (variability)
  • Excel will automatically calculate t-values to
    compare
  • Means of two datasets with equal variances
  • Means of two datasets with unequal variances
  • Two sets of paired data
  • abs(t-score) lt abs(t-critical) accept H0
  • Cannot prove that observations and expectations
    differ

28
t-Test Independent Samples
Round to appropriate sig figs before reporting
these numbers.
Change this if you want to know whether the means
of the two samples differ by at least some
specified amount.
Probability of drawing two random samples from a
normally distributed population and getting the
mean of sample 1 this much larger than the mean
of sample 2. The mean of sample 1 is larger at
a significance level of ?0.02 (or at the 2
significance level), because p lt 0.02.
t gt tcritical(one-tail), so the mean of sample 1
is significantly larger than the mean of sample
2.
t gt tcritical(two-tail), so the mean of sample 1
is significantly different from the mean of
sample 2.
P lt 0.03, so the means of the two samples are
different at the 3 significance level.
29
Exercise Hypothesis Testing
  • Now that you have seen the results of some
    t-tests and p-tests in Excel, you should
    practice. On the next slide, you will be directed
    to load a data file and make decisions about the
    data. Each bullet on the slide is a hint for
    what you should do next. You should be able to
    figure out how to make Excel do these things.

30
Exercise - Hypothesis Testing
  • Load calibration.xls from the ChE 408 homepage
  • Calculate the mean, std dev, and variance for
    only the first day.
  • Calculate the mean, std dev, and variance
    excluding the first day.
  • Use a t-test and a p-test to learn whether the
    calibration factor measured the first day differs
    significantly from the rest
  • Tools gt Data analysis gt t-Test Two sample . . .
  • Can use equal variances if ratio of variances lt
    3.

31
End of Hypothesis Testing Exercise
  • This concludes the Excel tutorial.
Write a Comment
User Comments (0)
About PowerShow.com