Title: Data
1Data Sampling Summary StatisticsModels for
Parameter Estimation and Multifactor Effects
- Engineering Experimental Design
- Valerie Young
2In Todays Lecture
- Principles of sampling
- Summary statistics
- Location
- Variability
- Models
3Principles of Sampling
4Sampling
- Goal use statistics based on analysis of a
sample to estimate the same characteristics for
the whole population - Random sampling best choice
- Likely to be representative of whole population
- Ironically, requires careful planning
- Uses a consistent, predetermined protocol
5Questions
- True or false you can always get more accurate
characterization of a population if you measure
every element instead of sampling. - Which is a better example of random sampling?
- The operator takes a sample at 10 past the hour
every hour. - The operator takes a sample once an hour at
whatever time he gets a chance
6Replicate Measurements
- Goal determine how much of the variability in y
is due to the way the experiment is done, and not
due to the factors you want to test. - Your definition of replicate determines what
effects (or sources of variability) are included
in the uncertainty you determine. - Watching the flow meter for awhile could be
considered making replicate measurements.
7Questions
- Suppose that you watch a thermocouple reading for
several minutes, and observe that it varies by no
more than ?1 C while you are watching. - What might cause this variability?
- The manufacturer specifies an accuracy of 2 C.
What additional sources of uncertainty might be
included in the manufacturers estimate that you
cannot see by watching this single readout?
8Summary Statistics
9Location
- To define a typical value for your data, try
- Mean, x
- Sum of values / Number of values
- Susceptible to any outliers
- Median, x0.5
- Middle value
- Not altered by a couple of outliers
- Typical value alone loses information about
variability, time dependence, etc. - The typical value for your sample is your best
estimate of the true value (or location for
whole population).
1,2,3,4,10 Mean ? Median ?
10Location
- To define a typical value for your data, try
- Mean, x
- Sum of values / Number of values
- Susceptible to any outliers
- Median, x0.5
- Middle value
- Not altered by a couple of outliers
- Typical value alone loses information about
variability, time dependence, etc. - The typical value for your sample is your best
estimate of the true value (or location for
whole population).
1,2,3,4,10 Mean 20/5 4 Median 3
11Variability
- To define the variability of your data, try
- Standard deviation, s
- sqrt(Sxx / (n-1))
- Variance, s2
- Sxx / (n-1)
- Interquartile range, IQR
- x0.75 x0.25
- Less susceptible to outliers than s or s2
Sxx means the sum of the squared differences
between each data point (xi) and the mean of all
data points (x-bar). Sxx ?((xi x)2).
Thus, s and s2 measure how widely distributed the
data are around their mean.
1,4,15,16,17,25,50,90 x0.25 ?, x0.75 ?
12Variability
- To define the variability of your data, try
- Standard deviation, s
- sqrt(Sxx / (n-1))
- Variance, s2
- Sxx / (n-1)
- Interquartile range, IQR
- x0.75 x0.25
- Less susceptible to outliers than s or s2
1,4,15,16,17,25,50,90 x0.25 15, x0.75 25
13Getting Uncertainty from Replicate Measurements
- With replicate measurements, the mean is commonly
reported as the best estimate of the true value. - Uncertainty may be described by standard
deviations or confidence limits (covered later). - 1 s.d. or 2 s.d. are both common choices
- Propagation of error on the calculation of the
mean is NOT appropriate. - Error propagation makes the uncertainty grow with
more math operations. You should be more certain
of your answer with more replicates, not less.
14Questions
- After carefully controlling all the chemical
reagents and conditions during a reaction, the
researcher weighs the product on an electronic
balance five times, removing and replacing the
same sample on the balance each time. - What measurement is being replicated?
- What sources of uncertainty are characterized by
the standard deviation of the five weighings? - What would you do to determine the uncertainty on
the reaction yield?
15Mathematical Models for Experimental Data
16Single-Factor Experiment
- Hypothesis The height of a chemical engineering
student depends on the students gender. - Population All U.S. chemical engineering
undergraduates - Sample Students in ChE 408 W03 at OU
- Whether this sample is representative of all ChE
students could certainly be questioned, but lets
go with it. - Factor (independent variable to be investigated)
Gender - Response (dependent variable to be investigated)
Height
17Model for Single-Factor Experiment
Table 1. Self-reported heights of students in
ChE 408 in Winter 03
From this sample of 10 women and 19 men, male
chemical engineering students are taller than
their female counterparts, with heights of (72
2) in and (64 3) in, respectively. The
uncertainties represent the standard deviations
of the data.
The Model Heightfemale,j 64 inches
?j Heightmale,i 72 inches ?i
- Every model consists of 2 parts
- The predictable relationship between the factor
and response. - The random variability
18Two-Factor Experiment
- Hypothesis The height of a chemical engineering
student depends on the students gender and
whether his/her last name starts with A-L or M-Z. - Population All U.S. chemical engineering
undergraduates - Sample Students in ChE 408 W03 at OU
- Factors (independent variables to be
investigated) Gender, First Letter of Last Name - Response (dependent variable to be investigated)
Height
19Two-Factor Experiment
- Gender appears to have an important effect.
- Alphabet appears not to have an important effect.
- How do we quantify these effects?
- What about interaction?
- Does alphabet modify the effect of gender?
20Two-Factor Experiment
Mean
Crossing here doesnt count. There is no gender
after Male.
Page 304 of text shows a plot with interaction.
Lines dont cross so no interaction
21Model for Two-Factor Experiment with No
Interaction
- Heightfemale,A-L,i 69.4 in (-5.597 in)
(-1.014 in) ?I - Every model consists of two parts
- Variability due to predictable relationship
between response and factors (often the only part
of the model that is written) - Random variability (also called error or
uncertainty)
Overall mean of all data in sample (best estimate
of true mean of population)
(Mean height of females) (overall mean of
data). (best estimate of main effect of being
female)
Error
(Mean height of A-L) (overall mean of data).
(best estimate of main effect of being A-L)