Title: Association
1Association
- Reference
- Browns Lecture Note 1
- Grading on Curve
2Topics
- Method for studying relationships among several
variables - Scatter plot
- Correlation coefficient
- Association and causation.
- Regression
- Examine the distribution of a single variable.
- QQplot
3Regression
- Sir Francis Galton in his 1885 Presidential
address before the anthropology section of the
British Association for the Advancement of
Science described a study he had made of - How tall children are compared to their parents?
- He thought he had made a discovery when he found
that childs heights tend to be more moderate
than that of their parents. - For example, if the parents were very tall their
children tended to be tall, but shorter than the
parents. - This discovery he called a regression to the
mean. - The term regression has come to be applied to the
least squares technique that we now use to
produce results of the type he found (but which
he did not use to produce his results). - Association between variables
- Two variables measured on the same individuals
are associated if some values of one variable
tend to occur more often with some values of the
second variable than with other values of that
variable.
4Study relationships among several variables
- Associations are possible between
- Two quantitative variables.
- A quantitative and a categorical variable.
- Two categorical variables.
- Quantitative and categorical variables
- Regression
- Response variable and explanatory variable
- A response variable measures an outcome of a
study. - An explanatory variable explains or causes
changes in the response variables. - If one sets values of one variable, what effect
does it have on the other variable? - Other names
- Response variable dependent variable.
- Explanatory variable independent variable
5Principles for studying association
- Start with graphical display scatterplots
- Display the relationship between two quantitative
variables. - The values of one variable appear on the
horizontal axis (the x axis) and the values of
the other variable on the vertical axis (the y
axis). - Each individual is the point in the plot fixed by
the values of both variables for that individual. - In regression, usually call the explanatory
variable x and the response variable y. - Look for overall patterns and for striking
deviations from the pattern interpreting
scatterplots - Overall pattern the relationship has ...
- form (linear relationships, curved relationships,
clusters) - direction (positive/negative association)
- strength (how close the points follow a clear
form?) - Outliers
- For a categorical x and quantitative y, show the
distributions of y for each category of x. - When the overall pattern is quite regular, use a
compact mathematical model to describe it.
6Positive/negative association
- Two variables are positively associated when
above-average values of one tend to accompany
above-average values of the other and
below-average values also tend to occur together. - Two variables are negatively associated when
above-average values of one accompany
below-average values of the other and vice
versa.
7Association or Causation
8Add numerical summaries - the correlation
Straight-line (linear) relations are particularly
interesting. (correlation)
Our eyes are not a good judges of how strong a
relationship is - affected by the plotting scales
and the amount of white space around the cloud of
points.
9Correlation
- The correlation r measures the direction and
strength of the linear relationship between two
quantitative variables. - For the data for n individuals on variables x and
y,
- Calculation
- Begins by standardizing the observations.
- Standardized values have no units.
- r is an average of the products of the
standardized x and y values for the n
individuals.
10Properties of r
- Makes no use of distinction between explanatory
and response variables. - Requires both variables be quantitative.
- Does not change when the units of measurements
are changed. - rgt0 for a positive association and rlt0 for
negative. - -1? r ? 1.
- Near-zero r indicate a weak linear relationship
the strength of the relationship increases as r
moves away from 0 toward either -1 or 1. - The extreme values r-1 or 1 occur only when the
points lie exactly along a straight line. - It measures the strength of only the linear
relationship. - Scatterplots and correlations
- It is not so easy to guess the value of r from a
scatterplot.
11Various data and their correlations
12Cautions about correlation
- Correlation is not a complete description of
two-variable data. - A high correlation means bigger linear
relationship but not similarity. - Summary If a scatterplot shows a linear
relationship, wed like to summarize the overall
pattern by drawing a line on the scatterplot. - Use a compact mathematical model to describe it -
least squares regression. - A regression line
- It summarizes the relationship between two
variables, one explanatory and another response. - It is a straight line that describes how a
response variable y changes as an explanatory
variable x changes. - Often used to predict the value of y for a
given value of x.
13(No Transcript)
14Mean height of children against age
- Strong, positive, linear relationship. (r0.994)
15Fitting a line to data
- It means to draw a line that comes as close as
possible to the points. - The equation of the line gives a compact
description of the dependence of the response
variable y on the explanatory variable x. - A mathematical model for the straight-line
relationship. - A straight line relating y to x has an equation
of the form
16- Height 64.93 (0.635Age)
- Predict the mean height of the children 32, 0 and
240 months of age. - Can we do extrapolation?
17Prediction
- The accuracy of predictions from a regression
line depend on how much scatter the data shows
around the line. - Extrapolation is the use of regression line for
prediction far outside the range of values of the
explanatory variable x that you used to obtain
the line. - Such predictions are often not accurate.
18Which line??
19Least-squares regression
- We need a way to draw a regression line that does
not depend on our eyeball guess. - We want a regression line that makes the
prediction errors as small as possible. - The least-squares idea.
- The least-squares regression line of y on x is
the line that makes the sum of the squares of the
vertical distances of the data points from the
line as small as possible. - Find a and b such that
is the smallest. (y-hat is predicted response for
the given x)
20(No Transcript)
21Equation of the LS regression line
- The equation of the least-squares regression line
of y on x
- Interpreting the regression line and its
properties - A change of one standard deviation in x
corresponds to a change of r standard deviation
in y. - It always passes through the point (x-bar, y-bar).
22The height-age data
23Correlation and regression
- In regression, x and y play different roles.
- In correlation, they dont.
- Comparing the regression of y on x and x on y.
- The slope of the LS regression involves r.
- r2 is the fraction of the variance of y that is
explained by the LS regression of y on x. - If r0.7 or -0.7, r20.49 and about half the
variation is accounted for by the linear
relationship. - Quantify the success of regression in explaining
y. Two sources of variation in y, one systematic
another random.
24 25Scatterplot smoothers
- Systematic methods of extracting the overall
pattern. - Help us see overall patterns.
- Reveal relationships that are not obvious from a
scatterplot alone.
26Categorical explanatory variable
- Make a side-by-side comparison of the
distributions of the response for each category. - back-to-back stemplots, side-by-side boxplots.
- If the categorical variable is not ordinal,
i.e. has no natural order, its hard to speak the
direction of the association.
27Regression
- Francis Galton (1822 1911) measured the heights
of about 1,000 fathers and sons. - The following plot summarizes the data on sons
heights. - The curve on the histogram is a N(68.2, 2.62)
density curve.
28Data is often normally distributed
- The following table summarizes some aspects of
the data - Quantiles
- 100.0 maximum
74.69 - 90.0
71.74 - 75.0 quartile
69.92 - 50.0 median
68.24 - 25.0 quartile
66.42 - 10.0
64.56 - 0.0 minimum
61.20 - Moments
- Mean 68.20 Std Dev 2.60
N 952
29Normal Quantile Plot
- A normal quantile plot provides a better way of
determining whether data is well fitted by a
normal distribution. - How these plots are formed and interpreted?
- The plot for the Galton data on sons heights
30Normal Quantile Plot
- The data points very nearly follow a straight
line on this plot. - This verifies that the data is approximately
normally distributed. - This is data from the population of all adult,
English, male heights. - The fact that the sample is approximately normal
is a reflection of the fact that this population
of heights is normally distributed or at least
approximately so. - IF the POPULATION is really normal how close to
normal should the SAMPLE histogram be and how
straight should the normal probability plot be? - Empirical Cumulative Distribution Function
- Suppose that x1,x2.,xn is a batch of numbers
(the word sample is often used in the case that
the xi are independently and identically
distributed with some distribution function the
word batch will imply no such commitment to a
stochastic model). - The empirical cumulative distribution function
(ecdf) is defined as (with this definition, Fn is
right-continuous).
31Empirical Cumulative Distribution Function
- The random variables I(Xi?x) are independent
Bernoulli random variables. - nFn(x) is a binomial random variable (n trials,
probability F(x) of success) and so - E?Fn(x)? F(x), Var?Fn(x) ? n-1F(x)?1-
F(x)?. - Fn(x) is an unbiased estimate of F(x) and has a
maximum variance at that value of x such that
F(x) 0.5, that is, at the median. - As x becomes very large or very small, the
variance tends to zero. - The Survival Function
- It is equivalent to a distribution function and
is defined as - S(t) P(T ? t) 1- F(t)
- Here T is a random variable with cdf F.
- In applications where the data consist of times
until failure or death and are thus nonnegative,
it is often customary to work with the survival
function rather than the cumulative distribution
function, although the two give equivalent
information. - Data of this type occur in medical and
reliability studies. In these cases, S(t) is
simply the probability that the lifetime will be
longer than t. we will be concerned with the
sample analogue of S, Sn(t) 1- Fn(t).
32Quantile-Quantile Plots
- Q-Q Plots are useful for comparing distribution
functions. - If X is a continuous random variable with a
strictly increasing distribution function, F, the
pth quantile of F was defined to be that value of
x such that F(x) p or Xp F-1(p). - In a Q-Q plot, the quantiles of one distribution
are plotted against those of another. - A Q-Q plot is simply constructed by plotting the
points (X(i),Y(i)). - If the batches are of unequal size, an
interpolation process can be used. - Suppose that one cdf (F) is a model for
observations (x) of a control group and another
cdf (G) is a model for observations (y) of a
group that has received some treatment. - The simplest effect that the treatment could be
to increase the expected response of every member
of the treatment group by the same amount, say h
units. - Both the weakest and the strongest individuals
would have their responses changed by h. Then yp
xp h, and the Q-Q plot would be a straight
line with slope 1 and intercept h.
33Quantile-Quantile Plots
- The cdfs are related as G(y) F(y h).
- Another possible effect of a treatment would be
multiplicative The response (such as lifetime or
strength) is multiplied by a constant, c. - The quantiles would then be related as yp cxp,
and the Q-Q plot would be a straight line with
slope c and intercept 0. The cdfs would be
related as G(y) F(y/c).
34Simulation
- Here is a histogram and probability plot for a
sample of size 1000 from a perfectly normal
population with mean 68 and SD 2.6.
Moments Mean 67.92 Std Dev 2.60 N
1000
35Simulation
36Summary on parents heights
37Another Data Set
- R. A. Fisher (1890 1962) (who many claim was
the greatest statistician ever) analyzed a series
of measurements of Iris flowers in some of his
important developmental papers. - Histogram of the sepal lengths of 50 iris setosa
flowers
This data has mean 5.0 and S.D. 3.5. The curve is
the density of a N(5, 3.52) distribution.
38Normal Quantile Plot Sepal length
- Why are the dots on this plot arranged in neat
little rows? - Apart from this, the data nicely follows a
straight line pattern on the plot.
N(5.006,0.35249)
39Fisher's Iris Data
- Array giving 4 measurements on 50 flowers from
each of 3 species of iris. - Sepal length and width, and petal length and
width are measured in centimeters. - Species are Setosa, Versicolor, and Virginica.
- SOURCE
- R. A. Fisher, "The Use of Multiple Measurements
in Taxonomic Problems", Annals of Eugenics, 7,
Part II, 1936, pp. 179-188. Republished by
permission of Cambridge University Press. - The data were collected by Edgar Anderson, "The
irises of the Gaspe Peninsula", Bulletin of the
American Iris Society, 59, 1935, pp. 2-5.
40Not all real data is approximately normal
- Histogram and normal probability plot for the
salaries (in 1,000) of all major league baseball
players in 1987. - Only position players not pitchers who were
on a major league roster for the entire season
are included.
Moments Mean 529.7 S.D. 441.6 N 260
41Normal Quantile Plot
- This distribution is skewed to the right.
- How this skewness is reflected in the normal
quantile plot? - Both the largest salaries and the smallest
salaries are much too large to match an ideal
normal pattern. (They can be called outliers.) - This histogram seems something like an
exponential density. Further investigation
confirms a reasonable agreement with an
exponential density truncated below at 67.5.
42Judging whether a distribution is approximately
normal or not
- Personal incomes, survival times, etc are usually
skewed and not normal. - Risky to assume that a distribution is normal
without actually inspecting the data. - Stemplots and histograms are useful.
- Still more useful tool is the normal quantile
plot.
43Normal quantile plots
- Arrange the data in increasing order. Record
percentiles of each data value. - Do normal distribution calculations to find the
z-scores at these same percentiles. - Plot each data point x against the corresponding
z. - If the data distribution is standard normal, the
points will lie close to the 45-degree line xz. - If it is close to any normal distribution, the
points will lie close to some straight line.
44 45- Right-skewed distribution
46(No Transcript)
47qqline (R-function)
- Plots a line through the first and third quartile
of the data, and the corresponding quantiles of
the standard normal distribution. - Provide a good straight line that helps us see
whether the points lie close to a straight line.
48(No Transcript)
49 50Simulations
51Summary
- Density curves relative frequencies.
- The mean (?), median, quantiles, standard
deviation (?). - The normal distributions N(?,?2).
- Standardizing z-score (z(x-?)/?)
- 68-95-99.7 rule standard normal distribution and
table. - Normal quantile plots/lines.
52Another non-normal pattern
- The data here is the number of runs scored in the
1986 season by each of the players in the above
data set.
Moments Mean 55.33 S.D. 25.02 N
261 Note n 261 here, but in the preceding data
n 260. The discrepancy results from the fact
that one player in the data set has a missing
salary figure.
53Gamma Quantile Plot
- This data is fairly well fit by a gamma density
with parameters a 4.55 and l 12.16. (How do
we find those two numbers?) - What is the gamma density curve?
- How do we plot a quantile plot to check on gamma
density? - The data points form a fairly straight line on
this plot hence there is reasonable agreement
between the data and a theoretical G(4.55,12.16)
distribution.
54Methods of Estimation
- Basic approach on parameter estimation
- The observed data will be regarded as realization
of random variables X1,X2,, , Xn, whose joint
distribution depends on an unknown parameter ?. - ? may be a vector, such as (?, ?) in Gamma
density function. - When the Xi can be modeled as independent random
variables all having the same distribution
??x???, in which case their joint distribution is
??x1?????x2??? ??xn??? . - Refer to such Xi as independent and identically
distributed, or i.i.d. - An estimate of ? will be a function of X1,X2,,Xn
and will hence be a random variable with a
probability distribution called its sampling
distribution. - We will use approximations to the sampling
distribution to assess the variability of the
estimate, most frequently through its standard
deviation, which is commonly called its standard
error. - The Method of Moments
- The Methods of Maximum Likelihood
55The Method of Moments
- The kth moment of a probability law is defined
as ?k E(Xk) - Here X is a random variable following that
probability law (of course, this is defined only
if the expectation exists). - ?k is a function of ? when the Xi have the
distribution ??x???. - If X1,X2,, , Xn, are i.i.d. random variables
from that distribution, the kth sample moment is
define as n-1Si(Xi)k. - According to the central limit theorem, the
sample moment n-1Si(Xi)k converges to the
population moments ?k in probability. - If the functions relating to the sample moments
are continuous, the estimates will converge to
the parameters as the sample moment converge to
the population moments. ?. - The method of moments estimates parameters by
finding expressions form them in terms of the
lowest possible order moments and then
substituting sample moments in E(Xk) to
expressions.???
56The Method of Maximum Likelihood
- It can be applied to a great variety of other
statistical problems, such as regression, for
example. This general utility is one of the major
reasons of the importance of likelihood methods
in statistics. - The maximum likelihood estimate (mle) of ? is
that value of ? the maximizes the likelihood?that
is, makes the observed data most probable or
most likely. - Rather than maximizing the likelihood itself, it
is usually easier to maximize its natural
logarithm (which is equivalent since the
logarithm is a monotonic function). - For an i.i.d. sample, the log likelihood is
- The large sample distribution of a maximum
likelihood estimate is approximately normal with
mean ?0 and variance 1?nI(?0). - This is merely a limiting result, which holds as
the sample size tends to infinity, we say that
the mle is asymptotically unbiased and refer to
the variance of the limiting normal distribution
as the asymptotic variance of the mle.
57QQplot
- x lt- qgamma(seq(.001, .999, len 100), 1.5)
compute a vector of quantiles - plot(x, dgamma(x, 1.5), type "l") density
plot for shape 1.5 - QQplots are used to assess
- whether data have a particular distribution, or
- whether two datasets have the same distribution.
- If the distributions are the same, then the
QQplot will be approximately a straight line. - The extreme points have more variability than
points toward the center. - A plot with a "U" shape means that one
distribution is skewed relative to the other. - An "S" shape implies that one distribution has
longer tails than the other. - In the default configuration a plot from qqnorm
that is bent down on the left and bent up on the
right means that the data have longer tails than
the Gaussian. - plot(qlnorm(ppoints(y)), sort(y)) log normal
qqplot