Title: Research Methods of Applied Linguistics and Statistics (11)
1Research Methods of Applied Linguistics and
Statistics (11)
- Correlation and multiple regression
- By Qin Xiaoqing
2Pearson Correlation
- The Pearson correlation allows us to establish
the strength of relationships between continuous
variables. - To show the relationship, the first step is to
draw a scatterplot or scattergram, which can
help us to obtain a preliminary understanding of
this relationship. - The scatterplot can be described in terms of
direction, strength and linearity.
3Correlation and SPSS
- Pearson product-moment coefficient is designed
for interval level (continuous) variables. It can
also be used if you have one continuous variable
(e.g., scores on a measure of self-esteem) and
one dichotomous variable (e.g., sex M/F). - Spearman rank order correlation is designed for
use with ordinal level or ranked data. - SPSS will calculate two types of correlation.
First, it will give a simple bivariate
correlation (which just means between two
variables), also known as zero-order correlation.
SPSS will also explore the relationship between
two variables, while controlling for another
variable. This is known as partial correlation.
4Direction
- Positive relationships represent relationships in
which an increase in one variable is associated
with an increase in a second. - Negative relationships represent relationships in
which an increase in one variable is associated
with decrease in a second.
5Strength
- Strong relationships appear as those in which the
dots are very close to a straight line - Weak relationships appear as those in which the
dots are more scattered about a straight line, or
farther away from that line.
6Linearity
- Linear relationships are indicated when the
pattern of dots on the scatter diagram appears to
be straight, or if the points could be
represented by drawing a straight line through
them.
7Steps for computation
- List the score for each S in parallel columns on
a data sheet. - Square each score and enter these values in the
columns labeled X2 and Y2. - Multiply the scores and enter this value in the
XY column. - Add the values in each column.
- Insert the values in the formula of correlation
coefficient.
8Example
S X Y X2 Y2 XY
1 12 8 144 64 96
2 10 12 100 144 120
3 11 5 121 25 556
4 9 8 181 64 72
5 8 4 64 16 32
6 7 13 49 169 91
7 7 7 49 49 49
8 5 3 25 9 15
9 4 8 16 64 32
10 3 5 9 25 15
Total 76 73 658 629 577
9Scatterplot
10Interpretation of scatterplot
- Checking for outliers
- Inspecting the distribution of data points
- Are the data points spread all over the place?
This suggests a very low correlation. - Are all the points neatly arranged in a narrow
cigar shape? This suggests quite a strong
correlation. - Could you draw a straight line through the main
cluster of points, or would a curved line better
represent the points? If a curved line is evident
(suggesting a curvilinear relationship), then
Pearson correlation should not be used. - What is the shape of the cluster? Is it even from
one end to the other? Or does it start off narrow
and then get fatter. If this is the case, the
data may be violating the assumption of variance
homogeneity. - Determining the direction of the relationship
between the variables
11Formula of r for raw score
X Y X2 Y2 XY
Total 76 73 658 629 577
12Assumptions underlying Pearson correlation
- The data are measured as scores or ordinal scales
that are truly continuous. - The scores on the two variables, X and Y, are
independent. - The data should be normally distributed through
their range. - The relationship between X and Y must be linear.
13Interpreting the correlation coefficient
- When r.60, the variance overlap between the 2
measures is .36. - The overlap tells that the 2 measures provide
similar information. Or the magnitude of r2
indicates the amount of variances in X which is
accounted for by Y or vice versa.
14Correlation coefficient
- If you hope 2 tests measure basically the same
thing, .71 isnt very strong .80 or .90 may be
desirable. - A correlation of .30 or lower may appear weak,
but in educational research such a correlation
might be very important. - Significant level plt.05, .01, dfN-2
15- r.10 to .29 or r.10 to .29 small
- r.30 to .49 or r.30 to .49 medium
- r.50 to 1.0 or r.50 to 1.0 large
16Presenting the results from correlation
17Comparing the correlation coefficients for two
groups
- Sometimes when doing correlational research you
may want to compare the strength of the
correlation coefficients for two separate groups.
18Factors affecting correlation
- If you have a restricted range of scores on
either of the variables, this will reduce the
value of r, eg. Age (18-20) and success on an
exam. - The existence of scores with extreme outliers in
the data. - The presence of extremely high and extremely low
scores on a variable with little in the middle. - Reliability of the data.
- Non-linear relationship. Always check the
scatterplot, particularly if you obtain low
values of r.
19Correlation versus causality
- Correlation provides an indication that there is
a relationship between two variables It does not
however indicate that one variable causes the
other. The correlation between two variables (A
and B) could be due to the fact that A causes B,
that B causes A, or (just to complicate matters)
that an additional variable (C) causes both A and
B. The possibility of a third variable that
influences both of your observed variables should
always be considered.
20Statistical vs practical significance
- Dont get too excited if your correlation
coefficients are significant. With large
samples, even quite small correlation
coefficients can reach statistical significance.
Although statistically significant, the practical
significance of a correlation of .2 is very
limited. You should focus on the actual size of
Pearsons r and the amount of shared variance
between the two variables. To interpret the
strength of your correlation coefficient you
should also take into account other research that
has been conducted in your particular topic area.
If other researchers in your area have only been
able to predict 9 per cent of the variance (a
correlation of .3) in a particular outcome (e.g.,
anxiety), then your study that explains 25 per
cent would be impressive in comparison. In other
topic areas, 25 per cent of the variance
explained may seem small and irrelevant.
21Linear regressionMultiple regression
22Understanding regression
- Regression is a way of predicting performance on
the dependent variable via one or more
independent variables. - In simple regression, we predict scores on one
variable on the basis of scores on a second. - In multiple regression, we expand the possible
sources of prediction and test to see which of
many variables and which combination of variables
allow us to make the best prediction.
23Linear regression
- Regression and correlation are related
procedures. The correlation coefficient is
central to simple linear regression. While we
cant make causal claims on the basis of
correlation, we can use correlation to predict
one variable from another. - We cant just throw variables into a multiple
regression and hope that, magically, answers will
appear. - We should have a sound thoretical or conceptual
reason for the analysis and, in particular, the
order of variables entering the equation.
24Uses of multiple regression
- how well a set of variables is able to predict a
particular outcome - which variable in a set of variables is the best
predictor of an outcome and - whether a particular predictor variable is still
able to predict an outcome when the effects of
another variable are controlled for.
25Assumptions of multiple regression
- Sample size
- Stevens (1996) recommends that for social
science research, about 15 subjects per predictor
are needed for a reliable equation. - Tabachnick and Fidell (1996, p. 132) give a
formula for calculating sample size requirements,
taking into account the number of independent
variables that you wish to use N gt 50 8m
(where m number of independent variables). If
you have five independent variables you would
need 90 cases. - More cases are needed if the dependent variable
is skewed. - For stepwise regression there should be a ratio
of forty cases for every independent variable.
26- Multicollinearity. It exists when the independent
variables are highly correlated (r.9 and above).
Multiple regression doesnt like
multicollinearity, and it certainly doesnt
contribute to a good regression model, so always
check for this problem before you start. - Outliers. Multiple regression is very sensitive
to outliers (very high or very low scores). - Normality, linearity
27MLAT and language learning
The closer r is to 1 the smaller the error will
be in predicting performance on one variable to
that of the second. The smaller, the greater the
error.
28Predicting scores using regression
- 4 pieces of information are needed They are
- the mean for scores on one variable
- The mean for scores on the second variable
- The Ss score on X, and
- The slope of the best-fitting straight line of
the joint distribution. - With this information, we can predict the Ss
score on Y from X on a mathematical basis. By
regressing Y on X, predicting Y from X will be
possible.
29Regression line
- Lines drawn to the straight line in the
scatterplot show the amount of error. Suppose we
square each of these errors and then find the
mean of the sum of these squared errors. This
best-fitting straight line is called regression
line and is technically defined as the line that
results in the smallest mean of the sum of the
squared errors. - We can think of the regression line as being
that which is closest to all the dots but, more
precisely, it is the one that results in a mean
of the squared errors that is less than any other
line we might produce.
30Determining the slope
- Turn MLAT and language learning to z score for
comparability. - Then plot the intersection of each Ss z score on
the MLAT and on the test. As the z scores on the
MLAT increase they form a run. The horizontal
line of a triangle. At the same time, the z
scores on the test increase to form a rise, the
vertical line. - The slope (b) of the regression line is shown as
we connect these 2 lines to form the third side
of the triangle.
31Regression coefficient with known r and SD
- In the diagram, an increase of say 6 units on the
run (MLAT) would equal 2 units of increase on the
rise. - The slope is the rise divided by the run. The
result is a fraction. That fraction is the
correlation coefficient. - The correlation coefficient is the same as the
slope of the best-fitting line in a z-score
scatterplot. In the triangle, the slope of the
regression line was 26, and so r for the two is
.33. suppose SDs are 8 and 10 respectively for Y
and X. - To obtain the slope, we multiply the correlation
coefficient by the standard deviation of Y over
the standard deviation of X.
32Regression coefficient with raw data
- With r and SD, it is very easy to find the slope.
With raw data, the formula for slope follows
33Example using TSE to predict TOEFL
- Mean on TOEFL540, SD40. Mean on TSE30, SD4.
r.80, b8.0 - A student achieved 36 on the TSE, 6 higher than
the mean. Multiplying that by the slope, we get
8648. So our prediction of TOEFL is mean Y
(540) 48588. The formula follows -
- Another regression equation is
34Standard error of estimate
- There is some overlap in the variance of the two
variables. When we square the value of r, we find
the degree of shared variance. - Of the original 100 of the variance, with an
r.50, we have accurately accounted for 25 of
the variance using the straight line as the bass
for prediction. The error variance now is reduced
to 75. - In regression, standard error of estimate (SEE)
shows the dispersion of scores away from the
straight line. If all the data are tightly
clustered on the line, little error is made in
prediction. - SEE tells us how much error is likely to occur in
prediction.
35Error variance
- To compute SEE, we need to know the error
variance, which is the sum of squares of actual
scores minus predicted scores divided by N-2. - The square root of this variance is referred to
as the SEE (1.35)
Mean for X8, SD4.47 mean for Y10.8, SD2.96
r.89
36Confidence interval
- 68 confidence interval 1 SEE (eg. 1.35) 68
of actual Y scores would fall within . 1.35 of
the predicted Y score. - 95 confidence interval 1.96SEE
- 99 confidence interval 2.58SEE
- Suppose estimated score is 11.98, then
- 95 confidence interval between 9.33
(11.98-1.351.96) and 14.63 (11.981.351.96) - 99 confidence interval?
- 8.5(11.98-3.48) - 15.46 (11.983.48)
37Estimated L2 scores predicted from class hours
38Goodness of fit for regression model R2
- R2, also called multiple correlation or the
coefficient of multiple determination, is the
percent of the variance in the dependent
explained uniquely or jointly by the
independents. - Adjusted R2 is an adjustment for the fact that
when one has a large number of independents, it
is possible that R2 will become artificially high
simply because some independents' chance
variations "explain" small parts of the variance
of the dependent. - The greater the number of independents, the more
the researcher is expected to report the adjusted
coefficient.
39T-test
- t-tests are used to assess the significance of
individual b coefficients. specifically testing
the null hypothesis that the regression
coefficient is zero.
40F test
- F test is used to test the significance of R,
which is the same as testing the significance of
R2, which is the same as testing the significance
of the regression model as a whole. - If prob(F) lt .05, then the model is considered
significantly better than would be expected by
chance and we reject the null hypothesis of no
linear relationship of y to the independents.
41Multicollinearity
- Multicollinearity is the intercorrelation of
independent variables. R2's near 1 violate the
assumption of no perfect collinearity, while high
R2's increase the standard error of the beta
coefficients and make assessment of the unique
role of each independent difficult or impossible.
42tolerance or VIF
- To assess multivariate multicollinearity, one
uses tolerance or VIF, which build in the
regressing of each independent on all the others. - As a rule of thumb, if tolerance is less than
.20, a problem with multicollinearity is
indicated. - When tolerance is close to 0 there is high
multicollinearity of that variable with other
independents and the b and beta coefficients will
be unstable. - The more the multicollinearity, the lower the
tolerance, the more the standard error of the
regression coefficients.
43Selecting method for predicting variables
Forward selection
- This method starts with a model containing none
of the explanatory variables. In the first step,
the procedure considers variables one by one for
inclusion and selects the variable that results
in the largest increase in R2. In the second
step, the procedures considers variables for
inclusion in a model that only contains the
variable selected in the first step. In each
step, the variable with the largest increase in
R2 is selected until, according to an F-test,
further additions are judged to not improve the
model.
44Backward selection
- This method starts with a model containing all
the variables and eliminates variables one by
one, at each step choosing the variable for
exclusion as that leading to the smallest
decrease in R2. Again, the procedure is repeated
until, according to an F-test, further exclusions
would represent a deterioration of the model.
45Stepwise selection
- This method is, essentially, a combination of the
previous two approaches. Starting with no
variables in the model, variables are added as
with the forward selection method. In addition,
after each inclusion step, a backward elimination
process is carried out to remove variables that
are no longer judged to improve the model.
46Interpretation of the results from multiple
regression
- Checking the assumptions
- Evaluating the model
- Evaluating each of the independent variables
47Presenting the results of multiple regression
- It would be a good idea to look for examples of
the presentation of different statistical
analysis in the journals relevant to your topic
area. Different journals have different
requirements and expectations. Given the severe
space limitations in journals these days, often
only a brief summary of the results is presented
and readers are encouraged to contact the author
for a copy of the full results.