Title: Correlation
1Correlation
- Forensic Statistics CIS205
2Introduction
- Chi-squared shows the strength of relationship
between variables when the data is of count form - However, many variables measured in a lab are on
a continuous scale, such as concentrations of
chemicals, time, and most machine responses - The term for the strength of the relation between
continuous variables is correlation - Any continuous variables which have some sort of
systematic relationship are said to covary, and
any variable which covaries with another is said
to be a covariate. - A basic tool for the investigation of correlation
is the scatterplot. Usually only two variables
are plotted, but three can be accommodated.
3Correlation Coefficient
- A statistical measure of correlation is called
the correlation coefficient, which can only take
on values between -1 and 1. - Both 1 and -1 mean that the variables are
absolutely related - 1 means that as one variable increases, so does
the other - -1 means that as one variable increases, the
other decreases. - 0 means that the variables are unrelated.
- The strength of relationship is independent of
the form of relationship. Most commonly
relationships are linear (plotting one variable
against another yields a straight line), next
most commonly loglinear (a graph of one variable
against the logarithm of the other is linear).
4(No Transcript)
5Ageing properties of the dye methyl violet (Grim
et al., 2002)
- This example will be used to demonstrate the
process involved in the calculation of a linear
correlation coefficient - Laser desorption mass spectrometry was used to
examine the ageing properties of the dye methyl
violet, a dye used in inks from the 1950s. - Documents written in methyl violet ink were
artificially aged with ultra violet radiation. - After various times the average molecular weight
for the methyl violet compound was measured. - The raw data is shown in table 6.1, and plotted
in figure 6.2
6Table 6.1. Average molecular weight of the dye
methyl violet and UV irradiation time from an
accelerated ageing experiment.
Time (min) Weight (Da)
0.0 367.20
15.3 368.97
30.6 367.42
45.3 366.19
60.2 365.91
75.5 365.68
90.6 365.12
105.7 363.59
7(No Transcript)
8Correlation coefficient r
- Visual inspection of Fig. 6.2 suggests that there
is a negative linear correlation between time and
mean molecular weight. - A suitable measure of this linear correlation r
is
9.
Time (min) x mean x (x mean x)² Weight (Da) y mean y (y mean y)² (x mean x)(y mean y)
0.0 -52.90 2798.41 367.20 0.94 0.883 -49.72
15.3 -37.61 1414.51 368.97 2.71 7.344 -101.92
30.6 -22.83 498.63 367.42 1.16 1.345 -25.90
45.3 -7.61 57.91 366.19 -0.07 0.005 0.53
60.2 7.33 53.73 365.91 -0.35 0.122 -2.57
75.5 22.61 511.21 365.68 -0.58 0.336 -13.11
90.6 37.67 1419.03 365.12 -1.14 1.300 -42.94
105.7 52.84 2792.06 363.59 -2.67 7.129 -141.08
mean x 52.89 S 9545.50 mean y 366.26 S 18.465 S -376.72
10Substituting these values into the equation for r
we have
- This means that as the irradiation time increases
the average molecular weight of methyl violet
ions decreases, and as -0.89 is close to -1, the
negative linear relationship is quite strong
11Significance tests for correlation coefficients
- A linear correlation coefficient of -0.89 sounds
quite high, but is it significantly high? Is it
possible that such a coefficient would occur in
data drawn randomly from a bivariate normal
distribution? - Also, what about the effect of sample size? It
makes sense that a high coefficient based on lots
of x,y pairs is somehow more significant than an
equal correlation based on only a few
observations. - For the null hypothesis that the correlation
coefficient is 0, a suitable test statistic is - t r vdf / v (1 - r²).
12Substituting for the methyl violet example
- t r vdf / v (1 - r²).
- t is the ordinate (horizontal axis) on the
t-distribution - df is degrees of freedom equal to n 2 (here 6
because we have 8 x,y pairs) - The linear correlation coefficient was -0.89, so
- t -0.89 v6 / v (1 - -0.89²) -4.78
- If we look at the values of the t-distribution
table for df 6 we see that 95 of the area is
within 2.447. - Our value of -4.78 is beyond -2.447, so we can
say that the correlation coefficient is
significant at 95 confidence.
13Correlation coefficients for non-linear data
- Andrasko and Ståhling measured three compounds
associated with the discharge of firearms,
napthalene, TEAC-2 and nitroglycerin over a
period of time by solid phase microextraction
(SPME) of the gaseous residue from the expended
cartridge. - They found that the concentrations of these
compounds would decrease with time, and that this
property would be of use in estimating the time
since discharge for this type of cartridges. - Table 6.3 is a table of the peak area for
nitroglycerine and time elapsed since discharge
for a Winchester SKEET 100 cartridge stored at
7C, shown as scatterplots in Figure 6.3
14Time since discharge (days) Nitroglycerin (peak height)
1.21 218.34
2.42 216.16
3.62 100.00
4.69 75.55
7.49 56.52
9.42 50.62
11.60 31.00
14.69 41.44
21.50 15.53
25.70 14.63
29.86 10.41
37.20 5.16
42.42 7.26
15(No Transcript)
16Log-linear relationships
- A common model for loss in chemistry (e.g.
radioactive decay) is called inverse exponential
decay, which entails a log-linear relationship
between the two variables - The right hand scatterplot of Figure 6.3 shows
the log to the base e (or natural logarithm) of
the nitroglycerine peak height against time. Here
we can see that the data looks much more linear. - The linear correlation coefficient is -0.95,
which is quite high, and suggests that this may
be a reasonable transformation of the variables - The calculations for the log-linear correlation
coefficient are exactly the same kind as in table
6.2, only the log to the base e of the y variable
has been used, rather than the untransformed y.
17The coefficient of determination
- The coefficient of determination is a direct
measure of how much the variance in one of the
covariates is attributed to the other. - We can imagine that the total variance in the
nitroglycerin peak is made up of two parts, that
which is attributable to the relationship with x
(time), and that which can be seen as random
noise. - The coefficient of determination describes what
proportion of the variance is attributable to
relationship with time. - The coefficient of determination is simply the
square of the correlation coefficient. - If r - 0.95, r² 0.90.
- Often the coefficient of determination is
described as a percentage, which in the example
above would mean that 90 of the variance in
nitroglycerin peak area is attributable to time.