Spatial Statistics : Relationships - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Spatial Statistics : Relationships

Description:

A better economy has more potential for people to be employed. ... Statistics that junk individuals together are useless. Example: ring speciation. ... – PowerPoint PPT presentation

Number of Views:250
Avg rating:1.0/5.0
Slides: 55
Provided by: andrew58
Category:

less

Transcript and Presenter's Notes

Title: Spatial Statistics : Relationships


1
Lecture 3
  • Spatial Statistics Relationships

2
Relationship statements
  • Male moderate drinkers are less likely to suffer
    from insulin dependent diabetes than nondrinkers.
  • A better economy has more potential for people to
    be employed.
  • The number of watches someone wears is directly
    proportional to the number of arms they have.
  • Mountains cause rainfall. Smoking causes cancer.
  • Girls are better than boys, so there.
  • The number of teapots in China has no effect on
    the frequency of volcanic eruptions in Italy.

3
This lecture
  • Correlation how much do two variables vary
    together?
  • Regression what is their relationship?
  • Spatial autocorrelation and crosscorrelation
  • Semi-variograms
  • Geographically Weighted Regression

4
Correlation
  • As one variable changes, how closely do others
    follow it?
  • Usually represented on graphs.
  • Can plot unrelated pairs of data from datasets of
    different sizes (q-q plots)
  • rank both datasets and plot 10 value against 10
    value, 90 value against 90 value.
  • Or plot linked pairs of data in scatterplots.
  • Correlation can be positive or negative.

5
Positive correlations
  • Attractiveness of chosen gender vs. alcohol
    intake.
  • Bus trip time from Headingley vs. importance of
    travel reason.
  • Ashs cumulative Pokémon losses vs. matches
    played.

6
Negative correlations
  • Ability to perform with chosen gender vs. alcohol
    intake.
  • Money vs. clubs visited.
  • Will to live vs. time in statistics lectures.

7
Correlation is one of the most useful and used
statistical techniques.
  • Correlation is an essential part of science.
  • Correlation is an essential part of politics.

8
Examples
9
Correlation is one of the most abused statistical
techniques.
  • Correlation is an essential part of dodgy
    science.
  • Correlation is an essential part of political
    misinformation.
  • Data can be selectively correlated.
  • There is no cause and effect link just because
    two variables correlate.

10
Examples
11
What can we do about this?
  • Tricky, but one start is to build convincing
    cause-and-effect models that demonstrate the same
    behavior.
  • This gives us something concrete to investigate.
  • But we then have to test our predictions.

12
Correlation can be strong or weak.
  • Strong Weak

13
How do we measure correlation?
  • We use Correlation coefficients.
  • These are usually denoted as r and vary between
    1 and 1. 
  • -1 very strong negative correlation.
  • 0 no correlation.
  • 1 very strong positive correlation.

14
Which one depends on the type of data
  •  Parametric tests used for data that is
  • Interval or ratio.
  • Normally distributed.
  • Sample populations have the same standard
    deviations.
  • Non-parametric tests used for all other data,
    including
  • Ranked data.
  • Categorized data.

15
One Parametric test Pearsons Correlation
Coefficient.
  • Idea is to calculate the average covariance how
    much one variable varies as the other varies.
  • Deviation value mean
  • The product of the variables deviations gives a
    measure of covariance.
  • (valueOne meanOne) x (valueTwo meanTwo)
  • If both variables values are far from the mean
    the product is large. If one deviation is large
    and the other small, the number will be smaller.

16
Pearsons Correlation Coefficient
  • Pearsons correlation coefficient r is the sum
    of these products normalised by the standard
    deviations.
  • The simplest way of calculating this is
  • r ((Sxy) / n) xmym
  • sxsy
  • where x and y are samples, xm and ym are
    sample means, sx and sy are sample standard
    deviations, and n is the sample sizes.

17
One non-parametric test Spearman Rank
Correlation Coefficient.
  • Given x and y sample pairs, we convert the xs
    into their rank in all the xs, and the ys into
    their rank in all the ys.
  • Spearmans coefficient is then calculated using
  • rs 1 - 6Sd2
  • n3 n
  • where d is the difference between the ranks for
    any given pair (a measure of the covariance).

18
Testing the significance of the coefficients
  • When the data can be assumed normal, we can test
    the null hypothesis that there is no correlation
    (r 0) using the following statistic
  • t r (n 2)0.5
  • (1 r2)0.5
  • which has a t distribution and n-2 degrees of
    freedom.

19
Problem correlations
Bizarre
Strong but non-linear
With many non-linear relationships we can
transform the data to a linear form. For example
exponential data can be made linear by taking the
natural log of the data.
Very bizarre
20
Regression
  • Quantifying the relationship between two or more
    variables.
  • Linear regression with two variables.

We aim to produce a single line that quantifies
the relationship.
?y
Dependent variable (y)
?x
The equation for such a line is y a bx where
b is the slope (?y/?x on the figure). We can
use this line to predict new values given an
independent value.
a
Independent variable (x)
21
Finding the regression line
  • We take the line that minimizes some measure of
    how well the line fits the data.
  • In the case of two variable linear regression, we
    try to minimize the deviations between the data
    and the line, or residuals.

The equation for such a line is given by b
S(x-x)(y-y) S(x-x)2 a y - bx
22
How much the line explains the data
  • The sum of the squared residuals gives us a
    measure of how much of the data is not explained
    by the line.
  • This value, divided by the total variation in the
    data (the sum of the values squared) gives a
    fraction of how badly the line matches.
  • One minus this gives how well it matches - the
    coefficient of determination.
  • Conveniently, this value is the square of the
    correlation coefficient r, and is also therefore
    known as r2.
  • Thus, the significance test for the r value also
    gives us the significance of our line.

23
Multiple regression
  • We can still do regression when there is more
    that one independent variable.
  • For example, in the case of three variables (two
    independent) we are looking for a solution sheet,
    not a line

We can do the same thing with a computer for as
many dimensions as we like, but more than three
become hard to visualize as graphs. Were
essentially trying to fit a line with the
equation y a b1x1 b2x2 b3x3
y
x2
x1
24
Polynomial regression
  • In some cases we may want to fit a curve through
    non-linear data in multi-variable space.
  • For one independent variable, the equation (which
    a type known as a polynomial) for the line is
  • y a bx b2x2 b3x3
  • Excel, for example, will fit this for you.

25
Polynomial curve fitting
  • The degree of the polynomial is the power of the
    last term. Higher degree curves fit the sample
    data better.
  • However, weve seen that a sample doesnt
    necessarily have the same distribution as the
    population.
  • Our curve should reflect a general population
    model not the sample data, with all its
    measurement and random errors.
  • When we look at AI techniques well see that
    predictive models based on data can become less
    accurate about the world as they increasingly
    match our samples and not the population.
  • Therefore we have to make a judgement as to the
    polynomial degree, and not necessarily pick the
    highest.

26
Summary
  • Correlation measures covariance, but doesnt say
    anything about causal relationships.
  • We can measure correlation and test its
    significance.
  • We can quantify relationships using regression
    equations and use these to predict.

27
Spatial autocorrelation
  • One of the major issues in dealing with
    geographical data.
  • The idea that values at one point may be
    correlated with values of the same variable
    nearby (or cross-correlated with another variable
    nearby).
  • Geodemographics people living near each other
    may have the same interests because they have the
    same opportunities and self-cluster.
  • Rainfall in one geographical area stops rainfall
    in another.
  • Crime spots cause social decay which in turn
    causes more crime in a limited geographical area.
  • All graded or clustered information suffers from
    this.

28
Frog averaging
  • Say we want to know what the average number of
    frogs in the country is.
  • We take a sample of six points.
  • Three of them are normal and fall across the
    whole country but three fall in an small area
    where theres a hidden pond.
  • Its like weve only really taken four samples.

29
How does this effect significance testing /
correlation / regression?
  • Essentially if our data is spatially correlated,
    we arent sampling as randomly as we would like
    in our attempts to get an overview of the
    population.
  • i.e. some of our samples are the same (not
    independent) / dont count. This is the
    equivalent of taking a smaller sample.
  • In correlation, it is possible that all our
    correlation is due to geography, and none to our
    variables.

30
How do we test for it?
  • First, plot the data and look for geographical
    trends.
  • Particularly plot the residuals of any
    regressions.

For example, the plot to the left might represent
murder rate residuals in an area after
deprivation and policing levels are taken into
account. / regressed. Anyone want to guess where
Dr Lecter lives?
  • Cluster analysis (two weeks) looks for these
    kinds of trends.

31
But what about if its more pervasive?
  • How do we test if, for example, its a constant
    relationship between neighbours? Statistics that
    junk individuals together are useless.
  • Example ring speciation.
  • If you start eastwards from Alaska theres no
    real difference between Herring gulls in one area
    of the Arctic Ocean and the next. But the minor
    differences build up around the globe so that
    Alaskan and Siberian gulls cant interbreed.
  • Theres negative spatial autocorrelation in the
    fertility you wouldnt understand if you mixed
    the whole population up.
  • Example factors in the spread of Ebola.
  • We need a measure of the covariance between
    neighbours.

32
Plotting Autocorrelations
  • Imagine we had the following map of mineral
    deposits, just showing one variable.

NNE
  • Obviously there is an spatial autocorrelation in
    the NNE direction and not in the others.

33
h-scattergrams
  • One way of displaying of autocorrelation is to
    plot the values of points against the value of a
    neighbour distance h away in some direction.
  • Usually the correlation will decrease with
    distance.
  • Correlation may vary with direction as well.

34
Correlogram
  • We can get a number of h-scatterplots for
    different h, and work out their correlation
    coefficients.
  • This gives the strength of the correlation as
    distance from a set of points drops off.

We can plot these against each other for one
direction.
Or as a contour plot for all directions.
35
Moment of Inertia
  • If a point x1 and its neighbour x2 were identical
    and plotted against each other, theyd fall on
    the 45 line x1 x2.
  • A measure of how much data does this is the
    moment of inertia.
  • m 1/2n S(x1-x2)2
  • Unlike the correlation coefficient, m increases
    as the data gets further spread.

36
Variograms
  • A plot of the moment of inertia vs. h is called a
    Semi-variogram, or, more usually, just a
    Variogram.

m 100
m
h-distance
m 200
h
h-angle
37
Problems
  • Variograms cant use all the data values without
    additional assumptions e.g. what is North or the
    Northernmost data point? Usually we ignore the
    boundaries.
  • All the correlation plots can suffer badly from a
    few unusual values, which can badly reduce the
    correlations.
  • h-scatterplots allow us to see which unusual
    points are causing the problems and let us decide
    whether to remove them.

38
Multi-variant Plots
  • We may be interested in the relationship between
    two variables and whether they are spatially
    cross-correlated.
  • We can plot h-scatterplots for a variable x and a
    variable y, but shift the y location by h.

y, h11/2
x
39
Multi-variant Correlation
  • We can also calculate the cross-correlation for
    this h-scatterplot.
  • r (for some h) ((Sxy) / n) xmym
  • sxsy
  • Where the means and standard deviations are just
    for the variable points used, i.e. x at one
    position, y at another.

40
Multi-variant Variograms
  • Equally the equation for the moment of inertia
    can be extended to
  • m 1/2n S((x1-x2) (y1-y2))
  • Note that this is no longer the moment of
    inertia as the line can be off 45.
  • Also note that it uses both x and y at positions
    1 and 2.

41
The Use of Variograms
  • As well see in later lectures, variograms can be
    very useful.
  • They represent the variability at different
    distances from a point.
  • You can therefore use them to construct
    probability models of a landscape and predict the
    value of missing areas.
  • This is known as kriging, and well look at it in
    later sessions.
  • However, it might be nice to have a single
    statistic we can use to assess autocorrelation.
    One way is using Joint Count Statistics.

42
Joint Count Statistics
  • Moran and Geary in the 1950s.
  • Defines a binary variable something is either
    present (white) or not (black).
  • Calculate the number of B-B W-W and B-W
    connections.
  • These totals can then be compared with the normal
    distribution, which is what wed get if the
    process was random.

D
43
Developments of this
  • Morans I for contiguous areas.
  • Gearys c for contiguous areas.
  • However, this is strongly dependent on
  • which directions you take as contiguous,
  • variation in the size of areas and boundaries.

44
Cliff and Ords Morans I test
  • Core values are the deviations from the mean at
    two locations.
  • These are then multiplied by an a priori weight
    which represents how much two areas might effect
    each other.
  • This is then normalized by the variation and
    sample-number-to-weights ratio.

45
How do we define the weights?
  • Various options
  • One or zero depending on whether the areas are
    adjacent.
  • Each area has a total of one, and this is
    divided up between its adjacent neighbours
    dependant on the number of them.
  • Exponentially related to the distance between the
    areas (its possible to assess the relationship
    between each area and all the others).
  • We have to pick the most reasonable.

46
Geographically Weighted Regression
  • Pioneered by the Newcastle United team of
    Fotheringham, Brunsdon, and Charlton.
  • A bit like Morans for regression, only even more
    arduous.

47
The Core Idea
  • A standard regression has the same parameters
    wherever you are geographically.
  • E.g. relationship between socioeconomics and
    secondary school performance.
  • Usually the residuals tell you where youve gone
    wrong.
  • GWR allows the parameters to vary spatially, so
    you can look at these.
  • Assumes a link between where you are, and the
    strength of a relationship.

48
Locally weighted regression
  • Run a standard regression for each point, but
    weight near points as more important.
  • Often the weights are an exponential function of
    distance and/or limited to a fixed number of
    nearest neighbours.
  • Weakly dependent on the form of the weights.
    Strongly dependent on how far weights stretch
    around an area.
  • Can try to find the best distance. This is the
    one that gives the best prediction for each
    point, if that point is excluded from GWR
    calculations.
  • Run, run and run again.

49
GWR Software
  • Derive local t statistics.
  • Perform tests to assess the significance of the
    spatial variation in the local parameter
    estimates.
  • Perform tests to determine if the local model
    performs better than the global one, accounting
    for differences in degrees of freedom.
  • http//www.ncl.ac.uk/geography/GWR

50
GWR ExampleSpatial Variations in School
Performance
  • Did a global regression on Primary School Maths
    results vs. demographics.
  • Then did a GWR regression. Derived the weights
    for each factor for each point and plotted them.
  • Divided by an error variation term to give a
    rough idea of the significance of the weights,
    and plotted these.

51
GWR Example Results
  • In Leeds / Bradford, school size was much more
    important than elsewhere (inverse relationship).
  • In Manchester, middle-class children do
    proportionally better than their social group
    elsewhere.
  • While the combination of variables and unknowns
    is complex, GWR does suggest interesting avenues
    of investigation.

Weights for school size
52
Summary
  • Correlation measures covariance, but doesnt say
    anything about causal relationships.
  • We can measure correlation and test its
    significance.
  • We can quantify relationships using regression
    equations and use these to predict.

53
Summary
  • Spatial autocorrelation means our sampling
    strategies arent as random / large as wed like.
  • Correlations can be due to geographical
    correlations, not the ones weve tested for.
  • Plotting residuals geographically may let us see
    autocorrelation.
  • Variograms and h-scatterplots are another good
    way.
  • Morans I test allows us to quantify spatial
    autocorrelation for a given weight scheme.
  • Geographically Weighted Regression helps us to
    take autocorrelation into account and investigate
    the weight it has.

54
Next lecture
  • Interpolation
  • Homework
  • Read handout on Autocorrelation Stats.
  • View GWR talk.
  • http//www.geocomputation.org/2001/
  • Keynote.
Write a Comment
User Comments (0)
About PowerShow.com