Title: Uncertainty, error, and quality control
1Uncertainty, error,and quality control
2GIS is not perfect
- A GIS cannot perfectly represent the world for
many reasons, including - The world is too complex and detailed.
- The data structures or models (raster, vector, or
TIN) used by a GIS to represent the world are not
discriminating or flexible enough. - We make decisions (how to categorize data, how to
define zones) that are not always fully informed
or justified. - It is impossible to make a perfect representation
of the world, so uncertainty is inevitable - Uncertainty degrades the quality of a spatial
representation
3A Conceptual View of Uncertainty
Real World
Conception
Measurement Representation
Data conversion and Analysis
error propagation
Result
41. Uncertainty in the conception of geographic
phenomena
- Many spatial objects are not well defined or
their definition is to some extent arbitrary, so
that people can reasonably disagree about whether
a particular object is x or not. There are at
least four types of conceptual uncertainty - Spatial uncertainty
- Vagueness
- Ambiguity
- Regionalization problems
5- Spatial uncertaintySpatial uncertainty occurs
when objects do not have a discrete, well defined
extent. They may have indistinct boundaries
(where exactly does a wetland end?), they may
have impacts that extend beyond their boundaries
(should an oil spill be defined by the dispersion
of pollutants or by the area of environmental
damage?), or they may simply be statistical
entities. The attributes ascribed to spatial
objects may also be subjectivefor example, the
spatial distributions of poverty and biodiversity
depend on human interpretations of what these
things mean. - Vagueness (obscureness)
- Vagueness occurs when the criteria that
define an object as x are not explicit or
rigorous. For example, In a land cover analysis,
how many oaks (or what proportion of oaks) must
be found in a tract of land to qualify it as oak
woodland? What incidence of crime (or resident
criminals) defines a high crime neighborhood?
6- AmbiguityAmbiguity occurs when y is used as a
substitute, or indicator, for x because x is not
available. The link between direct indicators and
the phenomena for which they substitute is
straightforward and fairly unambiguous. Soil
nutrient levels (y) are a direct indicator of
crop yields (x). Indirect indicators tend to be
more ambiguous and opaque. Wetlands (y) are an
indirect indicator of animal species diversity
(x). Of course, indicators are not simply direct
or indirect they occupy a continuum. The more
indirect they are, the greater the ambiguity and
the less certain it is that an object being
approximated using y really is x. - Regionalization problemsRegional geography is
largely founded on the creation of a mosaic of
zones that make it easy to portray spatial data
distributions. A uniform zone is defined by the
extent of a common characteristic, such as
climate, landform, or soil type. Functional zones
are areas that delimit the extent of influence of
a facility or featurefor example, how far people
travel to a shopping center or the geographic
extent of support for a football team.
Regionalization problems occur because zones are
artificial. In the development of climate zones,
for instance, experts may disagree on what
combination of characteristics defines a zone,
how these characteristics should be weighted to
create a composite indicator, and what the
minimum size threshold for a zone is. This should
not be surprising after all, spatial
distributions tend to change gradually, while
zones imply that there are sharp boundaries
between them.
72.1 Uncertainty in the measurement of geographic
phenomena
- Error occurs in physical measurement of objects.
This error creates further uncertainty about the
true nature of spatial objects. - Physical measurement error
- Digitizing error
- Error caused by combining data sets with
different lineages
8- Physical measurement errorInstruments and
procedures used to make physical measurements are
not perfectly accurate. For example, a survey of
Mount Everest might find its height to be 8,850
meters, with an accuracy of plus or minus 5
meters. In addition, the earth is not a
perfectly stable platform from which to make
measurements. Seismic motion, continental drift,
and the wobbling of the earth's axis cause
physical measurements to be inexact. (GPSing
error, remote sensing error) - Digitizing errorA great deal of spatial data has
been digitized from paper maps. Digitizing, or
the electronic tracing of paper maps, is prone to
human error. Lines may be drawn too far, not far
enough, or missed entirely. Errors caused by
digitizing mistakes can be partially, but not
completely, fixed by software. Additional error
occurs because adjacent data digitized from
different maps may not align correctly. This
problem can also be partially corrected through a
software technique called rubbersheeting. - Error caused by combining data sets with
different lineagesData sets produced by
different agencies or vendors may not match
because different processes were used to capture
or automate the data. For example, buildings in
one data set may appear on the opposite side of
the street in another data set. Error may also be
caused by combining sample and population data or
by using sample estimates that are not robust at
fine scales. "Lifestyle" data are derived from
shopping surveys and provide business and service
planners with up-to-date socioeconomic data not
found in traditional data sources like the
census. Yet the methods by which lifestyle data
are gathered and aggregated to zones or are
compared to census data may not be scientifically
rigorous
9Accuracy and Precision
- Precision a measure of repeatability
- Accuracy a measure of reliability, is the
difference between reality and our measurement
3
1
4
2
Shooting a target
10Digitizing Error
Any digitized map requires Considerable
post-processing Check for missing features
Connect lines Remove spurious polygons Some of
these steps can be automated
112.2 Uncertainty in the representation of
geographic phenomena
- Representation is closely related to
measurement. Representation is not just an input
to analysis, but sometimes also the outcome of
it. For this reason, we consider representation
separately from measurement. the world is
infinitely complex, but computer system are
finite. representation is all about the choices
that are made in capturing knowledge about the
world - Uncertainty in earth model ellipsoid models,
datum, projection types - Uncertainty in the raster data model (structure)
- Uncertainty in the vector data model (structure)
12- Uncertainty in the raster data structureThe
raster structure partitions space into square
cells of equal size (also called pixels). Spatial
objects x, y, and z emerge from cell
classification, in which Cell A1 is classified as
x, Cell A2 as y, Cell A3 as z, and so on, until
all cells are evaluated. A spatial object x can
be defined as a set of contiguous cells
classified as x. Commonly, a cell is not purely
one thing or another, but might contain some x,
some y, and maybe a bit of z within its area.
These impure cells are termed "mixels." Because a
cell can hold only one value, a mixel must be
classified as if it were all one thing or
another. Therefore, the raster structure may
distort the shape of spatial objects. - Uncertainty in the vector data structureSocioecon
omic datafacts about people, houses, and
householdsare often best represented as points.
For various reasons (to protect privacy, to limit
data volume), data are usually aggregated and
reported at a zonal level, such as census tracts
or ZIP Codes. This distorts the data in two ways
first, it gives them a spatially inappropriate
representation (polygons instead of points)
second, it forces the data into zones whose
boundaries may not respect natural distribution
patterns.
13Error in raster
- raster
- - because of the distortions due to flattening,
cells in a raster can never be perfectly equal in
size on the Earths surface. - - when information is represented in raster form
all detail about variation within cells is lost,
and instead the cell is given a single value.
largest share, central point (f.g. USGS DEM), and
mean value (f.g. remote sensing imagery)
Largest share
8x(1/6)6x(5/6)6.33 8x(3/4)6x(1/4)7.5 8x(1/7)6
x(6/7)6.29
Central point
14Map representation error
153. Uncertainty in the data conversion and
analysis of geographic phenomena
- Uncertainties in data lead to uncertainties in
the results of analysis Data conversion and
spatial analysis methods can create further
uncertainty -
- Data conversion error
- Georeferencing and resampling (nearest, bilinear,
cubic) - Projection and datum conversions
- The ecological fallacy
- The modifiable areal unit problem (MAUP)
- Classification errors
16- The ecological fallacyThe ecological fallacy is
the mistake of assuming that an overall
characteristic of a zone is also a characteristic
of any location or individual within the zone. - The Modifiable Areal Unit Problem (MAUP)The
results of data analysis are influenced by the
number and sizes of the zones used to organize
the data. The Modifiable Area Unit Problem has at
least three aspects - The number, sizes, and shapes of zones affect the
results of analysis. - The number of ways in which fine-scale zones can
be aggregated into larger units is often great. - There are usually no objective criteria for
choosing one zoning scheme over another. - - An example of the influence of the number of
zones on analysis is the 1950 study by Yule and
Kendall which found that the correlation between
wheat and potato yields in England changed from
low to high as the data were grouped into fewer
and fewer zones (starting with 48 and ending with
2). - - An example of the influence of zone shape is
gerrymandering, in which voting district
boundaries are manipulated in order to engineer a
desired election outcome.
17(No Transcript)
18zone shape change
19Classification error and quality check
20Selecting ROIs
Alfalfa
Cotton
Grass
Fallow
21Background ETM, 7/15/01 Top image IKONOS,
Oct, 2000 Classification Result
22Confusion Matrix
Classification results
Ground truth
23Bases of Confusion Matrix
- Producer accuracy is a measure indicating the
probability that the classifier has labeled an
image pixel into Class A given that the ground
truth is Class A. - User accuracy is a measure indicating the
probability that a pixel is Class A given that
the classifier has labeled the pixel into Class A - Overall accuracy is total classification
accuracy. - Kappa index (another parameter for overall
accuracy) is a more useful index for evaluating
accuracy. - Errors of commission represent pixels that belong
to another class but are labeled as belonging to
the class. - Errors of omission represent pixels that belong
to the ground truth class but that the
classification technique has failed to classify
them into the proper class.
244. Error Propagation
- the errors in the input will propagate to the
output of the operation - error propagation measures the impacts of error
(uncertainty) in data on the results of GIS
operations
Real World
Conception
Measurement Representation
Data conversion and Analysis
error propagation
Result
25Quantitative error propagation
The output uncertainty is a function of the input
errors (uncertainties), Assuming that errors are
independent and random in variables of x1, x2, ,
xn
1. One variable
p
p
26Quantitative error propagation cont.
Variables in additive and subtractive relations
2. Multiple variables Variables in additive or
subtractive relations
3. Multiple variables Variables in power law
relations
274. Multiple variables Variables in multiply or
divide relations
5. Multiple variables The errors of variables
have correlation
If you are interested in knowing more about this
topic error propagation, please see this
document http//www.uottawa.ca/academic/arts/geo
graphie/lpcweb/web6102/download/error_prop.pdf
28Two Examples
- If we use 3 scales of maps 130,000 (map1),
150,000 (map2), and 1250,000 (map3) used for a
final map or analysis. If we assume this is
additive process, so the map function u can be
like this
7.5 m
12.5 m
62.5 m
64.2 m
29Correlated
Uncorrelated
30Living with uncertainty
- uncertainty is inevitable and easier to find,
- use metadata to document the uncertainty
- sensitivity analysis to find the impacts of input
uncertainty on output, - rely on multiple sources of data,
- be honest and informative in reporting the
results of GIS analysis. - US Federal Geographic Data Committee lists five
components of data quality attribute accuracy,
positional accuracy, logical consistency,
completeness, and lineage (details see
www.fgdc.gov)
31Main references
- Paul A. Longley et al., 2001, Geographic
Information Systems and Science, John Wiley
Sons press. - ESRI www.esri.com
- http//www.uottawa.ca/academic/arts/geographie/lpc
web/web6102/download/error_prop.pdf - FGDC www.fgdc.gov