Title: Precision, Accuracy, and Numeric Data Types
1Precision, Accuracy, and Numeric Data Types
- Talbot J. Brooks
- Delta State University
2Big Questions Topics for tonight
- Why such a big deal over projections and
coordinate systems? - How does the above relate to spatial analysis?
- How does any of this help me get a job?
3Numerical Data Types
- Integers
- Short
- Long
- Decimals
- Single Precision
- Double Precision
4Short Integer
- 2 bytes
- (2 8 per byte times 2 bytes 65,536)
- Note the highest value is NOT 65,536
- The RANGE of values is /- 32767, and 0
5Long Integer
- 4 bytes
- (2 8 per byte times 4 bytes 4,294,967,296)
- Minimum and Maximum are HALF of that number
- The RANGE of values is /- 2,147,483,647, and
0
6Precision
In computer science, precision is determined by
the "bus size" of a computer. Think of the "bus"
as a freeway whose size is determined by the
amount of lanes it has. Bits, or more precisely,
groups of 8 bits called bytes travel on these
lanes. 32-bit systems, like the ones we use in
our lab, have a single precision of 32 bits.
Therefore, double precision on such a system
means 64 bits of precision. Of course, more bits
in the mantissa mean higher precision.
7Single Double Precision
- Single (or float)
- 4 bytes 32 bits
- 1 sign bit, 7 exponent bits, 24 mantissa
bits - Double
- 8 bytes 64 bits
- 1 sign bit, 7 exponent bits, 56 mantissa bits
8Accuracy
- Accuracy is the degree to which information on a
map or in a digital database matches true or
accepted values. Accuracy is an issue pertaining
to the quality of data and the number of errors
contained in a dataset or map. In discussing a
GIS database, it is possible to consider
horizontal and vertical accuracy with respect to
geographic position, as well as attribute,
conceptual, and logical accuracy. - The level of accuracy required for particular
applications varies greatly. - Highly accurate data can be very difficult and
costly to produce and compile.
9Precision
- Precision refers to the level of measurement and
exactness of description in a GIS database.
Precise locational data may measure position to a
fraction of a unit. Precise attribute information
may specify the characteristics of features in
great detail. It is important to realize,
however, that precise data--no matter how
carefully measured--may be inaccurate. Surveyors
may make mistakes or data may be entered into the
database incorrectly. - The level of precision required for particular
applications varies greatly. Engineering projects
such as road and utility construction require
very precise information measured to the
millimeter or tenth of an inch. Demographic
analyses of marketing or electoral trends can
often make do with less, say to the closest zip
code or precinct boundary. - Highly precise data can be very difficult and
costly to collect. Carefully surveyed locations
needed by utility companies to record the
locations of pumps, wires, pipes and transformers
cost 5-20 per point to collect.
10Put another way
- Any given measurement is precise only to the
degree of accuracy with which it was made - Precision is a function of the repeatability of a
measurement. - How precise a measurement can be made using the
ruler below?
11Implications
- High precision does not indicate high accuracy
nor does high accuracy imply high precision. But
high accuracy and high precision are both
expensive. Be aware also that GIS practitioners
are not always consistent in their use of these
terms. Sometimes the terms are used almost
interchangeably and this should be guarded
against. - Two additional terms are used as well
- Data quality refers to the relative accuracy and
precision of a particular GIS database. These
facts are often documented in data quality
reports. - Error encompasses both the imprecision of data
and its inaccuracies.
12The Unknown
- Neither accuracy or precision can be evaluated
without knowing and understanding the potential
and real sources of error - Real
- Instrumental, environmental, numeric
(calculation) based - Can be assessed and taken into account
- Potential
- Mistakes, inconsistency, general sloppy work
- Impossible to assess without detailed
documentation (metadata)
13Types of Error in GIS
- Positional accuracy and precision
- Attributional accuracy and precision
- Conceptual accuracy and precision
- Logical accuracy and precision
- Numeric accuracy and precision
14Positional accuracy and precision
- Applies to both horizontal and vertical
positions. - Accuracy and precision are a function of the
scale at which a map (paper or digital) was
created. The mapping standards employed by the
United States Geological Survey specify that
"requirements for meeting horizontal accuracy as
90 per cent of all measurable points must be
within 1/30th of an inch for maps at a scale of
120,000 or larger, and 1/50th of an inch for
maps at scales smaller than 120,000." - Accuracy Standards for Various Scale Maps
- 11,200 3.33 feet
- 12,400 6.67 feet
- 14,800 13.33 feet
- 110,000 27.78 feet
- 112,000 33.33 feet
- 124,000 40.00 feet
- 163,360 105.60 feet
- 1100,000 166.67 feet
15Implications
- This means that when we see a point on a map we
have its "probable" location within a certain
area. The same applies to lines. - Beware of the dangers of false accuracy and false
precision, that is reading locational information
from map to levels of accuracy and precision
beyond which they were created. This is a very
great danger in computer systems that allow users
to pan and zoom at will to an infinite number of
scales. Accuracy and precision are tied to the
original map scale and do not change even if the
user zooms in and out. Zooming in and out can
however mislead the user into believing--falsely--
that the accuracy and precision have improved.
16Attribute accuracy and precision
- The non-spatial data linked to location may also
be inaccurate or imprecise. Inaccuracies may
result from mistakes of many sorts. Non-spatial
data can also vary greatly in precision. Precise
attribute information describes phenomena in
great detail. For example, a precise description
of a person living at a particular address might
include gender, age, income, occupation, level of
education, and many other characteristics. An
imprecise description might include just income,
or just gender. The non-spatial data linked to
location may also be inaccurate or imprecise.
Inaccuracies may result from mistakes of many
sorts. Non-spatial data can also vary greatly in
precision. Precise attribute information
describes phenomena in great detail. For example,
a precise description of a person living at a
particular address might include gender, age,
income, occupation, level of education, and many
other characteristics. An imprecise description
might include just income, or just gender
17Conceptual accuracy and precision
- GIS depend upon the abstraction and
classification of real-world phenomena. The users
determines what amount of information is used and
how it is classified into appropriate categories.
Sometimes users may use inappropriate categories
or misclassify information. For example,
classifying cities by voting behavior would
probably be an ineffective way to study fertility
patterns. Failing to classify power lines by
voltage would limit the effectiveness of a GIS
designed to manage an electric utilities
infrastructure. Even if the correct categories
are employed, data may be misclassified. A study
of drainage systems may involve classifying
streams and rivers by "order," that is where a
particular drainage channel fits within the
overall tributary network. Individual channels
may be misclassified if tributaries are
miscounted. Yet some studies might not require
such a precise categorization of stream order at
all. All they may need is the location and names
of all stream and rivers, regardless of order.
18How does conceptual precision and accuracy relate
to the GIS Pipeline?
19Logical accuracy and precision
- Information stored in a database can be employed
illogically. For example, permission might be
given to build a residential subdivision on a
floodplain unless the user compares the proposed
plat with floodplain maps. Then again, building
may be possible on some portions of a floodplain
but the user will not know unless variations in
flood potential have also been recorded and are
used in the comparison. The point is that
information stored in a GIS database must be used
and compared carefully if it is to yield useful
results. GIS systems are typically unable to warn
the user if inappropriate comparisons are being
made or if data are being used incorrectly. Some
rules for use can be incorporated in GIS designed
as "expert systems," but developers still need to
make sure that the rules employed match the
characteristics of the real-world phenomena they
are modeling. - Finally, It would be a mistake to believe that
highly accurate and highly precise information is
needed for every GIS application. The need for
accuracy and precision will vary radically
depending on the type of information coded and
the level of measurement needed for a particular
application. The user must determine what will
work. Excessive accuracy and precision is not
only costly but can cause considerable details.
20Can you think of an example where a logical error
has been made?
- (besides electing Bush as President)
21Numeric Error
- Computers can only perform numeric calculations
out to a certain number of decimal places before
introducing rounding errors - Numeric error was the source of the failed
Patriot Missile program during the First Gulf War
and was directly responsible for allowing the
Scud to hit the Riyadh Army Depot
22- Since GIS process data digitally, numeric errors
may be inserted at the conversion process - A drawing may be put together very precisely and
accurately, but not only is that precision and
accuracy lost in translation, but likely not
accounted for by how the computer stores the
resultant data.
23Data storage
- Precision and accuracy are also a function of how
the data are stored (single, float, double) - Juan please explain!
24(No Transcript)
25What computer maker screwed up big-time and
produced huge computational error?
- (Definitely not intelligent)
26Sources of Error
- Burrough (1986) divides sources of error into
three main categories - Obvious sources of error.
- Errors resulting from natural variations or from
original measurements. - Errors arising through processing.
- Generally errors of the first two types are
easier to detect than those of the third because
errors arising through processing can be quite
subtle and may be difficult to identify. Burrough
further divided these main groups into several
subcategories.
27Obvious Sources of Error
- Age of data
- Aerial coverage
- Map scale
- Density observations
- Relevance (remember Mikes example?)
- Format
- Accessibility
- Cost
28Errors Resulting from Natural Variation or from
Original Measurements
- Positional accuracy
- Content accuracy
- Sources of variation within data
29Errors Arising Through Processing
- Numerical errors (previously discussed)
- Errors through topologic analysis
- Classification and generalization problems
- Digitizing and geocoding errors
30The Problems of Propagation and Cascading
- This discussion focused to this point on errors
that may be present in single sets of data. GIS
usually depend on comparisons of many sets of
data. This schematic diagram shows how a variety
of discrete datasets may have to be combined and
compared to solve a resource analysis problem. It
is unlikely that the information contained in
each layer is of equal accuracy and precision.
Errors may also have been made compiling the
information. If this is the case, the solution to
the GIS problem may itself be inaccurate,
imprecise, or erroneous. The point is that
inaccuracy, imprecision, and error may be
compounded in GIS that employ many data sources.
There are two ways in which this compounded my
occur.
31Propagation
- Propagation occurs when one error leads to
another. For example, if a map registration point
has been mis-digitized in one coverage and is
then used to register a second coverage, the
second coverage will propagate the first mistake.
In this way, a single error may lead to others
and spread until it corrupts data throughout the
entire GIS project. To avoid this problem use the
largest scale map to register your points. - Often propagation occurs in an additive fashion,
as when maps of different accuracy are collated.
32Cascading
- Cascading means that erroneous, imprecise, and
inaccurate information will skew a GIS solution
when information is combined selectively into new
layers and coverages. In a sense, cascading
occurs when errors are allowed to propagate
unchecked from layer to layer repeatedly. - The effects of cascading can be very difficult to
predict. They may be additive or multiplicative
and can vary depending on how information is
combined, that is from situation to situation.
Because cascading can have such unpredictable
effects, it is important to test for its
influence on a given GIS solution. This is done
by calibrating a GIS database using techniques
such as sensitivity analysis. Sensitivity
analysis allows the users to gauge how and how
much errors will effect solutions. Calibration
and sensitivity analysis are discussed in
Managing Error . - It is also important to realize that propagation
and cascading may affect horizontal, vertical,
attribute, conceptual, and logical accuracy and
precision.
33Beware of False Precision and False Accuracy!
- GIS users are not always aware of the difficult
problems caused by error, inaccuracy, and
imprecision. They often fall prey to False
Precision and False Accuracy, that is they report
their findings to a level of precision or
accuracy that is impossible to achieve with their
source materials. If locations on a GIS coverage
are only measured within a hundred feet of their
true position, it makes no sense to report
predicted locations in a solution to a tenth of
foot. That is, just because computers can store
numeric figures down many decimal places does not
mean that all those decimal places are
"significant." It is important for GIS solutions
to be reported honestly and only to the level of
accuracy and precision they can support. This
means in practice that GIS solutions are often
best reported as ranges or ranking, or presented
within statistical confidence intervals.
34The Dangers of Undocumented Data
- Given these issues, it is easy to understand the
dangers of using undocumented data in a GIS
project. Unless the user has a clear idea of the
accuracy and precision of a dataset, mixing this
data into a GIS can be very risky. Data that you
have prepared carefully may be disrupted by
mistakes someone else made. This brings up three
important issues.
35Managing Error
361. Setting Standards for Procedures and Products
- No matter what the project, standards should be
set from the start. Standards should be
established for both spatial and non-spatial data
to be added to the dataset. Issues to be resolved
include the accuracy and precision to be invoked
as information is placed in the dataset,
conventions for naming geographic features,
criteria for classifying data, and so forth. Such
standards should be set both for the procedures
used to create the dataset and for the final
products. Setting standards involves three steps.
372. Establishing Criteria that Meet the Specific
Demands of a Project
- Standards are not arbitrary they should suit the
demands of accuracy, precision, and completeness
determined to meet the demands of a project. The
Federal and many state governments have
established standards meet the needs of a wide
range of mapping and GIS projects in their
domain. Other users may follow these standards if
they apply, but often the designer must carefully
establish standards for particular projects.
Picking arbitrarily high levels of precision,
accuracy, and completeness simply adds time and
expense. Picking standards that are too low means
the project may not be able to reach its
analytical goals once the database is compiled.
Indeed, it is perhaps best to consider standards
in the light of ultimate project goals. That is,
how accurate, precise, and complete will a
solution need to be? The designer can then work
backward to establish standards for the
collection and input of raw data. Sensitivity
analysis (discussed below) applied to a prototype
can also help to establish standards for a
project.
383. Training People Involved to Meet Standards,
Including Practice
- The people who will be compiling and entering
data must learn how to apply the standards to
their work. This includes practice with the
standards so that they learn to apply them as a
natural part of their work. People working on the
project should be given a clear idea of why the
standards are being employed. If standards are
enforced as a set of laws or rules without
explanation, they may be resisted or subverted.
If the people working on a project know why the
standards have been set, they are often more
willing to follow them and to suggest procedures
that will improve data quality.
39Testing That the Standards Are Being Employed
Throughout a Project and Are Reached by the Final
Products
- Regular checks and tests should be employed
through a project to make sure that standards are
being followed. This may include the regular
testing of all data added to the dataset or may
involve spot checks of the materials. This allows
to designer to pinpoint difficulties at an early
stage and correct them. Examples of data
standards - USGS Geospatial Data Standards
- Information on the Spatial Data Transfer Standard
- USGS Map Accuracy Standards
40Documenting Procedures and Products Data Quality
Reports
- Standards for procedures and products should
always be documented in writing or in the dataset
itself. Data documentation should include
information about how data was collected and from
what sources, how it was preprocessed and
geocoded, how it was entered in the dataset, and
how it is classified and encoded. On larger
projects, one person or a team should be assigned
responsibility for data documentation.
Documentation is vitally important to the value
and future use of a dataset. The saying is that
an undocumented dataset is a worthless dataset.
By in large, this is true. Without clear
documentation a dataset can not be expanded and
cannot be used by other people or organizations
now or in the future.
41Measuring and Testing Products
- GIS datasets should be checked regularly against
reality. For spatial data, this involves checking
maps and positions in the field or, at least,
against sources of high quality. A sample of
positions can be resurveyed to check their
accuracy and precision. The USGS employs a
testing procedure to check on the quality of its
digital and paper maps, as does the Ordnance
Survey. Indeed, the Ordnance Survey continues
periodically to test maps and digital datasets
long after they have first been compiled. If too
many errors crop up, or if the mapped area has
changed greatly, the work is updated and
corrected.
42- Non-spatial attribute data should also be checked
either against reality or a source of equal or
greater quality. The particular tests employed
will, of course, vary with the type of data used
and its level of measurement. Indeed, many
different tests have been developed to test the
quality of interval, ordinal, and nominal data.
Both parametric and nonparametric statistical
tests can be employed to compare true values
(those observed "on the ground") and those
recorded in the dataset. - Cohen's Kappa provides just one example of the
types of test employed, this one for nominal
data. The following example shows how data on
land cover stored in a database can be tested
against reality.
43(No Transcript)
44Calibrating a Dataset to Ascertain How Error
Influences Solutions
- Solutions reached by GIS analysis should be
checked or calibrated against reality. The best
way to do this is check the results of a GIS
analysis against the findings produced from
completely independent calculations. If the two
agree, then the user has some confidence that the
data and modeling procedure is valid. - This process of checking and calibrating a GIS is
often referred to as Sensitivity Analysis.
Sensitivity analysis allows the user to test how
variations in data and modeling procedure
influence a GIS solution. What the user does is
vary the inputs of a GIS model, or the procedure
itself, to see how each change alters the
solution. In this way, the user can judge quite
precision how data quality and error will
influence subsequent modeling. - This is quite straight forward with
interval/ratio input data. The user tests to see
how an incremental change in an input variable
changes the output of the system. From this, the
user can derive "marginal sensitivity" to an
input and establish "marginal weights" to
compensate for error.
45- But sensitivity analysis can also be applied to
nominal (categorical) and ordinal (ranked) input
data. In these cases, data may be purposefully
misclassified or misranked to see how such errors
will change a solution. - Sensitivity analysis can also be used during
system design and development to test the levels
of precision and accuracy required to meet system
goals. That is, users can experiment with data of
differing levels of precision and accuracy to see
how they perform. If a test solution is not
accurate or precise enough in one pass, the
levels can be refined and tested again. Such
testing of accuracy and precision is very
important in large GIS projects that will
generated large quantities of data. In is of
little use (and tremendous cost) to gather and
store data to levels of accuracy and precision
beyond what is needed to reach a particular
modeling need.
46- Sensitivity can also be useful at the design
stage in testing the theoretical parameters of a
GIS model. It is sometimes the case that a
factor, though of seemingly great theoretical
importance to a solution, proves to be of little
value in solving a particular problem. For
example, soil type is certainly important in
predicting crop yields but, if soil type varies
little in a particular region, it is a waste of
time entering into a dataset designed for this
purpose. Users can check on such situations by
selectively removing certain data layers from the
modeling process. If they make no difference to
the solutions, then no further data entry needs
to be made.
47Sensitivity Analysis
- A small town adjacent to both a national forest
and an air force base must increase its water
capacity. The city hired a consulting firm to
assist water board planners in determining
different courses of action to increase municipal
water capacity. Using GIS analysis based on
geologic, hydrological, land use data, and
proximity to the town, the consultant determined
four well sites are suitable to meet the town's
needs. Although each site is suitable there are
several options that must be considered before
choosing the final site. Water from the wells can
be piped via the shortest route or by using
existing rights-of-way (ROW). The cost is
variable due to distance and trenching
difficulty. Water may also be treated either on
site or piped raw to the current city treatment
plant. For the purposes of this example, drilling
costs are constant. Therefore, each site has four
variable costs depending on piping route and
location of treatment
48(No Transcript)
49- There is no best solution. Political or policy
considerations may require a solution that is not
necessarily the least expensive, in other words,
cost may not be the only factor in the
decision-making process. Instead each site is
ranked according to the variables. Each well site
and its variables are examined below.
50(No Transcript)
51- Well 2 is situated on municipal property within
the city. Trenching costs are higher for either
method because streets and sidewalks will be torn
up and then have to be repaired. Treatment is
less expensive at the water plant. On site
treatment would require purchasing additional
property for a treatment facility.
52(No Transcript)
53- Well 3 is situated on a large diary farm.. The
owner of the property is not willing to sell the
required land or pipeline easement and property
condemnation will be required. Therefore, piping
via the highway easement is less expensive.
Additionally, the direct path would require
trenching under the river or constructing a
pipeline bridge. Treatment costs are only
slightly different.
54(No Transcript)
55- Well 4 is on US Air Force property and is
co-located with the base water well. Although the
piping cost are less, treatment costs are
significantly higher due to increased
contaminants in the water compared to other
sites.
56(No Transcript)
57- As you can see in the following table, none of
the options are the optimal solution for each
case. Also increasing the number of variables,
such as different drilling costs, quality of
water, allowances for unknown factors, and
production life would increase the number of
permutations and further complicates site
ranking. Additionally, if variable are changed
for a site, for example, new or different data
becomes available, the ranking will probably also
change. In other words, the answer is not always
"cut and dried" for a solution. In this case each
option has advantages and disadvantages and is
ranked accordingly. A high rank for one option
may be offset by a lower ranking for another.
58(No Transcript)
59Back to managing error
- Report Results in Terms of the Uncertainties of
the Data! - Too often GIS projects fall prey to the problem
of False Precision , that is reporting results to
a level of accuracy and precision unsupported by
the intrinsic quality of the underlying data.
Just because a system can store numeric solutions
down to four, six, or eight decimal places, does
not mean that all of these are significant.
Common practice allows users to round down one
decimal place below the level of measurement.
Below one decimal place the remaining digits are
meaningless. As examples of what this means,
consider - Population figures are reported in whole numbers
(5,421, 10,238, etc.) meaning that calculations
can be carried down 1 decimal place (density of
21.5, mortality rate of 10.3). - If forest coverage is measured to the closest 10
meters, then calculations can be rounded to the
closest 1 meter.
60- A second problem is False Certainty, that is
reporting results with a degree of certitude
unsupported by the natural variability of the
underlying data. Most GIS solutions involve
employing a wide range of data layers, each with
its own natural dynamics and variability.
Combining these layers can exacerbate the problem
of arriving at a single, precision solution.
Sensitivity analysis (discussed above) helps to
indicate how much variations in one data layer
will affect a solution. But GIS users should
carry this lesson all the way to final solutions.
These solutions are likely to be reported in
terms of ranges, confidence intervals, or
rankings. In some cases, this involves preparing
high, low, and mid-range estimates of a solution
based upon maximum, minimum, and average values
of the data used in a calculation.
61- You will notice that the case considered above
pertaining an optimal site selection problem
reported it's results in terms of rankings. Each
site was optimal in certain confined situations,
but only a couple proved optimal in more than one
situation. The results rank the number of times
each site came out ahead in terms of total cost. - In situations where statistical analysis is
possible, the use of confidence intervals is
recommended. Confidence intervals established the
probability of solution falling within a certain
range (i.e. a 95 probability that a solutions
falls between 100m and 150m).
62References
- http//www.colorado.edu/geography/gcraft/notes/err
or/error_f.html - Antenucci, J.C., Brown, K., Croswell, P.L.,
Kevany, M. and Archer, H. 1991. Geographic
Information Systems a guide to the technology.
Chapman and Hall. New York. Burrough, P.A. 1990.
Principles of Geographical Information Systems
for Land Resource Assessment. Clarendon Press.
Oxford. - Koeln, G.T., Cowardin, L.M., and Strong, L.L.
1994. "Geographic Information Systems". P. 540 in
T.A. Bookhout ed. Research and Management
Techniques for Wildlife and Habitat. The Wildlife
Society. Bethesda. - Muehrcke, P.C. 1986. Map Use Reading, Analysis,
and Interpretation . 2d Ed. JP Publications,
Madison. - Sample, V.A. (Ed). 1994. Remote Sensing and GIS
in Ecosystem Management . Island Press.
Washington, D.C. - Star, J. and Estes, J. 1990. Geographic
Information Systems an Introduction . Prentice
Hall. Englewood Cliffs. - Tufte, E.R. 1990. Envisioning Information.
Graphics Press, Cheshire, Conn.