Title: Logistic Regression
1Logistic Regression
- Often, the spatial phenomenon under investigation
can only be described by a categorical variable. - Wild fires typically depicted with polygons
showing burned vs. not burned - Or, bird distribution indicating presence or
absence of birds - Previous regression technique is not suitable
because the dependent variable is neither
interval or ratio - Logistic regression treats the distribution in a
probabilistic manner, that is, the occurrence of
the study phenomenon is evaluated in terms of
probability
2Logistic Regression
- If the probability of presence of a phenomenon is
Pa, then Pb represents the absence of the
phenomenon and - Pa Pb 1
-
- Ua b0 b1X1 b2X2 bnXn e
- Ua is the utility function of event a expressed
as a linear combination of a number of
explanatory variables X1, X2, .., and bn is the
estimated parameter of variable Xn
3Logistic Regression
- A greater value of Ua implies a greater
probability for the event to take place. When Ua
approaches infinity, Pa approaches 1, indicating
a high likelihood for the event to occur. When
Ua approaches negative infinity, Pa approaches 0. - When Ua equals zero, the probability is .50,
implying a 50/50 chance for the event to occur.
4Logistic Regression Example
- Example from Chou
- Fires in San Jacinto Ranger District of the San
Bernardino National Forest were examined to map
the distribution of fire occurrence probability.
The basic model consisted of eight independent
variables - Area, perimeter, vegetation, proximity to
buildings, proximity to campgrounds, proximity to
roads, maximum temperature in July, and annual
precipitation
5Variables in Fire Distribution Study
- X1 Area area of geographic unitX2 Perimeter
perimeter of geographic unitX3 Vegetation
vegetation computed by rotation
periodX4 Building proximity to
structuresX5 Campground proximity to
campgroundsX6 Road proximity to
roadsX7 Temperature maximum temperature in
JulyX8 Precipitation annual precipitation - Dependent variable is a code indicating whether
or not a geographic unit is burned or not. Area
and perimeter provide general geometric
characteristics. Vegetation, precipitation, and
temperature represent environmental factors,
while building, campground, and road represent
human-related factors
6Results of Logistic Regression
- The model indicatesthat perimeter, vegetation,
campground, road, and temperature are variables
to be included in the model. Other variables are
not included as they are not statistically
different from 0
7Results of Logistic Regression
- Percentage-correctly-estimated (PCE) index shows
the maximum level of estimation accuracy of a
model. - In this example, PCE is 60, not much better than
a random 50/50 chance. - Therefore, another parameter was evaluated
8Alternative Model
- Included an additional variable to determine
whether it makes any significant difference in
model performance - New variable represents neighborhood effects, or
conditions of the surrounding geographic units - Assumes that fire occurrence probability is not
only affected by the environmental and
human-related variables listed in the basic
model, but by the distribution of fire occurrence
probability of adjacent units - The new spatial term X9 is defined by the
percentage of neighboring units that were burned
during the study period
9New Results
- Results from the new study are quite different
- Only two variables are statistically significant
vegetation and neighborhood effects - Vegetation appears to be the determining
environmental variable in the distribution of
wildfires in the study area - Finally, wildfires are influenced by neighborhood
conditions
10Testing Statistical Signficance
- Did the neighborhood effects significantly change
the model? Need to test the chi-square test of
likelihood ratio - Where L0 denotes the likelihood of the basic
model and L1 denotes the likelihood of the study
model - Statistical testing suggests that the
neighborhood variable significantly improved the
performance of the model
11Procedure for Regression Analysis (Barber, p. 448)
- Specify the variables in the model and the exact
form of the relationship between them - Collect data
- Estimate the parameters of the model
- Statistically test the utility of the developed
model, and check whether the assumptions of the
simple linear regression model are satisfied - Use the model for prediction
12Example of Data Manipulation and Programming in
ArcView
- Manipulating Yield Data with DataManipulation.ave
13Spatial Prediction of Landslide Hazard Using
Logistic Regression and GIS
- Art Lembo
- 620 Presentation
- Based on paper by Gorsevski, Gessler, and Folz
14Introduction
- Landslides are natural geologic processes that
cause different types of damage, causing billions
of dollars in damage and thousands of deaths each
year - 95 of landslides occur in developing countries
15Causes of Landslides
- Human activities, such as deforestation and urban
expansion, accelerate the process of landslides - Roads and harvest activities in timberlands
increase the occurrence of landslides - In undisturbed forest, soil erosion is generally
negligible
16Clearwater National Forest
- 1995-1996
- Major landslides occurred during the winter
following heavy rains, snowmelt, and high river
flow - Over 900 landslides were recorded on the unstable
slopes of the forest - Landslide occurrence was widely distributed and
included artificial slopes such as road cuts and
fills, or natural slopes in clearcut areas
17(No Transcript)
18Landslide Data
- Within the large remote area, a DEM was used to
generate quantitative topographic attributes - Slope, elevation, aspect, profile, curvature,
tangent curvature, plan curvature, flow path, and
contributing area - Photo interpretation and field inventory
identified landslide areas
19(No Transcript)
20Considerations in Creating Hazard Models
- Datasets combined and stored in a GIS database
- Hazard Model assumptions
- Strength of a model depends on the quality of the
data collected - Data driven models are not appropriate to
extrapolate to neighboring areas - Climatic conditions may change so that the past
is not an indicator of the future - Uncertainty exists when a hazard map is derived
from a statistically based model
21Models Used in Study
- Logistic regression was used, which correlated
the environmental attributes and landslide
distribution - Because of the existence of uncertainty, a
Receiver-Operating Curve curve plots the
proportion of false positives against the true
positives at each level of the criterion
22Assessing Landslide Hazard
- Field inspection using a check list to identify
sites susceptible to landsliding - Projection of future patterns of instability from
analysis of landslide inventories - Multivariate analysis of factors characterizing
observed sites of slope instability - Stability ranking based on criteria such as
slope, land forms, or geologic structure - Failure probability analysis based on slope
stability models with stochastic hydrologic
simulation
23Preparing the Data
- Primary and secondary attributes are derived from
a DEM, reducing the high cost of collecting the
data (30m) - Landslides assessed through aerial reconnaissance
- Landslide hazard area are then identified based
on spatial correlation between the attributes - Identifying landslide hazard is based on spatial
correlation between the attributes derived from
the DEM - ROC curves used for decision making
24Data Sampling
- 15 of non-landslide cells were randomly sampled
for an absence of landslides - Multivariate subset was derived from the
coverages where landslides were absent - The landslide coverage was a point data set
sampled grid cells where landslides were present - Both samples were joined together where the
dependent variable had a binary response (present
or absent) - Final output stored in ASCII and used in SAS
25Statistical Analysis
- Normal plot of data to determine if the data
followed a normal distribution - Plot showed that data points do not fall along a
straight line. The data is not multivariate
normal - Logistic regression is used when the predictor
variables are not normally distributed, and
some predictor variables are categorical - Factor analysis was applied to determine the
number of underlying variables - Only significantly loaded variables were
considered
26Statistical Analysis
- The form of the logistic regression model is
defined as -
- Where x is the data vector for a randomly
selected experimental unit and y is the value of
the binary outcome variable. Maximum likelihood
was used to estimate B for the predictive
equation - Variables not significant at the .1 level were
eliminated
27Logit Results
- Logit showed that the most important variables
contributing to the slope instability were Flow
Path and mean slope of upland area - log (p/(1-p)) (-2.2642 FACTOR8 0.4969
FLPATH 0.6039)Â Â Â Â Â Â Â Â Â or p exp (-2.2642
FACTOR8 0.4969 FLPATH 0.6039)/(1
exp(-2.2642 FACTOR8 0.4969 FLPATH
0.6039)__________________________________________
____________________p probability of
landslide hazard FACTOR8 factor with
underlying characteristics of aspectFLPATH
Maximum distance of water to the point in the
catchment
28(No Transcript)
29Logit Results
- Coefficients of Logit model included positive
coefficients. Therefore, higher scores would
increase the probability of landslide hazard. - Logit model assumes a nonlinear relationship
between the probability and the explanatory
variables - Hazard map based on ROC curve technique groups
the hazard into two classes Low Hazard and High
Hazard, showing five classes of probabilities of
landslide hazard
30Final Results
- 59.1 of the landslides and 69.8 of non
landslides were correctly determined - Model can be applied to large geographic areas
- ROC curves are incorporated as a sophisticated
tool for decision makers for the spatial
prediction of landslide hazard
31a) Cut-off based on ROC curve technique b)
Probability of landslide hazard