Title: Gaussian Processes for Statistical Soil Modeling of the Tropics
1Gaussian Processes for Statistical Soil Modeling
of the Tropics
- CMU/TechBridgeWorld Juan Pablo Gonzalez
Drew Bagnell - CIAT Team Simon Cook, Thomas Oberthur,
Andrew Jarvis, Mauricio Rincon
2Introduction What is CIAT?
- International Center for Tropical Agriculture
- Is a not-for-profit organization
- Conducts socially and environmentally progressive
research in developing countries aimed at - reducing hunger and poverty
- preserving natural resources
- Works through partnerships with farmers,
scientists, and policy makers - 800 people, 120 researchers from 37 different
countries
3Introduction CIAT locations
- One of 15 future harvest centers in
- Cali, Colombia (headquarters)
- Kampala, Uganda (African Regional Office)
- Vientiane, Lao (Asian Regional Office)
- Honduras, Ecuador, Nicaragua, Bolivia, Kenya,
Brazil, Sri Lanka and Thailand, amongst others. - Funded by CGIAR
- Consultative Group on International Agricultural
Research - 58 countries, private foundations, and
international organizations
CGIAR Members World Bank, FAO, Ford Foundation,
Rockefeller Foundation, Kellog Foundation, USA,
Canada, U.K., Australia, New Zealand, Sweden,
Portugal, Norway, Denmark, Austria, Italy, India,
Pakistan, Kenya, Nigeria, Bangladesh, Belgium,
Brazil, China, Colombia, Cote d'Ivoire, Egypt,
Finland, France, Germany, Indonesia, Iran,
Ireland, Israel, Japan, Korea, Luxembourg,
Malaysia, Mexico, Morocco, The Netherlands, Peru,
The Philippines, Portugal, Republic of South
Africa, Romania, Russian Federation, Spain,
Switzerland, Syrian, Arab Republic,
Thailand,Turkey, Uganda
4Introduction What is CMU?
- Carnegie Mellon University
- World-leader in technology development
- Computer Science
- Robotics
- Birthplace of Artificial Intelligence
- Located in Pittsburgh, PA, USA
5Introduction What is TechBridgeWorld?
- An initiative within
Carnegie Mellon University - Mission
- To collaboratively design and implement creative
technological solutions that will benefit
developing communities around the world - To bridge the world with technology
6Introduction Task at Hand
- Input
- Soil scientists from CIAT
- Computer Scientists from CMU/TechBridgeWorld
- 2500 Field samples from Honduras
- Result
- Statistical Soil Modeling for The Tropics
7Introduction
- Statistical soil modeling
- The development of statistical soil models for
large areas based on soil samples and digital
maps of environmental variables - Exploiting easy-to-measure variables
- Also known as predictive soil mapping (PSM)
8Introduction
- Importance
- To detect opportunities
- Target soil-sensitive crops confidently within
new areas - To reduce risk of failure in new crops
- To detect threats
- Assess impact of climate change
- To understand soil interactions with land use
- Understand local hydrology
- Make decisions about appropriate changes in land
use
9Introduction
- Why in the tropics?
- Most developing countries are located in the
tropics - Most funding for soil analysis and modeling does
not go to the tropics - The tropics have distinct climate patterns from
the rest of the globe - Only dry/wet season (instead of four seasons)
- Almost constant day length
- Main determinant factor for temperature is
elevation
10Introduction Current Soil Map Coverage
Throughout the World
- Detailed soil maps
- USA complete coverage at 124,000 very
extensive and expensive (30 m grid size) - 68 of the countries (31 by area) have complete
coverage at 11,000,000 or better (1 km grid
size) - Rest of the World
- 69 by area
- FAO World Map
11Introduction Current Soil Map Coverage
Throughout the World
- Food and Agricultural Organization (FAO)
Worldwide Soil Map - Published in 1974
- Worldwide coverage at 15,000,000 (5 km grid
size) - Based on U.S. Soil Taxonomy
- 26 classes with subcategories
NITOSOLS (N) Subclass UHTa-3 Soils having an
argillic B horizon with a clay distribution where
the percentage of clay does not decrease from its
maximum amount by as much as 20 percent within
150 cm of the surface lacking plinthite within
125 cm of the surface lacking vertic and ferric
properties. Low pH (high acidity)
12Introduction Current Soil Map Coverage
Throughout the World
NITOSOLS (N)
13Previous Work FAO Soil Map
- Problems
- Made with information and technology of 1960
- Significant changes in technologies such as GPS,
remote sensing and GIS - Categorical data
- Most soil types explain only a small proportion
of the actual variation of properties - Soil variation is continuous
- Soil attributes do not cluster perfectly a cut
on the basis of one attribute may split the
variance of another attribute near its peak - Dependent on subjective expert opinion
- Dependent on soil classification used
- Low resolution
14Traditional Soil Survey
- Three steps
- Observation and measurement of ancillary data and
soil profile - Observations incorporated into implicit
conceptual model - Apply conceptual model to predict soil variation
in unobserved sites - Conceptual model uses factors of soil formation
- Soil is a function of climate, topography,
organisms, parent material, time (H. Jenny, 1941)
15Predictive Soil Mapping (PSM)
- Statistical model using factors of soil formation
- Soil is a function of climate, topography,
organisms, parent material, time - Goals
- Exploit relationships between environmental
variables and soil properties to improve data
collection efficiency - Produce and present data that better represent
soil landscape continuity - Explicitly incorporate expert knowledge in the
design
16PSM Existing Approaches
- Ordinary Kriging
- Weighted local spatial averaging
- Spatial interpolation
- Does not use knowledge of soil materials or
processes - Requires a large number of closely-spaced samples
- Block Kriging, Indicator Kriging, Co-Kriging
- Extensions to include ancillary data
- Difficult to extend to more than one ancillary
variable
17PSM Existing Approaches
- Expert Systems
- Use expert knowledge to establish rule-based
relationships between environment and soil
properties - Do not use soil data to determine soil-landscape
relationships - Regression Trees
- Decision trees with linear models
- Promising Good results in Australia (Henderson,
2004)
18New Approach Gaussian Processes
- Generalization of Gaussian distribution to
function space of infinite dimension - Probabilistic (Bayesian) model
- Completely determined by mean and covariance
function - Prediction with mean and variance (confidence
intervals) - Non-parametric
- Very powerful
- Complexity of model increases with more data
- Not new. It started as kriging and has evolved as
a replacement for supervised Neural Networks
19New Approach Gaussian Processes
- Generalization of Gaussian distribution to
function space of infinite dimension - Probabilistic (Bayesian) model
- Prediction with mean and variance (confidence
intervals) - Non-parametric
- Very powerful
- Not new. It started as kriging and has evolved as
a replacement for supervised Neural Networks
20Gaussian Processes
- Interpolation technique equivalent to
- Neural Network with infinite number of hidden
units - Radial Basis functions, with infinite number of
basis functions - Least squares SVMs
- Kernel Ridge Regression
21Gaussian Processes
22Available Data
- 2500 soil samples from Honduras
- Digital maps of Honduras with
- Climate
- Temperatures (max, min, average, etc)
- Precipitation (max, min, average, etc)
- Topography
- 90-m elevation maps
- Vegetation Index
- Measurement of vegetation cover
- And derived variables
23Gaussian Processes
- Learning the hyperparameters
- Maximize the probability of the hyperparameters
given the data - Use scaled conjugate gradient descent
- Takes approximately 20 minutes with current data
set - Selecting variables
- Select most promising variables and incrementally
add them to the model - Would take 54 hrs for each variable selected!
24Gaussian Processes Variable Selection
- Greedy search on R2 of validation set
- Learn parameters for all variables _at_10 of
training set - Calculate R2 on validation set for all variables
_at_10 of training set - Select variable with best R2
- Learn parameters _at_ 80 of training set with
selected variables - Calculate R2 with selected variables _at_80 of
training set - Decide whether to continue based on R2 on
validation set for parameters
R2 coefficient of determination. Percentage of
the variance explained by the model
25Gaussian Processes Variable Selection
R2 coefficient of determination. Percentage of
the variance explained by the model
26Training Time
- With 10/80 approach
- 15 s per R2 calculation _at_10
- 50 minutes for all variables (68), with three
length scale priors on each - 20 minutes per R2 calculation _at_80
- Total 1h 10 per variable. Up to 9 h for 8
variables - With 25/80 approach
- 1 minute per R2 calculation
- Total 3h 30 per variable. Up to 27 h for 8
variables - With 80 approach
- 20 minutes per R2 calculation
- Total 54 h per variable. Up to 18 days for 8
variables
27Results FAO Map of Honduras
NITOSOLS (N)Soils having an argillic B horizon
with a clay distribution where the percentage of
clay does not decrease from its maximum amount by
as much as 20 percent within 150 cm of the
surface lacking plinthite within 125 cm of the
surface lacking vertic and ferric properties.
Low pH (high acidity)
28Results pH in topsoil
29Results pH in topsoil, no X, Y
30Results Accuracy Of Current Techniques
- A soil survey is good if the map units have the
right soil more than 50 of the time - Most measurements have a variability of 20 or
more between laboratories - Most quantitative prediction methods explain less
than 10 of variation - Exception Henderson 2004 in Australia
31Results pH in Topsoil
- Experiment 554, PHW1 vs. inputs. Training set
82 - out_variable PHW1
- variables 'XUTM' 'YUTM' 'P5'
- final hyperparameters
- in_params 0.1414 -1.3439 4.3123 3.5009
-1.9544 -0.8364 -1.3607 - Train/Test2 error
- Data 0.7547/0.7567
- Model 0.4800/0.5590
- Train/Test2 r2
- 0.5954/0.4544
- bias 1.151939
- noise 0.260834 (std 0.51072)
- lengthscale
- XUTM 0.115770 (11067.51)
- YUTM 0.173696 (11198.10)
P5 Maximum temperature of warmest month
32Results pH in Topsoil
P5 Maximum temperature of warmest month
33Results pH in Topsoil, variable selection
34Results pH in Topsoil
P5 Maximum temperature of warmest month
35Results pH in Topsoil, No X, Y
P5 Maximum temperature of warmest month P2 Mean
Diurnal Temp. Range P16 Precipitation of wettest
quarter
36Results pH in Topsoil, No X, Y
- Experiment 504, PHW1 vs inputs. Training set
82 - out_variable PHW1
- variables 'P5' 'P2' 'P16' 'XGeology_Code_SA1'
- final hyperparameters
- in_params -0.1648 -0.9890 1.6712 2.1778
-3.1989 3.5034 -3.7036 -1.8056 - Train/Test2 error
- Data 0.7546/0.7567
- Model 0.5522/0.6029
- Train/Test2 r2
- 0.4645/0.3652
- bias 0.848064
- noise 0.371960 (std 0.60989)
- lengthscale
- P5 0.433610 ( 1.08)
- P2 0.336585 ( 0.32)
P5 Maximum temperature of warmest month P2 Mean
Diurnal Temp. Range P16 Precipitation of wettest
quarter
37Results pH in Topsoil, No X, Y, variable
selection
38Results pH in Topsoil, No X, Y
P5 Maximum temperature of warmest month P2
Mean Diurnal Temp. Range P16 Precipitation of
wettest quarter
39Results Sand in topsoil ()
P13 Precipitation of wettest month P19
Precipitation of coldest quarter P14
Precipitation of driest month
40Results Sand in topsoil ()
- Experiment 654, SA1 vs inputs. Training set
82 - out_variable SA1
- variables 'XUTM' 'YUTM' 'ZDEM' 'mean_ndvi'
'intra_var' 'XGeology_Code_SA1' 'P13' 'P19' 'P14'
'P13' - final hyperparameters
- in_params -0.0829 5.0725 0.8620 1.8612 1.3229
0.9330 0.0655 -3.2179 -3.2194 0.0184 0.0115
-3.2189 -0.4414 3.8994 - Train/Test2 error
- Data 14.9129/14.4163
- Model 11.5649/12.6090
- Train/Test2 r2
- 0.3986/0.2350
- bias 0.920486
- noise 159.578757 (std 12.63245)
- lengthscale
- XUTM 0.649868 (62143.46)
- YUTM 0.394312 (25394.81)
P13 Precipitation of wettest month P19
Precipitation of coldest quarter P14
Precipitation of driest month
41Results Sand in topsoil (), variable selection
42Results Sand in topsoil ()
P13 Precipitation of wettest month P19
Precipitation of coldest quarter P14
Precipitation of driest month
43Results Sand in topsoil (), no X, Y
P12 Annual Precipitation P13 Precipitation of
wettest month
44Results Sand in topsoil (), no X, Y
- Experiment 604, SA1 vs inputs. Training set
82 - out_variable SA1
- variables 'mean_ndvi' 'XFeat_1km_9_SA1'
'XGeology_Code_SA1' 'P12' 'intra_var' 'P13' - final hyperparameters
- in_params 0.2806 5.2131 0.9563 -3.2208
-3.2170 0.5258 0.0168 -3.2173 -0.3648 2.1717 - Train/Test2 error
- Data 14.8333/14.3487
- Model 13.4789/13.5924
- Train/Test2 r2
- 0.1743/0.1026
- bias 1.323985
- noise 183.653649 (std 13.55189)
- lengthscale
- mean_ndvi 0.619923 (10.34)
- XFeat_1km_9_SA1 5.004900 ( 4.32)
P12 Annual Precipitation P13 Precipitation of
wettest month
45Results Sand in topsoil (), no X, Y, variable
selection
46Results Sand in topsoil (), no X, Y
P12 Annual Precipitation P13 Precipitation
of wettest month
47Results Sand in topsoil ()
48Results Sand in topsoil (), no X, Y
49Results Clay in topsoil ()
P16 Precipitation of wettest quarter
50Results Clay in topsoil ()
- Experiment 754, CL1 vs inputs. Training set
82 - out_variable CL1
- variables 'XUTM' 'YUTM' 'Geology_Code' 'P16'
- final hyperparameters
- in_params -0.0301 4.7231 1.9283 1.0973
-0.1593 0.0280 -0.7034 3.3708 - Train/Test2 error
- Data 12.1955/11.3255
- Model 10.4334/10.3302
- Train/Test2 r2
- 0.2681/0.1680
- bias 0.970348
- noise 112.516514 (std 10.60738)
- lengthscale
- XUTM 0.381307 (36462.41)
- YUTM 0.577729 (37207.37)
P16 Precipitation of wettest quarter
51Results Clay in topsoil (), variable selection
52Results Clay in topsoil ()
P16 Precipitation of wettest quarter
53Results Clay in topsoil (), no X, Y
P13 Precipitation of wettest month P2 Mean
Diurnal Temp. Range P19 Precipitation of coldest
quarter P4 Temperature Seasonality
54Results Clay in topsoil (), no X, Y
- Experiment 704, CL1 vs inputs. Training set
82 - out_variable CL1
- variables 'P13' 'XGeology_Code_SA1' 'P2'
'P19' 'mean_ndvi' 'P4' 'P2' - final hyperparameters
- in_params 0.2078 4.7290 -1.0324 -1.2979
0.2249 0.1773 -0.4783 0.9329 1.3486 0.2713 2.8471
- Train/Test2 error
- Data 12.1955/11.3255
- Model 10.4058/10.5010
- Train/Test2 r2
- 0.2720/0.1403
- bias 1.231002
- noise 113.184496 (std 10.63882)
- lengthscale
- P13 1.675687 (111.26)
- XGeology_Code_SA1 1.913570 ( 6.29)
P13 Precipitation of wettest month P2 Mean
Diurnal Temp. Range P19 Precipitation of coldest
quarter P4 Temperature Seasonality
55Results Clay in topsoil (), no X, Y, variable
selection
56Results Clay in topsoil (), no X, Y
P13 Precipitation of wettest month P2 Mean
Diurnal Temp. Range P19 Precipitation of coldest
quarter P4 Temperature Seasonality
57Results Clay in topsoil ()
58Results Clay in topsoil (), no X, Y
59Prediction Time
- 21 ms/cell 1700 training points, Pentium 4
1.8GHz - Honduras (112,000 km2)
- 40 minutes _at_ 1km
- 3.4 days _at_ 90m
- 30 days _at_ 30m
- Africa (30,000,000 km2)
- 7.2 days _at_ 1km
- 2.4 years _at_ 90m
- 22 years _at_ 30m
- USA (9,158,000 km2)
- 2.2 days _at_ 1km
- World (148,940,000 km2)
- 37 days _at_ 1km
60Results Impact
- Gaussian Processes for PSM
- Provide quantitative predictions
- Provide quantitative estimate of confidence
- Combine pedogenic factors and spatial
interpolation - Allow for complete coverage
- Enable continued improvement
- Match or advance state of the art in predictive
soil mapping
61Future Work
- In Gaussian Processes for Predictive Soil Mapping
- Validate Results
- Improve existing variables
- Find new variables to improve results
- Compare with leading approach Regression Trees
- Participate in international workshop to assess
viability of worldwide coverage with latest
techniques
62Future Work
- In TechBridgeWorld work with CIAT
- Computer Vision for monitoring and management of
agricultural fields and natural resources from
low cost flying platforms - Digital elevation map generation
- Automated image mosaicing
- Segmentation of individual tree crowns
- Disease monitoring and detection
- Developing weather insurance schemes for
small-holder farmers in developing countries - Species/crop distribution modeling for targeting
conservation and identifying new opportunities
for farmers - Temporal analysis of land cover data
63Future WorkWeather Index Insurance for Small
Farmers
- Rather than insuring yield loss
- Insure for weather most likely cause of yield
loss is lack of or excess of rain - Reduces fraud
- Reduces cost
- Challenges
- Event timing is critical
- Needs very low false positive and false negative
rate - Impact of rainfall depends on terrain and soil
type
64Future Work Analysis of Digital Aerial Imagery
- Captured with low-cost hot air balloon or kite
- Automatic image mosaicing
- Generation of elevation maps from images
65Future WorkMonitoring of Rainforest Tree Species
66Future WorkAutomatic Coast Line Extraction
- 90 m Digital Elevation Maps available for the
world, from shuttle mission.
67Future WorkTemporal Analysis of Vegetation Cover
- To monitor natural changes and human impact
68Conclusions
- Great contributions can be made by applying
computer science techniques to other fields - Scientists in other fields are frequently limited
to off-the-shelf solutions - Working with existing groups in developing
countries can maximize impact of short-term work