Title: Automated classification of rainfall systems using statistical characterization
1Automated classification of rainfall systems
using statistical characterization
- Michael Baldwin
- University of Oklahoma
- CIMMS
2Motivation
- Verification and predictability
- e.g., Ebert and McBride (2000)
- Climatology
- e.g., Houze et al. (1990)
- Forecasting
- e.g., Doswell et al. (1996)
- Diagnosis of ensemble forecasts
- e.g, Elmore et al. (2002)
3Subjective classification
- From Houze et al. (1990)
- Leading line-trailing stratiform
- Symmetric/asymmetric
- Unclassifyable class
4Verification of detailed forecasts
observed
RMSE 3.4 MAE 0.97 ETS 0.06
RMSE 1.7 MAE 0.64 ETS 0.00
- 12h forecasts of 1h precipitation valid 00Z 24
Apr 2003
5Outline
- Training
- 48 cases from Aug-Nov 2000 (target data set)
- Histogram analysis
- Correlogram analysis
- Automation
- Classification
- Object identification
- Validation
- 100 cases from 2002 data
6Classify people?
7Classification terminology
- Objects things you wish to classify
- Individuals, cases, subjects, entities
- Attributes descriptions of the objects
- Variables, features, descriptors,
characteristics, properties - 2 types of automated classification
- Supervised
- Unsupervised
8Object classes are known ahead of timewhere
should new objects go?
9Using cluster analysis to discover classes
10What if you know the classes, but dont know how
to characterize the objects in such a way that an
automated classification will agree with your
classification?
- First, build a training data set
- Second, put together a set of trial attributes
that might be useful in a classification - Third, do a lot of unsupervised classification
experiments using combinations of trial
attributes - Next, build a supervised classification procedure
from the essential trial attributes
11Target data set - training
- 48 cases
- Summer-Fall of 2000
- Fixed domain size
12Target data set - training
- Cases selected by hand
- Populate data set with typical rainfall systems
13Target data set - training
- NCEP Stage IV radargage analyses
- 1h accumulation
- 128 x 128 4km grid boxes
14Target data set training
- Variety of phenomena, geographical locations,
times
15Expert (subjective) classification
- Two level classification hierarchy
- Three-class Two-class
- Linear
- Cellular
- Stratiform
Convective
Rain systems
Non-convective
16Subjective classification based upon objective
criteria
- Considering intensity, degree of alignment or
linear organization - Significant fraction gt 5mm/hr Convective
- Otherwise Stratiform
- Bounding box around convective region aspect
ratio gt 31 Linear - Otherwise Cellular
17Intensity-related attributes
- Compact way to describe how much rain fell
- Histogram analysis
- Parameters of a statistical distribution fit to
histogram will be used as attributes
18Attributes gamma distribution parameters
- Gamma PDF depends on 2 parameters shape (a) and
scale (b)f(xa,b) (x/b)a-1 exp(-x/b)
bG(a)-1 x?0, a,bgt0
shape
scale
19Example shape, scale (2-moments)
Cluster 1
4
Cluster 3
Cluster 2
- Percent correct 3-class 63.8 2-class 97.8
20Spatial organization related attributes
xh
h
xt
- Geostatistics
- Measure aspects of spatial field as a function of
separation vector h - Variogram
- Covariance
- Correlogram
21Synthetic data
- Similar to Wood and Brown (1986) with Doppler
velocity data - rainfall
- correlogram
22Synthetic data
- Degree of linear organization related to shape of
correlation contours
23Image processing
24Image processing
- Thresholding, connected component labeling, edge
detection
25Example shape, scale, 0.6 contour eccentricity
Cluster 1
5
3
Cluster 2
4
- Percent correct 3-class 76.1 2-class 100
26Essential attributes
- gamma scale parameter (b) and correlogram contour
eccentricity (a/b) - No clear advantage to standardization, these
attributes have approximately the same range - Addition of contour area and/or gamma shape parm
did not improve classification - Expand the number of correlation contours to 0.2,
0.4, 0.6, and 0.8
27Example scale, eccentricity of 0.2, 0.4, 0.6,
0.8
Cluster 1
5
3
Cluster 2
4
- Percent correct 3-class 90.5 2-class 100
28Automated classification
- Now using supervised classification
- Find cluster means from best HCA results
- Any new object will be classified by its nearest
neighbor to these 5 cluster means
29Automated rainfall object identification
- Contiguous regions of measurable rainfall
(similar to CRA Ebert and McBride (2000))
30Connected component labeling
31Expand area by 15, connect regions that are
within 20km, relabel
32Object analysis
332002 data 799,014 objects
Validation
- Looking for power-law regimes on a log-log plot
- Small
- meso-g (lt50km)2
- 65.6 (524224, 60/h)
- Medium
- meso-b (50-200km)2
- 30.4 (242914, 28/h)
- Large
- meso-a (gt200km)2
- 4 of total (31876, 3.7/h)
34Example of object sizes
large
small
medium
medium
35Distribution of object centers of mass
36Random sample of large objects to validate
classification procedure
- 100 cases, classified by hand (stratiform,
linear, cellular) and by automated procedure - Training data set consisted of large objects
- Overwhelming majority of small, medium objects
have nearly identical attributes (low scale,
eccentricity values) - Large objects will have least amount of
uncertainty in parameter estimation - Large objects have wide range of attribute values
37Validation of automated classification procedure
- 85 correct in three-class (linear, cellular,
stratiform) - 89 correct in two-class (convective,
non-convective)
38Classification of 2002 data
All
Small
Large
Medium
39Summary
- Developed an automated rainfall system
classification procedure - Using statistically-based characteristics of
intensity and degree of linear organization - Validated against random sample of 2002 data
40Future work
- Do these attributes allow more refined classes,
such as leading-line trailing stratiform,
symmetric/asymmetric? - Very large (synoptic scale) objects contain
multiple classes of rainfall systems - Why are 99 of small objects stratiform?
- Apply this to forecast and observed rain for
verification purposes
41Verification
b 7.8 a/b 0.2 3.6 a/b 0.4 3.1 a/b 0.6
4.5 a/b 0.8 3.6
observed
b 3.1 a/b 0.2 2.6 a/b 0.4 2.0 a/b 0.6
2.1 a/b 0.8 2.8
b 1.6 a/b 0.2 10.7 a/b 0.4 7.5 a/b 0.6
4.3 a/b 0.8 2.8
- 12h forecasts of 1h precipitation valid 00Z 24
Apr 2003
42(No Transcript)
43Cluster analysis - data matrix
- Objects are columns
- Attributes are rows
- Cluster based on the similarity between objects
(column vectors)
ith attribute
jth object
44(No Transcript)
45Subjective classification
- Three main classes
- Linear
- Cellular
- Stratiform
46Subjective classification
- Three main classes
- Linear
- Cellular
- Stratiform
47Subjective classification
- Three main classes
- Linear
- Cellular
- Stratiform
48Objective classification
1
- Hierarchical cluster analysis (HCA)
- Similar to ensemble data analysis by
- Alhamed et al (2002) SAMEX
- Yussouf et al. (2003) New England
- Analysis of similarity of objects
- Similarity correlation
- Dissimilarity distance
- Clusters are groups of similar objects
- Optimal clusters minimize within-cluster
variation and maximize between-cluster variation
2
49Cluster analysis Wards method
- Ward (1963), based on variance conservation law
- Agglomerative clustering algorithm
- Step 1 Place each object is a separate cluster
- Step 2 Compute within-cluster variance for every
possible merger of two clusters - Step 3 Merge the two clusters that increase the
within-cluster variance the least - Repeat steps 2 3 until all objects are in one
cluster
50Rainfall distribution
- Distribution of rainfall amounts is highly
positively skewed, non-negative - Heavy rain is a rare event
- Considered using
- Weibull (Wilks 1989)
- Two-parameter kappa (Mielke 1973)
- Gamma (Wilks 1990)
51Gamma distribution
- Gamma PDF depends on 2 parameters a,b
- f(xa,b) (x/b)a-1 exp(-x/b) bG(a)-1
- x?0, a,bgt0
Modify b (scale) parameter
Modify a (shape) parameter
52Continuous spectrum of objects
53Parameter estimation
- For 2 parameters, a set of 2 equations are
typically used relating the population and sample
moments (e.g., 1st and 2nd) - x a b
- s2 a b2
- Familiar method of moments
- Resulting distribution fits 1st and 2nd moments
but not higher-order moments - Wilks (1990) discusses problems with method of
moments estimates particularly for small values
of a
54Maximum likelihood estimation
- Find parameters that make observed data most
likely - Assuming independent, identically distributed
data - Likelihood function becomes product of
likelihoods for each observed value - Wilks (1990) used this on rainfall time-series,
we have spatial data which are correlated - Want to use an estimation method that can take
serial correlation into account
55Method of validating objective classification
- In order to convert HCA to classification,
subjective decision is required - Kalkstein et al. (1987) suggest calling it
automated instead of objective - Cut tree so we get 3-5 clusters 6 outliers
(at most) - Count up number of cases in each class
- Determine dominant class for each cluster
- Number of cases correctly in dominant classes
divided by total number of cases minus outliers
is percent correct
56Cluster membership
- Cluster 1 5 lines, 5 cells, 0 stratiform
- Cluster 2 8 lines, 10 cells, 0 stratiform
- Cluster 3 1 lines, 0 cells, 11 stratiform
- Cluster 4 4 lines, 3 cells, 0 stratiform
- 2-class 46 correct cases / 47 total (48 1
outlier) 97.8 correct - 3-class 30 correct / 47 total 63.8
57 and b for all 48 cases
non-conv
conv
58Performance of cluster analysis using gamma
distribution attributes
- Slight variation in performance as number of
moments increase and with changes in q
59Spatial organization related attributes
head
xh
h
tail
xt
- Geostatistics
- Measure aspects of spatial field as a function of
separation vector h - Variogram
- Covariance
- Correlogram
60Example variogram
- Germann and Joss (2001) 1-D
- Harris et al. (2001) 1-D structure function
61Example covariance
62Example correlogram
- Kessler and Russo (1963)
- Kessler (1966)
- Zawadzki (1973)
63Attributes summary measures of correlation
contours
- Approximations to the area and eccentricity of an
ellipse - Lengths of major, minor axes (a,b) found using
image processing techniques - Area ab
- Eccentricity a/b
- 1.0 for circle, larger for flatter ellipses
64Image processing
65Image processing
- Threshold
- Connected component labeling
66Image processing
- Thresholding, connected component labeling, edge
detection
67Example a b 0.6 a/b
Cluster 1
5
3
Cluster 2
4
- Percent correct 3-class 76.1 2-class 100
68Percent correct
- Question as to whether/how attributes should be
standardized - Test every combination of 2, 3, and 4 attributes
69Essential attributes
- b and a/b
- No clear advantage to standardization, these
attributes have approximately the same range - Addition of ab and/or a did not improve
classification - Expand the number of correlation contours to 0.2,
0.4, 0.6, and 0.8
70Example b a/b 0.2 0.4 0.6 0.8
Cluster 1
5
3
Cluster 2
4
- Percent correct 3-class 90.5 2-class 100
71Examples of hybrid cases
72Automated classification
- Now using partitional clustering
- Find cluster means from best HCA results
- Object class is nearest neighbor (Euclidean
distance) to these 5 cluster means
73Automated rainfall object identification
- Contiguous regions of measurable rainfall
(similar to CRA Ebert and McBride (2000))
74Connected component labeling
75Expand area by 15, connect regions that are
within 20km, relabel
76Object analysis
77Attributes from example objects
78Summary stats for 2002 data
- 799014 objects, 8679 hours (99.1 of year)
79Size regimes
- Looking for power-law regimes on a log-log plot
- Small
- meso-g (lt50km)2
- 65.6 (524224, 60/h)
- Medium
- meso-b (50-200km)2
- 30.4 (242914, 28/h)
- Large
- meso-a (gt200km)2
- 4 of total (31876, 3.7/h)
80Data reduction - feature extraction
- Find useful features that represent the data with
a relatively small number of variables or
dimensions - Determine which attributes are essential (those
that help to discriminate) by experiment - Automate the computation of essential attributes
81Random sample of large objects to validate
classification procedure
- 100 cases, classified subjectively (stratiform,
linear, cellular) and by automated procedure - Target data set consisted of large objects
- Overwhelming majority of small, medium objects
have nearly identical attributes (low b, low a/b) - Large objects will have least amount of
uncertainty in parameter estimation - Large objects have wide range of attribute values
82Log(density) of attributes
- Drop a regular grid (in log-log space)
- Count up number of objects in each gridbox
83Diurnal cycle
84Monthly distribution
85Spatial distribution
86a - b
87b a/b 0.4
88Validation of automated classification procedure
- 85 correct in three-class (linear, cellular,
stratiform) - 89 correct in two-class (convective,
non-convective)
89Summary
- Developed an automated rainfall system
classification procedure - Using statistically-based characteristics of
intensity and degree of organization - Validated against random sample of 2002 data
90Future work
- Further refinement of classification scheme
- Diagnosis of ensemble forecast systems
- Feature tracking
- Climatological studies
- Apply to forecast and observed rainfall for
verification purposes
91Classification procedure
- Categorizing entities based on their similarity
to other members of a class - A taxonomy, considering rainfall systems as
objects in their entirety - General, automated procedure using 1h accumulated
precipitation analyses - Universally applicable to rainfall systems
observable by NCEP Stage IV analysis system
92Method of determining essential attributes
- Classify a set of rainfall patterns (target data
set) both objectively and subjectively - If results of an objective classification agree
with subjective classification, then attributes
are considered essential - We will then have set of attributes that describe
rainfall patterns in a manner consistent with
expert analyst
93Feature extraction requires tools to manipulate
data
- Multivariate statistical analysis
- Parameter estimation
- Geostatistics
- Image processing
- Pattern recognition
94Verification
- Main motivation for this work
- Verify forecasts using an object-oriented
approach (e.g., Somerville 1977, Williamson 1981,
Neilley 1993) - In order to do this, must first be able to
locate, analyze, and characterize objects in an
automated fashion - An automated classification procedure is needed
95Previous classification methods
- Subjective
- e.g., Maddox (1980), Bluestein and Jain (1985),
Houze et al. (1990), Doswell et al. (1996),
Parker and Johnson (2000) - Rain rate threshold
- Johnson and Hamilton (1988)
- Agglomerative image segmentation
- Lakshmanan (2001)
96Subjective classification
- From Parker and Johnson (2000)
97Previous classification methods
- Analysis of local peaks
- e.g., Churchill and Houze (1984), Steiner et al.
(1995), Mohr and Zipser (1996), Biggerstaff and
Listemaa (2000) - Drop size distribution
- Yuter and Houze (1997), Rao et al. (2001)
- Cloud model analysis
- Xu (1995), Lang et al. (2003)
98Analysis of local peaks
- Micro-classification
- From Biggerstaff and Listemaa (2000)
99Why segment convective/stratiform?
convective
stratiform
- Tropical rainfall
- Vertical latent heating estimation
- MCS parameterization (Alexander and Cotton 1998)
Heating rate
Divergence
Adapted from Houze (1997)
100Development of an automated classification
procedure required
- Macro-classification approach
- Want to classify entire system rather than
separate regions within a system - Use subjective methods as a guide
101Outline of talk
- Introduction
- Process of developing classification procedure
- Analysis of target data set
- Automating the procedure
- Analysis of 2002 data
- Validation of automated procedure
- Future work
102(No Transcript)
103Results from using beta, a/b for 0.2, 0.4, 0.6,
0.8 contours
104PCA
105Include a discussion of future work or remaining
issues
- Various ways to verify forecasts using the
attribute vector - Generalized euclidean distance (how to determine
weight matrix?) - Marginal distributions
- Mean errors given a certain range of 1 or more
attributes - Joint distribution of errors given .
106Summary
- Developing an events-oriented verification
approach by characterizing forecasts and
observations - Cluster analysis on gamma distribution parameters
successfully discriminated convective/non-convecti
ve events - Future work involves finding attributes that
describe the spatial organization of rainfall
107Ideal set of attributes
- Small set of numbers
- Easy to compute (minimize CPU time)
- Able to characterize important aspects of
meteorological phenomena - Discriminate among different significant and
interesting phenomena - Easy to explain to meteorologists
108Synthetic data
109a
b
q
110Weight matrix
- Correlation can be taken into account by
modifying the error covariance matrix estimate - Iterative solution
- First guess of q using AI
- Use this to estimate S, invert to get A
- Iterate until convergence is reached
111Aspects of rainfall systems
- Intensity
- Degree of organization (mode)
- Orientation
- Location
112Subjective classificaiton - MCS
- Bluestein and Jain (1985)
- Bluestein et al. (1987)
- Houze et al. (1990)
- Blanchard (1990)
- Geerts (1998)
- Parker and Johnson (2000)
113Agglomerative
- Lakshmanan (2001)
- Region-growing
114Segment convective and stratiform Analysis of
local peaks
- Churchill and Houze (1984)
- Steiner et al. (1995)
- Biggerstaff and Listemaa (2000)
- Mohr and Zipser (1996)
- Looking for reflectivity/satellite brightness
pixels that stand out above the crowd - Micro-classification, dividing a system into
convective/stratiform regions
115Segment convective and stratiform Drop size
distribution
- Yuter and Houze (1997)
- Rao et al. (2001)
- Again, micro-classification
- Do not have routine access to this kind of data
116Segment convective and stratiform Cloud model
analysis
- Lang et al. (2003)
- Xu (1995)
- Looking at vertical motion, cloud/rain water
mixing ratio - Do not have access to this kind of information in
real atmosphere
117Rain rate
- 6mm/hr (Johnson and Hamilton (1988))
- 20mm/hr (Churchill and Houze (1984))
118Traditional verification methods
- Compute statistics based upon matching pairs of
forecast and observed variables at the same set
of points in space/time
119Subjective not gospel truth
- Ask 5 economists youll get 6 opinions
- 4th class catdog could be classified either
lines or cells depending on who you ask - Should not punish the objective classification if
experts would have valid disagreements - Going to compute a more realistic correct by
not punishing if catdogs are called lines or
cells
120Rand statistic
- What you are trying to do is the general problem
of comparing partitions between cluster
solutions. Although people often use simple
missclassification rates or Cohen's kappa, the
best approach is to use the adjusted Rand
statistic which has been developed to do exactly
what you want. For more information on the
adjusted Rand, you should refer to - Hubert, L., Arabie, P. (1985). Comparing
partitions. Journal of Classification, 2,
193-218. - I dont think I can use this since I have a
different number of subjective and objective
classes
121Scores
- Can produce single scores (like euclidean
distance) but should we? - How do you weigh errors in various attributes?
- Again, this must be user-specific like value
question
122Determining weights for generalized Euclidean
distance
- Could this be taken from Gerritys response to
Gandin and Murphy? - Inverse of the attribute observed frequency
covariance matrix?
123Problems
- Cluster analysis finds groups with similar
attributes - My target data set might happen to have clusters
that are not representative of those that may or
may not be found in real atmosphere - Classification is single-valued, either the case
is in one class or another. Might need fuzzy
classficiation - Cell/line class in the transition zone between
more definitive line and cell cases - What does the difference between a 30 cell and
a 50 cell mean in practical terms? (cost/loss
for misclassification/misforecasting)
124Process of developing classification procedure
- Knowledge Discovery in Databases Process
- Understand goals/application of end-user
- Create target data set
- Preprocess data set
- Data reduction
- Choose data mining task
- Choose data mining algorithm
- Execute data mining
- Interpret mined patterns, possibly repeat steps
- Consolidate discovered knowledge
125Process of developing classification procedure
- Knowledge Discovery in Databases Process
- Understand goals/application of end-user
- Create target data set
- Preprocess data set
- Data reduction
- Choose data mining task
- Choose data mining algorithm
- Execute data mining
- Interpret mined patterns, possibly repeat steps
- Consolidate discovered knowledge
126Process of developing classification procedure
- Knowledge Discovery in Databases Process
- Understand goals/application of end-user
- Create target data set
- Preprocess data set
- Data reduction
- Choose data mining task
- Choose data mining algorithm
- Execute data mining
- Interpret mined patterns, possibly repeat steps
- Consolidate discovered knowledge
127Process of developing classification procedure
- Knowledge Discovery in Databases Process
- Understand goals/application of end-user
- Create target data set
- Preprocess data set
- Data reduction
- Choose data mining task
- Choose data mining algorithm
- Execute data mining
- Interpret mined patterns, possibly repeat steps
- Consolidate discovered knowledge
128Process of developing classification procedure
- Knowledge Discovery in Databases Process
- Understand goals/application of end-user
- Create target data set
- Preprocess data set
- Data reduction
- Choose data mining task
- Choose data mining algorithm
- Execute data mining
- Interpret mined patterns, possibly repeat steps
- Consolidate discovered knowledge
129Process of developing classification procedure
- Knowledge Discovery in Databases Process
- Understand goals/application of end-user
- Create target data set
- Preprocess data set
- Data reduction
- Choose data mining task
- Choose data mining algorithm
- Execute data mining
- Interpret mined patterns, possibly repeat steps
- Consolidate discovered knowledge