Verifying cloud forecastsWhat is the
half-life of a cloud forecast?Is the Equitable
Threat Score really equitable?
  • Robin Hogan
  • Ewan OConnor, Anthony Illingworth
  • University of Reading, UK
  • Chris Ferro, Ian Jolliffe, David Stephenson
  • University of Exeter, UK

How skillful is a forecast?
ECMWF 500-hPa geopotential anomaly correlation
  • Most model evaluations of clouds test the cloud
  • What about individual forecasts?
  • Standard measure shows ECMWF forecast half-life
    of 6 days in 1980 and 9 days in 2000
  • But virtually insensitive to clouds!

  • The Cloudnet processing of ground-based radar
    and lidar observations
  • Continuous evaluation of the climatology of
    clouds in models
  • Evaluation of the diurnal cycle of boundary-layer
  • Desirable properties of verification measures
    (skill scores)
  • Usefulness for rare events the Symmetric Extreme
    Dependency Score
  • Equitability is the Equitable Threat Score
  • Testing the skill of cloud forecasts from seven
  • Skill versus cloud fraction, height, scale,
    forecast lead time, season...
  • Estimating the forecast half life
  • Testing the skill of cloud forecasts from space
  • Evaluation of ECMWF model with ICESat/GLAS lidar
  • Most results taken from these papers
  • Hogan, OConnor Illingworth (QJ 2009)
  • Hogan, Ferro, Jolliffe Stephenson (WAF, in

  • Aim to retrieve and evaluate the crucial cloud
    variables in forecast and climate models
  • 8 models global, mesoscale and high-resolution
    forecast models
  • Variables cloud fraction, LWC, IWC, plus a
    number of others
  • Sites 4 across Europe plus worldwide ARM sites
  • Period several years to avoid unrepresentative
    case studies
  • Current status
  • Funded by US Department of Energy Climate Change
    Prediction Program to apply to ARM data worldwide

Level 1b
  • Minimum instrument requirements at each site
  • Cloud radar, lidar, microwave radiometer, rain
    gauge, model or sondes
  • Radar
  • Lidar

Level 1c
  • Instrument Synergy product
  • Example of target classification and data quality

Level 2a/2b
  • Cloud products on (L2a) observational and (L2b)
    model grid
  • Water content and cloud fraction

L2a IWC on radar/lidar grid L2b Cloud fraction
on model grid
Cloud fraction
Chilbolton Observations Met Office Mesoscale
Model ECMWF Global Model Meteo-France ARPEGE
Model KNMI RACMO Model Swedish RCA model
Cloud fraction in 7 models
  • Mean PDF for 2004 for Chilbolton, Paris and

0-7 km
Illingworth et al. (BAMS 2007)
Diurnal cycle composite of clouds
Radar and lidar provide cloud boundaries and
cloud properties above site
  • Barrett, Hogan OConnor (GRL 2009)

Joint PDFs of cloud fraction
  • Raw (1 hr) resolution
  • 1 year from Murgtal
  • DWD COSMO model

Contingency tables
Observed cloud Observed clear-sky
a 7194 b 4098
c 4502 d 41062
DWD model, Murgtal DWD model, Murgtal
a Cloud hit b False alarm
c Miss d Clear-sky hit
  • Model cloud
  • Model clear-sky

For given set of observed events, only 2 degrees
of freedom in all possible forecasts (e.g. a
b), because 2 quantities fixed - Number of
events that occurred n a b c d - Base
rate (observed frequency of occurrence) p (a
Skill-Bias diagrams
Reality (n16, p1/4) Forecast

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
5 desirable properties of verification measures
  • Equitable all random forecasts receive
    expected score zero
  • Constant forecasts of occurrence or
    non-occurrence also score zero
  • Note that forecasting the right cloud climatology
    versus height but with no other skill should also
    score zero
  • Difficult to hedge
  • Some measures reward under- or over-prediction
  • Useful for rare events
  • Almost all measures are degenerate in that they
    asymptote to 0 or 1 for vanishingly rare events
  • Dependence on full joint PDF, not just 2x2
    contingency table
  • Difference between cloud fraction of 0.9 and 1 is
    as important for radiation as a difference
    between 0 and 0.1
  • Difficult to achieve with other desirable
    properties wont be studied much today...
  • Linear so that can fit an inverse exponential
    for half-life
  • Some measures (e.g. Odds Ratio Skill Score) are
    very non-linear

HedgingIssuing a forecast that differs from
your true belief in order to improve your score
(e.g. Jolliffe 2008)
  • Hit rate Ha/(ac)
  • Fraction of events correctly forecast
  • Easily hedged by randomly changing some forecasts
    of non-occurrence to occurrence

  • Defined by Gandin and Murphy (1992)
  • Requirement 1 An equitable verification measure
    awards all random forecasting systems, including
    those that always forecast the same value, the
    same expected score
  • Inequitable measures rank some random forecasts
    above skillful ones
  • Requirement 2 An equitable verification measure
    S must be expressible as the linear weighted sum
    of the elements of the contingency table, i.e. S
    (Saa Sbb Scc Sdd) / n
  • This can safely be discarded it is incompatible
    with other desirable properties, e.g. usefulness
    for rare events
  • Gandin and Murphy reported that only the Peirce
    Skill Score and linear transforms of it is
    equitable by their requirements
  • PSS Hit Rate minus False Alarm Rate a/(ac)
  • What about all the other measures reported to be

Some reportedly equitable measures
HSS x-E(x) / n-E(x) x ad ETS
a-E(a) / abc-E(a)
E(a) (ab)(ac)/n is the expected value of a
for an unbiased random forecasting system
LOR lnad/bc ORSS ad/bc 1 / ad/bc 1
Skill versus cloud-fraction threshold
  • Consider 7 models evaluated over 3 European sites
    in 2003-2004

Extreme dependency score
  • Stephenson et al. (2008) explained this behavior
  • Almost all scores have a meaningless limit as
    base rate p ? 0
  • HSS tends to zero and LOR tends to infinity
  • They proposed the Extreme Dependency Score
  • where n a b c d
  • It can be shown that this score tends to a
    meaningful limit
  • Rewrite in terms of hit rate H a/(a c) and base
    rate p (a c)/n
  • Then assume a power-law dependence of H on p as p
    ? 0
  • In the limit p ? 0 we find
  • This is useful because random forecasts have Hit
    rate converging to zero at the same rate as base
    rate d1 so EDS0
  • Perfect forecasts have constant Hit rate with
    base rate d0 so EDS1

Symmetric extreme dependency score
  • EDS problems
  • Easy to hedge (unless calibrated)
  • Not equitable
  • Solved by defining a symmetric version
  • All the benefits of EDS, none of the drawbacks!

Hogan, OConnor and Illingworth (2009 QJRMS)
Skill versus cloud-fraction threshold
SEDS has much flatter behaviour for all models
(except for Met Office which underestimates high
cloud occurrence significantly)
Skill versus height
  • Most scores not reliable near the tropopause
    because cloud fraction tends to zero

A surprise?
  • Is mid-level cloud well forecast???
  • Frequency of occurrence of these clouds is
    commonly too low (e.g. from Cloudnet Illingworth
    et al. 2007)
  • Specification of cloud phase cited as a problem
  • Higher skill could be because large-scale ascent
    has largest amplitude here, so cloud response to
    large-scale dynamics most clear at mid levels
  • Higher skill for Met Office models (global and
    mesoscale) because they have the arguably most
    sophisticated microphysics, with separate liquid
    and ice water content (Wilson and Ballard 1999)?
  • Low skill for boundary-layer cloud is not a
  • Well known problem for forecasting (Martin et al.
  • Occurrence and height a subtle function of
    subsidence rate, stability, free-troposphere
    humidity, surface fluxes, entrainment rate...

Key properties for estimating ½ life
  • We wish to model the score S versus forecast lead
    time t as
  • where t1/2 is forecast half-life
  • We need linearity
  • Some measures saturate at high skill end
    (e.g. Yules Q / ORSS)
  • Leads to misleadingly long half-life
  • ...and equitability
  • The formula above assumes that score tends to
    zero for very long forecasts, which will only
    occur if the measure is equitable

Which measures are equitable?
  • Expected values of ad for a random forecasting
    system may score zero
  • SE(a), E(b), E(c), E(d) 0
  • But expected score may not be zero!
  • ES(a,b,c,d) S P(a,b,c,d)S(a,b,c,d)
  • Width of random probability distribution
    decreases for larger sample size n
  • A measure is only equitable if positive and
    negative scores cancel

Asyptotic equitability
  • Consider first unbiased forecasts of events that
    occur with probability p ½

What about rarer events?
  • Equitable Threat Score still virtually
    equitable for n gt 30
  • ORSS, EDS and SEDS approach zero much more slowly
    with n
  • For events that occur 2 of the time (e.g.
    Finleys tornado forecasts), need n gt 25,000
    before magnitude of expected score is less than
  • But these measures are supposed to be useful for
    rare events!

Possible solutions
  • Ensure n is large enough that E(a) gt 10
  • Inequitable scores can be scaled to make them
  • This opens the way to a new class of non-linear
    equitable measures

What is the origin of the term ETS?
  • First use of Equitable Threat Score Mesinger
    Black (1992)
  • A modification of the Threat Score a/(abc)
  • They cited Gandin and Murphys equitability
    requirement that constant forecasts score zero
    (which ETS does) although it doesnt satisfy
    requirement that non-constant random forecasts
    have expected score 0
  • ETS now one of most widely used verification
    measures in meteorology
  • An example of rediscovery
  • Gilbert (1884) discussed a/(abc) as a possible
    verification measure in the context of Finleys
    (1884) tornado forecasts
  • Gilbert noted deficiencies of this and also
    proposed exactly the same formula as ETS, 108
    years before!
  • Suggest that ETS is referred to as the Gilbert
    Skill Score (GSS)
  • Or use the Heidke Skill Score, which is
    unconditionally equitable and is uniquely related
    to ETS HSS / (2 HSS)

Hogan, Ferro, Jolliffe and Stephenson (WAF, in
Properties of various measures
Measure Equitable Useful for rare events Linear
Peirce Skill Score, PSS Heidke Skill Score, HSS Y N Y
Equitably Transformed SEDS Y Y
Symmetric Extreme Dependency Score, SEDS Y
Log of Odds Ratio, LOR
Odds Ratio Skill Score, ORSS (also known as Yules Q) N
Gilbert Skill Score, GSS (formerly ETS) N N
Extreme Dependency Score, EDS N Y
Hit rate, H False alarm rate, FAR N N Y
Critical Success Index, CSI N N N
  • Truly equitable
  • Asymptotically equitable
  • Not equitable

Skill versus lead time
  • Only possible for UK Met Office 12-km model and
    German DWD 7-km model
  • Steady decrease of skill with lead time
  • Both models appear to improve between 2004 and
  • Generally, UK model best over UK, German best
    over Germany
  • An exception is Murgtal in 2007 (Met Office model

Forecast half life
  • Fit an inverse-exponential
  • S0 is the initial score and t1/2 is the half-life
  • Noticeably longer half-life fitted after 36 hours
  • Same thing found for Met Office rainfall forecast
    (Roberts 2008)
  • First timescale due to data assimilation and
    convective events
  • Second due to more predictable large-scale
    weather systems

Why is half-life less for clouds than pressure?
  • Different spatial scales? Convection?
  • Average temporally before calculating skill
  • Absolute score and half-life increase with number
    of hours averaged

Geopotential height anomaly Vertical velocity
  • Cloud is noisier than geopotential height Z
    because it is separated by around two orders of
  • Cloud vertical wind relative vorticity
    ?2streamfunction ?2pressure
  • Suggests cloud observations should be used
    routinely to evaluate models

Satellite observations IceSAT
  • Cloud observations from IceSAT 0.5-micron lidar
    (first data Feb 2004)
  • Global coverage but lidar attenuated by thick
    clouds direct model comparison difficult

Lidar apparent backscatter coefficient (m-1 sr-1)
Optically thick liquid cloud obscures view of any
clouds beneath
Solution forward-model the measurements
(including attenuation) using the ECMWF variables
Global cloud fraction comparison
ECMWF raw cloud fraction
ECMWF processed cloud fraction
  • Results for October 2003
  • Tropical convection peaks too high
  • Too much polar cloud
  • Elsewhere agreement is good
  • Results can be ambiguous
  • An apparent low cloud underestimate could be a
    real error, or could be due to high cloud above
    being too thick

IceSAT cloud fraction
Wilkinson, Hogan, Illingworth and Benedetti (MWR
Testing the model skill from space
Unreliable region
  • Clearly need to apply SEDS to cloud estimated
    from lidar radar!

Wilkinson, Hogan, Illingworth and Benedetti (MWR
CCPP project
  • US Dept of Energy Climate Change Prediction
    Program recently funded 5-year consortium project
    centred at Brookhaven, NY
  • Implement updated Cloudnet processing system at
    Atmospheric Radiation Measurement (ARM)
    radar-lidar sites worldwide
  • Ingests ARMs cloud boundary diagnosis, but uses
    Cloudnet for stats
  • New diagnostics being tested
  • Testing of NWP models
  • NCEP, ECMWF, Met Office, Meteo-France...
  • Over a decade of data at several sites have
    cloud forecasts improved over this time?
  • Single-column model testbed
  • SCM versions of many GCMs will be run over ARM
    sites by Roel Neggers
  • Different parameterization schemes tested
  • Verification measures can be used to judge

US Southern Great Plains 2004
Summary and outlook
  • Model comparisons reveal
  • Half-life of a cloud forecast is between 2.5 and
    4 days, much less than 9 days for ECMWF 500-hPa
    geopotential height forecast
  • In Europe, higher skill for mid-level cloud and
    lower for boundary-layer cloud, but larger
    seasonal contrast in Southern US
  • Findings applicable to other verification
  • Symmetric Extreme Dependency Score is a
    reliable measure of skill for both common and
    rare events (given we have large enough sample)
  • Many measures regarded as equitable are only so
    for very large samples, including the Equitable
    Threat Score, but they can be rescaled
  • Future work (in addition to CCPP)
  • CloudSat Calipso what is the skill of cloud
    forecasts globally?
  • What is half-life of ECMWF cloud forecasts? (Need
    more data!)
  • Near-real-time evaluation for rapid feedback to
    NWP centres?
  • Dept of Meteorology Lunchtime Seminar, 1pm
    Tuesday 3rd Nov Faster and more accurate
    representation of clouds and gases in GCM
    radiation schemes

(No Transcript)
Monthly skill versus time
  • Measure of the skill of forecasting cloud
  • Comparing models using similar forecast lead time
  • Compared with the persistence forecast
    (yesterdays measurements)
  • Lower skill in summer convective events

Statistics from AMF
  • Murgtal, Germany, 2007
  • 140-day comparison with Met Office 12-km model
  • Dataset released to the COPS community
  • Includes German DWD model at multiple resolutions
    and forecast lead times

Possible skill scores
Contingency table Observed cloud Observed clear sky
Modeled cloud a hit b false alarm
Modeled clear sky c miss d correct negative
  • Cloud deemed to occur when cloud fraction f is
    larger than some threshold fthresh
  • To ensure equitability and linearity, we can use
    the concept of the generalized skill score
  • Where x is any number derived from the joint
  • Resulting scores vary linearly from random0 to
  • Simplest example Heidke skill score (HSS) uses
  • We will use this as a reference to test other

DWD model DWD model
a 7194 b 4098
c 4502 d 41062
Perfect forecast Perfect forecast
ap 11696 bp 0
cp 0 dp 45160
Random forecast Random forecast
ar 2581 br 8711
cr 9115 dr 36449
  • Brier skill score uses xmean squared
    cloud-fraction difference, Linear Brier skill
    score (LBSS) uses xmean absolute difference
  • Sensitive to errors in model for all values of
    cloud fraction

(No Transcript)
Alternative approach
  • How valid is it to estimate 3D cloud fraction
    from 2D slice?
  • Henderson and Pincus (2009) imply that it is
    reasonable, although presumably not in convective
  • Alternative treat cloud fraction as a
    probability forecast
  • Each time the model forecasts a particular cloud
    fraction, calculate the fraction of time that
    cloud was observed instantaneously over the site
  • Leads to a Reliability Diagram

No resolution
No skill
Jakob et al. (2004)
ECMWF raw cloud fraction
  • Simulate lidar backscatter
  • Create subcolumns with max-rand overlap
  • Forward-model lidar backscatter from ECMWF water
    content particle size
  • Remove signals below lidar sensitivity

Testing the model climatology
