Title: Mark Haggard, Helen Spencer
1OM8-30 for assessment outcome in OME origins,
applications ultra-short OM2-13
Mark Haggard, Helen Spencer Mariella
Gregori MRC Multi-centre Otitis Media Study
Group, Cambridge, UK
with Wendy Floate Cathy Harper Mid-Cheshire
Hospitals NHS Trust Kingston Hospital NHS
Trust Eurotitis 2 Study Group Collaborators
BACDA 20th Anniversary Meeting London, 27
January, revised Autumn 2006 for website
2It is necessary to have
- Valid and reliable measures of outcome in
important domains for clinical trials other
types of study - Standardised efficient assessment tools
- defining cases as to health and development
status, not just pathology - as clinical indicators for treatment decisions
- International comparability on such measures (for
general communication about case types, or
multi-national studies that seek to accumulate or
contrast)
3The next 5 slides convey
- The starting point in deciding what facets and
domains to include in a measure usually done by
interviewing about 20 people qualitative
preliminaries. We wanted greater
comprehensiveness and quantitative estimate of
relative importance so gave an open-ended
questionnaire to over 1000 - The overview of the multiple stages gone through
in the psychometric development. This is highly
simplified and each stage actually consisted of
many sub-stages. The latter had occasionally to
differ according to the aim of the resulting
measure, eg behaviour (generic, developed on
unaffected children) vs reported hearing
difficulties (RHD items scaled see below
against HL to maximise correlation) vs all others
(conventional internal consistency within
clinical sample) - The resulting mixture of facets served by items
in OM8-30 - Their breakdown into reliably supportable domains
- The high concurrent (criterion) validity of the
32-item short form against the 83-item long form
used in the TARGET trial for maximum
inclusiveness. Diminishing returns are met, in
that the gain in reliability from adding back in
the extra 51 items does not make the full set
much better in various applications such as
showing group or treatment differences (examples
of equivalence not given here)
4Content validity most often mentioned categories
of parent concern (N1100) Long-form (TARGET
trial) short-form (OM8-30) measures take
weighting from of mentions
Hearing 20.8 Behaviour 8.3 Speech/Lang 6.3 Safe
ty 3.9 Miscellaneous 5.7 Balance 0.9
School progress 20.1 Family Impact 10.9 Child
QoL? 13.1 Physical health 5.5 L-T
ear/hearing 3.2 Sleep pattern 1.3
90 of concerns covered by OM8-30 measures
Comprising ear symptoms, respiratory symptoms,
global health and other physical
problems ?Covers missing out socially,
missing out ambiguous reaction of others to
child, vague future, child quality of life
Aligns with Parent Quality of Life, when some
of below are included Covers ambig.
communication, non-acknowledgement, service
delivery, treatment anxieties,other
5Provisional internally constructed scale
Definition of scale unit
External scale
Item scaling
Item selection
Item weighting
- Quantifi-cation of each response level
- Response rate
- Range
- Consistency
- Reliability
- Validity
- Formulation of scaled score by principal component
Ear Infection
N
Various types of validation of developed score
Items not used
Item Pool
Score
6Given 6 domains, numbers of items in each
mini-measure must be small
Behaviour (6)
Parent QoL (5)
School progress (1)
Speech/language (3)
Sleep pattern (3)
Reported Hearing Difficulty (4)
Ear symptoms (3)
Global health (1)
Respiratory Symptoms (5)
72-factor summary of impact with 27 items
Behaviour (6)
Parent QoL (5)
Developmentalimpact (15)
School progress (1)
Speech/language (3)
Forbiasadjustment(4)
Sleep pattern (3)
Physicalhealth (12)
Reported Hearing Difficulties
Ear symptoms (3)
Global health (1)
Respiratory Symptoms (5)
8Concurrent validity correlation of bias-adjusted
total score with bias-adjusted 83- item TARGET
total (unadjusted even higher)
Weighted bias-adjusted total (83 items)
r 0.90 N 324
(Bias-adjusted) 27-item score from OM8-30
9The next 6 slides convey
- The general format of the items and the need not
to assume that the separations between the
adjacent pairs of response levels are necessarily
all equal (eg 1.00) - The idea of optimally scaling the response-levels
for a Likert item, to maximise its discriminating
potential. This is done by a regression between
the item, distinguishing its response levels
initially as floating categories, and the raw
total count (ie for each individual in a very
large sample). Thus the best spacing between the
response levels is determined by the average
spacing in the item count for similar items, as
this is what maximises the correlation. The
particular example shows an additional
sophistication contingent scoring for one
response category. That has now been abandoned as
the multiplication involved has been found to add
unnecessary variability frequent colds is now
scored as a separate additive item like any other
and a single scale value attributed to the only
when colds answers - A graphic representation of this idea whereby for
the underlying more sensitive scale the equality
assumption is wrong (orange) whereas the scaled
version with empirically assigned spacings
between response levels is correct in the sense
that it maps the item more efficiently onto a
better and more highly aggregated version of what
it is trying to measure. One could go round the
iterative loop one or more further times, eg
basing the total not on the raw dichotomy items
but such scaled versions, and then re-scaling.
However this is labour-intensive and automatic
algorithms to do this are perhaps not to be
trusted at this stage. Maximum gain comes from
the first stage as described - Using the expected moderate correlation of a
reported scale (RHD) with a measured one (HL) as
a test, the enhancement to the correlation from
scaling (compare penultimate with last column) is
worth having. Although 7 does not sound huge, it
can be mapped into substantial savings in sample
size hence feasibility of studies. Evident in the
comparison table is the equivalence of the gain
from scaling to an approximate doubling of the
length of the part of questionnaire by adding
back in the less good 5 items discarded for
OM8-30 - Clarification of the distinction between the
purpose and content of the long short forms - The particular items before and after the
selection for the short form (OM8-30)
10Response options and item scaling
Typical questionnaire format How often does
your child.ltACTIONgt Typical response options?
Never ? Sometimes ? Often ? Always Typical item
coding for data entry0 Never 1 Sometimes 2
Often 3 Always Possible item scaling,
obtained by predicting HL (categorical
regression) 0 Never 0.1 Sometimes 0.5
Often 0.6 Always
11Item format scaling for scoring
12Scaling by categorical regression
frequency-based Likert item
Always
Mean HL (transformed)
Equispaced data entry coding Justified
final coding
Never
Coding Scaling
0.6(3)
13Item scaling improves score reliability
validity by 7
RHD-9 RHD-4 RHD-9 RHD-4 RHD-4 Scaled
Scaled Not scaled Correlation with
avHL 0.400 0.398 0.352 Data set A
Correlation with avHL 0.530 0.527 0.466
Data set B
r (RHD-9 with RHD-4) set A 0.93
14Developing RHD-9 and RHD-4
- RHD-9 Best way of totalling 9 best items on
Reported Hearing Difficulties - Comprehensive and reliable for clinical
research - Items scaled and weighted for optimality
- RHD-4 Simple short-form (the 4 items given
highest weight by 1st principal component) - Efficient and simple for routine practice
- Items scaled, but for simplicity not weighted
as optimum weights are highly similar
15RHD hearing items (abbreviated)
1) How would you describe your childs
hearing? 2) Has hearing ability
varied? 3) Speaks unduly loudly? 4) Raises
sound level of TV/radio? 5) Responds when
called in a normal voice? 6) Mishears words
when not looking at you? 7) Turns wrong way to
a call or sound? 8) Difficulty hearing when
spoken face to face in quiet room? 9) Difficul
ty hearing when with a group of
people? 10) Asks for things to be repeated?
16The next 4 slides convey
- Eight (the 8 in the title OM8-30) a priori facets
of presentation were used to cluster the items
and allow them to help select one another on the
basis of internal consistency. However if you
only have time and data-entry capacity for 30
items you cannot reliably support measures of 8
facets, an all-too-infrequently recognised limit.
For extremely large samples it would be worth
comparing sub-scores of down to 3 items, but more
generally about 6 items per score is a limit and
roughly double this is a safe and reliable way to
proceed. Thus an empirical evidence base is
sought for a 2-domain summary of the items that
are scored for this purpose. Structural equation
modelling (SEM summarised in the path diagram)
has shown on the full data that a 2-domain
summary is indeed more efficient and parsimonious
than either a 1- or a 3-domain summary. This is
not surprising, and is a result often found in
development of outcome measures, for example even
in the generic SF-36 which produces a physical
and a mental domain. Blue arrows are assumed
causal, red ones a matter of marking a construct
by contribution, and green and purple arrows show
correlated residuals strictly outside the model
but which might be of either type or a third type
joint manifestations of a common unmarked cause - The simplest view is that the relation between
the two domains is causal. However developmental
impact also has many other determinants than
physical health in OME psychological and
social, as we have documented in detail elsewhere - One material part of the models structure is the
strong linkage between the part of reported
hearing difficulties that is NOT explained by the
HL, and the parts of all other measures that are
not explained by their structural relations to
each other. This is most economically expressed
on the underlying variables (in effect, the
totals) for the 2 main summary domains (see
curves in light green). This is the evidence that
response bias can and must be extracted the
excess of report over what is measurable may not
all be due to pure bias, but it is usefully
interpreted as a bias adjustment. Failure to
consider this contribution to all scores results
in unnecessarily high error and failure to
reflect expected and generally confirmed
relationships. Where bias exists it is
consequently reduced by fitting the bias term - Specifically for OM8-30, a hierarchical view of
this structure and the scores possible is useful,
one in which the RHD items are set apart for bias
adjustment and not totalled into impact
17Impacts of OME SEM defines the best summary
measures
AverageHearing Level(HL)
Reportedhearingdifficulty
RespiratorySymptoms
Sleep Pattern
Ear Symptoms
Socialconfidence
Global Health Rating (1)
Physical health ( sleep)
Developmental outcomes
Speech Language
Parent Quality of Life (QoL)
Age
Anxiety
Schooling Concerns (1)
BalanceProblems
Context- directed Behaviour
17
18Construct validity Structural equation modelling
on full 11 TARGET measures did confirm that they
are best summarised in 2, not 1 nor 3, factors
?
?
?
18
19Enabling an efficient summary of all
variables that is rooted in basic biological
constraints
Reportedhearingdifficulty
AverageHearing Level(HL)
0.23
0.71
Fits details of data extremely well, and is
parsimonious (only 20, not 13 X 14/2 91, links
between mini-measures)
Age
19
20Hierarchical simplification for OM8-30
Reported Hearing Difficulties
TOTAL
4
Sleep
3
3
Speech Language
Physical Health
Respiratory
General Developmental Impact
5
Behaviour
Ear Infections
Global Health
6
School Progress
Parent QoL
3
1
1
5
21Current activities agenda OM8-30
- Re-standardisation in UK and NZ with the items in
their current number (30) positions in
sequence, no other context - Further development of facility for adjustment of
parental response bias (field version using
tympanometry not HL) - Completing programme of translations for European
languages and getting large enough national
standardisation samples - Piloting applications in audit
22Bias-adjustment what why ?
- Physical and developmental impact can be measured
objectively, but this is totally infeasible and
unaffordable for clinical work - So, such information, essential for assessment
and decision must be provided by reports
answers to Qs - Patients/parents differing mental standards
affect their mentioning quantitative responses - Knowledge about such distortions in judgements
can be used to improve validity hence value of
(self-) report - specifically, by adjusting for bias.
- using discrepancies between quantitative report
and objectively predicted values
23RHD-4 score within OM8-30 curves summarise the
distributions at various HLs
90thile
250
50thile
200
10thile
150
RHD-4 (1st clinical visit )
100
50
0
0
10
20
30
40
50
60
Average HL (1st clinical visit)
24RHD-4 (best hearing difficulty questions) HL
curve zones show residual as bias estimator
90th ile
Over-concerned 10
75th ile
25th ile
RHD-4 (1st clinical visit)
10th ile
Under-aware 10
Average HL (1st clinical visit)
25Response bias adjustment
STEP 1
STEP 2
Hearing Level or weighted tymp score
Reported Hearing Difficulties (RHD)
2
Residuals (RB estimate), Predicted values
2000
26Residuals discrepancies of data-points from the
best-fit regression line
STEP 2
Residual (RB) 10
Predicted value (regression) line
Overconcerned
20
RHD, Reported Hearing Difficulties
10
Underaware
2
Residual (RB) -5
0
10
40
30
20
Hearing Level
2000
27Response bias adjustment (RBA)
STEP 2
Expected influences
Parental bias (RB)
Adjusting analysis (multiple regression)
Better model and estimate of particular influence
adjusted (minimising subjective bias)
28Good distribution of OM8-30 total score after
bias-adjustment Baseline data from children
included in TARGET
Case frequency
29But how to do bias-adjustment without having to
obtain HL eg with tympanogram A, C1, C2, B,
not giving continuous scale ?Make them do so
you dont even need ME Pressure Max
ComplianceWe derived formula predicting 2-ear,
4-freq, mean HL in 1489 cases 25dBHL....
thus estimating weights for A, C1, C2, B on each
ear Distribution of this tymp-based score is
slightly lumpy, but with care can be used in
samples up to 30dB HL severity
30Showing the value of bias-adjustment
Epidemiological associations
Age, SEG, sex referral source, etc as predictors
Parents response biases (RB) at V2
Adjusting regression to predict (1) physical
health or (2) developmental impact at V2
Stronger model if RB included
31 Severity distributions for this test
sample Actual HL (2-ear ave) Tymp-based
prediction of HL
The many B-tymps create a nasty spike of
undifferentiable pseudo-HLs in the distribution,
but we now have a method to predict HL for a
child with B tymps, so can disperse the spike,
giving a more continuous powerful measure
32Bias-adjusting one risk-factor model gives a
power gain of 26 from reducing error(model
sex, age referral source, SEG ? impact)
Multiple variance Residual error R
explained in dev. units Unadjusted Bias-adjusted
(via tymps) Bias-adjusted (via HL)
0.31 10 0.44 0.57 32 0.38 0.58 34 0.38
Sex becomes NS with adjustment, giving a
simpler model using the pseudo-HL described in
previous slides Bias-adjustment uses reported
hearing difficulties on the same occasion (TARGET
V2) as developmental impact.
33Bias-adjusting another risk-factor model (age,
SEG ? physical health) gains 17in power
- Multiple variance Residual error R
explained in dev. units - Unadjusted
- Bias-adjusted (via tymps)
- Bias-adjusted (via HL)
0.29 9 2.15 0.51 26 1.96 0.52 27 1.94
- Using RHD-4 from OM8-30 on the same occasion
(TARGET V2) as physical outcome. N 244 up to
30dB HL, including some unaffected children. - On full RHD9 physical health scores (9 19
items) for TARGET V1V2 average with N 1261,
this model gave R0.42, 18 Vexpl, 13 power
gain
34HL no better here to derive bias-adjustment term
than tymp-based pseudo HL
Regression to predict residual from derivation
formula via performance determinants
Age, concentration, time to do audiogram,
method (play etc)
SEG marginal Type NS (play/conventional)
..significantly predict residual from regression
of HL with tympanogram data
Multiple R 0.23 fairly good model ( but other
sources of discrepancy too)
35Why avoid HL ? It is needed at many points, but
typical time for 4-frequency, 2-ear, PTA is just
under 15 minutes, so bias-adjusting OM8-30 from
tympanogram is worthwhile
36Import of this work
- The performance factors in measured HL can be
teased out with a simple ste of variables - Bias-adjustment works, reducing variability in
reported measures - In the ranges of age and of HL typical of OME
- Tymp-based adjustment is as good as HL-based
- We know why (irrelevant performance variation in
HL) - This points to a possible clinical approach
re-distributing time productively away from just
HL measurement - Obtain OM8-30, tymps
- Confirm diagnosis from these
- If BB or BC2, impact gt a certain criterion,
test HL at 1 4khz and note average - If difference gt Y do bone conduction to identify
any SNHL underlay - Save expensive clinician and audiologist time for
more important activities, including further
tests in some children eg Speech-in-Noise - Use more of cheaper time of computer clerical
assistant
37International validations basis1 TARGET sample
severe for OME2 Finland special wrt RAOM
- TARGET visit 1 selected by UK gatekeeping, visit
2 by trial entry criterion ( 2 X 20dBHL, BE) - RAOM seen in secondary care in Finnish system due
to high staffing minimal gatekeeping meet
demand - Finland differs on OM8-30 RAOM items (plt0.01)
- Literature reflects greater frequency of AOM seen
in recent years, despite no increased virulence - Joki-Erkkila VP, Laippala P, Pukander J. Increase
in paediatric acute otitis media diagnosed by
primary care in two Finnish municipalities--1994-5
versus 1978-9. Epidemiol Infect 1998121529-534 - Joki-Erkkila VP, Pukander J, Laippala P.
Alteration of clinical picture and treatment of
pediatric acute otitis media over the past two
decades. Int J P Otorhinolaryngol 200055197-201 - Blomgren K, Pohjavuori S, Poussa T, Hatakka K,
Korpela R, Pitkäranta A. Effect of accurate
diagnostic criteria on incidence of acute otitis
media in otitis-prone children. Scand J Infect
Dis 2004366-9
38Two extreme samples revealed by OM8-30in 2-D
plot, when scores adjusted for bias
Strong negative correlation between physical
health and development across centres, both
adjusted for their significant determinants (age,
response bias, but also development adjusted
for selectivity prior op/2nd visit)
4.4
r -0.88
Finland
4.2
UK ENT (TARGET)
4
Physical health
Netherlands
France
Kingston
3.8
Belgium
Cheshire
3.6
1.9
2
2.1
2.2
2.3
2.4
Developmental impact (square root)
A negative correlation is expected because a
child will be referred either for RAOM or OME
symptoms or a mixture
39pattern less sensible if not bias-adjusted
Raw centre means for physical health against
developmental impact, not adjusted for bias or
severities (r 0.17)
5
UK (TARGET)
4.5
Finland
UK Cheshire
Raw physical health mean
4
Netherlands
Belgium
UK Kingston
3.5
France
3
1.9
2
2.1
2.2
2.3
2.4
2.5
Raw developmental impact mean (square root)
40National variation is more in the difference
(physical - developmental) than in the sum
Physical health Worse Development
worse
Physical health
Developmental impact (square root)
41Value of Eurotitis 2 international
standardisation study to OM8-30
- Factor structure hence basis of scoring highly
similar across language translations and
healthcare cultures - 2-D plot of (phys dev) vs (phys dev) ie total
vs difference best way to think about impact - Adjusting for simple clinical differences, and
particularly for parental response bias - Reveals expected known effects much better
- Leaves little purely national variation
- This makes standardisation, including
inter-nationally feasible without needing vast N
or vast budget - Model for impact (HL, RAOM, URTI ? developmental
impact) strong in TARGET and Eurotitis 2 data - Ease of application of bias-adjustment in OM8-30
considerably boosted by demonstration (in
Eurotitis 2 data also) that tymp-based formula
pseudo-HL gives results similar to true HL for
samples varying through normal, marginal,
definite OME (not so applicable if many HLs go
above 30 dB) - Principal component version of factor structure
allows selection of best items for
ultra-short-form OM2-13 Eurotitis 2 data support
very similar selection hence next logical
development
42OM8-30 does several useful things, as recently
shown in several languages health(care)
culturesBut has 32 items ? 8-10 minutes Some
clinicians and public health types will expect
ultra- short Q-aires to do more than reliability
permits So, can an ultra-short form be useful
at all ?
YES !
43 Why two ultra-short forms ?
- Small number of items cannot reliably cover
multiple measures (or aims) - Indicators reside in pathophysiology quite
precise reportable symptoms - Outcomes must be broader many things influence
them, hence variable - Different items, and more of them required (for
reliability, given the variability) - But two domains supportable (just !)
44OM2-9 INDICATORS.(9 standard history items
from OM8-30, to guide decision on VTs /-
adenoidectomy)
Give ad with ? VTs, if history bad
(eg score TARGET median) HL 20dB
Give VTs-only if history bad HL 20dB ?
Respiratory infection history (6)
Ear infection history (3)
Evidence for these criteria described
previously (eg BACO) - at end if time
45OM2-13 OUTCOMES (summary domains only, no
bias-adjustment nor treatment indicators)
No items on schooling, speech/language kept ?
Slight content shift places a ceiling on
correlation (0.837) so underestimating
concurrent validity
Developmental Impact (7)
No reported hearing difficulty items, so no
bias-adjustment
Physical Health (6)
Concurrent validity correlation with phys in
OM8-30 r 0.938
46Use of short formsto refine, revise or
justify clinical policies via audit studiesie
not throw out baby with bathwater
47 Audits require such tools PLUS
- Ownership by interested group of Drs, some
incentive to a high participation rate - Slight resources
- Remind, manage, enter data, convene review
- Database tool (existing from OM8-30)
- We will modify for OM2-9 2-13 if demand
justifies - Usually, ethical approval not needed
- Sufficient numbers (ie power calculation needed)
- 1 or more appropriate audit questions, eg
- Is a rule/guideline being followed ? (OM2-9 )
- Are X of outcomes above standard ? (OM2-13)
- Example from New Zealand available
-
48Future prospects for ultra-short forms
- Audit via OM2-9 the application of the
evidence-based case selection criteria (RAOM,
URTI, developmental impact) for grommets, based
on TARGET RCT treatment analyses - Audit service quality via OM2-13, but for general
quality of outcomes only, as these items were not
selected for differential indication - For both forms, develop further and make more
robust the already automated scoring software and
evaluate time and errors of manual scoring by
professionals - Develop automated data capture
- Scoring routines (eg optical/magnetic reading)
- For many parents in future, on-line administration
49Parting shot given current NHS changes,there
may be increased need to show
OM8-30
- Research base appropriate tools
- The capacity to monitor outcomes
- Do hospitals differ in outcomes ?
- In TARGET, outcomes did differ, but only slightly
- This was due mostly to the rate of discretionary
adenoidectomy (longer-term benefits on HL etc) - Benchmarks for quality useful, implying reference
data, even if patient choice agenda is not
pursued - The capacity to accurately select patients for
the ability to benefit
OM2-13
HL OM2-9
50Contact for further information
MRC Multi-centre Otitis Media Study Group,
Cambridge, UK mark.haggard_at_mrc-cbu.cam.ac.uk