Title: Capture recapture analysis ??-?????
1Capture recapture analysis??-?????
- Keith Sabin, PhD, MPH
- DHHS/CDC/GAP
2What is it for?????
- Capture-recapture analysis is used for counting
the total number of people in a population using
two or more incomplete lists of those people - ??-???????????????????????????
- Why should I be interested???????????
- Evaluating surveillance systems ??????
- Magnitude of issues ?????
3Overview??
- Origin of method?????
- Application to epidemiology - why is it useful
for us? ?????-????????? - Principles??
- Conditions for using capture-recapture
methods????-?????? - Methods??
- Two sources????
- Multiple sources????
- Limitations????
4Origins of capture-recapture analysis??-?????????
?
- Origins in demography??????
- 1662 - used to estimate the population of London
- 1662???????????
- 1783 - Laplace used to estimate population of
France - 1783?laplace?????????
- 1949 - Sekar and Deming used to estimate birth
rate and mortality in India - 1949?Sekar?Deming????????????
- Subsequently most often for estimating wildlife
populations - ????????????
- More recently applied to epidemiology (Wittes
1968) - ?????????
5Application of capture-recapture analysis to
human epidemiology??-????????????????
- Evaluating completeness of a surveillance source
?????????? - Passive surveillance????
- Registers??
- Refining incidence and prevalence estimates from
surveillance systems or population
surveys??????????????????????? - Used for cancers, stroke, homelessness, mental
illness, drug use, congenital disorders,
infections??????????????????????????????
6Principles??
- Two or more sources (lists) of cases a given
disease - ??????????????
- Sources considered random capture samples in
population - ??????????????????
- Cases can be matched by unique identifiers
- ?????????????
- Estimate total number of cases that are not
captured by any source from the matched and
unmatched??????????(??????)???????
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13Critical assumptions/conditions????/??
- 1. Population is closed ??????
- methods exist for open populations??????????
- 2. Individuals captured on both occasions can be
matched????????????? - 3. Capture in the second sample is independent
of capture in the first???????????? - 4. Probability of capture is homogeneous across
individuals??????????????? - Homogeneity of individuals????
- Homogeneity of lists?????
14Application to humans???????
- Capture appearing on a list????????
- re-capture linking by identifying individuals
appearing on both lists by criteria name, date
of birth etc - ??????????????????(??????)???
- Trap fascination ????
- if you feed the animal they are more likely to be
caught again - ?????,???????????
- laboratory confirmed cases are more likely to be
reported in other systems????????????????????? - Trap avoidance ????
- if you scare the animal they will avoid the trap
- ??????,???????
- a person cant appear on community injecting drug
user registry if they are in prison
????????????????IDU????
15????? ???
????? ???
????
????
??????? ????
16Two sources??????
Source B
Source A
x12
x11
x21
x22?
1 included in source ?????? 2 not included in
source???????
17Capture (Source A) and recapture (source
B)??(??A)????(??B)
18Estimation??
- If sources independent P(A if B) P(A if B-)
???????
19Capture (Source A) recapture (Source B)
20Estimation
- Sensitivity of sources????????
- If numbers in cells small, probability that x11
0 is not zero ? - ??????, X110??????0
21Conditions??
- Same study period and area
- ?????????
- Closed population?????
- All cases in any source are true cases
- ??????????????
- True matches are identified
- ????????
- Equal catchability??????????
- Sources are independent????
22Same study period and area??????????
- Cases occur during the study period and in study
area ?????????????? - Different period of capture ???????
- Probability of recapture ? ? x11 ? ?
overestimates N - ???????? ? x11 ? ? ?? N
23Closed population????
- Nobody enters or leaves the population during the
study period????????????? - No immigration, emigration, death??????????
- ????Open population
- Individuals captured in first sample cannot be
captured in second - Probability of recapture ? ? x11 ? ?
overestimates N
24True cases?????
- All cases in any source are true cases
- ?????????????
- False positive cases?????
- Positive predictive value (PPV)?????lt 1
- Overestimation of N1 or N2 ? overestimates N
- ??N1?N2 ? ??N
- Correction??
- Take random sample of positive samples and
verify??????????????? - Estimate PPV and apply to formula
- ??PPV??????
25True matches?????
- Matches and only matches are identified
- ???????????
- Ideally, unique identifier available (social
security number, name, etc)?????????????(?????????
) - Combination of criteria Name initials, age, sex,
-
- True matches missed?????????
- x11 ? ? overestimates N ??N
- Wrong matches created????
- x11 ? ? underestimates N ??N
26Equal catchability?????????
- For a given source, probability of capture is the
same for all cases, although this probability may
differ from one source to another???????,?????????
??????,???????????????????? - Often not true for epidemiological datasets
- ????????????????
- Low or no probability of capture by any source
(eg, IVDU, homeless, disease severity)???????????
?????????(??,IVDU,?????,????) - Disregarded in estimate ? underestimates N
- ???????? ? ??N
- Identify and exclude population outside of all
sources - ???????????????????
27Accounting for variable catchability?????????????
?
- Stratify by factor introducing variable
catchability?????????????????? - Calculate estimates by strata???????
Stratum 1
Stratum 2
N ? Ni N1 N2
28Sources are independent ()????
- Being in one source does not influence the
probability of being in the other
source????????????????????????
OR gt 1 (positive dependence) d lt d ?
underestimates N OR lt 1 (negative dependence) d
gt d ? overestimates N
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Example
- Estimation of number of IVDU in Bangkok in 1991
(Maestro 1994) 1991???IVDU???? - Two sources used??????
- Methadone (April May 1991)???
- Police arrests (June September 1991)????
- Methadone ???? Need for drugs???? ? ? Probability
of being arrested??????? ? negative
dependence, over-estimation of N ??????,??N
33Evaluation of source dependence
- Two sources??????
- Qualitative analysis of the notification process
in each source i.e. there is no statistical
method to allow for dependence for two
sources?????????????,?????????????????????? - Multiple (gt2) sources??????
- Wittes method
- Log-linear modelling
34Behavioral Surveillance Using Respondent Driven
Sampling ??????????????
35Presentation Outline??
- Sampling methods for hard to reach populations
- ??????????????
- Description of RDS
- RDS???
- Lessons learned from Vietnam
- ????????????
36Probability Sampling ???? (Simple????,
Systematic??, Cluster??)
- Gold Standard-Best methods for sampling
- But, do not reach hidden populations
- ???-???????,?????????
- No sampling frame??????
- Stigmatized???
- Would need huge sample sizes in order to capture
a hidden population ???????????????? - Expensive??
37Sampling Methods to Reach Hidden Populations
???????????
- Time-Location (TLS), Venue-Based
- ??????-?????
- -Major Bias Only captures those who are
visible - ???? ????????
- Snowball???
- -Major Bias Not representative of the
population (tendency for in-group affiliation,
volunteerism and masking) - ?????????(??????,???)
38Background on RDSRDS??
- Developed by D. Heckathorn and R. Broadhead with
IDUs in Connecticut and in Yaroslavl, Russia - ?D. Heckathorn?R. Broadhead?
??Connecticut?????Yaroslavl?IDU????? - Sampling vs. Recruitment strategy?? vs ????
- Different from other chain referral methods
because it can give us point estimations with
standard errors.???????????,???????????????
39How RDS Works???RDS
- Use of a dual system of recruitment through the
use of incentives. - ????????,?????(????)
- Use of recruitment quotas.
- ??????
- Use of peers to recruit peers.
- ????????
- Use of links between recruiters and recruits.
- ???????????????
40The Theory Behind RDSRDS?????
- Uses prinicples of First Order Markov Theory
- ??Markov????
- Long referral chains ????
- Final sample will be independent of those
selected as seeds - ??????????????
- Final sample will be similar to the population of
the network from which you are recruiting - ?????????????????
41Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
42Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
43Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
44Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
45Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
46Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
47A Long Referral Chain Jazz Musicians in New
York City
48Selection of Seeds
49Example in Hai Phong Vietnam???????
- Final Sample size 420 IDUs in Hai Phong and
Saigon 418 CSWs in Saigon and 220 in Hai Phong
?????? ??????420?IDU,???418?CSW,????220?CSW - Recruitment process????
- 20 seeds selected by peer educators
- ???????20???
- Three coupons to each participant
- ???????????
- Participants asked to recruit their peers
- ???????????
- Time March June, 2004?? 2004?3-6?
- Three sites (Hai Phong) Four sites (Saigon)
- ??????????????
50Eligibility Criteria????
- CSWs
- Women, 18 years or more, living or working in Hai
Phong or Saigon??,18????,??????????? - Has sold sex for money in the last 30 days
- ????30??????
- Has a green coupon (except seeds) ??????(????)
- Has provided consent. ????
- IDUs
- Women (Saigon only) or Men,18 years or more,
living in Hai Phong or Saigon - ?????,18????,???????,??????
- Has injected drugs during the last 30 days
- ???30???????
- Has a yellow coupon (except seeds)??????(????)
- Has provided consent.????
51Coupon Front Side????
LIFE-GAP project For Your Health and
Safety Payment coupon Address____________________
____ Telephone___________________________________
(You can call to make an appointment in
advance) You will receive 15,000 VND for each
person who you recruit and enrolls into the study
(you may recruit up to 3 persons) ID number
Please call us in advance. You must present
this coupon for payment
52Coupon Back Side????
53Networks of CSWs in Hai Phong
54A network in Hai Phong
Seed
55Initial Lessons from Vietnam???????
- Seeds should have high degree-initial focus group
may be important?????????,?????????? - No slow down mechanism to end RDS
- ?????????RDS
- Need for security-Interviewers have no choice of
whom they interview - ????-???????????
- Managing multiple sites can be difficult
- ???????????
- Managing coupon numbers??????
- No way to control for those who recruit
faster.???????????
56Initial Lessons from Vietnam (Cont)
- Difficult to discourage recruiters from selling
coupons or giving them out in a non random way - ???????????????????????
- Non response information difficult to obtain
- (incentives picked up by friends, recruiters
do not return for secondary incentive) - ?????????(???????,??????????)
57Philosophical objection????????
- Capture-recapture is fun, so it must be
epidemiology! ??-???????,????????????! - But, as epidemiologists we are interested in
?????????,?????????? - Time, place and person
- Capture-recapture does not capture time - it is a
static tool which relies on lists which
correspond to prevalence of a chronic disease
(e.g. diabetes) or long time periods for acute
diseases (legionella)??-????????,
????????,???????(????)???????(???)??? - Can be used for measuring broad trends by repeat
analysis (Nardone et al Epidemiol Infect
2003)??????????????
58Practical limitations????????
- Unique identifier has to match in all data
sources - ????????????????????
- This may contravene confidentiality
laws?????????? - Clever statistics cant correct bad data
- ??????????????
- Rubbish in, rubbish out. ???,???
- For chronic and expensive diseases (eg diabetes)
it may be better to carry out an expensive
detailed survey than to use quick and dirty
methods?????????(????),???????????????????????? - it may be even more expensive to get it wrong.
- ????,????
59Extrapolation is based on assumptions
- we are assuming that the model which describes
the observed data also describes the count of the
unobserved individuals. We have no way of
checking this assumption. This is analogous to,
and has the same dangers as fitting an arbitrary
curve to a series of points (x,y), where xgt0,
with the intention of estimating y at x0.
.this is analogous to the position of those who
automatically assume that the k samples in our
problem are independent. - ?????????????????????????,??????????.?????????????
(X,Y)?????????????,?Xgt0?,?X)??Y.????????????K???
???????????. - Fienberg, Biometrika 197259591-603
60Conclusion??
- If conditions are met??????
- Potential to use multiple incomplete registers
and to estimate population size by
capture-recapture???????????????,???-???????????? - Cheaper than exhaustive registers???????????
?????? ??????????? - Two sources??????
- Impossible to quantify extent of dependence
- Requires third source
- Multiple sources
- Log-linear modelling method of choice
- Can adjust for dependence and variable
catchability
61Caveats??
- Use technique but be careful!????????
- Dont treat this as a black box method
??????????? - All prior knowledge should be used to formulate
the model?????????????? - Know your data!??????
- Not the solution to all problems
- Conditions often not met when applied to
epidemiology - There may still be heterogeneity you dont
understand - Complementary technique
62References
- Wittes JT, Colton T and Sidel VW.
Capture-recapture models for assessing the
completeness of case ascertainment using multiple
information sources. J Chronic diseases
19742725-36. - Hook EB, Regal RR. Capture-recapture methods in
epidemiology. Methods and limitations.
Epidemiologic Rev 1995 17(2) 243-264 - International Working Group for Disease
Monitoring and Forecasting. Am J Epidemiol.
Capture-recapture and multiple-record systems
estimation I History and theoretical
development. 19951421047-58 - International Working Group for Disease
Monitoring and Forecasting. Am J Epidemiol.
Capture-recapture and multiple-record systems
estimation II Applications in human diseases.
19951421059-68 - LaPorte RE, Dearwater SR, Yue-Fang C et al.
Efficiency and accuracy of disease monitoring
systems Application of capture-recapture methods
to injury monitoring. Am J Epidemiol
19951421069-77
63Recent examples of application to field
epidemiology
- Legionnaires disease. Infuso et al
Eurosurveillance 1998348-50 Nardone et al
2003131647-54 - Malaria. Van Hest et al. Epidemiol Infect 2002
129371-7 - Measles. Van den Hof et al Pediatr Inf Dis J
2002 211146-50 - Acute flaccid paralysis. Whitfield Bull WHO
200280846-851 - Pertussis deaths. Crowcroft et al Arch Dis Child
200286336-8 - Intussception after rotavirus vaccination.
Verstraeten et al Am J Epidemiol
20011541006-1012 - Tuberculosis. Tocque et al Commun Dis Public
Health 20014141-3 - Salmonella outbreaks. Gallay et al Am J Epidemiol
2000 152171-7 - AIDS. Bernillon et al Int J Epidemiol
200029168-174 - Meningitis. Faustini et al. Eur J Epidemiol
200016843-8
64Special thanks to Nancy Crowcroft Health
Protection Agency London Many of the
capture-recapture analysis slides come directly
from her class at Epi-Et.
65THANK YOU!
66RDS Advantages
- Ease of field operations
- Little for formative research/mapping
- Target members recruit for you
- Reach less visible segment of population
- Good external validity (found in other
studies-still waiting to see in Vietnam) - Minimal number of additional questions needed
- Computer software available
- Lower Cost (Still waiting to see)
67RDS Limitations
- Population must be a network
- Must be able to verify group membership
- Must track links between recruiters and
recruits-coupon management - Incentives
- Very difficult to deal with selective non
response bias.
68Option 1 Use RDS with Institutional Data
- Capture-recapture requires two samples of the
population, only one of which need be
representative. - If an institutional database is available, only a
single number is required to recapture the
population. - Example of Registered NEP members
69Example of Capture-Recapture
- Capture During the study period, police recorded
contacts with 86 injectors. The detective who
provided this information said he was confident
that this is almost all the shooters in town. - Recapture During the study period, 388 were
interviewed using RDS. - Overlap 32 respondents were in both the police
and the RDS samples. - Estimated population size
70Estimating the Number of Jazz Musicians in NYC
using the Logic of Capture/Recapture
- Capture Proportion of NYC musician union members
who identified themselves as jazz musicians (in
response to a union member survey) 70
(415/592). - Number of musician union members in the New York
metropolitan area, according to union records is
10,499. - Therefore, the estimated number of union jazz
musicians is 7,360 (10,499 x .70). - Recapture Proportion of all NYC jazz musicians
who are union members according to a RDS study is
22. - Using estimate of number of NYC union jazz
musicians and estimated portion of all NYC jazz
musicians who are union members, the size of the
NYC jazz musician universe is 7,360/.223 33,003
71Multiple sources
72Wittes Method
- Evaluate dependence among sources
- Compare two-source estimates of N
- If estimates different ?
- Test of independence
- Calculate odds ratios between cell counts of two
sources within a third source - If OR ? 1 ? dependence
- Merge dependent sources
- Repeat calculation of estimates with merged source
73Test of independence
A
B
a
b
f
c
d
e
g
C
74Test of independence
A
B
a
b
f
c
d
e
OR cg/de
g
C
OR 1 ? independence OR gt 1 ? positive
dependence ? underestimation of N OR lt 1 ?
negative dependence ? overestimation of N
75Test of independence
- To solve, have to assume highest order
interaction0 - i.e. the chance of being in all the lists (in c)
is a simple function of the chance of being on
any single or list of lesser combination - Or, there is nothing special about c
A
B
a
b
f
c
d
e
g
C
76Log-linear modeling - General
- Analyze relationship between categorical
variables in a contingency table - Logarithm of expected frequency of a cell
expressed as linear function of effects for each
cell and interaction term - For 3 variables A with i levels, B with j levels,
C with k levels, logarithm of expected frequency
of cell Fijk for cell ijk is
? main effect ?A first order effect ?AB second
order effect (interaction)
77Log-linear modeling - CRM
- Estimates value of a missing cell in a 2k
contingency table - k number of sources
- Missing cell number of cases not listed by any
source (m222)
78Log-linear modeling
- No interaction sources are independent (1 model)
- Interaction between 2 sources only (3 models)
- Interactions between pairs of sources (3 models)
- Interactions between all sources 2 by 2 (1 model)
79How to chose the best model
- Aim
- Best fit of observed data with least number of
interaction terms - Principle of parsimony
- Strategy
- Start with saturated model (all interactions
accounting for all potential dependency) - Remove interaction terms in stepwise fashion
based on likelihood ratio statistic G2
80Evaluation of Legionella notification system,
France 1995
- Mandatory notification system
- Implemented 1987
- Clinician report
- No validation, little feedback
- 60 cases per year (average)
- Re-organisation defined as priority
? Evaluate sensitivity of system using
Capture-recapture
81Three sources
- Notification system (NS)
- National Reference Laboratory (NRL)
- Confirmation of diagnosis, typing of strains
- gt200 diagnoses per year
- Hospital Laboratories (HL)
- Survey among all hospital bacteriology
laboratories (n432) - 357 cases identified in 1995
82Distribution of case reports by source
NS Notification system NRL National Reference
Laboratory HL Hospital Laboratories
83Two-source estimates
- Two-source estimates
- Tests of independence (Wittes)
- Merge NS/NLR into one source
NS/NRL 389 cases NS/HL 615 cases HL/NRL
715 cases
NS?NRL / HL 528 495561 cases
84How many deaths from pertussis in England 1994-9?
Official statistics (ONS) 18 Deaths in
hospital (HES) 9 Laboratory surveillance (ES)
22 Total 33 deaths observed
Estimated true number of deaths 46
(37-71) Official statistics 18/33 (54) observed
or 18/46 (39) estimated
85What is the sensitivity of hepatitis A
surveillance in England?
- Known under-reporting of cases
- Known failure to report risk factor data
- Under-ascertainment of outbreaks in injecting
drug users - Evaluation of surveillance system
86What is the sensitivity of hepatitis A
surveillance in England?