Title: Sampling Based Clerical Review Methods in Probabilistic Matching
1Sampling Based Clerical Review Methods in
Probabilistic Matching
Sampling Based Clerical Review Methods in
Probabilistic Matching
2Clerical Review
File A
Automatically assign as a non-link
File B
Send for manual clerical review
Automatically assign as a link
3Output of Record Pair Comparisons
- Set of matched records with an associated
comparison weight - Lots of these, high weights, low weights and
in-between
4Frequency
Comparison Weight
5... but
- Clerical review can be time consuming
- Thousands or tens of thousands of clerical review
pairs - High level of repetitive VDU based tasks can lead
to health and safety issues
6Frequency
Comparison Weight
7Acceptance Sampling
- Allows quantification of uncertainty in sampling
- Methods
- AS 1199 Sampling Procedures and Tables for
Inspection by Attributes - DIY calculations
proc power onesamplemeans mean
5 10 ntotal 150 stddev
30 50 power . plot xn min100
max200 run
8(No Transcript)
9Producer's Risk (?), risk of having to review a
batch with a large number of number of non-matches
AQL Match rate for automatic rejection
10RQL Match rate that is unacceptable to
automatically accept as non-links
11P(send for manual review)?
100
80
60
40
20
0
0
20
40
60
80
100
Acual Quality Level
12(No Transcript)
13Setting a Single Cut-off
- Sometimes there are not enough fields to do
meaningful clerical review - Particularly, when we are not using names and
addresses - In these cases we want to meaningfully set a
single cut off
14Estimated Cumulative Matches Linked
Estimated Cumulative Non - Matches Linked
Comparison Weight
15Case Study
- Migrants Settlement Database to Census 2006
linkage - 131,000 records identified for clerical review
- Sampling scheme was 50 batches from which 65
records pairs was selected - Only 39 batches were actually inspected for a
total of 2,535
16Final Remarks
- Sampling based clerical review can very
significantly reduce the amount of clerical
review - Can be used to rigorously set up cut-offs
- Provide information on linkage quality
- Can introduce missed and false links, but the
extent of these can be estimated
17Thank you for your attention
Any Questions?