Title: Statistics for Proficiency Testing
1Statistics for Proficiency Testing
- Hong Kong Accreditation Service
- 10 September, 2009
- Daniel Tholen, M.S.
2Overview of Statistical Methods
- Requirements for statistical methods from ISO/IEC
17043 - Overview of statistical procedures in the major
standards - Determining the assigned value
- Determining the performance score
- Checking homogeneity and stability
- Examples from ISO, APLAC PT, class
3Overview of Modules 2-4
- Cover all areas of PT
- Chemical testing
- Medical testing
- Calibration
- Concepts and procedures cover inspection
- Practical application with examples from PT
projects.
4Documents for PT Statistics
- ISO/IEC FDIS 17043 Conformity Assessment
General requirements for Proficiency Testing (was
Guide 43-1) - ISO 13528 Statistical Methods for use in
Proficiency Testing by Interlaboratory
Comparisons - IUPAC Harmonized Protocol for PT of Chemical
Analytical Laboratories, 2006 - Previous APLAC Statistical procedures
5ISO/IEC 17043 Annex B
- Same basic methods as ISO/IEC Guide 43-1 Annex A
on Statistical Methods - Adds considerations for semi-quantitative and
categorical data - Main topics
- Determine Assigned Value
- Calculate Performance Statistic
- Evaluate Performance
- Determination of Homogeneity and Stability
617043 Annex B
- References ISO 13528 (2005) and IUPAC Harmonized
Protocol (2006) - Adds considerations for GUM
7ISO 13528
- Companion to ISO Guide 43-1, Annex A
- Written as a Standard
- High interest / widely used
- Goal is to describe optimal procedures, but to
allow other procedures as long as they are - Statistically valid
- Fully described to participants
8ISO 13528
- Written by ISO TC69, SC6
- Approved work item in 1997
- Published in 2005
- Reaffirmed in 2009
- Proposal to revise for ISO/IEC 17043
- Correction for discovered gaps
- Add qualitative and ordinal data
- Harmonize with GUM and VIM
- Other? from Seminar?
9Reporting considerations ISO 13528, section 4.6
- Possible conflicts with requirement for
laboratories to treat and report PT same as for
customer - NO TRUNCATED RESULTS (!!??)
- Less than values not allowed
- Possible resolution
- Restriction only applies to consensus
10Reporting considerations ISO 13528, section 4.6
- Rounding
- Independently estimate typical repeatability sr
- Do not round digits by more than sr/2
- Number of replicates
- Concern for getting accurate estimate of bias
- When a methods repeatability is large, it can
confuse interpretation of scores - Determine number n of replicates so that
- sr /vn lt 0.3sP
11Limiting the effect of repeatability Example
- Say sP5 and sr 2
- Then sr /vn lt 0.3sP
- So 2/vn lt 0.3(5) or
- 2/1.5 lt vn ? (1.77)2 lt n ? 3.13 lt n
- Or n4 replicates
- This criterion can lead to large n replicates.
12IUPAC Harmonized Protocol (2006)
- Available free at IUPAC website
- Revision of 1996 Version
- Update to ISO 13528
- Revises selected portions
- Homogeneity criteria
- Detection of bimodality (kernal analysis)
- Strong opinions
13APLAC PT Committee
- Historically followed NATA (Australia) procedures
- Convention gradually changing to ISO 13528 and
IUPAC - No standard procedures
- Statistically valid
- Explained to participants
14(No Transcript)
15Requirements for Statistical Methods ISO/IEC
17043
- 4.4.1.4 Access to the necessary technical
expertise, including statistics - 4.4.3.2 Procedures for ensuring homogeneity and
stability in accordance with appropriate
statistical designs, including random selection
of items
16Statistical Methods in 17043 4.4.4 Statistical
Design
- 4.4.4.1 Designs shall meet the objectives of the
scheme, based on the nature of the data - NOTE 1 Covers the process of planning,
collection, analysis and reporting - NOTE 2 Data analysis methods could vary from the
very simple to the complex - NOTE 3 Statistical design and data analysis
methods can be taken directly from specifications
by regulatory agencies or customers. - NOTE 4 In the absence of reliable information, a
preliminary interlaboratory comparison can be
used. - .
17Statistical Design 4.4.4
- 4.4.4.2 Documented statistical design and data
analysis methods to identify the assigned value
and evaluate participant results. Demonstrate
that the statistical assumptions are reasonable.
18Statistical Design 4.4.4
- 4.4.4.3 Give careful consideration to the
following - the accuracy and measurement uncertainty required
or expected - the minimum number of participants
- number of significant figures and decimal places
- number of proficiency test items and repeat
tests - procedures to establish evaluation criteria
- procedures to be used to handle outliers
- procedures for the evaluation of excluded values
- the objectives to be met for the design and
frequency of proficiency testing rounds.
19ISO/IEC 17043 Other
- 4.4.4.5 Requirements for assigned values
(traceability and uncertainty) calibration,
testing, consensus - 4.7.1 Data analysis
- 4.7.2 Evaluation of performance
- 4.8 Reports
- including summary statistics for methods used by
other participants
20ISO 17043 Annex B
- B.1 General
- Many types of PT, many types of data
- Reference ISO 13528 and IUPAC Harmonized Protocol
- Note that ISO 13528 allows other techniques if
they are statistically valid and explained to
participants
21ISO 17043 Annex B
- B.2 Determining the assigned value and its
uncertainty - Definition
- 3.1 assigned value
- value attributed to a particular property of a
proficiency test item
22B.2 Determining the assigned value
- Various procedures available listed below in an
order of increasing uncertainty - known values by formulation (e.g. manufacture
or dilution) - certified reference values (for quantitative
tests) - reference values
- consensus values from expert participants
- consensus values from participants.
23Determining the assigned value
- Procedures for qualitative data (nominal or
categorical), or semi-quantitative values
(ordinal) are not in ISO 13528 or the IUPAC
Harmonized Protocol. - Generally determined by expert judgment or
manufacture. - May use a consensus value, such as agreement of a
predetermined percentage of responses (e.g., 80) - May use median or mode, not mean
24Determining the assigned value
- No such thing as standard deviation
- IT IS NOT APPROPRIATE to calculate the mean or SD
of semi-quantitative values.
25Example Semi-Quantitative
- Measurand Level of reaction, by category
- 1 no reaction, normal
- 2 mild reaction
- 3 moderate reaction
- 4 severe reaction
- 2 PT samples, A and B
- 50 participants
26Example Semi-Quantitative
- Sample A
- 1 20 results (40)
- 2 18 results (36)
- 3 10 results (20)
- 4 2 results (4)
- Sample B
- 1 8 results (16)
- 2 12 results (24)
- 3 20 results (40)
- 4 10 results (20)
27Responses for Samples A and B
28Determining the assigned value
- Other considerations
- If consensus, control outliers
- If consensus, check trueness of process
- Criteria for acceptability on the basis of
uncertainty of the assigned value (for all a.v.,
especially consensus)
29ISO 13528 Procedures
- Calculate Summary Statistics
- Outlier detection and removal are allowed if done
in a statistically valid way - Robust measures are preferred
- Mean
- SD
- Preferred robust method is given, others are
allowed if - Statistically valid
- Fully described to participants
-
30ISO 13528 Procedures
- Determine Assigned Value
- Determined before PT shipment
- Result from formulation
- Certified reference value
- Other reference values
- Determined from PT data
- Consensus of experts
- Consensus of participants
- Control the uncertainty of the assigned value
3113528 - Robust Analysis
- Algorithm A for mean and SD
- Starts with xmedian
- s1.483xmedianxi-x
- Limit data at x1.5s and x-1.5s
- Extreme values trimmed to 1.5s
- Option to use initial x and s and skip
iterations - (Per NATA and many APLAC studies)
3213528 - Robust Analysis
- Calculate new x(Sxi)/p
- s1.134vS(xi-x)2/(p-1)
- Trim data again, at 1.5s
- Recalculate new x and s
- Repeat until convergence
3313528 Quality check Section 5.7
- When AV is determined prior to PT
- Compare AV with robust mean or results
- Determine uncertainty of comparison ud
- If difference exceeds 2ud then investigate
- When AV is determined from consensus
- Compare AV with a reference value from a
competent laboratory (could come from homogeneity
data) - Compare robust SD with experience
34Determine the Standard Uncertainty of the
Assigned Value
- Determined before PT shipment
- Result from formulation
- Uncertainty per manufacture process, usually very
small relative to measurement uncertainty - Certified reference value
- Uncertainty provided with certificate
- Other reference values
- Uncertainty calculated per GUM or other procedure
35Determine the Standard Uncertainty of the
Assigned Value
- Determined from data in PT shipment
- Consensus of expert laboratories (p of them)
- Each lab should know their MU, and report it
- ux 1.25(v(Sui2))/p for robust mean (median)
- Caution about bias in experts
- Consensus of participants (p of them)
- Calculate robust mean and SD (s)
- ux 1.25(s)/vp
- Caution about bias due to method mix
- Caution about lack of consensus
36Limiting the uncertainty of the Assigned Value
(X) 13528 Section 4.2
- Establish limits for uncertainty of AV
- u(X) lt 0.3sP
- When using fixed limits (E)
- u(X) lt 0.3(E/3)
- u(X) lt E/10
37Limiting the uncertainty of the Assigned Value
(X) 13528 Section 4.2
- If this cannot be met then
- Look for a better way to determine AV
- Incorporate uncertainty in score
- z
- En
- zeta
- Advise participants of large uncertainty
38(No Transcript)
39Limiting the uncertainty of the Assigned Value
(X) Example
- When consensus mean and SD are used to determine
performance - Then u(X) SD/vn
- So one can have very small uncertainty with large
number of labs. - What n is needed to assure criterion is met?
40Limiting the uncertainty of the Assigned Value
(X) Example
- What n is needed to assure criterion is met?
- Need 1/vn lt 0.3, or n gt (1/0.3)2 or n12
- If n11, then cannot meet criterion.
- If robust mean is used ngt 1.25 (11) or
- n 14
41B.3 Calculating performance statistics
- Quantitative results
- D and D
- Z, z
- En, Zeta
- Qualitative/semi-quantitative results
- Combined performance scores
42Calculate Performance Statistic
- Estimates of bias
- Difference D(x-X)
- Percentage Difference D100(x-X)/X
- D and D can be evaluated with Fixed Limits
- Estimates of Relative Performance
- rank or percentage rank (not recommended)
- z score (recommended) z(x-X)/s
-
43Determine Performance Interval
- Fixed Limits (or Fitness for Purpose)
- Can come from methods for SD
- Not widely used
- Preferred for interpretation
- Fixed percentage across range
- Fixed value across range
- Mixed or segmented.
44SD for Proficiency Assessment
- SD for proficiency sP
- 5 ways to get SD for Proficiency (for z scores)
- By prescription (set by Accreditor or advisors)
- By experience (perception) of experts
- From a general model (e.g.,Horwitz)
- By a precision experiment (ISO 5725-2)
- From participant data (robust SD)
- Should be chosen as fitness for purpose, under a
common model for all analytes
45SD for proficiency testing
- Discussed in detail in section 6 of 13528
- SD as used in z scores
- Can also be thought of as 1/3 of evaluation
interval - (when zgt3 is action signal)
- For example if fixed interval is E 10...
- Then E 3 sP
- sP E/3 10/3 3.3
46Scores that use uncertainty
- En and zeta consider uncertainty of participant
result and assigned value - Requires consistent determination of uncertainty
by all laboratories - En in common use in calibration
- z uses uncertainty of assigned value only
- Useful when too much uncertainty in assigned
value. - Same as z when small uncertainty
47Scores that use uncertainty
- En score (Error, normalized)
- En (x-X)/v(U2labU2ref)
- z scores (like z, includes ux)
- z (x-X)/v(s2u2X)
- zeta scores (like En, but with std. uncertainty)
- Zeta (x-X)/v(u2labu2ref)
48Evaluate performance
- Compare performance statistic against criteria,
determine acceptability i.e., - For fixed limits
- Bias lt Limit ? acceptable
- Bias Limit ? unacceptable
49Evaluate performance
- Compare performance statistic against criteria,
determine acceptability i.e., - For z z zeta
- -2lt z lt2 ? acceptable
- -3lt z -2 or 2 z lt3 ? warning signal
- z -3 or z 3 ? unacceptable
-
50Evaluate performance
-
- En lt1 ? acceptable
- En 1 ? unacceptable
51Combined performance scores
- Analyze data for each item independently
- Special process for Youden pairs
- Can be other reasons to combine results
- Precision
- Linearity
- Can count number of satisfactory scores
- Not recommended to combine performance scores
(such as average z)
52Graphic Reports for PT round
- Rank vs. Result (with or without MU)
- Used to check Normal distribution
- Used to visualize data
53(No Transcript)
54(No Transcript)
55Graphic Reports for PT round
56Graphic Reports for PT round
- Histograms of results or scores
57(No Transcript)
58(No Transcript)
59Combined Performance Scores
- Generally discouraged in 17043 and 13528
- Can miss problem on one sample or measurand
- OK only for statistics that have the same
distribution (rare) sometimes true for
performance scores.
60Graphic Reports for PT round
- Bar plot of standardized performance statistics
(z h k) - z, or other standardized scores ( error)
- h and k plots from 5725
- h same as z, except always from sample SD
- k for repeatability (n2 replicates)
61(No Transcript)
62Graphic Reports for PT round
- Youden plot (usually w/median lines)
- In this document uses only z scores.
- Should use sample results, for clarity
- Provides evidence of related results, which can
suggest consistent bias - Consistent bias can suggest lack of clearly
defined method. - Confirm with rank correlation test.
63(No Transcript)
64Youden Plot Example
65(No Transcript)
66(No Transcript)
67Graphic Reports for More than One PT Round
- Line plot (Shewhart plot) for scores on previous
rounds - Use any standardized score
- Show evaluation intervals
- Show test dates
68(No Transcript)
69Graphic Reports for more than One PT Round
- Dot Plot
- Show all samples on same chart
- Show evaluation intervals
- Show dates or scheme codes
70(No Transcript)
71Graphic Reports for More than One PT Round
- CUSUM control chart
- Can show trends affecting bias
- Choose some number to use (rolling sum)
- Sums should trend to zero
- Not sensitive to current problems
72(No Transcript)
73Graphic Reports for More than One PT Round
- Plot of Standardized Laboratory Biases against
assigned value - Shows relationship between score and level
- Can mask time effectdo both
74(No Transcript)
75Demonstration of homogeneity and stability in
17043
- Ensure sufficient homogeneity so as to not impact
evaluation of performance - Different needs for determining HS in PT and in
for Reference Materials (ISO Guides 34 and 35) - PT (and RM) needs to ensure sufficient
- CRM needs to estimate SD between samples, and
instability as part of uncertainty of assigned
value
76Homogeneity - 13528
- Homogeneity
- 10 or more samples, 2 replicates
- SDS for samples (ANOVA calculation)
- SDS lt 0.3 sP
- No F test
- Can use experience to reduce testing
- When evidence and theory prove homogeneous
77Homogeneity IUPAC (2006)
- Similar to 13528, larger criterion for
acceptance, more complex statistics. - 10 or more samples, in duplicate
- Sufficient repeatability san lt 0.5sp
- Cochran test for duplicates
- Visual check for anomalies
- Non-random differences between replicates
- Time trend across manufacture
78Homogeneity IUPAC (2006)
- Calculate variances
- S2an (between replicates)
- S2sam (between samples)
- s2all (0.3sp)2
- Calculate acceptance criterion
- Take F1 and F2 from Tables
- c F1s 2all F2s2an
- If S2sam lt c then acceptable homogeneity
- Since F1gt0 and s2angt0 and s2all 13528
criterion, this is always an easier criterion
79Homogeneity - traditional
- F test (allowed, not recommended)
- F (SDS2/sr2)
- Sr repeatability SDS ANOVA treatment
- Fcrit F(.05,k-1, s(n-1)) k samples n
replicates - High Sr ?insensitive test (large SDS passes)
- Low Sr ?too sensitive test (small SDS fails)
80Stability - 13528
- Stability
- Analysis on or after closing date
- (2-)3 samples, (1-)2 replicates, depending on
experience - Calculate overall mean
- Mean(H) Mean(S) lt 0.3 sP
- No statistical t test
- High Sr ?insensitive test (big difference passes)
- Low Sr ?too sensitive test (small difference
fails)
81Stability - practical
- Can use experience and technical knowledge
(backed by data) - Same measurand, same manufacture process, same
matrix - For calibration artefact, homogeneity and
stability are usually the same
82APLAC (NATA) Robust procedure
- Calculate Quartiles Q1, median, Q3
- IQR Q3-Q1
- Median is an estimate of mean
- Normalized IQR is an estimate of SD
- IQRN 0.7413 x IQR
-
83APLAC performance statistics
- Calculate relative performance measures
- Between lab agreement
- Si (AiBi)/v2
- Within lab agreement
- Di (Ai-Bi)/v2 if median (Ai)gtmedian(Bi)
- (Bi-Ai)/v2 if median(Ai)ltmedian(Bi)
-
- Calculate z-scores for these measures
84(No Transcript)
85The End