Title: Anatomic Pathology Data Mining
1Anatomic Pathology Data Mining
- Jules J. Berman, Ph.D., M.D.Program Director,
Pathology InformaticsCancer Diagnosis
ProgramNational Cancer InstituteDr. Bill
Moore, Workshop DirectorFriday, October 27,
2000800 A.M.All opinions herein are Dr.
Bermans and do not represent those of any
federal agency.
2Expertise Domain of the Pathology Data Miner
- Confidentiality/Privacy Issues
- Data Sharing issues, which includes data
standardization - Data Analysis
3Data Domain of Pathology Data Miner
- Pathology Data linked to tissue samples
- Any medical record data that can be linked to
pathology data (including cancer registry data) - Any other relevant data in existence that can be
sensibly linked to pathology records (this
usually means the internet)
4Confidentiality/privacy
- Anyone interested in using confidential
information (essentially any data generated in a
hospital that is attached to a patient) needs to
understand confidentiality and privacy issues. - The fact that you might be using only your
departments data and that you treat the data
confidentially will almost never exempt you from
existing regulations. - The consequences to you and your institution of
ignoring regulations can be profound.
5Confidentiality/Privacy Lecture
- I am giving a lecture today in the afternoon
UAREP focus session on this subject. - The lecture is entitled Bioinformatics Data
Confidentiality Issues and is scheduled for 330.
6Issues related to data sharing
- Nomenclatures and free-text mapping
- Common Data Elements
- Standard Report Formats
- Internet Protocols
7CDE for Date of Birth
- birthdate September 15, 1970
- birthday September 15, 1970
- D.O.B. September 15, 1970
- d.o.b. September 15, 1970
- date of birth September 15, 1970
- date of birth September 15, 1970
- date-of-birth September 15, 1970
- date_of_birth September 15, 1970
- dob September 15, 1970
- DOB September 15, 1970
8Representation of CDE
- date_of_birth September 15, 1970
- date_of_birth 15, September, 1970
- date_of_birth 9/15/70
- date_of_birth 15/9/70
- date_of_birth 15/09/70
- date_of_birth 9/15/1970
- date_of_birth 9.15.70
- date_of_birth 9,15,70
- date_of_birth some delta time
9Annotation/Curation of the CDE
- Unique identifier
- Creator name
- Date of creation
- Date of modifications
- Exact definition
- Hierarchy (if applicable)
- List of users or CDE-specific browsers
10Shared Pathology Informatics Network
- 5-year project beginning April 2001
- Will develop the tools that will allow about 6
large laboratories to share their data with
researchers, using the internet - Basically, it will allow a researcher to
interrogate the pathology records at multiple
institutions simultaneously and receive a summary
report almost instantaneously.
11VIRTUAL MODEL (CBCTR)
Required steps
Resource 1
Extract patient data from clinical
record Evaluate Specimens Quality control of data
and specimens Audit Data and specimen
quality Update central database
regularly Re-evaluate data quality and currency
before specimens are shipped
U S E R S
REQUEST
Central Database
Resource 2
DATA
Resource 3
Research Evaluation Panel
Resource 4
Specimens
12What is so special about anatomic pathology data?
- Every anatomic pathology record is linked to the
patient identifier and to the tissue blocks for
that record - One of the important rate-limiting factors in
cancer research today is access to tissues - Access to even a small fraction of the tissues
routinely collected by pathology departments
(about 40 million each year) would be of enormous
research benefit.
13Example project Virtual Precancer Archive
- Johns Hopkins Surgical Pathology has cases
accrued in electronic form since 1984 - 372, 536 is the current (circa Sept., 2000)
number of accrued cases - Wouldnt it be nice to be able to survey the
archived precancer cases in a large archive such
as the Hopkins Archive?
14Step 1. (Drs Bill Moore and Robert Miller)Build
a phrase from all cases
- The text of the reports can be represented as a
collection of phrases that contain all of the
concepts included in the reports. - The 372,536 records were parsed to find the
diagnostic field free-text. - Diagnostic field free-text was parsed into
sentences. - Diagnostic field sentences were parsed into
phrases and words.
15418,159 phrases represent all the textual
concepts in the JHH surg path records - lie
outside the realm of Common Rule
- minimal mononuclear cell infiltrate
- minimal mononuclear cell infiltration
- minimal mononuclear cell interstitial
- minimal mononuclear infiltrate
- minimal mononuclear inflammation
- minimal mononuclear interstitial infitrates
- minimal mononuclear meningeal
- minimal morphologic abnormalities
16Step 2. Create a precancer terminology
- Started with the National Library of Medicines
UMLS (Unified Medical Language System) - We use the concept list file, which is
113,699,627 bytes and contains 1,598,176 terms. - As example, rcc has about 80 synonymous terms in
UMLS
17UMLS CUI C0007134 Renal cell carcinoma
- carcinoma, renal cell
- carcinomas, renal cell
- renal cell carcinoma
- hypernephroid carcinoma
- grawitz tumor
- hypernephroma
- renal cell adenocarcinoma
- rcc
18The UMLS precancer terms
- 2,984 terms
- Contains 221 terms added by myself and given
private J-codes
19Step 3. Map the Hopkins phrases to the precancer
terms
- Start with 418,159 phrases
- One-by-one try to find a matching phrase from the
list of 2,984 precancer terms list - Prepare a file of all the matching terms
- This step takes 33 second to complete with a PERL
script running on a 450 MHz desktop computer -
i.e., its scalable
20The result 10,310 term matches,from 418,159
phrases a scalable work in progress
- early actinic keratosisactinic keratosis0022602
- early adenomatous polypadenomatous polyp0206677
- early borderline rejectionborderline0205189
- early dysplasiadysplasia0334044
- early dysplastic changedysplastic0334045
- early dysplastic processdysplastic0334045
- early gastric mucin cell metaplasiametaplasia002
5568 - early gastric mucous cell metaplasiametaplasia00
25568
21Step 4. Give precancer match list to Drs. Bill
Moore and Robert Miller to create a concordance
- 10,310 precancer terms occurred in 54,909
accessioned surgical pathology cases between 1984
and 2000. That is, each of the precancer terms
were found in a little more than 5 cases. - 54,909 cases containing a precancer term
represents 54,909/ 372,536 15
22The concordance looks like this
- C0001815367220497667008419098
- C0002893394120765570701149177
- C0002893435120960421908784068
- C0002893436410698795906686356
- C0002893445510623875200588234
23Precancer-related cases by year
- 1984 1175 7
- 1985 1573 8
- 1986 2024 10
- 1987 2195 11
- 1988 2239 11
- 1989 2328 11
- 1990 2721 12
- 1991 3077 14
- 1992 3185 14
- 1993 2878 13
- 1994 3060 14
- 1995 2968 13
- 1996 3475 14
- 1997 4726 17
- 1998 4989 18
- 1999 5996 20
- 2000 6298 25
24Precancer-related cases by year
25 Precancer-related cases by year
26Cases by year of intraductal ca
- C0007124 1984 14
- C0007124 1985 18
- C0007124 1986 30
- C0007124 1987 31
- C0007124 1988 33
- C0007124 1989 50
- C0007124 1990 42
- C0007124 1991 51
- C0007124 1992 50
- C0007124 1993 80
- C0007124 1994 106
- C0007124 1995 85
- C0007124 1996 100
- C0007124 1997 180
- C0007124 1998 217
- C0007124 1999 228
27Cases per year of Barretts esophagus
- C0004763 1984 30
- C0004763 1985 35
- C0004763 1986 82
- C0004763 1987 97
- C0004763 1988 106
- C0004763 1989 84
- C0004763 1990 97
- C0004763 1991 100
- C0004763 1992 132
- C0004763 1993 126
- C0004763 1994 144
- C0004763 1995 162
- C0004763 1996 221
- C0004763 1997 307
- C0004763 1998 341
- C0004763 1999 401
28What does this really mean?
- With this approach, we can identify all the cases
of interest for any diagnosis or diagnoses,
stratifying data by year of diagnosis, age,
gender, or any other record element - We can determine the encrypted case identifiers
for all those cases - We can give those encrypted case numbers back to
the laboratory archivist, who can supply me with
the (encrypted) tissue blocks belonging to that
case.
29Conclusion
- With these techniques, laboratories with good
informatics infrastructure can create a virtual
omni-archive (at very low cost) that operates
within current human subject protection
guidelines for minimal-risk de-identified
retrospective studies.