Anatomic Pathology Data Mining - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Anatomic Pathology Data Mining

Description:

All opinions herein are Dr. Berman's and do not represent ... early adenomatous polyp|adenomatous polyp|0206677. early borderline rejection|borderline|0205189 ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 30
Provided by: ISCS7
Category:

less

Transcript and Presenter's Notes

Title: Anatomic Pathology Data Mining


1
Anatomic Pathology Data Mining
  • Jules J. Berman, Ph.D., M.D.Program Director,
    Pathology InformaticsCancer Diagnosis
    ProgramNational Cancer InstituteDr. Bill
    Moore, Workshop DirectorFriday, October 27,
    2000800 A.M.All opinions herein are Dr.
    Bermans and do not represent those of any
    federal agency.

2
Expertise Domain of the Pathology Data Miner
  • Confidentiality/Privacy Issues
  • Data Sharing issues, which includes data
    standardization
  • Data Analysis

3
Data Domain of Pathology Data Miner
  • Pathology Data linked to tissue samples
  • Any medical record data that can be linked to
    pathology data (including cancer registry data)
  • Any other relevant data in existence that can be
    sensibly linked to pathology records (this
    usually means the internet)

4
Confidentiality/privacy
  • Anyone interested in using confidential
    information (essentially any data generated in a
    hospital that is attached to a patient) needs to
    understand confidentiality and privacy issues.
  • The fact that you might be using only your
    departments data and that you treat the data
    confidentially will almost never exempt you from
    existing regulations.
  • The consequences to you and your institution of
    ignoring regulations can be profound.

5
Confidentiality/Privacy Lecture
  • I am giving a lecture today in the afternoon
    UAREP focus session on this subject.
  • The lecture is entitled Bioinformatics Data
    Confidentiality Issues and is scheduled for 330.

6
Issues related to data sharing
  • Nomenclatures and free-text mapping
  • Common Data Elements
  • Standard Report Formats
  • Internet Protocols

7
CDE for Date of Birth
  • birthdate September 15, 1970
  • birthday September 15, 1970
  • D.O.B. September 15, 1970
  • d.o.b. September 15, 1970
  • date of birth September 15, 1970
  • date of birth September 15, 1970
  • date-of-birth September 15, 1970
  • date_of_birth September 15, 1970
  • dob September 15, 1970
  • DOB September 15, 1970

8
Representation of CDE
  • date_of_birth September 15, 1970
  • date_of_birth 15, September, 1970
  • date_of_birth 9/15/70
  • date_of_birth 15/9/70
  • date_of_birth 15/09/70
  • date_of_birth 9/15/1970
  • date_of_birth 9.15.70
  • date_of_birth 9,15,70
  • date_of_birth some delta time

9
Annotation/Curation of the CDE
  • Unique identifier
  • Creator name
  • Date of creation
  • Date of modifications
  • Exact definition
  • Hierarchy (if applicable)
  • List of users or CDE-specific browsers

10
Shared Pathology Informatics Network
  • 5-year project beginning April 2001
  • Will develop the tools that will allow about 6
    large laboratories to share their data with
    researchers, using the internet
  • Basically, it will allow a researcher to
    interrogate the pathology records at multiple
    institutions simultaneously and receive a summary
    report almost instantaneously.

11
VIRTUAL MODEL (CBCTR)
Required steps
Resource 1
Extract patient data from clinical
record Evaluate Specimens Quality control of data
and specimens Audit Data and specimen
quality Update central database
regularly Re-evaluate data quality and currency
before specimens are shipped
U S E R S
REQUEST
Central Database
Resource 2
DATA
Resource 3
Research Evaluation Panel
Resource 4
Specimens
12
What is so special about anatomic pathology data?
  • Every anatomic pathology record is linked to the
    patient identifier and to the tissue blocks for
    that record
  • One of the important rate-limiting factors in
    cancer research today is access to tissues
  • Access to even a small fraction of the tissues
    routinely collected by pathology departments
    (about 40 million each year) would be of enormous
    research benefit.

13
Example project Virtual Precancer Archive
  • Johns Hopkins Surgical Pathology has cases
    accrued in electronic form since 1984
  • 372, 536 is the current (circa Sept., 2000)
    number of accrued cases
  • Wouldnt it be nice to be able to survey the
    archived precancer cases in a large archive such
    as the Hopkins Archive?

14
Step 1. (Drs Bill Moore and Robert Miller)Build
a phrase from all cases
  • The text of the reports can be represented as a
    collection of phrases that contain all of the
    concepts included in the reports.
  • The 372,536 records were parsed to find the
    diagnostic field free-text.
  • Diagnostic field free-text was parsed into
    sentences.
  • Diagnostic field sentences were parsed into
    phrases and words.

15
418,159 phrases represent all the textual
concepts in the JHH surg path records - lie
outside the realm of Common Rule
  • minimal mononuclear cell infiltrate
  • minimal mononuclear cell infiltration
  • minimal mononuclear cell interstitial
  • minimal mononuclear infiltrate
  • minimal mononuclear inflammation
  • minimal mononuclear interstitial infitrates
  • minimal mononuclear meningeal
  • minimal morphologic abnormalities

16
Step 2. Create a precancer terminology
  • Started with the National Library of Medicines
    UMLS (Unified Medical Language System)
  • We use the concept list file, which is
    113,699,627 bytes and contains 1,598,176 terms.
  • As example, rcc has about 80 synonymous terms in
    UMLS

17
UMLS CUI C0007134 Renal cell carcinoma
  • carcinoma, renal cell
  • carcinomas, renal cell
  • renal cell carcinoma
  • hypernephroid carcinoma
  • grawitz tumor
  • hypernephroma
  • renal cell adenocarcinoma
  • rcc

18
The UMLS precancer terms
  • 2,984 terms
  • Contains 221 terms added by myself and given
    private J-codes

19
Step 3. Map the Hopkins phrases to the precancer
terms
  • Start with 418,159 phrases
  • One-by-one try to find a matching phrase from the
    list of 2,984 precancer terms list
  • Prepare a file of all the matching terms
  • This step takes 33 second to complete with a PERL
    script running on a 450 MHz desktop computer -
    i.e., its scalable

20
The result 10,310 term matches,from 418,159
phrases a scalable work in progress
  • early actinic keratosisactinic keratosis0022602
  • early adenomatous polypadenomatous polyp0206677
  • early borderline rejectionborderline0205189
  • early dysplasiadysplasia0334044
  • early dysplastic changedysplastic0334045
  • early dysplastic processdysplastic0334045
  • early gastric mucin cell metaplasiametaplasia002
    5568
  • early gastric mucous cell metaplasiametaplasia00
    25568

21
Step 4. Give precancer match list to Drs. Bill
Moore and Robert Miller to create a concordance
  • 10,310 precancer terms occurred in 54,909
    accessioned surgical pathology cases between 1984
    and 2000. That is, each of the precancer terms
    were found in a little more than 5 cases.
  • 54,909 cases containing a precancer term
    represents 54,909/ 372,536 15

22
The concordance looks like this
  • C0001815367220497667008419098
  • C0002893394120765570701149177
  • C0002893435120960421908784068
  • C0002893436410698795906686356
  • C0002893445510623875200588234

23
Precancer-related cases by year
  • 1984 1175 7
  • 1985 1573 8
  • 1986 2024 10
  • 1987 2195 11
  • 1988 2239 11
  • 1989 2328 11
  • 1990 2721 12
  • 1991 3077 14
  • 1992 3185 14
  • 1993 2878 13
  • 1994 3060 14
  • 1995 2968 13
  • 1996 3475 14
  • 1997 4726 17
  • 1998 4989 18
  • 1999 5996 20
  • 2000 6298 25

24
Precancer-related cases by year
25
Precancer-related cases by year
26
Cases by year of intraductal ca
  • C0007124 1984 14
  • C0007124 1985 18
  • C0007124 1986 30
  • C0007124 1987 31
  • C0007124 1988 33
  • C0007124 1989 50
  • C0007124 1990 42
  • C0007124 1991 51
  • C0007124 1992 50
  • C0007124 1993 80
  • C0007124 1994 106
  • C0007124 1995 85
  • C0007124 1996 100
  • C0007124 1997 180
  • C0007124 1998 217
  • C0007124 1999 228

27
Cases per year of Barretts esophagus
  • C0004763 1984 30
  • C0004763 1985 35
  • C0004763 1986 82
  • C0004763 1987 97
  • C0004763 1988 106
  • C0004763 1989 84
  • C0004763 1990 97
  • C0004763 1991 100
  • C0004763 1992 132
  • C0004763 1993 126
  • C0004763 1994 144
  • C0004763 1995 162
  • C0004763 1996 221
  • C0004763 1997 307
  • C0004763 1998 341
  • C0004763 1999 401

28
What does this really mean?
  • With this approach, we can identify all the cases
    of interest for any diagnosis or diagnoses,
    stratifying data by year of diagnosis, age,
    gender, or any other record element
  • We can determine the encrypted case identifiers
    for all those cases
  • We can give those encrypted case numbers back to
    the laboratory archivist, who can supply me with
    the (encrypted) tissue blocks belonging to that
    case.

29
Conclusion
  • With these techniques, laboratories with good
    informatics infrastructure can create a virtual
    omni-archive (at very low cost) that operates
    within current human subject protection
    guidelines for minimal-risk de-identified
    retrospective studies.
Write a Comment
User Comments (0)
About PowerShow.com