Modern Neoplasm Classification - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Modern Neoplasm Classification

Description:

... responsive to a specific treatment, prone to metastasize, etc. ... The latest version at: www.pathologyinformatics.org. 53 ways of writing prostate cancer ... – PowerPoint PPT presentation

Number of Views:222
Avg rating:3.0/5.0
Slides: 29
Provided by: julesb5
Category:

less

Transcript and Presenter's Notes

Title: Modern Neoplasm Classification


1
Modern Neoplasm Classification Dept of
Pathology University of Michigan October 27,
2005 Jules J. Berman, Ph.D., M.D. jjberman_at_alum.
mit.edu
2
What is a tumor classification? A grouped
taxonomy listing of all tumors with the
following properties Inheritance Hierarchical
structure, with each class of tumors inheriting
properties of its ancestors Uniqueness Each
tumor occurs in only one place in the
classification Comprehensive All tumors are
included Class-intransitive A tumor from one
class does not change into a tumor from another
class (e.g. an adenocarcinoma does not become a
lymphoma) Ernst Mayr The growth of biological
thought diversity, evolution and inheritance.
Cambridge Belknap Press 1982.
3
Problems with current tumor classifications Mixed
bag of tumor classes based on Anatomic site
(roughly distance from the tumor to the floor as
in head and neck tumors) Clinical specialty
(dermatologic tumors) Functional similarity of
cell types (e.g. endocrine tumors) Not based on
any describable biologic premise.
4
Molecular classification of cancer The so-called
molecular classifications (based largely on gene
expression arrays of tumors) are simply a way of
finding variants within a population. Mostly,
you see experiments designed to cluster out
variants of a tumor type (slow-growing,
responsive to a specific treatment, prone to
metastasize, etc.) This is simply not
classification (ignores the intransitive law),
and in fact, no classification has emerged from
any of the work that's been done with molecular
diagnostics. My opinion Gene expression array
studies do not create classifications but are
very useful taxon finders
5
Developmental Lineage Classification and Taxonomy
of Neoplasms Similar to (but different from) the
classification efforts of the 1950s (particularly
Willis) Old hypothesis (more or less
discredited) is that tumor development
recapitulates embryologic development. New (my)
hypothesis is that tumors will tend to inherit
the molecular pathways from their developmental
ancestors. May be helpful in selecting classes of
tumors responsive to molecular targets. Despite
the difference in hypotheses, either way you end
up with a classification that follows embryologic
lines and that fits in will stem cell hypothesis.

6
(No Transcript)
7
Developmental Lineage Classification and Taxonomy
of Neoplasms Now 145,000 terms (10
Megabytes) Publicly available and free The
latest version at www.pathologyinformatics.org

8
53 ways of writing prostate cancer Prostate
cancer is the concept, the 53 synonyms are the
terms for the concept, and C486300 is the
code ltname nci-code "C4863000"gtprostate with
adenocalt/namegt ltname nci-code
"C4863000"gtadenoca arising in prostatelt/namegt ltnam
e nci-code "C4863000"gtadenoca involving
prostatelt/namegt ltname nci-code
"C4863000"gtadenoca arising from
prostatelt/namegt ltname nci-code
"C4863000"gtadenoca of prostatelt/namegt ltname
nci-code "C4863000"gtadenoca of the
prostatelt/namegt ltname nci-code
"C4863000"gtprostate with adenocarcinomalt/namegt ltna
me nci-code "C4863000"gtadenocarcinoma arising
in prostatelt/namegt ltname nci-code
"C4863000"gtadenocarcinoma involving
prostatelt/namegt ltname nci-code
"C4863000"gtadenocarcinoma arising from
prostatelt/namegt ltname nci-code
"C4863000"gtadenocarcinoma of prostatelt/namegt ltname
nci-code "C4863000"gtadenocarcinoma of the
prostatelt/namegt ltname nci-code
"C4863000"gtadenocarcinoma arising in the
prostatelt/namegt ltname nci-code
"C4863000"gtadenocarcinoma involving the
prostatelt/namegt ltname nci-code
"C4863000"gtadenocarcinoma arising from the
prostatelt/namegt ltname nci-code
"C4863000"gtprostate with calt/namegt ltname nci-code
"C4863000"gtca arising in prostatelt/namegt ltname
nci-code "C4863000"gtca involving
prostatelt/namegt ltname nci-code "C4863000"gtca
arising from prostatelt/namegt ltname nci-code
"C4863000"gtca of prostatelt/namegt ltname nci-code
"C4863000"gtca of the prostatelt/namegt ltname
nci-code "C4863000"gtprostate with
cancerlt/namegt ltname nci-code "C4863000"gtcancer
arising in prostatelt/namegt ltname nci-code
"C4863000"gtcancer involving prostatelt/namegt ltname
nci-code "C4863000"gtcancer arising from
prostatelt/namegt ltname nci-code
"C4863000"gtcancer of prostatelt/namegt
9
More ltname nci-code "C4863000"gtcancer of the
prostatelt/namegt ltname nci-code
"C4863000"gtcancer arising in the
prostatelt/namegt ltname nci-code
"C4863000"gtcancer involving the
prostatelt/namegt ltname nci-code
"C4863000"gtcancer arising from the
prostatelt/namegt ltname nci-code
"C4863000"gtprostate with carcinomalt/namegt ltname
nci-code "C4863000"gtcarcinoma arising in
prostatelt/namegt ltname nci-code
"C4863000"gtcarcinoma involving prostatelt/namegt ltna
me nci-code "C4863000"gtcarcinoma arising from
prostatelt/namegt ltname nci-code
"C4863000"gtcarcinoma of prostatelt/namegt ltname
nci-code "C4863000"gtcarcinoma of the
prostatelt/namegt ltname nci-code
"C4863000"gtcarcinoma arising in the
prostatelt/namegt ltname nci-code
"C4863000"gtcarcinoma involving the
prostatelt/namegt ltname nci-code
"C4863000"gtcarcinoma arising from the
prostatelt/namegt ltname nci-code
"C4863000"gtprostate adenocalt/namegt ltname nci-code
"C4863000"gtprostate adenocarcinomalt/namegt ltname
nci-code "C4863000"gtprostate calt/namegt ltname
nci-code "C4863000"gtprostate cancerlt/namegt ltname
nci-code "C4863000"gtprostate
carcinomalt/namegt ltname nci-code
"C4863000"gtprostatic cancerlt/namegt ltname nci-code
"C4863000"gtprostatic carcinomalt/namegt ltname
nci-code "C4863000"gtprostatic
adenocarcinomalt/namegt ltname nci-code
"C4863000"gtprostate gland adenocarcinomalt/namegt ltn
ame nci-code "C4863000"gtadenocarcinoma of the
prostate glandlt/namegt ltname nci-code
"C4863000"gtadenocarcinoma of prostate
glandlt/namegt ltname nci-code "C4863000"gtprostate
gland carcinomalt/namegt ltname nci-code
"C4863000"gtcarcinoma of the prostate
glandlt/namegt ltname nci-code "C4863000"gtcarcinoma
of prostate glandlt/namegt
10
Is the taxonomy comprehensive? Let's compare it
with SNOMED.
11
Comparing the Developmental Lineage
Classification with SNOMED. 1. Used the 2005
version of UMLS (free from ww.nlm.gov) 2.
MRCON05 650,948,750 1-18-05 and MRCXT
1,610,612,736 1-18-05 MRCXT2
1,610,612,736 1-18-05 MRCXT3
1,610,612,736 1-18-05 MRCXT4
1,610,612,736 1-18-05 MRCXT5
1,610,612,736 1-18-05 MRCXT6
1,610,612,736 1-18-05 MRCXT7
1,196,031,492 1-18-05 4. Extracted the snomed
ct terms from mrcon05 using the script MRCON05
.PL 2,098 5-30-05
12
MRCON05.PL line " " start time() open
(TEXT,"mrcon05") open (OUT,"gtsnom05") while
(line ne "") line ltTEXTgt
_at_linearray split(/\/,line) cuinumber
linearray0 language linearray1
vocabulary linearray11 next if ("ENG"
ne language) next if ("SNOMEDCT" ne
vocabulary) print OUT "cuinumber
linearray14\n" print "cuinumber
linearray14\n" end time() total
end - start print "\ntotal time was total
seconds\n" exit Execution time of 132 seconds
on a 2.89 Ghz PC
13
5. This produced a 35 MByte file SNOM05
35,127,210 5-30-05 6. Created a perl script,
neopull2.pl that uses the mrcxt
"Neoplasm" relationship to identify all the
neoplasm CUIs in UMLS and to pull out any of the
SNOMED terms that corresponded to a Neoplasm CUI
(neopull2.pl) 7. The output file is SNOM
.OUT 567,372 5-30-05 8. This output file
contains a lot of redundant terms and plurals, so
I wrote snoclean.pl to get rid of the extraneous
terms SNOCLEAN .PL 1,092 5-30-05 9.
The final output file is SNOCLEAN .OUT
300,834 5-30-05 SNOMED contains 2,673 different
neoplasm concepts and 7,696 neoplasm terms
14
SNOMED The total number of neoplasm concepts
is 2,673 The total number of neoplasm terms is
7,696 Developmental Lineage The total number of
neoplasm concepts is 6,193 The total number of
neoplasm terms is 146,666 The Developmental
Lineage has 2.3 times the neoplasm concepts as
SNOMED 19 times the neoplasm terms as SNOMED Can
one pathologist create a better nomenclature than
the CAP? maybe
15
The large curated nomenclatures can't be used for
concept matching and are fast becoming obsolete
for their intended mode of human-based
implementation due to the explosive growth of the
data domain terabytes and terabytes every day
think about all types of digital data in medical
information systems PRAKASH NADKARNI, MD, ROLAND
CHEN, MD, CYNTHIA BRANDT, MD, MPH, UMLS Concept
Indexing for Production DatabasesA Feasibility
StudyJ Am Med Inform Assoc. 2001880-91. Conclusi
ons Considerable curation needs to be performed
to define a UMLS subset that is suitable for
concept matching.
16
What is the value of a comprehensive neoplasm
classification? 1. A modern classification is
the key to retrieving, organizing, and
integrating the data held in biomedical databases
(including the data held in hospital information
systems) Can we use the taxonomy to code our
surgical pathology reports and other textual
documents? 2. A classification is a hypothesis
about the nature of reality. Can we use the
classification to select classes of tumors
(rather than single tumors) to molecular targeted
cancer therapy? We've done this with
antibiotics with astounding success. Can we
learn something about the biology of tumors by
using the classification to stratify the data
found in large biological databases and
inspecting the results?
17
Autocoding Surgical Pathology Reports What is
the size of the data domain when we're talking
about surgical pathology reports. There are
about 25 million surgical pathology reports
generated in the U.S. each year (about 50 million
cytology reports)
18
Autocoding Surgical Pathology Reports Allowing
1000 bytes per report, these reports occupy 25
Gigabytes of text (25 thousand million
bytes) Here is what 1000 bytes looks like To
be, or not to be,--that is the question-- Whether
'tis nobler in the mind to suffer The slings and
arrows of outrageous fortune Or to take arms
against a sea of troubles, And by opposing end
them?--To die,--to sleep,-- No more and by a
sleep to say we end The heartache, and the
thousand natural shocks That flesh is heir
to,--'tis a consummation Devoutly to be wish'd.
To die,--to sleep-- To sleep! perchance to
dream--ay, there's the rub For in that sleep of
death what dreams may come, When we have shuffled
off this mortal coil, Must give us pause there's
the respect That makes calamity of so long
life For who would bear the whips and scorns of
time, The oppressor's wrong, the proud man's
contumely, The pangs of despis'd love, the law's
delay, The insolence of office, and the
spurns That patient merit of the unworthy
takes, When he himself might his quietus
make With a bare bodkin? who would these fardels
bear, To grunt and sweat under a weary life,
But Compressed, all of the surgical pathology
reports produced in the U.S. In one year will fit
easily on one DVD (like 10 episodes of I Love
Lucy).
19
lt?xml version"1.0"?gt ltrdfRDF
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns" xmlnsdc"http//www.purl.org/dc/elemen
ts/1.0/" xmlnsv"http//www.pathologyinforma
tics.org/informatics_r.htm"gt ltrdfDescription
about"urnPMID-16160487"gt ltdctitlegt
interobserver and intraobserver variability
in the diagnosis of hydatidiform mole
lt/dctitlegt ltvautocode term"mole"
code"C0000000" /gt ltvautocode
term"hydatidiform mole" code"" /gt
ltde_idgt and in the of hydatidiform mole
lt/de_idgt lt/rdfDescriptiongt
ltrdfDescription about"urnPMID-16160486"gt
ltdctitlegt primary glial tumor of the
retina with features of myxopapillary ependymoma
lt/dctitlegt ltvautocode
term"tumor" code"C0000000" /gt
ltvautocode term"myxopapillary ependymoma"
code"C0000000" /gt ltvautocode
term"tumor of the retina" code"C0000000" /gt
ltvautocode term"glial tumor"
code"C3059000" /gt ltvautocode
term"ependymoma" code"C0000000" /gt
ltde_idgt glial tumor of the retina with of
myxopapillary ependymoma lt/de_idgt
lt/rdfDescriptiongt ltrdfDescription
about"urnPMID-16160485"gt ltdctitlegt
cd20-negative t-cell-rich b-cell lymphoma as
a progression of a nodular lymphocyte-predominant
hodgkin lymphoma treated with rituximab
a molecular analysis using laser capture
microdissection lt/dctitlegt
ltvautocode term"lymphoma" code"C0000000" /gt
ltvautocode term"hodgkin" code"C0000000"
/gt ltvautocode term"b-cell lymphoma"
code"C6858100" /gt ltvautocode
term"t-cell-rich b-cell lymphoma"
code"C9496100" /gt ltvautocode
term"hodgkin lymphoma" code"" /gt
ltde_idgt t-cell-rich b-cell lymphoma as a of
a hodgkin lymphoma with a using
lt/de_idgt lt/rdfDescriptiongt
20
The autocoder prepares an XML file in RDF format
(self-describing document) that autocodes and
scrubs text concurrently, at a speed of about
8,000 reports per second.... and does an
incomparably better job than human coders! This
means that it will code and scrub the 25 million
surgical pathology reports in the U.S. In about
an hour using a desktop PC If we had access to
a supercomputer (operating more than 3,000 times
faster than my desktop PC), we could autocode and
scrub every pathology report produced in the
country in about a second.
21
Why is it so important to autocode
fast? Because we're not really talking about
coding (coded datasets cannot be justified on the
basis of their scientific value). We're really
talking about re-coding very large datasets as
necessary. You almost always need to re-code!!!
1. Whenever you want to change from one
nomenclature to another (eliminates problem of
brand-name loyalty) 2. Whenever you introduce a
new version of a nomenclature 3. Whenever you
want to use a new coding algorithm (e.g.
Parsimonious versus comprehensive, linking code
to a particular extracted portion of report) 4.
Whenever you add legacy data to your LIS 5.
Whenever you merge different pathology datasets
forget mapping!!!
22
How can we integrate the neoplasm classification
with OMIM to discover a new biological
observation about tumors? What is OMIM? Omim is
a free, comprehensive listing of all the
so-called Mendelian inherited diseases. Omim is
103,610,906 bytes (over 100 million
bytes) Shakespeare's Hamlet is 180,711
bytes OMIM is about 573 times larger than
Hamlet Each record of OMIM lists the name of the
inherited disease, and all the medical conditions
(including neoplasms) that may be associated with
the condition.
23
Let's autocode all of OMIM and examine the
results 1. The time to autocode was 92
seconds 2. The number of records in omim is
16785 3. The number of records listing primitive
tumors is 348 4. The number of records listing
endoderm_or_ectoderm tumors is 1220 5. The number
of records lising mesoderm tumors is 1766
(completely unlike what you might expect with
non-inherited tumors) 6. The number of records
listing neuroectoderm tumors is 747 So, because
we have a class system, we can look at
instance-coded datasets and make observations
about CLASS
24
Easy to count the three combinations of
two-lineage (discordant) records The number of
OMIM records with neoplasm concepts in the record
text is 1,015. ectoderm/mesoderm 72 omim
records ectoderm/neuroectoderm 24 omim
records mesoderm/neuroectoderm 39 omim
records total 135
class-discordant OMIM records So, 135/1,015
(13) have a lineage discordance.
25
Causes for 135 cases of class discordance 1.
Inherited conditions with an (external)
environmental factor 2. Physiologic (internal)
effects that cross lineages (breast and ovarian
cancers caused by an endocrine sensitivity that
extends across lineages) 3. Conditions that
included a tumor that occurs too infrequently to
be correctly associated with the inherited
condition 4. Mistakes in parsing omim (finding
the name of a tumor in a record that was never
intended to indicate that the condition is
associated with the tumor) 5. Bad
classification How do you decide? In this case,
you go back and read the 135 records and try to
understand what went wrong in each case.
26
Classification papers Autocoding papers (Doublet
Method 20,000 times faster than other published
methods) Confidentiality/privacy papers -
De-identification and data scrubbing (Concept
Match method) - Zero-knowledge reconciliation of
identities - Threshold method for exchanging
pieces of data Data integration
papers www.pubmed.org search on berman jj
27

end
28
Write a Comment
User Comments (0)
About PowerShow.com