Title: LATE: Lisp Architecture for Text Engineering*
1LATE Lisp Architecture for Text Engineering
In homage to GATE, the General Architecture for
Text Engineering
2Desiderata for a Natural Language Processing
Framework
- Flexibility preferred over performance, aimed at
furthering research - Ability to deal with large corpora of documents
- Persistence of analysis results for subsequent
re-use - Stand-off style of annotations
- Wrappers for existing language processing
components, independent of their implementation
language - Incorporation of tools for machine learning
- General purpose programming language for
developing workflows, pipelines, scaling and
distribution - Dynamic programming language to make
experimentation easy - Convention over Configuration
3LATE
- Core programming environment is Common Lisp
- Object-oriented, multiple inheritance
- Garbage collection
- Strong support for meta-programming (manifest
types, meta-object protocol) - Interfaces to C, C, and Java external programs
- SOAP, HTTP, interfaces to web services
- Persistence is implemented in a relational
database (now MySQL) - Unlimited storage for corpora, documents,
annotations, classification and regression
models, - Cumulative development of annotations
- Provides a conventional interface to external
programs that can make use of LATE-developed
interpretations, or add to them - Efficient support for indexed search and retrieval
4LATE, continued
- Core entities
- Corpusa collection of documents, (possibly) a
structure grammar of how to interpret
section/subsection structure of the documents - Documenta single document, including its content
as a UTF-8 string, plus meta-data about source,
documents can belong to many corpora, for
example for cross-validation studies - Annotationa stand-off characterization of a
substring (perhaps all) of a documents content
annotations are of various types, organized as a
taxonomy of subtypes - persistent annotationannotations never to be
deleted e.g., result of human annotation - PHI mark, and type of PHI e.g., name, address,
MRN, - gold standard diagnosis, procedure, medication,
- volatile annotationmachine-generated annotation
- while in use, organized in efficient red-black
interval tree - Modela stored version of a learned model, from a
machine learning procedure
5LATE Volatile Annotations
- Sectionsrepresent a structural hierarchy of the
structures and substructures in a document, as
specified by a document structure grammar - list item annotations recognize enumerated lists
within sections, a common form in which lists of
drugs, procedures, etc. are reported - Sentenceswe break text (within section
boundaries) into sentences because many
processing algorithms work on sentence-level text - Tokenstokens are identified within text
- different incorporated processing algorithms have
slightly different definition of token e.g.,
12/4/2008 vs. 12, /, 4, /, 2008 - some compound tokens have internal structure that
we parse e.g., T-99.4, spO292 - parts of speech, determined by the UMLS
Specialist lexicon, Link Grammar parser, and
Brill tagger - UMLS-annotationswe map tokens and sequences of
tokens to UMLS concept unique identifiers (CUI),
type identifiers (TUI), MeSH terms, and an
aggregate semantic type (problem, test, drug, ...)
6Example
Admission Date 2011-11-13 Discharge Date
2012-02-10Date of Birth 2011-11-13 Sex
MService NeonatologyHISTORY OF PRESENT
ILLNESS Baby Dean Leslie Baugher was born at26
and 6/7 weeks gestation by cesarean section to a
39 yearold, Gravida II, Para I now II
woman.PRENATAL SCREENS A positive, antibody
negative, Rubellaimmune RPR nonreactive
hepatitis B surface antigennegative group B
strep unknown.This pregnancy was remarkable for
cervical incompetence,leading to cerclage
placement at 12 weeks. Mother wasadmitted on
2011-10-27 for preterm labor and treated
withNifedipine, tocolysis and bed rest and
received a course ofBetamethasone at that time.
She had refractory pretermlabor, thus leading to
delivery.This infant emerged with good tone and
cry and deliveredspontaneously. Apgars were
seven at one minute and eight atfive minutes.
Birth weight was 1,070 grams (50 to
75thpercentile). His birth length was 37 cm (50
to 75thpercentile) and his head circumference was
26.5 cm (50 to75th percentile). Discharge weight
was 3,375 grams (50thpercentile) length 49.5
(greater than 50th percentile) headcircumference
36.5 (greater than 90th percentile).PHYSICAL
EXAMINATION On admission, examination revealed
anextremely pre term infant anterior fontanel
was soft, flat.Non dysmorphic, intact palate.
Chest with moderateretractions with spontaneous
breaths, fair breath sounds ...
- A discharge summary, de-identified, with
synthesized names, dates, etc. - Dates are uniformly offset, to retain time
relations
7Structure
cl-user(22) (show-annotations tt type
'section-annotation)0 12217 " Admission Date
2011-11-13 ...d of Report) " DOC
discharge_summary1 17 " Admission Date" SH
admission_date17 36 " 2011-11-13 " SA
admission_date36 51 "Discharge Date" SH
discharge_date51 64 " 2012-02-10 " SA
discharge_date64 79 " Date of Birth" SH
date_of_birth79 99 " 2011-11-13 " SA
date_of_birth99 103 "Sex" SH sex103 107 " M
" SA sex107 116 " Service" SH service116 130
" Neonatology " SA service130 158 " HISTORY OF
PRESENT ILLNESS" SH history_of_present_illness1
58 1215 " Baby Dean Leslie Baugher was bor...th
percentile). " SA history_of_present_illness1215
1237 " PHYSICAL EXAMINATION" SH
physical_examination1237 1815 " On admission,
examination reveal... and clavicles. " SA
physical_examination1815 1832 " HOSPITAL
COURSE" SH hospital_course1832 9546 " 1.)
Respiratory Scranton was i...h his progress. "
SA hospital_course9546 9570 " CONDITION AT
DISCHARGE" SH discharge_condition9570 9580 "
Stable. " SA discharge_condition9580 9603 "
DISCHARGE DISPOSITION" SH discharge_disposition
9603 10303 " Home with family. PRIMARY PEDIA...
breast ad lib. " SA discharge_disposition10303
10316 " MEDICATIONS" SH medications10316 11716
" Fer-in-Joshua and Poly-Vi-James a... Lane
Hospital. " SA medications11716 11737 "
DISCHARGE DIAGNOSES" SH discharge_diagnoses1173
7 12217 " Former 26 and 08-13 premature mal...d
of Report) " SA discharge_diagnoses
- Headings and subheadings are found, hierarchic
structure is recognized - Only common headings are used in this structure
grammar
8Sentence Breaks
cl-user(24) (sentencize tt)nilcl-user(25)
(show-annotations tt type 'sentence-annotation)19
29 "2011-11-13" sent nil53 63 "2012-02-10"
sent nil82 92 "2011-11-13" sent nil105 106
"M" sent nil118 129 "Neonatology" sent nil160
294 "Baby Dean Leslie Baugher was born ... I now
II woman." sent nil296 439 "PRENATAL SCREENS
A positive, ant...B strep unknown." sent nil441
540 "This pregnancy was remarkable for ...ent at
12 weeks." sent nil541 697 "Mother was admitted
on 2011-10-27 ...ne at that time." sent nil699
758 "She had refractory preterm labor, ...ing to
delivery." sent nil760 831 "This infant emerged
with good tone...d spontaneously." sent nil833
891 "Apgars were seven at one minute an...at five
minutes." sent nil892 945 "Birth weight was
1,070 grams (50 t...5th percentile)." sent
nil947 1061 "His birth length was 37 cm (50 to
...5th percentile)." sent nil1063 1214
"Discharge weight was 3,375 grams (...0th
percentile)." sent nil1239 1337 "On admission,
examination revealed... was soft, flat." sent
nil1338 1368 "Non dysmorphic, intact palate."
sent nil1370 1494 "Chest with moderate
retractions wi...coarse crackles." sent nil1495
1540 "Heart was regular rate and rhythm, no
murmur." sent nil1541 1564 "Pink and well
perfused." sent nil1565 1625 "Abdomen soft and
distended with th... umbilical cord." sent
nil1626 1638 "Patent anus." sent nil1640 1706
"Normal preterm male genitalia with...ded
bilaterally." sent nil1708 1742 "Age
appropriate tone and reflexes." sent nil1743
1772 "Bruising of arms bilaterally." sent
nil1774 1814 "Normal spines, limbs, hip and
clavicles." sent nil1834 1910 "1.) Respiratory
Scranton was int...s of Surfactant." sent
nil1911 1976 "He remained on SIMV 8until day of
... self-extubated." sent nil1978 2167 "He was
then placed on continuous p...ned to room air."
sent nil2169 2442 "He had a trial of diuretic
therapy...ping the Diuril." sent nil2444 2552
"Baby was loaded with caffeine citr... day of
life 48." sent nil
- Within each section, sentence breaks are
determined by a MAXENT algorithm from OPENNLP - The model was trained on a newspaper corpus,
hence perhaps not appropriate for clinical text - but, it seems to work reasonably well
- 139 sentences in example
9Link Grammar Parser
cl-user(26) (link-parse tt)(135 4 0
0)cl-user(28) (length (annotations tt type
'lp-token))2171cl-user(31) (show-annotations tt
type 'lp-token)19 29 "2011-11-13" lptok 153 63
"2012-02-10" lptok 182 92 "2011-11-13" lptok
1105 106 "M" lptok 1118 129 "Neonatology"
lptok 1160 164 "Baby" lptok 1165 169 "Dean"
lptok 2170 176 "Leslie" lptok 3177 184
"Baugher" lptok 4185 188 "was" lptok 5189 193
"born" lptok 6194 196 "at" lptok 7197 199
"26" lptok 8200 203 "and" lptok 9204 207
"6/7" lptok 10208 213 "weeks" lptok 11214 223
"gestation" lptok 12224 226 "by" lptok 13227
235 "cesarean" lptok 14236 243 "section" lptok
15
- Constraint-based lexicalized parser
- Tokenizes
- computes all possible links among word pairs
- chooses linkages in which links do not cross
- Example has 139 sentences, of which 135 parsed
- Combinatorial explosion in 4
- Multiple parses possible in many
- Links are stored with the 2171 tokens
244 246 "to" lptok 16247 248 "a" lptok 17249
251 "39" lptok 18252 256 "year" lptok 19257
260 "old" lptok 20260 261 "," lptok 21262 269
"Gravida" lptok 22270 272 "II" lptok 23272
273 "," lptok 24274 278 "Para" lptok 25279
280 "I" lptok 26281 284 "now" lptok 27285 287
"II" lptok 28288 293 "woman" lptok 29293 294
"." lptok 30296 304 "PRENATAL" lptok 1305 312
"SCREENS" lptok 2312 313 "" lptok 3315 316
"A" lptok 4317 325 "positive" lptok 5325 326
"," lptok 6327 335 "antibody" lptok 7336 344
"negative" lptok 8...
10Parsing examples
- She had refractory preterm labor, thus leading to
delivery.
---------------Os---------------------MXsp
-----
-----------A---------- ----Xd-------------X
c-------- -Ss-
----A---- ---E----MVp---Jp--
she had.v refractory.a
preterm?.a labor.n , thus leading.g to
delivery.n .
cl-user(36) (setq s9 (elt (annotations tt type
'sentence-annotation) 9))ltsent 235584073
(699-758) "She had refractory preterm labor,
...ing to delivery."gtcl-user(37) (setq st9
(annotations s9 type 'lp-token))(ltlptok
235586149 (699-702)1 "She"gt ltlptok 235586148
(703-706)2 "had"gt ltlptok 235586147
(707-717)3 "refractory"gt ltlptok 235586146
(718-725)4 "preterm"gt ltlptok 235586145
(726-731)5 "labor"gt ltlptok 235586144
(731-732)6 ","gt ltlptok 235586143 (733-737)7
"thus"gt ltlptok 235586142 (738-745)8 "leading"gt
ltlptok 235586141 (746-748)9 "to"gt ltlptok
235586140 (749-757)10 "delivery"gt
...)cl-user(38) (print-table (left-links (elt
st9 1)))1 she Ss Ss S had.v-d 2 1 1 63
cl-user(39) (print-table (right-links (elt st9
1)))2 had.v-d Ij Ij I labor.v 5 3 2 15 2
had.v-d O Os Os preterm?.n 4 2 2 15 2
had.v-d MV MVg MVg leading.g 8 6 1 23 2
had.v-d O Ou Ou labor.n-u 5 3 1 47
11Parsing examples
- Baby Dean Leslie Baugher was born at 26 and 6/7
weeks gestation by cesarean section to a 39 year
old, Gravida II, Para I now II woman.
----------------------------------MVp------------
---------------------
--------------------MVp-----------
---------
--------------Jp--------------
----------GN----------
--------A--------
---
---G------G-----Ss----Pv---MVp
----AN--- ---Jp--
-
baby.n Dean.b Leslie.b
Baugher was.v born.v at 26 and 6/7?.a weeks.n
gestation.n by cesarean?.n section to a
-------------------MXs------------------
---------MXs--------
--Js---- ----Xd----
--------Xd------- ---Ds---
--G--X -G-----G--------Xc---
39 year.n old , Gravida II , Para I.n now
II woman .
12UMLS Lookup examples (only the good)
208 223 "weeks gestation" TUI T033 SEM
_finding CUI C1135241227 243 "cesarean
section" TUI T061,T033 SEM
_finding,_procedure CUI C0007876,C0029535,C138
4674,C2053588,C2114431 MeSH E04.520.252.500262
272 "Gravida II" TUI T033 SEM _finding CUI
C0232997274 278 "Para" TUI T033 SP-POS
noun SEM _finding CUI C0030563 MeSH
G08.686.677,G08.686.785.760.769.472,N06.850.490.81
2.600327 344 "antibody negative" TUI T033 SEM
_finding CUI C0855852346 353 "Rubella" TUI
T047,T116,T121,T129 SP-POS noun SEM
_medication,_disease CUI C0035920,C0035923
MeSH C02.782.930.700.700,D20.215.894.899.779354
360 "immune" TUI T169 SP-POS adj,noun SEM
_modifier CUI C0439662362 377 "RPR
nonreactive" TUI T034 SEM _bodyparam CUI
C0748443379 406 "hepatitis B surface antigen"
TUI T121,T129,T059 SEM _procparam,_medication
CUI C0019168,C0201477,C2229745 MeSH
D23.050.327.495.500.475407 415 "negative" TUI
T080,T033 SP-POS noun,verb,adj SEM
_finding,_modifier CUI C0205160,C1513916417
430 "group B strep" TUI T007 SEM _modifier
CUI C0579233 MeSH B03.510.400.800.872.100431
438 "unknown" TUI T078,T169,T170,T056,T080,T121
,T129,T032,T033,T098 SP-POS adj,noun SEM
_finding,_bodyparam,_medication,_modifier CUI
C0439673,C1521803,C1546837,C1546841,C1547283,C1547
294,C1547306,C1547312,...
13Database storage
mysqlgt select from annotations where
document_id31039 limit 30---------------------
---------------------------------------------
--------- id type
document_id start end data other up
-------------------------------------------
-------------------------------- 235586254
cui-annotation 31039 105 106
C0024554 NULL NULL 235586255
cui-annotation 31039 105 106
C0221134 NULL NULL 235586256
cui-annotation 31039 105 106
C0227102 NULL NULL 235586257
cui-annotation 31039 105 106
C0369637 NULL NULL 235586258
cui-annotation 31039 105 106
C0439113 NULL NULL 235586259
cui-annotation 31039 105 106
C0439232 NULL NULL 235586260
cui-annotation 31039 105 106
C0441923 NULL NULL 235586261
cui-annotation 31039 105 106
C0456533 NULL NULL 235586262
cui-annotation 31039 105 106
C0456644 NULL NULL 235586263
cui-annotation 31039 105 106
C0475209 NULL NULL 235586264
cui-annotation 31039 105 106
C1553028 NULL NULL 235586265
cui-annotation 31039 105 106
C1553034 NULL NULL 235586266
cui-annotation 31039 105 106
C1706456 NULL NULL 235586267
cui-annotation 31039 105 106
C1706457 NULL NULL 235586268
cui-annotation 31039 105 106
C1883310 NULL NULL 235586281
cui-annotation 31039 118 129
C0027621 NULL NULL 235586285
cui-annotation 31039 160 164
C0021270 NULL NULL 235586286
cui-annotation 31039 160 164
C1550504 NULL NULL 235586297
cui-annotation 31039 189 193
C0004897 NULL NULL 235586298
cui-annotation 31039 189 193
C1301886 NULL NULL 235586299
cui-annotation 31039 189 193
C1704689 NULL NULL 235586311
cui-annotation 31039 197 199
C0227067 NULL NULL 235586312
cui-annotation 31039 197 199
C0450349 NULL NULL 235586318
cui-annotation 31039 208 213
C0439230 NULL NULL 235586319
cui-annotation 31039 208 213
C0439506 NULL NULL 235586320
cui-annotation 31039 208 213
C1561540 NULL NULL 235586325
cui-annotation 31039 208 223
C1135241 NULL NULL 235586328
cui-annotation 31039 214 223
C0032961 NULL NULL 235586338
cui-annotation 31039 227 243
C0007876 NULL NULL 235586339
cui-annotation 31039 227 243
C0029535 NULL NULL -----------------------
---------------------------------------------
-------
14Modeling Document Content
- Statistical Natural Language Processing
- Generate large numbers of features
- Token-level features
- words themselves, parts of speech, mapping to
dictionary meanings, UMLS concepts (includes
ICD-9, SNOMED, MeSH, ...) - n-tuples of features based on adjacent sets of
token-level features - Syntactic features
- Noun-phrase chunks, mapped as for tokens
- Full parse (e.g., using link-parser grammar),
yields n-tuples of syntactically linked tokens
and phrases - Position in document, section, subsection
- Generate, test, and then apply machine learning
models that identify - names, locations, institutions, identifiers,
addresses, phone numbers, ... - signs, symptoms, diagnoses, allergies, ...
- tests, results, treatments, outcomes,
medications, dosages, ... - Currently support wrappers for LIBSVM, WEKA
learners - Goal formalized representation of the meaningful
content of the entire note
15Steps for SHARPN
- cTAKES performs many of the same tasks
- Adopt UIMA/cTakes framework
- Learning curve
- Reproduce some of current unique efforts
- importers for specific data sets, annotators for
complex tokens, use of features from link parser,
... - Suggest/develop incorporation of database-backed
persistence - Alternative
- Build data-level translation/interoperability
i.e., - map UIMA type system to LATE type system
- build import/export functions between XML
representation of UIMA and database
representation of LATE - incorporate LATE environment in UIMA environment
- Will do small experiments to determine whether
feasible