Title: Classification of Semantic Relations in Noun Compounds using MeSH
1Classification of Semantic Relations in Noun
Compounds using MeSH
- Marti Hearst, Barbara Rosario
- SIMS, UC Berkeley
2LINDI Project Synopsis
- Goal Extract semantics from text
- Method statistical corpus analysis
- Focus BioMedical text
- Interesting inferences (Swanson)
- Rich lexical resources
- Difficult NLP problems
- Noun Compounds
3Noun Compounds (NCs)
- Any sequence of nouns that itself functions as a
noun - asthma hospitalizations
- asthma hospitalization rates
- bone marrow aspiration needle
- health care personnel hand wash
- Technical text is rich with NCs
- Open-labeled long-term study of the
subcutaneous sumatriptan efficacy and
tolerability in acute migraine treatment.
4NCs 3 computational tasks(Lauer Dras 94)
- Identification
- Syntactic analysis (attachments)
- Baseline headache frequency
- Tension headache patient
- Semantic analysis
- Headache treatment treatment for
headache - Corticosteroid treatment treatment that uses
corticosteroid
5NC Semantic Relations
- Linguistic theories regarding the nature of the
relations between constituents in NCs all
conflict. - J. Levi 78
- P. Downing 77
- B. Warren 78
6NC Semantic relations
- 38 Relations found by iterative refinement based
on 2245 NCs - Goals
- More specific than case roles
- General enough to aid coverage
- Allow for domain-specific relations
7Semantic relations
- Examples
- Frequency/time of
- influenza season, headache interval
- Measure of
- relief rate, asthma mortality, hospital survival
- Instrument
- aciclovir therapy, laser irradiation, aerosol
treatment - Purpose
- headache drugs, voice therapy, influenza
treatment - Defect
- hormone deficiency, csf fistulas, gene mutation
- Inhibitor
- Adrenoreceptor blockers, influenza prevention
8Multi-class Assignment
- Some NCs can be describe by more than one
semantic relationships - eyelid abnormalities location and defect
- food allergy cause and activator
- cell growth change and activity
- tumor regression change and ending/reductio
n
9Extraction of NCs
- Titles and abstracts from Medline (medical
bibliographic database) - Part of Speech Tagger
- Extraction of sequences of units tagged as nouns
- Collection of 2245 NCs with 2 nouns
10Models
- Lexical (words)
- headache pain
- Class based model using MeSH descriptors for
levels of descriptions - MeSH 2 C.23 G.11
- MeSH 3 C23.888 G11.561
- MeSH 4 C23.888.592 G11.561.796
- MeSH 5 C23.888.592 G11.561.796
- MeSH 6 C23.888.592.612 G11.561.796 .444
11MeSH Tree Structures
- 1. Anatomy A
- 2. Organisms B
- 3. Diseases C
- 4. Chemicals and Drugs D
- 5. Analytical, Diagnostic and Therapeutic
Techniques and Equipment E - 6. Psychiatry and Psychology F
- 7. Biological Sciences G
- 8. Physical Sciences H
- 9. Anthropology, Education, Sociology and
Social Phenomena I - 10. Technology and Food and Beverages J
- 11. Humanities K
- 12. Information Science L
- 13. Persons M
- 14. Health Care N
- 15. Geographic Locations Z
12MeSH Tree Structures
- 1. Anatomy A
- Body Regions A01
- Musculoskeletal System A02
Digestive System A03 - Respiratory System A04
- Urogenital System A05
- Endocrine System A06
- Cardiovascular System A07
- Nervous System A08
- Sense Organs A09
- Tissues A10
- Cells A11
- Fluids and Secretions A12
- Animal Structures A13
- Stomatognathic System A14
- (..)
- Body Regions A01
- Abdomen A01.047
- Groin A01.047.365
- Inguinal Canal A01.047.412
- Peritoneum A01.047.596
- Umbilicus A01.047.849
- Axilla A01.133
- Back A01.176
- Breast A01.236
- Buttocks A01.258
- Extremities A01.378
- Head A01.456
- Neck A01.598
- (.)
13Mapping Nouns to MeSH Concepts
- headache recurrence
- C23.888.592.612.441 C23.550.291.937
- headache pain
- C23.888.592.612.441 G11.561.796.444
- breast cancer cells
- A01.236 C04 A11
14Levels of Description
- headache pain (C23.888.592.612.441
G11.561.796.444) - Only Tree C G
- C(Diseases)
- G (Biological Sciences)
- Level 1 C 23 G 11
- C 23 (Diseases Pathological Conditions)
- G 11 (Biological Sciences Musculoskeletal,
Neural, and Ocular Physiology) - Level 2 C 23 888 G 11 561
- C 23.888 (DiseasesPathological Conditions Signs
and symptoms) - G 11.561 (Biological Sciences Musculoskeletal,
Neural, and Ocular PhysiologyNervous System
Physiology) - Level 3 C 23 888 592 G 11 561 796
- C 23.888.592 (Diseases Pathological Conditions
Signs and symptoms Neurologic Manifestations) - G 11.561.796 (Biological Sciences
Musculoskeletal, Neural, and Ocular
PhysiologyNervous System PhysiologySensation)
15Classification Task Method
- Multi-class (18) classification problem
- Multi layer Neural Networks to classify across
all relations simultaneously. - Evaluation distinguish between
- Seen NCs where 1 or 2 words appeared in the
training set - Unseen NCs in which neither word appeared in the
training set
16Accuracy for 18-way Classification
Correct answer in first three (76-78)
Correct answer in first two (71-73)
Correct answer ranked first (61-62)
Training 855 NCs (50)
Testing 805 NCs (75 unseen)
Baseline (guessing most frequent class)
17Accuracies for 18-way classification
generalization on unseen NCs
Training 73 NCs (5)
Testing 1587 NCs (810 unseen) (95)
MeSH on unseen
Lexical
Lexical on unseen
18Accuracies by Unseen Noun
Case 1 first N unseen
Case 2 second N unseen
Case 3 both N seen
Case 4 neither N seen
Training 73 NCs (5)
Testing 1587 NCs (810 unseen) (95)
19Accuracy for each relation
20Accuracy for sample relations
Produces (genetic)
Ex. Test Set thymidine allele tumor dna csf
mrna acetylase gene virion rna ()
21Accuracy for sample relations
Frequency/time of
Test Set disease recurrence headache
recurrence enterovirus season influenza
season mosquito season pollen season disease
stage transcription stage drive time injection
time ischemia time travel time
22Accuracy for sample relations
Purpose
Test Set varicella vaccine tb vaccines
poliovirus vaccine influenza vaccination influen
za immunization abscess drainage acne therapy
asthma therapy asthma treatment carcinogen
treatment disease treatment hiv treatment
23Related work
- Finin (1980)
- Detailed AI analysis, hand-coded
- Rindflesch et al. (2000)
- Hand-coded rule base to extract certain types of
assertions
24Related work
- Vanderwende (1994)
- automatically extracts semantic information from
an on-line dictionary - manipulates a set of handwritten rules
- 13 classes
- 52 accuracy
- Lapata (2000)
- classifies nominalizations into subject/object
binary distinction - 80 accuracy
- Lauer (1995)
- probabilistic model
- 8 classes
- 47 accuracy
25Related work
- Prepositional Phrase Attachment
- The problem
- Eat spaghetti with a fork
- Eat spaghetti with sauce
- V N1 P N2
- Attachment/association, not semantics
- Approaches
- Word occurrences (Hindle Rooth 93)
- Using a lexical hierarchy
- Conceptual association (Resnik 93, Resnik
Hearst 93) - Transformation-based (Brill Resnik 94)
- MDL to find optimal tree cut (Li Abe 98)
- Lindi use ML techniques to determine appropriate
level of lexical hierarchy, classify into
semantic relations
26Conclusions
- A simple method for assigning semantic relations
to noun compounds - Does not require complex hand-coded rules
- Does make use of existing lexical resources
- High accuracy levels for an 18-way class
assignment - Small training set gets 60 accuracy on mixed
seen and unseen words - Tiny training set (73 NCs) gets 40 accuracy on
entirely unseen words - Off-the-shelf, unoptimized ML algorithms
27Future work
- Analysis of cases where it doesnt work
- NC with gt 2 terms
- How to generalize patterns found for noun
compounds to other syntactic structures? - How can we best formally represent semantics?
- How can we deal with non medical words?
- Should we use other ontologies (e.g.,WordNet)?
28Using Relations
- Eventual plan combine relations with
constituents ontology memberships - Examples
- Instrument_2 (biopsy,needle) -gt
Instrument_2(Diagnostic, Tool) - Procedure(brain,biopsy) -gt Procedure(Anatomical-E
lement, Diagnostic) - Procedure(tumor, marker) -gt Procedure(Disease-elem
ent, Indicator)