Title: Extracting%20biological%20names%20and%20relations%20from%20texts
1Extracting biological names and relations from
texts
- Ting-Yi Sung ???
- Bioinformatics Program, TIGP
- Institute of Information Science
- Academia Sinica
- 2004/12/16
2Motivation
- To automatically extract information from natural
language text. - The need arises from rapid accumulation of
biomedical literature. - Expedite survey efforts
- Support the database curation (automatically
associate the papers with database records)
3Targets of Information Extraction
- Protein-Protein interaction/binding/inhibition
- Protein-Small Molecules
- Gene-Gene regulation
- Gene-Gene Product interaction
- Gene-Drug relation
- Protein-Subcellular location
- Amino Acid-Protein relation
- Example relationships between gene and drugs
- The gene is the drug target
- The gene confers resistance to the drug
- The gene metabolizes the drug
4Information Extraction Tasks
Identify Target Named Entities
Identify Relations among Named Entities
Identify Relations among Events and Named Entities
Associate Results with existing database records
5Outline
- NER (named entity recognition) in biomedical
domain - Challenges in biomedical NER
- State of progress in NER
- Abbreviation disambiguation
- Future works
6What is NER?
- NER
- Named Entity Recognition
- Including two tasks
- Identification of proper names in text
- Classification of proper names in text
- Newswire Domain
- Person, Location, Organization
- Biomedical Domain
- Protein, DNA, RNA, Body Part, Cell Type, Lipid,
etc.
7Example of NER - Biomedical
Protein
tissue
Disease
8NER in biomedical domain
- BioNER aims to recognize following names
- First Priority
- Protein name, DNA name, RNA name
- Second Priority
- cell type, other organic compound, cell line,
lipid, multi-cell, virus, cell component, body
part, tissue, amino acid monomer, polynucleotide,
mono-cell, inorganic, peptide, nucleotide, atom,
other artificial source, carbohydrate, organic
9The Overall Spectrum
- BioNER is only the starting point of biological
information extraction - A whole suite of NLP techniques are needed to
treat relations, events in literature mining - Techniques developed for BioNER should be
adaptable to problems in later stages, - e.g. NE relation recognition
10Intrinsic Features of BioNER
- Unknown words
- Long compound words
- Variations of expressions
- Nested NEs
11Unknown Words
- Words containing hyphen, digit, letter, Greek
letter, Roman numeral. - Alpha B1
- Adenyly cyclase 76E
- Latent membrane protein 1
- 4-mycarosyl isovaleryl-CoA transferase
- oligodeoxyribonucleotide
- 18-deoxyaldosterone
- Abbreviation and Acronym
- IL, TECd, IFN, TPA
12Long Compound words
- interleukin 1 (IL-1)-responsive kinase
- interleukin 1-responsive kinase
- epidermal growth factor receptor
- SH2 domain containing tyrosine kinase Syk
- SH2 domain (GENIA example)
13Various expressions of the same NE
- Spelling variation
- N-acetylcysteine, N-acetyl-cysteine,
NAcetylCysteine - Word permutation
- beta-1 intergrin, integrin beta-1
- Ambiguous expressions
- epidermal growth factor receptor, EGF receptor,
EGFR - c-jun, c-Jun, c jun
14Various expressions the name explains its
function
- the Ras guanine nucleotide exchange factor Sos
- the Ras guanine nucleotide releasing protein Sos
- the Ras exchanger Sos
- the GDP-GTP exchange factor Sos
- Sos(mSos), a GDP/GTP exchange protein for Ras
15Various expressions The name includes
preposition and/or conjunction (ambiguity of
dependencies)
- p85 alpha subunit of PI 3-kinase
- SH2 and SH3 domains of Src
- NF-AT1 , AP-1 , and NF-kB sites
- E2F1 and -3
- Residues 432, 435, 437, 438, and 440
16Nested Named Entity
- An NE embedded in another NE.
- IL-2 protein
- IL-2 gene gene
- CBP/p300 associated factor protein
- CBP/p300 associated factor binding promoter DNA
17Outline
- NER (named entity recognition) in biomedical
domain - Challenges in biomedical NER
- State of progress in NER
- Abbreviation disambiguation
- Future works
18Challenges of NER
- Unknown word identification
- Named entity boundary detection
- Class disambiguation
19Challenges
- Unknown word identification
- t (1011) (p13 q14)
- DNA methyltransferase
- 73 kDa protein
- interleukin 1 (IL-1)-responsive kinase (NE may
contain an abbreviation within it.) - Some unknown words occur very few times in the
corpus ? hard to recognize.
20Challenges (contd)
- NE boundary detection
- Can be a regular English word, unknown word,
Roman numeral, digit. - MHC Class II
- latent protein 1 (The left boundary is an
adjective) - cyclin-like UDG gene product
- Conjunction (and, or, )
- alpha- and beta-globin
- human and mouse gene
21Challenges (contd)
- Classification of abbreviations
- NF-AT
- Full name nuclear factor of activated cells
- Class Protein
- HTLV-I
- Full name Human T cell lymphotropic virus I
- Class Virus
- TCDD
- Full name 2, 3, 7, 8-tetrachlorodibenzo-p-
dioxin - Class Other Organic
- GRE
- Full name glucocorticoid response element
- Class DNA
22Outline
- NER (named entity recognition) in biomedical
domain - Challenges in biomedical NER
- State of progress in NER
- Abbreviation disambiguation
- Future works
23State-of-the-art Systems on NER Two evaluation
contests
- BioCreative 2004 (March)
- Critical Assessment of Information Extraction
Systems in Biology - Task 1 Entity extraction
- Target genes (or proteins, where there is
ambiguity) - 10000 sentences from Medline as training data,
and 5000 sentences as testing data - BioNLP 2004 (August)
- GENIA Corpus as training data and 404 abstracts
as testing data - Target 5 classes, including protein, DNA, gene,
cell line and cell type. - Both use exact match scoring.
24BioNLP 2004 Datasets
of abstracts of sentences of tokens
Training Set Training Set 2,000 20,546 (10.27/abs) 472,006 (236.00/abs) (22.97/sen)
Test Set Total 404 4,260 (10.54/abs) 96,780 (239.55/abs) (22.72/sen)
Test Set 1978-1989 104 991 ( 9.53/abs) 22,320 (214.62/abs) (22.52/sen)
Test Set 1990-1999 106 1,115 (10.52/abs) 25,080 (236.60/abs) (22.49/sen)
Test Set 2000-2001 130 1,452 (11.17/abs) 33,380 (256.77/abs) (22.99/sen)
Test Set S/1998-2001 204 2,254 (11.05/abs) 51,628 (253.08/abs) (22.91/sen)
25R/P/F 1978-1989 set 1990-1999 set 2000-2001 set S/1998-2001 set Total
Zho04 75.3 / 69.5 / 72.3 77.1 / 69.2 / 72.9 75.6 / 71.3 / 73.8 75.8 / 69.5 / 72.5 76.0 / 69.4 / 72.6
Fin04 66.9 / 70.4 / 68.6 73.8 / 69.4 / 71.5 72.6 / 69.3 / 70.9 71.8 / 67.5 / 69.6 71.6 / 68.6 / 70.1
Set04 63.6 / 71.4 / 67.3 72.2 / 68.7 / 70.4 71.3 / 69.6 / 70.5 71.3 / 68.8 / 70.1 70.3 / 69.3 / 69.8
Son04 60.3 / 66.2 / 63.1 71.2 / 65.6 / 68.2 69.5 / 65.8 / 67.6 68.3 / 64.0 / 66.1 67.8 / 64.8 / 66.3
Zha04 63.2 / 60.4 / 61.8 72.5 / 62.6 / 67.2 69.1 / 60.2 / 64.7 69.2 / 60.3 / 64.4 69.1 / 61.0 / 64.8
Rös04 59.2 / 60.3 / 59.8 70.3 / 61.8 / 65.8 68.4 / 61.5 / 64.8 68.3 / 60.4 / 64.1 67.4 / 61.0 / 64.0
Par04 62.8 / 55.9 / 59.2 70.3 / 61.4 / 65.6 65.1 / 60.4 / 62.7 65.9 / 59.7 / 62.7 66.5 / 59.8 / 63.0
Lee04 42.5 / 42.0 / 42.2 52.5 / 49.1 / 50.8 53.8 / 50.9 / 52.3 52.3 / 48.1 / 50.1 50.8 / 47.6 / 49.1
BL 47.1 / 33.9 / 39.4 56.8 / 45.5 / 50.5 51.7 / 46.3 / 48.8 52.6 / 46.0 / 49.1 52.6 / 43.6 / 47.7
26Current Methods
- Machine Learning
- HMM, SVM, ME (Maximum Entropy), CRF (Conditional
Random Field) - Hybrid methods
- Dictionary Based
- Approximate String matching algorithm
- Naming Rules
- Dynamic Programming
27Features for Machine Learning Methods
- Morphological Features
- Orthographical Features
- POS Features
- Genia POS tagger
- Semantic Trigger Features
- Head-noun Features
- NF-kappaB consensus site
- IL-2 gene
28Morphological Features
Prefix/Suffix Example
cin mide zole actinomycin Cycloheximide Sulphamethoxazole
lipid rogen vitamin phospholipids estrogen dihydroxyvitamin
blast cyte phil erythroblast thymocyte eosinophil
phosph methyl immuno phosphorylation methyltranferase immunomodulator
29Orthographical Features
Orthographical Features Example Orthographical Features Example
AllCaps EBNA, NFAT AlphaDigit p50, p65
AlphaDigitAlpha IL23R, E1A ATGCSequence CCGCCC
CapLowAlpha Src, Ras, Epo CapMixAlpha NFkappaB
CapsAndDigits IL2, STAT4, SH2 DigitAlpha 2xNFkappaB
30Head Nouns
Head Nouns
Unigram factor, protein, receptor, alpha, NF-kappaB, IL-2, cytokine, kinase, transcription, domain, complex, TNF-alpha, Nuclear, p50, CD28, TNF, molecule, subunit, cell, STAT3, family, tumor, factor-alpha, expression, interleukin
Bigram NF-kappa B, transcription factor, I kappa, nuclear factor, protein kinase, B alpha, kinase C, tumor necrosis, T cell, glucocorticoid receptor, binding protein, factor alpha, adhesion molecule, monoclonal antibody, gene product, binding domain
31Additional features used by Mannings group
local features
- Clues within a sentence
- Include
- Previous NEs
- Abbreviations an abbr., a long form, neither
- Parenthesis-matching
- etc.
32External resources used by Mannings group
- Motivation
- Contextual clues do not provide sufficient
evidence for confident classification. - May be vulnerable to incompleteness, noise, and
ambiguity. - Web
- Least vulnerable to incompleteness, highly
vulnerable to noise. - Prepare patterns for each class
- For genes X gene, X antagonist, X mutation
- For RNA X mRNA,
- For proteins X ligation,
- Features web-protein, web-RNA, O-web,
- Does not work well in BioNLP Task.
33External resources (2)
- Gazetteers (dictionaries)
- Are arguably subject to all three, and yet have
been successfully in some systems. - Compiled a list of gene names from databases
(e.g. Locus Link) and GO, the data from
BioCreative Tasks 1A and 1B. - Filtering
- Single character entries, e.g., A, 1 entries
containing only digits or symbols and digits,
e.g., 37 3-1 - Entries containing only words can be found in an
English dictionary (CELEX), e.g., abnormal,
brain tumor - 1,731,581 entries
- Larger context
34State-of-the-art approaches
- Machine learning Post-processing
- Our method (BioKDD2004)
- Maximum entropy
- Post-processing
- Boundary extension
- Re-classification
35Zhou et al. approach
- HMM SVM
- Post-processing
- Rule-based used to resolve nested name entities.
- Top1 in the NLPBA Task, F72.5
36Manning et al. method
- Machine learning
- ME Markov model
- Local features
- External resources and larger context
- Post-processing
- To correct genes boundary (mainly for
BioCreative Task) - Top 1 in BioCreative, F 83.2
- Top 2 in NLPBA Task, F70.1
37Our Method Overview
Training Phase
Knowledge input
Construct boundary word lists and dictionary
Dictionary
Training Data
Mapping features
Boundary word lists
Knowledge input
ME Learning
Testing Phase
Post-processing
Testing Data
ME
Boundary extension
NEs
Re-classify
38Experimental Results
ME-based NER ME-based NER
NE identification P/R/F 0.56/0.589/0.574
NE recognition P/R/F 0.512/0.538/0.525
39Post-Processing
- Nested Named Entity
- Ex CIITA mRNA
- Nested Annotation ltRNAgtltDNAgtCIITA lt/DNAgt
mRNAlt/RNAgt - ME sometimes only recognizes CIITA as DNA
- 16.57 of NEs in GENIA 3.02 contains one or more
shorter NE Zhang, 2003 - Post-processing method
- Boundary Extension
- Re-classification
40Boundary Extension (1)
- Boundary extension for nested NEs
- Extend the R-boundary repeatedly if the NE is
followed by another NE, a head noun, or an
R-boundary word with a valid POS tag. - Extend the left boundary repeatedly if the NE is
preceded by an L-boundary word with a valid POS
tag.
41Example
- ICAM-1 surface protein
- ME result ICAM-1 /1U surface/unknown protein
/unknown (1protein, U single) - Boundary extension
- surface in R-boundary word list, valid POS tag
- Extension ICAM-1 surface
- protein in R-boundary word list, valid POS tag
- Extension ICAM-1 surface protein
42Boundary extension (2)
- Boundary extension for NEs containing brackets or
slashes - NE NE ( NE ) NE or head noun or
R-boundary word with valid POS tag - NE NE / NE ( / NE ) NE or head
noun or R-boundary word with valid POS tag - Example
- granulocyte-macrophage colony-stimulating factor
( GM-CSF ) gene - ME result granulocyte-macrophage
colony-stimulating factor, GM-CSF - Extension granulocyte-macrophage
colony-stimulating factor ( GM-CSF ) gene
43Re-classification
- Use dictionary lookup
- Use R-boundary word
- CIITA mRNA RNA class
- granulocyte-macrophage colony-stimulating factor
( GM-CSF ) gene DNA class
44Experimental ResultsNE Identification
Config Boundary Extension Boundary Extension Boundary Extension NE Identification P/R/F
Config BE-1 BE-2 BE-3 NE Identification P/R/F
Baseline 0.56/0.589/0.574
Conf1 ? 0.582/0.597/0.594
Conf2 ? 0.591/0.6/0.595
Conf3 ? 0.757/0.746/0.751
Conf4 ? ? ? 0.776/0.763/0.769
BE-1boundary extension for nested NEs
BE-2boundary extension for brackets and slashes
BE-3with human name filter
45Experimental ResultsNE Recognition
Config Boundary Extension Boundary Extension Boundary Extension Re-classification Re-classification NE Recognition P/R/F
Config BE-1 BE-2 BE-3 RC-1 RC-2 NE Recognition P/R/F
Baseline 0.512/0.538/0.525
Conf4 ? ? ? 0.645/0.634/0.639
Conf5 ? ? ? ? 0.67/0.658/0.664
Conf6 ? ? ? ? 0.707/0.695/0.701
Conf7 ? ? ? ? ? 0.727/0.715/0.721
RC-1 re-classification using dictionary lookup
RC-2 re-classification using R-boundary words
46Experimental Results
GENIA v3.02 (10 Fold-CV) Recently, Zhou
improve the F-measure of his HMM model to 0.712
by combining SVM
System Overall Protein DNA RNA
Our System 0.721 0.785 0.700 0.752
Zhou et al. (Bioinformatics, 2004) 0.666 0.758 0.633 0.612
47Error Analysis
- GENIA inconsistent annotation
- IL-2 gene expression
- ltDNAgtIL-2 genelt/DNAgt expression
- ltothernamegtltDNAgtIL-2 genelt/DNAgt
expressionlt/othernamegt - Conjunction
- Human and mouse gene
- Boundary detection error (boundary not in
boundary word file) - Squirrel, manic, bursal
48Error Analysis
- Abbreviation classification
- Orthographical form fits into at least two
classed. - Protein SOS1, FLICE, GAG
- Other Organic CD336
- False negative
- A number of errors due to low-frequency words or
works not encountered in the training data. - False positive
- Ellipsis
- Many inflammatory cytokine genes including TNF,
IL-1, and IL-6
49Outline
- NER (named entity recognition) in biomedical
domain - Challenges in biomedical NER
- Current methods and our method
- State of progress in NER
- Future works
50Mannings conclusion (I) Key factor for low
performance
- Task difficulty does not appear to be the primary
factor leading to low performance. - BioCreative 1 class, BioNLP 5 classes
- Key factor quality of the training and
evaluation data - Higher inconsistency in the annotation of the
BioNLP data. - Two of the authors independently review 50
systems errors 34-35 are attributed to
annotation. - The authors do not think the annotation
inconsistencies are due to biological subtleties.
51Mannings Conclusion (II)
- To improve biomedical annotation
- BioNLP organizers emphasized that participants
should focus on deep knowledge sources - coreference resolution and use of dependency
relations over wide used lexical-level features
(POS, morphological, orthographical, etc) - Proper exploitation of external resources
- In both tasks, external resources led to
improvement of only 1-2. - Consistent annotation might have led to a 70
reduction in error rate.
52Outline
- NER (named entity recognition) in biomedical
domain - Challenges in biomedical NER
- State of progress in NER
- Abbreviation disambiguation
- Future works
53Disambiguation of abbreviation
54Motivation (I)
- Named entity (NE) recognition (NER) is first step
of information extraction. - NER contain two steps
- NE identification extract named entity from text
- NE classification classify given NE into
specific class.
55Motivation (II)
- Since many protein or gene names are long
compound names, they usually represent gene or
protein names with abbreviation. - A2M Alpha-2-macroglobulin
- A4GALT alpha 1,4-galactosyltransferase
- EGFR epidermal growth factor receptor, EGF
receptor - NF-AT nuclear factor of activated cells
- HTLV-I Human T cell lymphotropic virus I
- TCDD 2, 3, 7, 8-tetrachlorodibenzo-p- dioxin
- GRE glucocorticoid response element
56Motivation (III)
- Abbreviation identification task
- It is easier than classification task.
- Abbreviations often have some orthographical
clues. - All Capital letter, Alphabet and digit
hybridetc. - Abbreviation classification task
- In some situation, it is hard to disambiguate
abbreviations class. - Example only mention abbreviation without full
name
57Challenges of abbreviation
- Two cases
- Case 1 sentence contains abbreviation and full
name - Human immunodeficiency virus type 2 (HIV-2), like
HIV-1, causes AIDS and is associated with AIDS
cases primarily in West Africa. - Case 2 sentence contains only abbreviation
- HIV-1 and HIV-2 display significant differences
in nucleic acid sequence and in the natural
history of clinical disease.
58Case 1
- Case 1 is easier than Case 2
- The classification can be solved by following
steps - Abbreviation Full name association
- Disambiguate full names class
- Assign full names class to abbreviation
- Challenges has shift from abbreviation
classification to abbreviate-full name
association
59Example of Case 1
- Sentence
- Human immunodeficiency virus type 2 (HIV-2), like
HIV-1, causes AIDS and is associated with AIDS
cases primarily in West Africa. - Step 1 Abbreviation Full name association
- (Full name, Abbreviation) (Human
immunodeficiency virus type 2, HIV-2) - Step 2 Full name class assignment
- Name Human immunodeficiency virus type 2
- Class Virus
- Step 3 Abbreviation class assignment
- Abbreviation HIV-2
- Class Virus
60A solution method to Case 1
- Schwartz and Hearst, PSB 2003.
- Identify ltlong form, short formgt pairs.
- Both long form and short form occur in the same
sentence. - long form ( short form ) more frequently
- short form ( long form )
61Algorithm Identify long form ( short form )
- Identify long form and short form candidates
(using adjacency to parentheses). - Identify correct long form.
- Starting from the end of both candidates, move
right to left, trying to find the shortest long
form that matches the short form. - Every character in the short form must match a
character in the long form. - The matched characters in the long form must be
in the same order as the characters in the short
forms. - ltHSF, Heat shock transcription factorgt
- ltTTF-1, Thyroid transcription factor 1gt fail
62Error analysis
- Unused characters, e.g., ltCNS1, cyclophilin seven
suppressorgt - Do not have any pattern between long form and
short form, e.g., ltATN, anterior thalamusgt - Partial matching
- The long form includes additional words to the
left of the matching, e.g., ltPol I, RNA
polymerase Igt - Out-of-order mapping
- First character matches to the internal character
(of the long form). - Non-continuous long form.
- Transformation in the mapping (2D -gt
two-dimensional) - Short form of only one character.
63Other types of abbreviations
- Schwartz and Hearsts algorithm only consider
candidates in parentheses. - Challenges To find all possible pairs is a more
difficult problem.
64Example of Case 2
- Its hard to disambiguate abbreviations class,
even with context information. - Example
- HIV-1 and HIV-2 display significant differences
in nucleic acid sequence and in the natural
history of clinical disease. - HIV-1 and HIV-2 are both virus, but if we replace
HIV-1 and HIV-2 with IL-2 and IL-10, the sentence
still make sense. - IL-2 and IL-10 display significant differences in
nucleic acid sequence and in the natural history
of clinical disease. - IL-2 and IL-10 gene name
65Case 2
- Leave for future work.
- Clue
- Statistical methods
- Dictionary-based methods
66Outline
- NER (named entity recognition) in biomedical
domain - Challenges in biomedical NER
- State of progress in NER
- Abbreviation disambiguation
- Future works
67Whats Next after NER solved?
- Name entity relation recognition (NERR)
- Protein-Protein interaction/binding/inhibition
- Protein-Small Molecules
- Gene-Gene regulation
- Gene-Gene Product interaction
- Gene-Drug relation
- Protein-Subcellular location
- Amino Acid-Protein relation
- Gene-drug relation
68Identify Relations among Named Entities
- Target Extract relations between various
biological named entities.
Here we demonstrate that the c-myb proto-oncogene
product, which is itself a DNA-binding protein,
and transcriptional transactivator, can interact
synergistically with Z.
Relation (Subject, Action, Object) (c-myb
proto-oncogene product, interact, Z)
69Future works
- Few papers have been published on the following
specific challenging topics of NER. - Automated corpus correction
- Disambiguation of abbreviations (Schwartz
Hearst, 2003,) - Conjunction
-
- NERR (difficult)
- parser
- Pronoun and anaphora resolution
70Acknowledgements
- Bioinformatics Yi-Feng Lin, Wen-Chi Chou
- NLP Tzong-Han Tsai, Cheng-Wei Lee
- Postdoc Kuen-Pin Wu
- Colleague Wen-Lian Hsu (Fu Chang is jumping on
the bandwagon now.)
71Lab Introduction
72Research topics
- Protein structure prediciton
- 2nd structure prediction
- Tertiary structure prediction local structure
- Members Hsin-Nan Lin, Caster Chen, Jia-Ming
Chang - Protein structure determination based on NMR data
- Backbone assignment
- Side chain assignment
- RDC
- Jia-Ming Chang, Caster Chen, Philip Chen
- Collaborator Prof TH Huang, IBMS
73Research topics
- Mass spectrometry based proteomics
- Protein quantification
- Protein identification for modification study
- Yi-Hwa Yian, Wen-Ting Lin, Jacky Chou, Wei-Nung
Hung - Collaborator Prof YR Chen, Inst of Chemistry
- Biological literature mining
- NER, NERR
- Yi-Feng Lin, Jacky Chou, Richard Tsai
74Faculty
- PI Wen-Lian Hsu, Ting-Yi Sung
- Post-doc Kuen-Pin Wu