Extracting%20biological%20names%20and%20relations%20from%20texts - PowerPoint PPT Presentation

About This Presentation

Title:

Extracting%20biological%20names%20and%20relations%20from%20texts

Description:

t (10;11) (p13; q14) DNA methyltransferase. 73 kDa protein ... State-of-the-art Systems on NER: Two evaluation contests. BioCreative 2004 (March) ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 75

Provided by: yifen

Category:

more less

Transcript and Presenter's Notes

Title: Extracting%20biological%20names%20and%20relations%20from%20texts

1
Extracting biological names and relations from
texts

Ting-Yi Sung ???
Bioinformatics Program, TIGP
Institute of Information Science
Academia Sinica
2004/12/16

2
Motivation

To automatically extract information from natural
language text.
The need arises from rapid accumulation of
biomedical literature.
Expedite survey efforts
Support the database curation (automatically
associate the papers with database records)

3
Targets of Information Extraction

Protein-Protein interaction/binding/inhibition
Protein-Small Molecules
Gene-Gene regulation
Gene-Gene Product interaction
Gene-Drug relation
Protein-Subcellular location
Amino Acid-Protein relation
Example relationships between gene and drugs
The gene is the drug target
The gene confers resistance to the drug
The gene metabolizes the drug

4
Information Extraction Tasks
Identify Target Named Entities
Identify Relations among Named Entities
Identify Relations among Events and Named Entities
Associate Results with existing database records
5
Outline

NER (named entity recognition) in biomedical
domain
Challenges in biomedical NER
State of progress in NER
Abbreviation disambiguation
Future works

6
What is NER?

NER
Named Entity Recognition
Including two tasks
Identification of proper names in text
Classification of proper names in text
Newswire Domain
Person, Location, Organization
Biomedical Domain
Protein, DNA, RNA, Body Part, Cell Type, Lipid,
etc.

7
Example of NER - Biomedical
Protein
tissue
Disease
8
NER in biomedical domain

BioNER aims to recognize following names
First Priority
Protein name, DNA name, RNA name
Second Priority
cell type, other organic compound, cell line,
lipid, multi-cell, virus, cell component, body
part, tissue, amino acid monomer, polynucleotide,
mono-cell, inorganic, peptide, nucleotide, atom,
other artificial source, carbohydrate, organic

9
The Overall Spectrum

BioNER is only the starting point of biological
information extraction
A whole suite of NLP techniques are needed to
treat relations, events in literature mining
Techniques developed for BioNER should be
adaptable to problems in later stages,
e.g. NE relation recognition

10
Intrinsic Features of BioNER

Unknown words
Long compound words
Variations of expressions
Nested NEs

11
Unknown Words

Words containing hyphen, digit, letter, Greek
letter, Roman numeral.
Alpha B1
Adenyly cyclase 76E
Latent membrane protein 1
4-mycarosyl isovaleryl-CoA transferase
oligodeoxyribonucleotide
18-deoxyaldosterone
Abbreviation and Acronym
IL, TECd, IFN, TPA

12
Long Compound words

interleukin 1 (IL-1)-responsive kinase
interleukin 1-responsive kinase
epidermal growth factor receptor
SH2 domain containing tyrosine kinase Syk
SH2 domain (GENIA example)

13
Various expressions of the same NE

Spelling variation
N-acetylcysteine, N-acetyl-cysteine,
NAcetylCysteine
Word permutation
beta-1 intergrin, integrin beta-1
Ambiguous expressions
epidermal growth factor receptor, EGF receptor,
EGFR
c-jun, c-Jun, c jun

14
Various expressions the name explains its
function

the Ras guanine nucleotide exchange factor Sos
the Ras guanine nucleotide releasing protein Sos
the Ras exchanger Sos
the GDP-GTP exchange factor Sos
Sos(mSos), a GDP/GTP exchange protein for Ras

15
Various expressions The name includes
preposition and/or conjunction (ambiguity of
dependencies)

p85 alpha subunit of PI 3-kinase
SH2 and SH3 domains of Src
NF-AT1 , AP-1 , and NF-kB sites
E2F1 and -3
Residues 432, 435, 437, 438, and 440

16
Nested Named Entity

An NE embedded in another NE.
IL-2 protein
IL-2 gene gene
CBP/p300 associated factor protein
CBP/p300 associated factor binding promoter DNA

17
Outline

NER (named entity recognition) in biomedical
domain
Challenges in biomedical NER
State of progress in NER
Abbreviation disambiguation
Future works

18
Challenges of NER

Unknown word identification
Named entity boundary detection
Class disambiguation

19
Challenges

Unknown word identification
t (1011) (p13 q14)
DNA methyltransferase
73 kDa protein
interleukin 1 (IL-1)-responsive kinase (NE may
contain an abbreviation within it.)
Some unknown words occur very few times in the
corpus ? hard to recognize.

20
Challenges (contd)

NE boundary detection
Can be a regular English word, unknown word,
Roman numeral, digit.
MHC Class II
latent protein 1 (The left boundary is an
adjective)
cyclin-like UDG gene product
Conjunction (and, or, )
alpha- and beta-globin
human and mouse gene

21
Challenges (contd)

Classification of abbreviations
NF-AT
Full name nuclear factor of activated cells
Class Protein
HTLV-I
Full name Human T cell lymphotropic virus I
Class Virus
TCDD
Full name 2, 3, 7, 8-tetrachlorodibenzo-p-
dioxin
Class Other Organic
GRE
Full name glucocorticoid response element
Class DNA

22
Outline

NER (named entity recognition) in biomedical
domain
Challenges in biomedical NER
State of progress in NER
Abbreviation disambiguation
Future works

23
State-of-the-art Systems on NER Two evaluation
contests

BioCreative 2004 (March)
Critical Assessment of Information Extraction
Systems in Biology
Task 1 Entity extraction
Target genes (or proteins, where there is
ambiguity)
10000 sentences from Medline as training data,
and 5000 sentences as testing data
BioNLP 2004 (August)
GENIA Corpus as training data and 404 abstracts
as testing data
Target 5 classes, including protein, DNA, gene,
cell line and cell type.
Both use exact match scoring.

24
BioNLP 2004 Datasets
of abstracts of sentences of tokens
Training Set Training Set 2,000 20,546 (10.27/abs) 472,006 (236.00/abs) (22.97/sen)
Test Set Total 404 4,260 (10.54/abs) 96,780 (239.55/abs) (22.72/sen)
Test Set 1978-1989 104 991 ( 9.53/abs) 22,320 (214.62/abs) (22.52/sen)
Test Set 1990-1999 106 1,115 (10.52/abs) 25,080 (236.60/abs) (22.49/sen)
Test Set 2000-2001 130 1,452 (11.17/abs) 33,380 (256.77/abs) (22.99/sen)
Test Set S/1998-2001 204 2,254 (11.05/abs) 51,628 (253.08/abs) (22.91/sen)
25
R/P/F 1978-1989 set 1990-1999 set 2000-2001 set S/1998-2001 set Total
Zho04 75.3 / 69.5 / 72.3 77.1 / 69.2 / 72.9 75.6 / 71.3 / 73.8 75.8 / 69.5 / 72.5 76.0 / 69.4 / 72.6
Fin04 66.9 / 70.4 / 68.6 73.8 / 69.4 / 71.5 72.6 / 69.3 / 70.9 71.8 / 67.5 / 69.6 71.6 / 68.6 / 70.1
Set04 63.6 / 71.4 / 67.3 72.2 / 68.7 / 70.4 71.3 / 69.6 / 70.5 71.3 / 68.8 / 70.1 70.3 / 69.3 / 69.8
Son04 60.3 / 66.2 / 63.1 71.2 / 65.6 / 68.2 69.5 / 65.8 / 67.6 68.3 / 64.0 / 66.1 67.8 / 64.8 / 66.3
Zha04 63.2 / 60.4 / 61.8 72.5 / 62.6 / 67.2 69.1 / 60.2 / 64.7 69.2 / 60.3 / 64.4 69.1 / 61.0 / 64.8
Rös04 59.2 / 60.3 / 59.8 70.3 / 61.8 / 65.8 68.4 / 61.5 / 64.8 68.3 / 60.4 / 64.1 67.4 / 61.0 / 64.0
Par04 62.8 / 55.9 / 59.2 70.3 / 61.4 / 65.6 65.1 / 60.4 / 62.7 65.9 / 59.7 / 62.7 66.5 / 59.8 / 63.0
Lee04 42.5 / 42.0 / 42.2 52.5 / 49.1 / 50.8 53.8 / 50.9 / 52.3 52.3 / 48.1 / 50.1 50.8 / 47.6 / 49.1
BL 47.1 / 33.9 / 39.4 56.8 / 45.5 / 50.5 51.7 / 46.3 / 48.8 52.6 / 46.0 / 49.1 52.6 / 43.6 / 47.7
26
Current Methods

Machine Learning
HMM, SVM, ME (Maximum Entropy), CRF (Conditional
Random Field)
Hybrid methods
Dictionary Based
Approximate String matching algorithm
Naming Rules
Dynamic Programming

27
Features for Machine Learning Methods

Morphological Features
Orthographical Features
POS Features
Genia POS tagger
Semantic Trigger Features
Head-noun Features
NF-kappaB consensus site
IL-2 gene

28
Morphological Features
Prefix/Suffix Example
cin mide zole actinomycin Cycloheximide Sulphamethoxazole
lipid rogen vitamin phospholipids estrogen dihydroxyvitamin
blast cyte phil erythroblast thymocyte eosinophil
phosph methyl immuno phosphorylation methyltranferase immunomodulator
29
Orthographical Features
Orthographical Features Example Orthographical Features Example
AllCaps EBNA, NFAT AlphaDigit p50, p65
AlphaDigitAlpha IL23R, E1A ATGCSequence CCGCCC
CapLowAlpha Src, Ras, Epo CapMixAlpha NFkappaB
CapsAndDigits IL2, STAT4, SH2 DigitAlpha 2xNFkappaB
30
Head Nouns
Head Nouns
Unigram factor, protein, receptor, alpha, NF-kappaB, IL-2, cytokine, kinase, transcription, domain, complex, TNF-alpha, Nuclear, p50, CD28, TNF, molecule, subunit, cell, STAT3, family, tumor, factor-alpha, expression, interleukin
Bigram NF-kappa B, transcription factor, I kappa, nuclear factor, protein kinase, B alpha, kinase C, tumor necrosis, T cell, glucocorticoid receptor, binding protein, factor alpha, adhesion molecule, monoclonal antibody, gene product, binding domain
31
Additional features used by Mannings group
local features

Clues within a sentence
Include
Previous NEs
Abbreviations an abbr., a long form, neither
Parenthesis-matching
etc.

32
External resources used by Mannings group

Motivation
Contextual clues do not provide sufficient
evidence for confident classification.
May be vulnerable to incompleteness, noise, and
ambiguity.
Web
Least vulnerable to incompleteness, highly
vulnerable to noise.
Prepare patterns for each class
For genes X gene, X antagonist, X mutation
For RNA X mRNA,
For proteins X ligation,
Features web-protein, web-RNA, O-web,
Does not work well in BioNLP Task.

33
External resources (2)

Gazetteers (dictionaries)
Are arguably subject to all three, and yet have
been successfully in some systems.
Compiled a list of gene names from databases
(e.g. Locus Link) and GO, the data from
BioCreative Tasks 1A and 1B.
Filtering
Single character entries, e.g., A, 1 entries
containing only digits or symbols and digits,
e.g., 37 3-1
Entries containing only words can be found in an
English dictionary (CELEX), e.g., abnormal,
brain tumor
1,731,581 entries
Larger context

34
State-of-the-art approaches

Machine learning Post-processing
Our method (BioKDD2004)
Maximum entropy
Post-processing
Boundary extension
Re-classification

35
Zhou et al. approach

HMM SVM
Post-processing
Rule-based used to resolve nested name entities.
Top1 in the NLPBA Task, F72.5

36
Manning et al. method

Machine learning
ME Markov model
Local features
External resources and larger context
Post-processing
To correct genes boundary (mainly for
BioCreative Task)
Top 1 in BioCreative, F 83.2
Top 2 in NLPBA Task, F70.1

37
Our Method Overview
Training Phase
Knowledge input
Construct boundary word lists and dictionary
Dictionary
Training Data
Mapping features
Boundary word lists
Knowledge input
ME Learning
Testing Phase
Post-processing
Testing Data
ME
Boundary extension
NEs
Re-classify
38
Experimental Results
ME-based NER ME-based NER
NE identification P/R/F 0.56/0.589/0.574
NE recognition P/R/F 0.512/0.538/0.525
39
Post-Processing

Nested Named Entity
Ex CIITA mRNA
Nested Annotation ltRNAgtltDNAgtCIITA lt/DNAgt
mRNAlt/RNAgt
ME sometimes only recognizes CIITA as DNA
16.57 of NEs in GENIA 3.02 contains one or more
shorter NE Zhang, 2003
Post-processing method
Boundary Extension
Re-classification

40
Boundary Extension (1)

Boundary extension for nested NEs
Extend the R-boundary repeatedly if the NE is
followed by another NE, a head noun, or an
R-boundary word with a valid POS tag.
Extend the left boundary repeatedly if the NE is
preceded by an L-boundary word with a valid POS
tag.

41
Example

ICAM-1 surface protein
ME result ICAM-1 /1U surface/unknown protein
/unknown (1protein, U single)
Boundary extension
surface in R-boundary word list, valid POS tag
Extension ICAM-1 surface
protein in R-boundary word list, valid POS tag
Extension ICAM-1 surface protein

42
Boundary extension (2)

Boundary extension for NEs containing brackets or
slashes
NE NE ( NE ) NE or head noun or
R-boundary word with valid POS tag
NE NE / NE ( / NE ) NE or head
noun or R-boundary word with valid POS tag
Example
granulocyte-macrophage colony-stimulating factor
( GM-CSF ) gene
ME result granulocyte-macrophage
colony-stimulating factor, GM-CSF
Extension granulocyte-macrophage
colony-stimulating factor ( GM-CSF ) gene

43
Re-classification

Use dictionary lookup
Use R-boundary word
CIITA mRNA RNA class
granulocyte-macrophage colony-stimulating factor
( GM-CSF ) gene DNA class

44
Experimental ResultsNE Identification
Config Boundary Extension Boundary Extension Boundary Extension NE Identification P/R/F
Config BE-1 BE-2 BE-3 NE Identification P/R/F
Baseline 0.56/0.589/0.574
Conf1 ? 0.582/0.597/0.594
Conf2 ? 0.591/0.6/0.595
Conf3 ? 0.757/0.746/0.751
Conf4 ? ? ? 0.776/0.763/0.769
BE-1boundary extension for nested NEs
BE-2boundary extension for brackets and slashes
BE-3with human name filter
45
Experimental ResultsNE Recognition
Config Boundary Extension Boundary Extension Boundary Extension Re-classification Re-classification NE Recognition P/R/F
Config BE-1 BE-2 BE-3 RC-1 RC-2 NE Recognition P/R/F
Baseline 0.512/0.538/0.525
Conf4 ? ? ? 0.645/0.634/0.639
Conf5 ? ? ? ? 0.67/0.658/0.664
Conf6 ? ? ? ? 0.707/0.695/0.701
Conf7 ? ? ? ? ? 0.727/0.715/0.721
RC-1 re-classification using dictionary lookup
RC-2 re-classification using R-boundary words
46
Experimental Results
GENIA v3.02 (10 Fold-CV) Recently, Zhou
improve the F-measure of his HMM model to 0.712
by combining SVM
System Overall Protein DNA RNA
Our System 0.721 0.785 0.700 0.752
Zhou et al. (Bioinformatics, 2004) 0.666 0.758 0.633 0.612
47
Error Analysis

GENIA inconsistent annotation
IL-2 gene expression
ltDNAgtIL-2 genelt/DNAgt expression
ltothernamegtltDNAgtIL-2 genelt/DNAgt
expressionlt/othernamegt
Conjunction
Human and mouse gene
Boundary detection error (boundary not in
boundary word file)
Squirrel, manic, bursal

48
Error Analysis

Abbreviation classification
Orthographical form fits into at least two
classed.
Protein SOS1, FLICE, GAG
Other Organic CD336
False negative
A number of errors due to low-frequency words or
works not encountered in the training data.
False positive
Ellipsis
Many inflammatory cytokine genes including TNF,
IL-1, and IL-6

49
Outline

NER (named entity recognition) in biomedical
domain
Challenges in biomedical NER
Current methods and our method
State of progress in NER
Future works

50
Mannings conclusion (I) Key factor for low
performance

Task difficulty does not appear to be the primary
factor leading to low performance.
BioCreative 1 class, BioNLP 5 classes
Key factor quality of the training and
evaluation data
Higher inconsistency in the annotation of the
BioNLP data.
Two of the authors independently review 50
systems errors 34-35 are attributed to
annotation.
The authors do not think the annotation
inconsistencies are due to biological subtleties.

51
Mannings Conclusion (II)

To improve biomedical annotation
BioNLP organizers emphasized that participants
should focus on deep knowledge sources
coreference resolution and use of dependency
relations over wide used lexical-level features
(POS, morphological, orthographical, etc)
Proper exploitation of external resources
In both tasks, external resources led to
improvement of only 1-2.
Consistent annotation might have led to a 70
reduction in error rate.

52
Outline

NER (named entity recognition) in biomedical
domain
Challenges in biomedical NER
State of progress in NER
Abbreviation disambiguation
Future works

53
Disambiguation of abbreviation
54
Motivation (I)

Named entity (NE) recognition (NER) is first step
of information extraction.
NER contain two steps
NE identification extract named entity from text
NE classification classify given NE into
specific class.

55
Motivation (II)

Since many protein or gene names are long
compound names, they usually represent gene or
protein names with abbreviation.
A2M Alpha-2-macroglobulin
A4GALT alpha 1,4-galactosyltransferase
EGFR epidermal growth factor receptor, EGF
receptor
NF-AT nuclear factor of activated cells
HTLV-I Human T cell lymphotropic virus I
TCDD 2, 3, 7, 8-tetrachlorodibenzo-p- dioxin
GRE glucocorticoid response element

56
Motivation (III)

Abbreviation identification task
It is easier than classification task.
Abbreviations often have some orthographical
clues.
All Capital letter, Alphabet and digit
hybridetc.
Abbreviation classification task
In some situation, it is hard to disambiguate
abbreviations class.
Example only mention abbreviation without full
name

57
Challenges of abbreviation

Two cases
Case 1 sentence contains abbreviation and full
name
Human immunodeficiency virus type 2 (HIV-2), like
HIV-1, causes AIDS and is associated with AIDS
cases primarily in West Africa.
Case 2 sentence contains only abbreviation
HIV-1 and HIV-2 display significant differences
in nucleic acid sequence and in the natural
history of clinical disease.

58
Case 1

Case 1 is easier than Case 2
The classification can be solved by following
steps
Abbreviation Full name association
Disambiguate full names class
Assign full names class to abbreviation
Challenges has shift from abbreviation
classification to abbreviate-full name
association

59
Example of Case 1

Sentence
Human immunodeficiency virus type 2 (HIV-2), like
HIV-1, causes AIDS and is associated with AIDS
cases primarily in West Africa.
Step 1 Abbreviation Full name association
(Full name, Abbreviation) (Human
immunodeficiency virus type 2, HIV-2)
Step 2 Full name class assignment
Name Human immunodeficiency virus type 2
Class Virus
Step 3 Abbreviation class assignment
Abbreviation HIV-2
Class Virus

60
A solution method to Case 1

Schwartz and Hearst, PSB 2003.
Identify ltlong form, short formgt pairs.
Both long form and short form occur in the same
sentence.
long form ( short form ) more frequently
short form ( long form )

61
Algorithm Identify long form ( short form )

Identify long form and short form candidates
(using adjacency to parentheses).
Identify correct long form.
Starting from the end of both candidates, move
right to left, trying to find the shortest long
form that matches the short form.
Every character in the short form must match a
character in the long form.
The matched characters in the long form must be
in the same order as the characters in the short
forms.
ltHSF, Heat shock transcription factorgt
ltTTF-1, Thyroid transcription factor 1gt fail

62
Error analysis

Unused characters, e.g., ltCNS1, cyclophilin seven
suppressorgt
Do not have any pattern between long form and
short form, e.g., ltATN, anterior thalamusgt
Partial matching
The long form includes additional words to the
left of the matching, e.g., ltPol I, RNA
polymerase Igt
Out-of-order mapping
First character matches to the internal character
(of the long form).
Non-continuous long form.
Transformation in the mapping (2D -gt
two-dimensional)
Short form of only one character.

63
Other types of abbreviations

Schwartz and Hearsts algorithm only consider
candidates in parentheses.
Challenges To find all possible pairs is a more
difficult problem.

64
Example of Case 2

Its hard to disambiguate abbreviations class,
even with context information.
Example
HIV-1 and HIV-2 display significant differences
in nucleic acid sequence and in the natural
history of clinical disease.
HIV-1 and HIV-2 are both virus, but if we replace
HIV-1 and HIV-2 with IL-2 and IL-10, the sentence
still make sense.
IL-2 and IL-10 display significant differences in
nucleic acid sequence and in the natural history
of clinical disease.
IL-2 and IL-10 gene name

65
Case 2

Leave for future work.
Clue
Statistical methods
Dictionary-based methods

66
Outline

NER (named entity recognition) in biomedical
domain
Challenges in biomedical NER
State of progress in NER
Abbreviation disambiguation
Future works

67
Whats Next after NER solved?

Name entity relation recognition (NERR)
Protein-Protein interaction/binding/inhibition
Protein-Small Molecules
Gene-Gene regulation
Gene-Gene Product interaction
Gene-Drug relation
Protein-Subcellular location
Amino Acid-Protein relation
Gene-drug relation

68
Identify Relations among Named Entities

Target Extract relations between various
biological named entities.

Here we demonstrate that the c-myb proto-oncogene
product, which is itself a DNA-binding protein,
and transcriptional transactivator, can interact
synergistically with Z.
Relation (Subject, Action, Object) (c-myb
proto-oncogene product, interact, Z)
69
Future works

Few papers have been published on the following
specific challenging topics of NER.
Automated corpus correction
Disambiguation of abbreviations (Schwartz
Hearst, 2003,)
Conjunction
NERR (difficult)
parser
Pronoun and anaphora resolution

70
Acknowledgements