Title: Mining%20External%20Resources%20for%20Biomedical%20IE
1Mining External Resourcesfor Biomedical IE
Why, How, What
Malvina Nissim mnissim_at_inf.ed.ac.uk
2Why
- goal Named Entity Recognition
- method supervised learning
- (text) internal features word shape, n-grams,
...
protein-indicative features - of shape
a0a0a0a - followed by /bind/ - shorter than 5
characters
- generalisations on training data might be
incomplete
- acquired evidence might be absent in test
instance
3Getting Additional Evidence
internal features might be insufficient, but
good evidence might be somewhere else...
- small and accurate lists of proteins
(gazetteers) - use as rules
- use as features
- other texts might contain indicative n-grams
- other texts might contain indicative n-grams
- how to use other texts
- which texts to use
Note some systems (MaxEnt for instance) can
easily and successfully integrate a
huge number of features
4How
patterns
X gene/protein/DNA
X sequence/motif
A. Create patterns (aim, method, input) B.
Search corpus for patterns and obtain counts C.
Use counts as appropriate
5Create Patterns (I)
1. AIM 2. METHOD 3. INPUT
1. AIM (granularity)
distinguish entities from non-entities
X gene OR DNA OR protein
bypass ambiguities and data sparseness less
information
distinguish between entities
X DNA
X gene
X binds
more information ambiguities, data sparseness
6Create Patterns (II)
1. AIM 2. METHOD 3. INPUT
2. METHOD
by hand (experts)
high precision, exact target time consuming,
experts needed
automatically (collocations, clustering)
no human intervention lower precision, not
necessarily interesting patterns
7Create Patterns (III)
1. AIM 2. METHOD 3. INPUT
3. INPUT (X gene)
low frequency words (as estimated from a
non-specific corpus)
words not found in standard dictionary
NP chunks
first output of classifier
increase precision but lower recall
prec rec f-score all features
.813 .861 .836 web .807 .864 .835
8What? Google vs PubMed
- PubMed searchable collection of over 12M
biomedical abstracts, more
sophisticated search options
- Everything Google searches over 8 billion
pages, raw search, API
p53 gene
PubMed
Google
5,843 documents
165,000 pages
9Google PubMed
anything you want siteltspecific_sitegt
p53 gene sitewww.ncbi.nlm.nih.gov
Rob Futrelle has this function available on this
webpage
http//www.ccs.neu.edu/home/futrelle/bionlp/search
.html
- comment sometimes PubMed reports
- Quoted phrase not found even when
- Google finds the phrase.
PubMed provides phrase search only on pre-indexed
phrases
10(No Transcript)
11(No Transcript)
12PubMed gt Google
PubMed uses the MeSH headings to match
synonyms (it will expand Pol II to search for
DNA Polymerase II)
Google will only try correct misspelling
PubMed allows field-specific searches (eg year)
Google cannot refine its search in this respect
PubMed is updated daily
Google is slow in updating
13PubMed gt Google (contd)
Google does a vote-based ranking not
necessarily good
PubMed does not do any ranking (possibly bad
too...)
- truncation and flexibility
PubMed accepts truncated entries and will look
for all possible Variations. It will try break
phrases if no matches are found.
Google has a rigid search
PubMeds MeSH contain keywords not necessarily
contained in the abstract
Google cannot find something that is not
mentioned in the abstract
14What to Use?(or How to Use the Evidence)
What to Use?(or How to Use the Evidence)
sure identification of entities too powerful
-gt high risk of false positives
might be better to use PubMed less info but
precise
less false positives some systems (MaxEnt)
can integrate huge number of features might
still not get used or provide enough evidence
might be OK to use Google more info but not
necessarily precise
15iHOP (Information Hyperlinked Over Proteins) A
gene network for navigating the literature
Nature Genetics, Vol. 36(7), July 2004
http//www.pdg.cnb.uam.es/UniPub/iHOP
- uses genes and proteins as hyperlinks between
sentences - and abstracts http//www.pdg.cnb.uam.es/UniPub/
iHOP
- each step through the network produces
information about - one single gene and its interactions
- information retrieved by connecting similar
concepts
- precision of gene name and synonym
identification 87-99
- readers can still check correctness of sentences
when they are - presented to them
- shortest path between any 2 genes is on average
4 steps only