Mining%20External%20Resources%20for%20Biomedical%20IE - PowerPoint PPT Presentation

About This Presentation
Title:

Mining%20External%20Resources%20for%20Biomedical%20IE

Description:

generalisations on training data might be incomplete. acquired evidence might be absent in test ... 'X sequence/motif' A. Create patterns (aim, method, input) ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 16
Provided by: malvina3
Category:

less

Transcript and Presenter's Notes

Title: Mining%20External%20Resources%20for%20Biomedical%20IE


1
Mining External Resourcesfor Biomedical IE
Why, How, What
Malvina Nissim mnissim_at_inf.ed.ac.uk
2
Why
  • goal Named Entity Recognition
  • method supervised learning
  • feature extraction
  • (text) internal features word shape, n-grams,
    ...

protein-indicative features - of shape
a0a0a0a - followed by /bind/ - shorter than 5
characters
  • generalisations on training data might be
    incomplete
  • acquired evidence might be absent in test
    instance

3
Getting Additional Evidence
internal features might be insufficient, but
good evidence might be somewhere else...
  • small and accurate lists of proteins
    (gazetteers)
  • use as rules
  • use as features
  • other texts might contain indicative n-grams
  • other texts might contain indicative n-grams
  • how to use other texts
  • which texts to use

Note some systems (MaxEnt for instance) can
easily and successfully integrate a
huge number of features
4
How
patterns
X gene/protein/DNA
X sequence/motif
A. Create patterns (aim, method, input) B.
Search corpus for patterns and obtain counts C.
Use counts as appropriate
5
Create Patterns (I)
1. AIM 2. METHOD 3. INPUT
1. AIM (granularity)
distinguish entities from non-entities
X gene OR DNA OR protein
bypass ambiguities and data sparseness less
information
distinguish between entities
X DNA
X gene
X binds
more information ambiguities, data sparseness
6
Create Patterns (II)
1. AIM 2. METHOD 3. INPUT
2. METHOD
by hand (experts)
high precision, exact target time consuming,
experts needed
automatically (collocations, clustering)
no human intervention lower precision, not
necessarily interesting patterns
7
Create Patterns (III)
1. AIM 2. METHOD 3. INPUT
3. INPUT (X gene)
low frequency words (as estimated from a
non-specific corpus)
words not found in standard dictionary
NP chunks
first output of classifier
increase precision but lower recall
prec rec f-score all features
.813 .861 .836 web .807 .864 .835
8
What? Google vs PubMed
  • PubMed searchable collection of over 12M
    biomedical abstracts, more
    sophisticated search options
  • Everything Google searches over 8 billion
    pages, raw search, API

p53 gene
PubMed
Google
5,843 documents
165,000 pages
9
Google PubMed
anything you want siteltspecific_sitegt
p53 gene sitewww.ncbi.nlm.nih.gov
Rob Futrelle has this function available on this
webpage
http//www.ccs.neu.edu/home/futrelle/bionlp/search
.html
  • comment sometimes PubMed reports
  • Quoted phrase not found even when
  • Google finds the phrase.

PubMed provides phrase search only on pre-indexed
phrases
10
(No Transcript)
11
(No Transcript)
12
PubMed gt Google
  • query expansion

PubMed uses the MeSH headings to match
synonyms (it will expand Pol II to search for
DNA Polymerase II)
Google will only try correct misspelling
  • field specific search

PubMed allows field-specific searches (eg year)
Google cannot refine its search in this respect
  • timeliness

PubMed is updated daily
Google is slow in updating
13
PubMed gt Google (contd)
  • ranking

Google does a vote-based ranking not
necessarily good
PubMed does not do any ranking (possibly bad
too...)
  • truncation and flexibility

PubMed accepts truncated entries and will look
for all possible Variations. It will try break
phrases if no matches are found.
Google has a rigid search
  • manual indexing

PubMeds MeSH contain keywords not necessarily
contained in the abstract
Google cannot find something that is not
mentioned in the abstract
14
What to Use?(or How to Use the Evidence)
What to Use?(or How to Use the Evidence)
  • as a rule

sure identification of entities too powerful
-gt high risk of false positives
might be better to use PubMed less info but
precise
  • as a feature

less false positives some systems (MaxEnt)
can integrate huge number of features might
still not get used or provide enough evidence
might be OK to use Google more info but not
necessarily precise
15
iHOP (Information Hyperlinked Over Proteins) A
gene network for navigating the literature
Nature Genetics, Vol. 36(7), July 2004
http//www.pdg.cnb.uam.es/UniPub/iHOP
  • uses genes and proteins as hyperlinks between
    sentences
  • and abstracts http//www.pdg.cnb.uam.es/UniPub/
    iHOP
  • each step through the network produces
    information about
  • one single gene and its interactions
  • information retrieved by connecting similar
    concepts
  • precision of gene name and synonym
    identification 87-99
  • readers can still check correctness of sentences
    when they are
  • presented to them
  • shortest path between any 2 genes is on average
    4 steps only
Write a Comment
User Comments (0)
About PowerShow.com