NLP Tools for Biology Literature Mining - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

NLP Tools for Biology Literature Mining

Description:

However, with existing extractors we can begin high level text mining work. ... As soon as better extractor is constructed, we can plug in easily. Summary ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 27
Provided by: jin144
Category:

less

Transcript and Presenter's Notes

Title: NLP Tools for Biology Literature Mining


1
NLP Tools for Biology Literature Mining
  • Qiaozhu Mei
  • Jing Jiang
  • ChengXiang Zhai
  • Nov 3, 2004

2
What do we have?
  • Biology Literature (huge amount of text)
  • E.g. Mites in the genus Varroa are the primary
    parasites of honey bees Ten of 22 transfer RNAs
    are in different locations relative to hard
    ticks, and the 12S ribosomal RNA subunit is
    inverted and separated from the 16S rRNA by a
    novel non-coding region, a trait not yet seen in
    other arthropods. (from Biological Abstracts)

3
What do we want?
  • Named entities
  • gene names, protein names, drugs, etc.
  • Interaction events between entities
  • transcription, translation, post translational
    modification, etc.
  • Relationships between basic events
  • caused by, inhibited by, etc.
  • (from Hirschman et al. 02)

4
Preliminary System Structure

Collections of raw textual data
Text Pre-processing NLP
POS Tagger
Parser
Entity Extractor
NPs, VPs, Relations
Genes, proteins, other entities
Nouns, Verbs, etc.
Pre-processed data ready to mine
Text Mining Modules TM

5
POS Taggers
  • Tree Tagger
  • Brill Tagger
  • SNoW Tagger
  • LT Chunk
  • Stanford Tagger

6
Results of POS Tagging
  • Raw text
  • Mites in the genus Varroa are the primary
    parasites of honey bees Ten of 22 transfer RNAs
    are in different locations relative to hard
    ticks, and the 12S ribosomal RNA subunit is
    inverted and separated from the 16S rRNA by a
    novel non-coding region, a trait not yet seen in
    other arthropods.
  • (from Biological Abstracts)

7
Results of POS Tagging (cont.)
8
Results of POS Tagging (cont.)
9
Comparison of POS Taggers
10
Conclusions
  • Existing general-purpose POS taggers work fine
    for our task.
  • Most nouns and verbs correctly identified
  • There is still room to improve existing POS
    taggers for biology data.
  • E.g. to identify gene and protein names
  • Speed and adaptability are important.

11
A Little Bit More on SNoW
  • SNoW has a POS tagger and a shallow parser.
  • Speed is reasonable.
  • Software is adaptable as help is available from
    CCG.
  • The network model can be trained if we have
    training data.

12
Result of SNoW Shallow Parser
  • NP the 12 S ribosomal RNA subunit VP is ADJP
    inverted and VP separated PP from NP the 16
    S rRNA PP by NP a novel non-coding region
  • (from online demo)
  • Problems
  • Currently the package is not available for
    download from the new CCG page.
  • There is still problem running the old package
    on our machine. (compilation, path setting, etc.)

13
Parsers
  • SNoW (already covered)
  • LT-Chunk
  • MiniPar
  • Collins
  • Stanford

14
Result of LT-Chunk
  • the_DT 12S_JJ ribosomal_JJ RNA_NNP subunit_NN
    (( is_VBZ inverted_VBN and_CC separated_VBN ))
    from_IN the_DT 16S_JJ rRNA_NNP by_IN
    a_DT novel_JJ non-coding_JJ region_NN

15
Result of MiniPar
  • 16 (the Det 20 det (gov subunit))
  • 17 (12S N 20 nn (gov subunit))
  • 18 (ribosomal A 20 mod (gov subunit))
  • 19 (RNA N 20 nn (gov subunit))
  • 20 (subunit N 22 s (gov invert))
  • 21 (is be be 22 be (gov invert))
  • 22 (inverted invert V E0 i (gov fin))
  • E4 (() subunit N 22 obj (gov invert)
  • 23 (and U 22 lex-mod (gov invert))
  • 24 (separated separate V 22 lex-dep (gov invert))
  • 25 (from Prep 22 mod (gov invert))
  • 26 (the Det 28 det (gov rRNA))
  • 27 (16S N 28 nn (gov rRNA))
  • 28 (rRNA N 25 pcomp-n (gov from))

16
Results of Collins Parser
  • (Sis22 (NPBsubunit55 the/DT 12S/CD
    ribosomal/JJ RNA/NNP subunit/NN ) (VPis21
    is/VBZ (UCPinverted31 (ADJPinverted11
    inverted/JJ ) and/CC (VPseparated31
    separated/VBN (PPfrom21 from/IN (NPBrRNA33
    the/DT 16S/CD rRNA/NN ) ) (PPby21 by/IN
    (NPregion21 (NPBregion44 a/DT novel/JJ
    non-coding/JJ region/NN ,/PUNC, )

17
Comparison of Parsers
18
Conclusion on Parsers
  • MiniPar has advantages so far
  • Fast
  • Outputs dependency governing info. and useful
    relations
  • Provides API
  • If SNoW is tuned for the task, we can easily
    plug it into the module.

19
Entity Extractors
  • Abner extracts protein, DNA, RNA, cell line,
    and cell type
  • Yagi extracts only gene names, a brother of
    Abner
  • LingPipe Named entity extraction that can be
    trained for different domains.

20
Result of Abner
  • Ten of ltRNAgt22 transfer RNAslt/RNAgt are in
    different locations relative to hard ticks , and
    the 12 ltproteingtS ribosomal RNA subunitlt/proteingt
    is inverted and separated from the 16 S rRNA by a
    novel non-coding region,

21
Result of LingPipe
  • Ten of 22 transfer RNAs are in different
    locations relative to hard ticks, and the ltENAMEX
    id"0" type"GENE"gt12S ribosomal RNA
    subunitlt/ENAMEXgt is inverted and separated from
    the ltENAMEX id"1" type"GENE"gt16S rRNAlt/ENAMEXgt
    by a novel non-coding region,

22
Comparison of Entity Extractors
23
Conclusion on Entities Extractors
  • Still a lot of room to improve. However, with
    existing extractors we can begin high level text
    mining work.
  • Performances over honeybee data need to be
    evaluated.
  • As soon as better extractor is constructed, we
    can plug in easily.

24
Summary
  • Some Existing NLP tools for supporting Biology
    Literature Mining POS Taggers , Parsers and
    Entity-Extractors are evaluated
  • Observations along two lines
  • Still considerable room of improvement beyond the
    existing NLP tools, especially customize them for
    special domains.
  • We can begin exploring higher-level text mining
    research with support of these toolkits.
  • Text Preprocessing Modules are independent, easy
    to plug and play

25
References
  • Hirschman, L. et al. Accomplishments and
    challenges in literature data mining for biology
    Bioinformatics, 2002
  • Dekang Lin. Dependency-based evaluation of
    MiniPar In Workshop on the Evaluation of Parsing
    Systems, 1998

26
End of Talk
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com