NLP Tools for Biology Literature Mining - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

NLP Tools for Biology Literature Mining

Description:

However, with existing extractors we can begin high level text mining work. ... As soon as better extractor is constructed, we can plug in easily. Summary ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 27

Provided by: jin144

Category:

more less

Transcript and Presenter's Notes

Title: NLP Tools for Biology Literature Mining

1
NLP Tools for Biology Literature Mining

Qiaozhu Mei
Jing Jiang
ChengXiang Zhai
Nov 3, 2004

2
What do we have?

Biology Literature (huge amount of text)
E.g. Mites in the genus Varroa are the primary
parasites of honey bees Ten of 22 transfer RNAs
are in different locations relative to hard
ticks, and the 12S ribosomal RNA subunit is
inverted and separated from the 16S rRNA by a
novel non-coding region, a trait not yet seen in
other arthropods. (from Biological Abstracts)

3
What do we want?

Named entities
gene names, protein names, drugs, etc.
Interaction events between entities
transcription, translation, post translational
modification, etc.
Relationships between basic events
caused by, inhibited by, etc.
(from Hirschman et al. 02)

4
Preliminary System Structure

Collections of raw textual data
Text Pre-processing NLP
POS Tagger
Parser
Entity Extractor
NPs, VPs, Relations
Genes, proteins, other entities
Nouns, Verbs, etc.
Pre-processed data ready to mine
Text Mining Modules TM

5
POS Taggers

Tree Tagger
Brill Tagger
SNoW Tagger
LT Chunk
Stanford Tagger

6
Results of POS Tagging

Raw text
Mites in the genus Varroa are the primary
parasites of honey bees Ten of 22 transfer RNAs
are in different locations relative to hard
ticks, and the 12S ribosomal RNA subunit is
inverted and separated from the 16S rRNA by a
novel non-coding region, a trait not yet seen in
other arthropods.
(from Biological Abstracts)

7
Results of POS Tagging (cont.)
8
Results of POS Tagging (cont.)
9
Comparison of POS Taggers
10
Conclusions

Existing general-purpose POS taggers work fine
for our task.
Most nouns and verbs correctly identified
There is still room to improve existing POS
taggers for biology data.
E.g. to identify gene and protein names
Speed and adaptability are important.

11
A Little Bit More on SNoW

SNoW has a POS tagger and a shallow parser.
Speed is reasonable.
Software is adaptable as help is available from
CCG.
The network model can be trained if we have
training data.

12
Result of SNoW Shallow Parser

NP the 12 S ribosomal RNA subunit VP is ADJP
inverted and VP separated PP from NP the 16
S rRNA PP by NP a novel non-coding region
(from online demo)

Problems
Currently the package is not available for
download from the new CCG page.
There is still problem running the old package
on our machine. (compilation, path setting, etc.)

13
Parsers

SNoW (already covered)
LT-Chunk
MiniPar
Collins
Stanford

14
Result of LT-Chunk

the_DT 12S_JJ ribosomal_JJ RNA_NNP subunit_NN
(( is_VBZ inverted_VBN and_CC separated_VBN ))
from_IN the_DT 16S_JJ rRNA_NNP by_IN
a_DT novel_JJ non-coding_JJ region_NN

15
Result of MiniPar

16 (the Det 20 det (gov subunit))
17 (12S N 20 nn (gov subunit))
18 (ribosomal A 20 mod (gov subunit))
19 (RNA N 20 nn (gov subunit))
20 (subunit N 22 s (gov invert))
21 (is be be 22 be (gov invert))
22 (inverted invert V E0 i (gov fin))
E4 (() subunit N 22 obj (gov invert)
23 (and U 22 lex-mod (gov invert))
24 (separated separate V 22 lex-dep (gov invert))
25 (from Prep 22 mod (gov invert))
26 (the Det 28 det (gov rRNA))
27 (16S N 28 nn (gov rRNA))
28 (rRNA N 25 pcomp-n (gov from))

16
Results of Collins Parser

(Sis22 (NPBsubunit55 the/DT 12S/CD
ribosomal/JJ RNA/NNP subunit/NN ) (VPis21
is/VBZ (UCPinverted31 (ADJPinverted11
inverted/JJ ) and/CC (VPseparated31
separated/VBN (PPfrom21 from/IN (NPBrRNA33
the/DT 16S/CD rRNA/NN ) ) (PPby21 by/IN
(NPregion21 (NPBregion44 a/DT novel/JJ
non-coding/JJ region/NN ,/PUNC, )

17
Comparison of Parsers
18
Conclusion on Parsers

MiniPar has advantages so far
Fast
Outputs dependency governing info. and useful
relations
Provides API
If SNoW is tuned for the task, we can easily
plug it into the module.

19
Entity Extractors

Abner extracts protein, DNA, RNA, cell line,
and cell type
Yagi extracts only gene names, a brother of
Abner
LingPipe Named entity extraction that can be
trained for different domains.

20
Result of Abner

Ten of ltRNAgt22 transfer RNAslt/RNAgt are in
different locations relative to hard ticks , and
the 12 ltproteingtS ribosomal RNA subunitlt/proteingt
is inverted and separated from the 16 S rRNA by a
novel non-coding region,

21
Result of LingPipe

Ten of 22 transfer RNAs are in different
locations relative to hard ticks, and the ltENAMEX
id"0" type"GENE"gt12S ribosomal RNA
subunitlt/ENAMEXgt is inverted and separated from
the ltENAMEX id"1" type"GENE"gt16S rRNAlt/ENAMEXgt
by a novel non-coding region,

22
Comparison of Entity Extractors
23
Conclusion on Entities Extractors

Still a lot of room to improve. However, with
existing extractors we can begin high level text
mining work.
Performances over honeybee data need to be
evaluated.
As soon as better extractor is constructed, we
can plug in easily.

24
Summary

Some Existing NLP tools for supporting Biology
Literature Mining POS Taggers , Parsers and
Entity-Extractors are evaluated
Observations along two lines
Still considerable room of improvement beyond the
existing NLP tools, especially customize them for
special domains.
We can begin exploring higher-level text mining
research with support of these toolkits.
Text Preprocessing Modules are independent, easy
to plug and play

25
References

Hirschman, L. et al. Accomplishments and
challenges in literature data mining for biology
Bioinformatics, 2002
Dekang Lin. Dependency-based evaluation of
MiniPar In Workshop on the Evaluation of Parsing
Systems, 1998

26
End of Talk