Title: Application of Data and Text Mining to Bioinformatics
1Application of Data and Text Mining to
Bioinformatics
- Sammy Wang
- Computer Science
- University of Georgia
2Data Mining (DM) Definition
- Extraction of implicit, previously unknown, and
potentially useful information (Pattern) from
large data sets or databases. - Uses computational techniques from statistics,
machine learning and pattern recognition.
--from wiki
3Where DM Comes From?
- High Performance Computers
DATA MINING
--from http//wwwmaths.anu.edu.au/steve/pdcn.pdf
4Knowledge Discovery Process
Knowledge
- Data mining the core of knowledge discovery
process.
Knowledge Interpretation
Data Mining
Task-relevant Data Data transformations
Selection
Preprocessed Data
Data Cleaning
Data Integration
Databases
--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
5DM Tasks (Goals)
- There are two categories of goals (or high level
tasks) in DM - Description models are constructed to describe
particular patterns or relationships in the data
(beer and diaper) - Prediction models are constructed using
historical cases to predict outcomes for new
cases (PROSPECTR)
--from http//www.sys-consulting.co.uk/
6Data Mining Tech
Data Mining
Descriptive
Predictive
Clustering Summarization Association Rules
Sequence Discovery
Classification Regression Time Series Analysis
Decision Tree Artificial Neural Network
--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
7PRiOritization by Sequence and PhylogEnetic
features of CandidaTe Regions - PROSPECTR
- Classify genes as likely or unlikely to be
involved in human hereditary disease, based on
features derived from their sequence - Build classifier using a decision tree (based on
C4.5) using Weka machine learning package - can input an arbitrary number of features /
attributes - gives human readable classifier
- also happens to give better results than SVM and
Bayes based methods. - Train data set on approx 1084 disease genes
versus 1084 randomly selected genes
8START
Mouse homol gt42
Has signal Peptide?
Mouse homol gt95
Gene length gt563
Gene length gt997bp
Exons gt 32
N Y
N Y
N Y
N Y
N Y
N Y
0.827
-0.163
-0.315
0.114
-0.036
0.151
0.818
-0.026
0.029
-0.594
-0.014
0.792
Best paralog gt78
Rata id gt59
3UTR lt 647bp
CDS len gt704bp
Hs/Mm Ka/Ks lt0.195
N Y
N Y
N Y
N Y
N Y
-0.422
0.344
-0.044
0.205
0.106
-0.087
-0.034
0.2
0.008
-0.57
GC gt 37.5
N Y
Mouse id gt 68.3
Worm id gt55
CLASS is DISEASE if Score lt 0
N Y
N Y
-0.038
0.213
0.015
-0.492
-0.034
0.027
--fromhttp//www.genetics.med.ed.ac.uk/tutorials/
InfMSc2_12.ppt27
9Text Mining Definition
- Refers to the process of extracting interesting,
non-trivial information and knowledge from
unstructured text (i.e. free text). - A young interdisciplinary area that draws on
information retrieval, data mining, machine
learning, statistics and computational
linguistics. - --From Wikipedia
10Two Approaches of TM
- Bag of Words
- looking for features (words, concepts, headings,
formatting, authors, references and links) - Statistical methods, machine learning methods,
algorithms, etc. - Natural Language Process (NLP)
- Syntax and semantics
11Bag of Words Approach
- Any techs used in data mining, statistic and
machine learning - For example, neural network, decision tree, HMM,
stochastic, Naïve Bayes, Maximum Entropy, SVM,
etc. Also they can be used into NLP - Web searchgoogle, yahoo
- Classification
- Clustering
12Natural Language Processing (NLP)
- NLP is a subfield of AI and linguistics.
- Goal of NLP -- design and build software that
will analyze, understand, and generate natural
languages, making people address the computer as
though they were addressing another person. - Major Tasks Information extraction (IE),
Information retrieval (IR), Text to speech,
Speech recognition, Natural language generation,
Machine translation, Question answering,
Text-proofing, Translation technology, Automatic
summarization - --from Wiki and Microsoft Research
13Five Types of IE
- Named Entity recognition (NE)
- Finds and classifies names, places, etc.
- Co-reference resolution (CO)
- Identifies identity relations between entities
- Template Element construction (TE)
- Adds descriptive information to NE results (using
CO) - Template Relation construction (TR)
- Finds relations between TE entities
- Scenario Template production (ST)
- Fits TE and TR results into specified event
scenarios - -- according to the definition given by The MUC
programme in 1998
14NE Example
- the E2F-RB complex induced by TGF-beta may bind
to E2F sites and suppress expression of specific
genes whose promoters contain E2F binding sites
Named Entity recognition (NE)
15CO Example
- the E2F-RB complex induced by TGF-beta may bind
to E2F sites and suppress expression of
specific genes whose promoters contain E2F
binding sites
Coreference resolution (CO)
16TE Example
- the E2F-RB complex induced by TGF-beta may bind
to E2F sites and suppressexpression of specific
genes whose promoters contain E2F binding sites
Template Element construction (TE)
17TR Example
- the E2F-RB complex induced by TGF-beta may bind
to E2F sites and suppress expression of specific
genes whose promoters contain E2F binding sites
Template Relation construction (TR)
18Foundation of NLP--Tagging
- Assign part of speech tags to words reflecting
their syntactic category (noun. verb. adjectiv
etc.) - Difficulty words can belong to different
syntactic categories in different contexts. - he books tickets vs. he reads books
- Buffalo buffaloes buffalo buffalo buffaloes
- Algorithms used viterbi algorithm, HMM, maximum
likelihood, rule-based, stochastic, n-grams,
Baum-Welch, Dynamic Programming
19Foundation of NLP--Tokenizing
- TokenizingSegmenting sentences into words and
phrases. This process determines which words
should be retained as phrases and which ones
should be segmented into individual words. - "Type II Diabetes" vs. "A patient with diabetes"
20Text Mining Steps
- IR yields all relevant texts
- Gathers, selects, filters relevant documents
- IE extracts entities, relations, facts events
of interest to user - Finds relevant concepts, facts about concepts
- Finds only what we are looking for
- DM discovers unsuspected associations
- Combines links facts and events
- Discovers new knowledge, finds new associations
- --from text mining and terminology management in
biomedicine
21Main TM Research in Biology
- Entity/Term recognition
- Relationship extraction
22Named Entity Recognition (NE)
- Term ambiguity/variation
- Lack of clear naming convention
- Synonyms, Abbreviations and Acronym (nuclear
receptor NR) - Many different terms refer to the same concept
vs. One term can have multiple meaning - BAD human gene encoding BCL-2 family of proteins
(bad things, bad weather)
23Approaches of NE
- Dictionary/controlled vocabularies
- MeSH
- Ontology Approaches
- --GO
- Rule-based
- CFG parser
- Statistical Methods
- Term frequency
- Machine Learning
- Neural Network
- Hybrid Approaches
24Ontology Approaches
- Ontologies are crucial for ATR
- Manage term acronym, synonym and variation
- Link term in text with concept
- Add meaning, semantic annotation of texts
- Support relationship extraction
- Ontologies support IE/IR
- populate terms into ontologies (automatic
ontology generation)
25Relationship Extraction
- Detect a prespecified type of relationship
between a pair of entities of given types. - Relationships between genes and proteins
- Relationships between genes, protein, or other
biological entities (Protein Active Site Template
Acquisition System--PASTA)
26Relationships between Genes and Proteins
- Grouping genes by functional relationships could
aid gene expression analysis and database
annotation - How to know if a group of genes share the same
function? - Raychaudhuri et al. used a measure of neighbor
divergence of papers to measure the functional
coherence of a group of genes
27Methodology of Functional Coherence
28Relationships between Genes and Proteins
- MeKE (Medical Knowledge Explorer) system (Chiang
and Yu) - Uses GO codes as a lexicon of function names,
combining it with a lexicon of gene and gene
product names from LocusLink - Uses sentence alignment to determine patterns
associated with statements about gene function - Uses a Naïve Bayes classifier to extract
sentences containing information about gene
product function
29PASTA
- Uses type and POS tagging along with manually
created templates and lexicons assembled from
biological databases to extract relationships
between amino acid residues and their function
within a protein.
30Example of Annotated Text
TITLE The crystal structure of a ltNAME
TYPEPROTEINgttriacylglycerol lipaselt/NAMEgt from
ltNAME TYPESPECIESgtPseudomonas cepacialt/NAMEgt.
reveals a highly open conformation in the absence
of a bound inhibitor AUTHORS Kim_KK, Song_HK,
Shin_DH, Hwang_KY, Suh_SW JOURNAL STRUCTURE,
1997, Vol.5, No.2, pp.173-185 ABSTRACT
Results We have determined the crystal structure
of a ltNAME TYPEPROTEINgt triacylglycerol
lipaselt/NAMEgt from ltNAME TYPEPROTEINgtPseudomonas
cepacia (Pet)lt/NAMEgt in the absence of a bound
inhibitor using X-ray crystallography. The
structure shows the ltNAME TYPEPROTEINgtlipaselt/NAM
Egt to contain an ltNAME TYPEPROTEINgtalpha/beta-hyd
rolaselt/NAMEgt fold and a catalytic triad
comprising of residues ltNAME TYPERESIDUEgt
Ser87lt/NAMEgt, ltNAME TYPERESIDUEgtHis286 lt/NAMEgt
and ltNAME TYPERESIDUEgtAsp264 lt/NAMEgt. The enzyme
shares several structural features with
homologous ltNAME TYPEPROTEINgtlipases lt/NAMEgt
from ltNAME TYPESPECIESgtPseudomonas glumae
(PgL)lt/NAMEgt and ltNAME TYPESPECIESgtChromobacteriu
m viscosum (CvL)lt/NAMEgt, including a
calcium-binding site. The present structure of
ltNAME TYPESPECIESgtPetlt/NAMEgt reveals a highly
open conformation with a solvent-accessible
active site. This is in contrast to the
structures of ltNAME TYPESPECIESgtPgLlt/NAMEgt and
ltNAME TYPESPECIESgtPetlt/NAMEgt in which the
active site is buried under a closed or
partially opened 'lid', respectively.
31Approaches of Relationship Extraction
- Manually generated template-based methods
- use patterns (usually in the form of regular
expressions) generated by domain experts to
extract concepts connected by a specific relation
from text. - Automatic template methods
- create similar templates automatically by
generalizing patterns from text known to have the
relevant relationship. - Statistical methods
- identify relationships by looking for concepts
that are found with each other more often than
would be predicted by chance. - NLP-based methods
- perform a substantial amount of sentence parsing
to decompose the text into a structure from which
relationships can be readily extracted
--fromA survey of current work in biomedical
text mining
32Hypothesis Generation
- Goal uncover previously unrecognized
relationships. - Swanson found a connection between fish oil and
Raynauds syndrome in 1986 by manually connecting
concepts between journal articles. - He also traced 11 indirect connections between
migraine and magnesium using summarizations of
published articles.
--fromA survey of current work in biomedical
text mining
33Approaches of Hypothesis Generation
- A influences B, and B influences C, therefore A
may influence Cby Swanson - Automated hypothesis generation
34Initial Thought
- Data resource paper abstracts/full papers, web
pages, databases online - Computing resource computer with large memory
(several Gig) for training taggerParser
(Stanford POS tagger) - Learning and clearing biological problem
(horizontal gene transfer)cooperated with
biologist - Preparing ontology
- Open Biology ontology (OBO)
- Basic Formal Ontology (BFO)
- Finding relationship
35 36Comparison of Document-handling Techs
--from Text analysis and knowledge mining system
37Model Types
- For a successful IR/IE, it is necessary to
represent the documents in some way. There are a
number of models for this purpose. They can be
classified according to two dimensions like shown
in the left figure the mathematical basis and
the properties of the model. - --from wiki
38--from wiki
39Bag of words approach
- Treats a document as a collection of words or
phrases - Generally ignores the word order
- May count each word occurrence, or just flag
which words occur - ?Stemming and stop word elimination
40Common Techniques of NLP
- StemmingIdentifying the stem of each word. For
example, "hybridized", "hybridizing", and
"hybridization" would be stemmed to "hybrid". As
a result, the analysis phase of the NLP process
has to deal with only the stem of each word, not
every possible permutation. - TaggingIdentifying the part of speech
represented by each word, such as noun, verb, or
adjective. - TokenizingSegmenting sentences into words and
phrases. This process determines which words
should be retained as phrases and which ones
should be segmented into individual words. For
example, "Type II Diabetes" should be retained as
a word phrase, whereas "A patient with diabetes"
would be segmented into four separate words. - Core TermsSignificant terms, such as protein
names and experimental method names, are
identified based on a dictionary of core terms. A
related process is ignoring insignificant words
such as "the", "and", and "a". - Resolving Abbreviations, Acronyms, and
SynonymsReplacing abbreviations with the words
they represent, and resolving acronyms and
synonyms to a controlled vocabulary. For example,
"DM" and "Diabetes Mellitus" could be resolved to
"Type II Diabetes", depending on the controlled
vocabulary.
41Relationships between Genes and Proteins
- Pan et al.s Dragon TF association miner system
used linear discriminate analysis on terms and
neural networks to create models that recognized
abstracts that contained information relating
transcription factors (TFs) with GO codes and
diseases.
--fromA survey of current work in biomedical
text mining
42Dictionary-based Approaches
- Neologisms, variations a major issue for these
- Combine dictionaries with edit distance for
flexible string matching - Tuning of cost function (space to hyphen less
costly) - Hirschman et al (2002)
- Tsuruoka Tsujii (2004)
--from Text Mining and Terminology Management In
Biomedicine
43Rule-based Approaches
- Use of dictionaries of typical term constituents
- Heads, class-specific adjectives, affixes,
specific acronyms - Use of term formation patterns (Ananiadou,
Gaizauskas) - Context-free grammars
- Simple lexical patterns orthographic features
- Fukuda et al PROPER (core feature term)
- core terms are domain-characteristic words
- feature terms are keywords that describe
function and characteristic of a term (e.g.
protein, receptor, etc) - SAP kinase core SAP, feature kinase
- Usual problem of tuning and porting
--from Text Mining and Terminology Management In
Biomedicine
44Machine Learning Approaches
- Typically designed for specific classes of
entities - Challenges
- Selecting set of representative features for
accurate - recognition classification
- Detection of term boundaries of multiword terms
- Few reliable training resources for biomedicine
- Experimentation with various techniques
- Hidden Markov models (Collier et al.), Naïve
Bayes,support vector machines (Kazama et al.,
Yamamoto etal.), decision trees, etc.
--from Text Mining and Terminology Management In
Biomedicine
45Statistical Approaches
- Based on statistical distributions of
collocations - Challenge to define adequate measure of
termhood of candidate terms - Main strategy
- Extract specific noun phrases as term candidates
- Estimate termhoods
- Ranked list, thresholds
- More easily tuned, more portable, no training
data - Work best on large collections (normalization
required for small)
--from Text Mining and Terminology Management In
Biomedicine
46Hybrid Approaches
- Combine several techniques
- C/NC value (Frantzi Ananiadou) being used by
National Centre - Combines statistical, linguistic and contextual
- processing to rank candidate terms
- Nested (embedded) sub-terms help to recognize
full compounds - ABGENE (Tanabe Wilbur)
- Machine learning, transformation rules,
dictionary combined with probabilistic approach
--from Text Mining and Terminology Management In
Biomedicine
47Descriptive Data Mining Tasks
- Classification maps data into predefined groups
or classes - Pattern recognition
- direct marketing, retention
- Clustering groups similar data together into
clusters/groups. - Segmentation
- Partitioning
- www marketing
--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
48Predictive Data Mining Tasks
- Regression is used to map a data item to a real
valued prediction variable. - credit scoring
- Link Analysis uncovers relationships among data.
- Association Rules
- Sequential Analysis determines sequential
patterns.
--from http//www.cse.ohio-state.edu/srini/694Z/
part1.ppt
49Decision Trees
- Popular technique for classification Leaf node
indicates class to which the corresponding tuple
belongs. - Decision Tree (DT) representation
- Each internal node tests an attribute.
- Each branch corresponds to attribute value.
- Each leaf node assigns a classification.
50Training Data Set
--from http//www.cs.cmu.edu/afs/cs.cmu.edu/proje
ct/theo-20/www/mlbook/ch3.pdf
51Decision Tree for PlayTennis (back)
--from http//www.cs.cmu.edu/afs/cs.cmu.edu/proje
ct/theo-20/www/mlbook/ch3.pdf