Title: Data Mining for Bioinformatics
1Data Mining for Bioinformatics
- Craig A. Struble, Ph.D.
- Marquette University
- craig.struble_at_marquette.edu
2Overview
- Survey of KDD for Bioinformatics
- KDD overview
- Bioinformatics data
- Survey of KDD steps
- Case Study miRNA Project
- Identifying the problem
- Data collection with Perl
- Selection/cleansing
- Future work
- Next Time
3Knowledge Discovery in Databases
Selection Transformation
Cleaning Integration
Evaluation Visualization
Data Mining
Data Warehouse
Prepared data
Patterns
Knowledge
Knowledge Base
Data
4Bioinformatics Data
- DNA Sequences
- Genes
- Location, introns, exons, function, etc.
- Gene products
- RNA, Proteins
- Pathways
- Signaling, metabolic, genomic, etc.
5Bioinformatics Data
- Experimental
- Gene expression, knockouts, etc.
- Literature
- Diseases, viruses, bacteria
- Organisms
- Textbooks
- Expert knowledge
- Unpublished
- Insights
- Etc.
6KDD for Bioinformatics
Normalization Curation Validation Etc.
Sampling Expressed Genes Homologs Etc.
Evaluation Visualization
Clustering SVMs ILP Classification Etc.
Genomic
Data Warehouse
Literature
Prepared data
Patterns
Experimental
Often not explicitly implemented
Knowledge
Expert Knowledge
Data
7Data Collection and Cleansing
- Perl scripts (BioPerl)
- From literature
- Read a paper and enter the information
- Supplemental data for papers
- Public databases
- GenBank
- Stanford Microarray Database
- SWISS-Prot
- Etc.
8Data Cleansing
- Remove invalid, redundant, or otherwise useless
data - Extrapolate missing data values
- Data formatting/transformation
- Binning, normalization, scaling, etc.
9Data Selection
- Database queries for specific genes, organisms,
sequences, etc. - Statistical analysis (microarray)
- Random sampling
- Etc.
10Data Mining Techniques
- Statistical
- Principal Component Analysis
- ANOVA
- Outlier analysis
- Discrimination
- Some clustering techniques (K-Means)
11Data Mining Techniques
- Machine Learning
- Neural Networks
- Support Vector Machines
- Decision Trees
- Inductive Logic Programming
- Fuzzy Logic
- Rough Sets
- Bayesian Belief Networks
12Data Mining Techniques
- More Techniques
- Clustering
- Self Organizing Maps
- Hidden Markov Models
- Maximum Likelihood Estimators
- Association Rules
13Kinds of Techniques
- Unsupervised
- Technique makes no assumption about a priori
knowledge - Useful when not much known
- Supervised
- Attach class labels to data items
- Identify (or learn about) properties that
distinquish classes
14Kinds of Techniques
- Unsupervised
- Clustering
- SOMs
- Supervised
- Support Vector Machines
- Neural Networks
- Bayesian Belief Networks
- HMMs
15Kinds of Techniques
- Supervised techniques require training
- Data split into training and test sets
- Many kinds of validation
- N-way cross validation
- Leave one out testing
- Etc
16Visualization of Results
- Graphs/Charts
- Rules
- If expression of X lt 1035, then tissue is
cancerous - Largely dependent on the technique used
17Case Study miRNA Project
- Started Jan, 2002
- Participants
- Dr. Craig Struble
- Dr. Stephen Munroe
- Dr. John Simms
- Parthav Jailwala
- Peigang Li
- http//bistro.mscs.mu.edu/miRNA
18Case Study miRNA Project
- Lee, R. C. Ambros, V. An extensive class of
small RNAs in Caenorhabditis elegans. Science
294, 862-864 (2001). - Lagos-Quintana, M., Rauhut, R., Lendeckel, W.
Tuschl, T. Identification of novel genes coding
for small expressed RNAs. Science 294, 853-858
(2001). - HutvĂźgner, G. et al. A cellular function for the
RNA-interference enzyme Dicer in the maturation
of the let-7 small temporal RNA. Science 293,
834-838 (2001). - N.C. Lau, Lee P. Lim, Earl G. Weinstein, David P.
Bartel. An abundant class of tiny RNAs with
probable regulatory roles in Caenorhabditis
elegans. Science 294, 858-86 (2001).
19Research Questions
- Can we identify features of existing miRNAs that
can be used to predict the existence of other
miRNA genes? - Which mRNA (messenger RNA) are targeted by
miRNAs? - What other family-wide behavioral and structural
questions can be answered about miRNAs?
20Current Implementation
Data Selection/Cleansing
Data warehouse
miRNA library
Perl Script
Genbank
BLAST Reports
Perl Script
Multiple Sequence Alignment
Perl Script
Homolog library
Initial mining and cleansing
21Perl
- Practical Extraction and Report Language
- Language of choice for many bioinformaticians
- Excellent support for parsing/transforming data
- http//www.perl.com
22Data Collection with Perl
E.G. Using Entrez
23Data Collection with Perl
Construct a URL to search and access information
in Entrez
24Data Collection with Perl
- Use LWP module
- Makes network connections easy
- Use BioPerl (http//www.bioperl.org)
- Perl modules/objects for handling bioinformatics
data - Handles connections to databases
25Sample Perl Script
!/usr/local/bin/perl Simple Entrez Query in
Perl Craig A. Struble For internet
requests and protocols use LWP A user agent
for testing my ua LWPUserAgent-gtnew ua-gtage
nt('miRNA/0.1 ') URL base for Entrez
search my NCBI_ENTREZ 'http//www.ncbi.nlm.nih.
gov/entrez/query.fcgi?'
26Script (cont.)
Building up the URL for the Entrez Search my
search_URL NCBI_ENTREZ URL Base
. 'cmdSearch' Command
. 'dbnucleotide' Database
. 'dispmax100' Max results
. 'termmiRNA' Search term
. 'doptcmdlFASTA' result
format Make an HTTP GET request for a Entrez
search my req HTTPRequest-gtnew(GET gt
search_URL) req-gtpush_header(Connection gt
'Keep-Alive') Get the response my res
ua-gtrequest(req)
27Script (cont.)
Check the response. If it's OK, print out the
content if (res-gtis_success) print
res-gtcontent else print
res-gterror_as_HTML exit 1
28Sample Result
ltinput name"showndispmax" type"hidden"
value"100"gtltinput name"page" type"hi dden"
value"0"gtlt/tablegtlt/tdgtlt/trgt lt/tablegtltdlgtltdtgtlttabl
e cellpadding"0" cellspacing"0"
width"100"gtlttrgtlttdgtltinp ut name"uid"
type"checkbox" value"17646034"gtltbgt1
lt/bgtAJ421749. Homo sapiens micr...gi17646034lt/t
dgt lttd align"right"gtltSPANgtlta CLASS"dblinks"
href"query.fcgi?dbnucleotideampcm dDisplayam
pdoptnucleotide_pubmedampfrom_uid17646034"gtPu
bMed, lt/agtlt/SPANgt ltSPANgtlta CLASS"dblinks"
href"query.fcgi?dbnucleotideampcmdDisplayamp
dopt nucleotide_taxonomyampfrom_uid17646034"gt
Taxonomylt/agtlt/SPANgt lt/tdgt lt/trgtlt/tablegtlt/dtgtlt/dlgtlt
pregtgtgi17646034embAJ421749.1HSA421749 Homo
sapiens m icroRNA miR-27 TTCACAGTGGCTAAGTTCCGCT lt/
pregtltdlgtltdtgtlttable cellpadding"0"
cellspacing"0" width"100"gtlttrgtlttdgtltinput
name"uid" type"checkbox" value"17646061"gtltbgt2
lt/bgtAJ421776. Drosophila mela no...gi17646061lt/
tdgt
29Parsing Result
- Result is big, ugly HTML file
- Need to take out data in ltpregt tags
- Fortunately, Perl can come to the rescue!
30Parsing Result with Perl
!/usr/local/bin/perl Use an HTML parser use
HTMLTreeBuilder Extract out FASTA entries
for each file on the command line foreach my
file_name (_at_ARGV) Build an HTML Parse
Tree my tree HTMLTreeBuilder-gtnew
tree-gtparse_file(file_name) FASTA
entries are in PRE tags _at_entries
tree-gtfind_by_tag_name('pre') Print out
each entry foreach my entry (_at_entries)
_at_children entry-gtcontent_list
print children0 . "\n" first child
is text content
31Processed Results
gtgi17646034embAJ421749.1HSA421749 Homo
sapiens microRNA miR-27 TTCACAGTGGCTAAGTTCCGCT gtg
i17646061embAJ421776.1DME421776 Drosophila
melanogaster microRNA miR-14 TCAGTCTTTTTCTCTCTCCTA
gtgi17646060embAJ421775.1DME421775
Drosophila melanogaster microRNA
miR-13b-2 TATCACAGCCATTTTGACGAGT gtgi17646059emb
AJ421774.1DME421774 Drosophila melanogaster
microRNA miR-13b-1 TATCACAGCCATTTTGACGAGT gtgi176
46058embAJ421773.1DME421773 Drosophila
melanogaster microRNA miR-13a TATCACAGCCATTTTGATGA
GT
32Getting BLAST Reports
- Can automate getting BLAST reports with Perl
- URL format documentation is available at
- http//www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
- Perl code not displayed
33Parsing BLAST Reports
- Use BioPerl BioToolsBPLite
- Find high scoring pairs that contain surrounding
sequence - BLAST also reports original sequence hits
- Extract out matching sequence with up and
downstream surrounding sequence
34Perl Script
!/usr/local/bin/perl Create homolog database
from BLAST reports Author Craig A. Struble
Various BioPerl modules to use use
BioToolsBPlite use BioDBGenBank use
BioSeqIO use BioSeq
35Script (cont.)
Function
rev_comp Description Calculates the reverse
complement of a DNA sequence.
sub rev_comp my _at_seqs
foreach seq (_at__) seq
tr/AaCcTtGg/TtGgAaCc/ seq reverse
seq push _at_seqs, seq
wantarray checks whether we were called in list
context return wantarray ? _at_seqs seqs0
36Script (cont.)
Function
around_seq Description Returns the upstream
and downstream sequence around an HSP
Parameters hsp - the high scoring pair
seq - the sequence of reference
upstream - number of basepairs upstream
downstream - number of basepairs
downstream
sub
around_seq my (hsp, seq, upstream,
downstream) _at__ Code deleted due to
space return subseq
37Script (cont.)
Open the BLAST report open(BLAST, "lt" .
ARGV0) or die "open failed" report new
BioToolsBPlite(-fh gt \BLAST) gb new
BioDBGenBank Open output file out
BioSeqIO-gtnew('-file' gt "gtARGV1", '-format'
gt 'fasta') Amount up and downstream to
get upstream ARGV2 downstream ARGV3
38Script (cont.)
while (my sbjct report-gtnextSbjct) my
(db, accv, acc, rest) split /\ /,
sbjct-gtname seq gb-gtget_Seq_by_acc(acc)
print seq-gtaccession_number . "\n"
while (my hsp sbjct-gtnextHSP) my
seqstr around_seq(hsp, seq, upstream,
downstream) my subseq
BioSeq-gtnew('-seq' gt seqstr,
'-accession_number' gt
seq-gtaccession_number,
'-display_id' gt seq-gtaccession_number
.
"_" .
hsp-gtsubject-gtstart .
".." .
hsp-gtsubject-gtend .
"_" .
hsp-gtsubject-gtstra
nd ) out-gtwrite_seq(subseq)
39Results
gtAC084471_10966..10987_-1 TCCCCCTTGGTCCCTTCTCATATA
CCATACTACATTTCTTTCAAAACTAACCGGGATTTT TCAGGGGATTGCA
GGATGATGGCTCTACACTGGGGTACGGTGAGGTAGTAGGTTGTATAG TT
TAGAATATTACTCTCGGTGAACTATGCAAGTTTCTACCTCACCGAATACC
AGGTTCTC AACTGCATCGTGTCAATTACTCTCAAACGACGGACACCTTC
A gtAF274345_1763..1784_1 CACATCTCCCTTTGAATTTATATGT
CTAATTTAACAACAAGTACTAATCCATTTTTCAGG CAAGCAGGCGATTG
GTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG TTT
GGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAG
AACTCTT CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT
gtZ70203_12425..12446_-1 CACATCTCCCTTTGAATTTATATGT
CTAATTTAACAACAAGTACTAATCCATTTTTCAGG CAAGCAGGCGATTG
GTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG TTT
GGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAG
AACTCTT CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT
40Multiple Sequence Alignment
- Currently using clustalw/clustalx
- Eventually generate web pages with sequence
alignments - Investigate conserved regions of the surrounding
sequence
41Multiple Sequence Alignment
42Future Work
- Process homolog library with RNA fold predication
software (mFold) - Collect together fold structure information and
other information - Transform into logical representation for ILP
analysis - Store data in a database (Postgres)
43Next Time
- Applications of
- Clustering
- Neural Networks
- Support Vector Machines
- Etc.
- Available tools to use, etc.