Title: Pattern Discovery in Bioinformatics Jaak Vilo vilout'ee http:biit'cs'ut'ee
1Pattern Discovery in Bioinformatics Jaak
Vilovilo_at_ut.eehttp//biit.cs.ut.ee
2Topics
- Bioinformatics
- Pattern discovery
- Microarray data
- ...
3Pattern Discovery
- Choose the language (formalism) to represent the
patterns (search space) - Choose the rating for patterns, to tell which is
better than others - Design an algorithm that finds the best patterns
from the pattern class, fast.
Brazma A, Jonassen I, Eidhammer I, Gilbert
D.Approaches to the automatic discovery of
patterns in biosequences.J Comput Biol.
19985(2)279-305.
4Bioinformatics
- Have the right data (real, relevant,
interesting) - Interpret and report the results (make someones
life easier) - Contribute to the field of biology
5Bioinformatics
- Study of biological data with the goal to better
understand biology (JV)
6Level 0
ATCGCTGAATTCCAATGTG
Level 1
Eukaryotic genome can be thought of as six Levels
of DNA structure. The loops at Level 4 range
from 0.5kb to 100kb in length. If these loops
were stabilized then the genes inside the loop
would not be expressed.
Level 2
Level 3
Level 4
Level 5
Level 6
7 DNA determines function (?)
Protein SwissProt/TrEMBL
Structure PDB/Molecular Structure Database
DNA GenBank / EMBL Bank
20 Amino Acids (3nt 1 AA)
4 Nucleotides
Function?
8A Simple Gene
A
B
C
Upstream/ promoter
Downstream
ATCGAAAT TAGCTTTA
Modifications
DNA
9Species and individuals
- Animals, plantsfungi, bacteria,
- Species
- Individuals
www.tolweb.org
10(No Transcript)
11(No Transcript)
12(No Transcript)
13http//www.youtube.com/watch?vbk7PW1FKMTI
14Gene regulation
- How are all genetic entities regulated?
- Networks
- parts lists and connections
- parameters and dynamics
15Possible mechanisms of action for secreted
protein function in cell proliferation, either by
intracellular second messengers pathways or by
nuclear import. FT transcription factor Co-reg
Co-regulator. Planque Cell Communication and
Signaling 2006 47 doi10.1186/1478-811X-4-7
16http//wwwmgs.bionet.nsc.ru/mgs/gnw/genenet/viewer
/
17(No Transcript)
18Model of RNA Polymerase II Transcription
Initiation Machinery. The machinery depicted here
encompasses over 85 polypeptides in ten (sub)
complexes core RNA polymerase II (RNAPII)
consists of 12 subunits TFIIH, 9 subunits
TFIIE, 2 subunits TFIIF, 3 subunits TFIIB, 1
subunit, TFIID, 14 subunits core SRB/mediator,
more than 16 subunits Swi/Snf complex, 11
subunits Srb10 kinase complex, 4 subunits and
SAGA, 13 subunits.
F.C.P. Holstege, E.G. Jenning, J.J. Wyrick, Tong
Ihn Lee, C.J. Hengartner, M.R. Green, T.R. Golub,
E.S. Lander, and R.A. Young Dissecting the
Regulatory Circuitry of a Eukaryotic Genome Cell
95 717-728 (1998)
19Chen and Rajewsky Nature Reviews Genetics 8,
93103 (February 2007) doi10.1038/nrg1990
20microRNA
21(No Transcript)
22(No Transcript)
23 gtmmu-mir-678 MI0004635 GUGGACUGUGACUUGCAGAGCUGUG
CUCCAAUAUGAGAGAUGGCCAUGCACCCGUGUCUCGGUGCAAGGACUGG
AGGUGGCAGU
24Alignment of microRNA targets
- CtTCCA-TCCTT--G-ACCGAGAt ENSMUSG00000041359tCTtC
AtaCCcTcaGgACCGAGAC ENSMUSG00000032436CCTttAGcCC
TT--GggCtGgGAg ENSMUSG00000028581CCatttGaCtcc--a
CACtGAGAg ENSMUSG00000034006gCTCCAGgCCTT--GgACCt
AGgC ENSMUSG00000001053CactgccTaacT--GCACtGAGAa
ENSMUSG00000032470ggaCCAGgttTT--GCACCaAGgC
ENSMUSG00000053175CCTCagGaCCTT--GtgtCGAGAg
ENSMUSG00000004040agagaccTCgaa--GaACtGAGAa
ENSMUSG00000031163tagCCtGTCCTTctG-ACtGAGAC
ENSMUSG00000006342
25Sequence patterns in BI
26Biological applications
- DNA
- Gene regulation (promoters, TF binding)
- Gene prediction (including TSS, to polyA site)
- Repeats, duplications, tandem repeats, etc
sequence features - RNA
- Splicing of the mRNA
- microRNA targeting mRNA-s
- Secondary structure, 3D structure
- Proteins
- Protein families and their functional conserved
elements - Active sites and protein-protein interactions
- 3-D structure of proteins
20min
27Gene Regulatory Signal Finding
Transcription Factor
Transcription Factor Binding Site
Goal Detect Transcription Factor Binding Sites.
Eleazar Eskin Columbia Univ.
28How can we find TF binding sites?
Tallinn
29How to detect signals in DNA?
- Biologists in past have created some experimental
data few examples - Generalise from these
- Indirect evidence of being co-regulated
- Search for common signals
- New techniques (lab)
- Identify regions in which binding occurs (ChIP
chip) - SELEX
30Position weight matrices (PWM, PSSM,...)
ACGTGA ACGATG AGGTGG ACGAGG TCGTGA ACGAGG ACGAGA T
CGTGA
A 6 0 0 4 0 4 C 0 7 0 0 0 0 G 0 1 8 0 7 4
T 2 0 0 4 1 0
PWM
p/f log p/f
31Motif matching
- Find all occurrences of the given motif(s)
- Databases of biologically valid motifs
- Well touch it a bit later
32Motif discovery
- Hypothesis a (sub)set of sequences may share a
common signal.
33Common biological role
- Genes known to have related roles and hence
needed at the same time - e.g. same Gene Ontology class
- Measurements by microarrays
- genes coordinately expressed should have common
regulators (and signals)
34Microarrays
- Measure gene expression activity
- genes mRNA
- tiling anywhere in the genome
- Measure in vitro TF binding
- ChIP-chip
- Methylation etc features of DNA
35How to know whats in the cells?
Cells and mRNAs
I
36How to know whats in the cells?
Cells and mRNAs
I
II
37Microarray,the measurement device
Gene 3
Gene 1
Gene 2
38Microarray, after hybridisation
39Microarray, 2 colors mixed
40TIGR 32k Human Arrays
41Affymetrix Wafer and Chip Format
20 - 50 µm
20 - 50 µm
one oligonucleotide sequence per pixel
49 - 400 chips/wafer
1.0 cm
up to 1.3 million features/chip
42From microarray images to gene expression data
Intermediate data
Raw data
Final data
Array scans
Image quantifications
Samples
Spots
Genes
Gene expression levels
Spot/Image quantiations
43Eisen et.al, PNAS 98
Spellman et.al. Mol Biol Cell 98
44Tumor classification 1) class prediction 2)
class discovery
ALL AML
Golub et al, Science Oct 15th 1999
- 38 samples of acute myeloic leukemia (AML) and
acute lymphoblastic leukemia (ALL) - 6817 genes
- classificator built based on 50 best correlated
genes - tested on 34 new samples, 29 of them predicted
accurately
ALL AML
45Hughes, T. R. et al Functional Discovery via a
Compendium of Expression Profiles, Cell 102
(2000), 109-126.
46Cluster of co-expressed genes, pattern discovery
in regulatory regions
Expression profiles
600 basepairs
Retrieve
Upstream regions
Find patterns over-represented within cluster
Genome Research 1998 ISMB (Intelligent Systems
in Mol. Biol.) 2000
47Binomial or hypergeometric distribution tail
Background - ALL upstream sequences
? occurs 3 times P(3,6,0.2) is probability of
having ?3 matches in 6 sequences P(?,3,6,0.2)
0.0989
Cluster
5 out of 25, p 0.2
48ChIP-chip (or sequencing)
I
49ChIP-chip (or sequencing)
I
II
50ChIP-chip (or sequencing)
I
II
III
51ChIP-chip (or sequencing)
I
II
III
Microarray or sequencing
IV
52Clustering and Gene set enrichment
- Analysis of (any) HT data (cluster, visualise,
test of significance, ...) - Produces gene lists
- partitioning produces bags or sets
- sorting produces ranked lists
- How to interpret these results?
- What to do next?
53K-means k 200 vs 50
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59Clustering observe your first patterns
60gProfiler
61Find a common function(pattern)
- Experiment or analysis identifies a set of genes
- What is a common theme to these genes?
- Bilogical function Gene Ontology molecular
pathway shared regulatory motif or miRNA target
site
62(No Transcript)
63Previously known functions
F1
F4
F2
F5
F3
F6
F7
64Your query
F1
F4
F2
F5
Q1
F3
F6
Q2
F7
65(No Transcript)
66(No Transcript)
67GO Evidence Codes
From reviews or introductions
IDA - Inferred from Direct Assay IMP - Inferred
from Mutant Phenotype IGI - Inferred from
Genetic Interaction IPI - Inferred from
Physical Interaction IEP - Inferred from
Expression Pattern
TAS - Traceable Author Statement NAS -
Non-traceable Author Statement IC - Inferred by
Curator ISS - Inferred from Sequence or
structural Similarity IEA - Inferred from
Electronic Annotation ND - Not Determined
automated
From primary literature
68Evidencecodes
Genes
GOcategories
P-value
Ordered list query
KEGGpathways
69KEGG Biosynthesis of steroids
70p-value
- tail of the hypergeometric distribution
- Multiple testing
- multiple sets to compare against
- different sizes of queries
- different sizes (and nrs) of reference sets
71SCS - Set Counts and Sizes threshold
72Motif discovery in sequences
- Deterministic and probabilistic
- Pattern driven vs sequence driven
- Descriptive or Discriminative
73Cluster of co-expressed genes, pattern discovery
in regulatory regions
Expression profiles
600 basepairs
Retrieve
Upstream regions
Find patterns over-represented within cluster
Genome Research 1998 ISMB (Intelligent Systems
in Mol. Biol.) 2000
74(No Transcript)
75(No Transcript)
76Pattern vs cluster strength
The pattern probability vs. the average
silhouette for the cluster
The same for randomised clusters
Vilo et.al. ISMB 2000
77Suffix tree represent all suffixes
CATAT gt suffix tree 123456CATAT 1 ATAT 2
TAT 3 AT 4 T 5 6
AT
T
6
CATAT
AT
AT
5
3
1
2
4
O(n) time and space
78SPEXS - Sequence Pattern EXhaustive SearchJaak
Vilo, 1998, 2002
- User-definable pattern language substrings,
character groups, wildcards, flexible wildcards
(c.f. PROSITE) - Fast exhaustive search over pattern language
- Lazy suffix tree construction-like algorithm
(Kurtz, Giegerich) - Analyze multiple sets of sequences simultaneously
- Restrict search to most frequent patterns only
(in each set) - Report most frequent patterns, patterns over- or
underrepresented in selected subsets, or
patterns significant by various statistical
criteria, e.g. by binomial distribution
30min
79SPEXS 1998
Jaak Vilo Discovering Frequent Patterns from
Strings. Technical Report C-1998-9 (pp. 20) May
1998. Department of Computer Science, University
of Helsinki.
80(No Transcript)
81(No Transcript)
82(No Transcript)
83Sequence patterns the basis of the SPEXS
A
G
A
A
T
C
G
C
C
C
GCAT (4 positions)
GCATA (3 positions)
GCATA.
GCATA.C
84Implementation example
Input 1 ACGTGCACGATATCG
Input 2 AGTACATGAAGCAGG
P pattern e.g. AC AC.pos 2, 9, 23 ACG.pos
3, 10
Convert into internal representation ...........
...................... 36ACGTGCACGATATCGAG
TACATGAAGCAGG11111-11111-11111-22222-22222-222
22
85(No Transcript)
86SPEXS general algorithm
- 1. S input sequences ( Sn )
- 2. ? empty pattern, ?.pos 1,...,n
- 3. enqueue( order , ?, priority)
- 4. while p dequeue( order )
- 5. generate all allowed extensions (p,
p.pos) of p - 6. enqueue( output, p, fitness(p))
- 7. enqueue( order, p, priority(p) )
- 8. while pdequeue( output )
- 9. Output p
Jaak Vilo Discovering Frequent Patterns from
Strings.Technical Report C-1998-9 (pp. 20) May
1998. Department of Computer Science, University
of Helsinki. Jaak Vilo Pattern Discovery from
Biosequences PhD Thesis, Department of Computer
Science, University of Helsinki, Finland. Report
A-2002-3 Helsinki, November 2002, 149 pages
87Order
Breadth-first
Depth-first
1
1
2
3
4
4
3
2
5
6
7
7
6
5
8
9
10
10
9
8
88Order
Frequent-first
50
40
6
4
4
34
2
6
24
4
89SPEXS count and memorize
i...v....x....v....x abracadabradadabraca
a
1,4,6,8,11,13,15,18,20
2,5,7,9,12,14,16,19,21
90SPEXS extend
i...v....x....v....x abracadabradadabraca
a
2,5,7,9,12,14,16,19,21
b
c
d
7,12,14
5,19
2,9,16
91SPEXS find frequent first
i...v....x....v....x abracadabradadabraca
a
2,5,7,9,12,14,16,19,21
b
d
7,12,14
2,9,16
92SPEXS group positions
i...v....x....v....x abracadabradadabraca
a
.
2,5,7,9,12,14,16,19,21
bd
b
d
7,12,14
2,9,16
2,7,9,12,14,16
93The wildcards
GCAT.X
94The wildcards
GCAT.A
95The wildcards
GCAT.3,6X
96The wildcards not too many
w0
a
.3.6
w0
b
w1
w0
97Multiple data sets
D1
D2
D3
4/3 (6)
3/3 (12)
2/2 (9)
98GPCR coupling
Agonist
Signal
Current perspective
GPCR
Effector Enzyme channels
Intracellular messengers
G-protein
99Our Computational Approach
- Membrane topology 7TMHMM
- Intracellular domains of ? 100 receptor sequences
with - well-characterised, and non-promiscuous coupling
(split into Gs, Gi/o and Gq/11)
Steffen Möller, Jaak Vilo, Michael D.R.
CroningPrediction of the coupling specificity of
G protein coupled receptors to their G
proteins.ISMB-2001 July 2001. Bioinformatics
2001 17 S174-S181.
100RK....R.0,9EK DR.4,11H...AGS FR....RK.0
,3L S...L.1,10TILV C.FWY.2,11K
ILV.L.6,10A.T S....RKA.3,10S
AILV.1,5Y..ILV.T LR.1,9T...ILV
Steffen Möller, Jaak Vilo, Michael D.R.
CroningPrediction of the coupling specificity of
G protein coupled receptors to their G
proteins.ISMB-2001 July 2001. Bioinformatics
2001 17 S174-S181.
101Receptor Match Positions
Möller, Vilo, Croning, ISMB 2001
102Improving upon discrete patterns
103101 Sequences relative to ORF start
YGR128C 100
gtYAL036C chromo1 coord(76154-75048(C))
start-600 end2 seq(76152-76754) TGTTCTTTCTTCTT
CTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTA
GTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTGCTTC
TTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGC
ACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGC
TGCTTTCTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCG
GCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACT
CTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATC
CCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTT
CAATGGGCTTAAAGCTTGAAAAATTTTTTCACATCACAAGCGACGAGGGC
CCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGA
TATTACGGTGTGATGAGGGCGCAATGATAGGAAGTGTTTGAAGCTAGATG
CAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_ gtYAL025C
chromo1 coord(101147-100230(C)) start-600
end2 seq(101145-101747) CTTAGAAGATAAAGTAGTGAATT
ACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGG
GTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACCACGAATTGCTGAG
TAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTAT
CCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTTGTA
AAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCAT
ACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAG
AATTTATAATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTT
TTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATG
CAGTAGGGTAATAAACCTTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTT
TCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGA
TATTGCATTGCTTAGTTCTTTCTTTTGACAGTGTTCTCTTCAGTACATAA
CTACAACGGTTAGAATACAACGAGGAT_ATG_ ... gtYBR084W
chromo2 coord(411012-413936) start-600 end2
seq(410412-411014) CCATGTATCCAAGACCTGCTGAAGATGCTT
ACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTT
TCGCAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCA
AAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGT
CTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTA
GTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTC
AAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTT
CTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGAC
GCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTC
ACTTCAACGGACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCA
GCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACAT
CAAAAAACAACTTTCATTACTGTGATTCTCTCAGTCTGTTCATTTGTCAG
ATATTTAAGGCTAAAAGGAA_ATG_
GATGAG.T 152/70 2453/508 R7.52345
BP1.02391e-33G.GATGAG.T 139/49 2193/222
R13.244 BP2.49026e-33AAAATTTT 163/77
2833/911 R4.95687 BP5.02807e-32TGAAAA.TTT
145/53 2333/350 R8.85687 BP1.69905e-31TG.A
AA.TTT 153/61 2538/570 R6.45662
BP3.24836e-31TG.AAA.TTTT 140/43 2254/260
R10.3214 BP3.84624e-30TGAAA..TTT 154/65
2608/645 R5.82106 BP1.0887e-29 ...
GATGAG.T TGAAA..TTT
104.G.GATGAG.T. 39 seq
.G.GATGAG.T. 39 seq (vs 193) p 2.5e-33
105-1 .G.GATGAG.T. 61 seq (vs 1292)
-1 .G.GATGAG.T. 61 seq (vs 1292) p 1.4e-19
106-2 .G.GATGAG.T. 91 seq
-2 .G.GATGAG.T. 91 seq (vs 5464)
107-3 .G.GATGAG.T. 98 seq
108Jaak Vilo Pattern Discovery from Biosequences
PhD Thesis, Department of Computer Science,
University of Helsinki, FinlandSeries of
Publications A, Report A-2002-3 Helsinki,
November 2002, 149 pages
109-2 .G.GATGAG.T. 91 seq
These hits result in a PWM
110PWM based on all previous hits, here shown
highest-scoring occurrences in blue
111All against all approximate matching
For every subsequence of every sequence Match
approximately against all the the sequences.
Approximate hits define PWM matrices (not all
positions vary equally). Look for ALL PWM-s
derived from data that are enriched in data set
(vs. background).
Hendrik Nigul, Jaak Vilo
112Dynamic programming
- Small nr of edit operations allows to limit the
search efficiently around main diagonal
113Suffix Tree
A
T
G
C
G
G
T
124,212,223
114Trie based all against all approximate matching
- trieindex
- trieagrep
- trieallagrep
- triematrix
Hendrik Nigul, Jaak Vilo
115More directions for PD
116Multiple alignment
Marko Hyvönen
117Artificial setup
118Challenge problem
- Pevzner, P., Sze, S.H. 2000. Combinatorial
Approaches to Finding Subtle Signals in DNA
Sequences. Proc. 8th Int. Conf, Intelligent
Systems of Molecular Biology, 269-278. - Plant into every sequence a string X of length l,
with d characters randomly altered. - What was the original X ?
- (l,d)-problem
119(4, 1) - problem
ACTG
Seed -
CCTG
12 possible planted versions
GCTG
TCTG
AATG
AGTG
ATTG
....
120Graph constructed by WINNOWER
- For (15,4)-signal - connect all words with
distance at most 8 - atgaccgggatactgatAgAAgAAAGGttGGGtataatggagtacgataa
- atgacttcAAtAAAAcGGcGGGtgctctcccgattttgagtatccctggg
- gcaatcgcgaaccaagctgagaattggatgtcAAAAtAAtGGaGtGGcac
- gtcaatcgaaaaaacggtggaggatttcAAAAAAAGGGattGgaccgctt
real signals
signal edges
spurious signals
spurious edges
from Eleazar Eskin
121Pairs of motifs
122Composite Patterns
atgactAGGGTAACATgattgagaccagtgaCAGGAATTCactgacaa
Conserved Region
Conserved Region
Unconserved Spacing
- Co-occurring patterns
- (GuhaThakurta, Stormo 2001)
- Fixed Order (Dyad Problem)
- (van Helden et. al 2000)
- (Gelfand et. al 2000)
- (Marsan, Sagot 2000)
from Eleazar Eskin
123Patterns with Mismatches
AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG
Mismatches d8
Instances
AcAAAAcAGGGGtGG-11-CTGAcTCTtATAaAG
AAAcAAAgaGtGGtG-12-CTGcgTCTAATtcAG
AtAAAAAtcGGGcGG-10-CTGATcCTAtTACcG
AAAAAtAAGGGGcGG-14-CgGAcTCTAATgCAG
Eleazar Eskin
124Sample Sequences
AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG
actgatAAAAAAAAGGGGGGGggcgtacacattagCTGATTCCAATACAG
acgt aaAAAAAAAAGGGGGGGaaacttttccgaataCTGATTCCAATA
CAGgatcagt atgacttAAAAAAAAGGGGGGGtgctctcccgattttc
CTGATTCCAATACAGc aggAAAAAAAAGGGGGGGagccctaacggact
taatCCTGATTCCAATACAGta ggaggAAAAAAAAGGGGGGGagccct
aacggacttaatCCTGATTCCAATACAG
Eleazar Eskin
125Sample Sequences
AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG
actgatAAAAtAAAGcGGGaGggcgtacacattagCaGAcTCCAATtgAG
acgt aaAAtAAAAAaaaGGcGaaacttttccgaataCTGAcTCCAAag
CAGgatcagt atgacttAAcAAtAgGGGaGGGtgctctcccgattttc
CTGcTaCCAAgAtAGc aggAAtAAAAtGGaGGGGagccctaacggact
taatCCaGATTgCAcTAaAata ggaggAAgAAAAAGGaGaGGagccct
aacggacttaatCtTGAaTCCtATACAc
Eleazar Eskin
126Traditional Approach Weaknesses
atgactAGGGTAACATgattgagaccagtgaCAGGAATTCactgacaa
Conserved Region
Conserved Region
Unconserved Spacing
Traditional Approach Find each conserved region
separately.
Problem Each region too weak.
Eleazar Eskin
127Traditional Approach Solution
atgactAGGGTAACATgattgagaccagtgaCAGGAATTCactgacaa
Conserved Region
Conserved Region
Unconserved Spacing
Traditional Approach Find each conserved region
separately.
Problem Each region too weak.
Our approach Find both regions simultaneously.
Conserved Region
Conserved Region
single pattern after preprocessing.
Eleazar Eskin
128Combinations and modules
- Regulatory signals do not work alone
- Motif co-occurrences
129- 700bp widows with at least 13 binding site
occurrences
130(No Transcript)
131- 700bp widows with at least 13 binding site
occurrences
132Using multiple species
- Phylogenetic footprinting
- Phylogenetic shadowing
- Conservation cross species
133(No Transcript)
134Phylogenetic footprinting
- McCue L, Thompson W, Carmack C, Ryan MP, Liu JS,
Derbyshire V, Lawrence CE.Phylogenetic
footprinting of transcription factor binding
sites in proteobacterial genomes.Nucleic Acids
Res. 2001 Feb 129(3) - Mathieu Blanchette, and Martin Tompa Discovery
of Regulatory Elements by a Computational Method
for Phylogenetic Footprinting Genome Research
Vol. 12, Issue 5, 739-748, May 2002
135Men and mice are alike
136(No Transcript)
13726 species
13845 species
139Phylogenetic footprinting
Study the same gene in many species
human
ape
mouse
fish
chicken
If preserved during evolution then must be
important for something!!!
140What if species too similar?
- Almost entire genome is highly similar
- Signal gets lost
141Phylogenetic shadowing
- Use many closely related species (monkeys, apes,
...) - All regions that differ, are shadowed out
- These regions that do not have differences in
(almost) any, are probably important - Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD,
Ovcharenko I, Pachter L, Rubin EM.Phylogenetic
shadowing of primate sequences to find functional
regions of the human genome.Science. 2003 Feb
28299(5611)1331-3.
142Phylogenetic shadowing
http//chr21.molgen.mpg.de/images/projects/BACH1_s
mall.jpg
143Alignment of functional elements
- Outi Hallikas, Kimmo Palin, Natalia Sinjushina,
Reetta Rautiainen, Juha Partanen, Esko Ukkonen
and Jussi Taipale. Genome-wide Prediction of
Mammalian Enhancers Based on Analysis of
Transcription-Factor Binding Affinity CELL
124(1), 13 January 2006, Pages 47-59. - Kimmo Palin, Jussi Taipale and Esko Ukkonen.
Locating potential enhancer elements by
comparative genomics using the EEL
softwareNature Protocols 1(1), 27. June 2006,
Pages 368-374. - E. Blanco, X. Messeguer, T. Smith and R. Guigó
"Transcription Factor Map Alignment of Promoter
Regions." PLoS Computational Biology, 2(5)e49
(2006)
144(No Transcript)
145Ranked list data
Tartu
146Problem
- Target vs background data?
- Strong vs weak
- No clear cut
147- (c1) the cutoff used to partition data into a
target set and background set of sequences is
often chosen arbitrarily - (c2) lack of an exact statistical score and
p-value for motif enrichment - (c3) a need for an appropriate framework that
accounts for multiple motif occurrences in a
single promoter. - (c4) motif discovery methods tend to report
presumably significant motifs even when applied
on randomly generated data. These motifs are
clear cases of false positives and should be
avoided.
148(No Transcript)
149(No Transcript)
150(No Transcript)
151(No Transcript)
152Summary
153Pattern languages
- Substrings ATCGA
- Character groups ATCGC.A
- Unrestricted wildcards AT.CG
- Restricted wildcards AT.2,5CG
- Combine all above A.TGC.1,3GTAC TGC
GCA - Closures TGAAATTT
- Allow mismatches, insertions, deletions
- Probabilistic versions of the above
154Probabilistic motifs
- Gary Stormo lab
- EM-algorithm
- MEME (Bailey, Elkan)
- Gibbs Sampling
- AlignAce (Roth et al)
- (Rocke, Tompa)
- Neural networks
- HMM models, SCFG
155The advantages and disadvantages of discrete
patterns
- Advantages
- simple and easily interpretable objects
- easier to discover from scratch (i.e., if no
additional information to sequences are given),
particularly in noisy data - Disadvantages
- limited descriptive power (no weights can be
attributed to alternatives) - No probability of a match
156Fitness measures
- Ratio (times over-reprsented)
- ROC AUC
- Probability (p-value)
- Domain specific (biological) score
157Multiple testing due
- Large pattern (search) space
- Many data sets analysed
- Different cut-off thresholds
158Search algorithms
- Pattern driven
- generate all possible patterns, evaluate
- Data Driven
- e.g. align data sets, read out patterns
- EM, Gibbs, ...
- (all probabilistic methods)
159Search algorithm
- Pattern driven
- generate all possible patterns, evaluate
- Data Driven
- e.g. align data sets, read out patterns
- Combined
- Use data as a guide for exhaustive search through
pattern space
160Regular pattern tools
- SPEXS (Jaak Vilo)
- Pratt (Inge Jonassen, U. of Bergen)
- TEIRESIAS (IBM Research, Rigoutsos, Floratos)
- MobyDick etc (Harmen Bussemaker)
- RSA-tools (Jacques van Helden)
- Martin Tompa
- Marsan Sagot (suffix tree gapped motifs)
- Jensen Knudsen (suffix tree based substrings)
- Verbumculus (Stefano Lonardi, A. Apostolico)
-
161(No Transcript)
162(No Transcript)
163(No Transcript)
164Anno 2007 (BIIT and Quretec)
165Tartu, ESTONIA