Title: Bioinformatics
1Bioinformatics
- The Machine Learning Approach
- Pierre Baldi
- School of Information and Computer Science
- Dept. Biological Chemistry
- Institute for Genomics and Bioinformatics
- University of California, Irvine
2(No Transcript)
3(No Transcript)
4OUTLINE
- INTRODUCTION
- BIOINFORMATICS THE DATA
- MACHINE LEARNING PROBABILISTIC MODELING
- EXAMPLES OF MODELS
- MARKOV MODELS
- NEURAL NETWORKS
- HIDDEN MARKOV MODELS
- PROTEIN APPLICATIONS
- DNA APPLICATIONS
- SOFTWARE DEMONSTRATION
- GRAPHICAL MODELS
- PROTEIN STRUCTURE PREDICTION (GMs and RNNs)
- STOCHASTIC GRAMMARS
- RNA MODELING
- DNA MICROARRAYS
- SYSTEMS BIOLOGY
- DISCUSSION
5- tggaagggctaattcactcccaacgaagacaagatatccttgatctgtgg
atctaccacacacaaggctacttccctgattagcagaactacacaccagg
gccagggatcagatatccactgacctttggatggtgctacaagctagtac
cagttgagccagagaagttagaagaagccaacaaaggagagaacaccagc
ttgttacaccctgtgagcctgcatggaatggatgacccggagagagaagt
gttagagtggaggtttgacagccgcctagcatttcatcacatggcccgag
agctgcatccggagtacttcaagaactgctgacatcgagcttgctacaag
ggactttccgctggggactttccagggaggcgtggcctgggcgggactgg
ggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgt
actgggtctctctggttagaccagatctgagcctgggagctctctggcta
actagggaacccactgcttaagcctcaataaagcttgccttgagtgcttc
aagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctc
agacccttttagtcagtgtggaaaatctctagcagtggcgcccgaacagg
gacctgaaagcgaaagggaaaccagaggagctctctcgacgcaggactcg
gcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagta
cgccaaaaattttgactagcggaggctagaaggagagagatgggtgcgag
agcgtcagtattaagcgggggagaattagatcgatgggaaaaaattcggt
taaggccagggggaaagaaaaaatataaattaaaacatatagtatgggca
agcagggagctagaacgattcgcagttaatcctggcctgttagaaacatc
agaaggctgtagacaaatactgggacagctacaaccatcccttcagacag
gatcagaagaacttagatcattatataatacagtagcaaccctctattgt
gtgcatcaaaggatagagataaaagacaccaaggaagctttagacaagat
agaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctg
acacaggacacagcaatcaggtcagccaaaattaccctatagtgcagaac
atccaggggcaaatggtacatcaggccatatcacctagaactttaaatgc
atgggtaaaagtagtagaagagaaggctttcagcccagaagtgataccca
tgttttcagcattatcagaaggagccaccccacaagatttaaacaccatg
ctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagac
catcaatgaggaagctgcagaatgggatagagtgcatccagtgcatgcag
ggcctattgcaccaggccagatgagagaaccaaggggaagtgacatagca
ggaactactagtacccttcaggaacaaataggatggatgacaaataatcc
acctatcccagtaggagaaatttataaaagatggataatcctgggattaa
ataaaatagtaagaatgtatagccctaccagcattctggacataagacaa
ggaccaaaggaaccctttagagactatgtagaccggttctataaaactct
aagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaacct
tgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattg
ggaccagcggctacactagaagaaatgatgacagcatgtcagggagtagg
aggacccggccataaggcaagagttttggctgaagcaatgagccaagtaa
caaattcagctaccataatgatgcagagaggcaattttaggaaccaaaga
aagattgttaagtgtttcaattgtggcaaagaagggcacacagccagaaa
ttgcagggcccctaggaaaaagggctgttggaaatgtggaaaggaaggac
accaaatgaaagattgtactgagagacaggctaattttttagggaagatc
tggccttcctacaagggaaggccagggaattttcttcagagcagaccaga
gccaacagccccaccagaagagagcttcaggtctggggtagagacaacaa
ctccccctcagaagcaggagccgatagacaaggaactgtatcctttaact
tccctcaggtcactctttggcaacgacccctcgtcacaataaagataggg
gggcaactaaaggaagctctattagatacaggagcagatgatacagtatt
agaagaaatgagtttgccaggaagatggaaaccaaaaatgatagggggaa
ttggaggttttatcaaagtaagacagtatgatcagatactcatagaaatc
tgtggacataaagctataggtacagtattagtaggacctacacctgtcaa
cataattggaagaaatctgttgactcagattggttgcactttaaattttc
ccattagccctattgagactgtaccagtaaaattaaagccaggaatggat
ggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcatt
agtagaaatttgtacagagatggaaaaggaagggaaaatttcaaaaattg
ggcctgaaaatccatacaatactccagtatttgccataaagaaaaaagac
agtactaaatggagaaaattagtagatttcagagaacttaataagagaac
tcaagacttctgggaagttcaattaggaataccacatcccgcagggttaa
aaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttca
gttcccttagatgaagacttcaggaagtatactgcatttaccatacctag
tataaacaatgagacaccagggattagatatcagtacaatgtgcttccac
agggatggaaaggatcaccagcaatattccaaagtagcatgacaaaaatc
ttagagccttttagaaaacaaaatccagacatagttatctatcaatacat
ggatgatttgtatgtaggatctgacttagaaatagggcagcatagaacaa
aaatagaggagctgagacaacatctgttgaggtggggacttaccacacca
gacaaaaaacatcagaaagaacctccattcctttggatgggttatgaact
ccatcctgataaatggacagtacagcctatagtgctgccagaaaaagaca
gctggactgtcaatgacatacagaagttagtggggaaattgaattgggca
agtcagatttacccagggattaaagtaaggcaattatgtaaactccttag
aggaaccaaagcactaacagaagtaataccactaacagaagaagcagagc
tagaactggcagaaaacagagagattctaaaagaaccagtacatggagtg
tattatgacccatcaaaagacttaatagcagaaatacagaagcaggggca
aggccaatggacatatcaaatttatcaagagccatttaaaaatctgaaaa
caggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaa
ttaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatgggg
aaagactcctaaatttaaactgcccatacaaaaggaaacatgggaaacat
ggtggacagagtattggcaagccacctggattcctgagtgggagtttgtt
aatacccctcccttagtgaaattatggtaccagttagagaaagaacccat
agtaggagcagaaaccttctatgtagatggggcagctaacagggagacta
aattaggaaaagcaggatatgttactaatagaggaagacaaaaagttgtc
accctaactgacacaacaaatcagaagactgagttacaagcaatttatct
agctttgcaggattcgggattagaagtaaacatagtaacagactcacaat
atgcattaggaatcattcaagcacaaccagatcaaagtgaatcagagtta
gtcaatcaaataatagagcagttaataaaaaaggaaaaggtctatctggc
atgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaat
tagtcagtgctggaatcaggaaagtactatttttagatggaatagataag
gcccaagatgaacatgagaaatatcacagtaattggagagcaatggctag
tgattttaacctgccacctgtagtagcaaaagaaatagtagccagctgtg
ataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagt
ccaggaatatggcaactagattgtacacatttagaaggaaaagttatcct
ggtagcagttcatgtagccagtggatatatagaagcagaagttattccag
cagaaacagggcaggaaacagcatattttcttttaaaattagcaggaaga
tggccagtaaaaacaatacatactgacaatggcagcaatttcaccggtgc
tacggttagggccgcctgttggtgggcgggaatcaagcaggaatttggaa
ttccctacaatccccaaagtcaaggagtagtagaatctatgaataaagaa
ttaaagaaaattataggacaggtaagagatcaggctgaacatcttaagac
agcagtacaaatggcagtattcatccacaattttaaaagaaaagggggga
ttggggggtacagtgcaggggaaagaatagtagacataatagcaacagac
atacaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcg
ggtttattacagggacagcagaaatccactttggaaaggaccagcaaagc
tcctctggaaaggtgaaggggcagtagtaatacaagataatagtgacata
aaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaaca
gatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaaca
tggaaaagtttagtaaaacaccatatgtatgtttcagggaaagctagggg
atggttttatagacatcactatgaaagccctcatccaagaataagttcag
aagtacacatcccactaggggatgctagattggtaataacaacatattgg
ggtctgcatacaggagaaagagactggcatttgggtcagggagtctccat
agaatggaggaaaaagagatatagcacacaagtagaccctgaactagcag
accaactaattcatctgtattactttgactgtttttcagactctgctata
agaaaggccttattaggacacatagttagccctaggtgtgaatatcaagc
aggacataacaaggtaggatctctacaatacttggcactagcagcattaa
taacaccaaaaaagataaagccacctttgcctagtgttacgaaactgaca
gaggatagatggaacaagccccagaagaccaagggccacagagggagcca
cacaatgaatggacactagagcttttagaggagcttaagaatgaagctgt
tagacattttcctaggatttggctccatggcttagggcaacatatctatg
aaacttatggggatacttgggcaggagtggaagccataataagaattctg
caacaactgctgtttatccattttcagaattgggtgtcgacatagcagaa
taggcgttactcgacagaggagagcaagaaatggagccagtagatcctag
actagagccctggaagcatccaggaagtcagcctaaaactgcttgtacca
attgctattgtaaaaagtgttgctttcattgccaagtttgtttcataaca
aaagccttaggcatctcctatggcaggaagaagcggagacagcgacgaag
agctcatcagaacagtcagactcatcaagcttctctatcaaagcagtaag
tagtacatgtaacgcaacctataccaatagtagcaatagtagcattagta
gtagcaataataatagcaatagttgtgtggtccatagtaatcatagaata
taggaaaatattaagacaaagaaaaatagacaggttaattgatagactaa
tagaaagagcagaagacagtggcaatgagagtgaaggagaaatatcagca
cttgtggagatgggggtggagatggggcaccatgctccttgggatgttga
tgatctgtagtgctacagaaaaattgtgggtcacagtctattatggggta
cctgtgtggaaggaagcaaccaccactctattttgtgcatcagatgctaa
agcatatgatacagaggtacataatgtttgggccacacatgcctgtgtac
ccacagaccccaacccacaagaagtagtattggtaaatgtgacagaaaat
tttaacatgtggaaaaatgacatggtagaacagatgcatgaggatataat
cagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactct
gtgttagtttaaagtgcactgatttgaagaatgatactaataccaatagt
agtagcgggagaatgataatggagaaaggagagataaaaaactgctcttt
caatatcagcacaagcataagaggtaaggtgcagaaagaatatgcatttt
tttataaacttgatataataccaatagataatgatactaccagctataag
ttgacaagttgtaacacctcagtcattacacaggcctgtccaaaggtatc
ctttgagccaattcccatacattattgtgccccggctggttttgcgattc
taaaatgtaataataagacgttcaatggaacaggaccatgtacaaatgtc
agcacagtacaatgtacacatggaattaggccagtagtatcaactcaact
gctgttaaatggcagtctagcagaagaagaggtagtaattagatctgtca
atttcacggacaatgctaaaaccataatagtacagctgaacacatctgta
gaaattaattgtacaagacccaacaacaatacaagaaaaagaatccgtat
ccagagaggaccagggagagcatttgttacaataggaaaaataggaaata
tgagacaagcacattgtaacattagtagagcaaaatggaataacacttta
aaacagatagctagcaaattaagagaacaatttggaaataataaaacaat
aatctttaagcaatcctcaggaggggacccagaaattgtaacgcacagtt
ttaattgtggaggggaatttttctactgtaattcaacacaactgtttaat
agtacttggtttaatagtacttggagtactgaagggtcaaataacactga
aggaagtgacacaatcaccctcccatgcagaataaaacaaattataaaca
tgtggcagaaagtaggaaaagcaatgtatgcccctcccatcagtggacaa
attagatgttcatcaaatattacagggctgctattaacaagagatggtgg
taatagcaacaatgagtccgagatcttcagacctggaggaggagatatga
gggacaattggagaagtgaattatataaatataaagtagtaaaaattgaa
ccattaggagtagcacccaccaaggcaaagagaagagtggtgcagagaga
aaaaagagcagtgggaataggagctttgttccttgggttcttgggagcag
caggaagcactatgggcgcagcctcaatgacgctgacggtacaggccaga
caattattgtctggtatagtgcagcagcagaacaatttgctgagggctat
tgaggcgcaacagcatctgttgcaactcacagtctggggcatcaagcagc
tccaggcaagaatcctggctgtggaaagatacctaaaggatcaacagctc
ctggggatttggggttgctctggaaaactcatttgcaccactgctgtgcc
ttggaatgctagttggagtaataaatctctggaacagatttggaatcaca
cgacctggatggagtgggacagagaaattaacaattacacaagcttaata
cactccttaattgaagaatcgcaaaaccagcaagaaaagaatgaacaaga
attattggaattagataaatgggcaagtttgtggaattggtttaacataa
caaattggctgtggtatataaaattattcataatgatagtaggaggcttg
gtaggtttaagaatagtttttgctgtactttctatagtgaatagagttag
gcagggatattcaccattatcgtttcagacccacctcccaaccccgaggg
gacccgacaggcccgaaggaatagaagaagaaggtggagagagagacaga
gacagatccattcgattagtgaacggatccttggcacttatctgggacga
tctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactct
tgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagcc
ctcaaatattggtggaatctcctacagtattggagtcaggaactaaagaa
tagtgctgttagcttgctcaatgccacagccatagcagtagctgagggga
cagatagggttatagaagtagtacaaggagcttgtagagctattcgccac
atacctagaagaataagacagggcttggaaaggattttgctataagatgg
gtggcaagtggtcaaaaagtagtgtgattggatggcctactgtaagggaa
agaatgagacgagctgagccagcagcagatagggtgggagcagcatctcg
agacctggaaaaacatggagcaatcacaagtagcaatacagcagctacca
atgctgcttgtgcctggctagaagcacaagaggaggaggaggtgggtttt
ccagtcacacctcaggtacctttaagaccaatgacttacaaggcagctgt
agatcttagccactttttaaaagaaaaggggggactggaagggctaattc
actcccaaagaagacaagatatccttgatctgtggatctaccacacacaa
ggctacttccctgattagcagaactacacaccagggccaggggtcagata
tccactgacctttggatggtgctacaagctagtaccagttgagccagata
agatagaagaggccaataaaggagagaacaccagcttgttacaccctgtg
agcctgcatgggatggatgacccggagagagaagtgttagagtggaggtt
tgacagccgcctagcatttcatcacgtggcccgagagctgcatccggagt
acttcaagaactgctgacatcgagcttgctacaagggactttccgctggg
gactttccagggaggcgtggcctgggcgggactggggagtggcgagccct
cagatcctgcatataagcagctgctttttgcctgtactgggtctctctgg
ttagaccagatctgagcctgggagctctctggctaactagggaacccact
gcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgccc
gtctgttgtgtgactctggtaactagagatccctcagacccttttagtca
gtgtggaaaatctctagca
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10SCALES
organisms genome genes
virus 10-100,000 10-100s
bacteria 5 Mb 4,000
single cell 15Mb 6,000
simple animal 100Mb 15,000
homo 3,000Mb 30-40,000
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Examples of Computational Problems
- Physical and Genetic Maps
- Genome assembly
- Pairwise and Multiple Alignments
- Motif Detection/Discrimination/Classification
- Data Base Searches and Mining
- Phylogenetic Tree Reconstruction
- Gene Finding and Gene Parsing
- Protein Secondary Structure Prediction
- Protein Tertiary Structure Prediction
- Protein Function Prediction
- Comparative genomics and evolution
- DNA microarray analysis
- Molecular docking/Drug design
- Gene regulation/regulatory networks
- Systems biology
17Machine Learning
- Extract information from the data automatically
(inference) via a process of model fitting
(learning from examples). - Model Selection Neural Networks, Hidden Markov
Models, Stochastic Grammars, Bayesian Networks - Model Fitting Gradient Methods, Monte Carlo
Methods, - Machine learning approaches are most useful in
areas where there is a lot of data but little
theory.
18Three Key Factors for Expansion
- Data Mining/Machine Learning Expansion is fueled
by - Progress in sensors, data storage, and data
management. - Computing power.
- Theoretical framework.
19Intuitive Approach
- Look at ALL available data, background
information, and hypothesis - Use probabilities to express PRIOR knowledge
- Use probabilities for inference, model selection,
model comparison, etc. by computing POSTERIOR
distributions and deriving UNIQUE answers
20Deduction and Inference
- DEDUCTION
- If A gt B and A is true,
- then B is true.
- INDUCTION
- If A gt B and B is true,
- then A is more plausible
21Bayesian Statistics
- Bayesian framework for induction we start with
hypothesis space and wish to express relative
preferences in terms of background information
(the Cox-Jaynes axioms). - Axiom 0 Transitivity of preferences.
- Theorem 1 Preferences can be represented by a
real number p (A). - Axiom 1 There exists a function f such that
- p(non A) f(p(A))
- Axiom 2 There exists a function F such that
- p (A,B) F(p(A), p(BA))
- Theorem2 There is always a rescaling w such that
p(A)w(p(A)) is in 0,1, and satisfies the sum
and product rules.
22Probability as Degree of Belief
- Sum Rule
- P(non A) 1- P(A)
- Product Rule
- P(A and B) P(A) P(BA)
- BayesTheorem
- P(BA)P(AB)P(B)/P(A)
- Induction Form
- P(MD) P(DM)P(M)/P(D)
- Equivalently
- logP(MD) logP(DM)logP(M)-logP(D)
      - Recursive Form
- P(MD1,D2,,Dn1) P(Dn1M) P(lD1,,Dn) /
P(Dn1D1,,Dn)
23Learning
- MODEL FITTING AND MODEL COMPARISON
- MAXIMUM LIKELIHOOD AND MAXIMUM A POSTERIORI
24Paradox
- A non-probabilistic model is NOT a scientific
model.
25EXAMPLES OF NON-SCIENTIFIC MODELS
- Fma
- Emc2
- etc
- These are only first-order approximations and do
not fit the data (likelihood is zero). - Correction (F F) (mm)(aa).
26Paradox
- TO CHOOSE A SIMPLE MODEL BECAUSE DATA IS
SCARCE IS LIKE SEARCHING FOR THE KEY UNDER THE
LIGHT IN THE PARKING LOT AT NIGHT
27Different Levels of Bayesian Inference
- Level 1 Find the best model w.
- Level2 Integrate over models.
28Model Classes
- NEURAL NETWORKS
- MARKOV MODELS
- HIDDEN MARKOV MODELS
- STOCHASTIC GRAMMARS
- DECISION TREES
- BAYESIAN NETWORKS
- GRAPHICAL MODELS
- KERNEL METHODS (SVMs, gaussian kernels, spectral
kernels, etc)
29PRIORS
- NON-INFORMATIVE PRIORS (UNIFORM, MAXIMUM ENTROPY,
SYMMETRIES) - STANDARD PRIORS GAUSSIAN, DIRICHLET, CONJUGATE,
ETC.
30LEARNING ALGORITHMS
- Minimize -log p(MD)
- Gradient methods (gradient descent, conjugate
gradient, back-propagation). - Monte Carlo methods (Metropolis, Gibbs sampling,
simulated annealing). - Other methods EM (Expectation-Maximization),
GEM, etc.
31OTHER ASPECTS
- Model complexity
- VC dimension
- Minimum description length
- Validation and cross validation
- Early stopping
- Ensembles
- Boosting, bagging,
- Second order methods (Hessian, Fisher information
matrix) - etc.
32(No Transcript)
33AXIOMATIC HIERARCHY
- GAME THEORY
- DECISION THEORY
- BAYESIAN STATISTICS
- GRAPHICAL MODELS
34- In general it is wise to model biological data
probabilistically for several reasons - 1. Measurement noise (arrays vs sequences)
- 2. Variability (evolutionary tinkering)
- 3. High-dimensional complex systems (hidden
variables)
35MARKOV MODELS
36(No Transcript)
37(No Transcript)
38- Problem of long-range dependencies
- Generalization graphical models
39ARTIFICIAL NEURAL NETWORKS
40(No Transcript)
41MODEL NEURON
outf(Swiui t) i
Fthreshold function Fsigmoidal function
etc.
wi
ui
42(No Transcript)
43NEURAL NETWORKS
- Early Neural Networks layered feed-forward, one
hidden layer, sigmoidal transfer functions, LMS
error. - Modern Neural Networks general class of Bayesian
statistical models. - Probabilistic Framework network N is
deterministic, data D(x,y) is stochastic.
Output of network F(x) is the average or
expectation of y given x. - Regression Gaussian model, linear transfer
functions in output layer, error function LMS
negative log-likelihood. - Classification Multinomial model, normalized
exponential transfer functions in output layer,
error function KL distance relative entropy
negative log-likelihood. - Classification (special case) Binomial model,
sigmoidal transfer functions in output layer,
error function KL distance relative entropy
negative log-likelihood.
44UNIVERSAL APPROXIMATION PROPERTIES
- Any reasonable input output function can be
approximated to any degree of precision by a
neural network. - Trivial for Boolean functions and threshold
units. - Simple proof for continuous functions.Â
- Real issues are
- Complexity (the size of the network).
- (2) Learning (how to find the network).
45(No Transcript)
46BACK-PROPAGATION
ERROR EF(w)
OUTPUT LAYER
i
W ij
j
INPUT LAYER
GRADIENT DESCENT ? w ij µ outj ?i
µ learning rate
47HIDDEN MARKOV MODELS
48Hidden Markov Models
- A first-order Hidden Markov Model is completely
defined by - A set of states.
- An alphabet of symbols.
- A transition probability matrix T(tij)
- An emission probability matrix E(eiX)
49(No Transcript)
50Basic Ideas
- As in speech recognition, use Hidden Markov
Models (HMM) to model a family of related primary
sequences. - As in speech recognition, in general use a left
to right HMM once the system leaves a state it
can never reenter it. The basic architecture
consists of a main backbone chain of main states,
and two side chains of insert and delete states. - The parameters of the model are the transition
and emission probabilities. These parameters are
adjusted during training from examples. - After learning, the model can be used in a
variety of tasks including multiple alignments,
detection of motifs, classification, data base
searches.
51HMM APPLICATIONS
- MULTIPLE ALIGNMENTS
- DATA BASE SEARCHES AND
- DISCRIMINATION/CLASSIFICATION
- STRUCTURAL ANALYSIS AND
- PATTERN DISCOVERY
52Multiple Alignments
- No precise definition of what a good alignment is
(low entropy, detection of motifs). - The multiple alignment problem is NP complete
(finding longest subsequence). - Pairwise alignment can be solved efficiently by
dynamic programming in O(N2) steps. - For K sequences of average length N, dynamic
programming scales like O(NK), exponentially in
the number of sequences. - Problem of variable scores and gap penalties.
53HMMs of Protein Families
- Globins
- Immunoglobulins
- Kinases
- G-Protein-Coupled Receptors
- Pfam is a data base of protein domains
54HMMs of DNA
- coding/non-coding regions (E. Coli)
- exons/introns/acceptor sites
- promoter regions
- gene finding/gene parsing
55(No Transcript)
56IMMUNOGLOBULINS
- 294 sequences (V regions) with minimum length 90,
average length 117, and maximal length 254 - linear model of length 117 trained with a random
subset of 150 sequences
57(No Transcript)
58IG MODEL ENTROPY
59IG EMISSIONS
60IG Viterbi Path
61IG MULTIPLE ALIGNMENT
62G-PROTEIN-COUPLED RECEPTORS
- 145 sequences with minimum length 310, average
length 430, and maximal length 764. - Model trained with 143 sequences (3 sequences
contained undefined symbols) using Viterbi
learning.
63GPCR ENTROPY
64(No Transcript)
65GPCR SCORING
66(No Transcript)
67Linear Architecture
68Loop Architecture
69Wheel Architecture
70(No Transcript)
71(No Transcript)
72(No Transcript)
73PROMOTER ENTROPY
74PROMOTER BENDABILITY
75(No Transcript)
76(No Transcript)
77(No Transcript)
78GRAPHICAL MODELS
79GRAPHICAL MODELS
- Bayesian statistics and modeling leads to very
high-dimensional distributions P(D,H,M) which are
typically intractable. - Need for factorization into independent clusters
of variables that reflect the local (Markovian)
dependencies of the world and the data. - Hence the general theory of graphical models.
- Directed models reflect temporal and causality
relationships NNs, HMMs, Bayesian networks, etc. - Directed models are used for instance in expert
systems. - Undirected models reflect correlations Random
Markov Fields, Boltzmann machines, etc.) - Undirected models are used for instance in image
modeling problems. - Directed/Undirected and other models are
possible.
80GRAPHICAL MODELS BAYESIAN NETWORKS
- X1, ,Xn random variables associated with the
vertices of a DAG Directed Acyclic Graph - The local conditional distributions P(XiXj j
parent of i) are the parameters of the model.
They can be represented by look-up tables
(costly) or other more compact parameterizations
(Sigmoidal Belief Networks, XOR, etc). - The global distribution is the product of the
local characteristicsP(X1,,Xn) ?i P(XiXj
j parent of i)
81(No Transcript)
82(No Transcript)
83(No Transcript)
84(No Transcript)
85Inference and Learning
- Visible and Hidden Nodes
- Bayes rule
- Tress and Polytrees
- General DAGs and Junction Tree Algorithm
- Belief propagation in DAGs
- Approximation methods (e.g. Variational methods
etc)
86PROTEIN STRUCTURE PREDICTION (GMs AND RNNs)
87PROTEINS
- R1
R3 -
- Ca N Cß
Ca - / \ / \ /
\ / \ - N Cß Ca
N Cß -
- R2
88(No Transcript)
89Utility of Structural Information
(Baker and Sali, 2001)
90CAVEAT
91REMARKS
- Structure/Folding
- Backbone/Full Atom
- Homology Modeling
- Protein Threading
- Ab Initio (Physical Potentials/Molecular
Dynamics, Statistical Mechanics/Lattice Models) - Statistical/Machine Learning (Training Sets, SS
prediction) - Mixtures ab-initio with statistical potentials,
machine learning with profiles, etc.
92PROTEIN STRUCTURE PREDICTION
93(No Transcript)
94Helices
- 1GRJ (Grea Transcript Cleavage Factor From
Escherichia Coli)
95Antiparallel ß-sheets
- 1MSC (Bacteriophage Ms2 Unassembled Coat Protein
Dimer)
96Parallel ß-sheets
97Contact map
98Secondary structure prediction
99(No Transcript)
100(No Transcript)
101(No Transcript)
102(No Transcript)
103DATA PREPARATION
- Â
- Starting point PDB data base.
- Â Â Â Â Â Â Remove sequences not determined by X ray
diffraction. - Â Â Â Â Â Â Remove sequences where DSSP crashes.
- Â Â Â Â Â Â Remove proteins with physical chain
breaks (neighboring AA having
distances exceeding 4 Angstroms) - Â Â Â Â Â Â Remove sequences with resolution worst
than 2.5 Angstroms. - Â Â Â Â Â Â Remove chains with less than 30 AA.
- Â Â Â Â Â Â Remove redundancy (Hobohms algorithm,
Smith-Waterman, PAM 120, etc.) - Build multiple alignments (BLAST,
PSI-BLAST, etc.)
104SECONDARY STRUCTURE PROGRAMS
- DSSP (Kabsch and Sander, 1983) works by
assigning potential backbone hydrogen bonds
(based on the 3D coordinates of the backbone
atoms) and subsequently by identifying repetitive
bonding patterns. - Â STRIDE (Frishman and Argos, 1995) in addition
to hydrogen bonds, it uses also dihedral angles. - Â Â DEFINE (Richards and Kundrot, 1988) uses
difference distance matrices for evaluating the
match of interatomic distances in the protein to
those from idealized SS.
105SECONDARY STRUCTURE ASSIGNMENTS
- DSSP classesÂ
- H alpha helix
- E sheet
- G 3-10 helix
- S kind of turn
- T beta turn
- B beta bridge
- I pi-helix (very rare)
- C the rest
- CASP (harder) assignmentÂ
- a H and G
- ß E and B
- ? the rest
- Alternative assignmentÂ
- a H
- ß B
- ? the rest
106ENSEMBLES
107(No Transcript)
108(No Transcript)
109(No Transcript)
110FUNDAMENTAL LIMITATIONS
- 100 CORRECT RECOGNITION IS PROBABLY IMPOSSIBLE
FOR SEVERAL REASONS - SOME PROTEINS DO NOT FOLD SPONTANEOUSLY OR MAY
NEED CHAPERONES - QUATERNARY STRUCTURE BETA-STRAND PARTNERS MAY BE
ON A DIFFERENT CHAIN - STRUCTURE MAY DEPEND ON OTHER VARIABLES
ENVIRONMENT, PH - DYNAMICAL ASPECTS
- FUZZINESS OF DEFINITIONS AND ERRORS IN DATABASES
111(No Transcript)
112(No Transcript)
113BB-RNNs
1142D RNNs
1152D INPUTS
- AA at positions i and j
- Profiles at positions i and j
- Correlated profiles at positions i and j
- Secondary Structure, Accessibility, etc.
116(No Transcript)
117Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1HCR, chain A Sequence
GRPRAINKHEQEQISRLLEKGHPRQQLAIIFGIGVSTLYRYFPASSIKKR
MN True SS CCCCCCCCHHHHHHHHHHHCCCCHHHHHHHCECCHHH
HHHHCCCCCCCCCCC Pred SS CCCCCCCHHHHHHHHHHHHCCCCH
HHHEEHECHHHHHHHHCCCHHHHHHHCC
PDB ID 1HCR Chain A (52 residues)
Model 147 RMSD 3.47Ã…
118Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1BC8, chain C Sequence
MDSAITLWQFLLQLLQKPQNKHMICWTSNDGQFKLLQAEEVARLWGIRKN
KPNMNYDKLSRALRYYYVKNIIKKVNGQKFVYKFVSYPEILNM True
SS CCCCCCHHHHHHHHCCCHHHCCCCEECCCCCEEECCCHHHHHHHH
HHHHCCCCCCHHHHHHHHHHHHHHCCEEECCCCCCEEEECCCCHHHCC P
red SS CCCHHHHHHHHHHHHHCCCCCCEEEEECCCEEEEECCHHHH
HHHHHHHCCCCCCCHHHHHHHHHHHHHCCCEEECCCCEEEEEEECCHHHH
CC
PDB ID 1BC8 Chain C (93 residues)
Model 1714 RMSD 4.21Ã…
119STRUCTURAL PROTEOMICSSUITE
- www.igb.uci.edu
- SSpro secondary structure
- SSpro8 secondary structure
- ACCpro accessibility
- CONpro contact number
- DI-pro disulphide bridges
- BETA-pro beta partners
- CMAP-pro contact map
- CCMAP-pro coarse contact map
- CON23D-pro contact map to 3D
- 3D-pro 3D structure
120(No Transcript)
121Advantage of Machine Learning
- Pitfalls of traditional ab-initio approaches
- Machine learning systems take time to train
(weeks). - Once trained however they can predict structures
almost faster than proteins can fold. - Predict or search protein structures on a genomic
or bioengineering scale .
122DAG-RNNs APPROACH
- Two steps
- 1. Build relevant DAG to connect inputs, outputs,
and hidden variables - 2. Use a deterministic (neural network)
parameterization together with appropriate
stationarity assumptions/weight sharingoverall
models remains probabilistic - Process structured data of variable size,
topology, and dimensions efficiently - Sequences, trees, d-lattices, graphs, etc
- Convergence theorems
- Other applications
123(No Transcript)
124ACKNOWLEDGMENTS
- UCI
- Gianluca Pollastri, Michal Rosen-Zvi
- Arlo Randall, Pierre-Francois Baisnee, S. Josh
Swamidass, Jianlin Cheng, Yimeng Dou, Yann
Pecout, Mike Sweredoski, Alessandro Vullo, Lin
Wu, - James Nowick, Luis Villareal
- DTU Soren Brunak
- Columbia Burkhard Rost
- U of Florence Paolo Frasconi
- U of Bologna Rita Casadio, Piero Fariselli
- www.igb.uci.edu/tools.htm
- www.ics.uci.edu/pfbaldi