Bioinformatics

About This Presentation

Title:

Bioinformatics

Description:

Bioinformatics – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 125

Provided by: igb3

Learn more at: https://www.igb.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics

1
Bioinformatics

The Machine Learning Approach
Pierre Baldi
School of Information and Computer Science
Dept. Biological Chemistry
Institute for Genomics and Bioinformatics
University of California, Irvine

2
(No Transcript)
3
(No Transcript)
4
OUTLINE

INTRODUCTION
BIOINFORMATICS THE DATA
MACHINE LEARNING PROBABILISTIC MODELING
EXAMPLES OF MODELS
MARKOV MODELS
NEURAL NETWORKS
HIDDEN MARKOV MODELS
PROTEIN APPLICATIONS
DNA APPLICATIONS
SOFTWARE DEMONSTRATION
GRAPHICAL MODELS
PROTEIN STRUCTURE PREDICTION (GMs and RNNs)
STOCHASTIC GRAMMARS
RNA MODELING
DNA MICROARRAYS
SYSTEMS BIOLOGY
DISCUSSION

tggaagggctaattcactcccaacgaagacaagatatccttgatctgtgg
atctaccacacacaaggctacttccctgattagcagaactacacaccagg
gccagggatcagatatccactgacctttggatggtgctacaagctagtac
cagttgagccagagaagttagaagaagccaacaaaggagagaacaccagc
ttgttacaccctgtgagcctgcatggaatggatgacccggagagagaagt
gttagagtggaggtttgacagccgcctagcatttcatcacatggcccgag
agctgcatccggagtacttcaagaactgctgacatcgagcttgctacaag
ggactttccgctggggactttccagggaggcgtggcctgggcgggactgg
ggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgt
actgggtctctctggttagaccagatctgagcctgggagctctctggcta
actagggaacccactgcttaagcctcaataaagcttgccttgagtgcttc
aagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctc
agacccttttagtcagtgtggaaaatctctagcagtggcgcccgaacagg
gacctgaaagcgaaagggaaaccagaggagctctctcgacgcaggactcg
gcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagta
cgccaaaaattttgactagcggaggctagaaggagagagatgggtgcgag
agcgtcagtattaagcgggggagaattagatcgatgggaaaaaattcggt
taaggccagggggaaagaaaaaatataaattaaaacatatagtatgggca
agcagggagctagaacgattcgcagttaatcctggcctgttagaaacatc
agaaggctgtagacaaatactgggacagctacaaccatcccttcagacag
gatcagaagaacttagatcattatataatacagtagcaaccctctattgt
gtgcatcaaaggatagagataaaagacaccaaggaagctttagacaagat
agaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctg
acacaggacacagcaatcaggtcagccaaaattaccctatagtgcagaac
atccaggggcaaatggtacatcaggccatatcacctagaactttaaatgc
atgggtaaaagtagtagaagagaaggctttcagcccagaagtgataccca
tgttttcagcattatcagaaggagccaccccacaagatttaaacaccatg
ctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagac
catcaatgaggaagctgcagaatgggatagagtgcatccagtgcatgcag
ggcctattgcaccaggccagatgagagaaccaaggggaagtgacatagca
ggaactactagtacccttcaggaacaaataggatggatgacaaataatcc
acctatcccagtaggagaaatttataaaagatggataatcctgggattaa
ataaaatagtaagaatgtatagccctaccagcattctggacataagacaa
ggaccaaaggaaccctttagagactatgtagaccggttctataaaactct
aagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaacct
tgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattg
ggaccagcggctacactagaagaaatgatgacagcatgtcagggagtagg
aggacccggccataaggcaagagttttggctgaagcaatgagccaagtaa
caaattcagctaccataatgatgcagagaggcaattttaggaaccaaaga
aagattgttaagtgtttcaattgtggcaaagaagggcacacagccagaaa
ttgcagggcccctaggaaaaagggctgttggaaatgtggaaaggaaggac
accaaatgaaagattgtactgagagacaggctaattttttagggaagatc
tggccttcctacaagggaaggccagggaattttcttcagagcagaccaga
gccaacagccccaccagaagagagcttcaggtctggggtagagacaacaa
ctccccctcagaagcaggagccgatagacaaggaactgtatcctttaact
tccctcaggtcactctttggcaacgacccctcgtcacaataaagataggg
gggcaactaaaggaagctctattagatacaggagcagatgatacagtatt
agaagaaatgagtttgccaggaagatggaaaccaaaaatgatagggggaa
ttggaggttttatcaaagtaagacagtatgatcagatactcatagaaatc
tgtggacataaagctataggtacagtattagtaggacctacacctgtcaa
cataattggaagaaatctgttgactcagattggttgcactttaaattttc
ccattagccctattgagactgtaccagtaaaattaaagccaggaatggat
ggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcatt
agtagaaatttgtacagagatggaaaaggaagggaaaatttcaaaaattg
ggcctgaaaatccatacaatactccagtatttgccataaagaaaaaagac
agtactaaatggagaaaattagtagatttcagagaacttaataagagaac
tcaagacttctgggaagttcaattaggaataccacatcccgcagggttaa
aaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttca
gttcccttagatgaagacttcaggaagtatactgcatttaccatacctag
tataaacaatgagacaccagggattagatatcagtacaatgtgcttccac
agggatggaaaggatcaccagcaatattccaaagtagcatgacaaaaatc
ttagagccttttagaaaacaaaatccagacatagttatctatcaatacat
ggatgatttgtatgtaggatctgacttagaaatagggcagcatagaacaa
aaatagaggagctgagacaacatctgttgaggtggggacttaccacacca
gacaaaaaacatcagaaagaacctccattcctttggatgggttatgaact
ccatcctgataaatggacagtacagcctatagtgctgccagaaaaagaca
gctggactgtcaatgacatacagaagttagtggggaaattgaattgggca
agtcagatttacccagggattaaagtaaggcaattatgtaaactccttag
aggaaccaaagcactaacagaagtaataccactaacagaagaagcagagc
tagaactggcagaaaacagagagattctaaaagaaccagtacatggagtg
tattatgacccatcaaaagacttaatagcagaaatacagaagcaggggca
aggccaatggacatatcaaatttatcaagagccatttaaaaatctgaaaa
caggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaa
ttaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatgggg
aaagactcctaaatttaaactgcccatacaaaaggaaacatgggaaacat
ggtggacagagtattggcaagccacctggattcctgagtgggagtttgtt
aatacccctcccttagtgaaattatggtaccagttagagaaagaacccat
agtaggagcagaaaccttctatgtagatggggcagctaacagggagacta
aattaggaaaagcaggatatgttactaatagaggaagacaaaaagttgtc
accctaactgacacaacaaatcagaagactgagttacaagcaatttatct
agctttgcaggattcgggattagaagtaaacatagtaacagactcacaat
atgcattaggaatcattcaagcacaaccagatcaaagtgaatcagagtta
gtcaatcaaataatagagcagttaataaaaaaggaaaaggtctatctggc
atgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaat
tagtcagtgctggaatcaggaaagtactatttttagatggaatagataag
gcccaagatgaacatgagaaatatcacagtaattggagagcaatggctag
tgattttaacctgccacctgtagtagcaaaagaaatagtagccagctgtg
ataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagt
ccaggaatatggcaactagattgtacacatttagaaggaaaagttatcct
ggtagcagttcatgtagccagtggatatatagaagcagaagttattccag
cagaaacagggcaggaaacagcatattttcttttaaaattagcaggaaga
tggccagtaaaaacaatacatactgacaatggcagcaatttcaccggtgc
tacggttagggccgcctgttggtgggcgggaatcaagcaggaatttggaa
ttccctacaatccccaaagtcaaggagtagtagaatctatgaataaagaa
ttaaagaaaattataggacaggtaagagatcaggctgaacatcttaagac
agcagtacaaatggcagtattcatccacaattttaaaagaaaagggggga
ttggggggtacagtgcaggggaaagaatagtagacataatagcaacagac
atacaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcg
ggtttattacagggacagcagaaatccactttggaaaggaccagcaaagc
tcctctggaaaggtgaaggggcagtagtaatacaagataatagtgacata
aaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaaca
gatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaaca
tggaaaagtttagtaaaacaccatatgtatgtttcagggaaagctagggg
atggttttatagacatcactatgaaagccctcatccaagaataagttcag
aagtacacatcccactaggggatgctagattggtaataacaacatattgg
ggtctgcatacaggagaaagagactggcatttgggtcagggagtctccat
agaatggaggaaaaagagatatagcacacaagtagaccctgaactagcag
accaactaattcatctgtattactttgactgtttttcagactctgctata
agaaaggccttattaggacacatagttagccctaggtgtgaatatcaagc
aggacataacaaggtaggatctctacaatacttggcactagcagcattaa
taacaccaaaaaagataaagccacctttgcctagtgttacgaaactgaca
gaggatagatggaacaagccccagaagaccaagggccacagagggagcca
cacaatgaatggacactagagcttttagaggagcttaagaatgaagctgt
tagacattttcctaggatttggctccatggcttagggcaacatatctatg
aaacttatggggatacttgggcaggagtggaagccataataagaattctg
caacaactgctgtttatccattttcagaattgggtgtcgacatagcagaa
taggcgttactcgacagaggagagcaagaaatggagccagtagatcctag
actagagccctggaagcatccaggaagtcagcctaaaactgcttgtacca
attgctattgtaaaaagtgttgctttcattgccaagtttgtttcataaca
aaagccttaggcatctcctatggcaggaagaagcggagacagcgacgaag
agctcatcagaacagtcagactcatcaagcttctctatcaaagcagtaag
tagtacatgtaacgcaacctataccaatagtagcaatagtagcattagta
gtagcaataataatagcaatagttgtgtggtccatagtaatcatagaata
taggaaaatattaagacaaagaaaaatagacaggttaattgatagactaa
tagaaagagcagaagacagtggcaatgagagtgaaggagaaatatcagca
cttgtggagatgggggtggagatggggcaccatgctccttgggatgttga
tgatctgtagtgctacagaaaaattgtgggtcacagtctattatggggta
cctgtgtggaaggaagcaaccaccactctattttgtgcatcagatgctaa
agcatatgatacagaggtacataatgtttgggccacacatgcctgtgtac
ccacagaccccaacccacaagaagtagtattggtaaatgtgacagaaaat
tttaacatgtggaaaaatgacatggtagaacagatgcatgaggatataat
cagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactct
gtgttagtttaaagtgcactgatttgaagaatgatactaataccaatagt
agtagcgggagaatgataatggagaaaggagagataaaaaactgctcttt
caatatcagcacaagcataagaggtaaggtgcagaaagaatatgcatttt
tttataaacttgatataataccaatagataatgatactaccagctataag
ttgacaagttgtaacacctcagtcattacacaggcctgtccaaaggtatc
ctttgagccaattcccatacattattgtgccccggctggttttgcgattc
taaaatgtaataataagacgttcaatggaacaggaccatgtacaaatgtc
agcacagtacaatgtacacatggaattaggccagtagtatcaactcaact
gctgttaaatggcagtctagcagaagaagaggtagtaattagatctgtca
atttcacggacaatgctaaaaccataatagtacagctgaacacatctgta
gaaattaattgtacaagacccaacaacaatacaagaaaaagaatccgtat
ccagagaggaccagggagagcatttgttacaataggaaaaataggaaata
tgagacaagcacattgtaacattagtagagcaaaatggaataacacttta
aaacagatagctagcaaattaagagaacaatttggaaataataaaacaat
aatctttaagcaatcctcaggaggggacccagaaattgtaacgcacagtt
ttaattgtggaggggaatttttctactgtaattcaacacaactgtttaat
agtacttggtttaatagtacttggagtactgaagggtcaaataacactga
aggaagtgacacaatcaccctcccatgcagaataaaacaaattataaaca
tgtggcagaaagtaggaaaagcaatgtatgcccctcccatcagtggacaa
attagatgttcatcaaatattacagggctgctattaacaagagatggtgg
taatagcaacaatgagtccgagatcttcagacctggaggaggagatatga
gggacaattggagaagtgaattatataaatataaagtagtaaaaattgaa
ccattaggagtagcacccaccaaggcaaagagaagagtggtgcagagaga
aaaaagagcagtgggaataggagctttgttccttgggttcttgggagcag
caggaagcactatgggcgcagcctcaatgacgctgacggtacaggccaga
caattattgtctggtatagtgcagcagcagaacaatttgctgagggctat
tgaggcgcaacagcatctgttgcaactcacagtctggggcatcaagcagc
tccaggcaagaatcctggctgtggaaagatacctaaaggatcaacagctc
ctggggatttggggttgctctggaaaactcatttgcaccactgctgtgcc
ttggaatgctagttggagtaataaatctctggaacagatttggaatcaca
cgacctggatggagtgggacagagaaattaacaattacacaagcttaata
cactccttaattgaagaatcgcaaaaccagcaagaaaagaatgaacaaga
attattggaattagataaatgggcaagtttgtggaattggtttaacataa
caaattggctgtggtatataaaattattcataatgatagtaggaggcttg
gtaggtttaagaatagtttttgctgtactttctatagtgaatagagttag
gcagggatattcaccattatcgtttcagacccacctcccaaccccgaggg
gacccgacaggcccgaaggaatagaagaagaaggtggagagagagacaga
gacagatccattcgattagtgaacggatccttggcacttatctgggacga
tctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactct
tgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagcc
ctcaaatattggtggaatctcctacagtattggagtcaggaactaaagaa
tagtgctgttagcttgctcaatgccacagccatagcagtagctgagggga
cagatagggttatagaagtagtacaaggagcttgtagagctattcgccac
atacctagaagaataagacagggcttggaaaggattttgctataagatgg
gtggcaagtggtcaaaaagtagtgtgattggatggcctactgtaagggaa
agaatgagacgagctgagccagcagcagatagggtgggagcagcatctcg
agacctggaaaaacatggagcaatcacaagtagcaatacagcagctacca
atgctgcttgtgcctggctagaagcacaagaggaggaggaggtgggtttt
ccagtcacacctcaggtacctttaagaccaatgacttacaaggcagctgt
agatcttagccactttttaaaagaaaaggggggactggaagggctaattc
actcccaaagaagacaagatatccttgatctgtggatctaccacacacaa
ggctacttccctgattagcagaactacacaccagggccaggggtcagata
tccactgacctttggatggtgctacaagctagtaccagttgagccagata
agatagaagaggccaataaaggagagaacaccagcttgttacaccctgtg
agcctgcatgggatggatgacccggagagagaagtgttagagtggaggtt
tgacagccgcctagcatttcatcacgtggcccgagagctgcatccggagt
acttcaagaactgctgacatcgagcttgctacaagggactttccgctggg
gactttccagggaggcgtggcctgggcgggactggggagtggcgagccct
cagatcctgcatataagcagctgctttttgcctgtactgggtctctctgg
ttagaccagatctgagcctgggagctctctggctaactagggaacccact
gcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgccc
gtctgttgtgtgactctggtaactagagatccctcagacccttttagtca
gtgtggaaaatctctagca

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
SCALES
organisms genome genes
virus 10-100,000 10-100s
bacteria 5 Mb 4,000
single cell 15Mb 6,000
simple animal 100Mb 15,000
homo 3,000Mb 30-40,000

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Examples of Computational Problems

Physical and Genetic Maps
Genome assembly
Pairwise and Multiple Alignments
Motif Detection/Discrimination/Classification
Data Base Searches and Mining
Phylogenetic Tree Reconstruction
Gene Finding and Gene Parsing
Protein Secondary Structure Prediction
Protein Tertiary Structure Prediction
Protein Function Prediction
Comparative genomics and evolution
DNA microarray analysis
Molecular docking/Drug design
Gene regulation/regulatory networks
Systems biology

17
Machine Learning

Extract information from the data automatically
(inference) via a process of model fitting
(learning from examples).
Model Selection Neural Networks, Hidden Markov
Models, Stochastic Grammars, Bayesian Networks
Model Fitting Gradient Methods, Monte Carlo
Methods,
Machine learning approaches are most useful in
areas where there is a lot of data but little
theory.

18
Three Key Factors for Expansion

Data Mining/Machine Learning Expansion is fueled
by
Progress in sensors, data storage, and data
management.
Computing power.
Theoretical framework.

19
Intuitive Approach

Look at ALL available data, background
information, and hypothesis
Use probabilities to express PRIOR knowledge
Use probabilities for inference, model selection,
model comparison, etc. by computing POSTERIOR
distributions and deriving UNIQUE answers

20
Deduction and Inference

DEDUCTION
If A gt B and A is true,
then B is true.
INDUCTION
If A gt B and B is true,
then A is more plausible

21
Bayesian Statistics

Bayesian framework for induction we start with
hypothesis space and wish to express relative
preferences in terms of background information
(the Cox-Jaynes axioms).
Axiom 0 Transitivity of preferences.
Theorem 1 Preferences can be represented by a
real number p (A).
Axiom 1 There exists a function f such that
p(non A) f(p(A))
Axiom 2 There exists a function F such that
p (A,B) F(p(A), p(BA))
Theorem2 There is always a rescaling w such that
p(A)w(p(A)) is in 0,1, and satisfies the sum
and product rules.

22
Probability as Degree of Belief

Sum Rule
P(non A) 1- P(A)
Product Rule
P(A and B) P(A) P(BA)
BayesTheorem
P(BA)P(AB)P(B)/P(A)
Induction Form
P(MD) P(DM)P(M)/P(D)
Equivalently
logP(MD) logP(DM)logP(M)-logP(D)
Recursive Form
P(MD1,D2,,Dn1) P(Dn1M) P(lD1,,Dn) /
P(Dn1D1,,Dn)

23
Learning

MODEL FITTING AND MODEL COMPARISON
MAXIMUM LIKELIHOOD AND MAXIMUM A POSTERIORI

24
Paradox

A non-probabilistic model is NOT a scientific
model.

25
EXAMPLES OF NON-SCIENTIFIC MODELS

Fma
Emc2
etc
These are only first-order approximations and do
not fit the data (likelihood is zero).
Correction (F F) (mm)(aa).

26
Paradox

TO CHOOSE A SIMPLE MODEL BECAUSE DATA IS
SCARCE IS LIKE SEARCHING FOR THE KEY UNDER THE
LIGHT IN THE PARKING LOT AT NIGHT

27
Different Levels of Bayesian Inference

Level 1 Find the best model w.
Level2 Integrate over models.

28
Model Classes

NEURAL NETWORKS
MARKOV MODELS
HIDDEN MARKOV MODELS
STOCHASTIC GRAMMARS
DECISION TREES
BAYESIAN NETWORKS
GRAPHICAL MODELS
KERNEL METHODS (SVMs, gaussian kernels, spectral
kernels, etc)

29
PRIORS

NON-INFORMATIVE PRIORS (UNIFORM, MAXIMUM ENTROPY,
SYMMETRIES)
STANDARD PRIORS GAUSSIAN, DIRICHLET, CONJUGATE,
ETC.

30
LEARNING ALGORITHMS

Minimize -log p(MD)
Gradient methods (gradient descent, conjugate
gradient, back-propagation).
Monte Carlo methods (Metropolis, Gibbs sampling,
simulated annealing).
Other methods EM (Expectation-Maximization),
GEM, etc.

31
OTHER ASPECTS

Model complexity
VC dimension
Minimum description length
Validation and cross validation
Early stopping
Ensembles
Boosting, bagging,
Second order methods (Hessian, Fisher information
matrix)
etc.

32
(No Transcript)
33
AXIOMATIC HIERARCHY

GAME THEORY
DECISION THEORY
BAYESIAN STATISTICS
GRAPHICAL MODELS

In general it is wise to model biological data
probabilistically for several reasons
1. Measurement noise (arrays vs sequences)
2. Variability (evolutionary tinkering)
3. High-dimensional complex systems (hidden
variables)

35
MARKOV MODELS
36
(No Transcript)
37
(No Transcript)
38

Problem of long-range dependencies
Generalization graphical models

39
ARTIFICIAL NEURAL NETWORKS
40
(No Transcript)
41
MODEL NEURON
outf(Swiui t) i
Fthreshold function Fsigmoidal function
etc.
wi
ui
42
(No Transcript)
43
NEURAL NETWORKS

Early Neural Networks layered feed-forward, one
hidden layer, sigmoidal transfer functions, LMS
error.
Modern Neural Networks general class of Bayesian
statistical models.
Probabilistic Framework network N is
deterministic, data D(x,y) is stochastic.
Output of network F(x) is the average or
expectation of y given x.
Regression Gaussian model, linear transfer
functions in output layer, error function LMS
negative log-likelihood.
Classification Multinomial model, normalized
exponential transfer functions in output layer,
error function KL distance relative entropy
negative log-likelihood.
Classification (special case) Binomial model,
sigmoidal transfer functions in output layer,
error function KL distance relative entropy
negative log-likelihood.

44
UNIVERSAL APPROXIMATION PROPERTIES

Any reasonable input output function can be
approximated to any degree of precision by a
neural network.
Trivial for Boolean functions and threshold
units.
Simple proof for continuous functions.
Real issues are
Complexity (the size of the network).
(2) Learning (how to find the network).

45
(No Transcript)
46
BACK-PROPAGATION
ERROR EF(w)
OUTPUT LAYER
i
W ij
j
INPUT LAYER
GRADIENT DESCENT ? w ij µ outj ?i
µ learning rate
47
HIDDEN MARKOV MODELS
48
Hidden Markov Models

A first-order Hidden Markov Model is completely
defined by
A set of states.
An alphabet of symbols.
A transition probability matrix T(tij)
An emission probability matrix E(eiX)

49
(No Transcript)
50
Basic Ideas

As in speech recognition, use Hidden Markov
Models (HMM) to model a family of related primary
sequences.
As in speech recognition, in general use a left
to right HMM once the system leaves a state it
can never reenter it. The basic architecture
consists of a main backbone chain of main states,
and two side chains of insert and delete states.
The parameters of the model are the transition
and emission probabilities. These parameters are
adjusted during training from examples.
After learning, the model can be used in a
variety of tasks including multiple alignments,
detection of motifs, classification, data base
searches.

51
HMM APPLICATIONS

MULTIPLE ALIGNMENTS
DATA BASE SEARCHES AND
DISCRIMINATION/CLASSIFICATION
STRUCTURAL ANALYSIS AND
PATTERN DISCOVERY

52
Multiple Alignments

No precise definition of what a good alignment is
(low entropy, detection of motifs).
The multiple alignment problem is NP complete
(finding longest subsequence).
Pairwise alignment can be solved efficiently by
dynamic programming in O(N2) steps.
For K sequences of average length N, dynamic
programming scales like O(NK), exponentially in
the number of sequences.
Problem of variable scores and gap penalties.

53
HMMs of Protein Families

Globins
Immunoglobulins
Kinases
G-Protein-Coupled Receptors
Pfam is a data base of protein domains

54
HMMs of DNA

coding/non-coding regions (E. Coli)
exons/introns/acceptor sites
promoter regions
gene finding/gene parsing

55
(No Transcript)
56
IMMUNOGLOBULINS

294 sequences (V regions) with minimum length 90,
average length 117, and maximal length 254
linear model of length 117 trained with a random
subset of 150 sequences

57
(No Transcript)
58
IG MODEL ENTROPY
59
IG EMISSIONS
60
IG Viterbi Path
61
IG MULTIPLE ALIGNMENT
62
G-PROTEIN-COUPLED RECEPTORS

145 sequences with minimum length 310, average
length 430, and maximal length 764.
Model trained with 143 sequences (3 sequences
contained undefined symbols) using Viterbi
learning.

63
GPCR ENTROPY
64
(No Transcript)
65
GPCR SCORING
66
(No Transcript)
67
Linear Architecture
68
Loop Architecture
69
Wheel Architecture
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
PROMOTER ENTROPY
74
PROMOTER BENDABILITY
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
GRAPHICAL MODELS
79
GRAPHICAL MODELS

Bayesian statistics and modeling leads to very
high-dimensional distributions P(D,H,M) which are
typically intractable.
Need for factorization into independent clusters
of variables that reflect the local (Markovian)
dependencies of the world and the data.
Hence the general theory of graphical models.
Directed models reflect temporal and causality
relationships NNs, HMMs, Bayesian networks, etc.
Directed models are used for instance in expert
systems.
Undirected models reflect correlations Random
Markov Fields, Boltzmann machines, etc.)
Undirected models are used for instance in image
modeling problems.
Directed/Undirected and other models are
possible.

80
GRAPHICAL MODELS BAYESIAN NETWORKS

X1, ,Xn random variables associated with the
vertices of a DAG Directed Acyclic Graph
The local conditional distributions P(XiXj j
parent of i) are the parameters of the model.
They can be represented by look-up tables
(costly) or other more compact parameterizations
(Sigmoidal Belief Networks, XOR, etc).
The global distribution is the product of the
local characteristicsP(X1,,Xn) ?i P(XiXj
j parent of i)

81
(No Transcript)
82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
Inference and Learning

Visible and Hidden Nodes
Bayes rule
Tress and Polytrees
General DAGs and Junction Tree Algorithm
Belief propagation in DAGs
Approximation methods (e.g. Variational methods
etc)

86
PROTEIN STRUCTURE PREDICTION (GMs AND RNNs)
87
PROTEINS

R1
R3
Ca N Cß
Ca
/ \ / \ /
\ / \
N Cß Ca
N Cß
R2

88
(No Transcript)
89
Utility of Structural Information
(Baker and Sali, 2001)
90
CAVEAT
91
REMARKS

Structure/Folding
Backbone/Full Atom
Homology Modeling
Protein Threading
Ab Initio (Physical Potentials/Molecular
Dynamics, Statistical Mechanics/Lattice Models)
Statistical/Machine Learning (Training Sets, SS
prediction)
Mixtures ab-initio with statistical potentials,
machine learning with profiles, etc.

92
PROTEIN STRUCTURE PREDICTION
93
(No Transcript)
94
Helices

1GRJ (Grea Transcript Cleavage Factor From
Escherichia Coli)

95
Antiparallel ß-sheets

1MSC (Bacteriophage Ms2 Unassembled Coat Protein
Dimer)

96
Parallel ß-sheets

1FUE (Flavodoxin)

97
Contact map
98
Secondary structure prediction
99
(No Transcript)
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
DATA PREPARATION

Starting point PDB data base.
Remove sequences not determined by X ray
diffraction.
Remove sequences where DSSP crashes.
Remove proteins with physical chain
breaks (neighboring AA having
distances exceeding 4 Angstroms)
Remove sequences with resolution worst
than 2.5 Angstroms.
Remove chains with less than 30 AA.
Remove redundancy (Hobohms algorithm,
Smith-Waterman, PAM 120, etc.)
Build multiple alignments (BLAST,
PSI-BLAST, etc.)

104
SECONDARY STRUCTURE PROGRAMS

DSSP (Kabsch and Sander, 1983) works by
assigning potential backbone hydrogen bonds
(based on the 3D coordinates of the backbone
atoms) and subsequently by identifying repetitive
bonding patterns.
STRIDE (Frishman and Argos, 1995) in addition
to hydrogen bonds, it uses also dihedral angles.
DEFINE (Richards and Kundrot, 1988) uses
difference distance matrices for evaluating the
match of interatomic distances in the protein to
those from idealized SS.

105
SECONDARY STRUCTURE ASSIGNMENTS

DSSP classes
H alpha helix
E sheet
G 3-10 helix
S kind of turn
T beta turn
B beta bridge
I pi-helix (very rare)
C the rest
CASP (harder) assignment
a H and G
ß E and B
? the rest
Alternative assignment
a H
ß B
? the rest

106
ENSEMBLES
107
(No Transcript)
108
(No Transcript)
109
(No Transcript)
110
FUNDAMENTAL LIMITATIONS

100 CORRECT RECOGNITION IS PROBABLY IMPOSSIBLE
FOR SEVERAL REASONS
SOME PROTEINS DO NOT FOLD SPONTANEOUSLY OR MAY
NEED CHAPERONES
QUATERNARY STRUCTURE BETA-STRAND PARTNERS MAY BE
ON A DIFFERENT CHAIN
STRUCTURE MAY DEPEND ON OTHER VARIABLES
ENVIRONMENT, PH
DYNAMICAL ASPECTS
FUZZINESS OF DEFINITIONS AND ERRORS IN DATABASES

111
(No Transcript)
112
(No Transcript)
113
BB-RNNs
114
2D RNNs
115
2D INPUTS

AA at positions i and j
Profiles at positions i and j
Correlated profiles at positions i and j
Secondary Structure, Accessibility, etc.

116
(No Transcript)
117
Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1HCR, chain A Sequence
GRPRAINKHEQEQISRLLEKGHPRQQLAIIFGIGVSTLYRYFPASSIKKR
MN True SS CCCCCCCCHHHHHHHHHHHCCCCHHHHHHHCECCHHH
HHHHCCCCCCCCCCC Pred SS CCCCCCCHHHHHHHHHHHHCCCCH
HHHEEHECHHHHHHHHCCCHHHHHHHCC
PDB ID 1HCR Chain A (52 residues)
Model 147 RMSD 3.47Å
118
Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1BC8, chain C Sequence
MDSAITLWQFLLQLLQKPQNKHMICWTSNDGQFKLLQAEEVARLWGIRKN
KPNMNYDKLSRALRYYYVKNIIKKVNGQKFVYKFVSYPEILNM True
SS CCCCCCHHHHHHHHCCCHHHCCCCEECCCCCEEECCCHHHHHHHH
HHHHCCCCCCHHHHHHHHHHHHHHCCEEECCCCCCEEEECCCCHHHCC P
red SS CCCHHHHHHHHHHHHHCCCCCCEEEEECCCEEEEECCHHHH
HHHHHHHCCCCCCCHHHHHHHHHHHHHCCCEEECCCCEEEEEEECCHHHH
CC
PDB ID 1BC8 Chain C (93 residues)
Model 1714 RMSD 4.21Å
119
STRUCTURAL PROTEOMICSSUITE

www.igb.uci.edu
SSpro secondary structure
SSpro8 secondary structure
ACCpro accessibility
CONpro contact number
DI-pro disulphide bridges
BETA-pro beta partners
CMAP-pro contact map
CCMAP-pro coarse contact map
CON23D-pro contact map to 3D
3D-pro 3D structure

120
(No Transcript)
121
Advantage of Machine Learning

Pitfalls of traditional ab-initio approaches
Machine learning systems take time to train
(weeks).
Once trained however they can predict structures
almost faster than proteins can fold.
Predict or search protein structures on a genomic
or bioengineering scale .

122
DAG-RNNs APPROACH

Two steps
1. Build relevant DAG to connect inputs, outputs,
and hidden variables
2. Use a deterministic (neural network)
parameterization together with appropriate
stationarity assumptions/weight sharingoverall
models remains probabilistic
Process structured data of variable size,
topology, and dimensions efficiently
Sequences, trees, d-lattices, graphs, etc
Convergence theorems
Other applications

123
(No Transcript)
124
ACKNOWLEDGMENTS

UCI
Gianluca Pollastri, Michal Rosen-Zvi
Arlo Randall, Pierre-Francois Baisnee, S. Josh
Swamidass, Jianlin Cheng, Yimeng Dou, Yann
Pecout, Mike Sweredoski, Alessandro Vullo, Lin
Wu,
James Nowick, Luis Villareal
DTU Soren Brunak
Columbia Burkhard Rost
U of Florence Paolo Frasconi
U of Bologna Rita Casadio, Piero Fariselli
www.igb.uci.edu/tools.htm
www.ics.uci.edu/pfbaldi

Write a Comment

User Comments (0)

About PowerShow.com

Bioinformatics - PowerPoint PPT Presentation

Bioinformatics

Bioinformatics – PowerPoint PPT presentation