Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics

Description:

Bioinformatics – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 125
Provided by: igb3
Learn more at: https://www.igb.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics


1
Bioinformatics
  • The Machine Learning Approach
  • Pierre Baldi
  • School of Information and Computer Science
  • Dept. Biological Chemistry
  • Institute for Genomics and Bioinformatics
  • University of California, Irvine

2
(No Transcript)
3
(No Transcript)
4
OUTLINE
  • INTRODUCTION
  • BIOINFORMATICS THE DATA
  • MACHINE LEARNING PROBABILISTIC MODELING
  • EXAMPLES OF MODELS
  • MARKOV MODELS
  • NEURAL NETWORKS
  • HIDDEN MARKOV MODELS
  • PROTEIN APPLICATIONS
  • DNA APPLICATIONS
  • SOFTWARE DEMONSTRATION
  • GRAPHICAL MODELS
  • PROTEIN STRUCTURE PREDICTION (GMs and RNNs)

  • STOCHASTIC GRAMMARS
  • RNA MODELING
  • DNA MICROARRAYS
  • SYSTEMS BIOLOGY
  • DISCUSSION

5
  • tggaagggctaattcactcccaacgaagacaagatatccttgatctgtgg
    atctaccacacacaaggctacttccctgattagcagaactacacaccagg
    gccagggatcagatatccactgacctttggatggtgctacaagctagtac
    cagttgagccagagaagttagaagaagccaacaaaggagagaacaccagc
    ttgttacaccctgtgagcctgcatggaatggatgacccggagagagaagt
    gttagagtggaggtttgacagccgcctagcatttcatcacatggcccgag
    agctgcatccggagtacttcaagaactgctgacatcgagcttgctacaag
    ggactttccgctggggactttccagggaggcgtggcctgggcgggactgg
    ggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgt
    actgggtctctctggttagaccagatctgagcctgggagctctctggcta
    actagggaacccactgcttaagcctcaataaagcttgccttgagtgcttc
    aagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctc
    agacccttttagtcagtgtggaaaatctctagcagtggcgcccgaacagg
    gacctgaaagcgaaagggaaaccagaggagctctctcgacgcaggactcg
    gcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagta
    cgccaaaaattttgactagcggaggctagaaggagagagatgggtgcgag
    agcgtcagtattaagcgggggagaattagatcgatgggaaaaaattcggt
    taaggccagggggaaagaaaaaatataaattaaaacatatagtatgggca
    agcagggagctagaacgattcgcagttaatcctggcctgttagaaacatc
    agaaggctgtagacaaatactgggacagctacaaccatcccttcagacag
    gatcagaagaacttagatcattatataatacagtagcaaccctctattgt
    gtgcatcaaaggatagagataaaagacaccaaggaagctttagacaagat
    agaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctg
    acacaggacacagcaatcaggtcagccaaaattaccctatagtgcagaac
    atccaggggcaaatggtacatcaggccatatcacctagaactttaaatgc
    atgggtaaaagtagtagaagagaaggctttcagcccagaagtgataccca
    tgttttcagcattatcagaaggagccaccccacaagatttaaacaccatg
    ctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagac
    catcaatgaggaagctgcagaatgggatagagtgcatccagtgcatgcag
    ggcctattgcaccaggccagatgagagaaccaaggggaagtgacatagca
    ggaactactagtacccttcaggaacaaataggatggatgacaaataatcc
    acctatcccagtaggagaaatttataaaagatggataatcctgggattaa
    ataaaatagtaagaatgtatagccctaccagcattctggacataagacaa
    ggaccaaaggaaccctttagagactatgtagaccggttctataaaactct
    aagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaacct
    tgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattg
    ggaccagcggctacactagaagaaatgatgacagcatgtcagggagtagg
    aggacccggccataaggcaagagttttggctgaagcaatgagccaagtaa
    caaattcagctaccataatgatgcagagaggcaattttaggaaccaaaga
    aagattgttaagtgtttcaattgtggcaaagaagggcacacagccagaaa
    ttgcagggcccctaggaaaaagggctgttggaaatgtggaaaggaaggac
    accaaatgaaagattgtactgagagacaggctaattttttagggaagatc
    tggccttcctacaagggaaggccagggaattttcttcagagcagaccaga
    gccaacagccccaccagaagagagcttcaggtctggggtagagacaacaa
    ctccccctcagaagcaggagccgatagacaaggaactgtatcctttaact
    tccctcaggtcactctttggcaacgacccctcgtcacaataaagataggg
    gggcaactaaaggaagctctattagatacaggagcagatgatacagtatt
    agaagaaatgagtttgccaggaagatggaaaccaaaaatgatagggggaa
    ttggaggttttatcaaagtaagacagtatgatcagatactcatagaaatc
    tgtggacataaagctataggtacagtattagtaggacctacacctgtcaa
    cataattggaagaaatctgttgactcagattggttgcactttaaattttc
    ccattagccctattgagactgtaccagtaaaattaaagccaggaatggat
    ggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcatt
    agtagaaatttgtacagagatggaaaaggaagggaaaatttcaaaaattg
    ggcctgaaaatccatacaatactccagtatttgccataaagaaaaaagac
    agtactaaatggagaaaattagtagatttcagagaacttaataagagaac
    tcaagacttctgggaagttcaattaggaataccacatcccgcagggttaa
    aaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttca
    gttcccttagatgaagacttcaggaagtatactgcatttaccatacctag
    tataaacaatgagacaccagggattagatatcagtacaatgtgcttccac
    agggatggaaaggatcaccagcaatattccaaagtagcatgacaaaaatc
    ttagagccttttagaaaacaaaatccagacatagttatctatcaatacat
    ggatgatttgtatgtaggatctgacttagaaatagggcagcatagaacaa
    aaatagaggagctgagacaacatctgttgaggtggggacttaccacacca
    gacaaaaaacatcagaaagaacctccattcctttggatgggttatgaact
    ccatcctgataaatggacagtacagcctatagtgctgccagaaaaagaca
    gctggactgtcaatgacatacagaagttagtggggaaattgaattgggca
    agtcagatttacccagggattaaagtaaggcaattatgtaaactccttag
    aggaaccaaagcactaacagaagtaataccactaacagaagaagcagagc
    tagaactggcagaaaacagagagattctaaaagaaccagtacatggagtg
    tattatgacccatcaaaagacttaatagcagaaatacagaagcaggggca
    aggccaatggacatatcaaatttatcaagagccatttaaaaatctgaaaa
    caggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaa
    ttaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatgggg
    aaagactcctaaatttaaactgcccatacaaaaggaaacatgggaaacat
    ggtggacagagtattggcaagccacctggattcctgagtgggagtttgtt
    aatacccctcccttagtgaaattatggtaccagttagagaaagaacccat
    agtaggagcagaaaccttctatgtagatggggcagctaacagggagacta
    aattaggaaaagcaggatatgttactaatagaggaagacaaaaagttgtc
    accctaactgacacaacaaatcagaagactgagttacaagcaatttatct
    agctttgcaggattcgggattagaagtaaacatagtaacagactcacaat
    atgcattaggaatcattcaagcacaaccagatcaaagtgaatcagagtta
    gtcaatcaaataatagagcagttaataaaaaaggaaaaggtctatctggc
    atgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaat
    tagtcagtgctggaatcaggaaagtactatttttagatggaatagataag
    gcccaagatgaacatgagaaatatcacagtaattggagagcaatggctag
    tgattttaacctgccacctgtagtagcaaaagaaatagtagccagctgtg
    ataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagt
    ccaggaatatggcaactagattgtacacatttagaaggaaaagttatcct
    ggtagcagttcatgtagccagtggatatatagaagcagaagttattccag
    cagaaacagggcaggaaacagcatattttcttttaaaattagcaggaaga
    tggccagtaaaaacaatacatactgacaatggcagcaatttcaccggtgc
    tacggttagggccgcctgttggtgggcgggaatcaagcaggaatttggaa
    ttccctacaatccccaaagtcaaggagtagtagaatctatgaataaagaa
    ttaaagaaaattataggacaggtaagagatcaggctgaacatcttaagac
    agcagtacaaatggcagtattcatccacaattttaaaagaaaagggggga
    ttggggggtacagtgcaggggaaagaatagtagacataatagcaacagac
    atacaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcg
    ggtttattacagggacagcagaaatccactttggaaaggaccagcaaagc
    tcctctggaaaggtgaaggggcagtagtaatacaagataatagtgacata
    aaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaaca
    gatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaaca
    tggaaaagtttagtaaaacaccatatgtatgtttcagggaaagctagggg
    atggttttatagacatcactatgaaagccctcatccaagaataagttcag
    aagtacacatcccactaggggatgctagattggtaataacaacatattgg
    ggtctgcatacaggagaaagagactggcatttgggtcagggagtctccat
    agaatggaggaaaaagagatatagcacacaagtagaccctgaactagcag
    accaactaattcatctgtattactttgactgtttttcagactctgctata
    agaaaggccttattaggacacatagttagccctaggtgtgaatatcaagc
    aggacataacaaggtaggatctctacaatacttggcactagcagcattaa
    taacaccaaaaaagataaagccacctttgcctagtgttacgaaactgaca
    gaggatagatggaacaagccccagaagaccaagggccacagagggagcca
    cacaatgaatggacactagagcttttagaggagcttaagaatgaagctgt
    tagacattttcctaggatttggctccatggcttagggcaacatatctatg
    aaacttatggggatacttgggcaggagtggaagccataataagaattctg
    caacaactgctgtttatccattttcagaattgggtgtcgacatagcagaa
    taggcgttactcgacagaggagagcaagaaatggagccagtagatcctag
    actagagccctggaagcatccaggaagtcagcctaaaactgcttgtacca
    attgctattgtaaaaagtgttgctttcattgccaagtttgtttcataaca
    aaagccttaggcatctcctatggcaggaagaagcggagacagcgacgaag
    agctcatcagaacagtcagactcatcaagcttctctatcaaagcagtaag
    tagtacatgtaacgcaacctataccaatagtagcaatagtagcattagta
    gtagcaataataatagcaatagttgtgtggtccatagtaatcatagaata
    taggaaaatattaagacaaagaaaaatagacaggttaattgatagactaa
    tagaaagagcagaagacagtggcaatgagagtgaaggagaaatatcagca
    cttgtggagatgggggtggagatggggcaccatgctccttgggatgttga
    tgatctgtagtgctacagaaaaattgtgggtcacagtctattatggggta
    cctgtgtggaaggaagcaaccaccactctattttgtgcatcagatgctaa
    agcatatgatacagaggtacataatgtttgggccacacatgcctgtgtac
    ccacagaccccaacccacaagaagtagtattggtaaatgtgacagaaaat
    tttaacatgtggaaaaatgacatggtagaacagatgcatgaggatataat
    cagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactct
    gtgttagtttaaagtgcactgatttgaagaatgatactaataccaatagt
    agtagcgggagaatgataatggagaaaggagagataaaaaactgctcttt
    caatatcagcacaagcataagaggtaaggtgcagaaagaatatgcatttt
    tttataaacttgatataataccaatagataatgatactaccagctataag
    ttgacaagttgtaacacctcagtcattacacaggcctgtccaaaggtatc
    ctttgagccaattcccatacattattgtgccccggctggttttgcgattc
    taaaatgtaataataagacgttcaatggaacaggaccatgtacaaatgtc
    agcacagtacaatgtacacatggaattaggccagtagtatcaactcaact
    gctgttaaatggcagtctagcagaagaagaggtagtaattagatctgtca
    atttcacggacaatgctaaaaccataatagtacagctgaacacatctgta
    gaaattaattgtacaagacccaacaacaatacaagaaaaagaatccgtat
    ccagagaggaccagggagagcatttgttacaataggaaaaataggaaata
    tgagacaagcacattgtaacattagtagagcaaaatggaataacacttta
    aaacagatagctagcaaattaagagaacaatttggaaataataaaacaat
    aatctttaagcaatcctcaggaggggacccagaaattgtaacgcacagtt
    ttaattgtggaggggaatttttctactgtaattcaacacaactgtttaat
    agtacttggtttaatagtacttggagtactgaagggtcaaataacactga
    aggaagtgacacaatcaccctcccatgcagaataaaacaaattataaaca
    tgtggcagaaagtaggaaaagcaatgtatgcccctcccatcagtggacaa
    attagatgttcatcaaatattacagggctgctattaacaagagatggtgg
    taatagcaacaatgagtccgagatcttcagacctggaggaggagatatga
    gggacaattggagaagtgaattatataaatataaagtagtaaaaattgaa
    ccattaggagtagcacccaccaaggcaaagagaagagtggtgcagagaga
    aaaaagagcagtgggaataggagctttgttccttgggttcttgggagcag
    caggaagcactatgggcgcagcctcaatgacgctgacggtacaggccaga
    caattattgtctggtatagtgcagcagcagaacaatttgctgagggctat
    tgaggcgcaacagcatctgttgcaactcacagtctggggcatcaagcagc
    tccaggcaagaatcctggctgtggaaagatacctaaaggatcaacagctc
    ctggggatttggggttgctctggaaaactcatttgcaccactgctgtgcc
    ttggaatgctagttggagtaataaatctctggaacagatttggaatcaca
    cgacctggatggagtgggacagagaaattaacaattacacaagcttaata
    cactccttaattgaagaatcgcaaaaccagcaagaaaagaatgaacaaga
    attattggaattagataaatgggcaagtttgtggaattggtttaacataa
    caaattggctgtggtatataaaattattcataatgatagtaggaggcttg
    gtaggtttaagaatagtttttgctgtactttctatagtgaatagagttag
    gcagggatattcaccattatcgtttcagacccacctcccaaccccgaggg
    gacccgacaggcccgaaggaatagaagaagaaggtggagagagagacaga
    gacagatccattcgattagtgaacggatccttggcacttatctgggacga
    tctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactct
    tgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagcc
    ctcaaatattggtggaatctcctacagtattggagtcaggaactaaagaa
    tagtgctgttagcttgctcaatgccacagccatagcagtagctgagggga
    cagatagggttatagaagtagtacaaggagcttgtagagctattcgccac
    atacctagaagaataagacagggcttggaaaggattttgctataagatgg
    gtggcaagtggtcaaaaagtagtgtgattggatggcctactgtaagggaa
    agaatgagacgagctgagccagcagcagatagggtgggagcagcatctcg
    agacctggaaaaacatggagcaatcacaagtagcaatacagcagctacca
    atgctgcttgtgcctggctagaagcacaagaggaggaggaggtgggtttt
    ccagtcacacctcaggtacctttaagaccaatgacttacaaggcagctgt
    agatcttagccactttttaaaagaaaaggggggactggaagggctaattc
    actcccaaagaagacaagatatccttgatctgtggatctaccacacacaa
    ggctacttccctgattagcagaactacacaccagggccaggggtcagata
    tccactgacctttggatggtgctacaagctagtaccagttgagccagata
    agatagaagaggccaataaaggagagaacaccagcttgttacaccctgtg
    agcctgcatgggatggatgacccggagagagaagtgttagagtggaggtt
    tgacagccgcctagcatttcatcacgtggcccgagagctgcatccggagt
    acttcaagaactgctgacatcgagcttgctacaagggactttccgctggg
    gactttccagggaggcgtggcctgggcgggactggggagtggcgagccct
    cagatcctgcatataagcagctgctttttgcctgtactgggtctctctgg
    ttagaccagatctgagcctgggagctctctggctaactagggaacccact
    gcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgccc
    gtctgttgtgtgactctggtaactagagatccctcagacccttttagtca
    gtgtggaaaatctctagca

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
SCALES
organisms genome genes
virus 10-100,000 10-100s
bacteria 5 Mb 4,000
single cell 15Mb 6,000
simple animal 100Mb 15,000
homo 3,000Mb 30-40,000

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Examples of Computational Problems
  • Physical and Genetic Maps
  • Genome assembly
  • Pairwise and Multiple Alignments
  • Motif Detection/Discrimination/Classification
  • Data Base Searches and Mining
  • Phylogenetic Tree Reconstruction
  • Gene Finding and Gene Parsing
  • Protein Secondary Structure Prediction
  • Protein Tertiary Structure Prediction
  • Protein Function Prediction
  • Comparative genomics and evolution
  • DNA microarray analysis
  • Molecular docking/Drug design
  • Gene regulation/regulatory networks
  • Systems biology

17
Machine Learning
  • Extract information from the data automatically
    (inference) via a process of model fitting
    (learning from examples).
  • Model Selection Neural Networks, Hidden Markov
    Models, Stochastic Grammars, Bayesian Networks
  • Model Fitting Gradient Methods, Monte Carlo
    Methods,
  • Machine learning approaches are most useful in
    areas where there is a lot of data but little
    theory.

18
Three Key Factors for Expansion
  • Data Mining/Machine Learning Expansion is fueled
    by
  • Progress in sensors, data storage, and data
    management.
  • Computing power.
  • Theoretical framework.

19
Intuitive Approach
  • Look at ALL available data, background
    information, and hypothesis
  • Use probabilities to express PRIOR knowledge
  • Use probabilities for inference, model selection,
    model comparison, etc. by computing POSTERIOR
    distributions and deriving UNIQUE answers

20
Deduction and Inference
  • DEDUCTION
  • If A gt B and A is true,
  • then B is true.
  • INDUCTION
  • If A gt B and B is true,
  • then A is more plausible

21
Bayesian Statistics
  • Bayesian framework for induction we start with
    hypothesis space and wish to express relative
    preferences in terms of background information
    (the Cox-Jaynes axioms).
  • Axiom 0 Transitivity of preferences.
  • Theorem 1 Preferences can be represented by a
    real number p (A).
  • Axiom 1 There exists a function f such that
  • p(non A) f(p(A))
  • Axiom 2 There exists a function F such that
  • p (A,B) F(p(A), p(BA))
  • Theorem2 There is always a rescaling w such that
    p(A)w(p(A)) is in 0,1, and satisfies the sum
    and product rules.

22
Probability as Degree of Belief
  • Sum Rule
  • P(non A) 1- P(A)
  • Product Rule
  • P(A and B) P(A) P(BA)
  • BayesTheorem
  • P(BA)P(AB)P(B)/P(A)
  • Induction Form
  • P(MD) P(DM)P(M)/P(D)
  • Equivalently
  • logP(MD) logP(DM)logP(M)-logP(D)
          
  • Recursive Form
  • P(MD1,D2,,Dn1) P(Dn1M) P(lD1,,Dn) /
    P(Dn1D1,,Dn)

23
Learning
  • MODEL FITTING AND MODEL COMPARISON
  • MAXIMUM LIKELIHOOD AND MAXIMUM A POSTERIORI

24
Paradox
  • A non-probabilistic model is NOT a scientific
    model.

25
EXAMPLES OF NON-SCIENTIFIC MODELS
  • Fma
  • Emc2
  • etc
  • These are only first-order approximations and do
    not fit the data (likelihood is zero).
  • Correction (F F) (mm)(aa).

26
Paradox
  • TO CHOOSE A SIMPLE MODEL BECAUSE DATA IS
    SCARCE IS LIKE SEARCHING FOR THE KEY UNDER THE
    LIGHT IN THE PARKING LOT AT NIGHT

27
Different Levels of Bayesian Inference
  • Level 1 Find the best model w.
  • Level2 Integrate over models.

28
Model Classes
  • NEURAL NETWORKS
  • MARKOV MODELS
  • HIDDEN MARKOV MODELS
  • STOCHASTIC GRAMMARS
  • DECISION TREES
  • BAYESIAN NETWORKS
  • GRAPHICAL MODELS
  • KERNEL METHODS (SVMs, gaussian kernels, spectral
    kernels, etc)

29
PRIORS
  • NON-INFORMATIVE PRIORS (UNIFORM, MAXIMUM ENTROPY,
    SYMMETRIES)
  • STANDARD PRIORS GAUSSIAN, DIRICHLET, CONJUGATE,
    ETC.

30
LEARNING ALGORITHMS
  • Minimize -log p(MD)
  • Gradient methods (gradient descent, conjugate
    gradient, back-propagation).
  • Monte Carlo methods (Metropolis, Gibbs sampling,
    simulated annealing).
  • Other methods EM (Expectation-Maximization),
    GEM, etc.

31
OTHER ASPECTS
  • Model complexity
  • VC dimension
  • Minimum description length
  • Validation and cross validation
  • Early stopping
  • Ensembles
  • Boosting, bagging,
  • Second order methods (Hessian, Fisher information
    matrix)
  • etc.

32
(No Transcript)
33
AXIOMATIC HIERARCHY
  • GAME THEORY
  • DECISION THEORY
  • BAYESIAN STATISTICS
  • GRAPHICAL MODELS

34
  • In general it is wise to model biological data
    probabilistically for several reasons
  • 1. Measurement noise (arrays vs sequences)
  • 2. Variability (evolutionary tinkering)
  • 3. High-dimensional complex systems (hidden
    variables)

35
MARKOV MODELS
36
(No Transcript)
37
(No Transcript)
38
  • Problem of long-range dependencies
  • Generalization graphical models

39
ARTIFICIAL NEURAL NETWORKS
40
(No Transcript)
41
MODEL NEURON
outf(Swiui t) i
Fthreshold function Fsigmoidal function
etc.
wi
ui
42
(No Transcript)
43
NEURAL NETWORKS
  • Early Neural Networks layered feed-forward, one
    hidden layer, sigmoidal transfer functions, LMS
    error.
  • Modern Neural Networks general class of Bayesian
    statistical models.
  • Probabilistic Framework network N is
    deterministic, data D(x,y) is stochastic.
    Output of network F(x) is the average or
    expectation of y given x.
  • Regression Gaussian model, linear transfer
    functions in output layer, error function LMS
    negative log-likelihood.
  • Classification Multinomial model, normalized
    exponential transfer functions in output layer,
    error function KL distance relative entropy
    negative log-likelihood.
  • Classification (special case) Binomial model,
    sigmoidal transfer functions in output layer,
    error function KL distance relative entropy
    negative log-likelihood.

44
UNIVERSAL APPROXIMATION PROPERTIES
  • Any reasonable input output function can be
    approximated to any degree of precision by a
    neural network.
  • Trivial for Boolean functions and threshold
    units.
  • Simple proof for continuous functions. 
  • Real issues are
  • Complexity (the size of the network).
  • (2) Learning (how to find the network).

45
(No Transcript)
46
BACK-PROPAGATION
ERROR EF(w)
OUTPUT LAYER
i
W ij
j
INPUT LAYER
GRADIENT DESCENT ? w ij µ outj ?i
µ learning rate
47
HIDDEN MARKOV MODELS
48
Hidden Markov Models
  • A first-order Hidden Markov Model is completely
    defined by
  • A set of states.
  • An alphabet of symbols.
  • A transition probability matrix T(tij)
  • An emission probability matrix E(eiX)

49
(No Transcript)
50
Basic Ideas
  • As in speech recognition, use Hidden Markov
    Models (HMM) to model a family of related primary
    sequences.
  • As in speech recognition, in general use a left
    to right HMM once the system leaves a state it
    can never reenter it. The basic architecture
    consists of a main backbone chain of main states,
    and two side chains of insert and delete states.
  • The parameters of the model are the transition
    and emission probabilities. These parameters are
    adjusted during training from examples.
  • After learning, the model can be used in a
    variety of tasks including multiple alignments,
    detection of motifs, classification, data base
    searches.

51
HMM APPLICATIONS
  • MULTIPLE ALIGNMENTS
  • DATA BASE SEARCHES AND
  • DISCRIMINATION/CLASSIFICATION
  • STRUCTURAL ANALYSIS AND
  • PATTERN DISCOVERY

52
Multiple Alignments
  • No precise definition of what a good alignment is
    (low entropy, detection of motifs).
  • The multiple alignment problem is NP complete
    (finding longest subsequence).
  • Pairwise alignment can be solved efficiently by
    dynamic programming in O(N2) steps.
  • For K sequences of average length N, dynamic
    programming scales like O(NK), exponentially in
    the number of sequences.
  • Problem of variable scores and gap penalties.

53
HMMs of Protein Families
  • Globins
  • Immunoglobulins
  • Kinases
  • G-Protein-Coupled Receptors
  • Pfam is a data base of protein domains

54
HMMs of DNA
  • coding/non-coding regions (E. Coli)
  • exons/introns/acceptor sites
  • promoter regions
  • gene finding/gene parsing

55
(No Transcript)
56
IMMUNOGLOBULINS
  • 294 sequences (V regions) with minimum length 90,
    average length 117, and maximal length 254
  • linear model of length 117 trained with a random
    subset of 150 sequences

57
(No Transcript)
58
IG MODEL ENTROPY
59
IG EMISSIONS
60
IG Viterbi Path
61
IG MULTIPLE ALIGNMENT
62
G-PROTEIN-COUPLED RECEPTORS
  • 145 sequences with minimum length 310, average
    length 430, and maximal length 764.
  • Model trained with 143 sequences (3 sequences
    contained undefined symbols) using Viterbi
    learning.

63
GPCR ENTROPY
64
(No Transcript)
65
GPCR SCORING
66
(No Transcript)
67
Linear Architecture
68
Loop Architecture
69
Wheel Architecture
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
PROMOTER ENTROPY
74
PROMOTER BENDABILITY
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
GRAPHICAL MODELS
79
GRAPHICAL MODELS
  • Bayesian statistics and modeling leads to very
    high-dimensional distributions P(D,H,M) which are
    typically intractable.
  • Need for factorization into independent clusters
    of variables that reflect the local (Markovian)
    dependencies of the world and the data.
  • Hence the general theory of graphical models.
  • Directed models reflect temporal and causality
    relationships NNs, HMMs, Bayesian networks, etc.
  • Directed models are used for instance in expert
    systems.
  • Undirected models reflect correlations Random
    Markov Fields, Boltzmann machines, etc.)
  • Undirected models are used for instance in image
    modeling problems.
  • Directed/Undirected and other models are
    possible.

80
GRAPHICAL MODELS BAYESIAN NETWORKS
  • X1, ,Xn random variables associated with the
    vertices of a DAG Directed Acyclic Graph
  • The local conditional distributions P(XiXj j
    parent of i) are the parameters of the model.
    They can be represented by look-up tables
    (costly) or other more compact parameterizations
    (Sigmoidal Belief Networks, XOR, etc).
  • The global distribution is the product of the
    local characteristicsP(X1,,Xn) ?i P(XiXj
    j parent of i)

81
(No Transcript)
82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
Inference and Learning
  • Visible and Hidden Nodes
  • Bayes rule
  • Tress and Polytrees
  • General DAGs and Junction Tree Algorithm
  • Belief propagation in DAGs
  • Approximation methods (e.g. Variational methods
    etc)

86
PROTEIN STRUCTURE PREDICTION (GMs AND RNNs)
87
PROTEINS
  • R1
    R3


  • Ca N Cß
    Ca
  • / \ / \ /
    \ / \
  • N Cß Ca
    N Cß
  • R2

88
(No Transcript)
89
Utility of Structural Information
(Baker and Sali, 2001)
90
CAVEAT
91
REMARKS
  • Structure/Folding
  • Backbone/Full Atom
  • Homology Modeling
  • Protein Threading
  • Ab Initio (Physical Potentials/Molecular
    Dynamics, Statistical Mechanics/Lattice Models)
  • Statistical/Machine Learning (Training Sets, SS
    prediction)
  • Mixtures ab-initio with statistical potentials,
    machine learning with profiles, etc.

92
PROTEIN STRUCTURE PREDICTION
93
(No Transcript)
94
Helices
  • 1GRJ (Grea Transcript Cleavage Factor From
    Escherichia Coli)

95
Antiparallel ß-sheets
  • 1MSC (Bacteriophage Ms2 Unassembled Coat Protein
    Dimer)

96
Parallel ß-sheets
  • 1FUE (Flavodoxin)

97
Contact map
98
Secondary structure prediction
99
(No Transcript)
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
DATA PREPARATION
  •  
  • Starting point PDB data base.
  •        Remove sequences not determined by X ray
    diffraction.
  •        Remove sequences where DSSP crashes.
  •        Remove proteins with physical chain
    breaks (neighboring AA having
    distances exceeding 4 Angstroms)
  •        Remove sequences with resolution worst
    than 2.5 Angstroms.
  •        Remove chains with less than 30 AA.
  •        Remove redundancy (Hobohms algorithm,
    Smith-Waterman, PAM 120, etc.)
  • Build multiple alignments (BLAST,
    PSI-BLAST, etc.)

104
SECONDARY STRUCTURE PROGRAMS
  • DSSP (Kabsch and Sander, 1983) works by
    assigning potential backbone hydrogen bonds
    (based on the 3D coordinates of the backbone
    atoms) and subsequently by identifying repetitive
    bonding patterns.
  •   STRIDE (Frishman and Argos, 1995) in addition
    to hydrogen bonds, it uses also dihedral angles.
  •   DEFINE (Richards and Kundrot, 1988) uses
    difference distance matrices for evaluating the
    match of interatomic distances in the protein to
    those from idealized SS.

105
SECONDARY STRUCTURE ASSIGNMENTS
  • DSSP classes 
  • H alpha helix
  • E sheet
  • G 3-10 helix
  • S kind of turn
  • T beta turn
  • B beta bridge
  • I pi-helix (very rare)
  • C the rest
  • CASP (harder) assignment 
  • a H and G
  • ß E and B
  • ? the rest
  • Alternative assignment 
  • a H
  • ß B
  • ? the rest

106
ENSEMBLES
107
(No Transcript)
108
(No Transcript)
109
(No Transcript)
110
FUNDAMENTAL LIMITATIONS
  • 100 CORRECT RECOGNITION IS PROBABLY IMPOSSIBLE
    FOR SEVERAL REASONS
  • SOME PROTEINS DO NOT FOLD SPONTANEOUSLY OR MAY
    NEED CHAPERONES
  • QUATERNARY STRUCTURE BETA-STRAND PARTNERS MAY BE
    ON A DIFFERENT CHAIN
  • STRUCTURE MAY DEPEND ON OTHER VARIABLES
    ENVIRONMENT, PH
  • DYNAMICAL ASPECTS
  • FUZZINESS OF DEFINITIONS AND ERRORS IN DATABASES

111
(No Transcript)
112
(No Transcript)
113
BB-RNNs
114
2D RNNs
115
2D INPUTS
  • AA at positions i and j
  • Profiles at positions i and j
  • Correlated profiles at positions i and j
  • Secondary Structure, Accessibility, etc.

116
(No Transcript)
117
Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1HCR, chain A Sequence
GRPRAINKHEQEQISRLLEKGHPRQQLAIIFGIGVSTLYRYFPASSIKKR
MN True SS CCCCCCCCHHHHHHHHHHHCCCCHHHHHHHCECCHHH
HHHHCCCCCCCCCCC Pred SS CCCCCCCHHHHHHHHHHHHCCCCH
HHHEEHECHHHHHHHHCCCHHHHHHHCC
PDB ID 1HCR Chain A (52 residues)
Model 147 RMSD 3.47Ã…
118
Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1BC8, chain C Sequence
MDSAITLWQFLLQLLQKPQNKHMICWTSNDGQFKLLQAEEVARLWGIRKN
KPNMNYDKLSRALRYYYVKNIIKKVNGQKFVYKFVSYPEILNM True
SS CCCCCCHHHHHHHHCCCHHHCCCCEECCCCCEEECCCHHHHHHHH
HHHHCCCCCCHHHHHHHHHHHHHHCCEEECCCCCCEEEECCCCHHHCC P
red SS CCCHHHHHHHHHHHHHCCCCCCEEEEECCCEEEEECCHHHH
HHHHHHHCCCCCCCHHHHHHHHHHHHHCCCEEECCCCEEEEEEECCHHHH
CC
PDB ID 1BC8 Chain C (93 residues)
Model 1714 RMSD 4.21Ã…
119
STRUCTURAL PROTEOMICSSUITE
  • www.igb.uci.edu
  • SSpro secondary structure
  • SSpro8 secondary structure
  • ACCpro accessibility
  • CONpro contact number
  • DI-pro disulphide bridges
  • BETA-pro beta partners
  • CMAP-pro contact map
  • CCMAP-pro coarse contact map
  • CON23D-pro contact map to 3D
  • 3D-pro 3D structure

120
(No Transcript)
121
Advantage of Machine Learning
  • Pitfalls of traditional ab-initio approaches
  • Machine learning systems take time to train
    (weeks).
  • Once trained however they can predict structures
    almost faster than proteins can fold.
  • Predict or search protein structures on a genomic
    or bioengineering scale .

122
DAG-RNNs APPROACH
  • Two steps
  • 1. Build relevant DAG to connect inputs, outputs,
    and hidden variables
  • 2. Use a deterministic (neural network)
    parameterization together with appropriate
    stationarity assumptions/weight sharingoverall
    models remains probabilistic
  • Process structured data of variable size,
    topology, and dimensions efficiently
  • Sequences, trees, d-lattices, graphs, etc
  • Convergence theorems
  • Other applications

123
(No Transcript)
124
ACKNOWLEDGMENTS
  • UCI
  • Gianluca Pollastri, Michal Rosen-Zvi
  • Arlo Randall, Pierre-Francois Baisnee, S. Josh
    Swamidass, Jianlin Cheng, Yimeng Dou, Yann
    Pecout, Mike Sweredoski, Alessandro Vullo, Lin
    Wu,
  • James Nowick, Luis Villareal
  • DTU Soren Brunak
  • Columbia Burkhard Rost
  • U of Florence Paolo Frasconi
  • U of Bologna Rita Casadio, Piero Fariselli
  • www.igb.uci.edu/tools.htm
  • www.ics.uci.edu/pfbaldi
Write a Comment
User Comments (0)
About PowerShow.com