BCB 444544 - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

BCB 444544

Description:

... because it also differentiates between insertions and deletions ... Hidden variables - insertions/deletions/transition probabilities that can't be 'seen' ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 34
Provided by: publicI
Category:
Tags: bcb | insertions

less

Transcript and Presenter's Notes

Title: BCB 444544


1
BCB 444/544
  • Lecture 18
  • More details HMMs
  • Protein Motifs Domain Prediction
  • Maybe Protein Structure - The Basics
  • 18_Oct03

2
Required Reading (before lecture)
  • vMon Oct 1 - Lecture 17
  • Protein Motifs Domain Prediction
  • Chp 7 - pp 85-96
  • Wed Oct 3 - Lecture 18
  • Protein Structure The Basics (Note chg in
    lecture Schedule!)
  • Chp 12 - pp 173-186
  • Thurs Oct 4 - Lab 6
  • Protein Structure Databases Visualization
  • Fri Oct 5 - Lecture 19
  • Protein Structure Classification Comparison
  • Chp 13 - pp 187-199

3
Assignments Announcements
  • HW544Extra 1 -
  • vDue Task 1.1 - Mon Oct 1 (today) by noon
  • Task 1.2 Task 2 - Mon Oct 8 by 5 PM
  • HomeWork 3 - posted online
  • Due Mon Oct 8 by 5 PM

4
BCB 544 - Extra Required Reading
  • Mon Sept 24
  • BCB 544 Extra Required Reading Assignment
  • Pollard KS, Salama SR, Lambert N, Lambot MA,
    Coppens S, Pedersen JS, Katzman S, King B,
    Onodera C, Siepel A, Kern AD, Dehay C, Igel H,
    Ares M Jr, Vanderhaeghen P, Haussler D. (2006) An
    RNA gene expressed during cortical development
    evolved rapidly in humans. Nature 443 167-172.
  • http//www.nature.com/nature/journal/v443/n7108/ab
    s/nature05113.html
  • doi10.1038/nature05113
  • PDF available on class website - under Required
    Reading Link

5
A few Online Resources for Cell Molecular
Biology
  • NCBI Science Primer What is a cell?
  • http//www.ncbi.nlm.nih.gov/About/primer/genetics_
    cell.html
  • NCBI Science Primer What is a genome?
  • http//www.ncbi.nlm.nih.gov/About/primer/genetics_
    genome.html
  • BioTechs Life Science Dictionary
  • http//biotech.icmb.utexas.edu/search/dict-search.
    html
  • NCBI Bookshelf
  • http//www.ncbi.nlm.nih.gov/sites/entrez?dbbooks

6
Statistics References
  • Statistical Inference (Hardcover)
  • George Casella, Roger L. Berger

StatWeb A Guide to Basic Statistics for
Biologists http//www.dur.ac.uk/stat.we
b/ Basic Statistics http//www.statsoft.com/tex
tbook/stbasic.html (correlations, tests,
frequencies, etc.) Electronic Statistics
Textbook StatSoft http//www.statsoft.com/textb
ook/stathome.html (from basic statistics to
ANOVA to discriminant analysis, clustering,
regression, data mining, machine learning,
etc.)
7
Extra Credit Questions 2-6
  • What is the size of the dystrophin gene (in kb)?
  • Is it still the largest known human
    protein?
  • What is the largest protein encoded in human
    genome (i.e., longest single polypeptide
    chain)?
  • What is the largest protein complex for which a
    structure is known (for any organism)?
  • What is the most abundant protein (naturally
    occurring) on earth?
  • Which state in the US has the largest number of
    mobile genetic elements (transposons) in its
    living population?
  • For 1 pt total (0.2 pt each) Answer all
    questions correctly
  • submit by to terrible_at_iastate.edu
  • For 2 pts total Prepare a PPT slide with all
    correct answers
  • submit to ddobbs_at_iastate.edu before 9 AM on
    Mon Oct 1
  • Choose one option - you can't earn 3 pts!
  • Partial credit for incorrect answers? only if
    they are truly amusing!

8
Extra Credit Questions 7 8
  • Given that each male attending our BCB 444/544
    class on a typical day is healthy (let's assume
    MH7), and is generating sperm at a rate equal to
    the average normal rate for reproductively
    competent males (dSp/dT ? per minute)
  • 7a. How many rounds of meiosis will occur during
    our 50 minute class period?
  • 7b. How many total sperm will be produced by our
    BCB 444/544 class during that class period?
  • 8. How many rounds of meiosis will occur in
    the reproductively competent females in our
    class? (assume FH5)
  • For 0.6 pts total (0.2 pt each) Answer all
    questions correctly
  • submit by to terrible_at_iastate.edu
  • For 1 pts total Prepare a PPT slide with all
    correct answers
  • submit to ddobbs_at_iastate.edu before 9 AM on
    Mon Oct 1
  • Choose one option - you can't earn more than 1
    pt for this!
  • Partial credit for incorrect answers? only if
    they are truly amusing!

9
Answers?
10
Chp 6 - Profiles Hidden Markov Models
  • SECTION II SEQUENCE ALIGNMENT
  • Xiong Chp 6
  • Profiles HMMs
  • Position Specific Scoring Matrices (PSSMs)
  • PSI-BLAST
  • Profiles
  • Markov Models Hidden Markov Models

11
Statistical Models for Representing Biological
Sequences
  • 3 types of probabilistic models, all of which
  • Are based on MSA
  • Capture both observed frequencies predicted
    frequencies of unobserved characters
  • In order of "sensitivity"
  • PSSM - scoring table derived from an ungapped
    MSA stores frequencies (log odds scores) for
    each amino acid in each position of a protein
    sequence,
  • Profile - A PSSM with gaps based on gapped MSA
    with penalties for insertions delations
  • HMM - hidden Markov Model - more complex
    mathematical model (than PSSMs or Profiles)
    because it also differentiates between insertions
    and deletions

12
HMMs for Biological Sequences?
  • HMMs originally developed for speech recognition
  • Now widely used in bioinformatics
  • Many applications (motif/domain detection,
    sequence alignment, phylogenetic
  • HMMs are "machine learning" algorithms - must be
    "trained" to obtain optimal statistical
    parameters
  • For Biological sequences
  • each character of a sequence is considered a
    state in a Markov process

13
But, What is a Markov Model?
  • Markov Model (or Markov chain)
  • mathematical model used to describe a sequence
    of events that occur one after another in a chain
  • a process that moves in one direction from one
    state to the next with a certain transition
    probability
  • For biological sequences
  • each letter state
  • linked together by transition probabilities

14
Different Types of Markov Models
  • Zero-order Markov Model probability of current
    state is independent of previous state(s)
  • e.g., random sequence, each residue with equal
    frequency
  • First-order MM probability of current state is
    determined by the previous state
  • e.g., frequencies of two linked residues
    (dimer) occurring simultaneously
  • Second-order MM describes situation in which
    probability of current state is determined by
    the previous two states
  • e.g., frequencies of thee linked residues
    (trimers) - occurring simultaneously, as in a
    codon
  • Higher orders? Also possible, later

15
So, What is a hidden Markov Model?
  • Hidden Markov Model (HMM)
  • a more sophisticated model in which some of
    states are hidden
  • some "unobserved" factors influence the state
    transition probabilities
  • MM which combines 2 or more Markov chains
  • only 1 chain is made up of observed states
  • other chains are made up of unobserved or
    "hidden" states

16
Hidden Markov Models - HMMs
  • Goal Find most likely explanation for observed
    variables
  • Components
  • States - composed of a number of elements or
    "symbols" (e.g., A,C,G,T)
  • Observed variables - sequence (or outcome) we can
    "see"
  • Hidden variables - insertions/deletions/transition
    probabilities that can't be "seen"
  • Emission probability - probability value
    associated with each "symbol" in each state
  • Transition probability - probability of going
    from one state to another
  • Special graphical representation used to
    illustrate relationships

17
An HMM for CpG Islands?
Emission probabilities are 0 or 1 e.g., eG-(G)
1, eG-(T) 0
See Durbin et al., Biological Sequence Analysis,
Cambridge, 1998
18
HMM example from Eddy HMM paper Toy HMM for
Splice Site Prediction
This is a new slide
19
An HMM for Occasionally Dishonest Casino
  • Transition probabilities
  • Prob(Fair ? Loaded) 0.01
  • Prob(Loaded ? Fair) 0.2

20
Calculating Different Paths to an Observed
Sequence
This slide has been changed
Calculations such as those shown below are used
to fill a matrix with probability values for
every state at every position
21
Calculating the Most Probable Path, using
Viterbi algorithm (using traceback as in DP)
This slide has been changed
Path within HMM that matches query sequence
with highest probability
22
Calculating the Total Probability
This slide has been changed
Note This not the same as matrix on previous
slide! Here, last column contains sums for each
row
23
Estimating the Probabilities or Training
the HMM
This slide has been changed
  • Calculate frequencies in each column of MSA built
    from set of related sequences
  • Use frequency values to fill the emission and
    transition probabilities in the model (use two
    matrices for this)
  • Viterbi training
  • Derive probable paths for training data using
    Viterbi algorithm
  • Re-estimate transition probabilities based on
    Viterbi path
  • Iterate until paths stop changing
  • Other algorithms can be used
  • e.g., "forward" "backward" algorithms
  • (see text - or see Wikipedia re HMMs)

24
Profile HMMs
  • Used to model a family of related sequences
  • (or motif or domain)
  • Derived from a MSA of family members
  • Transition emission probabilities are
    position-specific
  • Set parameters of model so that total probability
    peaks at members of family
  • Sequences can be tested for family membership
    using Viterbi algorithm to evaluate match
    against profile

25
Profile HMM represents a gapped MSA
This slide has been changed
Character in alignment can be in one of 3
states Match - observed Insert -
hidden Delete - hidden
Hidden chains
Observed chain
26
Example Pfam Protein Families http//pfam.san
ger.ac.uk/
  • A comprehensive collection of protein domains
    and families, with a range of well-established
    uses including genome annotation.
  • Pfam clans, web tools and services R.D. Finn,
    A. Bateman (2006) Nucleic Acids Res Database
    Issue 34D247-D5
  • Each family is represented by
  • 2 MSAs
  • 2 Hidden Markov Models (profile-HMMs)
  • cf. Superfamily - from Lab 5
  • similar collection of curated MSAs HMMs,
    focuses on superfamily level

27
A few more Details re Profiles HMMs
  • Smoothing or "Regularization" - method used to
    avoid "over-fitting"
  • Common problem in machine learning (data-driven)
    approaches
  • Limited training sample size causes
    over-representation of observed characters while
    "ignoring" unobserved characters
  • Result? Miss members of family not yet sampled
  • (too many false negative hits)
  • Pseudocounts - adding artificial values for
    'extra' amino acid(s) not observed in the
    training set
  • Treated as a 'real' values in calculating
    probabilities
  • Improve predictive power of profiles HMMs
  • Dirichlet mixture - commonly used mathematical
    model to simulate the aa distribution in a
    sequence alignment
  • To "correct" problems in an observed alignment
    based on limited number of sequences

28
Applications (of PSSMs, Profiles, HMMs)
  • HMMer - for building using HMMs
  • developed by Sean Eddy's group
  • Not a web-based server must download the
    software
  • 9 related programs
  • but check out the site - it's fun!
  • Psi-BLAST - you've heard enough about this!
  • Uses Profiles (not actually PSSMs) - iteratively
  • In previous lab used SuperFam (HMMs)
  • http//supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/
  • Prosite - includes patterns (regular expressions)
    profiles for motifs domains
  • http//ca.expasy.org/prosite
  • Pfam (MSAs HMMs)
  • http//pfam.sanger.ac.uk/ (new URL)
  • Many others

29
Chp 7 - Protein Motifs Domain Prediction
  • SECTION II SEQUENCE ALIGNMENT
  • Xiong Chp 7
  • Protein Motifs and Domain Prediction
  • Identification of Motifs Domains in MSAs
  • Motif Domain Databases Using Regular
    Expressions
  • Motif Domain Databases Using Statistical Models
  • Protein Family Databases
  • Motif Discovery in Unaligned Sequences
  • vSequence Logos

30
Motifs Domains
  • Motif - short conserved sequence pattern
  • Associated with distinct function in protein or
    DNA
  • Avg 10 residues (usually 6-20 residues)
  • e.g., zinc finger motif - in protein
  • e.g., TATA box - in DNA
  • Domain - "longer" conserved sequence pattern,
    defined as a independent functional and/or
    structural unit
  • Avg 100 residues (range from 40-700 in
    proteins)
  • e.g., kinase domain or transmembrane domain - in
    protein
  • Domains may (or may not) include motifs

31
2 Approaches for Representing "Consensus"
Information in Motifs Domains
  • Regular expression - reduce information from MSA
  • e.g., protein phosphorylation site motif
    S,T- X- R,K
  • Symbols represent specific or unspecified
    residues, spaces, etc.
  • 2 mechanisms for matching
  • Exact
  • "Fuzzy" (inexact, approximate) - flexible, more
    permissive to detect "near matches"
  • Statistical model - includes probability
    information derived from MSA
  • e.g., PSSM, Profile or HMM

32
Motif Domain Databases
  • Based on regular expressions
  • Prosite (Interpro)
  • Emofit
  • Limitation these don't take probability info
    into account
  • Based on statistical models
  • PRINTS
  • BLOCKS
  • ProDom
  • Pfam
  • SMART
  • CDART
  • Reverse PsiBLAST
  • READ your textbook try some of these at home
    there are distinct advantages/disadvantages
    associated with each
  • TAKE HOME LESSON
  • Always try several methods!
  • (not just one!)

33
Chp 12 - Protein Structure Basics
  • SECTION V STRUCTURAL BIOINFORMATICS
  • Xiong Chp 12
  • Protein Structure Basics
  • Introduction to the Protein DataBank - PDB
  • NEXT lecture!
Write a Comment
User Comments (0)
About PowerShow.com