Title: BCB 444544
1 BCB 444/544
- Lecture 18
- More details HMMs
-
- Protein Motifs Domain Prediction
- Maybe Protein Structure - The Basics
- 18_Oct03
2 Required Reading (before lecture)
- vMon Oct 1 - Lecture 17
- Protein Motifs Domain Prediction
- Chp 7 - pp 85-96
- Wed Oct 3 - Lecture 18
- Protein Structure The Basics (Note chg in
lecture Schedule!) - Chp 12 - pp 173-186
-
- Thurs Oct 4 - Lab 6
- Protein Structure Databases Visualization
- Fri Oct 5 - Lecture 19
- Protein Structure Classification Comparison
- Chp 13 - pp 187-199
3 Assignments Announcements
- HW544Extra 1 -
- vDue Task 1.1 - Mon Oct 1 (today) by noon
- Task 1.2 Task 2 - Mon Oct 8 by 5 PM
- HomeWork 3 - posted online
- Due Mon Oct 8 by 5 PM
4 BCB 544 - Extra Required Reading
- Mon Sept 24
- BCB 544 Extra Required Reading Assignment
- Pollard KS, Salama SR, Lambert N, Lambot MA,
Coppens S, Pedersen JS, Katzman S, King B,
Onodera C, Siepel A, Kern AD, Dehay C, Igel H,
Ares M Jr, Vanderhaeghen P, Haussler D. (2006) An
RNA gene expressed during cortical development
evolved rapidly in humans. Nature 443 167-172. - http//www.nature.com/nature/journal/v443/n7108/ab
s/nature05113.html - doi10.1038/nature05113
-
- PDF available on class website - under Required
Reading Link
5A few Online Resources for Cell Molecular
Biology
- NCBI Science Primer What is a cell?
- http//www.ncbi.nlm.nih.gov/About/primer/genetics_
cell.html - NCBI Science Primer What is a genome?
- http//www.ncbi.nlm.nih.gov/About/primer/genetics_
genome.html - BioTechs Life Science Dictionary
- http//biotech.icmb.utexas.edu/search/dict-search.
html - NCBI Bookshelf
- http//www.ncbi.nlm.nih.gov/sites/entrez?dbbooks
6Statistics References
- Statistical Inference (Hardcover)
- George Casella, Roger L. Berger
StatWeb A Guide to Basic Statistics for
Biologists http//www.dur.ac.uk/stat.we
b/ Basic Statistics http//www.statsoft.com/tex
tbook/stbasic.html (correlations, tests,
frequencies, etc.) Electronic Statistics
Textbook StatSoft http//www.statsoft.com/textb
ook/stathome.html (from basic statistics to
ANOVA to discriminant analysis, clustering,
regression, data mining, machine learning,
etc.)
7Extra Credit Questions 2-6
- What is the size of the dystrophin gene (in kb)?
- Is it still the largest known human
protein? - What is the largest protein encoded in human
genome (i.e., longest single polypeptide
chain)? - What is the largest protein complex for which a
structure is known (for any organism)? - What is the most abundant protein (naturally
occurring) on earth? - Which state in the US has the largest number of
mobile genetic elements (transposons) in its
living population?
- For 1 pt total (0.2 pt each) Answer all
questions correctly - submit by to terrible_at_iastate.edu
- For 2 pts total Prepare a PPT slide with all
correct answers - submit to ddobbs_at_iastate.edu before 9 AM on
Mon Oct 1 - Choose one option - you can't earn 3 pts!
- Partial credit for incorrect answers? only if
they are truly amusing!
8Extra Credit Questions 7 8
- Given that each male attending our BCB 444/544
class on a typical day is healthy (let's assume
MH7), and is generating sperm at a rate equal to
the average normal rate for reproductively
competent males (dSp/dT ? per minute) - 7a. How many rounds of meiosis will occur during
our 50 minute class period? - 7b. How many total sperm will be produced by our
BCB 444/544 class during that class period? - 8. How many rounds of meiosis will occur in
the reproductively competent females in our
class? (assume FH5)
- For 0.6 pts total (0.2 pt each) Answer all
questions correctly - submit by to terrible_at_iastate.edu
- For 1 pts total Prepare a PPT slide with all
correct answers - submit to ddobbs_at_iastate.edu before 9 AM on
Mon Oct 1 - Choose one option - you can't earn more than 1
pt for this! - Partial credit for incorrect answers? only if
they are truly amusing!
9Answers?
10Chp 6 - Profiles Hidden Markov Models
- SECTION II SEQUENCE ALIGNMENT
- Xiong Chp 6
- Profiles HMMs
- Position Specific Scoring Matrices (PSSMs)
- PSI-BLAST
- Profiles
- Markov Models Hidden Markov Models
11Statistical Models for Representing Biological
Sequences
- 3 types of probabilistic models, all of which
- Are based on MSA
- Capture both observed frequencies predicted
frequencies of unobserved characters - In order of "sensitivity"
- PSSM - scoring table derived from an ungapped
MSA stores frequencies (log odds scores) for
each amino acid in each position of a protein
sequence, - Profile - A PSSM with gaps based on gapped MSA
with penalties for insertions delations - HMM - hidden Markov Model - more complex
mathematical model (than PSSMs or Profiles)
because it also differentiates between insertions
and deletions
12HMMs for Biological Sequences?
- HMMs originally developed for speech recognition
- Now widely used in bioinformatics
- Many applications (motif/domain detection,
sequence alignment, phylogenetic - HMMs are "machine learning" algorithms - must be
"trained" to obtain optimal statistical
parameters - For Biological sequences
- each character of a sequence is considered a
state in a Markov process
13But, What is a Markov Model?
- Markov Model (or Markov chain)
- mathematical model used to describe a sequence
of events that occur one after another in a chain - a process that moves in one direction from one
state to the next with a certain transition
probability - For biological sequences
- each letter state
- linked together by transition probabilities
14Different Types of Markov Models
- Zero-order Markov Model probability of current
state is independent of previous state(s) - e.g., random sequence, each residue with equal
frequency - First-order MM probability of current state is
determined by the previous state - e.g., frequencies of two linked residues
(dimer) occurring simultaneously - Second-order MM describes situation in which
probability of current state is determined by
the previous two states - e.g., frequencies of thee linked residues
(trimers) - occurring simultaneously, as in a
codon - Higher orders? Also possible, later
15So, What is a hidden Markov Model?
- Hidden Markov Model (HMM)
- a more sophisticated model in which some of
states are hidden - some "unobserved" factors influence the state
transition probabilities - MM which combines 2 or more Markov chains
- only 1 chain is made up of observed states
- other chains are made up of unobserved or
"hidden" states
16Hidden Markov Models - HMMs
- Goal Find most likely explanation for observed
variables - Components
- States - composed of a number of elements or
"symbols" (e.g., A,C,G,T) - Observed variables - sequence (or outcome) we can
"see" - Hidden variables - insertions/deletions/transition
probabilities that can't be "seen" - Emission probability - probability value
associated with each "symbol" in each state - Transition probability - probability of going
from one state to another - Special graphical representation used to
illustrate relationships
17An HMM for CpG Islands?
Emission probabilities are 0 or 1 e.g., eG-(G)
1, eG-(T) 0
See Durbin et al., Biological Sequence Analysis,
Cambridge, 1998
18HMM example from Eddy HMM paper Toy HMM for
Splice Site Prediction
This is a new slide
19An HMM for Occasionally Dishonest Casino
- Transition probabilities
- Prob(Fair ? Loaded) 0.01
- Prob(Loaded ? Fair) 0.2
20Calculating Different Paths to an Observed
Sequence
This slide has been changed
Calculations such as those shown below are used
to fill a matrix with probability values for
every state at every position
21Calculating the Most Probable Path, using
Viterbi algorithm (using traceback as in DP)
This slide has been changed
Path within HMM that matches query sequence
with highest probability
22Calculating the Total Probability
This slide has been changed
Note This not the same as matrix on previous
slide! Here, last column contains sums for each
row
23Estimating the Probabilities or Training
the HMM
This slide has been changed
- Calculate frequencies in each column of MSA built
from set of related sequences - Use frequency values to fill the emission and
transition probabilities in the model (use two
matrices for this) - Viterbi training
- Derive probable paths for training data using
Viterbi algorithm - Re-estimate transition probabilities based on
Viterbi path - Iterate until paths stop changing
- Other algorithms can be used
- e.g., "forward" "backward" algorithms
- (see text - or see Wikipedia re HMMs)
24Profile HMMs
- Used to model a family of related sequences
- (or motif or domain)
- Derived from a MSA of family members
- Transition emission probabilities are
position-specific - Set parameters of model so that total probability
peaks at members of family - Sequences can be tested for family membership
using Viterbi algorithm to evaluate match
against profile
25Profile HMM represents a gapped MSA
This slide has been changed
Character in alignment can be in one of 3
states Match - observed Insert -
hidden Delete - hidden
Hidden chains
Observed chain
26Example Pfam Protein Families http//pfam.san
ger.ac.uk/
- A comprehensive collection of protein domains
and families, with a range of well-established
uses including genome annotation. - Pfam clans, web tools and services R.D. Finn,
A. Bateman (2006) Nucleic Acids Res Database
Issue 34D247-D5 - Each family is represented by
- 2 MSAs
- 2 Hidden Markov Models (profile-HMMs)
- cf. Superfamily - from Lab 5
- similar collection of curated MSAs HMMs,
focuses on superfamily level
27A few more Details re Profiles HMMs
- Smoothing or "Regularization" - method used to
avoid "over-fitting" - Common problem in machine learning (data-driven)
approaches - Limited training sample size causes
over-representation of observed characters while
"ignoring" unobserved characters - Result? Miss members of family not yet sampled
- (too many false negative hits)
- Pseudocounts - adding artificial values for
'extra' amino acid(s) not observed in the
training set - Treated as a 'real' values in calculating
probabilities - Improve predictive power of profiles HMMs
- Dirichlet mixture - commonly used mathematical
model to simulate the aa distribution in a
sequence alignment - To "correct" problems in an observed alignment
based on limited number of sequences
28Applications (of PSSMs, Profiles, HMMs)
- HMMer - for building using HMMs
- developed by Sean Eddy's group
- Not a web-based server must download the
software - 9 related programs
- but check out the site - it's fun!
- Psi-BLAST - you've heard enough about this!
- Uses Profiles (not actually PSSMs) - iteratively
- In previous lab used SuperFam (HMMs)
- http//supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/
- Prosite - includes patterns (regular expressions)
profiles for motifs domains - http//ca.expasy.org/prosite
- Pfam (MSAs HMMs)
- http//pfam.sanger.ac.uk/ (new URL)
- Many others
29Chp 7 - Protein Motifs Domain Prediction
- SECTION II SEQUENCE ALIGNMENT
- Xiong Chp 7
- Protein Motifs and Domain Prediction
- Identification of Motifs Domains in MSAs
- Motif Domain Databases Using Regular
Expressions - Motif Domain Databases Using Statistical Models
- Protein Family Databases
- Motif Discovery in Unaligned Sequences
- vSequence Logos
30Motifs Domains
- Motif - short conserved sequence pattern
- Associated with distinct function in protein or
DNA - Avg 10 residues (usually 6-20 residues)
- e.g., zinc finger motif - in protein
- e.g., TATA box - in DNA
- Domain - "longer" conserved sequence pattern,
defined as a independent functional and/or
structural unit - Avg 100 residues (range from 40-700 in
proteins) - e.g., kinase domain or transmembrane domain - in
protein - Domains may (or may not) include motifs
312 Approaches for Representing "Consensus"
Information in Motifs Domains
- Regular expression - reduce information from MSA
- e.g., protein phosphorylation site motif
S,T- X- R,K - Symbols represent specific or unspecified
residues, spaces, etc. - 2 mechanisms for matching
- Exact
- "Fuzzy" (inexact, approximate) - flexible, more
permissive to detect "near matches" - Statistical model - includes probability
information derived from MSA - e.g., PSSM, Profile or HMM
32Motif Domain Databases
- Based on regular expressions
- Prosite (Interpro)
- Emofit
- Limitation these don't take probability info
into account - Based on statistical models
- PRINTS
- BLOCKS
- ProDom
- Pfam
- SMART
- CDART
- Reverse PsiBLAST
- READ your textbook try some of these at home
there are distinct advantages/disadvantages
associated with each - TAKE HOME LESSON
- Always try several methods!
- (not just one!)
33Chp 12 - Protein Structure Basics
- SECTION V STRUCTURAL BIOINFORMATICS
- Xiong Chp 12
- Protein Structure Basics
- Introduction to the Protein DataBank - PDB
- NEXT lecture!