From PSIBLAST to HMMer - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

From PSIBLAST to HMMer

Description:

substitution matrices should be specific to a particular site. ... Same SCOP family based on catalytic domain. Overall structure similarity noted. ACTG ... – PowerPoint PPT presentation

Number of Views:416
Avg rating:3.0/5.0
Slides: 42
Provided by: markp95
Category:
Tags: psiblast | hmmer | scop

less

Transcript and Presenter's Notes

Title: From PSIBLAST to HMMer


1
From PSI-BLAST to HMMer
  • Professor Mark Pallen
  • Credits Stephanie Minnema
  • University of Calgary
  • David Wishart
  • University of Alberta

2
Advanced BLAST Methods
  • The NCBI BLAST pages have several advanced BLAST
    methods available
  • PSI-BLAST
  • PHI-BLAST
  • RPS-BLAST
  • All are powerful methods based on protein
    similarities

3
Position-Specific-Iterated-BLAST
  • Intuition
  • substitution matrices should be specific to a
    particular site.
  • e.g. enalize alanine?glycine more in a helix
  • Idea
  • Use BLAST with high stringency to get a set of
    closely related sequences.
  • Align those sequences to create a new
    substitution matrix for each position.
  • Then use that matrix to find additional sequences
  • Cycling/iterative method
  • Gives increased sensitivity for detecting
    distantly related proteins
  • Can give insight into functional relationships
  • Very refined statistical methods
  • Fast still based on BLAST methods
  • Simple to use

4
PSI-BLAST Principle
  • First, a standard blastp is performed
  • The highest scoring hits are used to generate a
    multiple alignment
  • A PSSM is generated from the multiple alignment.
  • Highly conserved residues get high scores
  • Less conserved residues get lower scores
  • Another similarity search is performed, this time
    using the new PSSM
  • Steps 2-4 can be repeated until convergence
  • No new sequences appear after iteration

5
ExampleAminoacyl tRNA Synthetases
  • 20 enzymes for 20 amino acids
  • Each is very different
  • Big, small, monomers, tetramers
  • All bind to their appropriate tRNAs and amino
    acids, with high specificity
  • TrpRS and TyrRS share only 13 sequence identity
  • BUT, overall structures of TrpTRS and TyrTRS are
    similar
  • Structure ? Function relationship

6
Same SCOP family based on catalytic domain
7
So is there sequence similarity between TyrRS
and TrpRS?
  • Given structural similarities, we would expect to
    find sequence similarity
  • BUT!
  • blastp of E.coli TyrRS against bacterial
    sequences in SwissProt does NOT show similarity
    with TrpRS at e-value cutoff of 10

8
No TrpRS!?
9
Try Using PSI-BLAST
  • PSI-BLAST available from BLAST main page
  • Query form just like for blastp
  • BUT one extra formatting option must be used
  • Format for PSI-BLAST activate the tick box!
  • Second e-value cutoff used to determine which
    alignments will be used for PSSM build
    Threshold for inclusion
  • First search using TyrRS as query
  • Db SwissProt limit Bacteria ORGN
  • Threshold for inclusion 0.005

10
(No Transcript)
11
(No Transcript)
12
After A Few Iterations
13
TyrRS Similarity to TrpRS!
14
Power of PSI-BLAST
  • We knew TyrRS and TrpRS were similarly
  • Functionally and structurally
  • BLASTP gave no indication
  • PSI-BLAST was able to detect their weak sequence
    similarity
  • Words of caution
  • be sure to inspect and think about the results
    included in the PSSM build
  • include/exclude sequences on basis of biological
    knowledge you are in the driving seat!
  • PSI-BLAST performance varies according to choice
    of matrix, filter, statistics etc just like BLASTP

15
Why (not) PSI-BLAST
  • If the sequences used to construct the Position
    Specific Scoring Matrices (PSSMs) are all
    homologous, the sensitivity at a given
    specificity improves significantly
  • However, if non-homologous sequences are included
    in the PSSMs, they are corrupted. Then they
    pull in more non-homologous sequences, and become
    worse than generic

16
Query
Does the query really have a relationship with
the results? One way to check is to run the
search in the opposite direction but often not
reversible even when true homology
Results
17
PSI-BLAST caveats
  • Increased ability to find distant homologues
  • Cost of additional required care to prevent
    non-homologous sequences from being included in
    the PSSM calculation
  • When in doubt, leave it out!
  • Examine sequences with moderate similarity
    carefully.
  • Be particularly cautious about matches to
    sequences with highly biased amino acid content
  • Low complexity regions, transmembrane regions and
    coiled-coil regions often display significant
    similarity without homology
  • Screen them out of your query sequences!

18
PSI-BLASTon the command line
  • as with simple BLAST searches, using PSI-BLAST on
    the command line gives the user more power
  • opens up additional options, e.g.
  • PSI-BLASTing over nucleotide databases
  • automating number of iterations
  • trying out lots of different settings in parallel
  • inputting multiple sequences

19
PHI-BLAST
  • Pattern Hit Initiated BLAST
  • PHI-BLAST principle
  • Same method as PSI-BLAST
  • Starts first search with query sequence pattern
    for a motif in the query
  • PHI-BLAST finds sequences containing the motif
    and having significant sequence similarity in the
    vicinity of the motif occurrence
  • Highly specific

20
Example TyrRS
  • TyrRS contains the aaRS class-I signature
  • Want to find sequences containing that motif, and
    regional similarity to TyrRS
  • First get the Prosite pattern for the class-I
    signature
  • Prosite db of protein families and domains

21
http//ca.expasy.org/prosite
22
P-x(0,2)-GSTAN-DENQGAPK-x-LIVMFP-HT-LIVMY
AC-G- HNTG-LIVMFYSTAGPC
23
Insert Query Sequence
Insert PHI Pattern
24
PHI-BLAST Results
  • After first search, PHI-BLAST functions same as
    PSI-BLAST
  • Result page is the same
  • Can iterate in same way
  • Try it later if you like

25
The Key to PHI- and PSI-BLAST
  • Generating the multiple alignments to create
    PSSMs
  • Refines scoring in searches
  • Annotated collections of multiple alignments
    defining domains exist
  • Conserved domain database (CDD)
  • Contains 18039 alignments (10013 last year)
  • Can search the CDD using CD search
  • Uses RPS-BLAST

26
RPS-BLAST
  • Reverse Position Specific BLAST
  • Opposite of PSI-BLAST
  • CDD multiple alignments converted to PSSMs
  • PSSMs are processed and turned into a searchable
    database
  • Queries are searched against PSSMs using
    RPS-BLAST
  • Output indicates conserved domains within the
    query sequence

27
Example CRADD protein
28
Click on picture to see CDD multiple alignment
Click to see alignment with query
29
Profile Hidden Markov Models
  • statistical models of multiple sequence
    alignments
  • capture position-specific information about
  • how conserved each column of the alignment is
  • which residues are likely
  • use position-specific scores for amino acids (or
    nucleotides)
  • position specific penalties for opening and
    extending an insertion or deletion.

30
Advantages of using HMMs
  • HMMs have a formal probabilistic basis
  • use probability theory to guide how all the
    scoring parameters should be set
  • can do things that more heuristic methods cannot
    do easily
  • For example, a profile HMM can be trained from
    unaligned sequences, if a trusted alignment isnt
    yet known
  • HMMs have a consistent theory behind gap and
    insertion scores

31
Advantages of using HMMs
  • In most details, profile HMMs are a slight
    improvement over a carefully constructed profile
  • but less skill and manual intervention are
    necessary to use profile HMMs
  • HMMs can produce true global alignments, unlike
    BLAST

32
Limitations of HMMs
  • do not capture any higher-order correlations
  • assumes that the identity of a particular
    position is independent of the identity of all
    other positions
  • make poor models of RNAs because an HMM cannot
    describe base pairs.
  • cf protein threading methods
  • which usually include scoring terms for nearby
    amino acids in a three-dimensional protein
    structure.
  • slower than and less user-friendly than PSI-BLAST

33
Applications of profile HMMs
  • Database searching for weak homologies
  • Alternative to PSI-BLAST
  • Automated annotation of the domain structure of
    proteins

34
Applications of profile HMMs
  • Useful for organizing sequences into
    evolutionarily related families
  • Databases like Pfam constructed by distinguishing
    between
  • a stable curated seed alignment of a small
    number of representative sequences
  • full alignments of all detectable homologs
  • HMMER used to
  • make a model of the seed
  • search the database for homologs
  • automatically produce the full alignment by
    aligning every sequence to the seed consensus

35
Constructing a profile HMM
  • multiple sequence alignment is made of known
    members of a given protein family
  • quality of alignment, number and diversity of the
    sequences crucial for success
  • profile HMM of family built from the alignment
  • model-building program uses the alignment
    together with its prior knowledge of the general
    nature of proteins
  • model-scoring program used to assign a score with
    respect to the model to any sequence of interest
  • better the score, the higher the chance that
    query sequence is homologous to protein family in
    the model.
  • each sequence in a database scored to find the
    members of the family present in the database.

36
Profile HMM programs HMMER
  • developed by Sean Eddy
  • freely available under GNU General Public License
  • includes model-building and model-scoring
    programs relevant to homology detection
  • contains a program that calibrates a model by
  • scoring it against a set of random sequences
  • fitting an extreme value distribution to the
    resultant raw scores
  • parameters of this distribution then used to
    calculate accurate E-values for sequences of
    interest.

37
Programs in the HMMER 2 package
  • hmmalign
  • Align sequences to existing model
  • hmmbuild
  • Build a model from multiple sequence alignment.
  • hmmcalibrate
  • Takes an HMM and empirically determines
    parameters used to make searches more sensitive
    by calculating more accurate E-values
  • hmmconvert
  • Convert a model file into different formats,
    including a compact HMMER 2 binary format, and
    best effort emulation of GCG profiles.
  • hmmemit
  • Emit sequences probabilistically from a profile
    HMM.
  • hmmfetch
  • Get a single model from an HMM database.
  • hmmindex
  • Index an HMM database.
  • hmmpfam
  • Search an HMM database for matches to a query
    sequence.
  • hmmsearch
  • Search a sequence database for matches to an HMM.

38
Profile HMM programs SAM
  • Developed by the bioinformatics group at the
    University of California, Santa Cruz
  • not open source, but free for academic use
  • does not include a model-calibration program
  • model-scoring program calculates E-values
    directly using a theoretical function that takes
    as its argument the difference between raw scores
    of the query sequence and its reverse
  • important component is target99 script, which
    generates a multiple sequence alignment suitable
    for model building

39
Clash of the TitansPSI-BLAST v. HMMer v. SAM!
  • Nucleic Acids Research, 2002, Vol. 30 No. 19 4321
  • SAM consistently produces better models than
    HMMER
  • relative performance of the model-scoring
    components varies
  • HMMER 1-3 X faster than SAM with large databases
  • SAM faster with small ones
  • both methods have effective low complexity and
    repeat sequence masking
  • accuracy of their E-values was comparable.
  • SAM T99 iterative database search procedure
    outperforms PSI-BLAST
  • BUT scoring of PSI-BLAST profiles gt 30 X faster
    than scoring of SAM models.

40
(No Transcript)
41
Summary
  • PSI-BLAST
  • Input SEQUENCE
  • Database SEQUENCES
  • Algorithm Constructs a PSSM from an initial pass
    and uses this in the next pass
  • Output Distantly related sequences
  • sensitive, -specific
  • PHI-BLAST
  • Input PROFILE SEQUENCE
  • Database SEQUENCES
  • Algorithm Same as PSI-BLAST except start with a
    profile
  • Output Sequences containing the domain and that
    are similar in the domain region
  • sensitive, -gt -specific
  • RPS-BLAST
  • Input SEQUENCE
  • Database DOMAINS
  • Output Domains found in the sequence
  • sensitive, specific
  • HMMs
  • More sensitive
  • But less user-friendly than PSI-BLAST and slower
Write a Comment
User Comments (0)
About PowerShow.com