Design and creation of multiple sequence alignments Unit 15 - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Design and creation of multiple sequence alignments Unit 15

Description:

Design and creation of multiple sequence ... T-coffee at EMBnet * * T-coffee results * Phylip format 5 ... Preliminary step in molecular evolution analysis ... – PowerPoint PPT presentation

Number of Views:300
Avg rating:3.0/5.0
Slides: 79
Provided by: IreneGab1
Category:

less

Transcript and Presenter's Notes

Title: Design and creation of multiple sequence alignments Unit 15


1
Design and creation of multiple sequence
alignmentsUnit 15
  • BIOL221T Advanced Bioinformatics for
    Biotechnology

Irene Gabashvili, PhD
2
IPA 6.0 license
  • Need a list of e-mails to create accounts
  • Will have a 6 weeks license (instead of 2 weeks)
  • Problem Set 3 is Pathway Analysis, Lab of March
    19 will be on using IPA too

3
Problem Set 2 Review
  • Sensitivity and Specificity
  • Parameters for Multiple Alignment (Databases,
    Search Terms, Scores)
  • Transfac
  • Dotplots

4
Gene prediction flowchart
Fig 5.15 Baxevanis Ouellette 2005
5
Evaluation of Splice Site Prediction
What do measures really mean?
Fig 5.11 Baxevanis Ouellette 2005
Note typo in BO
6
ROC curves (plots of (1-Sn) vs Sp)
  • A receiver operating characteristic (ROC), or
    simply ROC curve, is a graphical plot of the
    sensitivity vs. (1 - specificity) for a binary
    classifier system as its discrimination threshold
    is varied.
  • The sensitivity and specificity of a diagnostic
    test depends on more than just the "quality" of
    the test--they also depend on the definition of
    what constitutes an abnormal test.

7
Evaluation of Splice Site Prediction
8
Careful different definitions for "Specificity"
Brendel definitions
  • Specificity

cf. Guig?รณ definitions Sn Sensitivity
TP/(TPFN) Sp Specificity TN/(TNFP) Sp- AC
Approximate Coefficient 0.5 x ((TP/(TPFN))
(TP/(TPFP)) (TN/(TNFP)) (TN/(TNFN))) - 1
Other measures? Predictive Values, Correlation
Coefficient
9
Best measures for comparing different methods?
  • ROC curves (Receiver Operating
    Characteristic?!!)
  • http//www.anaesthetist.com/mnm/stats/roc/
  • "The Magnificent ROC" - has fun applets
    quotes
  • "There is no statistical test, however intuitive
    and simple, which will not be abused by medical
    researchers"
  • Correlation Coefficient
  • (Matthews correlation coefficient (MCC)
  • MCC 1 for a perfect prediction
  • 0 for a completely random assignment
  • -1 for a "perfectly incorrect" prediction

Just FYI
10
PromotersWhat signals are there? Simple
ones in prokaryotes
11
Prokaryotic promoters
  • RNA polymerase complex recognizes promoter
    sequences located very close to on 5 side
    (upstream) of initiation site
  • RNA polymerase complex binds directly to these.
    with no requirement for transcription factors
  • Prokaryotic promoter sequences are highly
    conserved
  • -10 region
  • -35 region

12
Simpler view of complex promoters in eukaryotes
Fig 5.12 Baxevanis Ouellette 2005
13
Eukaryotic genes are transcribed by 3 different
RNA polymerases
Recognize different types of promoters
enhancers
14
Eukaryotic promoters enhancers
  • Promoters located relatively close to
    initiation site
  • (but can be located within gene,
    rather than upstream!)
  • Enhancers also required for regulated
    transcription
  • (these control expression in specific cell
    types, developmental stages, in response to
    environment)
  • RNA polymerase complexes do not specifically
    recognize promoter sequences directly
  • Transcription factors bind first and serve as
    landmarks for recognition by RNA polymerase
    complexes

15
Eukaryotic transcription factors
  • Transcription factors (TFs) are DNA binding
    proteins that also interact with RNA polymerase
    complex to activate or repress transcription
  • TFs contain characteristic DNA binding motifs
  • http//www.ncbi.nlm.nih.gov/books/bv.fcgi?r
    idgenomes.table.7039
  • TFs recognize specific short DNA sequence motifs
    transcription factor binding sites
  • Several databases for these, e.g. TRANSFAC
  • http//www.generegulation.com/cgibin/pub/data
    bases/transfac

16
Zinc finger-containing transcription factors
  • Common in eukaryotic proteins
  • Estimated 1 of mammalian genes encode
    zinc-finger proteins
  • In C. elegans, there are 500!
  • Can be used as highly specific DNA binding
    modules
  • Potentially valuable tools for directed genome
    modification (esp. in plants) human gene
    therapy

17
Promoter prediction Eukaryotes vs prokaryotes
Promoter prediction is easier in microbial
genomes Why? Highly conserved Simpler
gene structures More sequenced genomes!
(for comparative approaches) Methods?
Previously mostly HMM-based Now
similarity-based. comparative methods because
so many genomes available
18
Predicting promoters Steps Strategies
  • Closely related to gene prediction!
  • Obtain genomic sequence
  • Use sequence-similarity based comparison
  • (BLAST, MSA) to find related genes
  • But "regulatory" regions are much less
    well-conserved than coding regions
  • Locate ORFs
  • Identify TSS (if possible!)
  • Use promoter prediction programs
  • Analyze motifs, etc. in sequence (TRANSFAC)

19
Predicting promoters Steps Strategies
  • Identify TSS --if possible?
  • One of biggest problems is determining exact
    TSS!
  • Not very many full-length cDNAs!
  • Good starting point? (human vertebrate genes)
  • Use FirstEF
  • found within UCSC Genome Browser
  • or submit to FirstEF web server

Fig 5.10 Baxevanis Ouellette 2005
20
Automated promoter prediction strategies
  • Pattern-driven algorithms
  • Sequence-driven algorithms
  • Combined "evidence-based"
  • BEST RESULTS? Combined, sequential

21
Promoter Prediction Pattern-driven algorithms
  • Success depends on availability of collections of
    annotated binding sites (TRANSFAC PROMO)
  • Tend to produce huge numbers of FPs
  • Why?
  • Binding sites (BS) for specific TFs often
    variable
  • Binding sites are short (typically 5-15 bp)
  • Interactions between TFs ( other proteins)
    influence affinity specificity of TF binding
  • One binding site often recognized by multiple BFs
  • Biology is complex promoters often specific to
    organism/cell/stage/environmental condition

22
Promoter Prediction Pattern-driven algorithms
  • Solutions to problem of too many FP predictions?
  • Take sequence context/biology into account
  • Eukaryotes clusters of TFBSs are common
  • Prokaryotes knowledge of ? factors helps
  • Probability of "real" binding site increases if
    annotated transcription start site (TSS) nearby
  • But What about enhancers? (no TSS nearby!)
  • Only a small fraction of TSSs have been
    experimentally mapped
  • Do the wet lab experiments!
  • But Promoter-bashing is tedious

23
Promoter Prediction Sequence-driven algorithms
  • Assumption common functionality can be deduced
    from sequence conservation
  • Alignments of co-regulated genes should highlight
    elements involved in regulation
  • Careful How determine co-regulation?
  • Orthologous genes from difference species
  • Genes experimentally determined to be
  • co-regulated (using microarrays??)
  • Comparative promoter prediction
  • "Phylogenetic footprinting" - more later.

24
Promoter Prediction Sequence-driven algorithms
  • Problems
  • Need sets of co-regulated genes
  • For comparative (phylogenetic) methods
  • Must choose appropriate species
  • Different genomes evolve at different rates
  • Classical alignment methods have trouble with
  • translocations, inversions in order of
    functional elements
  • If background conservation of entire region is
    highly conserved, comparison is useless
  • Not enough data (Prokaryotes gtgtgt Eukaryotes)
  • Biology is complex many (most?) regulatory
    elements are not conserved across species!

25
Examples of promoter prediction/characterization
software
Lab used MATCH, MatInspector TRANSFAC MEME
MAST BLAST, etc. Others? FIRST EF Dragon
Promoter Finder also see Dragon Genome
Explorer (has specialized promoter software for
GC-rich DNA, finding CpG islands, etc) JASPAR
26
TRANSFAC matrix entry for TATA box
  • Fields
  • Accession ID
  • Brief description
  • TFs associated with this entry
  • Weight matrix
  • Number of sites used to build (How many here?)
  • Other info

Fig 5.13 Baxevanis Ouellette 2005
27
Global alignment of human mouse obese gene
promoters (200 bp upstream from TSS)
Fig 5.14 Baxevanis Ouellette 2005
28
GenBank IDs and Accessions
  • http//www.ncbi.nlm.nih.gov/RefSeq/key.htmlaccess
    ions (Accession Formats RefSeq)
  • http//www.ncbi.nlm.nih.gov/Sitemap/samplerecord.h
    tml (GenBank Sample Record)

29
Why we do multiple alignments?
  • Help prediction of the secondary and tertiary
    structures of new sequences
  • Preliminary step in molecular evolution analysis
    using Phylogenetic methods for constructing
    phylogenetic trees.

30
An example of Multiple Alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWY
QQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--
YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDF
YPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLT
CLVKGFYPSD--IAVEWWSNG--
31
Visualization example
32
Other multiple alignment programs
ClustalW / ClustalX pileup multalign multal saga h
mmt
DIALIGN SBpima MLpima T-Coffee ...
33
Other multiple alignment programs
ClustalW / ClustalX pileup multalign multal saga h
mmt
DIALIGN SBpima MLpima T-Coffee ...
34
ClustalW- for multiple alignment
  • ClustalW can create multiple alignments,
    manipulate existing alignments, do profile
    analysis and create phylogentic trees.
  • Alignment can be done by 2 methods
  • - slow/accurate
  • - fast/approximate

35
Running ClustalW
clustalw
CLUSTAL
W (1.7) Multiple Sequence Alignments

1. Sequence Input From Disc
2. Multiple Alignments 3. Profile /
Structure Alignments 4. Phylogenetic trees
S. Execute a system command H. HELP
X. EXIT (leave program) Your choice
36
Running ClustalW
The input file for clustalW is a file containing
all sequences in one of the following
formats NBRF/PIR, EMBL/SwissProt, Pearson
(Fasta), GDE, Clustal, GCG/MSF, RSF.
37
Using ClustalW
MULTIPLE ALIGNMENT MENU 1. Do
complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only 3. Do
alignment using old guide tree file 4.
Toggle Slow/Fast pairwise alignments SLOW
5. Pairwise alignment parameters 6.
Multiple alignment parameters 7. Reset gaps
between alignments? OFF 8. Toggle screen
display ON 9. Output format
options S. Execute a system command H.
HELP or press RETURN to go back to main
menu Your choice
38
Output of ClustalW
CLUSTAL W (1.7) multiple sequence
alignment HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTC
TCTAATCAGCCCTCTGGCCCAG------GCAG SYNTNFTRP
GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG
------GCAG CFTNFA -----------------------------
--------------TGTCCAG------ACAG CATTNFAA
GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG
------ACAC RABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCAT
CTAGTCAACCCTGTGGCCCAGATGGTCACCC RNTNFAA
AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAG
ACCCTCACAC OATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCC
TTCAACAGGCCTCTGGTTCAG------ACAC OATNFAR
GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG
------ACAC BSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCC
ATCAACAGCCCTCTGGTTCAA------ACAC CEU14683
GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG
------ACCC

39
ClustalW options
Your choice 5 PAIRWISE ALIGNMENT
PARAMETERS Slow/Accurate
alignments 1. Gap Open Penalty
15.00 2. Gap Extension Penalty 6.66
3. Protein weight matrix BLOSUM30 4. DNA
weight matrix IUB Fast/Approximate
alignments 5. Gap penalty 5
6. K-tuple (word) size 2 7. No. of top
diagonals 4 8. Window size
4 9. Toggle Slow/Fast pairwise alignments
SLOW H. HELP Enter number (or RETURN to
exit)
40
ClustalW options
Your choice 6 MULTIPLE ALIGNMENT
PARAMETERS 1. Gap Opening
Penalty 15.00 2. Gap Extension
Penalty 6.66 3. Delay divergent
sequences 40 4. DNA Transitions
Weight 0.50 5. Protein weight
matrix BLOSUM series 6. DNA
weight matrix IUB 7. Use
negative matrix OFF 8.
Protein Gap Parameters H. HELP Enter
number (or RETURN to exit)
41
Blocks database and tools
  • Blocks are multiply aligned ungapped segments
    corresponding to the most highly conserved
    regions of proteins.
  • The Blocks web server tools are Block
    Searcher, Get Blocks and Block Maker. These are
    aids to detection and verification of protein
    sequence homology.
  • They compare a protein or DNA sequence to a
    database of protein blocks, retrieve blocks, and
    create new blocks,respectively.

42
The BLOCKS web server
  • At URL http//blocks.fhcrc.org/
  • The BLOCKS WWW server can be used to create
    blocks of a group of sequences, or to compare a
    protein sequence to a database of blocks.
  • The Blocks Searcher tool should be used for
    multiple alignment of distantly related protein
    sequences.

43
The Blocks Searcher tool
  • For searching a database of blocks, the first
    position of the sequence is aligned with the
    first position of the first block, and a score
    for that amino acid is obtained from the profile
    column corresponding to that position. Scores are
    summed over the width of the alignment, and then
    the block is aligned with the next position.
  • This procedure is carried out exhaustively for
    all positions of the sequence for all blocks in
    the database, and the best alignments between a
    sequence and entries in the BLOCKS database are
    noted. If a particular block scores highly, it is
    possible that the sequence is related to the
    group of sequences the block represents.

44
The Blocks Searcher tool
  • Typically, a group of proteins has more than one
    region in common and their relationship is
    represented as a series of blocks separated by
    unaligned regions. If a second block for a group
    also scores highly in the search, the evidence
    that the sequence is related to the group is
    strengthened, and is further strengthened if a
    third block also scores it highly, and so on.

45
The BLOCKS Database
  • The blocks for the BLOCKS database are made
    automatically by looking for the most highly
    conserved regions in groups of proteins
    represented in the PROSITE database. These blocks
    are then calibrated against the SWISS-PROT
    database to obtain a measure of the chance
    distribution of matches. It is these calibrated
    blocks that make up the BLOCKS database.

46
The Block Maker Tool
  • Block Maker finds conserved blocks in a group of
    two or more unaligned protein sequences, which
    are assumed to be related, using two different
    algorithms.
  • Input file must contain at least 2 sequences.
  • Input sequences must be in FastA format.
  • Results are returned by e-mail.

47
Progressive Approaches
  • CLUSTALW
  • Perform pairwise alignments
  • Construct a tree, joining most similar sequences
    first (guide tree)
  • Align sequences sequentially, using the
    phylogenetic tree
  • PILEUP
  • Similar to CLUSTALW
  • Uses UPGMA to produce tree (chapter 6)

48
Clustal method
  • Higgins and Sharp 1988
  • ref CLUSTAL a package for performing multiple
    sequence alignment on a microcomputer. Gene, 73,
    237244. Medline
  • Progressive alignment method
  • An approximation strategy (heuristic algorithm)
    yields a possible alignment, but not necessarily
    the best one

49
First step
A B C D
Compute the pairwise alignments for all against
all (6 pairwise alignments) the similarities are
stored in a table
D C B A
A
11 B
1 3 C
10 2 2 D
50
Second step
D C B A
A
11 B
1 3 C
10 2 2 D
  • cluster the sequences to create a tree (guide
    tree)
  • Represents the order in which pairs of sequences
    are to be aligned
  • Highly similar sequences are neighbors in the
    tree
  • Highly distant sequences are distant from each
    other in the tree

51
Third step
Align most similar pairs
Align the alignments as if each of them was a
single sequence (with the use of a consensus
sequence or a profile)
52
Clustal programs
  • ClustalV
  • ClustalW
  • Thompson et al., 1994
  • Uses sequence weighting, positions-specific gap
    penalties and weight matrix choice
  • W stands for weight sequences
  • clustalX - windows implementation

53
ClustalW method rules (1) sequence weighting
  • Each sequence is weighted according to how
    different it is from the other sequences.
  • For the case where one specific subfamily is
    overrepresented in the data

54
ClustalW method rules (2) weight matrix choice
  • The substitution matrix used for each alignment
    step depends on the similarity of the sequences.

55
ClustalW method rules (3) positions-specific gap
penalties
  • Gaps found in initial alignments remain fixed
    through the process (ends gap)
  • Hydrophobic residues have higher gap penalties
    than hydrophilic
  • they are more likely to be in the hydrophobic
    core, where gaps should not occur.

56
ClustalW method shortcomings
  • (1) Sequences that are similar only in sub-
    regions
  • ClustalW forces a global alignments, not local.
  • (2) A sequence that contains a large
    insertion/deletion compared to the rest will
    extremely affect the alignment
  • (again global not local).

57
ClustalW method shortcomings
  • (3) A sequence that contains a repetitive
  • element (such as a domain), whereas all
    other sequences only contain one copy.

58
Comments
  • Pairwise alignment is an optimal algorithm
  • Multiple alignment is not an optimal algorithm
    only a heuristic. Better alignments may exist!
  • The algorithm yields a possible alignment, but
    not necessarily the best one.

59
ClustalW in the web server
  • Global multiple sequence alignment program for
    DNA or proteins
  • Available from a number of sites
  • EMBL-EBI

60
Results
61
Results
62
Alignment with colors
identity
similarty
63
CLUSTAL format
  • CLUSTAL W(1.82) multiple sequence alignment
  • YPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSN
    FDEEFTR--SEKPIDSVVDEYLSESV
  • YPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTAN
    FDQEFTK---EKPIDSVVDEYLSASI
  • KPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAEN
    FDKFFTR---GQPVLTPPDQLVIANI
  • KPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDN
    FDTQFTS---EPVQLTPDDEDAIKRI
  • KAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQ
    FDKYPE----EDINYGVQGEDPYADL
  • KAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQ
    FDRYPE-EVDEEFNYGIQGEDPYMDL
  • KAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSL
    FDQYPE-DV-EQLDYGIQGDDPYAEY
  • KS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQ
    FDSKFTR-V-QTPVDSP-DDSTLSES
  • .
  • YPK1 -----MQKQF
  • YPK2 ----N-QKQF
  • KPCA_HUMAN D--O--QSDF
  • KPCZ_HUMAN D-----QSEF
  • KAPA -D----FRDF

64
ClustalW at EMBL - Jalview
Jalview is a multiple alignment editor
conservation
65
Jalview
  • color menu
  • Taylor colors (each amino acid is colored
    differently)
  • Zappo colors (amino acids are colored according
    to their physico-chemical properties)
  • Hydrophobicity colors (colors amino aids
    according to a certain score scale that
    represents hydrophobicity)
  • Coloring residues above a percentage identity
    threshold
  • User defined color schemes

66
Example - Zappo colors
  • physico-chemical properties color-code

67
Guide Tree
68
ClustalX
  • ClustalX provides a window-based user interface
    to the ClustalW program.
  • It uses the developed by the NCBI as part of
    their NCBI SOFTWARE DEVELOPEMENT TOOLKIT.

69
T-coffee
  • Another MSA program
  • Protein nucleotide MSA program
  • Uses principles similar to ClustalW
  • More accurate but longer running times
  • Limits the number of sequences it can align
    (100)
  • T-coffee at EMBnet

70
(No Transcript)
71
T-coffee results
72
Phylip format
  • 5 99
  • Cabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMN
    LPGKWKPKIIGGI
  • JCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-
    NPGRWKPKIIGGI
  • JCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-
    LPGRW-PKMIGGI
  • JCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----
    DPGRWKPKMIGGI
  • JCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMN
    LPGRWKPKMIGGI
  • GGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGR
    NLLTQLGCTLNF
  • GGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGR
    NLLTQIGCTLNF
  • GGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGR
    NLMTQLGCTLNF
  • GGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGR
    NLLTQIG-TLNF
  • GGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGR
    NLLTQIGCTLNF

73
The Biology WorkBench
  • http//workbench.sdsc.edu/
  • http//www.ngbw.org/
  • Nucleic Acid Sequence Tools, including BLAST,
    CLUSTALW, MFOLD, PRIMER3

74
Muscle
  • Protein nucleotide MSA program
  • Improvements in both accuracy and speed
  • exploiting a range of existing and new
    algorithmic techniques
  • combination of progressive and iterative
    alignment strategies
  • details of the method
  • web server
  • downloads Windows, Linux, Mac

75
Muscle web server
76
Editing MSA
  • There are a variety of tools that can be used to
    modify a multiple alignment (SeaView, BioEdit,
    JalView)
  • These programs can be very useful in formatting
    and annotating an alignment for publication.
  • An editor can also be used to make modifications
    by hand to improve biologically significant
    regions in a multiple alignment created by one of
    the automated alignment programs.

77
MSA approaches
  • Progressive approach CLUSTALW (CLUSTALX),
    PileUp,
  • T-COFFEE, MAFFT, MUSCLE
  • Iterative approach Repeatedly realign subsets
    of sequences. MultAlin, DiAlig, MAFFT,
    MUSCLE,ProbCons
  • Genetic algorithm
  • SAGA
  • Graph algorithm
  • POA

78
Conclusion
  • There is no single method that always generates
    the best alignment
  • It may thus be wise to use more than one method
  • Alignment editors can be used to correct the
    alignments
Write a Comment
User Comments (0)
About PowerShow.com