Algorithms for Local Sequence Alignments BLAST, FASTA - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Algorithms for Local Sequence Alignments BLAST, FASTA

Description:

Direct submission of DNA/RNA sequences by the researchers. Uncurated: varying quality of sequences ... ART ITS AVS ... An example: comparing mus/rat/hum chr X ... – PowerPoint PPT presentation

Number of Views:320
Avg rating:3.0/5.0
Slides: 16
Provided by: adrianbr
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Local Sequence Alignments BLAST, FASTA


1
Algorithms forLocal Sequence AlignmentsBLAST,
FASTA
  • A. Brüngger, Labhead BioinformaticsNovartis
    Pharma AG
  • adrian.bruengger_at_pharma.novartis.com

2
Algorithms for Local Sequence Alignments BLAST,
FASTA
  • Sequence Similarity and Homology
  • Origins of homology
  • Sequence alignment
  • Global Alignment
  • Local Alignment
  • Content of Sequence DBs
  • GenBank, SwissProt, RefSeq
  • Size of sequence DB requires special search tools
  • Algorithms for searching Sequence Databases
  • Basics of sequence DB searches
  • Efficient detection of identical k-mers
  • BLAST2 improvements
  • Statistical significance of hits

Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3
Rational for Sequence Analysis, Origins of
Sequence Similarity
Similar sequence leads to similar function
Sequence Analysis as the basic tool to discover
functional, structural, evolutionary information
in biological sequences
Sequence A
Sequence B
Evolutionary relationship between two similar
sequences and a possible common ancestor. The
number of steps to convert one sequence into the
other is the "evolutionary" distance between the
sequences (x y). Usually, the ancestor sequence
is not available, only (x y) can be computed.
y Steps
x Steps
common ancestor sequence
4
Origins of Homology ? Significance of Sequence
Alignments
  • Possible Origins of Sequence Homology
  • orthologs (panel A and B) a1 in species I and a1
    in species II (same ancestor!)
  • paralogs (panel A and B) a1 and a2 (arose from
    gene duplication event)
  • analogs (panel C) different genes converge to
    same function by different evolutionary paths
  • transfer of genetic material (panel D) between
    different species
  • Homology vs. Similarity
  • Similarity can be computed (by sequence
    alignments)
  • Homology is deduced (e.g. from similarity, but
    also from other evidence!)

5
Definition of Sequence Alignment
  • Computational procedure (algorithm) for
    comparing two/many sequences
  • identify series of identical residues or patterns
    of identical residuesthat appear in the same
    order in the sequences
  • visualized by writing sequences as follows
  • sequence alignment is an optimiztion
    problembringing as many identical residues as
    possible into corresponding positions

MLGPSSKQTGKGS-SRIWDN
MLN-ITKSAGKGAIMRLGDA
Pairwise Global Alignment (over whole length of
sequences)
GKG GKG
Pairwise Local Alignment (similar parts of
sequences)
6
Content of Sequence Databases
  • Sequencing efforts during the last 15 years led
    to a wealth of sequence DBs
  • GenBank (NCBI)
  • Direct submission of DNA/RNA sequences by the
    researchers
  • Uncurated varying quality of sequences (ESTs,
    mRNAs, genomic DNA)
  • Entries are hardly ever changed (even if there
    are obvious mistakes!)
  • Highly redundant
  • 30'000'000 sequences
  • SwissProt (EBI, SIB)
  • Protein database
  • curated ongoing effort to improve data quality
    by human curation
  • annotations controlled vocabulary, structured
    information
  • minimally redundant
  • 30'000 sequences
  • RefSeq (NCBI)
  • manually and computationally annotated/curated
    set of genes, mRNAs, and proteins
  • minimally redundant
  • capture relationship between DNA, RNA, Proteins
    (splice variants, SNPs etc.)
  • 100'000 sequences

7
Growth of DBs requires specific algorithms to
search
  • Given my sequence of interest ("query")
  • Is the query contained in a sequence DB?
  • Are there orthologs, paralogs, analogs to the
    query in a sequence DB?
  • Are there sequences sharing high/medium/low
    degree of similarity with query parts?
  • Are there other sequence variants (splice
    variants, SNPs) in the DB?

8
Searching a sequence DB with a query sequence
  • Approach 1
  • global/local pairwise sequence alignment of query
    with each DB sequence
  • dynamic programming requires O(nm) space and
    time
  • space/time complexity such that resulting
    computation-time is prohibitive
  • Smith-Waterman or alike not feasible!
  • Approach 2
  • fast identification of identical subsequences in
    DB ("seeds")
  • extension around seed to construct local
    alignment
  • General difficulty
  • small query, large database
  • in some cases, an identified "hit" may happen by
    pure chance
  • assign statistical significance to a "hit"
  • Example
  • DB human genomic DNA, 3109 b
  • query tggtacaaatgttct (glucocorticoid response
    element GRE)

9
Basics of Sequence DB Searches Detection of
identical k-mers
  • Idea identify identical k-mers in DB and q
    (seed) expand alignment from seed in both
    directions
  • Example

q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQM
AQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS

...
MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYE
TQLRRYINTLTRPRYGKRAEEENTGGLP...
  • BLAST
  • HSP, high scoring pair
  • gapped alignment
  • starting extension also from similar (and not
    only identical) seeds

10
Basics of Sequence DB Searches Detection of
identical k-mers
  • Precompute position of all k-mers in DB sequence
  • Indexing all peptides of length k in database
  • Example

0 1 2 3 1234567890123456789
012345678901234 MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR AAARL AARLC ....
APLEP Sorted AAARL 2
AARLC 3 APLEP 30 ... MAAAR 1 ...
VALLL 17
  • For each peptide of length k in the query,
    identical peptides in "database" are detected
    efficiently (binary search in sorted list)
  • For identical pairs, extension step in both
    directions is performed

11
Basics of Sequence DB Searches Detection of
identical k-mers
  • Indexing all peptides of length k in database
    some refinements

0 1 2 3 1234567890123456789
012345678901234 MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers
to previous occurrences List of all words of
length k and last occurrence in query AAAAA
- AAAAC - AAAAD - .... AARLC 3
.... APLEP 30 ... ... VALLL 17 ...
Simple, yet efficient data-structure - array of
integers (sizelength of db) - array of integers
(sizenumber of words with length
k) Book-keeping - more than one db/query
sequence - build database chunks that fit into
main memory(speeds up computation
1000x) Extension step optimizations
  • For each peptide of length k in the query, the
    position in the wordlist can be easily computed
    (no binary search!)

12
Improvement of sensitivity/selectivity in BLAST
A W T V A S A V R T S I
  • (optional) filtering for low complexity region in
    query
  • all query words of length 3 are listed
  • to each word, 50 'high scoring' additional words
    are added
  • matching words are identified in DB (as described
    before)
  • ungapped alignment constructed from word matches,
    'HSP'
  • statistics determines, whether HSP is significant
  • SW-alignment for significant HSPs

AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS AWA
IAS TVR ... TWA LAS AIR ... ART ITS AVS ... ...
... ...
13
An example comparing mus/rat/hum chr X
Each dot conserved stretch of AA, HSP, high
scoring pair Sequence lengths gt 140 M bp
14
Significance of matches DNA case
  • issue searching with short query vs. large
    database? found match could have occurred by
    pure chance
  • assume equal distribution of c,g,a,t
  • what is ...
  • the probability p, that sequence q (lenm) is
    contained in sequence t (lenn)?
  • the expected length of the longest common
    subsequence of two sequences?
  • the expected score of the best local alignment of
    two sequences?
  • the expected score distribution when locally
    aligning two seqeuences?
  • Example
  • s tggtacaaatgttct (glucocorticoid response
    element GRE)
  • t 10000 bp (promoter, upstream DNA to start
    codon)
  • if promoter sequence was random, how often do we
    expect to find a GRE?
  • Probability that q (lenm) is contained in t
    (lenn)
  • a. There are (n-m) 'words' of length m in
    sequence A
  • b. In total, there are 4m sequences of length m
  • c. p (n-m) / 4m

This is wrong. Why?
15
Conclusions and Outlook
  • types of sequence alignments pairwise, multiple,
    query vs. database
  • local and global alignment
  • optimal alignment in practical terms only
    feasible for pairwise alignment
  • sequence database searches
  • has become the single most important tool in
    sequence analysis
  • basic tool for hypothesis building about function
    of unknown sequence
  • basic algorithm to identify local alignments in
    sequence databases
  • efficiently find occurrence of query k-mers in db
    sequences (seeds)
  • expand (ungapped or gapped) HSP from seeds
  • attach statistical significance
  • outlook
  • some more about statistical significance
  • multiple sequence alignments
  • profile based sequence searches
  • construction of phylogenetic trees from sequence
    alignments
Write a Comment
User Comments (0)
About PowerShow.com