Indexing Genome Sequences - PowerPoint PPT Presentation

About This Presentation
Title:

Indexing Genome Sequences

Description:

State of the art. Dynamic Programming. Slow but accurate. Never ... Not suited for average DNA/Protein query lengths. IITB - Bioinformatics Workshop 2001 ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 16
Provided by: srik6
Category:

less

Transcript and Presenter's Notes

Title: Indexing Genome Sequences


1
Indexing Genome Sequences
  • Srikanta B. J.
  • Database Systems Lab (DSL)
  • Indian Institute of Science

2
Background
  • Sequences
  • DNA (Deoxyribose Nucleic Acid)
  • Proteins
  • Similarity of sequences
  • The extent to which nucleotide or protein
    sequences are related
  • Percent sequence identity, and/or Conservation

3
Genome Sequence Analysis
  • Hypothesize
  • Function of Proteins
  • Phylogenetic trees
  • Causes of Diseases
  • First step in unraveling the mystery of Life!
  • Sequence Similarity ? Structural Similarity ?
    Functional Similarity

4
Sequence Similarity
  • Alignment
  • between two sequences, S1 S2 (perhaps of
    unequal length)
  • Insert spaces, into or at the ends of S1(S2)
  • Place them so that every character or space in
    either string is opposite a unique
    character/space in the other.E.g.,q a c - d b
    dq a w x - b -
  • Global Local Alignments

5
Alignment
  • Global
  • Given two sequences, find best alignment over
    full length
  • E.g., between (agtcacaaaact, actcgga) a g t c a
    c a a a a c t a c t c g
    g a - - - - -
  • Local
  • Look for islands of high similarity
  • E.g., between (agtcacaaaact, actcgga) a g t c a
    c a a a a c t
    a c t c g g a

O(mn) with Dynamic Programming
6
Scoring the Alignments
  • Scoring Schemes
  • Value for aligning character x against character
    y
  • Provided as scoring matrix, for alphabet ?
  • E.g.,
  • BLOSUM
  • PAM - 120
  • DNA-BLAST (5 for match, -4 for mismatch)
  • Optimizing alignments
  • E.g., Edit Distance
  • Scoring Scheme Insert - 1, Delete - 1, 0
    otherwise
  • gt edit_distance (surgery, surgeon) 4

7
Search Process
  • Given sequence to be studied
  • Want all similar (global/local) known sequences
  • Collections of sequences
  • NCBI-GenBank, SwissProt etc.
  • Contain millions of sequences

8
State of the art
  • Dynamic Programming
  • Slow but accurate
  • Never misses a significant alignment
  • FastA
  • Faster than Dynamic Programming
  • Uses statistical heuristics
  • Reduced sensitivity ? False dismissals
  • BLAST
  • Fastest and popular
  • Lower sensitivity than FastA
  • Requires whole database in memory!

9
BLAST - on 1,000 Budget!
  • BODHI experience DSL, 2001
  • 51,000 DNA sequences in database
  • CAFÉ Experience Williams and Zobel, 2001
  • 120,000 DNA sequences in memory
  • Time - 67.1 seconds/BLAST

? 10.6 seconds / BLAST
10
NCBI GenBank Growth
  • Doubles every 13 months
  • In 1998, estimated 40,000 sequence similarity
    queries per day
  • That was 3 years ago!!

11
We Need Indexes for Sequence Similarity
Searching NOW!!
12
Indexed Searching
  • Inverted Indexes
  • RAMdb Fondrat and Dessen, 1995
  • CAFÉ Williams and Zobel, 2001
  • FLASH Califano and Rigoutsos, 1993
  • Multi-Dimensional Indexes
  • MRS-indexing Kahveci and Singh, 2001
  • Persistent Prefix Tree Hunt et al., 2001

13
RAMdb (Rapid Access Motif db)
  • Each sequence in repository is indexed by
    constituent overlapping sequences
  • 800-fold speedup over Dynamic Programming
  • Prohibitive index size
  • No ranking (goodness) of alignments
  • False dismissals

ACTC
Seq1, seq2,
Seq1, seq4,
CTCG
14
CAFÉ
  • Partitioned Search
  • Coarse searching with compressed inverted index
  • Fine searching in small fraction of database,
    with ranking
  • 14-fold speedup over BLAST
  • Compression reduces the index size
  • Distant sequence relationships are lost
  • Lower retrieval effectiveness

15
MRS - Indexing
  • Uses progressive wavelet coefficients to
    represent sequence

16
MRS-Indexing (contd.)
  • Builds a hierarchy of Multi-Dim. Indexes
  • Only for edit distances - no general scoring
    schemes
  • Not suited for average DNA/Protein query lengths

17
Summary
  • Rapid growth in sequence databases
  • Existing algorithms do not scale
  • Indexed approach to Sequence Similarity is
    necessary
  • Improvements needed in Indexed Searching methods
Write a Comment
User Comments (0)
About PowerShow.com