Title: q-gram Based Database Searching Using A Suffix Array (QUASAR)
1q-gram Based Database Searching Using A Suffix
Array (QUASAR)
- S. Burkhardt
- A. Crauser
- H-P. Lenhof
E. Rivals P. Ferragina M. Vingron
Max-Planck Institut f. Informatik,
Saarbrücken Deutsches Krebsforschungszentrum,
Heidelberg
2Outline
- Existing Work
- Motivation
- Problem
- Algorithm
- Results
3Existing Work
- Examples
- BLAST
- FASTA
- Linear Scan (No Index)
- Good Sensitivity
4Motivation
- Today New Applications
- Examples
- EST-Clustering
- Large Scale Shotgun Assembly
- Low Sensitivity
- Multiple Searches
- Specialized Algorithms Needed
5Problem Definition
w 8
- Local Alignment, minimum Length w
- Low Error Rate (lt10 Edit Distance)
6The Algorithm
- Filter Step
- Identify Hotspots
- Scan Step
- Scan Hotspots with BLAST
7The Algorithm
- q-gram Filtration
- Block Addressing
- Suffix Array
- Window Shifting
T C G A T T A C
T C G A T T A C A G T G A A T
q 3 of q-grams P - q 1
w 8
G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
Edit Distance e at least t P - q 1 -
(qe) common q-grams
8The Algorithm
- q-gram Filtration
- Block Addressing
- Suffix Array
- Window Shifting
T C G A T T A C
- Count matching q-grams per Block
- Scan Blocks with counter ³ t
How to find the matching q-grams?
G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
9The Algorithm
- q-gram Filtration
- Block Addressing
- Suffix Array
- Window Shifting
T C G A T T A C
G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29
10The Algorithm
- q-gram Filtration
- Block Addressing
- Suffix Array
- Window Shifting
q 3 w 8 e 1 t 3
T C G A T T A C A G T G A A T
T C G A T T A C
- Mark full Blocks for each Window
G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
11Results
- Influence of the Block Size
- Sensitivity
- Running Times
- Overhead for loading the Index
Benchmark System Ultra Sparc Processor, 333Mhz,
4GB RAM
12Results
Influence of Block Size
13Sensitivity
Results
- 1000 Queries
- BLAST Cutoff E 0.00001
- Number of identical hitlists
- Mouse EST DB 91.4
- Human EST DB 97.1
- QUASAR finds many Hits below selected Error Level
14Results
Running Times
- Test Parameters
- 6 Error
- w 50
- q 11
- block size 2048
- scan with BLAST
- time averaged for 1000 queries
- 30 times faster than BLAST
15Results
Overhead for Loading the Index
- 1000 queries
- Human EST DB, 280 Mbps
- BLAST Test Run
- 5 seconds Load Time
- 13.270 seconds Search Time
- QUASAR Test Run
- 90 seconds Load Time
- 380 seconds Search Time
16(No Transcript)