Title: Indexing genomic sequences
1Indexing genomic sequences
2Outline
- Introduction
- Unique markers
- Multi-layer unique markers
- Locating SNP on genome
- Aligning EST to genome
- Future works and conclusion
3Introduction
4Why indexing ?
5How to locate homologous sequences?
6How to locate a SNP sequence?
7K-mer table
8K-mer table
- Small k (14) ? too many combination
- Large k (21) ? huge table
9What is an unique marker?
10Problem
11UniMarker (UM) method
12(No Transcript)
13Numbers of UMs with different lengths
Mer No. of UMs Density(bp)
14 40,261,150 150
21 2,194,639,732 2.7
28 2,374,581,178 2.5
14(No Transcript)
15Locating SNP on genome
16Trade off of short and long UMs
- Short UMs
- Faster
- Low hit ratio
- 14-mer ? 40,261,150 (table size lt 512Mb)
- 57
- Long UMs
- Slow
- High hit ratio
- 28-mer ? 2,374,581,178 (table size gt 10Gb)
- 97
17Properties of UM
- A sequence contains an unique marker is also an
unique marker. - One U14 ? eight U21
U14
U21
18Multi-Layer Unique Markers
- First layer U14
- Second layer U21
- Third layer U28
- The first layer is also an index of the second
layer. - The second layer is also an index of the third
layer.
19Multi-Layer Unique Markers
20Example of 2-layer unique markers
21Size of Muti-layer UM table
Multi-Layer UM28 Size One-LayerUM28 Size
FirstLayer 3,634,095,330 3,634,095,330
Second Layer 40,266,215,370 40,976,086,518
Third Layer 3,367,681,380 42,833,743,434
22Search UMs on SNP
23Elongating
Next position
24(No Transcript)
25Unique marker clusters
26Maximal matched increasing unique markers
27Locating SNP on Genome
- Find matched unique markers
- Search on the 3-layer unique markers
- Elongating
- Find maximal matched increasing unique markers
for each cluster
28(No Transcript)
29What is an Expressed Sequence Tag?
3 UTR
Gene
5 UTR
DNA coding strand
Exon 4 (non-coding)
Exon 1
Exon 2
Exon 3
Intron
primary transcript
Intron
Intron
mRNA
ESTs
30Performance of using multi-layer UM table
Chr 10(sec.) Whole genome (sec.)
Sim4 66 1508
Mugup 3.6 5.2
ratio 18.3 290
- We have installed them on personal computers.
31Quality of aligning 103 human mRNAs in HMR195 by
using Squall and Mugup
Squall NCBI human genome (Build 28) Mugup NCBI human genome (Build 31)
The exactly correct alignment is computed 83 101
Positions of one or two exons are incorrect 8 0
More than two exons are located incorrectly 6 0
No alignments are calculated 6 2
32How about repetitive regions?
- 10
- First search on UM table,
- Then search on k-mer table
33Future work conclusion
34UM Spacing
Arabidopsis thaliana
35Homo sapiens
36Marker Frequency
37Low frequency high density index problem