Indexing genomic sequences - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Indexing genomic sequences

Description:

Indexing genomic sequences Outline Introduction Unique markers Multi-layer unique markers Locating SNP on genome Aligning EST ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 38
Provided by: frh2
Category:

less

Transcript and Presenter's Notes

Title: Indexing genomic sequences


1
Indexing genomic sequences
  • ???? ?????
  • ???

2
Outline
  • Introduction
  • Unique markers
  • Multi-layer unique markers
  • Locating SNP on genome
  • Aligning EST to genome
  • Future works and conclusion

3
Introduction
4
Why indexing ?
  • Long genomic sequences

5
How to locate homologous sequences?
6
How to locate a SNP sequence?
7
K-mer table
8
K-mer table
  • Small k (14) ? too many combination
  • Large k (21) ? huge table

9
What is an unique marker?
10
Problem
11
UniMarker (UM) method
12
(No Transcript)
13
Numbers of UMs with different lengths
Mer No. of UMs Density(bp)
14 40,261,150 150
21 2,194,639,732 2.7
28 2,374,581,178 2.5
14
(No Transcript)
15
Locating SNP on genome
16
Trade off of short and long UMs
  • Short UMs
  • Faster
  • Low hit ratio
  • 14-mer ? 40,261,150 (table size lt 512Mb)
  • 57
  • Long UMs
  • Slow
  • High hit ratio
  • 28-mer ? 2,374,581,178 (table size gt 10Gb)
  • 97

17
Properties of UM
  • A sequence contains an unique marker is also an
    unique marker.
  • One U14 ? eight U21

U14
U21
18
Multi-Layer Unique Markers
  • First layer U14
  • Second layer U21
  • Third layer U28
  • The first layer is also an index of the second
    layer.
  • The second layer is also an index of the third
    layer.

19
Multi-Layer Unique Markers
20
Example of 2-layer unique markers
21
Size of Muti-layer UM table
Multi-Layer UM28 Size One-LayerUM28 Size
FirstLayer 3,634,095,330 3,634,095,330
Second Layer 40,266,215,370 40,976,086,518
Third Layer 3,367,681,380 42,833,743,434
22
Search UMs on SNP
23
Elongating
Next position
24
(No Transcript)
25
Unique marker clusters
26
Maximal matched increasing unique markers
27
Locating SNP on Genome
  • Find matched unique markers
  • Search on the 3-layer unique markers
  • Elongating
  • Find maximal matched increasing unique markers
    for each cluster

28
(No Transcript)
29
What is an Expressed Sequence Tag?
3 UTR
Gene
5 UTR
DNA coding strand
Exon 4 (non-coding)
Exon 1
Exon 2
Exon 3
Intron
primary transcript
Intron
Intron
mRNA
ESTs
30
Performance of using multi-layer UM table
Chr 10(sec.) Whole genome (sec.)
Sim4 66 1508
Mugup 3.6 5.2
ratio 18.3 290
  • We have installed them on personal computers.

31
Quality of aligning 103 human mRNAs in HMR195 by
using Squall and Mugup
Squall NCBI human genome (Build 28) Mugup NCBI human genome (Build 31)
The exactly correct alignment is computed 83 101
Positions of one or two exons are incorrect 8 0
More than two exons are located incorrectly 6 0
No alignments are calculated 6 2
32
How about repetitive regions?
  • 10
  • First search on UM table,
  • Then search on k-mer table

33
Future work conclusion
  • Speed
  • Space
  • Accuracy

34
UM Spacing
Arabidopsis thaliana
35
Homo sapiens
36
Marker Frequency
37
Low frequency high density index problem
Write a Comment
User Comments (0)
About PowerShow.com