72x48 Poster Template - PowerPoint PPT Presentation

About This Presentation
Title:

72x48 Poster Template

Description:

SSAHA: Search with Speed Nick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke University Focus Program: The Genome Revolution and Its Impact on Society – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 2
Provided by: A681
Category:

less

Transcript and Presenter's Notes

Title: 72x48 Poster Template


1
SSAHA Search with Speed Nick Altemose, Kelvin
Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke
University Focus Program The Genome Revolution
and Its Impact on Society
How Does a Hash Table Work?
Introduction
APT k-tuple Coordinates in a Hash Table
SSAHA Step-by-Step
  • The Human Genome contains an immense amount of
    information 3 Billion base pairs, 3 Gigabytes of
    storage space. In fact, if you were to read it
    aloud continuously at the alarming rate of 10
    base pairs per second, it would take you 9.5
    years to get through it all.
  • When you want to search for a sequence in the
    genome, you must use smart search methods like
    those used by internet search engines. If you
    were to search through the entire genome one base
    pair at a time, it could take hours to get back
    any results, even with the best computers.
  • SSAHA is a search method that was developed with
    one goal in mind speed.
  • SSAHA takes a database of data, like the 3
    Gigabase human genome reference sequence, and
    organizes it into a fast searchable index, known
    as a hash table.
  • This organization is the key to SSAHAs speed,
    and the origin of its name Sequence Search and
    Alignment by Hashing Algorithm.
  • The hash table is like a dictionary to the
    English language, or an index to an encyclopedia.
    It allows for fast lookup of organized
    information. After making this hash table, it is
    then very easy to search for a given sequence.
  • Problem Statement
  • In this APT, you will generate a string similar
    to a single hash table entry. Given an array of
    DNA strings (our database) and a dimer string
    (our 2-tuple), output a string containing the
    positions of every occurrence of that 2-tuple in
    the database (not limited to a single reading
    frame). Positions will be in the form of
    (0-initiated array element index, 0-initiated
    character index) and should be ordered ascending
    numerically first by array element index, then by
    character index. Note that these positions will
    be 0-initiated and not 1-initiated like actual
    SSAHA positions.
  • Notes
  • Positions should be space-delimited in the
    string. For example, an output string containing
    5 positions should have this format
  • (0,2) (1,3) (2,5) (5,3) (6,7)
  • Constraints
  • database can contain strings of any length,
    including 0 and 1.
  • All input strings will contain only the
    characters A, G, C,or T in any
    combination.
  • twotuple will always have exactly 2 characters.
  • Examples
  • 1. ATTCGT, TACGG, GAATCA, G
  • CG
  • Returns (0, 3) (1, 2)
  • 2. TTCGT, AAATC, ATATATTTA, TCGAAATG
  • AT
  • Returns (1, 2) (2, 0) (2, 2) (2, 4)
    (3, 6)
  • 3. TTTTT, ATTCG
  • TT
  • Returns (0,0) (0, 1) (0, 2) (0, 3)
    (1, 1)

1) Break sequences in database into
k-tuples. 2) Make a hash table. Ideally,
each cell in the hash table is allotted for one
particular unique k-tuple. Each cell
contains the locations where that particular
unique k-tuple occurs. The location specifies the
sequence where the k-tuple occurs, and the index
in the sequence where the k-tuple occurs. 3) The
query sequence is also broken into
k-tuples. 4) These query k-tuples are
matched to k-tuples in the database. Matches are
noted. 5) SSAHA returns search results,
listing sequences from the database that best
match the query sequence. The degree of
similarity is based on the amount of matching
k-tuples, the successiveness in which they match
(this involves pathingnot explained here) etc.
and attempts to account for insertions, deletions
and SNPs.
  • The hash table stores positions at which
    different short sequences, called k-tuples, occur
    in the database. A position looks like this
    (index, offset), where index identifies which DNA
    strand contains the k-tuple, and offset
    identifies at what position in the DNA strand the
    k-tuple begins.
  • In lay terms, SSAHA likes to keep things
    organized, like an anal retentive woman who puts
    everything in labeled Tupperware, which she then
    organizes into labeled drawers. The index is like
    the labeled drawer, and the offset is like the
    labeled Tupperware. If you tell her to get
    something from the third drawer, fourth
    Tupperware, she can quickly and easily locate
    what youre looking for. Or, if you ask her where
    you can find something, she can easily tell you
    in which drawer and Tupperware youll find it.
  • You only have to make a hash table once for a
    given database, kind of like you only have to
    tidy your room up once, then you can find things
    extremely easily thereafter.

Visual of Hash Table Search
How Does SSAHA Code for Base Pairs?
  • SSAHA stores individual nucleotide base
    information as 2-bits of information per base
  • A is stored as 00
  • C is stored as 01
  • G is stored as 10
  • T is stored as 11
  • This allows for fast and efficient storage and
    processing, using less memory and less processor
    power
  • However, the drawback is that only 4 characters
    are possible, and sometimes real DNA data has
    gaps and ambiguous characters
  • Other search engines treat these ambiguities as
    separate letters, like R or N (N means it could
    be any base)
  • SSAHA reverts all ambiguities to A, since AAAAA
    is the most common word in the genome, and the
    search can easily recognize it as uninformative

Contact Information
Nick Altemose, Student, Duke University Trinity
College of Arts and Sciences email
nick.altemose_at_duke.edu Owen Astrachan,
Professor, Duke University Department of Computer
Science email ola_at_cs.duke.edu Kelvin
Gu, Student, Duke University Trinity College of
Arts and Sciences email
kelvin.gu_at_duke.edu Tiffany Lin, Student, Duke
University Trinity College of Arts and Sciences
email tiffany.lin_at_duke.edu Kevin Tao,
Student, Duke University Pratt School of
Engineering email kevin.tao_at_duke.edu
Works Referenced
Genome Size Data from Department of Energy Website http//www.ornl.gov/sci/techresources/Human_Genome/faq/faqs1.shtml SSAHA data from Genome Research 2001 volume 11, pages 1725-1729 SSAHA A Fast Search Method for Large DNA Databases
Write a Comment
User Comments (0)
About PowerShow.com