Title: Genehackers
1(No Transcript)
2 3Design Overview
- Virus detection system
- Receives nucleotide sequence from DNA sequencer
- Analyzes sequence and compares to known viral
pathogens - Informs user of matches
- Additional alerts for viruses considered
particularly dangerous
4Design Use Case
- Aid worker in the field
- Needs access to large amount of data
- Has limited resources
- Far from base station
- Portable devices
- Must be small/light
- Do not want to sacrifice function and ability
5Design Specialization
- Compress virus database
- We will be compressing the virus database,
allowing us to store more of it in less space. - Will likely be using a combination of compression
algorithms, and decompressing over multiple
hardware components
6Design Exploration Architecture
- Store Compressed Data on the Flash
- Data compressed on the PC and stored to Flash
with separate hardware software - Store compressed data on USB stick
- Data compressed on PC, communicate with USB stick
using FPGA or ARM - Allow in-system data compression
- Makes most sense with Flash
7Design Exploration Architecture
- Decompression
- Could be done on ARM, FPGA, or both
- ARM would decompress in software
- FPGA would decompress in hardware
- BLASTP
- Could be done in software and/or hardware
- Optimize certain algorithms in software
- Custom hardware in FPGA for fast calculations
8Design Exploration Architecture
9Design Exploration Architecture
10Design Exploration Architecture
- Mixed hardware/software decompression seems best
- In-system compression to Flash makes most sense
(allows for large database and future updates) - Lost team member caused simplification in goals
11Design Exploration Compression
- LZW
- Huffman
- More Complex/Effective Algorithms
12Design Exploration Compression
- LZW Algorithm
- LZW compression replaces strings of characters
with single codes. - Adds every new string of characters it sees to a
table of strings. - Compression occurs when a single code is output
instead of a string of characters.
13Design Exploration Compression
- LZW Compression Perks
- Great for repetitive data where a certain chunk
is repeated in multiple places - Can achieve anywhere from 50 to 90 reduction in
file size on standard text - Very fast
- LZW Compression Dangers
- If data is not repeated often, file size can be
considerable
14Design Exploration Compression
- Huffman Algorithm
- Data is read from input file, and characters are
stored in tree organized by frequency of
occurrence - Requires two reads of the data
- one to construct the Huffman tree
- other to write the data do mapping.
15Design Exploration Compression
- Huffman Compression Perks
- Data where certain codes are substantially more
frequent than others will display excellent
compression - Huffman Compression Dangers
- Resulting file may be larger than the original if
most codes occur with similar frequency - Requires a table of the codes to be stored with
the data, which adds overhead
16Design Exploration Compression
- Other Algorithms
- Adaptive Huffman
- Requires only one pass of the data can be done
dynamically - Tree is continually updated and reshaped each
time a new character is read allows different
parts of the file to have different encoding
depending on frequency of the character it
adapts to the data
17Design Exploration Compression
- Other Algorithms
- Deflate
- Algorithm used in gzip/other ZIP variants
- Combines LZW and Huffman encoding
- Compressed data set consists of series of blocks,
corresponding to successive blocks of input data.
- Each block consists of two parts Huffman code
trees to describe compressed data, and compressed
data. - Compressed data consists of series of elements of
literal bytes (of strings that have not been
detected as duplicated within the previous set
limit of bytes), and pointers to duplicated
strings.
18Design Exploration Compression
- Compression Analysis
- Comparisons amongst algorithms still needs to be
done. - We know that Deflate can achieve about 10 better
compression than LZW, but can be about 10X
slower. - Data needs to be analyzed to see what type of
compression needs to be used as the base, then we
can see if that initial compressed data can be
compressed more efficiently with another
algorithm on top.
19Design Exploration Compression
- Compression Analysis
- Performing compression in hardware is a
complicated task. - Determination of how to abstract working with the
huge tables in hardware. All of the compression
algorithms involve storing data in tables. In
hardware, this means sharing the memory on the
FPGA between BLASTP, the database, and the
tables. - Also, need to look at ways of searching the
tables how to abstract hashing/quick search
methods into hardware.
20Matlab Profile on BLASTP
OPTIMIZATION NEEDED!!!
21Searching Algorithm on BLASTP
- Divide and Conquer
- Cooperation between Hardware and Software
- Software look for possible hits and extend them
for further matching with the virus database - Hardware look for highest scoring pairs
- Possible Improvement on BLASTP Choices
- Implementing Hash table
- Adaptive neighborhood word sizes
- Correlation
22Searching Algorithm on BLASTP
- Implementing Hash table
- Table 1 To determine the optimal Neighborhood
Word Size for query sequences based on database - Table 2 To determine the Neighborhood words from
the computed neighborhood word size - Implemented in DSP
23Searching Algorithm on BLASTP
- Adaptive Neighborhood Word Size
- Since Neighborhood Word Size controls amt. of
loops, having an adaptive word size would speed
up matching - Do statistical analyses on individual scores of
the members of the sequence to determine the
reasonable neighborhood word size for the query
sequence - Store the information of the neighborhood sizes
into hash table, and the key will be individual
or a sequence of characters - Implemented in NIOS (C)
24Searching Algorithm on BLASTP
- Correlation
- To improve find_seeds and find_hsps
- Convert and arrange the query sequence and
Database smartly into 1-0 matrices - Correlate the 1-0 matrix templates (masks) with
the database and get the highest scored matches - Implemented on FPGA (Verilog) for better/faster
performance
25Implementation Tasks
- Communication between all devices
- Ensuring that we can send a message through all
required paths - Using four-phase handshake to facilitate
differences in clock - Analog/digital conversion on DSP
- DSP will be converting analog nucleotide signals
into digital codons - Will involve translating to five bit codon
26Implementation Tasks
- BLASTP algorithm on the FPGA
- Examine different search algorithms using MATLAB
to obtain the fastest matching search algorithm - Implementing search algorithm in software and
hardware - Aiming to speed up certain calculations by using
hardware, and optimizing the more complicated
functions through software
27Implementation Tasks
- Reading in database to PC
- Proper identification through parsing
- Compressing Data on the PC
- Will apply LZW encoding on the codons
- Will encode result with Huffman compression
- Huffman Decompression on ARM
- Will be written in C
28Implementation Tasks
- LZW Decompression on FPGA
- Will be written in Verilog
- Testing, Integration
- Making sure everything works
29Verification Testing
- DSP Verification
- Verification by comparison with Matlab outputs
- FPGA BLASTP Verification
- Will double check against Matlab searches
- Compression Testing
- Will check against written C code
- Each team members code verified by someone else
30Division of Labor
- Chris Thomas
- General architecture concerns
- Communication with flash
- Huffman decoding on the ARM
- Charles-Christopher Onyeama
- LZW decoding on the FPGA
- LZW and Huffman compression
- Assist BLASTP
- Mark Pimentel
- Matlab code optimization
- Translating analog signals to digital on the DSP
- BLASTP implementation on the FPGA
31Demo Deliverables
- Demo 1
- Some communication between processors
- (Chris, Mark)
- Compression analyses and sample code
- (Charles)
32Demo Deliverables
- Demo 2
- Huffman compression decompression completed
- (Charles, Chris)
- Analog signals processed translated
- (Mark)
- BLASTP implementation underway
- (Mark)
- Communication with storage/Storage management
- (Chris,Charles)
33Demo Deliverables
- Demo 3
- Everything completed working.
- (Team GeneHackers!)
34Updated Schedule
35