Jerry Held

About This Presentation

Title:

Jerry Held

Description:

Handling Large Amounts of Biological Data Xiaobin Guan, Ph.D. Senior Oracle DBA/Bioinformatician National Institutes of Health Introduction Bioinformatics In Silico ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 50

Provided by: Analy7

Category:

more less

Transcript and Presenter's Notes

Title: Jerry Held

1
(No Transcript)
2
Handling Large Amounts of Biological Data
Session id40364

Xiaobin Guan, Ph.D.Senior Oracle
DBA/Bioinformatician
National Institutes of Health

3
Introduction

Bioinformatics
In Silico
Large Database
DNA Sequence
Using CLOB
Using Partition Tables

4
NISC Database Environment

NIH Intramural Sequencing Center
Established in 1997
A multi-disciplinary genomics facility
Large-scale DNA sequencing
Applied Biosystems (ABI) DNA Analyzers
Produce 10,000 DNA sequences per day

5
NISC Pipeline

The Laboratory Information Management System
(LIMS).
Move the sequencing data from each PC to a
partition (/area1) on our main Unix Server.
A Perl script is then running to validate the
trace name and run folder name, and also check
for duplicates. Then, moved to another partition
(/area2).
Phred is run on each trace file to get rid of the
low quality bases at the beginning and end of
each read.

6
NISC Pipeline

Vector Screening is then performed on each read,
and masked out where the vector is.
Contaminant Checking is to use BLAST to screen
any contaminants. The information about
contamination is then stored in the database.
QC Report is generated to show the quality and
other information.

7
Why CLOB?

To store DNA sequences
Combination of ACGT character strings
The length can be more or less than 4KB

8
LOBs vs. Long/Long Raw
LONG, LONG RAW LOBs
Number of LOB columns per table 1 Multiple
LOB Capacity Up to 2 GB Up to 4 GB
Data stored out-of-line No Yes
Object type support No Yes
Random piece-wise access No Yes
9
A Simple Create Table Statement

CREATE TABLE dna_sequence1
(base_id NUMBER(6),
base_sequence CLOB)
TABLESPACE example

10
Specify the Segment Name, and LOB Storage

CREATE TABLE dna_sequence2
(base_id NUMBER(6),
base_sequence CLOB)
LOB (base_sequence) STORE AS
dna_seq_lob
(TABLESPACE lob_seg_ts)
TABLESPACE example

11
Specify the Index Name and Index Storage

CREATE TABLE dna_sequence3
(base_id NUMBER(6),
base_sequence CLOB)
LOB (base_sequence) STORE AS
dna_seq_lob1
(TABLESPACE lob_seg_ts
INDEX dna_seq_clob_idx (
TABLESPACE nisc_index))
TABLESPACE example

12
Check Segment and Index Name

SELECT table_name, column_name,
segment_name, index_name
FROM user_lobs
TABLE_NAME COLUMN_NAME SEGMENT_NAME
INDEX_NAME
--------------- ---------------
--------------------------- ----------------------
--
DNA_SEQUENCE1 BASE_SEQUENCE
SYS_LOB0000040338C00002 SYS_IL0000040338C00002
DNA_SEQUENCE2 BASE_SEQUENCE DNA_SEQ_LOB
SYS_IL0000040341C00002
DNA_SEQUENCE3 BASE_SEQUENCE DNA_SEQ_LOB1
DNA_SEQ_CLOB_IDX

13
Query the Table

SELECT
FROM dna_sequence
WHERE base_id 20
20 actcggtactgggacccatgtggtggatttctatccttgaagctgc
acgtaaagacccggtttttgcgggtatctctgataatgccaccgctcaaa
tcgctacagcgtgggcaagtgcactggctgactacgccgcagcacataaa
tctatgccgcgtccggaaattctggcctcctgccaccagacgctggaaaa
ctgcctgatagagtccacccgcaatagcatggatgccactaataaagcga
tgctggaatctgtcgcagcagagatgatgagcgtttctgacggtgttatg
cgtctgcctttattcctcgcgatgatcctgcctgttcagttgggggcagc
taccgctgatgcgtgtaccttcattccggttacgcgtgaccagtccgaca
tctatgaagtctttaacgtggcaggttcatcttttggttcttatgctgct
ggtgatgttctggacatgcaatccgtcggtgtgtacagccagttacgtcg
ccgctatgtgctggtggcaagctccgatggcaccagcaaaaccgcaacct
tcaagatggaagacttcgaaggccagaatgtaccaatccgaaaaggtcgc
actaacatctacgttaaccgtattaagtctgttgttgataacggttccgg
cagcctacttcactcgtttactaatgctgctggtgagcaaatcactgtta
cctgctctctgaactacaacattggtcagattgccctgtcgttctccaaa
gcgccggataaaagcactgagatcgcaattgagacggaaatcaatattga
agccggctctgagctgatcccgctgatcacca

14
In-line or Out-of-line Storage

In-line
Out-of-line
Enable storage in row
Disable storage in row
Tablespaces

15
CLOB Usage

Table structure
This table contains two CLOB columns
BASECALLS stores DNA sequences
BASEQUALS stores the quality score of each
sequence
The length of both fields varies between a few
hundred to up to 6 thousand characters

16
Test Protocol

Create tablespaces
Four for 4 tables, and two for LOB storage
Create four test tables
T1, in-line, one tablespace
T2, in-line, two tablespaces
T3, out-of-line, one tablespace
T4, out-of-line, two tablespaces

17
Test Table 1 (T1)

CREATE TABLE T1
(CALL_ID NUMBER(10) NOT NULL,
TRACE_ID NUMBER(10) NOT NULL,
BASECALLS CLOB NOT NULL,
BASEQUALS CLOB)
TABLESPACE "TEST_CALL1"
LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL1"
ENABLE STORAGE IN ROW)
LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL1"
ENABLE STORAGE IN ROW)

18
Test Table 2 (T2)

CREATE TABLE T2
(CALL_ID NUMBER(10) NOT NULL,
TRACE_ID NUMBER(10) NOT NULL,
BASECALLS CLOB NOT NULL,
BASEQUALS CLOB)
TABLESPACE "TEST_CALL2"
LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL_LOB1"
ENABLE STORAGE IN ROW)
LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL_LOB1"
ENABLE STORAGE IN ROW)

19
Test Table 3 (T3)

CREATE TABLE T3
(CALL_ID NUMBER(10) NOT NULL,
TRACE_ID NUMBER(10) NOT NULL,
BASECALLS CLOB NOT NULL,
BASEQUALS CLOB)
TABLESPACE "TEST_CALL3"
LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL3"
DISABLE STORAGE IN ROW)
LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL3"
DISABLE STORAGE IN ROW)

20
Test Table 4 (T4)

CREATE TABLE T4
(CALL_ID NUMBER(10) NOT NULL,
TRACE_ID NUMBER(10) NOT NULL,
BASECALLS CLOB NOT NULL,
BASEQUALS CLOB)
TABLESPACE "TEST_CALL4"
LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL_LOB2"
DISABLE STORAGE IN ROW)
LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL_LOB2"
DISABLE STORAGE IN ROW)

21
Results
In-line/out-of-line IN-LINE IN-LINE OUT-OF-LINE OUT-OF-LINE
Tablespace usage One TS Two TS One TS Two TS
Table name T1 T2 T3 T4
Initial space used (MB) 6 7(25) 6 7(25)
Space used after 10000 row insert (MB) 46 47(425) 162 163(2161)
Total insert time (sec) 10 11 47 48
Ranking 1 2 3 4
22
DBMS_LOB Package
23
Functions/Procedures to Read or Return LOB Values
Subprogram F/P Description
COMPARE() F Compares the value of two LOBs
GETCHUNKSIZE() F Gets the chunk size used when reading and writing. This only works on internal LOBs and does not apply to external LOBs (BFILEs).
GETLENGTH() F Gets the length of the LOB value
INSTR() F Returns the matching position of the nth occurrence of the pattern in the LOB
READ() P Reads data from the LOB starting at the specified offset
SUBSTR() F Returns part of the LOB value starting at the specified offset
24
Functions/Procedures to Write LOB Values
Subprogram F/P Description
APPEND() P Appends the LOB value to another LOB
COPY() P Copies all or part of a LOB to another LOB
ERASE() P Erases part of a LOB, starting at a specified offset
LOADFROMFILE() P Load BFILE data into an internal LOB
LOADCLOBFROMFILE() P Load character data from a file into a LOB
LOADBLOBFROMFILE() P Load binary data from a file into a LOB
TRIM() P Trims the LOB value to the specified shorter length
WRITE() P Writes data to the LOB at a specified offset
WRITEAPPEND() P Writes data to the end of the LOB
25
Functions/Procedures for BFILEs
Subprogram F/P Description
FILECLOSE() P Closes the file. Use CLOSE() instead.
FILECLOSEALL() P Closes all previously opened files
FILEEXISTS() F Checks if the file exists on the server
FILEGETNAME() P Gets the directory alias and file name
FILEISOPEN() F Checks if the file was opened using the input BFILE locators. Use ISOPEN() instead.
FILEOPEN() P Opens a file. Use OPEN() instead.
26
Call Functions in SQL

SELECT dbms_lob.getlength(base_sequence)
FROM dna_sequence1
DBMS_LOB.GETLENGTH(BASE_SEQUENCE)
---------------------------------
878
1269
893
872
961
807
806
808
833
837
10 rows selected.

27
Call procedures in PL/SQL

DECLARE
v_dna_seq CLOB
v_seq_amt BINARY_INTEGER 10
v_seq_buffer VARCHAR2(10)
BEGIN
v_dna_seq 'atctcgagtagctgaagctccaatgntggtg
gaattcacgagttgctt'
DBMS_LOB.READ (v_dna_seq, v_seq_amt, 1,
v_seq_buffer)
DBMS_OUTPUT.PUT_LINE('The first 10 bases for
this DNA sequence are ' v_seq_buffer)
END
/
The first 10 bases for this DNA sequence are
atctcgagta
PL/SQL procedure successfully completed.

28
Substr vs. dbms_lob.substr

Substr(the_string, from_character,
number_of_characters)
Dbms_lob.substr(the_string, number_of_characters,
from_character).

29
Substr vs. dbms_lob.substr

CREATE table substring (str varchar2(20), lob
clob)
INSERT INTO substring
VALUES ('Oracle10G', 'Oracle10G')
SELECT substr (str, 7, 3),
dbms_lob.substr(lob, 7, 3) lob
FROM substring
ow03_at_NISCDEV.NHGRI.NIH.GOVgt
SUB LOB
--- ----------
10G acle10G
10G acle10G
SELECT substr (str, 7, 3),
dbms_lob.substr(lob, 3, 7) lob
FROM substring
ow03_at_NISCDEV.NHGRI.NIH.GOVgt
SUB LOB
--- ----------
10G 10G
10G 10G

30
Lob Usage Limitation

Not in the ORDER BY, or GROUP BY or in an
aggregate function.
Not in a SELECT... DISTINCT or SELECT... UNIQUE
statement or in a join.
Not in ANALYZE... COMPUTE or ANALYZE... ESTIMATE
statements.
Not as a primary key column.
Not select a LOB column through dblink.
ORA-22992 cannot use LOB locators selected from
remote tables.

Partitioning and Its
Usage Scenarios at NISC

32
Partition Method

Range Partitioning, introduced in Oracle 8.
Hash Partitioning, introduced in 8i.
List Partitioning, introduced in 9i release 1.
Composite Partitioning. The range-hash partition
was introduced in 8i, and the range-list
partition was introduced in 9i release 2.
This is a good example how Oracle adds
functionalities to the new release.

33
Benefit of Partitioning

The amount of time for each operation can be
significantly reduced because of the small
segment.
Improve query performance. The I/O will be
balanced among disks.
Reduce the downtime.
Part of the table can be put to read only mode.
Easy to implement.

34
When to Partition

When table becomes large. 2GB is considered as a
general guideline.
When the data is kind of adding on, meaning new
data will go to the new partition.

35
Work with Range Partition

Create table with range partitioning.
Convert a non-partition table to a partition
table.
Merge/split partition.
Tablespace usage with partition.
Maintain range partition.

36
Partitioning Usage Examples

Create tablespace
Create table
Add partition
Drop partition
Exchange partition
Move partition
Merge partition
Split partition
Truncate partition
Rename partition

37
Create Partitioned Table

CREATE TABLE dna_sequence
(base_id NUMBER(6),
base_sequence CLOB)
LOB (base_sequence) STORE AS
dna_seq_lob2
TABLESPACE example
PARTITION BY RANGE (BASE_ID)
(partition dna_sequence1 values less than (100)
tablespace dna_sequence_p1,
partition dna_sequence2 values less than (200)
tablespace dna_sequence_p2,
partition dna_sequence3 values less than (300)
tablespace dna_sequence_p3)

38
Query the Partitioned Table

SELECT table_name, partition_name,
tablespace_name, high_value
FROM user_tab_partitions
ORDER BY partition_name
TABLE_NAME PARTITION_NAME
TABLESPACE_NAME HIGH_VALUE
---------------- --------------------
-------------------- ----------
DNA_SEQUENCE DNA_SEQUENCE1
DNA_SEQUENCE_P1 100
DNA_SEQUENCE DNA_SEQUENCE2
DNA_SEQUENCE_P2 200
DNA_SEQUENCE DNA_SEQUENCE3
DNA_SEQUENCE_P3 300

39
Add Partition

ALTER TABLE dna_sequence
ADD PARTITION dna_sequence4 VALUES LESS THAN
(400)
TABLESPACE dna_sequence_p1
TABLE_NAME PARTITION_NAME TABLESPACE_NAME
HIGH_VALUE
--------------- -----------------
-------------------- ----------
DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1
100
DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2
200
DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3
300
DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1
400

40
Drop Partition

ALTER TABLE dna_sequence DROP PARTITION
dna_sequence4
Run partition.sql
TABLE_NAME PARTITION_NAME
TABLESPACE_NAME HIGH_VALUE
---------------- -------------------
-------------------- ---------
DNA_SEQUENCE DNA_SEQUENCE1
DNA_SEQUENCE_P1 100
DNA_SEQUENCE DNA_SEQUENCE2
DNA_SEQUENCE_P2 200
DNA_SEQUENCE DNA_SEQUENCE3
DNA_SEQUENCE_P3 300

41
Exchange Partition

CREATE TABLE dna_sep03
AS SELECT
FROM dna_sequence
WHERE 12
ALTER TABLE dna_sequence
EXCHANGE PARTITION dna_sequence3 WITH TABLE
dna_sep03

42
Move Partition

ALTER TABLE dna_sequence
MOVE PARTITION dna_sequence4 TABLESPACE
dna_sequence_p2 NOLOGGING

43
Split Partition

ALTER TABLE dna_sequence
SPLIT PARTITION dna_sequence4 AT (350)
INTO (
PARTITION dna_sequence4 TABLESPACE
dna_sequence_p1,
PARTITION dna_sequence5 TABLESPACE
dna_sequence_p2)
PARALLEL ( DEGREE 5 )
TABLE_NAME PARTITION_NAME
TABLESPACE_NAME HIGH_VALUE
----------------- --------------------
-------------------- ----------
DNA_SEQUENCE DNA_SEQUENCE1
DNA_SEQUENCE_P1 100
DNA_SEQUENCE DNA_SEQUENCE2
DNA_SEQUENCE_P2 200
DNA_SEQUENCE DNA_SEQUENCE3
DNA_SEQUENCE_P3 300
DNA_SEQUENCE DNA_SEQUENCE4
DNA_SEQUENCE_P1 350
DNA_SEQUENCE DNA_SEQUENCE5
DNA_SEQUENCE_P2 400

44
Truncate Partition

ALTER TABLE dna_sequence
TRUNCATE PARTITION dna_sequence4 DROP STORAGE

45
Rename Partition/Table

Rename partition
ALTER TABLE dna_sequence
RENAME PARTITION dna_sequence4 TO
dna_sequence5
Rename table
ALTER TABLE dna_sequence
RENAME TO dna_seq
RENAME dna_seq TO dna_sequence

46
Conclusion

By proper use of the Oracle features such as
CLOB, and partitioning table, it becomes a lot
easier to manage the database containing large
amounts of biological data.

47
Major Benefits using CLOB and Partitioning at NISC