Title: Jerry Held
1(No Transcript)
2Handling Large Amounts of Biological Data
Session id40364
- Xiaobin Guan, Ph.D.Senior Oracle
DBA/Bioinformatician - National Institutes of Health
3Introduction
- Bioinformatics
- In Silico
- Large Database
- DNA Sequence
- Using CLOB
- Using Partition Tables
4NISC Database Environment
- NIH Intramural Sequencing Center
- Established in 1997
- A multi-disciplinary genomics facility
- Large-scale DNA sequencing
- Applied Biosystems (ABI) DNA Analyzers
- Produce 10,000 DNA sequences per day
5NISC Pipeline
- The Laboratory Information Management System
(LIMS). - Move the sequencing data from each PC to a
partition (/area1) on our main Unix Server. - A Perl script is then running to validate the
trace name and run folder name, and also check
for duplicates. Then, moved to another partition
(/area2). - Phred is run on each trace file to get rid of the
low quality bases at the beginning and end of
each read.
6NISC Pipeline
- Vector Screening is then performed on each read,
and masked out where the vector is. - Contaminant Checking is to use BLAST to screen
any contaminants. The information about
contamination is then stored in the database. - QC Report is generated to show the quality and
other information.
7Why CLOB?
- To store DNA sequences
- Combination of ACGT character strings
- The length can be more or less than 4KB
8LOBs vs. Long/Long Raw
LONG, LONG RAW LOBs
Number of LOB columns per table 1 Multiple
LOB Capacity Up to 2 GB Up to 4 GB
Data stored out-of-line No Yes
Object type support No Yes
Random piece-wise access No Yes
9A Simple Create Table Statement
- CREATE TABLE dna_sequence1
- (base_id NUMBER(6),
- base_sequence CLOB)
- TABLESPACE example
10Specify the Segment Name, and LOB Storage
- CREATE TABLE dna_sequence2
- (base_id NUMBER(6),
- base_sequence CLOB)
- LOB (base_sequence) STORE AS
- dna_seq_lob
- (TABLESPACE lob_seg_ts)
- TABLESPACE example
11Specify the Index Name and Index Storage
- CREATE TABLE dna_sequence3
- (base_id NUMBER(6),
- base_sequence CLOB)
- LOB (base_sequence) STORE AS
- dna_seq_lob1
- (TABLESPACE lob_seg_ts
- INDEX dna_seq_clob_idx (
- TABLESPACE nisc_index))
- TABLESPACE example
12Check Segment and Index Name
- SELECT table_name, column_name,
segment_name, index_name - FROM user_lobs
- TABLE_NAME COLUMN_NAME SEGMENT_NAME
INDEX_NAME - --------------- ---------------
--------------------------- ----------------------
-- - DNA_SEQUENCE1 BASE_SEQUENCE
SYS_LOB0000040338C00002 SYS_IL0000040338C00002
- DNA_SEQUENCE2 BASE_SEQUENCE DNA_SEQ_LOB
SYS_IL0000040341C00002 - DNA_SEQUENCE3 BASE_SEQUENCE DNA_SEQ_LOB1
DNA_SEQ_CLOB_IDX
13Query the Table
- SELECT
- FROM dna_sequence
- WHERE base_id 20
- 20 actcggtactgggacccatgtggtggatttctatccttgaagctgc
acgtaaagacccggtttttgcgggtatctctgataatgccaccgctcaaa
tcgctacagcgtgggcaagtgcactggctgactacgccgcagcacataaa
tctatgccgcgtccggaaattctggcctcctgccaccagacgctggaaaa
ctgcctgatagagtccacccgcaatagcatggatgccactaataaagcga
tgctggaatctgtcgcagcagagatgatgagcgtttctgacggtgttatg
cgtctgcctttattcctcgcgatgatcctgcctgttcagttgggggcagc
taccgctgatgcgtgtaccttcattccggttacgcgtgaccagtccgaca
tctatgaagtctttaacgtggcaggttcatcttttggttcttatgctgct
ggtgatgttctggacatgcaatccgtcggtgtgtacagccagttacgtcg
ccgctatgtgctggtggcaagctccgatggcaccagcaaaaccgcaacct
tcaagatggaagacttcgaaggccagaatgtaccaatccgaaaaggtcgc
actaacatctacgttaaccgtattaagtctgttgttgataacggttccgg
cagcctacttcactcgtttactaatgctgctggtgagcaaatcactgtta
cctgctctctgaactacaacattggtcagattgccctgtcgttctccaaa
gcgccggataaaagcactgagatcgcaattgagacggaaatcaatattga
agccggctctgagctgatcccgctgatcacca
14In-line or Out-of-line Storage
- In-line
- Out-of-line
- Enable storage in row
- Disable storage in row
- Tablespaces
15CLOB Usage
- Table structure
- This table contains two CLOB columns
- BASECALLS stores DNA sequences
- BASEQUALS stores the quality score of each
sequence - The length of both fields varies between a few
hundred to up to 6 thousand characters
16Test Protocol
- Create tablespaces
- Four for 4 tables, and two for LOB storage
- Create four test tables
- T1, in-line, one tablespace
- T2, in-line, two tablespaces
- T3, out-of-line, one tablespace
- T4, out-of-line, two tablespaces
17Test Table 1 (T1)
- CREATE TABLE T1
- (CALL_ID NUMBER(10) NOT NULL,
- TRACE_ID NUMBER(10) NOT NULL,
- BASECALLS CLOB NOT NULL,
- BASEQUALS CLOB)
- TABLESPACE "TEST_CALL1"
- LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL1" - ENABLE STORAGE IN ROW)
- LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL1" - ENABLE STORAGE IN ROW)
18Test Table 2 (T2)
- CREATE TABLE T2
- (CALL_ID NUMBER(10) NOT NULL,
- TRACE_ID NUMBER(10) NOT NULL,
- BASECALLS CLOB NOT NULL,
- BASEQUALS CLOB)
- TABLESPACE "TEST_CALL2"
- LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL_LOB1" - ENABLE STORAGE IN ROW)
- LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL_LOB1" - ENABLE STORAGE IN ROW)
19Test Table 3 (T3)
- CREATE TABLE T3
- (CALL_ID NUMBER(10) NOT NULL,
- TRACE_ID NUMBER(10) NOT NULL,
- BASECALLS CLOB NOT NULL,
- BASEQUALS CLOB)
- TABLESPACE "TEST_CALL3"
- LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL3" - DISABLE STORAGE IN ROW)
- LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL3" - DISABLE STORAGE IN ROW)
20Test Table 4 (T4)
- CREATE TABLE T4
- (CALL_ID NUMBER(10) NOT NULL,
- TRACE_ID NUMBER(10) NOT NULL,
- BASECALLS CLOB NOT NULL,
- BASEQUALS CLOB)
- TABLESPACE "TEST_CALL4"
- LOB("BASECALLS") STORE AS (TABLESPACE
"TEST_CALL_LOB2" - DISABLE STORAGE IN ROW)
- LOB("BASEQUALS") STORE AS (TABLESPACE
"TEST_CALL_LOB2" - DISABLE STORAGE IN ROW)
21Results
In-line/out-of-line IN-LINE IN-LINE OUT-OF-LINE OUT-OF-LINE
Tablespace usage One TS Two TS One TS Two TS
Table name T1 T2 T3 T4
Initial space used (MB) 6 7(25) 6 7(25)
Space used after 10000 row insert (MB) 46 47(425) 162 163(2161)
Total insert time (sec) 10 11 47 48
Ranking 1 2 3 4
22DBMS_LOB Package
23Functions/Procedures to Read or Return LOB Values
Subprogram F/P Description
COMPARE() F Compares the value of two LOBs
GETCHUNKSIZE() F Gets the chunk size used when reading and writing. This only works on internal LOBs and does not apply to external LOBs (BFILEs).
GETLENGTH() F Gets the length of the LOB value
INSTR() F Returns the matching position of the nth occurrence of the pattern in the LOB
READ() P Reads data from the LOB starting at the specified offset
SUBSTR() F Returns part of the LOB value starting at the specified offset
24Functions/Procedures to Write LOB Values
Subprogram F/P Description
APPEND() P Appends the LOB value to another LOB
COPY() P Copies all or part of a LOB to another LOB
ERASE() P Erases part of a LOB, starting at a specified offset
LOADFROMFILE() P Load BFILE data into an internal LOB
LOADCLOBFROMFILE() P Load character data from a file into a LOB
LOADBLOBFROMFILE() P Load binary data from a file into a LOB
TRIM() P Trims the LOB value to the specified shorter length
WRITE() P Writes data to the LOB at a specified offset
WRITEAPPEND() P Writes data to the end of the LOB
25Functions/Procedures for BFILEs
Subprogram F/P Description
FILECLOSE() P Closes the file. Use CLOSE() instead.
FILECLOSEALL() P Closes all previously opened files
FILEEXISTS() F Checks if the file exists on the server
FILEGETNAME() P Gets the directory alias and file name
FILEISOPEN() F Checks if the file was opened using the input BFILE locators. Use ISOPEN() instead.
FILEOPEN() P Opens a file. Use OPEN() instead.
26Call Functions in SQL
- SELECT dbms_lob.getlength(base_sequence)
- FROM dna_sequence1
- DBMS_LOB.GETLENGTH(BASE_SEQUENCE)
- ---------------------------------
- 878
- 1269
- 893
- 872
- 961
- 807
- 806
- 808
- 833
- 837
- 10 rows selected.
27Call procedures in PL/SQL
- DECLARE
- v_dna_seq CLOB
- v_seq_amt BINARY_INTEGER 10
- v_seq_buffer VARCHAR2(10)
- BEGIN
- v_dna_seq 'atctcgagtagctgaagctccaatgntggtg
gaattcacgagttgctt' - DBMS_LOB.READ (v_dna_seq, v_seq_amt, 1,
v_seq_buffer) - DBMS_OUTPUT.PUT_LINE('The first 10 bases for
this DNA sequence are ' v_seq_buffer) - END
- /
- The first 10 bases for this DNA sequence are
atctcgagta - PL/SQL procedure successfully completed.
28Substr vs. dbms_lob.substr
- Substr(the_string, from_character,
number_of_characters) - Dbms_lob.substr(the_string, number_of_characters,
from_character).
29Substr vs. dbms_lob.substr
- CREATE table substring (str varchar2(20), lob
clob) - INSERT INTO substring
- VALUES ('Oracle10G', 'Oracle10G')
- SELECT substr (str, 7, 3),
- dbms_lob.substr(lob, 7, 3) lob
- FROM substring
- ow03_at_NISCDEV.NHGRI.NIH.GOVgt
- SUB LOB
- --- ----------
- 10G acle10G
- 10G acle10G
- SELECT substr (str, 7, 3),
- dbms_lob.substr(lob, 3, 7) lob
- FROM substring
- ow03_at_NISCDEV.NHGRI.NIH.GOVgt
- SUB LOB
- --- ----------
- 10G 10G
- 10G 10G
30Lob Usage Limitation
- Not in the ORDER BY, or GROUP BY or in an
aggregate function. - Not in a SELECT... DISTINCT or SELECT... UNIQUE
statement or in a join. - Not in ANALYZE... COMPUTE or ANALYZE... ESTIMATE
statements. - Not as a primary key column.
- Not select a LOB column through dblink.
ORA-22992 cannot use LOB locators selected from
remote tables.
31- Partitioning and Its
- Usage Scenarios at NISC
32Partition Method
- Range Partitioning, introduced in Oracle 8.
- Hash Partitioning, introduced in 8i.
- List Partitioning, introduced in 9i release 1.
- Composite Partitioning. The range-hash partition
was introduced in 8i, and the range-list
partition was introduced in 9i release 2. - This is a good example how Oracle adds
functionalities to the new release.
33Benefit of Partitioning
- The amount of time for each operation can be
significantly reduced because of the small
segment. - Improve query performance. The I/O will be
balanced among disks. - Reduce the downtime.
- Part of the table can be put to read only mode.
- Easy to implement.
34When to Partition
- When table becomes large. 2GB is considered as a
general guideline. - When the data is kind of adding on, meaning new
data will go to the new partition.
35Work with Range Partition
- Create table with range partitioning.
- Convert a non-partition table to a partition
table. - Merge/split partition.
- Tablespace usage with partition.
- Maintain range partition.
36Partitioning Usage Examples
- Create tablespace
- Create table
- Add partition
- Drop partition
- Exchange partition
- Move partition
- Merge partition
- Split partition
- Truncate partition
- Rename partition
37Create Partitioned Table
- CREATE TABLE dna_sequence
- (base_id NUMBER(6),
- base_sequence CLOB)
- LOB (base_sequence) STORE AS
- dna_seq_lob2
- TABLESPACE example
- PARTITION BY RANGE (BASE_ID)
- (partition dna_sequence1 values less than (100)
tablespace dna_sequence_p1, - partition dna_sequence2 values less than (200)
tablespace dna_sequence_p2, - partition dna_sequence3 values less than (300)
tablespace dna_sequence_p3)
38Query the Partitioned Table
- SELECT table_name, partition_name,
tablespace_name, high_value - FROM user_tab_partitions
- ORDER BY partition_name
-
- TABLE_NAME PARTITION_NAME
TABLESPACE_NAME HIGH_VALUE - ---------------- --------------------
-------------------- ---------- - DNA_SEQUENCE DNA_SEQUENCE1
DNA_SEQUENCE_P1 100 - DNA_SEQUENCE DNA_SEQUENCE2
DNA_SEQUENCE_P2 200 - DNA_SEQUENCE DNA_SEQUENCE3
DNA_SEQUENCE_P3 300
39Add Partition
- ALTER TABLE dna_sequence
- ADD PARTITION dna_sequence4 VALUES LESS THAN
(400) - TABLESPACE dna_sequence_p1
- TABLE_NAME PARTITION_NAME TABLESPACE_NAME
HIGH_VALUE - --------------- -----------------
-------------------- ---------- - DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1
100 - DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2
200 - DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3
300 - DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1
400
40Drop Partition
- ALTER TABLE dna_sequence DROP PARTITION
dna_sequence4 - Run partition.sql
- TABLE_NAME PARTITION_NAME
TABLESPACE_NAME HIGH_VALUE - ---------------- -------------------
-------------------- --------- - DNA_SEQUENCE DNA_SEQUENCE1
DNA_SEQUENCE_P1 100 - DNA_SEQUENCE DNA_SEQUENCE2
DNA_SEQUENCE_P2 200 - DNA_SEQUENCE DNA_SEQUENCE3
DNA_SEQUENCE_P3 300
41Exchange Partition
- CREATE TABLE dna_sep03
- AS SELECT
- FROM dna_sequence
- WHERE 12
- ALTER TABLE dna_sequence
- EXCHANGE PARTITION dna_sequence3 WITH TABLE
dna_sep03
42Move Partition
- ALTER TABLE dna_sequence
- MOVE PARTITION dna_sequence4 TABLESPACE
dna_sequence_p2 NOLOGGING
43Split Partition
- ALTER TABLE dna_sequence
- SPLIT PARTITION dna_sequence4 AT (350)
- INTO (
- PARTITION dna_sequence4 TABLESPACE
dna_sequence_p1, - PARTITION dna_sequence5 TABLESPACE
dna_sequence_p2) - PARALLEL ( DEGREE 5 )
- TABLE_NAME PARTITION_NAME
TABLESPACE_NAME HIGH_VALUE - ----------------- --------------------
-------------------- ---------- - DNA_SEQUENCE DNA_SEQUENCE1
DNA_SEQUENCE_P1 100 - DNA_SEQUENCE DNA_SEQUENCE2
DNA_SEQUENCE_P2 200 - DNA_SEQUENCE DNA_SEQUENCE3
DNA_SEQUENCE_P3 300 - DNA_SEQUENCE DNA_SEQUENCE4
DNA_SEQUENCE_P1 350 - DNA_SEQUENCE DNA_SEQUENCE5
DNA_SEQUENCE_P2 400
44Truncate Partition
- ALTER TABLE dna_sequence
- TRUNCATE PARTITION dna_sequence4 DROP STORAGE
45Rename Partition/Table
- Rename partition
- ALTER TABLE dna_sequence
- RENAME PARTITION dna_sequence4 TO
dna_sequence5 - Rename table
- ALTER TABLE dna_sequence
- RENAME TO dna_seq
- RENAME dna_seq TO dna_sequence
46Conclusion
- By proper use of the Oracle features such as
CLOB, and partitioning table, it becomes a lot
easier to manage the database containing large
amounts of biological data.
47Major Benefits using CLOB and Partitioning at NISC
- Space Savings Proper use of CLOB
- Better performance Put big tables into smaller
segments - Better Maintenance Easier backup and recovery
Less down time
48A
49Reminder please complete the OracleWorld
online session surveyThank you.Xiaobin Guan,
Ph.D.NISC/NIHXiaobin_Guan_at_nih.gov