Jerry Held - PowerPoint PPT Presentation

About This Presentation
Title:

Jerry Held

Description:

Handling Large Amounts of Biological Data Xiaobin Guan, Ph.D. Senior Oracle DBA/Bioinformatician National Institutes of Health Introduction Bioinformatics In Silico ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 50
Provided by: Analy7
Category:
Tags: held | jerry | sequencing

less

Transcript and Presenter's Notes

Title: Jerry Held


1
(No Transcript)
2
Handling Large Amounts of Biological Data
Session id40364
  • Xiaobin Guan, Ph.D.Senior Oracle
    DBA/Bioinformatician
  • National Institutes of Health

3
Introduction
  • Bioinformatics
  • In Silico
  • Large Database
  • DNA Sequence
  • Using CLOB
  • Using Partition Tables

4
NISC Database Environment
  • NIH Intramural Sequencing Center
  • Established in 1997
  • A multi-disciplinary genomics facility
  • Large-scale DNA sequencing
  • Applied Biosystems (ABI) DNA Analyzers
  • Produce 10,000 DNA sequences per day

5
NISC Pipeline
  • The Laboratory Information Management System
    (LIMS).
  • Move the sequencing data from each PC to a
    partition (/area1) on our main Unix Server.
  • A Perl script is then running to validate the
    trace name and run folder name, and also check
    for duplicates. Then, moved to another partition
    (/area2).
  • Phred is run on each trace file to get rid of the
    low quality bases at the beginning and end of
    each read.

6
NISC Pipeline
  • Vector Screening is then performed on each read,
    and masked out where the vector is.
  • Contaminant Checking is to use BLAST to screen
    any contaminants. The information about
    contamination is then stored in the database.
  • QC Report is generated to show the quality and
    other information.

7
Why CLOB?
  • To store DNA sequences
  • Combination of ACGT character strings
  • The length can be more or less than 4KB

8
LOBs vs. Long/Long Raw
LONG, LONG RAW LOBs
Number of LOB columns per table 1 Multiple
LOB Capacity Up to 2 GB Up to 4 GB
Data stored out-of-line No Yes
Object type support No Yes
Random piece-wise access No Yes
9
A Simple Create Table Statement
  • CREATE TABLE dna_sequence1
  • (base_id NUMBER(6),
  • base_sequence CLOB)
  • TABLESPACE example

10
Specify the Segment Name, and LOB Storage
  • CREATE TABLE dna_sequence2
  • (base_id NUMBER(6),
  • base_sequence CLOB)
  • LOB (base_sequence) STORE AS
  • dna_seq_lob
  • (TABLESPACE lob_seg_ts)
  • TABLESPACE example

11
Specify the Index Name and Index Storage
  • CREATE TABLE dna_sequence3
  • (base_id NUMBER(6),
  • base_sequence CLOB)
  • LOB (base_sequence) STORE AS
  • dna_seq_lob1
  • (TABLESPACE lob_seg_ts
  • INDEX dna_seq_clob_idx (
  • TABLESPACE nisc_index))
  • TABLESPACE example

12
Check Segment and Index Name
  • SELECT table_name, column_name,
    segment_name, index_name
  • FROM user_lobs
  • TABLE_NAME COLUMN_NAME SEGMENT_NAME
    INDEX_NAME
  • --------------- ---------------
    --------------------------- ----------------------
    --
  • DNA_SEQUENCE1 BASE_SEQUENCE
    SYS_LOB0000040338C00002 SYS_IL0000040338C00002
  • DNA_SEQUENCE2 BASE_SEQUENCE DNA_SEQ_LOB
    SYS_IL0000040341C00002
  • DNA_SEQUENCE3 BASE_SEQUENCE DNA_SEQ_LOB1
    DNA_SEQ_CLOB_IDX

13
Query the Table
  • SELECT
  • FROM dna_sequence
  • WHERE base_id 20
  • 20 actcggtactgggacccatgtggtggatttctatccttgaagctgc
    acgtaaagacccggtttttgcgggtatctctgataatgccaccgctcaaa
    tcgctacagcgtgggcaagtgcactggctgactacgccgcagcacataaa
    tctatgccgcgtccggaaattctggcctcctgccaccagacgctggaaaa
    ctgcctgatagagtccacccgcaatagcatggatgccactaataaagcga
    tgctggaatctgtcgcagcagagatgatgagcgtttctgacggtgttatg
    cgtctgcctttattcctcgcgatgatcctgcctgttcagttgggggcagc
    taccgctgatgcgtgtaccttcattccggttacgcgtgaccagtccgaca
    tctatgaagtctttaacgtggcaggttcatcttttggttcttatgctgct
    ggtgatgttctggacatgcaatccgtcggtgtgtacagccagttacgtcg
    ccgctatgtgctggtggcaagctccgatggcaccagcaaaaccgcaacct
    tcaagatggaagacttcgaaggccagaatgtaccaatccgaaaaggtcgc
    actaacatctacgttaaccgtattaagtctgttgttgataacggttccgg
    cagcctacttcactcgtttactaatgctgctggtgagcaaatcactgtta
    cctgctctctgaactacaacattggtcagattgccctgtcgttctccaaa
    gcgccggataaaagcactgagatcgcaattgagacggaaatcaatattga
    agccggctctgagctgatcccgctgatcacca

14
In-line or Out-of-line Storage
  • In-line
  • Out-of-line
  • Enable storage in row
  • Disable storage in row
  • Tablespaces

15
CLOB Usage
  • Table structure
  • This table contains two CLOB columns
  • BASECALLS stores DNA sequences
  • BASEQUALS stores the quality score of each
    sequence
  • The length of both fields varies between a few
    hundred to up to 6 thousand characters

16
Test Protocol
  • Create tablespaces
  • Four for 4 tables, and two for LOB storage
  • Create four test tables
  • T1, in-line, one tablespace
  • T2, in-line, two tablespaces
  • T3, out-of-line, one tablespace
  • T4, out-of-line, two tablespaces

17
Test Table 1 (T1)
  • CREATE TABLE T1
  • (CALL_ID NUMBER(10) NOT NULL,
  • TRACE_ID NUMBER(10) NOT NULL,
  • BASECALLS CLOB NOT NULL,
  • BASEQUALS CLOB)
  • TABLESPACE "TEST_CALL1"
  • LOB("BASECALLS") STORE AS (TABLESPACE
    "TEST_CALL1"
  • ENABLE STORAGE IN ROW)
  • LOB("BASEQUALS") STORE AS (TABLESPACE
    "TEST_CALL1"
  • ENABLE STORAGE IN ROW)

18
Test Table 2 (T2)
  • CREATE TABLE T2
  • (CALL_ID NUMBER(10) NOT NULL,
  • TRACE_ID NUMBER(10) NOT NULL,
  • BASECALLS CLOB NOT NULL,
  • BASEQUALS CLOB)
  • TABLESPACE "TEST_CALL2"
  • LOB("BASECALLS") STORE AS (TABLESPACE
    "TEST_CALL_LOB1"
  • ENABLE STORAGE IN ROW)
  • LOB("BASEQUALS") STORE AS (TABLESPACE
    "TEST_CALL_LOB1"
  • ENABLE STORAGE IN ROW)

19
Test Table 3 (T3)
  • CREATE TABLE T3
  • (CALL_ID NUMBER(10) NOT NULL,
  • TRACE_ID NUMBER(10) NOT NULL,
  • BASECALLS CLOB NOT NULL,
  • BASEQUALS CLOB)
  • TABLESPACE "TEST_CALL3"
  • LOB("BASECALLS") STORE AS (TABLESPACE
    "TEST_CALL3"
  • DISABLE STORAGE IN ROW)
  • LOB("BASEQUALS") STORE AS (TABLESPACE
    "TEST_CALL3"
  • DISABLE STORAGE IN ROW)

20
Test Table 4 (T4)
  • CREATE TABLE T4
  • (CALL_ID NUMBER(10) NOT NULL,
  • TRACE_ID NUMBER(10) NOT NULL,
  • BASECALLS CLOB NOT NULL,
  • BASEQUALS CLOB)
  • TABLESPACE "TEST_CALL4"
  • LOB("BASECALLS") STORE AS (TABLESPACE
    "TEST_CALL_LOB2"
  • DISABLE STORAGE IN ROW)
  • LOB("BASEQUALS") STORE AS (TABLESPACE
    "TEST_CALL_LOB2"
  • DISABLE STORAGE IN ROW)

21
Results
In-line/out-of-line IN-LINE IN-LINE OUT-OF-LINE OUT-OF-LINE
Tablespace usage One TS Two TS One TS Two TS
Table name T1 T2 T3 T4
Initial space used (MB) 6 7(25) 6 7(25)
Space used after 10000 row insert (MB) 46 47(425) 162 163(2161)
Total insert time (sec) 10 11 47 48
Ranking 1 2 3 4
22
DBMS_LOB Package
23
Functions/Procedures to Read or Return LOB Values
Subprogram F/P Description
COMPARE() F Compares the value of two LOBs
GETCHUNKSIZE() F Gets the chunk size used when reading and writing. This only works on internal LOBs and does not apply to external LOBs (BFILEs).
GETLENGTH() F Gets the length of the LOB value
INSTR() F Returns the matching position of the nth occurrence of the pattern in the LOB
READ() P Reads data from the LOB starting at the specified offset
SUBSTR() F Returns part of the LOB value starting at the specified offset
24
Functions/Procedures to Write LOB Values
Subprogram F/P Description
APPEND() P Appends the LOB value to another LOB
COPY() P Copies all or part of a LOB to another LOB
ERASE() P Erases part of a LOB, starting at a specified offset
LOADFROMFILE() P Load BFILE data into an internal LOB
LOADCLOBFROMFILE() P Load character data from a file into a LOB
LOADBLOBFROMFILE() P Load binary data from a file into a LOB
TRIM() P Trims the LOB value to the specified shorter length
WRITE() P Writes data to the LOB at a specified offset
WRITEAPPEND() P Writes data to the end of the LOB
25
Functions/Procedures for BFILEs
Subprogram F/P Description
FILECLOSE() P Closes the file. Use CLOSE() instead.
FILECLOSEALL() P Closes all previously opened files
FILEEXISTS() F Checks if the file exists on the server
FILEGETNAME() P Gets the directory alias and file name
FILEISOPEN() F Checks if the file was opened using the input BFILE locators. Use ISOPEN() instead.
FILEOPEN() P Opens a file. Use OPEN() instead.
26
Call Functions in SQL
  • SELECT dbms_lob.getlength(base_sequence)
  • FROM dna_sequence1
  • DBMS_LOB.GETLENGTH(BASE_SEQUENCE)
  • ---------------------------------
  • 878
  • 1269
  • 893
  • 872
  • 961
  • 807
  • 806
  • 808
  • 833
  • 837
  • 10 rows selected.

27
Call procedures in PL/SQL
  • DECLARE
  • v_dna_seq CLOB
  • v_seq_amt BINARY_INTEGER 10
  • v_seq_buffer VARCHAR2(10)
  • BEGIN
  • v_dna_seq 'atctcgagtagctgaagctccaatgntggtg
    gaattcacgagttgctt'
  • DBMS_LOB.READ (v_dna_seq, v_seq_amt, 1,
    v_seq_buffer)
  • DBMS_OUTPUT.PUT_LINE('The first 10 bases for
    this DNA sequence are ' v_seq_buffer)
  • END
  • /
  • The first 10 bases for this DNA sequence are
    atctcgagta
  • PL/SQL procedure successfully completed.

28
Substr vs. dbms_lob.substr
  • Substr(the_string, from_character,
    number_of_characters)
  • Dbms_lob.substr(the_string, number_of_characters,
    from_character).

29
Substr vs. dbms_lob.substr
  • CREATE table substring (str varchar2(20), lob
    clob)
  • INSERT INTO substring
  • VALUES ('Oracle10G', 'Oracle10G')
  • SELECT substr (str, 7, 3),
  • dbms_lob.substr(lob, 7, 3) lob
  • FROM substring
  • ow03_at_NISCDEV.NHGRI.NIH.GOVgt
  • SUB LOB
  • --- ----------
  • 10G acle10G
  • 10G acle10G
  • SELECT substr (str, 7, 3),
  • dbms_lob.substr(lob, 3, 7) lob
  • FROM substring
  • ow03_at_NISCDEV.NHGRI.NIH.GOVgt
  • SUB LOB
  • --- ----------
  • 10G 10G
  • 10G 10G

30
Lob Usage Limitation
  • Not in the ORDER BY, or GROUP BY or in an
    aggregate function.
  • Not in a SELECT... DISTINCT or SELECT... UNIQUE
    statement or in a join.
  • Not in ANALYZE... COMPUTE or ANALYZE... ESTIMATE
    statements.
  • Not as a primary key column.
  • Not select a LOB column through dblink.
    ORA-22992 cannot use LOB locators selected from
    remote tables.

31
  • Partitioning and Its
  • Usage Scenarios at NISC

32
Partition Method
  • Range Partitioning, introduced in Oracle 8.
  • Hash Partitioning, introduced in 8i.
  • List Partitioning, introduced in 9i release 1.
  • Composite Partitioning. The range-hash partition
    was introduced in 8i, and the range-list
    partition was introduced in 9i release 2.
  • This is a good example how Oracle adds
    functionalities to the new release.

33
Benefit of Partitioning
  • The amount of time for each operation can be
    significantly reduced because of the small
    segment.
  • Improve query performance. The I/O will be
    balanced among disks.
  • Reduce the downtime.
  • Part of the table can be put to read only mode.
  • Easy to implement.

34
When to Partition
  • When table becomes large. 2GB is considered as a
    general guideline.
  • When the data is kind of adding on, meaning new
    data will go to the new partition.

35
Work with Range Partition
  • Create table with range partitioning.
  • Convert a non-partition table to a partition
    table.
  • Merge/split partition.
  • Tablespace usage with partition.
  • Maintain range partition.

36
Partitioning Usage Examples
  • Create tablespace
  • Create table
  • Add partition
  • Drop partition
  • Exchange partition
  • Move partition
  • Merge partition
  • Split partition
  • Truncate partition
  • Rename partition

37
Create Partitioned Table
  • CREATE TABLE dna_sequence
  • (base_id NUMBER(6),
  • base_sequence CLOB)
  • LOB (base_sequence) STORE AS
  • dna_seq_lob2
  • TABLESPACE example
  • PARTITION BY RANGE (BASE_ID)
  • (partition dna_sequence1 values less than (100)
    tablespace dna_sequence_p1,
  • partition dna_sequence2 values less than (200)
    tablespace dna_sequence_p2,
  • partition dna_sequence3 values less than (300)
    tablespace dna_sequence_p3)

38
Query the Partitioned Table
  • SELECT table_name, partition_name,
    tablespace_name, high_value
  • FROM user_tab_partitions
  • ORDER BY partition_name
  • TABLE_NAME PARTITION_NAME
    TABLESPACE_NAME HIGH_VALUE
  • ---------------- --------------------
    -------------------- ----------
  • DNA_SEQUENCE DNA_SEQUENCE1
    DNA_SEQUENCE_P1 100
  • DNA_SEQUENCE DNA_SEQUENCE2
    DNA_SEQUENCE_P2 200
  • DNA_SEQUENCE DNA_SEQUENCE3
    DNA_SEQUENCE_P3 300

39
Add Partition
  • ALTER TABLE dna_sequence
  • ADD PARTITION dna_sequence4 VALUES LESS THAN
    (400)
  • TABLESPACE dna_sequence_p1
  • TABLE_NAME PARTITION_NAME TABLESPACE_NAME
    HIGH_VALUE
  • --------------- -----------------
    -------------------- ----------
  • DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1
    100
  • DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2
    200
  • DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3
    300
  • DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1
    400

40
Drop Partition
  • ALTER TABLE dna_sequence DROP PARTITION
    dna_sequence4
  • Run partition.sql
  • TABLE_NAME PARTITION_NAME
    TABLESPACE_NAME HIGH_VALUE
  • ---------------- -------------------
    -------------------- ---------
  • DNA_SEQUENCE DNA_SEQUENCE1
    DNA_SEQUENCE_P1 100
  • DNA_SEQUENCE DNA_SEQUENCE2
    DNA_SEQUENCE_P2 200
  • DNA_SEQUENCE DNA_SEQUENCE3
    DNA_SEQUENCE_P3 300

41
Exchange Partition
  • CREATE TABLE dna_sep03
  • AS SELECT
  • FROM dna_sequence
  • WHERE 12
  • ALTER TABLE dna_sequence
  • EXCHANGE PARTITION dna_sequence3 WITH TABLE
    dna_sep03

42
Move Partition
  • ALTER TABLE dna_sequence
  • MOVE PARTITION dna_sequence4 TABLESPACE
    dna_sequence_p2 NOLOGGING

43
Split Partition
  • ALTER TABLE dna_sequence
  • SPLIT PARTITION dna_sequence4 AT (350)
  • INTO (
  • PARTITION dna_sequence4 TABLESPACE
    dna_sequence_p1,
  • PARTITION dna_sequence5 TABLESPACE
    dna_sequence_p2)
  • PARALLEL ( DEGREE 5 )
  • TABLE_NAME PARTITION_NAME
    TABLESPACE_NAME HIGH_VALUE
  • ----------------- --------------------
    -------------------- ----------
  • DNA_SEQUENCE DNA_SEQUENCE1
    DNA_SEQUENCE_P1 100
  • DNA_SEQUENCE DNA_SEQUENCE2
    DNA_SEQUENCE_P2 200
  • DNA_SEQUENCE DNA_SEQUENCE3
    DNA_SEQUENCE_P3 300
  • DNA_SEQUENCE DNA_SEQUENCE4
    DNA_SEQUENCE_P1 350
  • DNA_SEQUENCE DNA_SEQUENCE5
    DNA_SEQUENCE_P2 400

44
Truncate Partition
  • ALTER TABLE dna_sequence
  • TRUNCATE PARTITION dna_sequence4 DROP STORAGE

45
Rename Partition/Table
  • Rename partition
  • ALTER TABLE dna_sequence
  • RENAME PARTITION dna_sequence4 TO
    dna_sequence5
  • Rename table
  • ALTER TABLE dna_sequence
  • RENAME TO dna_seq
  • RENAME dna_seq TO dna_sequence

46
Conclusion
  • By proper use of the Oracle features such as
    CLOB, and partitioning table, it becomes a lot
    easier to manage the database containing large
    amounts of biological data.

47
Major Benefits using CLOB and Partitioning at NISC
  • Space Savings Proper use of CLOB
  • Better performance Put big tables into smaller
    segments
  • Better Maintenance Easier backup and recovery
    Less down time

48
A
49
Reminder please complete the OracleWorld
online session surveyThank you.Xiaobin Guan,
Ph.D.NISC/NIHXiaobin_Guan_at_nih.gov
Write a Comment
User Comments (0)
About PowerShow.com