Title: Bioinformatics Data Representation and Integration
1Bioinformatics Data Representation and Integration
2Table of Contents
- Introduction to Bioinformatics
- Proteins and Sequences
- Bioinformatics Tools
- The databases
- Blast Functions
- Bioindexing
- Conclusion
3What is Bioinformatics
- Bioinformatics is the use of computers to study
and handle biological Information - Bioinformatics can be looked at as an integration
of computer science and Biology to help enhance
the study of biological data which has been
proven to be very extensive - The role of computer science in this
Interdisciplinary is to store the data(via
databases) for future Analysis via biological
tools - This fields study includes but is not limited to
the study of genes, dna sequences and protein
structures
4Protein and Sequences
- Biological proteins are made up of 20 amino acids
- Alanine - ala - A
- arginine - arg R
- asparagine - asn N
- aspartic acid - asp D
- cysteine - cys C
- glutamine - gln Q
- glutamic acid - glu - E
- glycine - gly G
- Histidine - his H
- isoleucine - ile I
- leucine - leu L
- lysine - lys K
- methionine - met M
- phenylalanine - phe F
- proline - pro P
- serine - ser S
- threonine - thr - T
5Proteins and Sequences
- Combination of these amino acids make up protein
structures and sequences - Pdb database contains numerous protein structures
that are similar by sequence alignment of fold
recognition. - Bioinformatics studies difference and
similarities of these protein structures based on
sequence similarity - A Sequence is a combination of amino acids.
- This sequences can contain biological data, that
can be used to denote information about families
of proteins
6Bioinformatic Tools
- Mage
- Used to display protein singular structures
- Rasmol
- Used to display protein 3d Structure
- LALIGN
- For pairwise Sequence Alignment
- ClustalW
- Used for Multiple Sequence Alignment
- Ammp
- Molecular Modeling
- Sequence Alignment Tools
- FASTA
- BLAST (will be looked at extensively)
7(No Transcript)
8Biological Databases
- There are over 5000 public biological databases
- These databases contain genomic, proteomic and
microarray data. - This so called data is made up of sequence of
genes or amino acids of proteins - Biological databases have become very useful to
scientists. It is important in understanding and
explaining a host of biological phenomena from
the structure of biomolecules and their
interaction, to the whole metabolism of organisms
and to understanding the evolution of species.
9- This knowledge helps facilitate the fight against
diseases, assists in the development of
medications and in discovering basic
relationships amongst species in the history of
life. - The biological knowledge is distributed amongst
many different general and specialized databases.
This sometimes makes it difficult to ensure the
consistency of information. - Biological databases cross-reference other
databases with accession numbers as one way of
linking their related knowledge together.
10- Bioinformatics databases can be grouped into 2
groups Generalized databases and Specialized
databases - Generalized databases
- Primary Sequence Databases (EMBL, Genebank,DDJB)
- Protein Sequence Databases(Swiss-prot,UniProt,
UniRef) - Carbohydrate Databases (CarbBank)
- 3d structure Databases (PDB, EBI-MSD,NDB)
11Specialized Databases
- Specialized databases
- Specialized Sequence database
- Genome databases
- Specialized Protein Sequence database
- Specialize Structure databases
- Microarray databases
- Main focus are the Generalized databases
12Primary Sequence Database
- Primary sequence databases
- EMBL (European Molecular Biology Laboratory
nucleotide sequence database at EBI, Hinxton, UK) - GenBank (at National Center for Biotechnology
information, NCBI, Bethesda, MD, USA) - DDBJ (DNA Data Bank Japan at CIB , Mishima,
Japan)
13Protein Sequence Database
- Protein sequence databases
- SWISS-PROT (Swiss Institute of Bioinformatics,
SIB, Geneva, CH) - TrEMBL (Translated EMBL computer annotated
protein sequence database at EBI, UK) - PIR-PSD (PIR-International Protein Sequence
Database, annotated protein database by PIR, MIPS
and JIPID at NBRF, Georgetown University, USA) - UniProt (Joined data from Swiss-Prot, TrEMBL and
PIR) - UniRef (UniProt NREF (Non-redundant REFerence)
database at EBI, UK) - IPI (International Protein Index human, rat and
mouse proteome database at EBI, UK)
14Other Databases
- Carbohydrate databases
- CarbBank (Former complex carbohydrate structure
database) - 3D structure databases
- PDB (Protein Data Bank cured by RCSB, USA)
- EBI-MSD (Macromolecular Structure Database at
EBI, UK ) - NDB (Nucleic Acid structure Database at Rutgers
State University of New Jersey , USA)
15Blast
- Blast is a heuristic algorithm to detect
sequence - similarity and is optimized for speed. It is
suitable - for large scale analysis
- What blast does is to match a queried sequence
to - certain positions of database sequences
-
16Quick Diversion
- Blast Example
- Sequence to be queried
- TSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQ
17- Sequences producing significant alignments
Score(Bits) E Value - pdb2FXPA Chain A, Solution Structure Of The
Sars-Coronaviru... 82.4 3e-17 pdb2BEZF Chain
F, Structure Of A Proteolitically Resistant ...
81.6 5e-17 pdb1WNCA Chain A, Crystal
Structure Of The Sars-Cov Spike P... 77.8
7e-16 pdb1WYYA Chain A, Post-Fusion Hairpin
Conformation Of The S... 76.6 1e-15 pdb2BEQD
Chain D, Structure Of A Proteolytically Resistant
... 69.7 2e-13 pdb1ZVAA Chain A, A
Structure-Based Mechanism Of Sars Virus... 68.6
5e-13 pdb1ZV7A Chain A, A Structure-Based
Mechanism Of Sars Virus... 65.9 3e-12
pdb1ZV8B Chain B, A Structure-Based Mechanism
Of Sars Virus... 65.5 4e-12 pdb1WDGA Chain A,
Crystal Structure Of Mhv Spike Protein Fu... 25.4
4.7 pdb2A11A Chain A, Crystal Structure Of
Nuclease Domain Of R... 24.3 9.1
18Blast Functions in Databases
- Blast is one of the most heavily used data
analysis tools available, hence large scale data
analysis need to supports BLAST functions. - Blast Support is achieved by defining a set of
user-defined functions that return BLAST results
as a table. - Many databases Support Blast Functions
- Blast 2 major functions are
- BLAST_MATCH
- BLAST_ALIGN
19The Blast Functions
- function BLASTP_MATCH (
- query_seq CLOB,
- seqdb_cursor REF CURSOR,
- subsequence_from NUMBER default 1,
- subsequence_to NUMBER default -1,
- filter_low_complexity BOOLEAN default false,
- mask_lower_case BOOLEAN default false,
- sub_matrix VARCHAR2 default BLOSUM62,
- expect_value NUMBER default 10,
- open_gap_cost NUMBER default 11,
- extend_gap_cost NUMBER default 1,
- word_size NUMBER default 3,
- x_dropoff NUMBER default 15,
- final_x_dropoff NUMBER default 25)
- return table of row (t_seq_id VARCHAR2, score
NUMBER, expect NUMBER)
20Parameter Description
- query_seq The query sequence to search. A
sequence is just lines of sequence data. Blank
lines are not allowed in the middle of bare
sequence input. - seqdb_cursor The cursor parameter supplied by the
user when calling the function. It should return
two columns in its returning row, the sequence
identifier and the sequence string. - Subsequence from Start position of a region of
the query sequence to be used for - the search. The default is 1.
- Subsequence To End position of a region of the
query sequence to be used for - the search. If -1 is specified, the sequence
length is taken as subsequence to. The default
is -1. - Filter_low_complexity TRUE or FALSE. If TRUE, the
search masks off segments of the query sequence
that have low compositional complexity. Filtering
can eliminate statistically significant but
biologically - uninteresting regions, leaving the more
biologically interesting regions of the query
sequence available for specific matchingagainst
database sequences. Filtering is only applied to
the query sequence. The default value is FALSE. - mask_lower_case TRUE or FALSE. If TRUE, you can
specify a sequence in upper case characters as
the query sequence and denote areas to be
filtered out with lower case. This customizes
what is filtered from the sequence. The default
value is FALSE.
21- sub_matrix Specifies the substitution matrix used
to assign a score for aligning any possible pair
of residues. The different options are PAM30,
PAM70, BLOSUM80, BLOSUM62, and BLOSUM45. The
default is BLOSUM62. - expect_value The statistical significance
threshold for reporting matches against database
sequences. The default value is 10. Specifying 0
invokes default behavior. - open_gap_cost The cost of opening a gap. The
default value is 11. Specifying 0 invokes default
behavior. - extend_gap_cost The cost of extending a gap. The
default value is 1. Specifying 0 invokes default
behavior. - word_size The word size used for dividing the
query sequence into subsequences during the
search. The default value is 3. Specifying 0
invokes default behavior. - x_dropoff Dropoff for BLAST extensions in bits.
The default value is 15. Specifying 0 invokes
default behavior. - final_x_dropoff The final X dropoff value for
gapped alignments in bits. The default value is
25. Specifying 0 invokes default behavior. - t_seq_id The sequence identifier of the returned
match. - score The score of the returned match.
- expect The expect value of the returned match.
22How the whole system Works
- Sequences that need to be searched are inserted
into a query table - INSERT INTO query_db VALUES (1,
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGT)
23How does it work
- Select T_SEQ_ID, score, EXPECT as evalue from
TABLE(BLASTP_MATCH ( (select sequence from
query_db), -- query_sequenceCURSOR(SELECT
seq_id, seq_dataFROM swissprotWHERE organism
'Homo sapiens (Human)'), -- seqdb_cursor 1, --
subsequence_from-1, -- subsequence_to0, --
FILTER_LOW_COMPLEXITY0, -- MASK_LOWER_CASE
'BLOSUM62', -- SUB_MATRIX10, -- EXPECT_VALUE0,
-- OPEN_GAP_COST0, -- EXTEND_GAP_COST0, --
WORD_SIZE0, -- X_DROPOFF0)) --
FINAL_X_DROPOFFt where t.score gt 25
24The Search Procedure
- SELECT t.t_seq_id, t.score, t.expect, p.name
- FROM PROT_DB p, TABLE(
- BLASTP_MATCH (
- (SELECT sequence FROM query_db WHERE sequence_id
2), - CURSOR(SELECT seq_id, sequence FROM PROT_DB),
- 1,
- -1,
- 0,
- 0,
- BLOSUM62,
- 10,
- 0,
- 0,
- 0,
- 0,
- 0)
- )t WHERE t.t_seq_id p.seq_id AND t.score gt 25
- ORDER BY t.expect
25Output Results
- SEQ_ID SCORE EVALUE
- -------- ---------- ----------
- P31946 205 5.8977E-18
- Q04917 198 3.8228E-17
- P31947 169 8.8130E-14
- P27348 198 3.8228E-17
- P58107 49 7.24297332
26The Databases and Why
- The ability to perform genome-wide and
cross-genome data analysis can reduce time
required for new biological discoveries - Since traditional databases are not built to
support location datatypes, researchers are
forced to find ways in which these databases can
manage biological information that will permit
information to be queried with a Modern database
system - This research has led to a concept called
Bioindexing
27Bioindexing
- An index in this construct is basically a way of
providing a mapping between information entities. - In a traditional database, an index is an
auxiliary structure which speeds up the data
retrieval process by providing a mapping between
a record key and the physical disk address of the
records containing the key - Bioindexing provides similar functionality as a
database index but also facilitates DATA
INTEGRATION - Biological features are generally attached to
locations and locations are also the bases for
maps(MAPS in this context is an association of
features with a sequence alignment), alignment (
relationships between two genomic sequence
segments ) and other complex relationships.
28The Blast Database and Bioindexing
- Bioindexing is essentially an infrastructure for
representing and managing biological knowledge in
a large-scale database system using index
constructs - Bioindexing uses location datatype and BLAST
JOINS to efficiently handle and query the large
amount of data. - Bioindexing is essentially a scheme for
connecting and querying information with modern
database systems WITH THE USE OF INDEXES
29Types of Indexing
- Intrinsic Indexing Indexable bioinformatics
datatypes. Intrinsic indexing permits both the
representation and management of biological
mapping - Extrinsic Indexing is basically an efficient
way of data integration from different
heterogeneous sources such as relational tables,
xml files standard sequence formats and other
sources. - Extrinsic indexing concerns the functions and
algorithms used to access and connect this
information, even when it is not stored locally
30(No Transcript)
31Location (How it is represented)
- Without proper abstraction, users have to
implement their own codes to handle location
operations - A location consists of a sequence identifier and
an interval range. - Integer Interval are modeled in lower,upper
structure - Identifiers are character strings or accession
numbers used to denote a particular sequence and
interval range consists of a pair of positive
integers used to denote the sub-range within the
given sequence
32Complexity (Where Clauses ) if no location
DatatypesEst sequences being needed to be
grouped over consecutive overlapping EST
fragments
- SELECT DISTINCT A.id, A.lower, B.upper
- FROM ESTs AS A, ESTs AS B
- WHERE A.unigene_clusterid B.unigene_clusterid
- AND A.lower lt B.upper
- AND NOT EXISTS
- (SELECT
- FROM ESTs AS C
- WHERE C.unigene_clusterid A.unigene_clusterid
- AND A.lower lt C.lower AND C.lower lt B.upper
- AND NOT EXISTS
- (SELECT FROM ESTs AS D
- WHERE D.unigene_clusterid A.unigene_clusterid
- AND D.lower lt C.lower AND C.lower lt D.upper))
- AND NOT EXISTS
- (SELECT
- FROM ESTs AS E
- WHERE E.unigene_clusterid A.unigene_clusterid
- AND ((E.lower lt A.lower AND A.lower ltE.upper) OR
33Location Datatype
- A straightforward representation of a location
would be a sequence identifier as a character
string and the location interval as (start, end)
pair of integers. - There are other possible representations such as
integer codes for sequence identifiers and or a
(start,length) interval representation - Most databases use the sequence identifier, and
location (start, end ) pair of integers..
WHY..because of Simplicity
34Simplicity using Location DatatypeCreation and
Insertion
- CREATE TABLE features ( location loc, description
text) - -- The Prader-Willi/Angelman syndrome region on
chromosome 15 - INSERT INTO features VALUES ( 'NG_0026901..755217
', 'Prader-Willi/Angelman syndrome region' ) - INSERT INTO features VALUES ( 'NG_0026901..174707
', 'AC090602.16' ) - INSERT INTO features VALUES ( 'NG_002690174707..3
24834', 'AC124312.5' ) - INSERT INTO features VALUES ( 'NG_002690324835..4
78258', 'AC124303.5' ) - INSERT INTO features VALUES ( 'NG_002690478259..6
06120', 'AC100774.2' ) - INSERT INTO features VALUES ( 'NG_002690606121..7
55217', 'AC124997.4' )
35- The introduction of location datatype not only
provides a natural and intuitive way to represent
biological information, but also boosts system
performance. - Additional performance increase could be achieved
by supporting the location index scheme. - Supports for indexing schemes in traditional
relational database systems are very limited and
inflexible. - They are only limited to a few well-known index
structures, such as B-tree, Hash and R-tree and
could be used for a limited set of native
data-types for (in)equality and range queries.
36- Essentially there are operation and functions
supported in the location datatype. - A major proportion of these functions are related
to interval operations. - More than 30 interval operations are defined,
including Allen's interval logic 15 (which
includes after, before, contains, during, equals,
overlaps, overlapped by, - finishes, finished by, meets, met by, starts and
started by). - Optimization information (such as regarding
ordering, commutativity or negation) is also
provided to permit optimization of important
operations like merge-join, hash-join or general
theta-join.
37Why location datatype is Needed
- Here is a simple example to demonstrate the power
of location datatype support. This example shows
a session that painfully attempts to locate
alternatively spliced exon intervals which
intersect with known homology intervals and
associate them with known protein features from
the Pfam and Swissprot databases.
38Complexity without locations
- CREATE TABLE alt_splice_homology_map AS
- SELECT o., d.swiss_id, d.query_start,
d.query_end, - d.hit_start(o.seq_start-d.query_start)/3,
- d.hit_start(o.seq_end-d.query_start)/3,
- FROM alt_splice_exon_obs o, alt_splice_homology
d - WHERE o.ug_id d.ug_id
- AND o.seq_start gt d.query_start
- AND o.seq_start lt d.query_end
- AND d.e_value lt 0.01
- GROUP BY o.ug_id, o.seq_start
- SELECT o., f.type, f.start, f.end
- FROM alt_splice_homology_map o, swiss_feature f
- WHERE o.swiss_idf.swiss_id
- AND o.hit_end gt f.start
- AND o.hit_end lt f.end
39Simplicity using locations
- CREATE TABLE alt_splice_homology_map AS
- SELECT o., d.location,
- range_start(d.query)(o.location-range_start(d.hit
))/3 - FROM alt_splice_exon_obs o, alt_splice_homology d
- WHERE o.location _at_ d.location -- contained
- AND d.e_value lt 0.01
- GROUP BY o
- SELECT o., f.type, f.location
- FROM alt_splice_homology_map o, swiss_feature f
- WHERE o.location lt f.location -- left overlap
40Location Support
- Supporting location indexing in a traditional
database implies the need to support interval
indexing. - BUT, interval indexing is not supported in
traditional databases and standard join
operations could not handle intervals
efficiently, this has led to extensive research
for interval indexing. - Here lies the need for a concept called GIST
41GIST
- Is an efficient solution handle the problem of
ineffective interval indexing in traditional
database - Gist is basically a balanced search tree in which
keys are maintained in a hierarchical manner. The
search keys used in gist may be any arbitrary
predicate, but this predicate must hold true for
the data searched below a key. - Gist searches by traversing the entire tree in a
dept-first search manner. If the query predicate
is consistent with a given search key, Gist will
continue to search the subtree below the key
42Gist Implementation
- Gist is implemented using bounding intervals that
covers the range of - Identifier integers (id_lower,id_upper)
- And
- Intervals in the subtree (lower,upper)
- Under Gist architecture interval predicates such
as such as left, right overlap,
overleft,overright, contains, contained and equal
are all supported
43What gist location does
44Conclusion
- Bioinformatics databases are being modeled and
queried using function(as seen in oracle and ibm
DB2) - An efficient way of modeling these databases are
seen using bioindexing (as seen in postgre- sql
database) - The use of an index structure as seen in
Bioindexing, where a location is modeled using a
(DFS) tree structure leads to less complexity. - This location index structure leads to an faster
searching of the databases - This concept of speed is very important in
bioinformatics - Using a gist architecture, lead to less complex
queries and a more confined search sector for
query information.
45References
- The Index as a First-Class Construct in
Relational Database Systems - D. Stott Parker, Edwin Mach
- Algorithms and Databases in Bioinformatics
Towards a Proteomic Ontology - Mario Cannataro, Pietro Hiram Guzzi, Tommaso
Mazza, Giuseppe Tradigo and Pierangelo Veltri - Oracle Data Mining
- Mobile Access to Biological Databases on the
Internet - Pentti Riikonen, Jorma Boberg, Tapio Salakoski,
and Mauno Vihinen - Utilizing Multiple Bioinformatics Information
Sources - An XML Database Approach
- Raymond K. Wong William M. Shui
- Support for BioIndexing in BLASTgres
- Ruey-Lung Hsiao, D. Stott Parker, and Hung-chih
Yang