Bioinformatics Data Representation and Integration

About This Presentation

Title:

Bioinformatics Data Representation and Integration

Description:

aspartic acid - asp D. cysteine - cys C. glutamine - gln Q. glutamic acid - glu - E ... UniRef (UniProt NREF (Non-redundant REFerence) database at EBI, UK) ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 46

Provided by: Big76

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Data Representation and Integration

1
Bioinformatics Data Representation and Integration

By
Ngozi Oleleh

2
Table of Contents

Introduction to Bioinformatics
Proteins and Sequences
Bioinformatics Tools
The databases
Blast Functions
Bioindexing
Conclusion

3
What is Bioinformatics

Bioinformatics is the use of computers to study
and handle biological Information
Bioinformatics can be looked at as an integration
of computer science and Biology to help enhance
the study of biological data which has been
proven to be very extensive
The role of computer science in this
Interdisciplinary is to store the data(via
databases) for future Analysis via biological
tools
This fields study includes but is not limited to
the study of genes, dna sequences and protein
structures

4
Protein and Sequences

Biological proteins are made up of 20 amino acids
Alanine - ala - A
arginine - arg R
asparagine - asn N
aspartic acid - asp D
cysteine - cys C
glutamine - gln Q
glutamic acid - glu - E
glycine - gly G
Histidine - his H
isoleucine - ile I
leucine - leu L
lysine - lys K
methionine - met M
phenylalanine - phe F
proline - pro P
serine - ser S
threonine - thr - T

5
Proteins and Sequences

Combination of these amino acids make up protein
structures and sequences
Pdb database contains numerous protein structures
that are similar by sequence alignment of fold
recognition.
Bioinformatics studies difference and
similarities of these protein structures based on
sequence similarity
A Sequence is a combination of amino acids.
This sequences can contain biological data, that
can be used to denote information about families
of proteins

6
Bioinformatic Tools

Mage
Used to display protein singular structures
Rasmol
Used to display protein 3d Structure
LALIGN
For pairwise Sequence Alignment
ClustalW
Used for Multiple Sequence Alignment
Ammp
Molecular Modeling
Sequence Alignment Tools
FASTA
BLAST (will be looked at extensively)

7
(No Transcript)
8
Biological Databases

There are over 5000 public biological databases
These databases contain genomic, proteomic and
microarray data.
This so called data is made up of sequence of
genes or amino acids of proteins
Biological databases have become very useful to
scientists. It is important in understanding and
explaining a host of biological phenomena from
the structure of biomolecules and their
interaction, to the whole metabolism of organisms
and to understanding the evolution of species.

This knowledge helps facilitate the fight against
diseases, assists in the development of
medications and in discovering basic
relationships amongst species in the history of
life.
The biological knowledge is distributed amongst
many different general and specialized databases.
This sometimes makes it difficult to ensure the
consistency of information.
Biological databases cross-reference other
databases with accession numbers as one way of
linking their related knowledge together.

Bioinformatics databases can be grouped into 2
groups Generalized databases and Specialized
databases
Generalized databases
Primary Sequence Databases (EMBL, Genebank,DDJB)
Protein Sequence Databases(Swiss-prot,UniProt,
UniRef)
Carbohydrate Databases (CarbBank)
3d structure Databases (PDB, EBI-MSD,NDB)

11
Specialized Databases

Specialized databases
Specialized Sequence database
Genome databases
Specialized Protein Sequence database
Specialize Structure databases
Microarray databases
Main focus are the Generalized databases

12
Primary Sequence Database

Primary sequence databases
EMBL (European Molecular Biology Laboratory
nucleotide sequence database at EBI, Hinxton, UK)
GenBank (at National Center for Biotechnology
information, NCBI, Bethesda, MD, USA)
DDBJ (DNA Data Bank Japan at CIB , Mishima,
Japan)

13
Protein Sequence Database

Protein sequence databases
SWISS-PROT (Swiss Institute of Bioinformatics,
SIB, Geneva, CH)
TrEMBL (Translated EMBL computer annotated
protein sequence database at EBI, UK)
PIR-PSD (PIR-International Protein Sequence
Database, annotated protein database by PIR, MIPS
and JIPID at NBRF, Georgetown University, USA)
UniProt (Joined data from Swiss-Prot, TrEMBL and
PIR)
UniRef (UniProt NREF (Non-redundant REFerence)
database at EBI, UK)
IPI (International Protein Index human, rat and
mouse proteome database at EBI, UK)

14
Other Databases

Carbohydrate databases
CarbBank (Former complex carbohydrate structure
database)
3D structure databases
PDB (Protein Data Bank cured by RCSB, USA)
EBI-MSD (Macromolecular Structure Database at
EBI, UK )
NDB (Nucleic Acid structure Database at Rutgers
State University of New Jersey , USA)

15
Blast

Blast is a heuristic algorithm to detect
sequence
similarity and is optimized for speed. It is
suitable
for large scale analysis
What blast does is to match a queried sequence
to
certain positions of database sequences

16
Quick Diversion

Blast Example
Sequence to be queried
TSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQ

Sequences producing significant alignments
Score(Bits) E Value
pdb2FXPA Chain A, Solution Structure Of The
Sars-Coronaviru... 82.4 3e-17 pdb2BEZF Chain
F, Structure Of A Proteolitically Resistant ...
81.6 5e-17 pdb1WNCA Chain A, Crystal
Structure Of The Sars-Cov Spike P... 77.8
7e-16 pdb1WYYA Chain A, Post-Fusion Hairpin
Conformation Of The S... 76.6 1e-15 pdb2BEQD
Chain D, Structure Of A Proteolytically Resistant
... 69.7 2e-13 pdb1ZVAA Chain A, A
Structure-Based Mechanism Of Sars Virus... 68.6
5e-13 pdb1ZV7A Chain A, A Structure-Based
Mechanism Of Sars Virus... 65.9 3e-12
pdb1ZV8B Chain B, A Structure-Based Mechanism
Of Sars Virus... 65.5 4e-12 pdb1WDGA Chain A,
Crystal Structure Of Mhv Spike Protein Fu... 25.4
4.7 pdb2A11A Chain A, Crystal Structure Of
Nuclease Domain Of R... 24.3 9.1

18
Blast Functions in Databases

Blast is one of the most heavily used data
analysis tools available, hence large scale data
analysis need to supports BLAST functions.
Blast Support is achieved by defining a set of
user-defined functions that return BLAST results
as a table.
Many databases Support Blast Functions
Blast 2 major functions are
BLAST_MATCH
BLAST_ALIGN

19
The Blast Functions

function BLASTP_MATCH (
query_seq CLOB,
seqdb_cursor REF CURSOR,
subsequence_from NUMBER default 1,
subsequence_to NUMBER default -1,
filter_low_complexity BOOLEAN default false,
mask_lower_case BOOLEAN default false,
sub_matrix VARCHAR2 default BLOSUM62,
expect_value NUMBER default 10,
open_gap_cost NUMBER default 11,
extend_gap_cost NUMBER default 1,
word_size NUMBER default 3,
x_dropoff NUMBER default 15,
final_x_dropoff NUMBER default 25)
return table of row (t_seq_id VARCHAR2, score
NUMBER, expect NUMBER)

20
Parameter Description

query_seq The query sequence to search. A
sequence is just lines of sequence data. Blank
lines are not allowed in the middle of bare
sequence input.
seqdb_cursor The cursor parameter supplied by the
user when calling the function. It should return
two columns in its returning row, the sequence
identifier and the sequence string.
Subsequence from Start position of a region of
the query sequence to be used for
the search. The default is 1.
Subsequence To End position of a region of the
query sequence to be used for
the search. If -1 is specified, the sequence
length is taken as subsequence to. The default
is -1.
Filter_low_complexity TRUE or FALSE. If TRUE, the
search masks off segments of the query sequence
that have low compositional complexity. Filtering
can eliminate statistically significant but
biologically
uninteresting regions, leaving the more
biologically interesting regions of the query
sequence available for specific matchingagainst
database sequences. Filtering is only applied to
the query sequence. The default value is FALSE.
mask_lower_case TRUE or FALSE. If TRUE, you can
specify a sequence in upper case characters as
the query sequence and denote areas to be
filtered out with lower case. This customizes
what is filtered from the sequence. The default
value is FALSE.

sub_matrix Specifies the substitution matrix used
to assign a score for aligning any possible pair
of residues. The different options are PAM30,
PAM70, BLOSUM80, BLOSUM62, and BLOSUM45. The
default is BLOSUM62.
expect_value The statistical significance
threshold for reporting matches against database
sequences. The default value is 10. Specifying 0
invokes default behavior.
open_gap_cost The cost of opening a gap. The
default value is 11. Specifying 0 invokes default
behavior.
extend_gap_cost The cost of extending a gap. The
default value is 1. Specifying 0 invokes default
behavior.
word_size The word size used for dividing the
query sequence into subsequences during the
search. The default value is 3. Specifying 0
invokes default behavior.
x_dropoff Dropoff for BLAST extensions in bits.
The default value is 15. Specifying 0 invokes
default behavior.
final_x_dropoff The final X dropoff value for
gapped alignments in bits. The default value is
25. Specifying 0 invokes default behavior.
t_seq_id The sequence identifier of the returned
match.
score The score of the returned match.
expect The expect value of the returned match.

22
How the whole system Works

Sequences that need to be searched are inserted
into a query table
INSERT INTO query_db VALUES (1,
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGT)

23
How does it work

Select T_SEQ_ID, score, EXPECT as evalue from
TABLE(BLASTP_MATCH ( (select sequence from
query_db), -- query_sequenceCURSOR(SELECT
seq_id, seq_dataFROM swissprotWHERE organism
'Homo sapiens (Human)'), -- seqdb_cursor 1, --
subsequence_from-1, -- subsequence_to0, --
FILTER_LOW_COMPLEXITY0, -- MASK_LOWER_CASE
'BLOSUM62', -- SUB_MATRIX10, -- EXPECT_VALUE0,
-- OPEN_GAP_COST0, -- EXTEND_GAP_COST0, --
WORD_SIZE0, -- X_DROPOFF0)) --
FINAL_X_DROPOFFt where t.score gt 25

24
The Search Procedure

SELECT t.t_seq_id, t.score, t.expect, p.name
FROM PROT_DB p, TABLE(
BLASTP_MATCH (
(SELECT sequence FROM query_db WHERE sequence_id
2),
CURSOR(SELECT seq_id, sequence FROM PROT_DB),
1,
-1,
0,
0,
BLOSUM62,
10,
0,
0,
0,
0,
0)
)t WHERE t.t_seq_id p.seq_id AND t.score gt 25
ORDER BY t.expect

25
Output Results

SEQ_ID SCORE EVALUE
-------- ---------- ----------
P31946 205 5.8977E-18
Q04917 198 3.8228E-17
P31947 169 8.8130E-14
P27348 198 3.8228E-17
P58107 49 7.24297332

26
The Databases and Why

The ability to perform genome-wide and
cross-genome data analysis can reduce time
required for new biological discoveries
Since traditional databases are not built to
support location datatypes, researchers are
forced to find ways in which these databases can
manage biological information that will permit
information to be queried with a Modern database
system
This research has led to a concept called
Bioindexing

27
Bioindexing

An index in this construct is basically a way of
providing a mapping between information entities.
In a traditional database, an index is an
auxiliary structure which speeds up the data
retrieval process by providing a mapping between
a record key and the physical disk address of the
records containing the key
Bioindexing provides similar functionality as a
database index but also facilitates DATA
INTEGRATION
Biological features are generally attached to
locations and locations are also the bases for
maps(MAPS in this context is an association of
features with a sequence alignment), alignment (
relationships between two genomic sequence
segments ) and other complex relationships.

28
The Blast Database and Bioindexing

Bioindexing is essentially an infrastructure for
representing and managing biological knowledge in
a large-scale database system using index
constructs
Bioindexing uses location datatype and BLAST
JOINS to efficiently handle and query the large
amount of data.
Bioindexing is essentially a scheme for
connecting and querying information with modern
database systems WITH THE USE OF INDEXES

29
Types of Indexing

Intrinsic Indexing Indexable bioinformatics
datatypes. Intrinsic indexing permits both the
representation and management of biological
mapping
Extrinsic Indexing is basically an efficient
way of data integration from different
heterogeneous sources such as relational tables,
xml files standard sequence formats and other
sources.
Extrinsic indexing concerns the functions and
algorithms used to access and connect this
information, even when it is not stored locally

30
(No Transcript)
31
Location (How it is represented)

Without proper abstraction, users have to
implement their own codes to handle location
operations
A location consists of a sequence identifier and
an interval range.
Integer Interval are modeled in lower,upper
structure
Identifiers are character strings or accession
numbers used to denote a particular sequence and
interval range consists of a pair of positive
integers used to denote the sub-range within the
given sequence

32
Complexity (Where Clauses ) if no location
DatatypesEst sequences being needed to be
grouped over consecutive overlapping EST
fragments

SELECT DISTINCT A.id, A.lower, B.upper
FROM ESTs AS A, ESTs AS B
WHERE A.unigene_clusterid B.unigene_clusterid
AND A.lower lt B.upper
AND NOT EXISTS
(SELECT
FROM ESTs AS C
WHERE C.unigene_clusterid A.unigene_clusterid
AND A.lower lt C.lower AND C.lower lt B.upper
AND NOT EXISTS
(SELECT FROM ESTs AS D
WHERE D.unigene_clusterid A.unigene_clusterid
AND D.lower lt C.lower AND C.lower lt D.upper))
AND NOT EXISTS
(SELECT
FROM ESTs AS E
WHERE E.unigene_clusterid A.unigene_clusterid
AND ((E.lower lt A.lower AND A.lower ltE.upper) OR

33
Location Datatype

A straightforward representation of a location
would be a sequence identifier as a character
string and the location interval as (start, end)
pair of integers.
There are other possible representations such as
integer codes for sequence identifiers and or a
(start,length) interval representation
Most databases use the sequence identifier, and
location (start, end ) pair of integers..
WHY..because of Simplicity

34
Simplicity using Location DatatypeCreation and
Insertion

CREATE TABLE features ( location loc, description
text)
-- The Prader-Willi/Angelman syndrome region on
chromosome 15
INSERT INTO features VALUES ( 'NG_0026901..755217
', 'Prader-Willi/Angelman syndrome region' )
INSERT INTO features VALUES ( 'NG_0026901..174707
', 'AC090602.16' )
INSERT INTO features VALUES ( 'NG_002690174707..3
24834', 'AC124312.5' )
INSERT INTO features VALUES ( 'NG_002690324835..4
78258', 'AC124303.5' )
INSERT INTO features VALUES ( 'NG_002690478259..6
06120', 'AC100774.2' )
INSERT INTO features VALUES ( 'NG_002690606121..7
55217', 'AC124997.4' )

The introduction of location datatype not only
provides a natural and intuitive way to represent
biological information, but also boosts system
performance.
Additional performance increase could be achieved
by supporting the location index scheme.
Supports for indexing schemes in traditional
relational database systems are very limited and
inflexible.
They are only limited to a few well-known index
structures, such as B-tree, Hash and R-tree and
could be used for a limited set of native
data-types for (in)equality and range queries.

Essentially there are operation and functions
supported in the location datatype.
A major proportion of these functions are related
to interval operations.
More than 30 interval operations are defined,
including Allen's interval logic 15 (which
includes after, before, contains, during, equals,
overlaps, overlapped by,
finishes, finished by, meets, met by, starts and
started by).
Optimization information (such as regarding
ordering, commutativity or negation) is also
provided to permit optimization of important
operations like merge-join, hash-join or general
theta-join.

37
Why location datatype is Needed

Here is a simple example to demonstrate the power
of location datatype support. This example shows
a session that painfully attempts to locate
alternatively spliced exon intervals which
intersect with known homology intervals and
associate them with known protein features from
the Pfam and Swissprot databases.

38
Complexity without locations

CREATE TABLE alt_splice_homology_map AS
SELECT o., d.swiss_id, d.query_start,
d.query_end,
d.hit_start(o.seq_start-d.query_start)/3,
d.hit_start(o.seq_end-d.query_start)/3,
FROM alt_splice_exon_obs o, alt_splice_homology
d
WHERE o.ug_id d.ug_id
AND o.seq_start gt d.query_start
AND o.seq_start lt d.query_end
AND d.e_value lt 0.01
GROUP BY o.ug_id, o.seq_start
SELECT o., f.type, f.start, f.end
FROM alt_splice_homology_map o, swiss_feature f
WHERE o.swiss_idf.swiss_id
AND o.hit_end gt f.start
AND o.hit_end lt f.end

39
Simplicity using locations

CREATE TABLE alt_splice_homology_map AS
SELECT o., d.location,
range_start(d.query)(o.location-range_start(d.hit
))/3
FROM alt_splice_exon_obs o, alt_splice_homology d
WHERE o.location _at_ d.location -- contained
AND d.e_value lt 0.01
GROUP BY o
SELECT o., f.type, f.location
FROM alt_splice_homology_map o, swiss_feature f
WHERE o.location lt f.location -- left overlap

40
Location Support

Supporting location indexing in a traditional
database implies the need to support interval
indexing.
BUT, interval indexing is not supported in
traditional databases and standard join
operations could not handle intervals
efficiently, this has led to extensive research
for interval indexing.
Here lies the need for a concept called GIST

41
GIST

Is an efficient solution handle the problem of
ineffective interval indexing in traditional
database
Gist is basically a balanced search tree in which
keys are maintained in a hierarchical manner. The
search keys used in gist may be any arbitrary
predicate, but this predicate must hold true for
the data searched below a key.
Gist searches by traversing the entire tree in a
dept-first search manner. If the query predicate
is consistent with a given search key, Gist will
continue to search the subtree below the key

42
Gist Implementation

Gist is implemented using bounding intervals that
covers the range of
Identifier integers (id_lower,id_upper)
And
Intervals in the subtree (lower,upper)
Under Gist architecture interval predicates such
as such as left, right overlap,
overleft,overright, contains, contained and equal
are all supported

43
What gist location does
44
Conclusion

Bioinformatics databases are being modeled and
queried using function(as seen in oracle and ibm
DB2)
An efficient way of modeling these databases are
seen using bioindexing (as seen in postgre- sql
database)
The use of an index structure as seen in
Bioindexing, where a location is modeled using a
(DFS) tree structure leads to less complexity.
This location index structure leads to an faster
searching of the databases
This concept of speed is very important in
bioinformatics
Using a gist architecture, lead to less complex
queries and a more confined search sector for
query information.

45
References

The Index as a First-Class Construct in
Relational Database Systems
D. Stott Parker, Edwin Mach
Algorithms and Databases in Bioinformatics
Towards a Proteomic Ontology
Mario Cannataro, Pietro Hiram Guzzi, Tommaso
Mazza, Giuseppe Tradigo and Pierangelo Veltri
Oracle Data Mining
Mobile Access to Biological Databases on the
Internet
Pentti Riikonen, Jorma Boberg, Tapio Salakoski,
and Mauno Vihinen
Utilizing Multiple Bioinformatics Information
Sources
An XML Database Approach
Raymond K. Wong William M. Shui
Support for BioIndexing in BLASTgres
Ruey-Lung Hsiao, D. Stott Parker, and Hung-chih
Yang