Bioinformatics Data Representation and Integration - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Bioinformatics Data Representation and Integration

Description:

aspartic acid - asp D. cysteine - cys C. glutamine - gln Q. glutamic acid - glu - E ... UniRef (UniProt NREF (Non-redundant REFerence) database at EBI, UK) ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 46
Provided by: Big76
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Data Representation and Integration


1
Bioinformatics Data Representation and Integration
  • By
  • Ngozi Oleleh

2
Table of Contents
  • Introduction to Bioinformatics
  • Proteins and Sequences
  • Bioinformatics Tools
  • The databases
  • Blast Functions
  • Bioindexing
  • Conclusion

3
What is Bioinformatics
  • Bioinformatics is the use of computers to study
    and handle biological Information
  • Bioinformatics can be looked at as an integration
    of computer science and Biology to help enhance
    the study of biological data which has been
    proven to be very extensive
  • The role of computer science in this
    Interdisciplinary is to store the data(via
    databases) for future Analysis via biological
    tools
  • This fields study includes but is not limited to
    the study of genes, dna sequences and protein
    structures

4
Protein and Sequences
  • Biological proteins are made up of 20 amino acids
  • Alanine - ala - A
  • arginine - arg R
  • asparagine - asn N
  • aspartic acid - asp D
  • cysteine - cys C
  • glutamine - gln Q
  • glutamic acid - glu - E
  • glycine - gly G
  • Histidine - his H
  • isoleucine - ile I
  • leucine - leu L
  • lysine - lys K
  • methionine - met M
  • phenylalanine - phe F
  • proline - pro P
  • serine - ser S
  • threonine - thr - T

5
Proteins and Sequences
  • Combination of these amino acids make up protein
    structures and sequences
  • Pdb database contains numerous protein structures
    that are similar by sequence alignment of fold
    recognition.
  • Bioinformatics studies difference and
    similarities of these protein structures based on
    sequence similarity
  • A Sequence is a combination of amino acids.
  • This sequences can contain biological data, that
    can be used to denote information about families
    of proteins

6
Bioinformatic Tools
  • Mage
  • Used to display protein singular structures
  • Rasmol
  • Used to display protein 3d Structure
  • LALIGN
  • For pairwise Sequence Alignment
  • ClustalW
  • Used for Multiple Sequence Alignment
  • Ammp
  • Molecular Modeling
  • Sequence Alignment Tools
  • FASTA
  • BLAST (will be looked at extensively)

7
(No Transcript)
8
Biological Databases
  • There are over 5000 public biological databases
  • These databases contain genomic, proteomic and
    microarray data.
  • This so called data is made up of sequence of
    genes or amino acids of proteins
  • Biological databases have become very useful to
    scientists. It is important in understanding and
    explaining a host of biological phenomena from
    the structure of biomolecules and their
    interaction, to the whole metabolism of organisms
    and to understanding the evolution of species.

9
  • This knowledge helps facilitate the fight against
    diseases, assists in the development of
    medications and in discovering basic
    relationships amongst species in the history of
    life.
  • The biological knowledge is distributed amongst
    many different general and specialized databases.
    This sometimes makes it difficult to ensure the
    consistency of information.
  • Biological databases cross-reference other
    databases with accession numbers as one way of
    linking their related knowledge together.

10
  • Bioinformatics databases can be grouped into 2
    groups Generalized databases and Specialized
    databases
  • Generalized databases
  • Primary Sequence Databases (EMBL, Genebank,DDJB)
  • Protein Sequence Databases(Swiss-prot,UniProt,
    UniRef)
  • Carbohydrate Databases (CarbBank)
  • 3d structure Databases (PDB, EBI-MSD,NDB)

11
Specialized Databases
  • Specialized databases
  • Specialized Sequence database
  • Genome databases
  • Specialized Protein Sequence database
  • Specialize Structure databases
  • Microarray databases
  • Main focus are the Generalized databases

12
Primary Sequence Database
  • Primary sequence databases
  • EMBL (European Molecular Biology Laboratory
    nucleotide sequence database at EBI, Hinxton, UK)
  • GenBank (at National Center for Biotechnology
    information, NCBI, Bethesda, MD, USA)
  • DDBJ (DNA Data Bank Japan at CIB , Mishima,
    Japan)

13
Protein Sequence Database
  • Protein sequence databases
  • SWISS-PROT (Swiss Institute of Bioinformatics,
    SIB, Geneva, CH)
  • TrEMBL (Translated EMBL computer annotated
    protein sequence database at EBI, UK)
  • PIR-PSD (PIR-International Protein Sequence
    Database, annotated protein database by PIR, MIPS
    and JIPID at NBRF, Georgetown University, USA)
  • UniProt (Joined data from Swiss-Prot, TrEMBL and
    PIR)
  • UniRef (UniProt NREF (Non-redundant REFerence)
    database at EBI, UK)
  • IPI (International Protein Index human, rat and
    mouse proteome database at EBI, UK)

14
Other Databases
  • Carbohydrate databases
  • CarbBank (Former complex carbohydrate structure
    database)
  • 3D structure databases
  • PDB (Protein Data Bank cured by RCSB, USA)
  • EBI-MSD (Macromolecular Structure Database at
    EBI, UK )
  • NDB (Nucleic Acid structure Database at Rutgers
    State University of New Jersey , USA)

15
Blast
  • Blast is a heuristic algorithm to detect
    sequence
  • similarity and is optimized for speed. It is
    suitable
  • for large scale analysis
  • What blast does is to match a queried sequence
    to
  • certain positions of database sequences

16
Quick Diversion
  • Blast Example
  • Sequence to be queried
  • TSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQ

17
  • Sequences producing significant alignments
    Score(Bits) E Value
  • pdb2FXPA Chain A, Solution Structure Of The
    Sars-Coronaviru... 82.4 3e-17 pdb2BEZF Chain
    F, Structure Of A Proteolitically Resistant ...
    81.6 5e-17 pdb1WNCA Chain A, Crystal
    Structure Of The Sars-Cov Spike P... 77.8
    7e-16 pdb1WYYA Chain A, Post-Fusion Hairpin
    Conformation Of The S... 76.6 1e-15 pdb2BEQD
    Chain D, Structure Of A Proteolytically Resistant
    ... 69.7 2e-13 pdb1ZVAA Chain A, A
    Structure-Based Mechanism Of Sars Virus... 68.6
    5e-13 pdb1ZV7A Chain A, A Structure-Based
    Mechanism Of Sars Virus... 65.9 3e-12
    pdb1ZV8B Chain B, A Structure-Based Mechanism
    Of Sars Virus... 65.5 4e-12 pdb1WDGA Chain A,
    Crystal Structure Of Mhv Spike Protein Fu... 25.4
    4.7 pdb2A11A Chain A, Crystal Structure Of
    Nuclease Domain Of R... 24.3 9.1

18
Blast Functions in Databases
  • Blast is one of the most heavily used data
    analysis tools available, hence large scale data
    analysis need to supports BLAST functions.
  • Blast Support is achieved by defining a set of
    user-defined functions that return BLAST results
    as a table.
  • Many databases Support Blast Functions
  • Blast 2 major functions are
  • BLAST_MATCH
  • BLAST_ALIGN

19
The Blast Functions
  • function BLASTP_MATCH (
  • query_seq CLOB,
  • seqdb_cursor REF CURSOR,
  • subsequence_from NUMBER default 1,
  • subsequence_to NUMBER default -1,
  • filter_low_complexity BOOLEAN default false,
  • mask_lower_case BOOLEAN default false,
  • sub_matrix VARCHAR2 default BLOSUM62,
  • expect_value NUMBER default 10,
  • open_gap_cost NUMBER default 11,
  • extend_gap_cost NUMBER default 1,
  • word_size NUMBER default 3,
  • x_dropoff NUMBER default 15,
  • final_x_dropoff NUMBER default 25)
  • return table of row (t_seq_id VARCHAR2, score
    NUMBER, expect NUMBER)

20
Parameter Description
  • query_seq The query sequence to search. A
    sequence is just lines of sequence data. Blank
    lines are not allowed in the middle of bare
    sequence input.
  • seqdb_cursor The cursor parameter supplied by the
    user when calling the function. It should return
    two columns in its returning row, the sequence
    identifier and the sequence string.
  • Subsequence from Start position of a region of
    the query sequence to be used for
  • the search. The default is 1.
  • Subsequence To End position of a region of the
    query sequence to be used for
  • the search. If -1 is specified, the sequence
    length is taken as subsequence to. The default
    is -1.
  • Filter_low_complexity TRUE or FALSE. If TRUE, the
    search masks off segments of the query sequence
    that have low compositional complexity. Filtering
    can eliminate statistically significant but
    biologically
  • uninteresting regions, leaving the more
    biologically interesting regions of the query
    sequence available for specific matchingagainst
    database sequences. Filtering is only applied to
    the query sequence. The default value is FALSE.
  • mask_lower_case TRUE or FALSE. If TRUE, you can
    specify a sequence in upper case characters as
    the query sequence and denote areas to be
    filtered out with lower case. This customizes
    what is filtered from the sequence. The default
    value is FALSE.

21
  • sub_matrix Specifies the substitution matrix used
    to assign a score for aligning any possible pair
    of residues. The different options are PAM30,
    PAM70, BLOSUM80, BLOSUM62, and BLOSUM45. The
    default is BLOSUM62.
  • expect_value The statistical significance
    threshold for reporting matches against database
    sequences. The default value is 10. Specifying 0
    invokes default behavior.
  • open_gap_cost The cost of opening a gap. The
    default value is 11. Specifying 0 invokes default
    behavior.
  • extend_gap_cost The cost of extending a gap. The
    default value is 1. Specifying 0 invokes default
    behavior.
  • word_size The word size used for dividing the
    query sequence into subsequences during the
    search. The default value is 3. Specifying 0
    invokes default behavior.
  • x_dropoff Dropoff for BLAST extensions in bits.
    The default value is 15. Specifying 0 invokes
    default behavior.
  • final_x_dropoff The final X dropoff value for
    gapped alignments in bits. The default value is
    25. Specifying 0 invokes default behavior.
  • t_seq_id The sequence identifier of the returned
    match.
  • score The score of the returned match.
  • expect The expect value of the returned match.

22
How the whole system Works
  • Sequences that need to be searched are inserted
    into a query table
  • INSERT INTO query_db VALUES (1,
    AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGT)

23
How does it work
  • Select T_SEQ_ID, score, EXPECT as evalue from
    TABLE(BLASTP_MATCH ( (select sequence from
    query_db), -- query_sequenceCURSOR(SELECT
    seq_id, seq_dataFROM swissprotWHERE organism
    'Homo sapiens (Human)'), -- seqdb_cursor 1, --
    subsequence_from-1, -- subsequence_to0, --
    FILTER_LOW_COMPLEXITY0, -- MASK_LOWER_CASE
    'BLOSUM62', -- SUB_MATRIX10, -- EXPECT_VALUE0,
    -- OPEN_GAP_COST0, -- EXTEND_GAP_COST0, --
    WORD_SIZE0, -- X_DROPOFF0)) --
    FINAL_X_DROPOFFt where t.score gt 25

24
The Search Procedure
  • SELECT t.t_seq_id, t.score, t.expect, p.name
  • FROM PROT_DB p, TABLE(
  • BLASTP_MATCH (
  • (SELECT sequence FROM query_db WHERE sequence_id
    2),
  • CURSOR(SELECT seq_id, sequence FROM PROT_DB),
  • 1,
  • -1,
  • 0,
  • 0,
  • BLOSUM62,
  • 10,
  • 0,
  • 0,
  • 0,
  • 0,
  • 0)
  • )t WHERE t.t_seq_id p.seq_id AND t.score gt 25
  • ORDER BY t.expect

25
Output Results
  • SEQ_ID SCORE EVALUE
  • -------- ---------- ----------
  • P31946 205 5.8977E-18
  • Q04917 198 3.8228E-17
  • P31947 169 8.8130E-14
  • P27348 198 3.8228E-17
  • P58107 49 7.24297332

26
The Databases and Why
  • The ability to perform genome-wide and
    cross-genome data analysis can reduce time
    required for new biological discoveries
  • Since traditional databases are not built to
    support location datatypes, researchers are
    forced to find ways in which these databases can
    manage biological information that will permit
    information to be queried with a Modern database
    system
  • This research has led to a concept called
    Bioindexing

27
Bioindexing
  • An index in this construct is basically a way of
    providing a mapping between information entities.
  • In a traditional database, an index is an
    auxiliary structure which speeds up the data
    retrieval process by providing a mapping between
    a record key and the physical disk address of the
    records containing the key
  • Bioindexing provides similar functionality as a
    database index but also facilitates DATA
    INTEGRATION
  • Biological features are generally attached to
    locations and locations are also the bases for
    maps(MAPS in this context is an association of
    features with a sequence alignment), alignment (
    relationships between two genomic sequence
    segments ) and other complex relationships.

28
The Blast Database and Bioindexing
  • Bioindexing is essentially an infrastructure for
    representing and managing biological knowledge in
    a large-scale database system using index
    constructs
  • Bioindexing uses location datatype and BLAST
    JOINS to efficiently handle and query the large
    amount of data.
  • Bioindexing is essentially a scheme for
    connecting and querying information with modern
    database systems WITH THE USE OF INDEXES

29
Types of Indexing
  • Intrinsic Indexing Indexable bioinformatics
    datatypes. Intrinsic indexing permits both the
    representation and management of biological
    mapping
  • Extrinsic Indexing is basically an efficient
    way of data integration from different
    heterogeneous sources such as relational tables,
    xml files standard sequence formats and other
    sources.
  • Extrinsic indexing concerns the functions and
    algorithms used to access and connect this
    information, even when it is not stored locally

30
(No Transcript)
31
Location (How it is represented)
  • Without proper abstraction, users have to
    implement their own codes to handle location
    operations
  • A location consists of a sequence identifier and
    an interval range.
  • Integer Interval are modeled in lower,upper
    structure
  • Identifiers are character strings or accession
    numbers used to denote a particular sequence and
    interval range consists of a pair of positive
    integers used to denote the sub-range within the
    given sequence

32
Complexity (Where Clauses ) if no location
DatatypesEst sequences being needed to be
grouped over consecutive overlapping EST
fragments
  • SELECT DISTINCT A.id, A.lower, B.upper
  • FROM ESTs AS A, ESTs AS B
  • WHERE A.unigene_clusterid B.unigene_clusterid
  • AND A.lower lt B.upper
  • AND NOT EXISTS
  • (SELECT
  • FROM ESTs AS C
  • WHERE C.unigene_clusterid A.unigene_clusterid
  • AND A.lower lt C.lower AND C.lower lt B.upper
  • AND NOT EXISTS
  • (SELECT FROM ESTs AS D
  • WHERE D.unigene_clusterid A.unigene_clusterid
  • AND D.lower lt C.lower AND C.lower lt D.upper))
  • AND NOT EXISTS
  • (SELECT
  • FROM ESTs AS E
  • WHERE E.unigene_clusterid A.unigene_clusterid
  • AND ((E.lower lt A.lower AND A.lower ltE.upper) OR

33
Location Datatype
  • A straightforward representation of a location
    would be a sequence identifier as a character
    string and the location interval as (start, end)
    pair of integers.
  • There are other possible representations such as
    integer codes for sequence identifiers and or a
    (start,length) interval representation
  • Most databases use the sequence identifier, and
    location (start, end ) pair of integers..
    WHY..because of Simplicity

34
Simplicity using Location DatatypeCreation and
Insertion
  • CREATE TABLE features ( location loc, description
    text)
  • -- The Prader-Willi/Angelman syndrome region on
    chromosome 15
  • INSERT INTO features VALUES ( 'NG_0026901..755217
    ', 'Prader-Willi/Angelman syndrome region' )
  • INSERT INTO features VALUES ( 'NG_0026901..174707
    ', 'AC090602.16' )
  • INSERT INTO features VALUES ( 'NG_002690174707..3
    24834', 'AC124312.5' )
  • INSERT INTO features VALUES ( 'NG_002690324835..4
    78258', 'AC124303.5' )
  • INSERT INTO features VALUES ( 'NG_002690478259..6
    06120', 'AC100774.2' )
  • INSERT INTO features VALUES ( 'NG_002690606121..7
    55217', 'AC124997.4' )

35
  • The introduction of location datatype not only
    provides a natural and intuitive way to represent
    biological information, but also boosts system
    performance.
  • Additional performance increase could be achieved
    by supporting the location index scheme.
  • Supports for indexing schemes in traditional
    relational database systems are very limited and
    inflexible.
  • They are only limited to a few well-known index
    structures, such as B-tree, Hash and R-tree and
    could be used for a limited set of native
    data-types for (in)equality and range queries.

36
  • Essentially there are operation and functions
    supported in the location datatype.
  • A major proportion of these functions are related
    to interval operations.
  • More than 30 interval operations are defined,
    including Allen's interval logic 15 (which
    includes after, before, contains, during, equals,
    overlaps, overlapped by,
  • finishes, finished by, meets, met by, starts and
    started by).
  • Optimization information (such as regarding
    ordering, commutativity or negation) is also
    provided to permit optimization of important
    operations like merge-join, hash-join or general
    theta-join.

37
Why location datatype is Needed
  • Here is a simple example to demonstrate the power
    of location datatype support. This example shows
    a session that painfully attempts to locate
    alternatively spliced exon intervals which
    intersect with known homology intervals and
    associate them with known protein features from
    the Pfam and Swissprot databases.

38
Complexity without locations
  • CREATE TABLE alt_splice_homology_map AS
  • SELECT o., d.swiss_id, d.query_start,
    d.query_end,
  • d.hit_start(o.seq_start-d.query_start)/3,
  • d.hit_start(o.seq_end-d.query_start)/3,
  • FROM alt_splice_exon_obs o, alt_splice_homology
    d
  • WHERE o.ug_id d.ug_id
  • AND o.seq_start gt d.query_start
  • AND o.seq_start lt d.query_end
  • AND d.e_value lt 0.01
  • GROUP BY o.ug_id, o.seq_start
  • SELECT o., f.type, f.start, f.end
  • FROM alt_splice_homology_map o, swiss_feature f
  • WHERE o.swiss_idf.swiss_id
  • AND o.hit_end gt f.start
  • AND o.hit_end lt f.end

39
Simplicity using locations
  • CREATE TABLE alt_splice_homology_map AS
  • SELECT o., d.location,
  • range_start(d.query)(o.location-range_start(d.hit
    ))/3
  • FROM alt_splice_exon_obs o, alt_splice_homology d
  • WHERE o.location _at_ d.location -- contained
  • AND d.e_value lt 0.01
  • GROUP BY o
  • SELECT o., f.type, f.location
  • FROM alt_splice_homology_map o, swiss_feature f
  • WHERE o.location lt f.location -- left overlap

40
Location Support
  • Supporting location indexing in a traditional
    database implies the need to support interval
    indexing.
  • BUT, interval indexing is not supported in
    traditional databases and standard join
    operations could not handle intervals
    efficiently, this has led to extensive research
    for interval indexing.
  • Here lies the need for a concept called GIST

41
GIST
  • Is an efficient solution handle the problem of
    ineffective interval indexing in traditional
    database
  • Gist is basically a balanced search tree in which
    keys are maintained in a hierarchical manner. The
    search keys used in gist may be any arbitrary
    predicate, but this predicate must hold true for
    the data searched below a key.
  • Gist searches by traversing the entire tree in a
    dept-first search manner. If the query predicate
    is consistent with a given search key, Gist will
    continue to search the subtree below the key

42
Gist Implementation
  • Gist is implemented using bounding intervals that
    covers the range of
  • Identifier integers (id_lower,id_upper)
  • And
  • Intervals in the subtree (lower,upper)
  • Under Gist architecture interval predicates such
    as such as left, right overlap,
    overleft,overright, contains, contained and equal
    are all supported

43
What gist location does
44
Conclusion
  • Bioinformatics databases are being modeled and
    queried using function(as seen in oracle and ibm
    DB2)
  • An efficient way of modeling these databases are
    seen using bioindexing (as seen in postgre- sql
    database)
  • The use of an index structure as seen in
    Bioindexing, where a location is modeled using a
    (DFS) tree structure leads to less complexity.
  • This location index structure leads to an faster
    searching of the databases
  • This concept of speed is very important in
    bioinformatics
  • Using a gist architecture, lead to less complex
    queries and a more confined search sector for
    query information.

45
References
  • The Index as a First-Class Construct in
    Relational Database Systems
  • D. Stott Parker, Edwin Mach
  • Algorithms and Databases in Bioinformatics
    Towards a Proteomic Ontology
  • Mario Cannataro, Pietro Hiram Guzzi, Tommaso
    Mazza, Giuseppe Tradigo and Pierangelo Veltri
  • Oracle Data Mining
  • Mobile Access to Biological Databases on the
    Internet
  • Pentti Riikonen, Jorma Boberg, Tapio Salakoski,
    and Mauno Vihinen
  • Utilizing Multiple Bioinformatics Information
    Sources
  • An XML Database Approach
  • Raymond K. Wong William M. Shui
  • Support for BioIndexing in BLASTgres
  • Ruey-Lung Hsiao, D. Stott Parker, and Hung-chih
    Yang
Write a Comment
User Comments (0)
About PowerShow.com