Data Representation in Bioinformatics - PowerPoint PPT Presentation

About This Presentation

Title:

Data Representation in Bioinformatics

Description:

DEFINITION AB020037 Phaseolus vulgaris library (Watanabe T) cDNA, mRNA sequence. ... Example: set of all persons, companies, trees, holidays ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 30

Provided by: marily221

Category:

more less

Transcript and Presenter's Notes

Title: Data Representation in Bioinformatics

1
Data Representation in Bioinformatics

S. Sudarshan
Computer Science and Engg. Dept.
I.I.T. Bombay

2
Data Representation

Goal Represent data in an intuitive and
convenient manner
Without unnecessary replication of information
Making it easy to write queries to find required
information
Supporting efficient retrieval of required
information
Data Models
Ad-hoc file formats (not really data models!)
XML (Extensible Markup Language)
Relational data model
Entity-relationship (ER) data model
Object-relational data model
Object-oriented data model

3
Data Representation in Genomics

Most common approach Text Files
E.g. GenBank GenBank Example
Advantage
Easy to export data to others (integrating
datasets is not my problem!)
Drawback
Makes it hard to integrate information from
different sources
This is essential for many applications e.g.
comparative studies
Multiplicity of formats makes interoperation
difficult
Reading a particular file format requires a
program designed to parse that file format
No standard query language
Complex queries needed to integrate data from
different sources
Several efforts to create standard file formats
are based on a tag language called XML

4
LOCUS AB020037 300 bp mRNA EST
11-MAY-1999DEFINITION AB020037 Phaseolus
vulgaris library (Watanabe T) cDNA, mRNA
sequence. ACCESSION AB020037 VERSION
AB020037.1 GI4783241 KEYWORDS EST. SOURCE
Phaseolus vulgaris. ORGANISM Phaseolus
vulgaris Eukaryota Viridiplantae
Streptophyta Embryophyta REFERENCE 1 (bases
1 to 300) AUTHORS Watanabe,T., Watanabe,T, .
TITLE Partial cDNA G.max calnexin homologue
from P.vulgaris JOURNAL Unpublished (1999)
FEATURES Location/Qualifiers source
1..300 /organism"Phaseolus vulgaris"
/db_xref"taxon3885" /clone_lib"Phaseolu
s vulgaris library (Watanabe T)" BASE COUNT 92
a 50 c 82 g 76 t ORIGIN 1
gacctgcgat cttctacgaa tcattcgatg aggattttca
agatcgttgg atcgtgtctc 61
agaaagagga atacagtggt gtctggaaac atgccaagag
tgagggacat gatgatcatg 121
gtcttcttgt cagtgagaaa gcaagaaaat atgccatagt
gaaggaactt gacaaggcag 181
tgagtctcag ggatggaact gttgttctcc agtttgaaac
tcggcttcag aatggacttg 241
aatgtgaagg agcatatata aaatatctcc gaccacaggg
atgctggatg ggaactctaa//

Genbank Example

5
XML Extensible Markup Language

Simple XML example
E.g. ltfacultygt ltfaculty-member
facid12349gt ltnamegt
S.Sudarshan lt/namegt ltemailgt
sudarsha_at_cse.iitb.ac.inlt/emailgt
lt/faculty-membergt ltfaculty-member
facid12987gt ltnamegt Pramod
Wangikarlt/namegt ltemailgt
pramodw_at_che.iitb.ac.inlt/emailgt
lt/faculty-membergt lt/facultygt
Each piece of text enclosed by matching tags
ltxyzgt lt/xyzgt is called an element
Elements may have attributes (such as facid in
the example above)
DTD (Document Type Descriptor) specifies allowed
element, attributes of each element, and what
elements may appear within each element (and how
many times and in what order).
Each application defines a standard set of
elements (including how they are nested) and
attributes for each element

6
XML Representation (Cont.)

Ad-hoc file representations are being replaced by
standard XML representations (see e.g.
http//i3c.open-bio.org)
Examples
Gene Expression Markup Language (GEML)
(http//www.geml.org)
(GEML 2.0 white paper http//www.geml.org/docs/GE
ML2_0.pdf)
Bioinformatic Sequence Markup Language (BSML)
(http//www.labbook.com/products/xmlbsml.asp),
and many others
Earlier GenBank example in in XML (BSML)
Benefits
Standardization will simplify inter-operation and
data sharing
XML tagged datasets are easy to read and
comprehend
Parsing of datasets is simple with XML
Problems
Standards take time to develop (for
human/political reasons)
More than one standard may evolve
People may not adopt standards, sticking to old
formats
Support for querying on XML data is still poor
(but will improve)

7
Genbank Example in XML (BSML)
lt?xml version"1.0" ?gt ltrecordsgt ltrecordgt
ltlocus name"AB020037" bp"300" strands""
molecule"mRNA" geometry"linear"
division"EST" date"11-MAY-1999"/gt
ltdefinitiongt lt!CDATA AB020037 Phaseolus
vulgaris library (Watanabe T)
Phaseolus vulgaris cDNA, mRNA sequence gt
lt/definitiongt ltaccession name"AB020037"/gt
ltversion accession"AB020037.1" gi"4783241"/gt
ltkeywordsgt EST lt/keywordsgt ..
.
8
Present vs. Future

XML databases are coming but not quite here yet
In alpha versions at best
Some relational database provide support for
storing XML data, but no support or poor support
for quering complex XML data
XML query language is still being standardized
(XQuery)
Initial XML query implementations likely to be
poor compared to relational query implementations
which are mature
Interesting query execution/optimization problems
to be solved, even ignoring bioinformatics
Relational data can be viewed as a special case
of XML data
Issues we describe in next few slides also
applicable to XML representation
XML good for data exchange
Can easily convert simple XML data to relations
Perhaps a few years down the road we can use XML
for querying genomics data

9
What are Relations
Attributes or columns
Name
E-mail
Department
Pramod Seshadri Uday Sudarshan
pw_at_yahoo.com sesh_at_em.com uday_at_msn.com sud_at_iitb.ac.
in
Chem. Engg. Mech. Engg. Elec. Engg. Comp. Sci.
Tuples or rows
faculty
10
Relational Representation

The relational data model is widely used and
supported by all the popular commercial database
systems
Allows 1) information to be broken up into
logical units, and then 2) recombined in
different ways as required
Great for queries involving information from
multiple original sources
Can easily gather related information
e.g. information about a particular gene from
multiple datasets/experiments
Entity Relationship (E-R) Model
Higher level model than the relational model
Often used for design, and then converted
(automatically or manually) into a relational
schema
Has several diagrammatical representations
Widely used

11
Entities and Relationships

A database can be modeled as
a collection of entities,
relationship among entities.
An entity is an object that exists and is
distinguishable from other objects.
Example gene, protein, experiment, organism,
person
Entities have attributes
An entity set is a set of entities of the same
type that share the same properties.
Example set of all persons, companies, trees,
holidays
Relationships provide connections between two or
more entities
E.g. Which genes were involved in which
experiment

12
Example ER Diagram for Microarray Data

Entities represented by boxes, (binary)
relationships by lines with names and optional
attributes
See www.bioinf.man.ac.uk for a more realistic
version (the MaxD database)

Expt-Exptr
Expt-Sample
Notation
Expt-Array
Many-to-one
13
Schema Diagrams for MicroArray Data

Schema diagrams show multiple relations and their
interconnections
Lines link foreign key with referenced relation

Experimenter Experimenter-Id Name E-mail
Dept. Institution
Experiment Experiment-Id Date
Experimenter-Id Sample-Id Array-Id Image
Sample Sample-Id Organism Cell-type
Drug-Ids
?Multivalued attribute
Expression-Value Experiment-Id Gene-Id value
Array Array-Id Manufacturer Type Batch
14
Modeling Protein Data (from Paton Goble)
15
Schema Diagrams vs. ER Notation

Dont confuse ER diagrams with schema diagrams
Differences
In ER diagrams
lines have names
There are no explicit foreign key attributes
In schema diagrams
Lines dont have names, but represent foreign key
relationships
Foreign key attributes must be explicitly
represented
Relationships in ER diagrams get converted to
separate relations and/or foreign key
relationships (more on this later)

16
Query Languages

Language in which user requests information from
the database.
Categories of languages
Procedural
E.g. C/C/Java
Advantage Powerful, can specify any query by
programming
Disadvantage Interfacing directly to database is
cumbersome
non-procedural
Web forms!
SQL
Advantage
Can specify query declaratively and let
database system figure out best way of finding
answers
Supports queries of medium complexity
Specialized languages
More complex queries (e.g. data mining such as
classification and clustering) implemented in
procedural language, with SQL acting as interface
to database

17
Problems of Diversity

Many different databases
Multiple databases for each of genome, proteome,
transcriptome, metabolome (and perhaps any other
ome you choose to add!)
Need to cross-reference between these databases
Need an ontology to ensure consistent and unique
names
Instability
Names, data, even models keep changing
Modeling secondary information
Annotations, typically text based

18
Problems in Querying

Querying
What query languages to use? (AceDB (SGD), Icarus
(SRS), SQL?)
OO API (Corba based interfaces proposed by
OMG/EMBL)
Querying and text mining on annotations
Queries that combine multiple databases and
paradigms
E.g. genome, proteome and annotations (text data)
Browsing and visualization
Generate hyperlinks in data automatically for
browsing
Visualization for sequence data, protein
structures, to depict correlations, etc

19
Problems of Scale and Distribution

Problems of scale
Genome hundreds of gigabytes to terabytes (1012
bytes)
Transcriptome (Microarray)
Each chip has 10,000 measurements image
Millions of experiments
on different species/individuals/cells/conditions
Total 1 petabyte/annum (1015 bytes)
Bottom line too big to hold everything locally
Ideally provide integrated view of all data, and
fetch actual data on demand
Limited access patterns
Can usually access data only via predefined Web
forms

20
Problems of Database Representation

Efficiency and flexibility of use are often at
odds
E.g. the Expression-Value table in our schema can
be huge
Array representation may be better but less
convenient for users
Alternative use one attribute for each gene
no database efficiently supports relations with
thousands of attributes
But this is natural to lay users
Similarly user may want one relation for each of
millions of experiments
Ideal
flexible view combined with efficient
implementation underneath, plus
query languages that offer metadata capabilities
E.g. for all relations whose name is in table N

21
References

Online information
Heaps and heaps of sites, many with actual data
freely available data may be worth what you paid
for it!
Tutorial on Information Management for Genome
Level Bioinformatics, Paton and Goble, at VLDB
2001 http//www.dia.uniroma3.it/vldbproc/tut
European Molecular Biology Network
http//www.embnet.org/
Univ. Manchester site (with relational version of
Microarray data representation, and links to
other sites)
http//www.bioinf.man.ac.uk
Database textbook with absolutely no
bioinformatics coverage (shameless sales pitch
?)
Database System Concepts 4th Ed by Silberschatz,
Korth and Sudarshan (should come out in Indian
edition in a few months)

22
End of Talk
23
Relational Schema Design Problems

Many flat file formats have lots of columns
E.g. Drug-effect
Drug1 Drug2 Drug3
Drug-n Cancer1 Cancer2
Cancer3
.
Cancer-m
Beware
Such structures are nice for humans to read (are
called crosstabs), BUT
Most databases cannot support relations with many
columns!
And querying data with such columns is more
complicated
Solution use a schema drug-effect(cancer-type
, drug, effect)
Alternative solution use arrays to represent
some such information (supported by some
databases)

24
Relational Schema Design Problems (Cont.)

Another common mistake having many relations
with same attributes
E.g. one relation for each cancer type, or one
relation for each drug
Cancer1(), Cancer2(), , Cancer-n()
Most databases can handle only hundreds or a few
thousand relations efficiently
Querying becomes more complicated when there are
many relations
Solution Replace many relations with same
attributes by a single relation with the same
attributes, plus an extra attribute storing the
name
Cancer(Type, )

25
Alternative E-R Notations

Crows feet notation Total participation (each
entity participates in at least one
relationship) is indicated by an extra bar

R1
R2
26
E-R Diagram For Our Example
Value
Gene
Expression-Value
E-mail
Experimenter-Id
Dept.
Experimenter
Image
Experiment
Expt-Exptr
Institution
Expt-Sample
Drugs
Expt-Array
Array
Sample
27
Relational Schema Design Principles

Redundancy
E.g. Array-genes(.., fragment-seq, gene-seq,
gene-mutations, )
is better represented as
Array-genes( fragment-seq, gene-id)
Gene(gene-id, gene-seq, gene-mutations)
Otherwise data is replicated unnecessarity
I.e. mutation information is stored multiple
times
Redundancy can be useful for better query
performance, but should be used in a thought-out
manner, not by accident
Inability to express information
E.g. if a gene is not stored in Array-genes we
cannot store its mutation information

28
Basic SQL Queries

Find the image for experiment number 1345
select image from experiment where
experiment-id 1345
Find the experiment-id and image of all
experiments involving e-coli
select experiment-id, image from experiment,
sample where experiment.sample-id
sample.sample-id and
sample.organism e-coli
All combinations of rows from the relations in
the from clause are considered, and those that
satisfy the where conditions are output

29
Complex Queries and Views

A view consisting of experiments with number of
active genes
create view expt-active-genes
as select experiment-id, count (gene-id) as
active-cnt from experiment, expression-value wher
e expression-value.experiment-Id
experiment.experiment-Id
and value gt 2 group by branch-name
Find number of active genes in experiment
E-123 select active-cnt from expt-active-genes
where expirement-Id E-123