Bioinformatics Databases: Fundamental Concepts of Database Technology - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Bioinformatics Databases: Fundamental Concepts of Database Technology

Description:

Kristen Anton. Director of BioInformatics. Dartmouth Medical School ... required understanding of domain, sophisticated update/archive processes) ... – PowerPoint PPT presentation

Number of Views:828
Avg rating:3.0/5.0
Slides: 36
Provided by: KristenC153
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Databases: Fundamental Concepts of Database Technology


1
Bioinformatics DatabasesFundamental Concepts of
Database Technology Data Organization
  • Kristen Anton
  • Director of BioInformatics
  • Dartmouth Medical School

BioInformatics _at_ Dartmouth Medical School
2
How can data be organized?
  • Paper (i.e. in notebooks)
  • Flat files
  • Collection of data records
  • Minimal structure, no metadata
  • Application program must contain relationship
    information
  • Database
  • Hierarchical
  • Network
  • Relational


BioInformatics _at_ Dartmouth Medical School
3
BioInformatics _at_ Dartmouth Medical School
4
How can data be organized?
  • Paper (i.e. in notebooks)
  • Flat files
  • Collection of data records
  • Minimal structure, no metadata
  • Application program must contain relationship
    information
  • Database
  • Hierarchical
  • Network
  • Relational


BioInformatics _at_ Dartmouth Medical School
5
What is a relational database?
A database composed of relations and
conforming to a set of principles governing how
such relations are supposed to behave (Codds 12
Rules). There are many database systems that use
tables but dont conform to all of the
principles. These are often called
semirelational systems.
from Understanding SQL,
Martin Gruber

BioInformatics _at_ Dartmouth Medical School
6
Practically speaking...
  • A database is a body of information stored in two
    dimensions (rows and columns)
  • Rows are records
  • Columns are attributes of those record entities
    (usually!)
  • The groups of rows and columns, or tables, are
    largely independent of each other
  • The power of the database lies in the
    relationships that you construct among the tables
  • A database is self-describing it contains
    metadata, which is a description of its own
    structure


BioInformatics _at_ Dartmouth Medical School
7
What is a Database Management System (DBMS)?
  • A set of programs which define, administer and
    process databases and their associated
    applications
  • A scalable DBMS can run on multiple platforms
    (varying sizes)
  • A DBMS that supports interoperability uses
    industry-standard language and standard ways of
    exchanging data


Examples Oracle, Sybase, 4D, MS Access
BioInformatics _at_ Dartmouth Medical School
8
Features of a Relational Database
  • Rows (records) are in no particular order
  • Columns (fields) are ordered, numbered and named
    names should indicate content of the field
  • Primary key uniquely identifies each row -
    ensures that no row is empty, and that every row
    is different from every other row
  • Two-step commit process

BioInformatics _at_ Dartmouth Medical School
9
Features of a Relational Database
  • A view is a subset of the database that an
    application (or user) can process
  • The database schema is the structure of the
    entire database
  • A constraint is a condition you apply to an
    attribute of a table

BioInformatics _at_ Dartmouth Medical School
10
Relationships between tables
  • One-to-One, Many-to-One, Many-to-Many
  • A join is an operation that combines data from
    multiple tables into a singe result table
  • E-R (entity-relationship) diagram is the basic
    graphic to describe the structure of a database

SELECT Sequence.sname, KnownGenes.gname,
KnownGenes.length FROM Sequence,
KnownGenes WHERE KnownGenes.length
Sequence.length
BioInformatics _at_ Dartmouth Medical School
11
E-R Diagram
BioInformatics _at_ Dartmouth Medical School
12
The tool for communicating withrelational
databases SQL
  • Standard Query Language (SQL)
  • A query is a question you ask the database, and
    SQL retrieves the appropriate answer set
  • Interactive SQL (command line) vs. RAD tool/GUI
  • Standardization issue ANSI (American National
    Standards Institute)

BioInformatics _at_ Dartmouth Medical School
13
Data Types
  • Types of data indicate functions that are
    possible between related fields
  • Each field is assigned one data type (imposes
    structure on data)
  • Examples text (CHAR, VARCHAR), number (INT,
    DEC) date, time, money binary
  • Standardization issue ANSI (American National
    Standards Institute)

BioInformatics _at_ Dartmouth Medical School
14
A word about database design
  • Designing a database is not trivial
  • The value is not in the data, but in the
    structure
  • Design to facilitate the retrieval and
    interpretation of the data

BioInformatics _at_ Dartmouth Medical School
15
(No Transcript)
16
Design database for data extraction think it
through
  • Relationships ease extraction and/or reporting of
    data from the system
  • Redundancy
  • Concept of attributes in rows instead of columns

BioInformatics _at_ Dartmouth Medical School
17
Design database for data extraction think it
through
BioInformatics _at_ Dartmouth Medical School
18
Design database for data extraction think it
through
BioInformatics _at_ Dartmouth Medical School
19
Example BioInformatics Core Technology
  • Reusable core modules, with customizable
    components
  • Standard business logic framework controls
    transactions (middle layer)
  • Metadata-based back-end data storage (facilitates
    data sharing)

BioInformatics _at_ Dartmouth Medical School
20
BioInformatics Core Technology
BioInformatics _at_ Dartmouth Medical School
21
Data Security High Priority
HIPAA, FIPS 140-2 (VA), IRB requirements
BioInformatics _at_ Dartmouth Medical School
22
Life science has become a field which generates
an enormous amount of un-integrated data.
How can methods for data organization help to
solve this problem?
BioInformatics _at_ Dartmouth Medical School
23
What is Data Integration?
  • Creating a system which allows the extraction of
    a piece or set of information (query result)
    across multiple domains (possibly disparate data
    sources - flat files, databases, spreadsheets,
    URLs...)

BioInformatics _at_ Dartmouth Medical School
24
Sample integration problemCancer Biomarker
Discovery
  • Clinical center collects blood samples from 1000
    individuals with colon cancer
  • Expression analysis reveals that protein x is
    over-expressed in these samples, relative to
    controls
  • Could this be a colon cancer biomarker?

BioInformatics _at_ Dartmouth Medical School
25
Understanding transcription factors for protein
x production
Show me all genes in the public literature that
are putatively related to protein x, have more
than 4-fold expression differential between
affected and normal tissue and are homologous to
known transcription factors.
Q1 Find homologs
Q2 Find genes with4-fold differential
Q3 Show me genesin public literature
SEQUENCE
EXPRESSION
LITERATURE
(Q1 ? Q2 ? Q3)
BioInformatics _at_ Dartmouth Medical School
26
Key components to integration
  • Accessing without modifying original data sources
  • Handling redundant, conflicting, missing,
    changing (versions) data
  • Normalizing analytical data from different data
    sources
  • Conforming terminology to industry standards
  • Accessing the integrated data as a single
    repository
  • Including metadata in repository

BioInformatics _at_ Dartmouth Medical School
27
Approaches to Integrationwhere are the key
issues addressed?
  • Federated database (poses constraints on original
    data sources fragility in reliance on source
    systems)
  • Data warehousing (ETL layer, original data
    sources untouched, required understanding of
    domain, sophisticated update/archive processes)
  • Integrating data source profiles
  • Indexed Flat Files
  • Others.

BioInformatics _at_ Dartmouth Medical School
28
Data Warehousing
BioInformatics _at_ Dartmouth Medical School
29
Metadataone key to success
  • Describes data types, relationships, histories,
    etc.
  • Back-end (supports developers), front-end
    (supports users and application)

Data value 55
BioInformatics _at_ Dartmouth Medical School
30
Metadataone key to success
  • Describes data types, relationships, histories,
    etc.
  • Back-end (supports developers), front-end
    (supports users and application)

Data value 55Metadata values Data element
name vehicle speed
BioInformatics _at_ Dartmouth Medical School
31
Metadataone key to success
  • Describes data types, relationships, histories,
    etc.
  • Back-end (supports developers), front-end
    (supports users and application)

Data value 55Metadata values Data element
name vehicle speed Unit miles per hour
BioInformatics _at_ Dartmouth Medical School
32
Metadataone key to success
  • Describes data types, relationships, histories,
    etc.
  • Back-end (supports developers), front-end
    (supports users and application)

Data value 55Metadata values Data element
name vehicle speed Unit miles per
hour Description the average velocity of a
vehicle
BioInformatics _at_ Dartmouth Medical School
33
Standardsthe final frontier
  • Naming conventions
  • Standard coordinate systems
  • Unify interpretations of single object types
  • Unify software solutions to the same problem
    (also data formats)
  • Standards for metadata (incompatible or missing
    metadata)

BioInformatics _at_ Dartmouth Medical School
34
Developing Standardsfor Life Sciences Research
  • Discovery science does not lend well to
    constraints (especially system constraints)
  • Decentralized data management infrastructure,
    competition
  • Wildly varying skill levels for data and
    information management

Several groups (Bio-Ontologies, HGNC, OMG, etc.)
and national research initiatives (EDRN, caBIG,
etc.) are taking the lead in the effort to create
workable standards.
BioInformatics _at_ Dartmouth Medical School
35
New approach to integrationCancer Biomarker
Discovery
  • Network of distributed data silos (does not
    perturb data sources)
  • Centralized query and business logic servers,
    accessed through web interface
  • CORBA framework manages XML profile definitions
    across the web
  • A profile is a set of resource definitions
    implemented in XML for data sources residing in
    one or more distributed systems

BioInformatics _at_ Dartmouth Medical School
Write a Comment
User Comments (0)
About PowerShow.com