Title: Effective design and analysis of bioinformation Unit 3
1Effective design and analysis of
bioinformationUnit 3
- BIOL221T Advanced Bioinformatics for
Biotechnology
Irene Gabashvili, PhD igabashvili_at_yahoo.com
2Course availability
- Lectures Lab every Wednesday, Duncan Hall,
Room 550, 600 pm to 945 pm - Office hours Wednesday, 4pm-6pm (Room 554,
phone 92404831) and by appointment - Lecture notes will be posted at
http//home.comcast.net/igabashvili/221T.htm - Or the SJSU page --
- The user name is ewok\biostudents (dont enter
quotation mark) - And the password is 4biolecture (dont enter
quotation mark).
3Consumer genomics gets crowded
In the News
- http//www.seqwright.com/ SoliD, ABI
- http//www.decodeme.com/ Illumina
- https//www.23andme.com/ Illumina
- http//www.navigenics.com/ Affymetrix
- http//www.knome.com/ ABI,Amersham,Illumina
4https//www.23andme.com/experts/letters/science/
5List from DeCODE genetics
- Our current list of diseases includes
Age-related Macular Degeneration, Asthma,
Alzheimer's Disease, Atrial Fibrillation, Breast
Cancer, Celiac Disease, Colorectal Cancer,
Exfoliation Glaucoma XFG, Crohn's Disease,
Multiple Sclerosis, Myocardial Infarction,
Obesity, Prostate Cancer, Psoriasis, Restless
Legs, Rheumatoid Arthritis, Type 1 Diabetes and
Type 2 Diabetes.
6Three important sub-disciplines within
bioinformatics
- the development of new algorithms and statistics
with which to assess relationships among members
of large data sets - the analysis and interpretation of various types
of data including nucleotide and amino acid
sequences, protein domains, and protein
structures - the development and implementation of tools that
enable efficient access and management of
different types of biological information.
7biomedical informatics
Main tasks of
- Storage, Analysis, Visualization and Management
of biomedical data - Mining for new knowledge, hypothesis formulation
and testing - Development of tools and resources for the above
8Brief History of Bioinformatics
- 1920 - term genome was introduced by H. Winkler
to denote the complete set of chromosomal and
extra chromosomal genes - 1933 - A new technique, electrophoresis, is
introduced by Tiselius for separating proteins in
solution. - 1951 - Pauling and Corey propose the structure
for the alpha-helix and beta-sheet
9Brief History of Bioinformatics
- 1953 - Watson Crick propose the double helix
model for DNA (data by Franklin Wilkins) - 1954 - Perutz's group develop methods to solve
the phase problem in protein crystallography. - 1955 - The sequence of the first protein to be
analyzed, bovine insulin, announced by F.Sanger. - 1956 - The first protein sequence reported was
that of bovine insulin, consisting of 51 residues
10Brief History of Bioinformatics
- 1962 - Pauling's theory of molecular evolution
- 1965 M.Dayhoffs Atlas of Protein Sequences
- 1970 - Needleman-Wunsch algorithm
- 1972 The Protein DataBank
- 1980 - The first complete gene sequence for an
organism (FX174)5,386 bp, nine proteins. - 1981 - The Smith-Waterman algorithm IBM
introduces its PC to the market. The concept of
a sequence motif ( Doolittle )
11Brief History of Bioinformatics
- 1983 Sequence DB searching (Wilbur-Lipman)
- 1986 - Human Genome Initiative announcement
- 1987 SWISSPROT protein sequence database
- 1988 - NCBI created at NIH/NLM (databases)
- 1988 - FASTA by Pearson and Lupman EMBL
establish sequence database network - 1990 - BLAST by Altschul,et.al.
- 2003 -Human Genome Project Completion
12biomedical informatics
The data of
- Public Private Databases store biological data
in various formats - Sequences DNA, RNA, proteins
- Structures X-ray, NMR, microscopy
- Expression microarrays, gels
- Interaction 2 hybrid, mass spec
- Metabolism GC-MS, NMR
- Physiology medical images, PK/PD
13Search Engines
- AND, OR, NOT
- Specifying database fields (Organism, Author)
- Order of words, neonatal pre/3 screening
(neonatal at least 3 words before screening - Spaces wom?n cats
14Search Download
- Entrez integrated, text-based search and
retrieval system for PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete
Genomes, Taxonomy, etc batch download - http//www.ncbi.nlm.nih.gov/sites/batchentrez
- term field OPERATOR term field
- 110ESTC AND Homo sapiensORGN AND
deafnessdis (BSND Bartter syndrome,
infantile, with sensorineural deafness) - http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD
searchDBunigene - More on the courses website
15DATA FORMATS AND DATA INTEGRATION
- It is widely recognized that successful data
integration is one of the keys to improved
productivity in biopharmaceutical RD - Success in most bioinformatics-related
activities, from functional characterization of
genomic sequences to prioritization of drug
targets, requires an integrated view of all
relevant data in a drug discovery RD program - Bioinformatics data sources often have large,
complex data structures, reflecting the richness
of the scientific concepts they model. Many
bioinformatics data sources cover similar
domains, such as genes, proteins, sequence
annotations or microarray results.
16Database design links
- http//www.devx.com/ibm/Article/20702
- http//www.campus.ncl.ac.uk/databases/design/
- http//www.dbazine.com/mullins_datamodel.shtml
- http//www.extropia.com/tutorials/sql/toc.html
- http//www.surfermall.com/relational/lesson_1.htm
17Database Definition
- A collection of data that
- is organized
- usually computer-based
- represents repetitive information implicitly
- supports retrieval
- A set of rules to manipulate data
- A method to mold information into knowledge
18Database applications
- Who uses Computerized Databases
- Stores to keep track of inventory
- Hospitals to track of patient info
- Travel agents to keep up with their customers
and reservations - Biologists to efficiently manage and manipulate
their data - DATA ? INFORMATION ? KNOWLEDGE
19Paper Database as Expert System
20HISTORY
- 1960's Two main data models are developed
network model (CODASYL) and hierarchical (IMS). A
user would need to know the physical structure of
the database in order to query for information.
SABRE IBM/AA. - 1970-72 E.F. Codd proposed relational model He
disconnects the schema (logical organization) of
a database from the physical storage methods.
21HISTORY
- 1970's
- Ingres UCB ? Ingres Corp., Sybase, MS SQL
Server, Britton-Lee, Wang's PACE. This system
used QUEL as query language. - System R IBM ? IBM's SQL/DS DB2, Oracle, HP's
Allbase, Tandem's Non-Stop SQL. This system used
SEQUEL as query language. - The term Relational Database Management System
(RDBMS) is coined
22HISTORY
- 1976 P. Chen proposed the Entity-Relationship
(ER) model for database design - Early 1980's Commercialization of relational
systems begins as a boom - Mid-1980's SQL (Structured Query Language)
becomes "intergalactic standard". DB2 becomes
IBM's flagship product. Network and hierarchical
models fade into the background
23HISTORY
- Early 1990's Application and personal
productivity tool development PowerBuilder
(Sybase), Oracle Developer, VB (Microsoft),
Excel/Access (MS) and ODBC. First Object Database
Management Systems (ODBMS) prototypes. - Mid-1990's Internet/WWW. Web/DB grows
exponentially, usable for average users
24HISTORY
- Late-1990's Boom for Web/Internet/DB connectors.
Open source solution with widespread use of gcc,
cgi, Apache, MySQL, etc. Online Transaction
processing (OLTP) and online analytic processing
(OLAP) comes of age - Early 21st century Burst of.com but solid growth
of DB applications. PDAs, POS transactions, IBM,
Microsoft, Oracle.
25FUTURE
- Terabyte and Petabyte databases of everything
- Mobile databases
- Semantic Web
- Object Oriented Everything, includes databases
- Object Database Management Group (ODMG) standards
are proposed and accepted - Security issues
26Database advantages
- An advantage of a database program is
- Can find a specific file quickly
- Can easily add records
- Can alphabetize and sort data faster than most
people - Is as accurate as the data that is entered
- Can make many different types of reports
- Is invaluable for large amounts of data
27Database Parts
- Parts of a relational database
- Fields-categories of information
- lttablegt
- Entry data in a field
- Record all of the information about one item
(row) - File document of all of the records
- To sort field, ascend or descend (Excel, Works)
28Database types
- Flat (spreadsheet)
- Hierarchical
- Network (two fundamental constructs, called
records and sets) - Relational
29Relational Databases
- Relational databases started to get to be a big
deal in the 1970's, and they're still a big deal
today, which is a little peculiar, because
they're a 1960's technology. - A relational database is a bunch of rectangular
tables. Each row of a table is a record about one
person or thing the record contains several
pieces of information called fields.
30Entities and Relationships
Entities things we store information
about Relationships links between the
entities Many-to-many One-to-one One-to-many
31A Table is a Relation
Columns, Fields, Attributes Rows, Records,
Tuples, Entities. records of data, comprised of
fields, stored in tables
32Keys and Functional Dependencies
- Key field (superkey, key) - a field that uniquely
identifies a record - If there is a functional dependency between
column A and column B in a given table, - (A ? B), then the value of column A determines
the value of column B. (employeeID ? name)
33Schema
- Database schema is the structure or design of the
database, a blueprint for the data in the
database. - employee(employeeID, name, job, cube,
departmentID) - What information needs to be stored? (things or
entities) - What questions will we ask of the database?
(queries.)
34Flawed schemas
This Schema design leads to redundancies Employee(
employee ID, name, job, department
ID Department(Department ID, Department name)
35Flawed schemas
Insertion Anomaly
Deletion Anomaly
Update Anomaly
36 Avoid Null Values
37Normalization
Unnormlized table lists instead of atomic
numbers. This violates the rules of first normal
form
38Normalization
This schema is in first normal form, 1NF
39Second Normal Form, 2NF
2NF Attributes must depend on the whole key
403NF and BCNF (Boyce-Codd)
3NF Attributes must depend on nothing but the
key BCNF all the functional dependencies must
have a superkey on the left side
41Concepts
- Entities are things, and relationships are the
links between them. - Relations or tables hold a set of data in tabular
form. - Columns belonging to tables describe the
attributes that each data item possesses. - Rows in tables hold data items with values for
each column in a table. - Keys are used to identify a single row.
- Functional dependencies identify which attributes
determine the values of other attributes. - Schemas are the blueprints for a database.
42Design Principles
- Minimize redundancy without losing data.
- Insertion, deletion, and update anomalies are
problems that occur when trying to insert,
delete, or update data in a table with a flawed
structure. - Avoid designs that will lead to large quantities
of null values.
43Normalization
- Normalization is a formal process for improving
database design. - First normal form (1NF) means atomic column or
attribute values. - Second normal form (2NF) means that all
attributes outside the key must depend on the
whole key. - Third normal form (3NF) means no transitive
dependencies. - Boyce-Codd normal form (BCNF) means that all
attributes must be functionally determined by a
superkey.
44Hierarchical Databases
1234567
Sandiego, Carmen
123 Main Street
Labs
Chem7
Chem7
K 3.9
Na142
K 4.3
Na136
45Hierarchical Databases
- Easy to use
- Efficient storage
- Tree walking is fast
- Queries across trees are slow
- Flexible
- Too flexible chaos is allowed
- Too easy to modify
- Difficult to document complex structures
46Hierarchical Databases
- EMR(1234567)Sandiego, Carmen
- EMR(1234567, Address)123 Main Street
- EMR(1234567, Chem7, 2/2/02, Na)136
- EMR(1234567, Chem7, 2/2/02, K)4.3
- EMR(1234567, Chem7, 2/3/02, Na)142
- EMR(1234567, Chem7, 2/3/02, K)3.9
47Hierarchical Chaos
1234567
Admissions
Admission 1
Admit Date 2/2/02
Primary DX CHF
Other DX
AODM
A Fib
Flag S
Flag P
48Network Databases
1234567
Gyn Clinic
2 Main St.
Sandiego
305-2500
Secretary
Gyn Clinic
8AM-5PM
Ms Smith
305-1000
Service
Pap
Dr. Jones
Gyn Visit
Beeper 34
49Extensible Markup Language (XML) Databases
- SGML is a metalanguage
- SGML is used to write Document Type Definitions
(DTDs) that define languages - HTML is a language with an SGML DTD
- Tags are for formatting/presentation syntax
- XML is a proper subset of SGML
- XML defines tags that convey semantics
- We could write Health Markup Language (HML)
in XML (if we could agree on the semantics and
tags) - Tags may or may not be stored with data
50ltdocumentgt lt/documentgt
ltdocument.idgtCXR001lt/document.idgt ltdoc.
dategt19991101lt/doc. dategt ltdocument.typegt lt/d
ocument.typegt ltdocument.bodygt ltdocument.bod
ygt
ltidentifiergtP5-00010lt/identifiergt
lttextgtChest X-Raylt/textgt
ltfindingsgtNo infiltrate, cardiac shadow not
enlarged...lt/findingsgt ltimpressiongtNormal
X-raylt/impressiongt
51ltpatientgt lt/patientgt
ltpatient.idgt lt/patient.idgt ltpatient.namegt lt/pa
tient.namegt ltpatient.dobgt19230113lt/patient.dobgt ltp
atient.sex value"male"/gt ltinpatient/gt
ltid.valuegt1234789lt/id.valuegt
ltfamily.namegtSandiegolt/family.namegt ltgiven.namegtCa
rmenlt/given.namegt ltsuffixgtM.D.lt/suffixgt
52Extensible Markup Language (XML) Databases
- Strengths
- Flexibility to represent wide range of data
- Data carries its field assignment
- Sparse data handled compactly
- Tags can have platform-specific display
- Weaknesses
- Immature database tools
- Verbose
- I/O intensive
- A trade-off of decreased efficiency for increased
flexibility ? scalability
53Relational Databases - Advantages
- Comprehensible
- Multiple views possible
- Easy to modify
- New elements dont break programs
- Database management systems (DBMS)
- Referential integrity
- Reorg for efficiency
- Access control
- Locking for multiple simultaneous use
54Relational Databases - Disadvantages
- Storage overhead
- I/O-intense
- Cost