Effective design and analysis of bioinformation Unit 3

About This Presentation

Title:

Effective design and analysis of bioinformation Unit 3

Description:

Office hours: Wednesday, 4pm-6pm (Room 554, phone: ... Bitter Taste Perception. TAS2R38. Earwax Type. ABCC11. Lactose Intolerance. LCT. Muscle Performance ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 55

Provided by: irenegab

Category:

more less

Transcript and Presenter's Notes

Title: Effective design and analysis of bioinformation Unit 3

1
Effective design and analysis of
bioinformationUnit 3

BIOL221T Advanced Bioinformatics for
Biotechnology

Irene Gabashvili, PhD igabashvili_at_yahoo.com
2
Course availability

Lectures Lab every Wednesday, Duncan Hall,
Room 550, 600 pm to 945 pm
Office hours Wednesday, 4pm-6pm (Room 554,
phone 92404831) and by appointment
Lecture notes will be posted at
http//home.comcast.net/igabashvili/221T.htm
Or the SJSU page --
The user name is ewok\biostudents (dont enter
quotation mark)
And the password is 4biolecture (dont enter
quotation mark).

3
Consumer genomics gets crowded
In the News

http//www.seqwright.com/ SoliD, ABI
http//www.decodeme.com/ Illumina
https//www.23andme.com/ Illumina
http//www.navigenics.com/ Affymetrix
http//www.knome.com/ ABI,Amersham,Illumina

4
https//www.23andme.com/experts/letters/science/
5
List from DeCODE genetics

Our current list of diseases includes
Age-related Macular Degeneration, Asthma,
Alzheimer's Disease, Atrial Fibrillation, Breast
Cancer, Celiac Disease, Colorectal Cancer,
Exfoliation Glaucoma XFG, Crohn's Disease,
Multiple Sclerosis, Myocardial Infarction,
Obesity, Prostate Cancer, Psoriasis, Restless
Legs, Rheumatoid Arthritis, Type 1 Diabetes and
Type 2 Diabetes.

6
Three important sub-disciplines within
bioinformatics

the development of new algorithms and statistics
with which to assess relationships among members
of large data sets
the analysis and interpretation of various types
of data including nucleotide and amino acid
sequences, protein domains, and protein
structures
the development and implementation of tools that
enable efficient access and management of
different types of biological information.

7
biomedical informatics
Main tasks of

Storage, Analysis, Visualization and Management
of biomedical data
Mining for new knowledge, hypothesis formulation
and testing
Development of tools and resources for the above

8
Brief History of Bioinformatics

1920 - term genome was introduced by H. Winkler
to denote the complete set of chromosomal and
extra chromosomal genes
1933 - A new technique, electrophoresis, is
introduced by Tiselius for separating proteins in
solution.
1951 - Pauling and Corey propose the structure
for the alpha-helix and beta-sheet

9
Brief History of Bioinformatics

1953 - Watson Crick propose the double helix
model for DNA (data by Franklin Wilkins)
1954 - Perutz's group develop methods to solve
the phase problem in protein crystallography.
1955 - The sequence of the first protein to be
analyzed, bovine insulin, announced by F.Sanger.
1956 - The first protein sequence reported was
that of bovine insulin, consisting of 51 residues

10
Brief History of Bioinformatics

1962 - Pauling's theory of molecular evolution
1965 M.Dayhoffs Atlas of Protein Sequences
1970 - Needleman-Wunsch algorithm
1972 The Protein DataBank
1980 - The first complete gene sequence for an
organism (FX174)5,386 bp, nine proteins.
1981 - The Smith-Waterman algorithm IBM
introduces its PC to the market. The concept of
a sequence motif ( Doolittle )

11
Brief History of Bioinformatics

1983 Sequence DB searching (Wilbur-Lipman)
1986 - Human Genome Initiative announcement
1987 SWISSPROT protein sequence database
1988 - NCBI created at NIH/NLM (databases)
1988 - FASTA by Pearson and Lupman EMBL
establish sequence database network
1990 - BLAST by Altschul,et.al.
2003 -Human Genome Project Completion

12
biomedical informatics
The data of

Public Private Databases store biological data
in various formats
Sequences DNA, RNA, proteins
Structures X-ray, NMR, microscopy
Expression microarrays, gels
Interaction 2 hybrid, mass spec
Metabolism GC-MS, NMR
Physiology medical images, PK/PD

13
Search Engines

AND, OR, NOT
Specifying database fields (Organism, Author)
Order of words, neonatal pre/3 screening
(neonatal at least 3 words before screening
Spaces wom?n cats

14
Search Download

Entrez integrated, text-based search and
retrieval system for PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete
Genomes, Taxonomy, etc batch download
http//www.ncbi.nlm.nih.gov/sites/batchentrez
term field OPERATOR term field
110ESTC AND Homo sapiensORGN AND
deafnessdis (BSND Bartter syndrome,
infantile, with sensorineural deafness)
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD
searchDBunigene
More on the courses website

15
DATA FORMATS AND DATA INTEGRATION

It is widely recognized that successful data
integration is one of the keys to improved
productivity in biopharmaceutical RD
Success in most bioinformatics-related
activities, from functional characterization of
genomic sequences to prioritization of drug
targets, requires an integrated view of all
relevant data in a drug discovery RD program
Bioinformatics data sources often have large,
complex data structures, reflecting the richness
of the scientific concepts they model. Many
bioinformatics data sources cover similar
domains, such as genes, proteins, sequence
annotations or microarray results.

16
Database design links

http//www.devx.com/ibm/Article/20702
http//www.campus.ncl.ac.uk/databases/design/
http//www.dbazine.com/mullins_datamodel.shtml
http//www.extropia.com/tutorials/sql/toc.html
http//www.surfermall.com/relational/lesson_1.htm

17
Database Definition

A collection of data that
is organized
usually computer-based
represents repetitive information implicitly
supports retrieval
A set of rules to manipulate data
A method to mold information into knowledge

18
Database applications

Who uses Computerized Databases
Stores to keep track of inventory
Hospitals to track of patient info
Travel agents to keep up with their customers
and reservations
Biologists to efficiently manage and manipulate
their data
DATA ? INFORMATION ? KNOWLEDGE

19
Paper Database as Expert System
20
HISTORY

1960's Two main data models are developed
network model (CODASYL) and hierarchical (IMS). A
user would need to know the physical structure of
the database in order to query for information.
SABRE IBM/AA.
1970-72 E.F. Codd proposed relational model He
disconnects the schema (logical organization) of
a database from the physical storage methods.

21
HISTORY

1970's
Ingres UCB ? Ingres Corp., Sybase, MS SQL
Server, Britton-Lee, Wang's PACE. This system
used QUEL as query language.
System R IBM ? IBM's SQL/DS DB2, Oracle, HP's
Allbase, Tandem's Non-Stop SQL. This system used
SEQUEL as query language.
The term Relational Database Management System
(RDBMS) is coined

22
HISTORY

1976 P. Chen proposed the Entity-Relationship
(ER) model for database design
Early 1980's Commercialization of relational
systems begins as a boom
Mid-1980's SQL (Structured Query Language)
becomes "intergalactic standard". DB2 becomes
IBM's flagship product. Network and hierarchical
models fade into the background

23
HISTORY

Early 1990's Application and personal
productivity tool development PowerBuilder
(Sybase), Oracle Developer, VB (Microsoft),
Excel/Access (MS) and ODBC. First Object Database
Management Systems (ODBMS) prototypes.
Mid-1990's Internet/WWW. Web/DB grows
exponentially, usable for average users

24
HISTORY

Late-1990's Boom for Web/Internet/DB connectors.
Open source solution with widespread use of gcc,
cgi, Apache, MySQL, etc. Online Transaction
processing (OLTP) and online analytic processing
(OLAP) comes of age
Early 21st century Burst of.com but solid growth
of DB applications. PDAs, POS transactions, IBM,
Microsoft, Oracle.

25
FUTURE

Terabyte and Petabyte databases of everything
Mobile databases
Semantic Web
Object Oriented Everything, includes databases
Object Database Management Group (ODMG) standards
are proposed and accepted
Security issues

26
Database advantages

An advantage of a database program is
Can find a specific file quickly
Can easily add records
Can alphabetize and sort data faster than most
people
Is as accurate as the data that is entered
Can make many different types of reports
Is invaluable for large amounts of data

27
Database Parts

Parts of a relational database
Fields-categories of information
lttablegt
Entry data in a field
Record all of the information about one item
(row)
File document of all of the records
To sort field, ascend or descend (Excel, Works)

28
Database types

Flat (spreadsheet)
Hierarchical
Network (two fundamental constructs, called
records and sets)
Relational

29
Relational Databases

Relational databases started to get to be a big
deal in the 1970's, and they're still a big deal
today, which is a little peculiar, because
they're a 1960's technology.
A relational database is a bunch of rectangular
tables. Each row of a table is a record about one
person or thing the record contains several
pieces of information called fields.

30
Entities and Relationships
Entities things we store information
about Relationships links between the
entities Many-to-many One-to-one One-to-many
31
A Table is a Relation
Columns, Fields, Attributes Rows, Records,
Tuples, Entities. records of data, comprised of
fields, stored in tables
32
Keys and Functional Dependencies

Key field (superkey, key) - a field that uniquely
identifies a record
If there is a functional dependency between
column A and column B in a given table,
(A ? B), then the value of column A determines
the value of column B. (employeeID ? name)

33
Schema

Database schema is the structure or design of the
database, a blueprint for the data in the
database.
employee(employeeID, name, job, cube,
departmentID)
What information needs to be stored? (things or
entities)
What questions will we ask of the database?
(queries.)

34
Flawed schemas
This Schema design leads to redundancies Employee(
employee ID, name, job, department
ID Department(Department ID, Department name)
35
Flawed schemas
Insertion Anomaly
Deletion Anomaly
Update Anomaly
36
Avoid Null Values
37
Normalization
Unnormlized table lists instead of atomic
numbers. This violates the rules of first normal
form
38
Normalization
This schema is in first normal form, 1NF
39
Second Normal Form, 2NF
2NF Attributes must depend on the whole key
40
3NF and BCNF (Boyce-Codd)
3NF Attributes must depend on nothing but the
key BCNF all the functional dependencies must
have a superkey on the left side
41
Concepts

Entities are things, and relationships are the
links between them.
Relations or tables hold a set of data in tabular
form.
Columns belonging to tables describe the
attributes that each data item possesses.
Rows in tables hold data items with values for
each column in a table.
Keys are used to identify a single row.
Functional dependencies identify which attributes
determine the values of other attributes.
Schemas are the blueprints for a database.

42
Design Principles

Minimize redundancy without losing data.
Insertion, deletion, and update anomalies are
problems that occur when trying to insert,
delete, or update data in a table with a flawed
structure.
Avoid designs that will lead to large quantities
of null values.

43
Normalization

Normalization is a formal process for improving
database design.
First normal form (1NF) means atomic column or
attribute values.
Second normal form (2NF) means that all
attributes outside the key must depend on the
whole key.
Third normal form (3NF) means no transitive
dependencies.
Boyce-Codd normal form (BCNF) means that all
attributes must be functionally determined by a
superkey.

44
Hierarchical Databases
1234567
Sandiego, Carmen
123 Main Street
Labs
Chem7
Chem7
K 3.9
Na142
K 4.3
Na136
45
Hierarchical Databases

Easy to use
Efficient storage
Tree walking is fast
Queries across trees are slow
Flexible
Too flexible chaos is allowed
Too easy to modify
Difficult to document complex structures

46
Hierarchical Databases

EMR(1234567)Sandiego, Carmen
EMR(1234567, Address)123 Main Street
EMR(1234567, Chem7, 2/2/02, Na)136
EMR(1234567, Chem7, 2/2/02, K)4.3
EMR(1234567, Chem7, 2/3/02, Na)142
EMR(1234567, Chem7, 2/3/02, K)3.9

47
Hierarchical Chaos
1234567
Admissions
Admission 1
Admit Date 2/2/02
Primary DX CHF
Other DX
AODM
A Fib
Flag S
Flag P
48
Network Databases
1234567
Gyn Clinic
2 Main St.
Sandiego
305-2500
Secretary
Gyn Clinic
8AM-5PM
Ms Smith
305-1000
Service
Pap
Dr. Jones
Gyn Visit
Beeper 34
49
Extensible Markup Language (XML) Databases

SGML is a metalanguage
SGML is used to write Document Type Definitions
(DTDs) that define languages
HTML is a language with an SGML DTD
Tags are for formatting/presentation syntax
XML is a proper subset of SGML
XML defines tags that convey semantics
We could write Health Markup Language (HML)
in XML (if we could agree on the semantics and
tags)
Tags may or may not be stored with data

50
ltdocumentgt lt/documentgt
ltdocument.idgtCXR001lt/document.idgt ltdoc.
dategt19991101lt/doc. dategt ltdocument.typegt lt/d
ocument.typegt ltdocument.bodygt ltdocument.bod
ygt
ltidentifiergtP5-00010lt/identifiergt
lttextgtChest X-Raylt/textgt
ltfindingsgtNo infiltrate, cardiac shadow not
enlarged...lt/findingsgt ltimpressiongtNormal
X-raylt/impressiongt
51
ltpatientgt lt/patientgt
ltpatient.idgt lt/patient.idgt ltpatient.namegt lt/pa
tient.namegt ltpatient.dobgt19230113lt/patient.dobgt ltp
atient.sex value"male"/gt ltinpatient/gt
ltid.valuegt1234789lt/id.valuegt
ltfamily.namegtSandiegolt/family.namegt ltgiven.namegtCa
rmenlt/given.namegt ltsuffixgtM.D.lt/suffixgt
52
Extensible Markup Language (XML) Databases

Strengths
Flexibility to represent wide range of data
Data carries its field assignment
Sparse data handled compactly
Tags can have platform-specific display
Weaknesses
Immature database tools
Verbose
I/O intensive
A trade-off of decreased efficiency for increased
flexibility ? scalability

53
Relational Databases - Advantages

Comprehensible
Multiple views possible
Easy to modify
New elements dont break programs
Database management systems (DBMS)
Referential integrity
Reorg for efficiency
Access control
Locking for multiple simultaneous use

Effective design and analysis of bioinformation Unit 3 - PowerPoint PPT Presentation

Effective design and analysis of bioinformation Unit 3

Office hours: Wednesday, 4pm-6pm (Room 554, phone: ... Bitter Taste Perception. TAS2R38. Earwax Type. ABCC11. Lactose Intolerance. LCT. Muscle Performance ... – PowerPoint PPT presentation