Lecture 1: Biological information database and data mining - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Lecture 1: Biological information database and data mining

Description:

All living things in a particular area (such as an island) and all non-living, ... Brown: Milk, Bread, Shoes, Greeting Cards, Pork. Eric: Cheese, Milk, Shoes, Beef ... – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 18
Provided by: cad79
Category:

less

Transcript and Presenter's Notes

Title: Lecture 1: Biological information database and data mining


1
Lecture 1 Biological information database and
data mining
  • Biology as an information intensive science
  • Typical databases
  • Introduction to data mining
  • Data mining in biology

2
Biology as an information intensive science
Organization of living systems Ecosystemsgt
Communitiesgt Populations gt Organisms gt Organ
systems gt Organs gt Tissues gt Cells gt
Molecules.

Ecosystem All living things in a particular
area (such as an island) and all non-living,
physical components of the environment that
affect living things (such as air, soil, water,
sunlight). Community All living things in an
ecosystem (such as all animals, plants, bacteria,
fungal, viruses etc. in a rain forest).
Population A group of interbreeding
individuals of one species (such as all flying
squirrels in a rain forest). Organism An
individual living thing (such as one flying
squirrel). Organ system A group of related
body components that perform a specific type of
function (such CNP). Organ Functional group of
organ system (such as brain).
3
Biology as an information intensive science

Fundamental Theory Evolution Simple
molecules gt Organic molecules gt RNA-based life
systems gt Single cells gt Multiple cellular
organisms gt Higher organisms Molecular Basis of
Life DNA (Genes) gt RNAs gt Proteins
Structural organization Chemical reaction,
synthesis and destruction of molecules Signal
transduction Transportation of molecules. Regulati
on
4
Biology as an information intensive science
Cell Organization and Function Structural
organization Chemical reaction, synthesis and
destruction of molecules Signal
transduction Transportation of
molecules. Regulation

5
Biology as an information intensive science

Information (Molecular Level) DNA 30,000
100,000 genes for human (many with unknown
functions) 3x109 base pairs for human DNA (lt 10
coding region) Protein 60,000 100,000
proteins for human. Individual level
sequence, 3D structure, molecular
function. Group level pathways, cellular
location, collective function. Classification
Family superfamily, family, subfamily (based on
evolution and function) Type receptor, ion
channel, enzyme, carrier, regulator,
structure Function Physiological function,
diseases, therapeutics, toxicity,
pharmacokinetics, agriculture, plant,
environmentally relevant.
6
Typical Databases
  • Category
  • General
  • Sequence
  • 3D structure
  • Protein function, proteomics, and pathways.
  • Pharmainformatics
  • Medical informatics and disease information
  • Reference
  • Nucleic. Acids. Res., 30, 1-12 (2002).
  • Internet links
  • http//www.cz3.nus.edu.sg/yzchen/database.html

7
Typical Databases

General The National Center for Biotechnology
Information (NCBI).   (http//www3.ncbi.nlm.nih.g
ov/) Integrated ENTREZ retrieval software and
databases for genetics, gene and protein
sequences, 3D structures, and on-line PubMed
library. CAM (Complementary and Alternative
Medicine) on PubMed. Pedro's BioMolecular
Research Tools.   A Collection of WWW Links to
Information and Services Useful to Molecular
Biologists. Other mirror sites in Germany, and
Switzerland. The CMS Molecular Biology
Resource.  This site is a compendium of
electronic and Internet-accessible tools and
resources for Molecular  Biology, Biotechnology,
Molecular Evolution, Biochemistry, and
Biomolecular Modeling. Other mirror sites in
Japan, Canada, France, Germany, Italy, and UK.
8
Typical Databases
  • Sequence
  • GenBank DataBase (GenBank). (http//www.ncbi.nih.g
    ov/Genbank/)
  • The GenBank database contains and distributes
    publicly available DNA sequences from more than
    130,000 different organisms. It contains DNA
    sequences, their derived protein sequences, and
    annotations describing biological, structural,
    and other relevant features. It currently
    contains 27213748 loci, 33865022251 bases, from
    27213748 reported sequences
  • SWISS-PROT   (http//us.expasy.org/sprot/)
  • Annotated protein sequence database. Information
    includes the description of the function of a
    protein, its domains structure,
    post-translational modifications, variants, etc.
  • Release 42.0 of 10-Oct-2003 of Swiss-Prot
    contains 135850 sequence entries, comprising
    50046799 amino acids abstracted from 109694
    references.

9
Typical Databases

Sequence-related knowledge databases Online
Mendelian Inheritance in Man. 
(http//www3.ncbi.nlm.nih.gov/omim/) Database
that catalogs the human genes and genetic
disorders. Located at NCBI. It currently contains
14831 entries Pfam Protein families database
of alignments and HMMs. (http//www.sanger.ac.uk/S
oftware/Pfam/ ). A large collection of multiple
sequence alignments and hidden Markov models
covering many common protein domains. In this
way, proteins are grouped into domain-based
families. It currently covers 6190 families.
10
Typical Databases

Structure Protein Data Bank (PDB). 
(http//www.rcsb.org/pdb/ ) 3D crystal and NMR
structure of proteins, DNA, RNA and ligand-bound
complexes. Official mirror site in Singapore, 
and other places in China., Japan, Taiwan and
several places in USA Boston, North Carolina. It
currently contains 22874 Structures. Nucleic
Acids Database (NDB).   3D crystal structure of
DNA and RNA.  Mirror sites in UK, Japan, and
other sites in USA San Diego.
11
Typical Databases

Structure derived knowledge databases SCOP. 
Structural classification of proteins. Mirror
sites in Singapore, China, the U.S., and Japan.
CATH. Protein Structure Classification. A
hierarchical domain classification of protein
structures in PDB. MODBASE. A database of
Comparative Protein Structure Models. Models were
generated by PSI-BLAST and MODELLER. As of Aug
2000, there are 3,379 reliable models for domains
in 2,220 proteins, and 5433 reliable fold
assignments for domains in 3,083 proteins.
12
Typical Databases
Function and pathways GeneCards. A database
of human genes, their products and their
involvement in diseases. It offers concise
information about the functions of all human
genes that have an approved symbol, as well as
selected others gene listing. PROSITE.
Protein families and domains. It consists of
biologically significant sites, patterns and
profiles that help to reliably identify to which
known protein family (if any) a new sequence
belongs. Mirror sites in Australia, Canada,
China, Taiwan. PRINTS. Protein fingerprint
database. A fingerprint is a group of conserved
motifs used to characterise a protein family.
PROCAT. A database of 3D enzyme active site
templates.  It can be thought of as the 3D
equivalent of the 1D templates found in sequence
motif databases such as PROSITE and PRINTS.
KEGG Kyoto Encyclopedia of Genes and Genomes.
Site contains  Pathway Info, Disease Catalogs,
Cell Catalogs, Molecule Catalog, and Genomic
Info. It also provides Links to Pathway and Other
Databases. SPAD Signaling Pathway Database. An
integrated database for genetic information and
signal transduction systems. Divided into four
categories based on extracellular signal
molecules (Growth factor, Cytokine, and Hormone)
and stress, that initiate the intracellular
signaling pathway.

13
Typical Databases
Pharmainformatics TTD Therapeutic Target
Database. A database to provide information about
the known and newly proposed therapeutic protein
and nucleic acid targets, the targeted disease,
pathway information and the corresponding
drugs/ligands directed at each of these targets.
Links to relevant databases also provided.
MedChem/Biobyte QSAR Database.   A collection
of 10,000 of  QSAR datasets that covers both
biological and physical-organic chemistry. The
NCI Drug Information System 3D Database.  A
collection of 3D structures for over 400,000
drugs which was built and is maintained by the
Developmental Therapuetics Program Division of
Cancer Treatment, National Cancer Institute. The
database is an extension of the NCI Drug
Information System. Drug Discovery Databases
Compiled by The Biophysical Pharmacology Group at
NCI. Site has links to several therapeutics
program databases and tools, and a 2D-Gel protein
expression database. Pharmaceutical Information
Network . A comprehensive information database
about drugs and diseases. U. S. Food and Drug
Administration Center for Drug Evaluation and
Research.

14
Introduction to Data Mining
  • Main Objective
  • Pattern identification, Classification,
    Extraction of related data (character) set.
  • Tasks
  • Generation of association rules.
  • Classification and clustering.
  • Pre-processing and post-processing of relevant
    dataset.
  • General Procedure
  • Understanding of application domain.
  • Data source identification and data selection.
  • Pre-processing feature selection,
    discretization, data cleaning.
  • Data mining pattern extraction and model
    building.
  • Post-processing identification of
    interesting/useful/novel patterns/rules.
  • Incorporation of patterns in real world tasks.


15
Introduction to Data Mining
Example Generation of association
rules Record of customer purchases John
Jacket, Boots Alfred Milk, Cheese, Bread,
Shoes Green Milk, Bread Brown Milk, Bread,
Shoes, Greeting Cards, Pork Eric Cheese, Milk,
Shoes, Beef Bob Jacket, Boots, Ski Pants Form
of association rules Item A gt Item B sup,
conf sup support of records containing
both item A and B conf confidence sup / ( of
records containing item B)

16
Data Mining in Biology
  • Types of Tasks
  • Search for similar pattern in a subsection of
    each member of datasets (e.g. protein sequence
    motifs).
  • Classification of datasets into groups (e.g.
    proteins into families).
  • Search for a dataset matching given
    characteristics (e.g. alignment of a protein
    sequence against all entries in a protein
    sequence database).
  • Extraction of particular information from
    literature (e.g. drugs that bind to a particular
    protein).
  • Proc. Natl. Acad. Sci. USA 95, 10710-10715 (1998)
  • Structure 7, 1099-1112 (1999)
  • Bioinformatics 17, 721-728 (2001)
  • Bioinformatics 17, 155-161 (2001) 17, 359-363
    (2001))


17
Homework
                                               
       
 
  • Write a very short report about a database
    assigned to you.
  • Can you give at least two more examples to each
    type of tasks in biological data mining?
  • Read the reference about typical biological
    database and get a broad picture about the
    current status of publicly-accessible
    bioinformatics databases.
  • Read at least one of the references about data
    mining in biology and be prepared to give a brief
    description about the paper.
Write a Comment
User Comments (0)
About PowerShow.com