Bioinformatics: an overview

About This Presentation

Title:

Bioinformatics: an overview

Description:

Bioinformatics: an overview Ming-Jing Hwang ( ) Institute of Biomedical Sciences Academia Sinica http://gln.ibms.sinica.edu.tw/ – PowerPoint PPT presentation

Number of Views:370

Avg rating:3.0/5.0

Slides: 71

Provided by: pc051

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics: an overview

1
Bioinformatics an overview

Ming-Jing Hwang (???)
Institute of Biomedical Sciences
Academia Sinica

http//gln.ibms.sinica.edu.tw/
2
The human genome project
Year 2001
3
Promises

More will happen in biology in the next 10
years than in the past 50
(Craig Venter, Celera Genomics).
We should be able to uncover the major
hereditary contributions to common
illnesses like diabetes and mental illness,
probably in the next three to five years
(Francis Collins, head of HGP).

4
Genetics Genomics From DNA to population
Source gsk
5
What makes us human ?

The difference between you chimp is 1.24
The difference between you and Maggie is 0.1

6
Hunting for disease genes
Source gsk
7
Genes and Diseases
Penetrance the likelihood that a person carrying
a particular mutant gene will have an altered
phenotype
Source gsk
8
phenotype and genotype

Many different genotypes can have same phenotype
Many genotypes do not change the phenotype
One phenotype could be due to many different
genotypes
-- statistical genetics

9
The common variant common disease (CV-CD)
hypothesis
It is believed that most polygenic contributions
to disease susceptibility will arise from
variants that are relatively common in the
susceptible population.
10
Genetic variations

SNP constitute 90 of human genetic variations
Other forms of variations include insertion,
deletion, and differences in the copy number of
tandem repeats or large genomic segments, etc.

11
Three phases of human genome sequencing

The genome map (draft in 2001, finished in
2003)
The SNP map (TSC, 2001)
The haplotype map (HapMap, 2005)

12
Source gsk
13
Source gsk
14
pharmacogenomics
8/2 (?) 1000pm on PTS (CH13)
15
(Nature, 2004)
(PNAS, 2005)
16
Common SNPs
Kruglyak Nickerson, 2001
17
dbSNP summary (NCBI build 124)
July, 2005
18
Haplotype structure of the human genome
Goldstein, 2001
19
(No Transcript)
20
Rationale of HapMap
In a given population, 55 percent of people may
have one version of a haplotype, 30 percent may
have another, 8 percent may have a third, and the
rest may have a variety of less common
haplotypes. The International HapMap Project is
identifying these common haplotypes in four
populations from different parts of the world. It
also is identifying "tag" SNPs that uniquely
identify these haplotypes. By testing an
individual's tag SNPs (a process known as
genotyping), researchers will be able to identify
the collection of haplotypes in a person's DNA.
The number of tag SNPs that contain most of the
information about the patterns of genetic
variation is estimated to be about 300,000 to
600,000, which is far fewer than the 10 million
common SNPs.
21
Beyond genome
22
Chemical genomics
23
(No Transcript)
24
Bioinformatics has many sub-disciplines

Genome Informatics (DNA sequence)
Transcriptome Informatics (expression)
Proteome informatics (ID, post-transl. mod.)
Protein Informatics (protein struct./funct.)
Evolutionary Informatics
Biomedical Informatics (human disease)

25
Briefings in bioinformatics (Mar 2005)

The many faces of sequence alignment (Altman )
Bioinformatics analysis of alternative splicing
(Lee Wang)
Putting microarray in a context Integrated
analysis of diverse biological data (Troyanskaya)
Bioinformatics approaches and resources for
single nucleotide polymorphism functional
analysis (Mooney)
A survey of current work in biomedical text
mining (Cohen Hersh)
Current efforts in the analysis of RNAi and RNAi
target genes (Bengert Dandekar)

26
Sequence alignment

The problem is still not solved.
Sequence alignment methodology and tool
development continue to grow, indicating that the
alignment problem is still not solved.
How can that be, after nearly forty years of
research and literally hundreds of available
tools?
Why should alignments remain an open problem?
It is not a single problem but rather a
collection of many quite diverse questions that
all have in common the search for sequence
similarity
The exponential expansion of biological sequence
databases faster than Moores law

Batzoglou, 2005
27
Sequence alignment challenges

Sensitivity and specificity
Speed
Evaluation
Low similarity
Rearrangements
Orthology detection
Multiple (genome) alignments

28
Evolution of functional important regions over
time
Miller et al., 2005
29
Evolutionary Informatics/ Comparative genomics
(Ureta-Vidal, Ettwiller, Birney, 2003)
30
Schema of genome alignment
(2003)
31
Genome alignment recent reviews

An Applications-Focused Review of Comparative
Genomics Tools Capabilities, Limitations and
Future Challenges (Chain et al. Briefings in
bioinformatics, 2003)
The many faces of sequence alignment (Batzoglou,
Briefings in bioinformatics, 2005)
Comparative genomics (Miller et al. Annu. Rev.
Genomics Hum. Genet. 2004)

32
RNAi post-translational gene regulaion

Computational identification of miRNAs
Computational prediction of miRNA targets
miRNA data resources

33
Transcriptomics tools for understanding the body
plan
34
Microarray Integrated analysis
Troyanskaya, 2005
35
Proteomics
Initial goal identification of all proteins
expressed by a cell or tissue
36
From 1D to 3D The Holy Grail of Structural
Bioinformatics
MADWVTGKVTKVQNWTDALFSLTVHAPVLPFTAGQFTKLGLEIDGERVQR
AYSYVNSPDNPDLEFYLVTVPDGKLSPRLAALKPGDEVQVVSEAAGFFVL
DEVPHCETLWMLATGTAIGPYLSILRLGKDLDRFKNLVLVHAARYAADLS
YLPLMQELEKRYEGKLRIQTVVSRETAAGSLTGRIPALIESGELESTIGL
PMNKETSHVMLCGNPQMVRDTQQLLKETRQMTKHLRRRPGHMTAEHYW
37
(No Transcript)
38
Structural Bioinformatics Sequence/Structure
Relationship
Percent Identity
100 90 80 70 60 50 40 30 20 10 0
All possible sequences of amino acids
Protein structures observed in nature
Twilight zone
Midnight zone
Protein sequences observed in nature
39
Structure Prediction Methods
Homology modeling
Fold recognition
ab initio
0 10 20 30 40 50 60
70 80 90 100
sequence identity
40
CASP Experiments
41
Some CASP4 successes
Bakers group
42
3D to 1D?
Science 2003
43
A computer-designed protein (93 aa) with 1.2 A
resolution
44
Sequence/Structure Gap
Sequence
Structure
45
Structural Genomics solving fold representatives
Baker Sali, 2001
46
Structural Genomics overview

When 1997 by Barry Honig, Wayne Henderickson and
colleagues in a DOEs Advanced Photon Source
(APS) proposal
Goals 10,000 structures (100-200 str/center/yr)
each representing a protein family in 5(10)
years
Enabling factors genome sequences, technology
advancement (synchrotron MAD, etc.)
Cost reducing current US200,000/str to
10,000/str (est. 1.5-5 billion US)
Players academic industry

47
Flowchart of a SG project
Burley etal., 1999
48
PSI phase I (pilot) centers

Berkeley Structural Genomics Center focused on
two bacterial species with extremely small
genomes to study proteins essential for
independent life. Principal investigator
Sung-Hou Kim, Lawrence Berkeley National
Laboratory
Center for Eukaryotic Structural Genomics, based
in Wisconsin, focused on protein production,
characterization, and structure determination
from Arabidopsis thaliana, a plant that is
frequently used in laboratory research and that
has many genes in common with humans and animals.
Principal investigator John Markley, University
of Wisconsin, Madison
Joint Center for Structural Genomics, based in
California, focused on novel structures from
thermophilic microorganisms and on human proteins
thought to be involved in cell signaling.
Principal investigator Ian Wilson, The Scripps
Research Institute
Midwest Center for Structural Genomics, based in
Illinois, selected bacterial targets related to
disease and proteins from all three kingdoms of
life. The emphasis was on previously unknown
folds and on proteins from disease-causing
organisms. Principal investigator Andrzej
Joachimiak, Argonne National Laboratory
New York Structural Genomics Research Consortium
solved protein structures for disease-related
proteins from eukaryotes and bacteria. Principal
investigator Stephen K. Burley, Structural
GenomiX, Inc.
Northeast Structural Genomics Consortium, based
in New Jersey, focused on target proteins from
various model organisms, including the fruit fly,
yeast, and roundworm. It used both X-ray
crystallography and NMR spectroscopy. Principal
investigator Gaetano Montelione, Rutgers
University
The Southeast Collaboratory for Structural
Genomics, based in Georgia, determined structures
from the prokaryotic model organism, Pyrococcus
furiosus, and the eukaryotic model organism C.
elegans, as well as some human proteins.
Principal investigator Bi-Cheng Wang, University
of Georgia
Structural Genomics of Pathogenic Protozoa
Consortium, based in Washington, solved protein
structures from organisms known as protozoans,
many species of which cause deadly diseases such
as sleeping sickness, malaria, and Chagas'
disease. Principal investigator Wim G. J. Hol,
University of Washington
TB Structural Genomics Consortium, based in New
Mexico, analyzed protein structures from
Mycobacterium tuberculosis. Principal
investigator Thomas Terwilliger, Los Alamos
National Laboratory

49
PSI Pilot Phase Facts at a Glance

Goal To develop new approaches and tools needed
to streamline and automate the steps of protein
structure determination, and to incorporate those
methods into high-throughput pipelines that use
DNA sequence information to generate
three-dimensional protein structure models
Project period September 2000 to June 2005
Funding 270 million (funded largely by the
National Institute of General Medical Sciences,
with additional support from the National
Institute of Allergy and Infectious Diseases)
Number of Centers 9 (6 survived to phase II)
Solved protein structures More than 1,100
Unique structures solved (structures sharing less
than 30 percent of their sequence with other
known proteins) More than 700

50
PDB content growth (May 2005)
51
Many bottlenecks remain target tracking by PDB
(Sep 2002)
52
Current (phase II) PSI centers
53
Hybrid approach for solving macromolecular
complex structures
54
Protein network an integrated approach
Aloy et al, 2004
55
Bioinformatics and Drug Design
Scientific America 2000
56
Yeast protein interaction network
Nat Rev Genet. 2004
57
Network parameters

Degree (connectivity) k
Degree distribution P(k), probability that a
selected node has exactly k links.
Scale-free network degree distribution
approximates a power law, P(k) k-? (? degree
exponent)

Log(P(k))
Log(k)
Barabasi Oltvai, Nat Rev Genet. 2004
58
Network models
Barabasi Olvtai, Nat Rev Genet. 2004
59
Scale-free networks
P(k) k-?, (?in in-degree ?out out-degree
exponent)
Albert Barabasi, Reviews of Modern Physics, 2002
60
Challenges in network biology

Network databases
Information integration
Organization characteristics and principles
Design rules
Evolution mechanisms
Validation

61
Neuroinformatics neuroscience bioinformatics
The human brain project -UC Davis http//nir.cs.uc
davis.edu/index.jsp
http//ncmir.ucsd.edu/NCDB/
62
Bioinformatics Journals

Bioinformatics
Nucleic Acids Research
BMC Bioinformatics
Briefings in Bioinformatics
Proteins
J. Mol. Biol.

PNAS
PLoS computational biology
Genome Research

63
Scope of bioinformatics

Genome analysis
Sequence analysis
Phylogenetics
Structural bioinformatics
Gene expression
Genetic and population analysis
Systems biology
Data and text mining
Databases and ontologies

64
Sample articles of a recent issue

Exondomain correlation and its corollaries
Functional annotation from predicted protein
interaction networks
HYPROSP II-A knowledge-based hybrid method for
protein secondary structure prediction based on
local prediction confidence
Comparative interactomics analysis of protein
family interaction networks using PSIMAP (protein
structural interactome map)
Semi-supervised protein classification using
cluster kernels
A new progressive-iterative algorithm for
multiple structure alignment
Practical FDR-based sample size calculations in
microarray experiments
Mining genetic epidemiology data with Bayesian
networks I Bayesian networks and example
application (plasma apoE levels)
Inferring proteinprotein interactions through
high-throughput interaction data from diverse
organisms
A latent variable model for chemogenomic
profiling

NAR Database issue (Jan. 2005)

Categories
1. Nucleotide Sequence Databases 53
2. RNA Sequence Databases 34
3. Protein Sequence Databases 105
4. Structure Databases 64
5. Genomic Databases (non-human) 134
6. Metabolic Enzyme Pathways Signals Pathways 36
7. Human Other Vertebrate Genomes 64
8. Human Genes Diseases 69
9. Microarray Data Other Gene Expression Databases 42
10. Proteomics Resources 7
11. Other Molecular Biology Databases 17
12. Organelle Databases 18
13. Plant Databases 48
14. Immunological Databases 20
Total 711
http//nar.oupjournals.org/cgi/content/full/33/sup
pl_1/D5/TBL1
66
NAR Web Server Issue (July 2005)
Year of publication
2004 129 (137)
2005 166
Total 295
67
Computer Related (2) Bio- Programming Tools
(1) Statistics (1)DNA (57) Annotations
(9) Gene Prediction (4) Mapping and Assembly
(1) Phylogeny Reconstruction (4) Sequence
Feature Detection (16) Sequence Polymorphisms
(8) Sequence Retrieval and Submission (3) Tools
For the Bench (12)Education (1) Directories and
Portals (1)Expression (48) cDNA, EST, SAGE
(8) Gene Regulation (22) Microarrays
(16) Splicing (2) Human Genome (13) Annotations
(3) Health and Disease (3) Other Resources
(2) Sequence Polymorphisms (5)Model Organisms
(9) Microbes (4) Mouse and Rat (2) Plants
(1) Yeast (2)Other Molecules (2) Carbohydrates
(2)
Protein (131) 2-D Structure Prediction (10) 3-D
Structure Prediction, Comparison (34) 3-D
Structure Retrieval, Viewing (6) Biochemical
Features (8) Domains and Motifs (25) Function
(10) Interactions, Pathways, Enzymes
(13) Localization and Targeting (7) Phylogeny
Reconstruction (5) Proteomics (2) Sequence
Features (6) Sequence Retrieval (5)RNA
(15) Functional RNAs (5) Motifs (3) Sequence
Retrieval (2) Structure Prediction,
Visualization, and Design (5)Sequence Comparison
(29) Alignment Editing and Visualization
(2) Analysis of Aligned Sequences
(12) Comparative Genomics (7) Multiple Sequence
Alignments (2) Pairwise Sequence Alignments
(2) Similarity Searching (4) Literature
(5) Search Tools (3) Text Mining (2)
http//bioinformatics.ubc.ca/resources/links_direc
tory/narweb2005/
68
NAR database issue (Jan. 2005)
Categories URL not available /not working Not recommended Recommended
1. Nucleotide Sequence Databases 53 3 41 9
2. RNA Sequence Databases 34 4 23 7
3. Protein Sequence Databases 104 4 66 34
4. Structure Databases 64 7 9 48
5. Genomic Databases (non-human) 134 11 105 17
6. Metabolic Enzyme Pathways Signals Pathways 36 3 20 13
7. Human Other Vertebrate Genomes 62 2 42 18
8. Human Genes Diseases 69 4 54 11
9. Microarray Data Other Gene Expression Databases 42 4 31 7
10. Proteomics Resources 7 0 5 2
11. Other Molecular Biology Databases 17 1 10 6
12. Organelle Databases 18 1 13 4
13. Plant Databases 48 11 36 1
14. Immunological Databases 20 0 17 3
Total 708 55 474 179
138 can be downloaded
69
An example of our curation
70
Two keys in bioinformatics research

Solve a significant biological question
Develop a must-use application tool

Write a Comment

User Comments (0)

About PowerShow.com

Bioinformatics: an overview - PowerPoint PPT Presentation

Bioinformatics: an overview

Bioinformatics: an overview Ming-Jing Hwang ( ) Institute of Biomedical Sciences Academia Sinica http://gln.ibms.sinica.edu.tw/ – PowerPoint PPT presentation