Bioinformatics: an overview - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Bioinformatics: an overview

Description:

Bioinformatics: an overview Ming-Jing Hwang ( ) Institute of Biomedical Sciences Academia Sinica http://gln.ibms.sinica.edu.tw/ – PowerPoint PPT presentation

Number of Views:366
Avg rating:3.0/5.0
Slides: 71
Provided by: pc051
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics: an overview


1
Bioinformatics an overview
  • Ming-Jing Hwang (???)
  • Institute of Biomedical Sciences
  • Academia Sinica

http//gln.ibms.sinica.edu.tw/
2
The human genome project
Year 2001
3
Promises
  • More will happen in biology in the next 10
  • years than in the past 50
  • (Craig Venter, Celera Genomics).
  • We should be able to uncover the major
  • hereditary contributions to common
  • illnesses like diabetes and mental illness,
  • probably in the next three to five years
  • (Francis Collins, head of HGP).

4
Genetics Genomics From DNA to population
Source gsk
5
What makes us human ?
  • The difference between you chimp is 1.24
  • The difference between you and Maggie is 0.1

6
Hunting for disease genes
Source gsk
7
Genes and Diseases
Penetrance the likelihood that a person carrying
a particular mutant gene will have an altered
phenotype
Source gsk
8
phenotype and genotype
  • Many different genotypes can have same phenotype
  • Many genotypes do not change the phenotype
  • One phenotype could be due to many different
    genotypes
  • -- statistical genetics

9
The common variant common disease (CV-CD)
hypothesis
It is believed that most polygenic contributions
to disease susceptibility will arise from
variants that are relatively common in the
susceptible population.
10
Genetic variations
  • SNP constitute 90 of human genetic variations
  • Other forms of variations include insertion,
    deletion, and differences in the copy number of
    tandem repeats or large genomic segments, etc.

11
Three phases of human genome sequencing
  • The genome map (draft in 2001, finished in
    2003)
  • The SNP map (TSC, 2001)
  • The haplotype map (HapMap, 2005)

12
Source gsk
13
Source gsk
14
pharmacogenomics
8/2 (?) 1000pm on PTS (CH13)
15
(Nature, 2004)
(PNAS, 2005)
16
Common SNPs
Kruglyak Nickerson, 2001
17
dbSNP summary (NCBI build 124)
July, 2005
18
Haplotype structure of the human genome
Goldstein, 2001
19
(No Transcript)
20
Rationale of HapMap
In a given population, 55 percent of people may
have one version of a haplotype, 30 percent may
have another, 8 percent may have a third, and the
rest may have a variety of less common
haplotypes. The International HapMap Project is
identifying these common haplotypes in four
populations from different parts of the world. It
also is identifying "tag" SNPs that uniquely
identify these haplotypes. By testing an
individual's tag SNPs (a process known as
genotyping), researchers will be able to identify
the collection of haplotypes in a person's DNA.
The number of tag SNPs that contain most of the
information about the patterns of genetic
variation is estimated to be about 300,000 to
600,000, which is far fewer than the 10 million
common SNPs.
21
Beyond genome
22
Chemical genomics
23
(No Transcript)
24
Bioinformatics has many sub-disciplines
  • Genome Informatics (DNA sequence)
  • Transcriptome Informatics (expression)
  • Proteome informatics (ID, post-transl. mod.)
  • Protein Informatics (protein struct./funct.)
  • Evolutionary Informatics
  • Biomedical Informatics (human disease)

25
Briefings in bioinformatics (Mar 2005)
  • The many faces of sequence alignment (Altman )
  • Bioinformatics analysis of alternative splicing
    (Lee Wang)
  • Putting microarray in a context Integrated
    analysis of diverse biological data (Troyanskaya)
  • Bioinformatics approaches and resources for
    single nucleotide polymorphism functional
    analysis (Mooney)
  • A survey of current work in biomedical text
    mining (Cohen Hersh)
  • Current efforts in the analysis of RNAi and RNAi
    target genes (Bengert Dandekar)

26
Sequence alignment
  • The problem is still not solved.
  • Sequence alignment methodology and tool
    development continue to grow, indicating that the
    alignment problem is still not solved.
  • How can that be, after nearly forty years of
    research and literally hundreds of available
    tools?
  • Why should alignments remain an open problem?
  • It is not a single problem but rather a
    collection of many quite diverse questions that
    all have in common the search for sequence
    similarity
  • The exponential expansion of biological sequence
    databases faster than Moores law

Batzoglou, 2005
27
Sequence alignment challenges
  • Sensitivity and specificity
  • Speed
  • Evaluation
  • Low similarity
  • Rearrangements
  • Orthology detection
  • Multiple (genome) alignments

28
Evolution of functional important regions over
time
Miller et al., 2005
29
Evolutionary Informatics/ Comparative genomics
(Ureta-Vidal, Ettwiller, Birney, 2003)
30
Schema of genome alignment
(2003)
31
Genome alignment recent reviews
  • An Applications-Focused Review of Comparative
    Genomics Tools Capabilities, Limitations and
    Future Challenges (Chain et al. Briefings in
    bioinformatics, 2003)
  • The many faces of sequence alignment (Batzoglou,
    Briefings in bioinformatics, 2005)
  • Comparative genomics (Miller et al. Annu. Rev.
    Genomics Hum. Genet. 2004)

32
RNAi post-translational gene regulaion
  • Computational identification of miRNAs
  • Computational prediction of miRNA targets
  • miRNA data resources

33
Transcriptomics tools for understanding the body
plan
34
Microarray Integrated analysis
Troyanskaya, 2005
35
Proteomics
Initial goal identification of all proteins
expressed by a cell or tissue
36
From 1D to 3D The Holy Grail of Structural
Bioinformatics
MADWVTGKVTKVQNWTDALFSLTVHAPVLPFTAGQFTKLGLEIDGERVQR
AYSYVNSPDNPDLEFYLVTVPDGKLSPRLAALKPGDEVQVVSEAAGFFVL
DEVPHCETLWMLATGTAIGPYLSILRLGKDLDRFKNLVLVHAARYAADLS
YLPLMQELEKRYEGKLRIQTVVSRETAAGSLTGRIPALIESGELESTIGL
PMNKETSHVMLCGNPQMVRDTQQLLKETRQMTKHLRRRPGHMTAEHYW
37
(No Transcript)
38
Structural Bioinformatics Sequence/Structure
Relationship
Percent Identity
100 90 80 70 60 50 40 30 20 10 0
All possible sequences of amino acids
Protein structures observed in nature
Twilight zone
Midnight zone
Protein sequences observed in nature
39
Structure Prediction Methods
Homology modeling
Fold recognition
ab initio
0 10 20 30 40 50 60
70 80 90 100
sequence identity
40
CASP Experiments
41
Some CASP4 successes
Bakers group
42
3D to 1D?
Science 2003
43
A computer-designed protein (93 aa) with 1.2 A
resolution
44
Sequence/Structure Gap
Sequence
Structure
45
Structural Genomics solving fold representatives
Baker Sali, 2001
46
Structural Genomics overview
  • When 1997 by Barry Honig, Wayne Henderickson and
    colleagues in a DOEs Advanced Photon Source
    (APS) proposal
  • Goals 10,000 structures (100-200 str/center/yr)
    each representing a protein family in 5(10)
    years
  • Enabling factors genome sequences, technology
    advancement (synchrotron MAD, etc.)
  • Cost reducing current US200,000/str to
    10,000/str (est. 1.5-5 billion US)
  • Players academic industry

47
Flowchart of a SG project
Burley etal., 1999
48
PSI phase I (pilot) centers
  • Berkeley Structural Genomics Center focused on
    two bacterial species with extremely small
    genomes to study proteins essential for
    independent life. Principal investigator
    Sung-Hou Kim, Lawrence Berkeley National
    Laboratory
  • Center for Eukaryotic Structural Genomics, based
    in Wisconsin, focused on protein production,
    characterization, and structure determination
    from Arabidopsis thaliana, a plant that is
    frequently used in laboratory research and that
    has many genes in common with humans and animals.
    Principal investigator John Markley, University
    of Wisconsin, Madison
  • Joint Center for Structural Genomics, based in
    California, focused on novel structures from
    thermophilic microorganisms and on human proteins
    thought to be involved in cell signaling.
    Principal investigator Ian Wilson, The Scripps
    Research Institute
  • Midwest Center for Structural Genomics, based in
    Illinois, selected bacterial targets related to
    disease and proteins from all three kingdoms of
    life. The emphasis was on previously unknown
    folds and on proteins from disease-causing
    organisms. Principal investigator Andrzej
    Joachimiak, Argonne National Laboratory
  • New York Structural Genomics Research Consortium
    solved protein structures for disease-related
    proteins from eukaryotes and bacteria. Principal
    investigator Stephen K. Burley, Structural
    GenomiX, Inc.
  • Northeast Structural Genomics Consortium, based
    in New Jersey, focused on target proteins from
    various model organisms, including the fruit fly,
    yeast, and roundworm. It used both X-ray
    crystallography and NMR spectroscopy. Principal
    investigator Gaetano Montelione, Rutgers
    University
  • The Southeast Collaboratory for Structural
    Genomics, based in Georgia, determined structures
    from the prokaryotic model organism, Pyrococcus
    furiosus, and the eukaryotic model organism C.
    elegans, as well as some human proteins.
    Principal investigator Bi-Cheng Wang, University
    of Georgia
  • Structural Genomics of Pathogenic Protozoa
    Consortium, based in Washington, solved protein
    structures from organisms known as protozoans,
    many species of which cause deadly diseases such
    as sleeping sickness, malaria, and Chagas'
    disease. Principal investigator Wim G. J. Hol,
    University of Washington
  • TB Structural Genomics Consortium, based in New
    Mexico, analyzed protein structures from
    Mycobacterium tuberculosis. Principal
    investigator Thomas Terwilliger, Los Alamos
    National Laboratory

49
PSI Pilot Phase Facts at a Glance
  • Goal To develop new approaches and tools needed
    to streamline and automate the steps of protein
    structure determination, and to incorporate those
    methods into high-throughput pipelines that use
    DNA sequence information to generate
    three-dimensional protein structure models
  • Project period September 2000 to June 2005
  • Funding 270 million (funded largely by the
    National Institute of General Medical Sciences,
    with additional support from the National
    Institute of Allergy and Infectious Diseases)
  • Number of Centers 9 (6 survived to phase II)
  • Solved protein structures More than 1,100
  • Unique structures solved (structures sharing less
    than 30 percent of their sequence with other
    known proteins) More than 700

50
PDB content growth (May 2005)
51
Many bottlenecks remain target tracking by PDB
(Sep 2002)
52
Current (phase II) PSI centers
53
Hybrid approach for solving macromolecular
complex structures
54
Protein network an integrated approach
Aloy et al, 2004
55
Bioinformatics and Drug Design
Scientific America 2000
56
Yeast protein interaction network
Nat Rev Genet. 2004
57
Network parameters
  • Degree (connectivity) k
  • Degree distribution P(k), probability that a
    selected node has exactly k links.
  • Scale-free network degree distribution
    approximates a power law, P(k) k-? (? degree
    exponent)

Log(P(k))
Log(k)
Barabasi Oltvai, Nat Rev Genet. 2004
58
Network models
Barabasi Olvtai, Nat Rev Genet. 2004
59
Scale-free networks
P(k) k-?, (?in in-degree ?out out-degree
exponent)
Albert Barabasi, Reviews of Modern Physics, 2002
60
Challenges in network biology
  • Network databases
  • Information integration
  • Organization characteristics and principles
  • Design rules
  • Evolution mechanisms
  • Validation

61
Neuroinformatics neuroscience bioinformatics
The human brain project -UC Davis http//nir.cs.uc
davis.edu/index.jsp
http//ncmir.ucsd.edu/NCDB/
62
Bioinformatics Journals
  • Bioinformatics
  • Nucleic Acids Research
  • BMC Bioinformatics
  • Briefings in Bioinformatics
  • Proteins
  • J. Mol. Biol.
  • PNAS
  • PLoS computational biology
  • Genome Research

63
Scope of bioinformatics
  • Genome analysis
  • Sequence analysis
  • Phylogenetics
  • Structural bioinformatics
  • Gene expression
  • Genetic and population analysis
  • Systems biology
  • Data and text mining
  • Databases and ontologies

64
Sample articles of a recent issue
  • Exondomain correlation and its corollaries
  • Functional annotation from predicted protein
    interaction networks
  • HYPROSP II-A knowledge-based hybrid method for
    protein secondary structure prediction based on
    local prediction confidence
  • Comparative interactomics analysis of protein
    family interaction networks using PSIMAP (protein
    structural interactome map)
  • Semi-supervised protein classification using
    cluster kernels
  • A new progressive-iterative algorithm for
    multiple structure alignment
  • Practical FDR-based sample size calculations in
    microarray experiments
  • Mining genetic epidemiology data with Bayesian
    networks I Bayesian networks and example
    application (plasma apoE levels)
  • Inferring proteinprotein interactions through
    high-throughput interaction data from diverse
    organisms
  • A latent variable model for chemogenomic
    profiling

65
  • NAR Database issue (Jan. 2005)

Categories
1. Nucleotide Sequence Databases 53
2. RNA Sequence Databases 34
3. Protein Sequence Databases 105
4. Structure Databases 64
5. Genomic Databases (non-human) 134
6. Metabolic Enzyme Pathways Signals Pathways 36
7. Human Other Vertebrate Genomes 64
8. Human Genes Diseases 69
9. Microarray Data Other Gene Expression Databases 42
10. Proteomics Resources 7
11. Other Molecular Biology Databases 17
12. Organelle Databases 18
13. Plant Databases 48
14. Immunological Databases 20
Total 711
http//nar.oupjournals.org/cgi/content/full/33/sup
pl_1/D5/TBL1
66
NAR Web Server Issue (July 2005)
Year of publication
2004 129 (137)
2005 166
Total 295
67
Computer Related (2) Bio- Programming Tools
(1) Statistics (1)DNA (57) Annotations
(9) Gene Prediction (4) Mapping and Assembly
(1) Phylogeny Reconstruction (4) Sequence
Feature Detection (16) Sequence Polymorphisms
(8) Sequence Retrieval and Submission (3) Tools
For the Bench (12)Education (1) Directories and
Portals (1)Expression (48) cDNA, EST, SAGE
(8) Gene Regulation (22) Microarrays
(16) Splicing (2) Human Genome (13) Annotations
(3) Health and Disease (3) Other Resources
(2) Sequence Polymorphisms (5)Model Organisms
(9) Microbes (4) Mouse and Rat (2) Plants
(1) Yeast (2)Other Molecules (2) Carbohydrates
(2)
Protein (131) 2-D Structure Prediction (10) 3-D
Structure Prediction, Comparison (34) 3-D
Structure Retrieval, Viewing (6) Biochemical
Features (8) Domains and Motifs (25) Function
(10) Interactions, Pathways, Enzymes
(13) Localization and Targeting (7) Phylogeny
Reconstruction (5) Proteomics (2) Sequence
Features (6) Sequence Retrieval (5)RNA
(15) Functional RNAs (5) Motifs (3) Sequence
Retrieval (2) Structure Prediction,
Visualization, and Design (5)Sequence Comparison
(29) Alignment Editing and Visualization
(2) Analysis of Aligned Sequences
(12) Comparative Genomics (7) Multiple Sequence
Alignments (2) Pairwise Sequence Alignments
(2) Similarity Searching (4) Literature
(5) Search Tools (3) Text Mining (2)
http//bioinformatics.ubc.ca/resources/links_direc
tory/narweb2005/
68
NAR database issue (Jan. 2005)
Categories URL not available /not working Not recommended Recommended
1. Nucleotide Sequence Databases 53 3 41 9
2. RNA Sequence Databases 34 4 23 7
3. Protein Sequence Databases 104 4 66 34
4. Structure Databases 64 7 9 48
5. Genomic Databases (non-human) 134 11 105 17
6. Metabolic Enzyme Pathways Signals Pathways 36 3 20 13
7. Human Other Vertebrate Genomes 62 2 42 18
8. Human Genes Diseases 69 4 54 11
9. Microarray Data Other Gene Expression Databases 42 4 31 7
10. Proteomics Resources 7 0 5 2
11. Other Molecular Biology Databases 17 1 10 6
12. Organelle Databases 18 1 13 4
13. Plant Databases 48 11 36 1
14. Immunological Databases 20 0 17 3
Total 708 55 474 179
138 can be downloaded
69
An example of our curation
70
Two keys in bioinformatics research
  • Solve a significant biological question
  • Develop a must-use application tool
Write a Comment
User Comments (0)
About PowerShow.com