Title: Summation: Principles of Bioinformatics
1Summation Principles of Bioinformatics
- Review the key ingredients of the Recipe for
Bioinformatics. - Use the Human Genome results as examples for
understanding the importance of these ingredients
in future genomics and bioinformatics problems. - Integrate these principles with all the specifics
youve learned this quarter these principles are
present in everything weve done in this class.
2What is Genomics? G(MB)n
- Genomics is any (molecular) biology experiment
taken to the whole genome scale. - Ideally in a single experiment.
- E.g. genome sequencing.
- E.g. DNA microarray analysis of gene expression.
- E.g. mass spectrometry protein mixture analyses
quantity, phosphorylation, etc.
3Genomics Foundation High Throughput Technology
- Automation any human step is a bottleneck.
- Multiplexing parallelization.
- Miniaturization.
- Read-out speed, sensitivity.
- GMP Q/A, reproducibility, production line
mindset.
4(No Transcript)
5In Genomics every question is really an
information problem
- In molecular biology, experiments are small and
designed to test a specific hypothesis clearly
and directly. - In genomics, experiments are massive and not
designed for a single hypothesis. - Every biology question about genomics data
corresponds to a computer science problem how to
find the desired pattern in a dataset.
6Human Genome Sequencing
- The experimental part (the actual sequencing) was
easy. It was the information problem that was
hard. - Assembly the high frequency of repeats in the
human genome can fool you into joining the wrong
fragments
7Purely Sequence-Based Assembly
Celera believed they could assemble the whole
human genome from shotgun sequence fragments in
this way. But this approach failed. They had to
use the public domain map data to resolve
problems in their assembly.
8Genome Annotation
- Genes are what biologists really want, not just
the genome sequence. - Unfortunately, most of the 32,000 gene annotation
is based on gene prediction, not measured
experimental evidence. - It is likely that 50 of the reported genes are
wrong in details (individual exons, boundaries)
or entirely. - The Drosophila annotation has recently been shown
to be deeply flawed. - An information problem that is still not solved.
9Definition of Bioinformatics
- Bioinformatics is the study of the inherent
structure of biological information. - Data-driven let the data speak for themselves.
- non-random patterns in the data.
- Measure significance of patterns as evidence for
competing hypotheses.
10Computational Challenges
- Cluster genes by expression pattern over the
course of the cell cycle. - Identify groups of genes that are co-expressed,
co-regulated. - Identify regulatory elements in common to the
promoters of these genes, that make them be
expressed at the same time.
11Solving the Information Problem
- Modeling the problem choosing what to include,
and how to describe them. - Relating this to known information problems.
- Algorithms for solution.
- Complexity amount of time memory the algorithm
requires.
12Completeness Changes Everything
- In molecular biology cleverness is finding a way
to answer a question definitively by ignoring
99.99 of genes. You cant see them, so the
experiment must exclude them. - In genomics cleverness is discovering what
becomes when possible when you can see
everything. - Have to switch our deepest assumptions.
13What specifically can you learn from Everything?
- E.g. Protein function prediction
- genomic neighbors method
- phylogenetic profiles
- domain fusion (Rosetta Stone method)
- Microarray gene expression analysis
- meaningful signal not in just a few genes
14Example Ortholog Prediction
- Orthologs two genes related by speciation events
alone. the same gene in two species,
typically, same function. - Paralogs two genes related by at least one
gene-duplication divergence event. - Homology an ortholog or a paralog?
- Experimentally very hard to answer.
15Genomics Requires Statistical Measures of
Evidence
- Evaluate competing hypotheses under
uncertainty--automatically? - based on statistical tendencies, not proofs
- false positives, false negatives
- the need for cross-validation
- the need for experimental validation
- best role experiment interpretation and planning
16Measures of Evidence
- SNP identification from sequence data
- Genome annotation Gene evidence?
- Keys
- explicit, realistic likelihood models, priors
measured from tons of real data. - Explicit evaluation of alternative models.
- Real posteriors, w/ measures of uncertainty.
17Integrating Independent Evidence
- Typically, a calculation works with one kind of
data hard to integrate very different data. - Likelihood Models provide easy way to integrate
many different types of data if they are really
different, just multiply them - Independence Factorization!
18Statistical Problems
- Microarray analysis hierarchical clustering?
- Genome annotation gene prediction?
- COGs no statistics at all
- Protein function prediction distance metrics
instead of probabilistic modeling, no posteriors.
19From Reductionism to Systems Analysis
- Mol. Biol. dissect a complex phenomenon into its
smallest pieces characterize each. - Very hard to put the pieces back together again
Given AB, AC ABC ? - Genomics The cell as test-tube. Able to see
ABC (DE) working together. - Study how all the components work together as a
system. Study system behavior.
20Cell-Cycle Regulated Genes by whole genome mArray
21Automated Discovery of Cell Cycle Regulatory
Elements
22(No Transcript)
23Pathways Detected by the Domain Fusion (Rosetta
stone) Method
- AroH YDIB AroK PurF
- AroF AroE AroL
- AroG Pur2
- AroA AroB
- AroB PurT Pur3
- B
- AroD PurL
- AroE Pur7 Pur5
- Pur2 PurU
- AroK PurE Pur3 PurE C
- AroL
- Pur5 PurK
- AroA PurK
- PurT Pur7
- AroC GuaA
- GuaB PurB PurB
- A D
- GuaA PurH PurA
Marcotte et al., Science, 285, pp. 751-753 (1999)
24PHYLOGENETIC PROFILE METHOD
25From Hypothesis-Driven to Data-Driven Science
- Mol. Biol. cant see 99.99 of genes, so use
black-box logic based on controls keep
everything the same except for one small change.
Isolate a specific cause-effect. - In reality you rarely have the perfect control.
- Hypothesis driven can only see what you look
for a few genes, a few controls. - Interpretable ask a YES-NO question.
26From Hypothesis-drivento Data-driven Science
- Genomics measure all genes at once.
- Dont have to assume a hypothesis as basis for
designing the experiment. - Objective let the data speak for themselves.
- Reality vast amounts of data, very complex, hard
to interpret. - System Science or just Stupid Science?
27Stupid Science Data-driven Science Done Wrong
- No hypothesis.
- Assumptions alternative models not explicitly
enumerated, weighed. - Statistical basis of model either neglected or
only implicit (and therefore poor). - No cross-validation just one form of evidence.
- Greedy algorithms, sensitive to noise.
- Measures of significance weak or absent, both
computationally and experimentally.
28Data-driven Science Done Right
- Multiple competing hypotheses.
- Alternative models explicitly included, computed,
to eliminate assumptions. - Statistical models clear, well-justified.
- Multiple, independent types of evidence.
- Robust algorithms w/ well demonstrated
convergence to global optimum. - Rigorous posterior probability calculated for all
possible models of the data. Priors derived from
data. False /- measured.
29Implications of Data-driven Science
- To get strong posteriors that can distinguish
multiple models, you need LOTS of data. - Genomics is creating an unprecedented avalanche
of data, opportunities. - A change in the nature of data lost data in
old notebooks, journals, heads vs. electronic
databases that can be queried, analyzed. - The end of (purely) human analysis.
- Dont confuse observations interpretations.
30Bioinformatics as Prediction
- Given a protein sequence, bioinformatics would
seek to predict its fold. - Given a genome sequence, bioinformatics would
seek to predict the locations and exon-intron
structures of its genes. - The ultimate test make a blind prediction (when
no experimental data known) and test it
experimentally.
31A new kind of Bioinformatics
- The massive experimental data produced by
genomics projects has created a demand for a
fundamentally different kind of bioinformatics,
which we can characterize (with some
exaggeration) as a mix of three principles
32CHEAT
- Dont even try to predict anything.
- Just say, Give us all your experimental data
that contain the answer to this question, and
THEN well tell you what we think the answer is! - Focus is on statistically accurate measurement of
the strength of the evidence for different
interpretations of the experimental data.
33Steal Other Peoples Data
- The massive amount of data being produced in the
public domain is an opportunity for heavy duty
data-mining, using statistics to expose patterns
that would otherwise be missed in this huge
dataset.
34Trust No One
- What kind of data do we want?
- RAW EXPERIMENTAL DATA, ideally straight from the
sequencing machines. - INTERPRETED DATA is untrustworthy.
- Actually, bioinformatics PREDICTIONS are
contaminating the experimental databases!
35Chromatographic Evidence
G G T G G T C
C C
HsS785496
G
zu42c08.r1
G G T G A T C
C C
HsS1065649
oz03ho7.x1
A
36Science by Computer?
- No human scientist will ever look at all these
data. - To make discoveries in these data, scientific
judgment of evidence must be formalized as a
computation. - Computational Inference about hidden states H
from observable states O
posterior
prior
likelihood
Bayes Law
Sum over all hidden states
37Diversity Kills Bayes Law
- Posterior probability p(Mobs) assumes all
observations came from one model. - e.g. gene prediction predict best gene
structure. Completely ignores possibility of
alternative splicing. - What if there are multiple, different models in
reality? Observations will appear contradictory - Must treat world as a hidden mixture of models
dont know how many dont know their weights.
38When data is Mixed Up
weight
No correlation?
height
39Real Results are Hidden
Baseballplayers
weight
Basketball players
Good correlation within each group!
height
40or Completely Wrong
Sumowrestlers
weight
Basketball players
Overallcorrelation line
height
41Mixture Evidence must be convincing!
weight
height
42Mixture Evidence must be convincing!
Or we could arbitrarily split up the data any way
we like, to generate any desired (ridiculous)
conclusion!
weight
height
43Not just one Genome! a Hidden Mixture
- Generalize linear model S to partial order graph
with hidden edge probabilities rij
r67
S1
S2
S3
S5
S6
S7
S8
S9
S10
S11
r69
S4
SNP
splice
Cf. Gene Prediction assume only one model
possible (no alternative splicing), so treat rij
as binary (0 or 1).
44Evidence Confidence
1-r
C
A
G
G
T
C
T
A
G
G
C
G
r
- Odds ratio that a feature existslog p(r gt 0
obs) - Does the probability of the observations drop
catastrophically when we eliminate a given model
feature (ie. Set its r0)? - That means there are some observations that
cannot be explained well any other way! This is
strong evidence.
45t
1 - t
p(obs t gt 0) 10-3
Evidence Confidence LOD VALUE of 4.2
p(obs t 0) 10-7.2
46Sorting out EST complexity
- Need to allow for real divergences within an
alignment e.g. chimeras, paralogs, alt. splicing - Detect clustering errors by dividing alignment
into groups of divergent sequences (possible
paralogs). - Apply graph theory (mathematics of branching
structures) to deal with this. - Developed new multiple sequence alignment method
to do this Partial Order Alignment - Two cases with genomic sequence, or without
(harder).
47Linear Alignment assumes NO Structural
Divergences
Cluster AA702884 C vs. T polymorphism Novel SNP,
not previously identified.
48- simple assembly
- substitutions
- simple indels
LinearMSA
Major Divergences within an Alignment Partial
Order
- multiple domains
- chimeric sequences
- paralogous genes
- alternative splicing or polyadenylation
Branching
- not simple indels
- alternative splicing
- paralogous genes
- multiple domains
Loops
49Find Optimal Traversals
Use graph theory to find the minimum number of
traversals needed to completely encode the
alignment.
Completely encoded by a single traversal
Can only be encoded by two distinct traversals
Assign each EST to the traversal that encodes it
50Separation of Paralogous Groups
Most Unigene clusters contain mutually
inconsistent sequences (e.g. branching,
correlated substitutions suggesting paralogs,
etc.)
51Pairwise Multiple Sequence Alignment
O(N2) pairwise distances find minimum spanning
tree iteratively align via shortest edges.
3
4
2
1
52Multiple Domains
Proteins may share a domain, but differ elsewhere
One domain may be observed in many proteins. One
protein may contain many domains.
As a linear MSA Should it be stored as...
In linear MSAs, indel placement is often
arbitrary. This leads to arbitrary differences
in gap penalties charged in later alignments...
Or as...
53PO Alignment of Multi Domain Sequences
MATK
M
KINASE
M
SH3
ABL1
A
A
SH2
GRB2
G
G
SH3
CRKL
C
C
SH3
54Data Convergence the Genome is the Glue
- Biology has become highly specialized,
fragmented natural result of reductionism. - But ultimately most activities attach to a gene
or group of genes. - Because of evolution, genes connect to each other
through orthology (across species) and paralogy
(duplication events). - Discovery is connecting previously unrelated
facts about phenotypes and causes
55Examples
- Human proteins work in yeast. Use yeast to
figure out protein-protein interactions - Drosophila get high on cocaine. Use them to
study addiction, therapies. - C. elegans on Prozac...
- Spiders build crazy webs on caffeine...
56(No Transcript)
57Human
Mouse
58Expansion of Gene Families
59Star TopologiesO(N2) power in O(N) links
activities
Regulatory sites
mutants
ligands
phenotypes
expression
domains
Alternativesplicing
Model organisms
genes
proteins
polymorphisms
folds
screens
mapping
motifs
Disease associations
development
60Genomics Bioinformatics
- Incredible increase in experimental data
production is making possible entirely new
analyses. - Bioinformatics required for interpretation of the
meaning of the raw data. - A lot of discovery is possible, but
61Evidence Matters!
- The genome is a very complex place.
- Every possible trap for naïve bioinformatics
analysis is actually there repeats, paralogs,
polymorphisms - Rigorous statistical measurement of our
confidence is the only thing that will keep us
from making silly mistakes.
62Sources of Uncertainty
Modrek Lee, Nature Genetics (in press).
63Bioinformatics Factors
64Biological Interpretation Factors