Summation: Principles of Bioinformatics - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Summation: Principles of Bioinformatics

Description:

Microarray analysis: hierarchical clustering? Genome annotation: ... Given a genome sequence, bioinformatics would seek to predict ... just one Genome! a ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 64

Provided by: christo138

Learn more at: http://voh.chem.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Summation: Principles of Bioinformatics

1
Summation Principles of Bioinformatics

Review the key ingredients of the Recipe for
Bioinformatics.
Use the Human Genome results as examples for
understanding the importance of these ingredients
in future genomics and bioinformatics problems.
Integrate these principles with all the specifics
youve learned this quarter these principles are
present in everything weve done in this class.

2
What is Genomics? G(MB)n

Genomics is any (molecular) biology experiment
taken to the whole genome scale.
Ideally in a single experiment.
E.g. genome sequencing.
E.g. DNA microarray analysis of gene expression.
E.g. mass spectrometry protein mixture analyses
quantity, phosphorylation, etc.

3
Genomics Foundation High Throughput Technology

Automation any human step is a bottleneck.
Multiplexing parallelization.
Miniaturization.
Read-out speed, sensitivity.
GMP Q/A, reproducibility, production line
mindset.

4
(No Transcript)
5
In Genomics every question is really an
information problem

In molecular biology, experiments are small and
designed to test a specific hypothesis clearly
and directly.
In genomics, experiments are massive and not
designed for a single hypothesis.
Every biology question about genomics data
corresponds to a computer science problem how to
find the desired pattern in a dataset.

6
Human Genome Sequencing

The experimental part (the actual sequencing) was
easy. It was the information problem that was
hard.
Assembly the high frequency of repeats in the
human genome can fool you into joining the wrong
fragments

7
Purely Sequence-Based Assembly
Celera believed they could assemble the whole
human genome from shotgun sequence fragments in
this way. But this approach failed. They had to
use the public domain map data to resolve
problems in their assembly.
8
Genome Annotation

Genes are what biologists really want, not just
the genome sequence.
Unfortunately, most of the 32,000 gene annotation
is based on gene prediction, not measured
experimental evidence.
It is likely that 50 of the reported genes are
wrong in details (individual exons, boundaries)
or entirely.
The Drosophila annotation has recently been shown
to be deeply flawed.
An information problem that is still not solved.

9
Definition of Bioinformatics

Bioinformatics is the study of the inherent
structure of biological information.
Data-driven let the data speak for themselves.
non-random patterns in the data.
Measure significance of patterns as evidence for
competing hypotheses.

10
Computational Challenges

Cluster genes by expression pattern over the
course of the cell cycle.
Identify groups of genes that are co-expressed,
co-regulated.
Identify regulatory elements in common to the
promoters of these genes, that make them be
expressed at the same time.

11
Solving the Information Problem

Modeling the problem choosing what to include,
and how to describe them.
Relating this to known information problems.
Algorithms for solution.
Complexity amount of time memory the algorithm
requires.

12
Completeness Changes Everything

In molecular biology cleverness is finding a way
to answer a question definitively by ignoring
99.99 of genes. You cant see them, so the
experiment must exclude them.
In genomics cleverness is discovering what
becomes when possible when you can see
everything.
Have to switch our deepest assumptions.

13
What specifically can you learn from Everything?

E.g. Protein function prediction
genomic neighbors method
phylogenetic profiles
domain fusion (Rosetta Stone method)
Microarray gene expression analysis
meaningful signal not in just a few genes

14
Example Ortholog Prediction

Orthologs two genes related by speciation events
alone. the same gene in two species,
typically, same function.
Paralogs two genes related by at least one
gene-duplication divergence event.
Homology an ortholog or a paralog?
Experimentally very hard to answer.

15
Genomics Requires Statistical Measures of
Evidence

Evaluate competing hypotheses under
uncertainty--automatically?
based on statistical tendencies, not proofs
false positives, false negatives
the need for cross-validation
the need for experimental validation
best role experiment interpretation and planning

16
Measures of Evidence

SNP identification from sequence data
Genome annotation Gene evidence?
Keys
explicit, realistic likelihood models, priors
measured from tons of real data.
Explicit evaluation of alternative models.
Real posteriors, w/ measures of uncertainty.

17
Integrating Independent Evidence

Typically, a calculation works with one kind of
data hard to integrate very different data.
Likelihood Models provide easy way to integrate
many different types of data if they are really
different, just multiply them
Independence Factorization!

18
Statistical Problems

Microarray analysis hierarchical clustering?
Genome annotation gene prediction?
COGs no statistics at all
Protein function prediction distance metrics
instead of probabilistic modeling, no posteriors.

19
From Reductionism to Systems Analysis

Mol. Biol. dissect a complex phenomenon into its
smallest pieces characterize each.
Very hard to put the pieces back together again
Given AB, AC ABC ?
Genomics The cell as test-tube. Able to see
ABC (DE) working together.
Study how all the components work together as a
system. Study system behavior.

20
Cell-Cycle Regulated Genes by whole genome mArray
21
Automated Discovery of Cell Cycle Regulatory
Elements
22
(No Transcript)
23
Pathways Detected by the Domain Fusion (Rosetta
stone) Method

AroH YDIB AroK PurF
AroF AroE AroL
AroG Pur2
AroA AroB
AroB PurT Pur3
B
AroD PurL
AroE Pur7 Pur5
Pur2 PurU
AroK PurE Pur3 PurE C
AroL
Pur5 PurK
AroA PurK
PurT Pur7
AroC GuaA
GuaB PurB PurB
A D
GuaA PurH PurA

Marcotte et al., Science, 285, pp. 751-753 (1999)
24
PHYLOGENETIC PROFILE METHOD
25
From Hypothesis-Driven to Data-Driven Science

Mol. Biol. cant see 99.99 of genes, so use
black-box logic based on controls keep
everything the same except for one small change.
Isolate a specific cause-effect.
In reality you rarely have the perfect control.
Hypothesis driven can only see what you look
for a few genes, a few controls.
Interpretable ask a YES-NO question.

26
From Hypothesis-drivento Data-driven Science

Genomics measure all genes at once.
Dont have to assume a hypothesis as basis for
designing the experiment.
Objective let the data speak for themselves.
Reality vast amounts of data, very complex, hard
to interpret.
System Science or just Stupid Science?

27
Stupid Science Data-driven Science Done Wrong

No hypothesis.
Assumptions alternative models not explicitly
enumerated, weighed.
Statistical basis of model either neglected or
only implicit (and therefore poor).
No cross-validation just one form of evidence.
Greedy algorithms, sensitive to noise.
Measures of significance weak or absent, both
computationally and experimentally.

28
Data-driven Science Done Right

Multiple competing hypotheses.
Alternative models explicitly included, computed,
to eliminate assumptions.
Statistical models clear, well-justified.
Multiple, independent types of evidence.
Robust algorithms w/ well demonstrated
convergence to global optimum.
Rigorous posterior probability calculated for all
possible models of the data. Priors derived from
data. False /- measured.

29
Implications of Data-driven Science

To get strong posteriors that can distinguish
multiple models, you need LOTS of data.
Genomics is creating an unprecedented avalanche
of data, opportunities.
A change in the nature of data lost data in
old notebooks, journals, heads vs. electronic
databases that can be queried, analyzed.
The end of (purely) human analysis.
Dont confuse observations interpretations.

30
Bioinformatics as Prediction

Given a protein sequence, bioinformatics would
seek to predict its fold.
Given a genome sequence, bioinformatics would
seek to predict the locations and exon-intron
structures of its genes.
The ultimate test make a blind prediction (when
no experimental data known) and test it
experimentally.

31
A new kind of Bioinformatics

The massive experimental data produced by
genomics projects has created a demand for a
fundamentally different kind of bioinformatics,
which we can characterize (with some
exaggeration) as a mix of three principles

32
CHEAT

Dont even try to predict anything.
Just say, Give us all your experimental data
that contain the answer to this question, and
THEN well tell you what we think the answer is!
Focus is on statistically accurate measurement of
the strength of the evidence for different
interpretations of the experimental data.

33
Steal Other Peoples Data

The massive amount of data being produced in the
public domain is an opportunity for heavy duty
data-mining, using statistics to expose patterns
that would otherwise be missed in this huge
dataset.

34
Trust No One

What kind of data do we want?
RAW EXPERIMENTAL DATA, ideally straight from the
sequencing machines.
INTERPRETED DATA is untrustworthy.
Actually, bioinformatics PREDICTIONS are
contaminating the experimental databases!

35
Chromatographic Evidence
G G T G G T C
C C
HsS785496
G
zu42c08.r1
G G T G A T C
C C
HsS1065649
oz03ho7.x1
A
36
Science by Computer?

No human scientist will ever look at all these
data.
To make discoveries in these data, scientific
judgment of evidence must be formalized as a
computation.
Computational Inference about hidden states H
from observable states O

posterior
prior
likelihood
Bayes Law
Sum over all hidden states
37
Diversity Kills Bayes Law

Posterior probability p(Mobs) assumes all
observations came from one model.
e.g. gene prediction predict best gene
structure. Completely ignores possibility of
alternative splicing.
What if there are multiple, different models in
reality? Observations will appear contradictory
Must treat world as a hidden mixture of models
dont know how many dont know their weights.

38
When data is Mixed Up
weight
No correlation?
height
39
Real Results are Hidden
Baseballplayers
weight
Basketball players
Good correlation within each group!
height
40
or Completely Wrong
Sumowrestlers
weight
Basketball players
Overallcorrelation line
height
41
Mixture Evidence must be convincing!
weight
height
42
Mixture Evidence must be convincing!
Or we could arbitrarily split up the data any way
we like, to generate any desired (ridiculous)
conclusion!
weight
height
43
Not just one Genome! a Hidden Mixture

Generalize linear model S to partial order graph
with hidden edge probabilities rij

r67
S1
S2
S3
S5
S6
S7
S8
S9
S10
S11
r69
S4
SNP
splice
Cf. Gene Prediction assume only one model
possible (no alternative splicing), so treat rij
as binary (0 or 1).
44
Evidence Confidence
1-r
C
A
G
G
T
C
T
A
G
G
C
G
r

Odds ratio that a feature existslog p(r gt 0
obs)
Does the probability of the observations drop
catastrophically when we eliminate a given model
feature (ie. Set its r0)?
That means there are some observations that
cannot be explained well any other way! This is
strong evidence.

45
t
1 - t
p(obs t gt 0) 10-3
Evidence Confidence LOD VALUE of 4.2
p(obs t 0) 10-7.2
46
Sorting out EST complexity

Need to allow for real divergences within an
alignment e.g. chimeras, paralogs, alt. splicing
Detect clustering errors by dividing alignment
into groups of divergent sequences (possible
paralogs).
Apply graph theory (mathematics of branching
structures) to deal with this.
Developed new multiple sequence alignment method
to do this Partial Order Alignment
Two cases with genomic sequence, or without
(harder).

47
Linear Alignment assumes NO Structural
Divergences
Cluster AA702884 C vs. T polymorphism Novel SNP,
not previously identified.
48

simple assembly
substitutions
simple indels

LinearMSA
Major Divergences within an Alignment Partial
Order

multiple domains
chimeric sequences
paralogous genes
alternative splicing or polyadenylation

Branching

not simple indels
alternative splicing
paralogous genes
multiple domains

Loops
49
Find Optimal Traversals
Use graph theory to find the minimum number of
traversals needed to completely encode the
alignment.
Completely encoded by a single traversal
Can only be encoded by two distinct traversals
Assign each EST to the traversal that encodes it
50
Separation of Paralogous Groups
Most Unigene clusters contain mutually
inconsistent sequences (e.g. branching,
correlated substitutions suggesting paralogs,
etc.)
51
Pairwise Multiple Sequence Alignment
O(N2) pairwise distances find minimum spanning
tree iteratively align via shortest edges.
3
4
2
1
52
Multiple Domains
Proteins may share a domain, but differ elsewhere
One domain may be observed in many proteins. One
protein may contain many domains.
As a linear MSA Should it be stored as...
In linear MSAs, indel placement is often
arbitrary. This leads to arbitrary differences
in gap penalties charged in later alignments...
Or as...
53
PO Alignment of Multi Domain Sequences
MATK
M
KINASE
M
SH3
ABL1
A
A
SH2
GRB2
G
G
SH3
CRKL
C
C
SH3
54
Data Convergence the Genome is the Glue

Biology has become highly specialized,
fragmented natural result of reductionism.
But ultimately most activities attach to a gene
or group of genes.
Because of evolution, genes connect to each other
through orthology (across species) and paralogy
(duplication events).
Discovery is connecting previously unrelated
facts about phenotypes and causes

55
Examples

Human proteins work in yeast. Use yeast to
figure out protein-protein interactions
Drosophila get high on cocaine. Use them to
study addiction, therapies.
C. elegans on Prozac...
Spiders build crazy webs on caffeine...

56
(No Transcript)
57
Human
Mouse
58
Expansion of Gene Families
59
Star TopologiesO(N2) power in O(N) links
activities
Regulatory sites
mutants
ligands
phenotypes
expression
domains
Alternativesplicing
Model organisms
genes
proteins
polymorphisms
folds
screens
mapping
motifs
Disease associations
development
60
Genomics Bioinformatics

Incredible increase in experimental data
production is making possible entirely new
analyses.
Bioinformatics required for interpretation of the
meaning of the raw data.
A lot of discovery is possible, but

61
Evidence Matters!

The genome is a very complex place.
Every possible trap for naïve bioinformatics
analysis is actually there repeats, paralogs,
polymorphisms
Rigorous statistical measurement of our
confidence is the only thing that will keep us
from making silly mistakes.

62
Sources of Uncertainty
Modrek Lee, Nature Genetics (in press).
63
Bioinformatics Factors
64
Biological Interpretation Factors

Write a Comment

User Comments (0)