Summation: Principles of Bioinformatics - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Summation: Principles of Bioinformatics

Description:

Microarray analysis: hierarchical clustering? Genome annotation: ... Given a genome sequence, bioinformatics would seek to predict ... just one Genome! a ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 64
Provided by: christo138
Learn more at: http://voh.chem.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Summation: Principles of Bioinformatics


1
Summation Principles of Bioinformatics
  • Review the key ingredients of the Recipe for
    Bioinformatics.
  • Use the Human Genome results as examples for
    understanding the importance of these ingredients
    in future genomics and bioinformatics problems.
  • Integrate these principles with all the specifics
    youve learned this quarter these principles are
    present in everything weve done in this class.

2
What is Genomics? G(MB)n
  • Genomics is any (molecular) biology experiment
    taken to the whole genome scale.
  • Ideally in a single experiment.
  • E.g. genome sequencing.
  • E.g. DNA microarray analysis of gene expression.
  • E.g. mass spectrometry protein mixture analyses
    quantity, phosphorylation, etc.

3
Genomics Foundation High Throughput Technology
  • Automation any human step is a bottleneck.
  • Multiplexing parallelization.
  • Miniaturization.
  • Read-out speed, sensitivity.
  • GMP Q/A, reproducibility, production line
    mindset.

4
(No Transcript)
5
In Genomics every question is really an
information problem
  • In molecular biology, experiments are small and
    designed to test a specific hypothesis clearly
    and directly.
  • In genomics, experiments are massive and not
    designed for a single hypothesis.
  • Every biology question about genomics data
    corresponds to a computer science problem how to
    find the desired pattern in a dataset.

6
Human Genome Sequencing
  • The experimental part (the actual sequencing) was
    easy. It was the information problem that was
    hard.
  • Assembly the high frequency of repeats in the
    human genome can fool you into joining the wrong
    fragments

7
Purely Sequence-Based Assembly
Celera believed they could assemble the whole
human genome from shotgun sequence fragments in
this way. But this approach failed. They had to
use the public domain map data to resolve
problems in their assembly.
8
Genome Annotation
  • Genes are what biologists really want, not just
    the genome sequence.
  • Unfortunately, most of the 32,000 gene annotation
    is based on gene prediction, not measured
    experimental evidence.
  • It is likely that 50 of the reported genes are
    wrong in details (individual exons, boundaries)
    or entirely.
  • The Drosophila annotation has recently been shown
    to be deeply flawed.
  • An information problem that is still not solved.

9
Definition of Bioinformatics
  • Bioinformatics is the study of the inherent
    structure of biological information.
  • Data-driven let the data speak for themselves.
  • non-random patterns in the data.
  • Measure significance of patterns as evidence for
    competing hypotheses.

10
Computational Challenges
  • Cluster genes by expression pattern over the
    course of the cell cycle.
  • Identify groups of genes that are co-expressed,
    co-regulated.
  • Identify regulatory elements in common to the
    promoters of these genes, that make them be
    expressed at the same time.

11
Solving the Information Problem
  • Modeling the problem choosing what to include,
    and how to describe them.
  • Relating this to known information problems.
  • Algorithms for solution.
  • Complexity amount of time memory the algorithm
    requires.

12
Completeness Changes Everything
  • In molecular biology cleverness is finding a way
    to answer a question definitively by ignoring
    99.99 of genes. You cant see them, so the
    experiment must exclude them.
  • In genomics cleverness is discovering what
    becomes when possible when you can see
    everything.
  • Have to switch our deepest assumptions.

13
What specifically can you learn from Everything?
  • E.g. Protein function prediction
  • genomic neighbors method
  • phylogenetic profiles
  • domain fusion (Rosetta Stone method)
  • Microarray gene expression analysis
  • meaningful signal not in just a few genes

14
Example Ortholog Prediction
  • Orthologs two genes related by speciation events
    alone. the same gene in two species,
    typically, same function.
  • Paralogs two genes related by at least one
    gene-duplication divergence event.
  • Homology an ortholog or a paralog?
  • Experimentally very hard to answer.

15
Genomics Requires Statistical Measures of
Evidence
  • Evaluate competing hypotheses under
    uncertainty--automatically?
  • based on statistical tendencies, not proofs
  • false positives, false negatives
  • the need for cross-validation
  • the need for experimental validation
  • best role experiment interpretation and planning

16
Measures of Evidence
  • SNP identification from sequence data
  • Genome annotation Gene evidence?
  • Keys
  • explicit, realistic likelihood models, priors
    measured from tons of real data.
  • Explicit evaluation of alternative models.
  • Real posteriors, w/ measures of uncertainty.

17
Integrating Independent Evidence
  • Typically, a calculation works with one kind of
    data hard to integrate very different data.
  • Likelihood Models provide easy way to integrate
    many different types of data if they are really
    different, just multiply them
  • Independence Factorization!

18
Statistical Problems
  • Microarray analysis hierarchical clustering?
  • Genome annotation gene prediction?
  • COGs no statistics at all
  • Protein function prediction distance metrics
    instead of probabilistic modeling, no posteriors.

19
From Reductionism to Systems Analysis
  • Mol. Biol. dissect a complex phenomenon into its
    smallest pieces characterize each.
  • Very hard to put the pieces back together again
    Given AB, AC ABC ?
  • Genomics The cell as test-tube. Able to see
    ABC (DE) working together.
  • Study how all the components work together as a
    system. Study system behavior.

20
Cell-Cycle Regulated Genes by whole genome mArray
21
Automated Discovery of Cell Cycle Regulatory
Elements
22
(No Transcript)
23
Pathways Detected by the Domain Fusion (Rosetta
stone) Method
  • AroH YDIB AroK PurF
  • AroF AroE AroL
  • AroG Pur2
  • AroA AroB
  • AroB PurT Pur3
  • B
  • AroD PurL
  • AroE Pur7 Pur5
  • Pur2 PurU
  • AroK PurE Pur3 PurE C
  • AroL
  • Pur5 PurK
  • AroA PurK
  • PurT Pur7
  • AroC GuaA
  • GuaB PurB PurB
  • A D
  • GuaA PurH PurA

Marcotte et al., Science, 285, pp. 751-753 (1999)
24
PHYLOGENETIC PROFILE METHOD
25
From Hypothesis-Driven to Data-Driven Science
  • Mol. Biol. cant see 99.99 of genes, so use
    black-box logic based on controls keep
    everything the same except for one small change.
    Isolate a specific cause-effect.
  • In reality you rarely have the perfect control.
  • Hypothesis driven can only see what you look
    for a few genes, a few controls.
  • Interpretable ask a YES-NO question.

26
From Hypothesis-drivento Data-driven Science
  • Genomics measure all genes at once.
  • Dont have to assume a hypothesis as basis for
    designing the experiment.
  • Objective let the data speak for themselves.
  • Reality vast amounts of data, very complex, hard
    to interpret.
  • System Science or just Stupid Science?

27
Stupid Science Data-driven Science Done Wrong
  • No hypothesis.
  • Assumptions alternative models not explicitly
    enumerated, weighed.
  • Statistical basis of model either neglected or
    only implicit (and therefore poor).
  • No cross-validation just one form of evidence.
  • Greedy algorithms, sensitive to noise.
  • Measures of significance weak or absent, both
    computationally and experimentally.

28
Data-driven Science Done Right
  • Multiple competing hypotheses.
  • Alternative models explicitly included, computed,
    to eliminate assumptions.
  • Statistical models clear, well-justified.
  • Multiple, independent types of evidence.
  • Robust algorithms w/ well demonstrated
    convergence to global optimum.
  • Rigorous posterior probability calculated for all
    possible models of the data. Priors derived from
    data. False /- measured.

29
Implications of Data-driven Science
  • To get strong posteriors that can distinguish
    multiple models, you need LOTS of data.
  • Genomics is creating an unprecedented avalanche
    of data, opportunities.
  • A change in the nature of data lost data in
    old notebooks, journals, heads vs. electronic
    databases that can be queried, analyzed.
  • The end of (purely) human analysis.
  • Dont confuse observations interpretations.

30
Bioinformatics as Prediction
  • Given a protein sequence, bioinformatics would
    seek to predict its fold.
  • Given a genome sequence, bioinformatics would
    seek to predict the locations and exon-intron
    structures of its genes.
  • The ultimate test make a blind prediction (when
    no experimental data known) and test it
    experimentally.

31
A new kind of Bioinformatics
  • The massive experimental data produced by
    genomics projects has created a demand for a
    fundamentally different kind of bioinformatics,
    which we can characterize (with some
    exaggeration) as a mix of three principles

32
CHEAT
  • Dont even try to predict anything.
  • Just say, Give us all your experimental data
    that contain the answer to this question, and
    THEN well tell you what we think the answer is!
  • Focus is on statistically accurate measurement of
    the strength of the evidence for different
    interpretations of the experimental data.

33
Steal Other Peoples Data
  • The massive amount of data being produced in the
    public domain is an opportunity for heavy duty
    data-mining, using statistics to expose patterns
    that would otherwise be missed in this huge
    dataset.

34
Trust No One
  • What kind of data do we want?
  • RAW EXPERIMENTAL DATA, ideally straight from the
    sequencing machines.
  • INTERPRETED DATA is untrustworthy.
  • Actually, bioinformatics PREDICTIONS are
    contaminating the experimental databases!

35
Chromatographic Evidence
G G T G G T C
C C
HsS785496
G
zu42c08.r1
G G T G A T C
C C
HsS1065649
oz03ho7.x1
A
36
Science by Computer?
  • No human scientist will ever look at all these
    data.
  • To make discoveries in these data, scientific
    judgment of evidence must be formalized as a
    computation.
  • Computational Inference about hidden states H
    from observable states O

posterior
prior
likelihood
Bayes Law
Sum over all hidden states
37
Diversity Kills Bayes Law
  • Posterior probability p(Mobs) assumes all
    observations came from one model.
  • e.g. gene prediction predict best gene
    structure. Completely ignores possibility of
    alternative splicing.
  • What if there are multiple, different models in
    reality? Observations will appear contradictory
  • Must treat world as a hidden mixture of models
    dont know how many dont know their weights.

38
When data is Mixed Up
weight
No correlation?
height
39
Real Results are Hidden
Baseballplayers
weight
Basketball players
Good correlation within each group!
height
40
or Completely Wrong
Sumowrestlers
weight
Basketball players
Overallcorrelation line
height
41
Mixture Evidence must be convincing!
weight
height
42
Mixture Evidence must be convincing!
Or we could arbitrarily split up the data any way
we like, to generate any desired (ridiculous)
conclusion!
weight
height
43
Not just one Genome! a Hidden Mixture
  • Generalize linear model S to partial order graph
    with hidden edge probabilities rij

r67
S1
S2
S3
S5
S6
S7
S8
S9
S10
S11
r69
S4
SNP
splice
Cf. Gene Prediction assume only one model
possible (no alternative splicing), so treat rij
as binary (0 or 1).
44
Evidence Confidence
1-r
C
A
G
G
T
C
T
A
G
G
C
G
r
  • Odds ratio that a feature existslog p(r gt 0
    obs)
  • Does the probability of the observations drop
    catastrophically when we eliminate a given model
    feature (ie. Set its r0)?
  • That means there are some observations that
    cannot be explained well any other way! This is
    strong evidence.

45
t
1 - t
p(obs t gt 0) 10-3
Evidence Confidence LOD VALUE of 4.2
p(obs t 0) 10-7.2
46
Sorting out EST complexity
  • Need to allow for real divergences within an
    alignment e.g. chimeras, paralogs, alt. splicing
  • Detect clustering errors by dividing alignment
    into groups of divergent sequences (possible
    paralogs).
  • Apply graph theory (mathematics of branching
    structures) to deal with this.
  • Developed new multiple sequence alignment method
    to do this Partial Order Alignment
  • Two cases with genomic sequence, or without
    (harder).

47
Linear Alignment assumes NO Structural
Divergences
Cluster AA702884 C vs. T polymorphism Novel SNP,
not previously identified.
48
  • simple assembly
  • substitutions
  • simple indels

LinearMSA
Major Divergences within an Alignment Partial
Order
  • multiple domains
  • chimeric sequences
  • paralogous genes
  • alternative splicing or polyadenylation

Branching
  • not simple indels
  • alternative splicing
  • paralogous genes
  • multiple domains

Loops
49
Find Optimal Traversals
Use graph theory to find the minimum number of
traversals needed to completely encode the
alignment.
Completely encoded by a single traversal
Can only be encoded by two distinct traversals
Assign each EST to the traversal that encodes it
50
Separation of Paralogous Groups
Most Unigene clusters contain mutually
inconsistent sequences (e.g. branching,
correlated substitutions suggesting paralogs,
etc.)
51
Pairwise Multiple Sequence Alignment
O(N2) pairwise distances find minimum spanning
tree iteratively align via shortest edges.
3
4
2
1
52
Multiple Domains
Proteins may share a domain, but differ elsewhere
One domain may be observed in many proteins. One
protein may contain many domains.
As a linear MSA Should it be stored as...
In linear MSAs, indel placement is often
arbitrary. This leads to arbitrary differences
in gap penalties charged in later alignments...
Or as...
53
PO Alignment of Multi Domain Sequences
MATK
M
KINASE
M
SH3
ABL1
A
A
SH2
GRB2
G
G
SH3
CRKL
C
C
SH3
54
Data Convergence the Genome is the Glue
  • Biology has become highly specialized,
    fragmented natural result of reductionism.
  • But ultimately most activities attach to a gene
    or group of genes.
  • Because of evolution, genes connect to each other
    through orthology (across species) and paralogy
    (duplication events).
  • Discovery is connecting previously unrelated
    facts about phenotypes and causes

55
Examples
  • Human proteins work in yeast. Use yeast to
    figure out protein-protein interactions
  • Drosophila get high on cocaine. Use them to
    study addiction, therapies.
  • C. elegans on Prozac...
  • Spiders build crazy webs on caffeine...

56
(No Transcript)
57
Human
Mouse
58
Expansion of Gene Families
59
Star TopologiesO(N2) power in O(N) links
activities
Regulatory sites
mutants
ligands
phenotypes
expression
domains
Alternativesplicing
Model organisms
genes
proteins
polymorphisms
folds
screens
mapping
motifs
Disease associations
development
60
Genomics Bioinformatics
  • Incredible increase in experimental data
    production is making possible entirely new
    analyses.
  • Bioinformatics required for interpretation of the
    meaning of the raw data.
  • A lot of discovery is possible, but

61
Evidence Matters!
  • The genome is a very complex place.
  • Every possible trap for naïve bioinformatics
    analysis is actually there repeats, paralogs,
    polymorphisms
  • Rigorous statistical measurement of our
    confidence is the only thing that will keep us
    from making silly mistakes.

62
Sources of Uncertainty
Modrek Lee, Nature Genetics (in press).
63
Bioinformatics Factors
64
Biological Interpretation Factors
Write a Comment
User Comments (0)
About PowerShow.com