Title: What is functional genomics
1What is functional genomics?
2Functional genomics is the determination of
genome function by whatever means necessary.
This area is described as 'the development and
application of global (genome-wide or
system-wide) experimental approaches to assess
gene function by making use of the information
and reagents provided by genome sequencing and
mapping' Hietor Boguski, Science,
1997
3To some it is only the use of micro array
Technology.
4To others it is the integration of an immense
amount of heterogeneous data to understand
the complexity of life.
5To others it is a a strategy of using wet and in
silico tools to address specific questions in a
given organism.
6Bioinformatics and Functional Genomics
Jonathan Pevsner Functional Genomics
Chapter 6 Gene Expression Chapter 7
Microarray Data Analysis Chapter 8
Protein analysis and proteomics Chapter
9 Protein structure Chapter 10 Multiple
Sequence Alignment Chapter 11 Molecular
Phylogeny and Evolution
7What is functional genomics?
One component is the prediction of protein
function, interaction and structure from
analyzing primary sequence information.
8These studies in functional genomics reveal
duplication, divergence, rearrangement and
structure prediction all based on primary
sequence analysis.
McClure, M.A., (2001) "Evolution of the DUT Gene
Horizontal Transfer between Host and Pathogen in
all Three Domains of Life," Current Protein and
Peptide Science, 2,313-324. Baldo, A.M. and
McClure, M.A., (1999) "Evolutionary History of
the dUTPase Enzyme Horizontal Exchange between
Viral Pathogens and Their Hosts," Journal of
Virology, 73 7710-7721.
9dUTPase
dUTP
dUMP
2 Pi
- What is dUTPase?
- Deoxyuridine triphosphatase converts dUTP into
dUMP. - Why is dUTPase important?
- Excessive amounts of dUTP cause the incorporation
of uracil into growing DNA chains. - All organisms examined to date possess a UNG to
excise dUTP misincorporated into DNA. - In the presence of a large dUTP pool UNG repair
causes DNA strand breakage, which can lead to
cell death.
10(No Transcript)
11Questions regarding dUTPase
- What is the relationship, and distribution of
dUTPase in viruses, eukaryotes, eubacteria, and
archaea? - Is the genome location of dUTPase conserved?
- What kinds of motif arrangements exist?
- How might these arrangements assemble?
12Three observed types of Motif Arrangements
single gene copy one ordered-series-of-motifs five
conserved motifs
I II III IV V
amino carboxyl
tandem duplication two ordered-series-of-motifs fi
ve conserved motifs
I II ? IV V
? ? III ? -
amino
middle
carboxyl
tandem triplication three ordered-series-of-motifs
15 conserved motifs
I II III IV V
I II III IV V
I II III IV V
13Representative Sample of dUTPase HMM Alignment
Eukaryotes Eubacteria Archaea Retroviruses Herpesv
iruses Other DNA Viruses
14Retroviruses
pol gene
gag gene
HTLV-like viruses
RSV
PR
other retroviruses
MMTV-like viruses
dUTP
PR
Non-primate Lentiviruses
dUTP
PR
RT
RH
IN
HIV
PR
Herpesviruses ???alpha ???gamma
dUTP
dUTP
RRRP
PMASE
dUTP
dUTP
RRRP
PMASE
Poxviruses Vacinnia Suid Pox
dUTP
TIF
RR
?
?
?
?
?
UH
dUTP
RR
TIF
?
?
UH
Bacteria E. coli P. aeruginosa C.
burnetti
dUTP
?
DSF
dUTP
DSF
PPMM
?
dUTP
PIB
PPMM
15100
81
100
Additive Distance Tree for 90 dUTPase sequences
HSAP
ORFV
VARI
CPOX
100
67
VACW
VACL
SPOX
60
AVAD
CELA
100
CELM
66
CELC
99
LESC
74
CALB
SCER
PBCV
BPT5
69
92
100
Eukaryotes Eubacteria Archaea Retroviruses Herpesv
iruses Other DNA Viruses
96
CTRA
CDIF
100
MTUB
BPRT
100
100
BPSP
MJAN
100
SIRV
DAMB
IHER
SHER
100
97
95
100
80
100
100
100
100
98
MIAP
JSRV
75
100
SARV
83
100
HER1
100
MMTC
100
94
77
100
100
60
100
75
ASFV
ONPH
59
61
HH8A
AH1A
100
EH4A
75
SH1A
16(No Transcript)
17Assembly and Active Site location in H. sapiens
dUTPase
Clifford D. Mol et al (1996) Structure 4(9)
1077-1092
18Hypothesized dUTPase folding and assembly
H. sapiens (single)
Herpes (double)
C. elegans (triple)
19Summary
- dUTPase is found in
- 8 virus families (Retroviruses and DNA viruses)
- Eukaryotes (multicellular and unicellular)
- Eubacteria (Proteobacteria, Firmicutes,
Chlamydiales, Spirochaetales) - Archaea
- Three kinds of motif arrangements exist
- (Single, Double, and Triple)
- It is hypothesized these arrangements assemble
- ?, ? Herpesvirus dUTPases (Double) as trimers
- C. elegans dUTPases (Triple) as monomers.
- The genome location of dUTPase is not conserved
in viruses and eubacteria. - 6 virus groups have dUTPases similar to those of
hosts.
20What is functional genomics?
Basically it is the prediction of protein
function and interaction from analyzing sequence
information. How can we know anything about
protein interactions without protein structure?
21New work
A Functional Genomics Approach to Inferring Amino
Acid Contacts Among the L, P and N proteins of
the Replication/Transcription Complex of the
Order Mononivavirales
- Protein disorder
- Low hydrophobicity and high mean net charge are
good indicators of natively unfolded proteins - Predictors of Natural Disordered Regions
(PONDR)-- - utilizes neural networks to distinguish
disordered from ordered regions
2) Evolutionary Dynamic Approaches A)
Intermolecular compensatory mutations Pazos and
Valencia 1) predicting
interacting partners 2) detecting
correlated mutations between two interacting
proteins 3) extending to three
interacting partners B)
Evolutionary-Structure Function (EFS) -- Simon
and Sidow Determines numbers amino acid
replacements given a fixed phylogenetic topology,
ranking constrained regions C)
Intramolecular compensatory mutations
-- Pollack calculates likelihood estimates of
allowing for rate variation and robustly
discriminates coevolution of intra-sites versus
random effects.
3) Use experimental results to model and validate
expectations 4) Test the predicted structure for
the Ebola
22VSV Transcription
leader
N
VSV Transcription
5'
5'
read through
3'
P
P
P
P
P
P
VSV Replication
L
L
CO-ASSEMBLY
N
?
P
P
23Rhabdoviridae Genome
VSV
Paramyxoviridae Genome
Sendai
24Heterogeneous Date to Infer ProteinProtein
Contacts
Multiple Alignment
N, P and L sequences
ALL experimental information regarding positions
of functions and interactions of L. N and P
Evolutionary Dynamics Analysis
Predict regions of disorder
Inter-CM analysis
Phylogenetic reconstruction
PONDAR
Calculate H/R
ESF-analysis
Intra-CM analysis
Integration of Heterogeneous Data Sources in a
Bayesian Framework
Most Probable Amino Acid Contact Points
25N, P and Proteins
required for replication
N protein
RNA-BS
1
524
Sendai
RNA-BS
PPBS
RNA-BS
PPBS
PCS
VSV
1
422
PPBS
P protein
Oligomerization domain
NPBS
RSR
RES
1
LPBS
Sendai
NPBS
NPBS
568
NPBS
NPBS
LPBS
GTP binding
VSV
1
265
L protein
I
II
III
IV
V
Sendai
2228
1
RSR
PPBS
MT
RNA-BS
VI
I
II
III
V
IV
VSV
1
2109
MT
PPBS
26Mtase of Ebola virus
27VSV Transcription
leader
N
VSV Transcription
5'
5'
read through
3'
P
P
P
P
P
P
VSV Replication
L
L
CO-ASSEMBLY
N
?
P
P
28Can data like these help infer L and P regions of
potential contact?
29Can data like these help infer N and P potential
regions of contact?
30Analysis of Evolutionary Dynamics
- Evolutionary-Structure-Function-EFS analysis
provides the likelihood - estimates of rates of change within a protein.
The basis of this approach relies on - maintaining phylogenetic tree topology while
calculating the number of amino acid - replacements over a fixed-window size over the
entire multiple sequence alignment. - Plotting these values as a function of position
provides a rate profile for the protein. - A heuristic algorithm identifies and ranks the
evolutionarily constrained regions of the - sequence thereby providing a relative measure of
the importance of each region. - This approach has been demonstrated to be
consistent with known experimental - data for a number of proteins (Simon, A. L.,
Stone, E. A. and Sidow, A. (2002). Inference of
functional regions - in proteins by quantification of evolutionary
constraints. Proc Natl Acad Sci U S A 99 (5).
2912-2917.)
2) Predicting intramolecular compensatory
mutations also uses likelihood estimates, and
does not rely on the accuracy of inferring
ancestral nodes in phylogenetic reconstruction .
This approach also allows for variation of rate
of evolution along tree branches. Basically,
this method can robustly discriminate coevolution
of intramolecular sites from correlations due to
random effects (Pollock, D. D., Taylor, W. R. and
Goldman, N. (1999). Coevolving protein residues
maximum likelihood identification and
relationship to structure. Journal of Molecular
Biology 287 (1). 187-198.)
31What about predicting intermolecular compensatory
mutations?
Two different methods exist to predict
protein-protein interactions from intermolecular
compensatory mutation analysis. The first
predicts the interacting partners (Pazos, F.,
Helmer-Citterich, M., Ausielo, G. and Valencia,
A. (1997). Correlated mutations contain
Information about protein-protein interaction.
J. Mol. Biol. 271 511-523), while the other
determines the actual interaction sites via
compensatory mutation analysis (Pazos, F.,
Helmer-Citterich, M., Ausielo, G. and Valencia,
A. (1997). Correlated mutations contain
information about protein-protein interaction. J.
Mol. Biol. 271 511-523, Ouzounis, C.,
Perez-Irratxeta, C Sander, C. and Valencia, A.
(1998). Are binding residues conserved? Pac Symp
Biocomput 401-412.)
How can we use these methods?
32What is a Bayesian Inference Network?
What is Bayesian Inference?
Bayesian inference dates back to 1790.
Bayesian inference has always been controversial.
33Bayesian Inference is a different way of thinking
about probability. Bayesian inference is a
subjective interpretation of probability. When
the probability of an occurrence is unknown, an
opinion can be expressed about what is unknown as
a prior probability. What is a prior
probability? It is the probability distribution
of the proportions of value on the believe that
an observer has without knowledge of data. After
observing data, then one can alter an opinion
about the values assigned in the prior
probability. This new probability distribution,
called the the posterior distribution, is
calculated by Bayes' rule. All of the
observer's knowledge about the prior distribution
is contained in the posterior distribution, and
statistical inferences are made by summarizing
this distribution. Bayes rule turns prior
probabilities into posterior probabilities.
Posterior probabilities have some observation
about the data in them.
So what is so controversial about Bayesian
inference?
34There is no agreement on what proportion of value
should be placed on believes and opinions about
unknown events. Furthermore, there is the
issue of whether or not a prior probability on
an unknown event can even exist. This is a
philosophical question not a scientific one.
35The Bayesian approach to heterogeneous data
integration tries to fit the data to a model
using a prior distribution of the values of what
is believed to be good data.
sequence-based experiments
xy contact in virus 1
The Simple Network
36sequence-based experiments
replication
transcription
replication
transcription
xy contact in virus type 1
xy contact in virus type 2
The Complexity of Two Viruses in the Network