Title: Computational Virology
1Computational Virology
Lectures in
Bioinformatic Studies on the Evolution Structure
and Function of RNA-based Life Forms
Marcella A. McClure, Ph.D. Department of
Microbiology and the Center for Computational
Biology Montana State University, Bozeman
MT mars_at_parvati.msu.montana.edu
2Summary Lecture I
- 1) Introduction to RNA-based life forms
- Methods to test the hypothesis.
- Testing the hypothesis.
- Predicting protein contacts.
3The World of Viruses
DNA viruses
RNA viruses
RdDp
ssRNA
dsRNA
ssDNA
dsDNA
RdRp
host Pol II
ssRNA
- ssRNA
Does the RT domain of the RdDp share common
ancestry with the RdRp of negative and positive
polarity, single-stranded viruses?
4Rhabdoviridae
Paramyxoviridae
Filoviridae
Retroviridae
Picornaviridae
5Retroid Agents
Retroviruses, retrotransposons,
pararetroviruses, retroposons, retroplasmids,
retrointrons, and retrons
reverse transcriptase mediated replication or
transposition
RNA viruses e.g., Ebola, rabies, influenza, polio
All cellular systems most DNA Viruses
RNA
DNA
transcription
Replication by DNA-dependent DNA polymerase
Replication by RNA-dependent RNA Polymerase
translation
snRNAs, ribozymes, tRNA, rRNA
PROTEIN SYNTHESIS
McClure, 2000
6Mononegavirales
OLD FOES rabies (Rhabdoviridae) measles,
RSV, mumps (Paramyxoviridae) EMERGING
THREATS Ebola, Marburg (Filoviridae) equine
morbillivirus, Nipah virus (Paramyxoviridae) MOD
EL AGENT vesicular stomatitis virus
(Rhabdoviridae)
7Roles of Retroid Agents
1) Disease a) retroviruses 1) exogenous
infectious HIV HTLV 2) endogenous
associations breast cancer, testicular tumors,
insulin dependent diabetes, multiple
sclerosis, rheumatoid arthritis,
schizophrenia and systemic lupus erythematosus
b)LINEs insertional mutagenesis 1)
Hemophilia A 2) muscular dystrophies Duchenne
and Fukuyama- congenital type 3) X-linked
disorders Alport Syndrome-Diffuse
Leiomyomatosis and Chronic Granulomatous Disease
2) Regulation of cellular genes and
reproduction 3) Telomere maintenance 4) Repair of
broken dsDNA 5) Exchange of genetic information
among and between organisms
8Plus-strand RNA Virus Families and Human Diseases
Togaviridae - Riff Valley Fever Flaviviridae -
Dengue Fever virus, West Nile virus Coronaviridae
- Infectious Bronchitis Caliciviridae - Hepatitis
E virus Picornaviridae - Human poliovirus,
Hepatitis A
9VSV Transcription
leader
N
VSV Transcription
5'
5'
read through
3'
P
P
P
P
P
P
VSV Replication
L
L
CO-ASSEMBLY
N
?
P
P
10RNA Template
11Replication
12Model of a poliovirus polymerase-dsRNA complex
HIV-1 Reverse Transcriptase
Poliovirus Polymerase
Poliovirus Polymerase Oligorner
Model of a poliovirus polymerase-dsRNA complex
based on the structure of HIV-1 RT complexed to
dsDNA (Huang etal., 1998).
13 Rhabdoviridae Genome
Paramyxoviridae Genome
Filoviridae Genome
N VP35 VP40 G
VP30 VP24 RdRp
MMLV Genome
Picornaviridae Genome
RdRp
VPg
Poly(A)
L P4 P2 P3 P1 2A 2B 2C
3A 3B 3C 3D
14RdRp of Plus strand viruses
GDD
RdRp of Mononegavirales
GDNQ
RdDp
FADDM
RT
RH
HYPOTHESIS The Reverse Transcriptase domain of
the RNA-dependent DNA Polymerase shares common
ancestry with the RNA-dependent RNA Polymerase
of the OrderMononegavirales and Plus Strand RNA
viruses.
15Biological Patterns
Whether randomness can be measured is a
difficult problem. One cannot judge the absence
of pattern without specifying which pattern, and
what is a pattern to you may not be a pattern to
me.
McClure, 2000
16Basic Strategy
Search Databases
Annotate and Preparation of Sequences
Multiple Alignment of Sequences
Refined Multiple Alignment
Analysis of Multiple Alignment
McClure, 2000
17What is an ordered series of motifs (OSM)?
An OSM, which may span hundreds of residues, is
defined as a set of conserved or semi-conserved
motifs (1-9 contiguous amino acid residues) found
in the same arrangement relative to one another
in all sequences of a protein family. The amino
acids of these patterns are involved in catalysis
or structural integrity. The spacing between
motifs or motif intervening regions (MIRs) can be
highly variable, reflecting the regions of a
protein that are less restricted by functional or
structural constrains. MIRs may evolve more
rapidly and be more subject to insertion/deletion
events, and duplications that the OSM. Why is
OSM identification important? The OSM of a
protein family can be used to predict function.
The identification of an OSM common among protein
sequences with as little as 8 amino acid
identity has led to successful prediction of
function. If a multiple alignment method, (be it
global or local) cannot correctly identify the
highly conserved residues of a given sequence
that are critical for function and structure,
then it is of little value.
McClure 2002
18Levels of Sequence Comparisons
McClure, 2000
19Example of local subsequences or OSM
McClure, 2000
20Strategy for Assessing Protein Sequence Homology
Protein Sequence Data
SEQUENCE COMPARISON
gt30 identical homology
lt30 identical
MOTIF DETECTION
Support for homology Statistical tests
OSM present functionally equivalent
likely homologue
Functional identification, Phylogenetic
analysis, Structural prediction
Support for homology Gene order and size,
common function
McClure, 2000
21DoRNA-Dependent Polymerases Share Common Ancestry?
22Experimental Design for Testing Motif Detection
Methods
Methods Appropriateness Availability Assumptions
Limitations User specific parameters
Bench Mark Sequences Biologically informative
markers Sequence length distribution Evolutionary
distribution Set size
Parameter Range Tests
Types of Test Data
Evaluate Results for Correct Identification of
Biologically Informative Marker
Method (s) that Accurately Identify Biologically
Informative Marker
RdRp and RdDp sequences
Test hypothesis RdRp share common ancestry with
RdDp
23RdRp of Plus strand viruses
GDD
RdRp of Mononegavirales
GDNQ
RdDp
FADDM
RT
RH
HYPOTHESIS The Reverse Transcriptase domain of
the RNA-dependent DNA Polymerase shares common
ancestry with the RNA-dependent RNA Polymerase of
the Order Mononegavirales and Plus Strand RNA
viruses.
24(No Transcript)
25Sequence Length, Percent Identity and Distance
Values
26Small Dataset Output
27Large Dataset Output
28New work
A Functional Genomics Approach to Inferring Amino
Acid Contacts Among the L, P and N proteins of
the Replication/Transcription Complex of the
Order Mononivavirales
- Protein disorder
- Low hydrophobicity and high mean net charge are
good indicators of natively unfolded proteins - Predictors of Natural Disordered Regions
(PONDR)-- - utilizes neural networks to distinguish
disordered from ordered regions
2) Evolutionary Dynamic Approaches A)
Intermolecular compensatory mutations Pazos and
Valencia 1) predicting
interacting partners 2) detecting
correlated mutations between two interacting
proteins 3) extending to three
interacting partners B)
Evolutionary-Structure Function (EFS) -- Simon
and Sidow Determines numbers amino acid
replacements given a fixed phylogenetic topology,
ranking constrained regions C)
Intramolecular compensatory mutations
-- Pollack calculates likelihood estimates of
allowing for rate variation and robustly
discriminates coevolution of intra-sites versus
random effects.
3) Use experimental results to model and validate
expectations 4) Test the predicted structure for
the Ebola
29Figure 1. Schematic of VSV RNA Synthesis. The L
and P proteins interact with the ribonuclear
protein complex, NRNA and the 5 individual RNA
messages of the genome are transcribed. The same
complex also replicates nascent genomes that
undergo co-assembly with the N protein. (Figure
from J. Perrault, personal communication.)
30Rhabdoviridae Genome
VSV
Paramyxoviridae Genome
Sendai
31N, P and Proteins
required for replication
N protein
RNA-BS
1
524
Sendai
RNA-BS
PPBS
RNA-BS
PPBS
PCS
VSV
1
422
PPBS
P protein
Oligomerization domain
NPBS
RSR
RES
1
LPBS
Sendai
NPBS
NPBS
568
NPBS
NPBS
LPBS
GTP binding
VSV
1
265
L protein
I
II
III
IV
V
Sendai
2228
1
RSR
PPBS
MT
RNA-BS
VI
I
II
III
V
IV
VSV
1
2109
MT
PPBS
32Mtase of Ebola virus
33Update Mononegavirales Sequence
Update Mononegavirales Sequence and Literature
Database
Annotated N, P, L protein maps with ALL
information regarding positions of
experimentally determined functions and
interactions
N, P and L sequences
Multiple Alignment
Evolutionary Dynamics Analysis
Predict regions of disorder
Inter-CM analysis
Phylogenetic reconstruction
PONDR
Calculate H/R
ESF-analysis
Intra-CM analysis
Integration of heterogeneous data in Bayesian
Inference Network
34sequence-based experiments
replication
transcription
xy contact in virus 1
Fig. 4. A small Bayesian network for inferring
proteinprotein contact for a one virus.
35sequence-based experiments
replication
transcription
replication
transcription
xy contact in virus type 1
xy contact in virus type 2
Fig 5 A Bayesian network representing multiple
instances of proteinprotein contact inference
for more than one virus.
361
2
Integration of heterogeneous data into a Bayesian
Framework
Update Sequences, literature and construct
structure/function maps
Determine disordered regions
Construct multiple alignments and initiate
Inter-CM analysis
Initiate phylogenetic reconstruction
Conduct ESF and Intra-CM analysis
Figure 5. Two year project timeline for proposed
studies.