Title: Synopsis
1(No Transcript)
2Synopsis
- It has been estimated that at least 40 of the
total human genome sequence contains the
integrated fragments of genomic parasites - Retroviruses, Retrotransposons, DNA transposons,
and parvoviruses can efficiently insert new
sequence into the human genome - These integrating elements can be powerful tools
for discovering . . .
3What genomic features affect integration?
- Each element shows a different pattern of
favorable integration sites - Favored specific nucleotide sequences can be
detected in the target DNA at the point of
integration for most of these elements - Post-integration genomic DNA is harvested, and
the DNA flanking the integrated element is cloned
and sequenced
4Intention
- Present a comprehensive statistical comparison
of the factors influencing integration frequency
by annotating each base pair in the human genome
for its likelihood of hosting integration events
5Framework
- 7 types of integrating elements
- 17 different integration complexes (datasets)
- 200 variables (genomic features)
- 10,000 integration sites
6Previous research provided extensive insertion
site data
- HIV favors integration in active transcription
units (TUs) - MLV favors integration near gene 5 ends
- ASLV integration is mostly random, but TUs seem
to be favored slightly
TUs are defined as regions of transcribed DNA
7Previous research had provided extensive
insertion site data
- SFV integration is mostly random, but is favored
slightly near CpG islands - SB favors integration in transcription units.
- AAV-based vectors show a modest preference for
regions neat transcription start sites - Experiments concerning whether LINEs prefer to
integrate within TUs have been inconclusive
8Some Variables (Genomic Features)
- Genes and Exons Indicator variables for whether
the site falls into a gene or an exon - Gene or Expression Density The number of genes
or expressed genes per base pair in the region
surrounding the integration site - Dnase I Site Density The number or density of
DNAse I sites in regions surrounding the
integration
9Some Variables (Genomic Features)
- GC Content The GC percent in the 5kb region
containing the site - CpG Islands The site is in a CpG island
- CpG Island Density The number or density of CpG
islands in the region surrounding the site - Transcription Start/Stop Features The relation
of the site to transcription start/stop position
10Some Variables (Genomic Features)
- Positional Weight in Flanking Sequence The
loglikelihood for integration versus control site
at each position in twenty bases of flanking
sequence (10 upstream and 10 downstream) and
their sum - Loglikelihood is defined as the log ratio of the
frequency of each of the four bases at each
position to the frequency in the controls
11Integration Complexes (Datasets)
12Control Site Generation
- Each dataset has one of two types of control
- Matched (preferred) the integration sites were
created using a restriction enzyme. The control
site matches the distance from the nearest
restriction site in the direction of
transcription - Random The control site is merely a random
sequence from the genome
13The ROC Curve
- Used to analyze the effects of genomic features
on integration - Provide a measurement of a predictor variables
ability to discriminate between two classes of
events - This measure can be interpreted as the
probability that a randomly drawn integration
site will have a value for its genomic feature
that exceeds that of a control
14The ROC Curve
- The area under the ROC curve is taken as a
measure of the association between genomic
feature and the likelihood of an integration event
15The ROC Curve
- The area under the curve is 1.0 when all
integration events have higher values for the
feature than any control event, and 0.0 for the
opposite case.
16The ROC Curve
- Values very near 1.0 occur when higher values of
the feature predict integration, and values very
near 0.0 occur when lower values of the feature
predict integration
17The ROC Curve
- When the area is 0.50, it is equally likely that
either has a higher value - Values near 0.50 are consistent with having no
predictive value
18ROC Curve Construction
- Values for the integration sites are tallied to
create the histogram and the upper tail areas of
the histogram, which shows the fraction of
integration sites (vertical axis) that have
values for the feature that exceed a given value
(horizontal axis)
19ROC Curve Construction
- Repeat this same procedure using data from the
control sites - Rotate this histogram and upper tail areas graph
90 clockwise - The ROC curve is constructed from the collection
of true and false positive rates
20ROC Curve Construction
- For every possible cutpoint, plot the True
Positive Rate on the y-axis and the False
Positive Rate on the x-axis - A cutpoint is defined as any value of a predictor
21A Compact Representationof these Associations
- The absolute difference between the area and 0.50
is plotted - Values around 0.0 indicate no useful predictive
information in the feature - Values near 0.50 indicate that the feature is
nearly perfect in separating integration sites
from the controls
22Color-coded Heat Maps
- Color-coded heat maps are matrices displaying
associations for each type of genomic feature
using rows of the matrix for features and columns
for data sets
23Color-coded Heat Maps
- Bright green represents ROC curve areas near 0.0
- Black represents ROC curve areas of 0.50
- Bright red represents ROC curve areas near 1.0
24Effects of Nucleotide Sequence of the 20 Base
Pairs Surrounding the Point of Integration
- To determine how important different features are
in directing integration towards a region, each
base in the interval is treated as the edge of an
integration site
25Effects of Nucleotide Sequence of the 20 Base
Pairs Surrounding the Point of Integration
- Each region is then scored for the expected
number of integration events over the interval,
and these interval scores are summed
26Effects of Nucleotide Sequence of the 20 Base
Pairs Surrounding the Point of Integration
- The summed values are then tested for their
ability to sort experimental integration sites
from controls
27Effects of Nucleotide Sequence of the 20 Base
Pairs Surrounding the Point of Integration
Interval Size
Integrating Elements
- Results are presented as areas under the ROC
curve for this variable
28Integration in Transcription Units and the Effect
of Gene Activity
- Analysis of DNA integration within TU's and exons
29- HIV (Red) positively correlated with TU's
- Others varied from slight, negative (green) to
undistinguishable data (black)
30- This figure summarizes the effects of gene
density in differently sized genomic intervals
100kb-4 Mb - Utilized Affimetrix arrays to do transcriptional
profiling - Each expression scores for all genes in a
interval divided by interval width - All datasets resulted in weakly positive for
insertion in at least one integral. And - "There was no clear pattern of interval size,
type of gene call. or expression level. - Suggests that Gene density features were most
significant - -Strong effects seen in HIV and MLV datasets
- Weakest response from non-dividing cells or
macrophage
31How does G/C Content and Proximity to CpG Islands
Effect Integration?
- On average, G/C Content implies
- Gene rich
- Short introns
- High frequencies of ALu repeats
- Low frequencies of LINEs
- High Frequency of CpGs
32- 2 MLVs where integration was positive
- 3 HIVs that were negatively correlated, A/T
preference - Other datasets showed weaker and less consistent
responses
33Whoa!? I Thought HIV Integrated in In Gene
Enriched Regions?
34Fig. 3 A
Fig. 4 A
A/T preference of HIV integrase-binding protein
35- GpC Island density
- Increasing length 1K-32 M
- Correlates to gene density
- Within short regions, proximity to CpG islands
correlate to proximity to regulatory regions - Long intervals span many genes
36DNase I Cleavage Sites
- DNase I cleaves the sites in chromatin where the
binding of transcription factors occurs along
with the presence of CpG islands, and gene
control regions.
37Integration Near Transcription Factor Binding
Motifs
- Summarizes how integration is affected by its
proximity to transcription factor binding sites - TRANSFAC PWM- scores how well the integration
site or control matches a PWM and this score
generates an ROC describing the effects of that
PWM - Lack of strength when analyzed with other factors
38Proximity to Transcription Start and Stop Features
- To compare the integration frequency between
start and stop codons for experimental and
matched random controls expressed as ROC areas.
Fig 4C
39- Boundary.dx Distance from 5' or 3' end
- Start.dx distance to the nearest gene start
sites - closer to the start (green)
- Signed.dx High probability at the start sites
(red) - General.width- length of introns
40Improved Models Incorporating Score.20 Together
with Other Genomic Features
- Score.20 was the most effective method for
differentiating between site selection of the
different vehicles - Addition of other variables to accentuate our
results. - Non-redundant
- Lack of correlation
41Increase in ROC Area by the Addition of a Genomic
Feature
- Histogram Found little correlation of score.20
with other features - Predictors of Integration targeting can be
constructed based on score.20 and another feature - The fitting process leads to values that rank
higher than random match controls
42Fig. 5 D
43A Single Model!
- Regression models would be too complex
- Want to analyze various features
- Bayes Model Averaging (BMA)
- Reinforces that score. 20 and other features are
independent - Models with high posterior probability were
collected and used to evaluate the importance of
various features - Random sites are scored for the logarithmic odds
of integration with BMA models
44 Hierarchical clustering
- Major grouping of retrovirus HIV
- Amongst our 17 datasets, with each branch
different element types were resolved - Verifies that integration site selection is
dominated by element encoded recombination enzymes
45What genomic features influence integration of
new DNA?
- What weve learned about each integrating element
- HIV favors integration in active transcription
units (TUs) - MLV favors integration near gene 5 ends
- ASLV integration is mostly random, but TUs seem
to be favored slightly
- HIV- Found to be weakly attracted to integration
sites near DNase 1 cleavage domains over long
intervals. Probably because of the correlation
of HIV insertion sites and DNase 1 cut sites with
gene dense regions. Also revealed a strong
integration attraction to A/T rich sequences,
contradictory to previous presumptions
correlating insertion with C/G dense areas. - MLV- Integration associations with CpG islands
and DNase 1 hypersensitive sites found to be
amplified when a larger scale of interest is
used. The influence of the local nucleotide
sequence also increased with a larger interval.
Strong correlation for integration near areas of
gene expression. - ASLV- Integration near DNase 1 sites over long
genomic intervals favored.
46What genomic features influence integration of
new DNA?
- What weve learned about each integrating element
- SFV integration is mostly random, but is favored
slightly near CpG islands - SB favors integration in transcription units.
- AAV-based vectors show a modest preference for
regions neat transcription start sites - Experiments concerning whether LINEs prefer to
integrate within TUs have been inconclusive.
Specific sequence known to have effect on
integration.
- SFV- Cell specific integration influences.
Integration near CpG islands and proximity to
DNase 1 cut sites more evident in stem cells then
fibroblasts. - SB- Contradictory results in regards to proximity
to CpG islands and gene density. Possibly because
of cell type specific integration influences. - AAV- Of all vectors, integration found least
favorable into TUs. Contradictory to previous
mouse liver studies. - L1- Supports previous studies suggesting strong
integration site nucleotide relationships.
47What genomic features influence integration of
new DNA?
When asking this question, the scale of interest
is very important because it can influence the
results.
For example You use a vector that you think
integrates near the sequence GATTACA, When you
focus on a 20 bp segment, it can be very easy to
predict where the vector will integrate.
Conversely, if that same vector is integrated
into a 1kbp segment, or 20kb, or 3 billion base
pair segment, the integration site is going to
be harder to predict. Especially if there are
other, less understood influences acting in
concert. As seen in our case. Other factors
were seen to increase their influence with
increased area, as seen in MLV and ASLV.
48Future Studies
With this catalog of vector-feature interactions,
we can better understand novel insertion
influences as theyre identified. They can be
studied and compared in cooperation with the
current comprehensive predictive models
incorporating all currently known genomic
features. In doing so, we will gain better
insertion prediction abilities with each new
independent variable genomic feature discovered.
One such new feature could be the relative
locations of nucleosomes, or other epigenetic
factors, like methylation or acetylation of the
DNA strand.
http//en.wikipedia.org/wiki/Nucleosome
49Future Studies
This paper mentioned many potential future
studies surrounding each individual potential
insertion vector, for example, SB cell specific
integration and AAV likeliness of TU insertion.
Many other areas of research could collaborate
upon the findings presented in this article.
Stronger mathematical modeling systems could be
of great value.
http//www.bioscience.heacademy.ac.uk/network/sigs
/numeracy/
50Future Studies
Also using a different approach utilizing the
advances in proteomics to isolate and identify
some of the functional proteins used by these
potential insertion vectors could expand our
understanding of the mechanisms used. A
bioinformatics data base could then be used to
see if there any DNA binding proteins, chromatin
related proteins, DNase proteins, DNA ligase
proteins, etc were found.
http//www.dartmouth.edu/toxmetal/TXQAas.shtml
51Future Studies
A second novel use of the vector-feature
interaction library is as a reference in respect
to the feature in question. If you were working
with CpG islands, you could look up what kind of
insertion vectors have a probability of inserting
near your CpG island of interest.
http//www.pb.ethz.ch/research/chromatin_technics/
TDI.jpg/image
52Big Future Studies
The purpose of this research was to better
understand the factors influencing various vector
insertions. This is useful for the hope of
creating a reliable, predictable, vehicle for
integrating DNA elements into humans. This
innovation could turn gene therapy into a
plausible reality. We need to be able to insert
desired segments with pin point accuracy as
illustrated at the beginning of this paper. A
previous study successfully treated human
X-SCID while also indirectly causing leukemia in
three of the patients, Unlike mice, it has to
work the first try, every try.
53Gene Therapy
Typically gene therapy is most successful when
used to treat a single gene, or monogenic genetic
disorder
- Cystic Fibrosis
- Sickle Cell Anemia
- Marfan Syndrome
- Huntingtons Disease
- Hereditary Hemochromatosis
- Ornithine Transcarboxylase Deficiency (OTCD)
- X-linked Severe Combined Immunodeficiency Disease
(X-SCID) "bubble baby syndrome."
http//www.annasslant.com/doctor-shot.jpg
For more information about gene therapy
visit http//www.ornl.gov/sci/techresources/Human_
Genome/medicine/assist.shtml