Overview

About This Presentation

Title:

Overview

Description:

TATA box. Initiator. Gene. DNA coding strand. Biological ... upstream regulatory signals (TATA boxes) and local characteristics of the sequence (CpG islands) ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 44

Provided by: CeleraE7

Category:

Tags: overview

more less

Transcript and Presenter's Notes

Title: Overview

1
Overview

Biological motivation
Methods in gene prediction
Mapping of large EST data sets
Applications of EST data mining

2
ESTomics

Sorin Istrail

3
Biological motivation

Model of eukaryotic gene transcription and
translation

RNA polymerase II promoter
Upstream binding sites
TATA box
Gene
DNA coding strand
Sp1
Oct1
C/EBP
Initiator
4
Biological motivation

Model of eukaryotic gene transcription and
translation

RNA polymerase II promoter
Upstream binding sites
TATA box
Gene
DNA coding strand
Sp1
Oct1
C/EBP
Initiator
Transcription
AAUAAA
cap
Exon 1
Exon 2
Intron
primary transcript
(A)n
3 UTR
GT
AG
5 UTR
5
Biological motivation

Model of eukaryotic gene transcription and
translation

Model of eukaryotic gene transcription and
translation

Expressed Sequence Tags (ESTs) are cDNA fragments
500 bp long on average
may span one or more exons
cDNA single-stranded DNA complementary to an
RNA, synthesized from it by reverse transcription

3 UTR
Gene
5 UTR
DNA coding strand
Exon 4 (non-coding)
Exon 3
Exon 1
Exon 2
Intron
primary transcript
Intron
Intron
mRNA
ESTs
8
Overview

Biological motivation
Methods in gene prediction
Mapping of large EST data sets
Applications of EST data mining

9
Methods in gene finding

Ab initio analysis of genomic sequences
(GenScan, Burge and Karlin 1997 HMMer, Haussler
et al. 1993, Krogh et al. 1994 FGenesH, Solovyev
and Salamov 1994)
Comparison of protein and genomic sequences
(Procrustes, Gelfand et al. 1996 Genewise,
Birney and Durbin)
Comparison of expressed DNA (ESTs, cDNA, mRNA)
and genomic sequences (EST_GENOME, Mott 1997
SIM4, Florea et al. 1998)
Cross-species genomic sequence comparisons
(ROSETTA, Batzoglou et al. 2000 CEM, Bafna and
Huson 2000)

10
Ab initio gene finders

Use information embedded in the genomic sequence
to predict the exon model
polyadenylation signal (AATAAA)
differential codon usage in coding versus
non-coding sections of the gene
upstream regulatory signals (TATA boxes) and
local characteristics of the sequence (CpG
islands)
splice recognition signals (e.g., GT-AG)
Markov models are the predominant predictive
method
Caveats
not effective in detecting alternatively spliced
forms, interleaved or overlapping genes

11
The GenScan method

High-level organization
each of the basic functional units of a gene is
associated with a state in the HMM
Lower-level organization
separate sequence prediction module for each of
the higher-level elements
exons (marginal, internal, phase-specific) -
inhomogeneous 3-periodic fifth order Markov model
introns and intergenic regions - homogeneous 5th
order Markov model
5 and 3UTRs - homogeneous 5th order Markov
model
polyadenylation signal
donor and acceptor splice sites - WAM and the
Maximal Dependence Decomposition (MDD), i.e., a
decision tree-based weighted position matrix

12
GenScans HMM for sequence generation
Reverse (-) strand
F - (5UTR)
F (5UTR)
P - (prom)
P (prom)
E0
I0
Einit-
I0 -
E0 -
Einit
Esngl (single-exon gene)
Esngl - (single-exon gene)
N (intergenic region)
I1 -
E1
I1
E1 -
A (polyA signal)
A - (polyA signal)
I2 -
E2
I2
Eterm
Eterm-
E2 -
T (3UTR)
T - (3UTR)
Forward () strand
(Prediction of complete gene structures in human
genomic DNA(1997) Burge and Karlin, JMB 268, p.
86)
13
Protein-genomic sequence comparisons

Use sequence similarity between the protein and
the protein-coding regions of the genomic
sequence for gene model prediction
Algorithmic techniques
dynamic programming-based sequence alignment
algorithms
specialized recognition modules for splice
junction prediction
profile HMMs
Examples
Procrustes (Gelfand et al. 1996)
combinatorial pairing of putative splice
junctions to form introns
uses protein-genomic sequence similarity to
validate the correct pairings
Genewise (Durbin and Birney)
HMM-based sequence profiles
uses similarity between the query protein and a
database of protein families organized in
profiles (Pfam)
Caveats
prediction limited to coding regions (excluding
5 and 3 UTRs)

14
cDNA-genomic sequence comparisons

Use similarities between the cDNA (ESTs, mRNAs)
and the genomic sequences to predict the gene
model.
Algorithmic techniques
dynamic-programming based sequence alignment
algorithms
specialized module for splice junction detection
(pattern matching techniques, or statistical
modeling)
Examples
EST_GENOME (Mott 1997)
dynamic programming alignment with an affine
scoring scheme
uniform scoring for large indels (introns)
SIM4 (Florea et al. 1998)
incremental exon detection and refinement with
blast-like and greedy sequence comparison
techniques
pattern matching prediction of splice junctions
Caveats
accuracy depends on the quality of the data
source (e.g., cannot detect genomic contamination
by unspliced introns, or spurious priming)

15
Cross-species genomic sequence comparison

Use the sequence similarity and the ordering of
homologous regions between genomic sequences from
related organisms to infer their common gene
model.
Algorithmic techniques
dynamic programming-based sequence comparison
algorithms
statistical modeling of the splice junctions and
other common transcriptional elements
Examples
ROSETTA (Batzoglou et al. 2000), CEM (Conserved
Exon Model Bafna and Huson 2000)
progressive sequence alignment between the
various categories of orthologus regions (based
on the expected sequence similarity)
statistical methods for splice signal recognition
(?)
Caveats
accuracy depends on the specificity of sequence
similarity and the presence of delimiting
transcriptional signals at that locus (similarity
may extend past the gene boundaries)

16
Automatic gene annotation with Otto
17
Components of the automatic gene annotation

Bn - blastn (dbEST, CHGI, CMGI, RefSeq)
S4 - SIM4 (dbEST, CHGI, CMGI, RefSeq)
Genewise (nr)
GenScan
FGenesH
repeat - RepeatMasker
etc.
Otto automatic gene predictions by
Otto
Promoted curated transcripts

18
Overview

Biological motivation
Methods in gene prediction
Mapping of large EST data sets
Applications of EST data mining

19
Using large EST data sets for gene prediction
20
Using large EST data sets for gene prediction

Each EST may span one or more of a genes exons
Overlapping ESTs and mRNAs on the genome can be
used to infer gene models
Large data sets must be used for completeness
dbEST ( 3.7 million ESTs)
UniGene (90,000 ESTs and mRNA transcripts,
grouped by similarity)
proprietary data sets (LifeSeq, CHGI)
Analyzing such large data sets is time and
resource-consuming
Strategy for EST data mining
determine the occurrences of a large set of cDNA
sequences in a target genome (mapping)
group the overlapping EST matches on the genome
to infer the underlying gene model (clustering)

21
Mapping ESTs to a target genome

Mapping Determine, for a given EST, the exact
genomic location(s) and exon model(s), i.e.
exon coordinates in the genomic sequence
genomic match strand (forward, or reverse
complement)
percent sequence identity values (at the exon and
EST levels)
spliced EST-genomic sequence alignment
ValidationCriteria for validating putative EST
occurrences on the genome
EST coverage
similarity between the EST and genomic sequences
e.g., gt80 of the EST must match the genome, at
gt90 sequence identity

22
Technical challenges

cDNA
Sequencing errors and polymorphisms
Interspecies contamination
Low quality EST data
Gene model
Multiple gene homologues
Alternative splicing
Interleaving and overlapping of genes
Genomic sequence
Repetitive elements
Genomic contamination
Genomic sequence representation
Large data size
3 billion bp in the human genome
2.8 billion bp in dbEST

23
Source primary cDNA data
24
Source underlying gene model

Multiple gene homologues
generate multiple EST matches
need to distinguish the true match based on
sequence similarity
complicated by sequencing errors in cDNA data

EST
Ortholog (true match)
Paralog 3
Paralog 2
Paralog 1
25
Source underlying gene model

Alternative splicing
a single gene gives rise to more than one mRNA
sequences and protein products
may occur as a result of tissue specificity, or
to activate different regulatory pathways
cannot be identified by ab initio methods

mRNA transcript 1
genomic sequence
mRNA transcript 2
26
Source underlying gene model

Interleaving and overlapping of genes
genes located in the introns of another gene
overlapping exons from different genes
difficult to detect with ab initio methods

Gene 1
Gene2
27
Source genomic sequence

Repetitive elements
classes
LINEs (Long Interspersed Nuclear Elements) --
7,000bp
SINEs (Short Interspersed Nuclear Elements) --
300bp -- e.g., Alu
low complexity regions -- e.g., ACACACACACACACAC
tandem repeats -- e.g., CAGCAGCAGCAG
occur in large numbers in the genome
considerably increase the size of the computation

28
Source genomic sequence

Genomic contamination
unspliced introns (A)
internal priming (B)
these artifacts can only be resolved by
clustering the ESTs on the genomic axis, or in
conjunction with other prediction methods

unspliced intron
EST
EST
genome
genome
AATATAAA
false (non-genic) primer
(A)
(B)
29
Source genomic sequence

Genomic sequence representation
ideal view one sequence per chromosome

public sequences BACs, contigs, ordered and
oriented to approximate full-chromosomes
possible mis-ordering and mis-orienting
incomplete genomic sequence

Gap
30
Source genomic sequence

Celera genome assembly
generated using the Whole Genome Shotgun (WGS)
method and a compartmentalized sequence assembler
sequence partially ordered and oriented
collection of scaffolds
scaffolds ordered and oriented collection of
contigs
known mean and distribution of gap lengths

Scaffolds
Contig ordering and orienting with mate-pairs
Shared fragments
Gap(?,?2)
Fragments
BACs (finished or unordered collections of
contigs)
...ACCGATCACGTATCTAGCGATCTTAAGGCTATCCCATGCGAGACTTA
GCTTACGGNNNCATTCGAGCGGATCTATCTGAGCT....
31
Source genomic sequence
Scaffold
Contigs
BACtigs
Genomic sequence
Fragments
32
Strategies for large scale EST mapping

Direct mapping with an exact cDNA-genomic
sequence alignment method (SIM4, EST_GENOME)
divide the genome in n overlapping fragments
align the EST against each of the genomic
fragments

Time required
SIM4 - 0.3s per EST/Mb (1 EST vs. genome in 15
minutes)
EST_GENOME - even slower
Too expensive!

33
Strategies for large scale EST mapping

Mapping of ESTs to the genome via the (predicted)
mRNA transcripts
map each of the ESTs on the set of (predicted)
mRNA transcripts, or genes with known genomic
locations
align the EST against the genomic fragment
containing the gene for the EST with an exact
alignment method

Faster than exact mapping
Can be used to improve existing gene models, but
not to discover new ones

34
Strategies for large scale EST mapping

Two-stage mapping of ESTs to the genome
detect potential EST matches on the genome with a
fast similarity search program (signal finding)
blastn, MUMer, tfastx
align the EST against the bounded genomic region
containing the signal with an exact alignment
method (polishing)
SIM4, EST_GENOME

1
2
EST
EST signal
genome
bounded genomic regions containing the EST signal
35
Repeat detection and resolution

Repeats represent 40 of the sequence of the
human genome
Some repeats can be found in the 3 UTRs of the
genes
Spurious priming can produce repetitive ESTs
In tests using dbEST 1 of the ESTs found
accounted for 99 of the EST signals
Resolution Strategies
repeat mask the genome prior to mapping using,
e.g., RepeatMasker
repeat mask the EST data prior to mapping
selectively mask only those ESTs with large
numbers of occurrences, during mapping

36
Overview

Biological motivation
Methods in gene prediction
Mapping of large EST data sets
Applications of EST data mining

37
EST data mining

Gene prediction by genomic EST clustering
(previously discussed)
Generation of gene indices by EST clustering and
assembly
5 and 3 UTR reconstruction
Detection of alternatively spliced gene variants

38
Gene indices

Quality and vector trim the EST sequences
Cluster the ESTs in groups based on sequence
similarity
Assemble the ESTs in each cluster using a
multiple alignment program
For each cluster, select a consensus sequence
EST assembly
Each EST assembly is a potential mRNA transcript
Detect potential splice variants by pairwise
comparisons between highly similar EST assemblies

39
5 and 3 UTR reconstruction

Map the ESTs on the genomic axis
Cluster the EST matches along the genomic axis in
the area surrounding the predicted transcripts,
in a manner consistent with the GenBank
annotation
Determine putative 3 mRNA transcript ends in the
vicinity of the 3-most EST-genomic alignments
Use genomic information (e.g., poly-adenylation
signals AATAAA) to validate the 3 UTR ends

40
Detection of alternative splices

Using EST consensus information
cluster the ESTs to create gene indices
determine the consensus sequence for each cluster
compare highly similar consensus sequences to
detect putative alternatively spliced exons
(indel blocks)
Using the EST-genomic sequence alignments
cluster the EST matches along the genomic axis to
infer possible exon models
determine (internal) exons that are present in
some, but not all, ESTs in the cluster
(alternatively spliced)
collect EST evidence for alternatively spliced
variants

41
References

Lewin B (2000) Genes VII, Oxford University Press
Inc., New York, ISBN 0-19-879276-X.
Burge C, and Karlin S. (1997) Prediction of
complete gene structures in human genomic DNA, J
Mol Biol. 268(1)78-94.
Kulp D, Haussler D, Reese MG, and Eeckman FH.
(1996) A generalized hidden Markov model for the
recognition of human genes in DNA, Proc Int Conf
Intell Syst Mol Biol. 4134-42.
Krogh A, Mian IS, and Haussler D. (1994) A hidden
Markov model that finds genes in E. coli DNA,
Nucleic Acids Res. 22(22)4768-78.
Solovyev VV, Salamov AA, and Lawrence CB. (1994)
Predicting internal exons by oligonucleotide
composition and discriminant analysis of
spliceable open reading frames, Nucleic Acids
Res. 22(24)5156-63.
Salamov AA, and Solovyev VV. (2000) Ab initio
gene finding in Drosophila genomic DNA, Genome
Res. 10(4)516-22.

42
References

Gelfand MS, Mironov AA, and Pevzner PA (1996)
Gene recognition via spliced sequence alignment,
Proc Natl Acad Sci USA 93(17)9061-6.
Mott R. (1997) EST_GENOME a program to align
spliced DNA sequences to unspliced genomic DNA,
Comput Appl Biosci. 13(4)477-8.
Florea L, Hartzell G, Zhang Z, Rubin GM, and
Miller W. (1998) A computer program for
aligning a cDNA sequence with a genomic DNA
sequence, Genome Res. 8(9)967-74.
Florea, L. and Walenz, B. (in preparation)
ESTMapper Massive EST Mapping.
Batzoglou S, Pachter L, Mesirov JP, Berger B, and
Lander ES. (2000) Human and mouse gene
structure comparative analysis and application
to exon prediction, Genome Res. 10(7)950-8.
Bafna V, and Huson DH. (2000) The conserved exon
method for gene finding, Proc Int Conf Intell
Syst Mol Biol. 83-12.
Quackenbush J, Liang F, Holt I, Pertea G, and
Upton J. (2000) The TIGR gene indices
reconstruction and representation of expressed
gene sequences, Nucleic Acids Res. 28(1)141-5.

43
References

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ,
Sutton GG, Smith HO, Yandell M, Evans CA, Holt
RA, et al. (2001) The sequence of the human
genome, Science 291(5507)1304-51.
Gautheret D, Poirot O, Lopez F, Audic S, and
Claverie JM. (1998) Alternate polyadenylation in
human mRNAs a large-scale analysis by EST
clustering, Genome Res. 8(5)524-30.
Kan Z, Rouchka EC, Gish WR, and States DJ. (2001)
Gene structure prediction and alternative
splicing analysis using genomically aligned ESTs,
Genome Res. 11(5)889-900.
Kan Z, Gish W, Rouchka E, Glasscock J, and States
D. (2000) UTR reconstruction and analysis using
genomically aligned EST sequences, Proc Int Conf
Intell Syst Mol Biol. 8218-27.
Ji H, Zhou Q, Wen F, Xia H, Lu X, and Li Y.
(2001) AsMamDB an alternative splice database of
mammals, Nucleic Acids Res. 29(1)260-3.

Write a Comment

User Comments (0)

About PowerShow.com

Overview - PowerPoint PPT Presentation

Overview

TATA box. Initiator. Gene. DNA coding strand. Biological ... upstream regulatory signals (TATA boxes) and local characteristics of the sequence (CpG islands) ... – PowerPoint PPT presentation