Computational Molecular Biology - PowerPoint PPT Presentation

About This Presentation
Title:

Computational Molecular Biology

Description:

Detection of repetitive sequences (mostly untranslated) ... BLAST/QUASAR. search, clustering. Assembly, Consensus sequences. Visualization ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 57
Provided by: stefa98
Category:

less

Transcript and Presenter's Notes

Title: Computational Molecular Biology


1
DNA sequence analysisGene prediction
  • Gene prediction methods
  • Gene indices
  • Mapping cDNA on genomic DNA
  • Applications

2
DNA sequence analysisGene prediction
Protein coding sequence
3UTR
5UTR
exon 2
exon 1
exon n
promotor
exon n-1
3
Gene predictionStrategies for detecting ORFs /
exons
  • Distribution of Stop-codons
  • Codon usage
  • Hexamer frequencies
  • Prediction of the coding frame
  • Splice site recognition (Eucaryotes only)

4
Gene predictionCodon usage (single exon)
coding
Frame 1
non-coding
Frame 2
Frame 3
5
Gene predictionCodon usage (single exon)
coding
Frame 1
non-coding
Frame 2
Frame 3
correct start
6
Gene predictionCodon usage (multiple exons)
Exons 208. .295 1029. .1349 1500. .1688 2686.
.2934 3326. .3444 3573. .3680 4135. .4309 4708.
.4846 4993. .5096 7301. .7389 7860. .8013 8124.
.8405 8553. .8713 9089. .9225 13841. .14244
coding
Frame 1
non-coding
Frame 2
Frame 3
Splice sites
7
Gene predictionCodon usage (multiple exons)
Exons 208. .295 1029. .1349 1500. .1688 2686.
.2934 3326. .3444 3573. .3680 4135. .4309 4708.
.4846 4993. .5096 7301. .7389 7860. .8013 8124.
.8405 8553. .8713 9089. .9225 13841. .14244
coding
Frame 1
non-coding
Frame 2
Frame 3
Splice sites
8
Gene predictionAdditional criteria
  • Detection of start codons
  • Detection of potential promotor elements
  • Detection of repetitive sequences (mostly
    untranslated)
  • Homology to known genes of related organisms

9
Gene predictionSoftware
  • GENSCAN (C.Burge S.Karlin)
  • Grail (neural network Ueberbacher et al.)
  • MZEF (M. Zhang,1997)
  • FGeneH, Hexon (V.Solovyev et al., 1994)
  • Genie, etc.
  • All programs are using dynamic programming for
    detection of the
  • optimal solution

10
DNA sequences in public databases
  • Human
  • 4 million ESTs 130 000 RNAs
  • Mouse
  • 2.7 million ESTs 30 000 RNAs

11
Expressed sequence tags (EST)
  • Reverse transcriptase stops randomly

mRNA
12
Expressed sequence tags (EST)
Dechiffered sequence (EST)
Clone mRNA fragment
3-primer
lt700 bp
Vector (known sequence)
Average 1500 bp
13
Expressed sequence tags (EST)
  • Isolation of mRNAs from tissue(s)
  • Generation of cDNAs reflecting parts of the RNAs
  • Cloning of cDNAs into a vector (often random
    orientation)
  • End sequencing of the clones

14
Generation of ESTsbasecalling problems
close to 5 end of EST
close to 3 end of EST
missing bases
15
Coverage of an mRNA by ESTs
putative mRNA
AAAAAA...
exon 1
5UTR
exon 2
3UTR
16
Characteristics of ESTs
  • Highly redundant
  • Low sequence quality
  • (Cheap)
  • Reflect expressed genes
  • May be tissue/stage specific

17
Gene indices
Clustering of EST and mRNA sequences of an
organism to reduce redundance in sequence
data. Goal Each cluster represents one gene or
mRNA
  • UniGene (NCBI)
  • TIGR Gene Indices
  • STACK (SANBI)
  • GeneNest (DKFZ,MPI)

18
Gene indicesGeneNest workflow
EMBL database
Unigene database
Quality clipping
Quality clipping
BLAST/QUASAR search, clustering
Assembly, Consensus sequences
Visualization
19
Gene indicesQuality clipping
In order to cluster based on gene-specific
sequence data the following steps have to be
performed
  • Removal of vector sequence
  • Masking of repetitive sequences (e.g. Alu)
  • Removal of terminal sequences of low quality

20
Gene indices Clustering
Sequences are usually clustered if the matching
part between two sequences fullfills several
(empirical) criteria
  • Minimal identity (e.g. gt 95)
  • Minimal length of match (e.g. gt40 bp)
  • No internal matches (TIGR gene indices)
  • Same origin of tissue (only STACK)

21
Gene indices Assembly
Sequences in a cluster are assembled to group
those sequences which are globally similar,
resulting in
  • Contigs, reflecting parts of different
    transcripts
  • One consensus sequence per contig
  • A relative order of the sequences (alignment)

22
Gene indicesConsensus sequences
  • Reduced error rate
  • Consensus often longer than any single sequence
    contributing
  • Efficient database search
  • Detection of exon/intron boundaries and
    alternative splice variants

23
Gene indices Alignment
consensus
24
Gene indices Alignment Software
  • Phrap (Phil Green)
  • CAP3 (X. Huang)
  • TIGR assembler
  • GAP4 (R. Staden)

25
GeneNest visualization(http//genenest.molgen.mpg
.de)
26
GeneNest visualization(http//genenest.molgen.mpg
.de)
27
TIGR Gene Indices(http//www.tigr.org/)
Alignment scheme
28
UniGene(http//www.ncbi.nih.nlm.gov/UniGene)
29
UniGene(http//www.ncbi.nih.nlm.gov/UniGene)
30
Mapping of consensus sequences on genomic DNA
genomic sequence
31
Mapping cDNA on genomic DNA
32
Gene indicesApplications
  • Detection of exon/intron boundaries
  • Detection of alternative splicing
  • Detection of Single Nucleotide Polymorphisms
  • Genome annotation
  • Analysis of gene expression
  • Genome-genome comparison

33
Alternative Splicing
hnRNA
34
Alignment of EST consensus sequences and genomic
target
genomic sequence
35
Detection of the appropriate genomic target
sequence
Local similarity of EST consensus and genomic
DNA gt96 identity
genomic sequence
36
Cutting out genomic target sequence
genomic sequence
37
Alternative Splicing(mapping on genomic DNA)
genomic sequence
38
SpliceNest(http//SpliceNest.molgen.mpg.de)
39
Alternative Splicing(additional exon)
Splice variants of adenylsuccinate lyase
unspliced ?
skipped exon
gene prediction errors ?
40
Alternative Splicing
Splice variants of APECED gene
alternative variants
number of sequences
genomic sequence
41
Alternative splicing
42
Alternative Splicing (alternative donor site)
43
Alternative Splicing
44
Alternative Splicing(alternative exons)
45
SpliceNest(hypothetical gene Hs16936)
46
Single Nucleotide Polymorphisms(SNP)
  • SNPs are single base differences within one
    species
  • Several million SNPs detected in Human
  • SNPs may be related to diseases

47
Single Nucleotide Polymorphisms(SNP)
SNP or basecalling error ?
48
Genome Annotation / Ensembl(http//www.ensembl.or
g)
49
Analysis of gene expressiontissue-specificity
  • Counting frequency of EST derived from a
    specific tissue within one sequence cluster
  • Searching for cluster/contigs which are tissue
    specific (e.g. tumor)
  • Searching for alternative splice variants which
    are potentially tissue specific

50
Analysis of gene expressionPDZ-domain containing
protein PDZK1 (Hs.15456)
51
Analysis of gene expressionsmall muscular
protein, SMPX (Hs.88492)
52
Analysis of gene expressionhypothetical protein
(Hs.32343)
53
Analysis of gene expressionnon-redundant gene set
  • Selection of optimal clones
  • Generation of gene-specific PCR-products

54
Analysis of gene expression optimal clones
  • clone availability
  • type of clone library
  • length of the clone
  • relative position to the consensus sequence
  • homology to other genes
  • existence of repetitive elements

55
Analysis of gene expressiongene-specific
PCR-products
  • putative gene
  • consensus
  • sequence

exon A
exon C
exon B
56
Analysis of gene expressionoptimal gene-specific
PCR-product
  • minimal similarity to other genes
  • minimal content of repetitive sequences
  • not spanning over several exons
  • /- constant length of PCR-products of
  • different genes
Write a Comment
User Comments (0)
About PowerShow.com