Title: Course Comparative Genomics
1Course Comparative Genomics
- Techniques (How do you deal with a question)
- Questions (What kind of interesting questions can
be asked) - This is not about computers or webservers, but
about how to solve certain questions, and how to
make up new ones.
2Bioinformatics (III) is not (only)
about Question A should be solved by this
technique and Question B should be solved by
that technique but Understanding the data,
and the techniques to a level where YOU can
decide what would be a good way of answering
question A (if you already have a question, that
is).
3The omics sciences predictions/validation/biol
ogical interpretation wanted. Genomics D
NA sequencing Prediction of genes,
proteins, protein function, pathways,
regulation Transcriptomics Expression
analysis Prediction coregulation,
promotors ChIP-omics Analysis of which proteins
bind where on DNA Prediction of regulatory
elements Proteomics Analysis of the protein
complement, Protein complexes Metabolomics
Analysis of the metabolite concentrations, Pat
hway prediction
4Intrinsic value of omics data Detecting large
scale and genomic trends in molecular
evolution Comparative genomics evolution of
gene content -gene loss, gene duplicat
ion, horizontal transfer etc. Comp.
transcriptomics evolution of gene
regulation Comp. Proteomics evolution of
protein-protein interactions ? Comp.
Metabolomics
5Why comparative genomics, evolution ?
I
Everything put together, sooner or later falls
apart there is nothing to it, nothing to it,You
can cry ,You can lie For all the good itll do
you, you can die. (Paul Simon) Things that are
functional fall apart at a less alarming rate
than things that are not Things that are
conserved , tend to be conserved because they
have some selective value Selective value lt-gt
function
6II
- Genomics data are very noisy, examining the
overlap between genomics data make the
conclusions that can be drawn more reliable
7III
- Genomics data are one dimensional (e.g.
sequence, expression, structure, protein-protein
interaction. By combining them we might be able
to approach a molecular understanding of the cell.
8Day 1-4 Sequentie analyse -gen
voorspelling -homologie voorspelling -orthologie
voorspelling -finding signals in the DNA Day
5-7 Genoom vergelijking, pathway
prediction -vergelijkende genoom
analyse -pathway voorspelling -vergelijkende
genoom analyse voor pathway voorspelling Day 8-
Netwerken Day 9- Human genome data
browsing Day 10- Systems biology, ChIP data Day
11- Examn Cursus dagen April 6, 7, 8, 9, 15,
16, 17, 20, 21 Examn 22
9Dag 1Gen voorspelling
Het voorspellen van welk DNA coderend is, en welk
niet
ATGATGCAGTTCTCGAAAATGCATGGCCTTGGCAACGATTTTATGGTCGT
CGACGCGGTAACGCAGAATGTCTTTTTTTCACCGGAGCTGATTCGTCGCC
TGGCTGATCGGCACCTGGGGGTAGGGTTTGACCAACTGCTGGTGGTTGAG
CCGCCGTATGATCCTGAACTGGATTTTCACTATCGCATTTTCAATGCTGA
TGGCAGTGAAGTGGCGCAGTGCGGCAACGGTGCGCGCTGCTTTGCCCGTT
TTGTGCGTCTGAAAGGACTGACCAATAAGCGTGATATCCGCGTCAGCACC
GCCAACGGGCGGATGGTTCTGACCGTCACCGATGATGATCTGGTCCGCGT
AAATATGGGCGAACCCAACTTCGAACCTTCCGCCGTGCCGTTTCGCGCTA
ACAAAGCGGAAAAGACCTATATTATGCGCGCCGCCGAGCAGACAATCTTA
TGCGGCGTGGTGTCGATGGGAAATCCGCATTGCGTGATTCAGGTCGATGA
TGTCGATACCGCGGCGGTAGAAACGCTTGGTCCTGTTCTGGAAAGCCACG
AGCGTTTTCCGGAGCGCGCCAATATCGGTTTTATGCAAGTGGTTAAGCGC
GAGCATATTCGTTTACGCGTTTATGAGCGTGGGGCAGGAGAAACCCAGGC
CTGCGGCAGCGGCGCGTGTGCGGCGGTTGCAGTAGGGATTCAGCAAGGTT
TGCTGGCCGAAGAAGTACGCGTGGAACTCCCCGGCGGTCGTCTTGATATC
GCCTGGAAAGGTCCGGGTCACCCGTTATATATGACTGGCCCGGCGGTACA
TGTCTACGACGGATTTATTCATCTATGA
10Dag 2 Homologie voorspelling (wat voor type
activiteit heeft het?)
gtCEK12D12_1 K12D12.1 11849346..11854492
Chromosome II MSDSDSEFSIEDSPKKKTAPKKEKASPKKKKDDANE
SMVMTEEDRNVFTSIDKKGGGSKQMAIEDIYQKKSQLEHILLRPDTYIGS
VEHTEKTPMWVYNMEESKLEQRDISYVPGLYKIYDEILVNAADNKQRDPK
MNTIKITINKEKNEISVYNNGKGIPVTQHKVEKVYVPELIFGTLLTSSNY
NDDEKKVTGGRNGYGAKLCNIFSTKFTLETSSRDYKSAFKQTWIKNMTRD
EEPKIVKSTDEDFTKITFSPDLAKFKMKELDDDICHLMARRAYDVAGSSK
GVAVFLNGKRIPIKGFEDYVQMYTSQFNNEGEPLKIAYEQVGDRWQVALA
LSEKGFQQVSFVNSIATTKGGRHVDYVADQMVAKFIDSIKRKLTKTSMNI
KPFQIKNHMWVFVNCLIENPTFDSQTKETMTLQQKQFGSTCVLSEKFSKA
ASSVGITDAVMSWVRFKQMDDLNKKCSKTKTSKLKGIPKLEDANDAGTKN
SQQCTLILTEGDSAKTLAVSGLSVVGRDKYGVFPLRGKLLNVREGNMKQI
ADNAEVNAMIKILGLQYKKKYETEDDFKTLRYGKLMVMADQDQDGSHIKG
LVINFIHHFWPSLIQRNFVEEFITPIVKATKGKEEVSFYSLPEYSEWRMN
TDNWKSYKIKYYKGLGTSTSKEAKEYFLDMVRHRIRFKYNGADDDNAVDM
AFSKKKIEERKDWLSKWMREKKDRKQQGLAEEYLYNKDTRFVTFKDFVNR
ELVLFSNLDNERSIPCLVDGFKPGQRKVLFACFKRADKREVKVAQLAGAV
AEISAYHHGEQSLMGTIVNLAQDYVGSNNINLLLPIGQFGTRLQGGKDSA
SARYIFTQLSPVTRTLFPAHDDNVLRFLYEENQRIEPEWYCPIIPMVLVN
GAQGIGTGWSTNIPNYNPRELVKNIKRLIAGEPQKALAPWYKNFRGKIIQ
IDPSRFACYGEVAVLDDNTIEITELPIKQWTQDYKEKVLEGLMESSDKKS
PVIVDYKEYHTDTTVKFVVKLSPGKLRELERGQDLHQVFKLQAVINTTCM
VLFDAAGCLRTYTSPEAITQEFYDSRQEKYVQRKEYLLGVLQAQSKRLTN
QARFILAKINNEIVLENKKKAAIVDVLIKMKFDADPVKKWKEEQKLKELR
ESGEIELDEDDLAAVAVEEDEAISSAAKAVETKLSDYDYLVGMALIKLSE
EEKNKLIKESEEKMAEVRVIEKKTWQDLWHEDLDNFVSELDKQEAREKAD
QDASIKNAAKKLAADAKTGRGPKKNVCTEVLPSKDGQRIEPMLDAATKAK
YEKMSQPKKERVKKEPKEPKEPKKVKKEGQDIKKFMSPAAPKTAKKEKSD
GFNSDMSEESDVEFDEGIDFDSDDDGVEREDVVSKPKPRTGKGAAKAEVI
DLSDDDEVPAKKPAPAKKAAPKKKKSEFSDLSGGDSDEEAEKKPSTSKKP
SPKKAAPKTAEPKSKAVTDFFGASKKNGKKAAGSDDEDDESFVVAPREKS
GRARKAPPTYDVDSGSDSDQPKKKRGRVVDSDSD
DNA topoisomerase II DNA topoisomerase
IV
11Dag 3 Orthologie voorspelling (wat is de
biologische functie)
gtCEK12D12_1 K12D12.1 11849346..11854492
Chromosome II MSDSDSEFSIEDSPKKKTAPKKEKASPKKKKDDANE
SMVMTEEDRNVFTSIDKKGGGSKQMAIEDIYQKKSQLEHILLRPDTYIGS
VEHTEKTPMWVYNMEESKLEQRDISYVPGLYKIYDEILVNAADNKQRDPK
MNTIKITINKEKNEISVYNNGKGIPVTQHKVEKVYVPELIFGTLLTSSNY
NDDEKKVTGGRNGYGAKLCNIFSTKFTLETSSRDYKSAFKQTWIKNMTRD
EEPKIVKSTDEDFTKITFSPDLAKFKMKELDDDICHLMARRAYDVAGSSK
GVAVFLNGKRIPIKGFEDYVQMYTSQFNNEGEPLKIAYEQVGDRWQVALA
LSEKGFQQVSFVNSIATTKGGRHVDYVADQMVAKFIDSIKRKLTKTSMNI
KPFQIKNHMWVFVNCLIENPTFDSQTKETMTLQQKQFGSTCVLSEKFSKA
ASSVGITDAVMSWVRFKQMDDLNKKCSKTKTSKLKGIPKLEDANDAGTKN
SQQCTLILTEGDSAKTLAVSGLSVVGRDKYGVFPLRGKLLNVREGNMKQI
ADNAEVNAMIKILGLQYKKKYETEDDFKTLRYGKLMVMADQDQDGSHIKG
LVINFIHHFWPSLIQRNFVEEFITPIVKATKGKEEVSFYSLPEYSEWRMN
TDNWKSYKIKYYKGLGTSTSKEAKEYFLDMVRHRIRFKYNGADDDNAVDM
AFSKKKIEERKDWLSKWMREKKDRKQQGLAEEYLYNKDTRFVTFKDFVNR
ELVLFSNLDNERSIPCLVDGFKPGQRKVLFACFKRADKREVKVAQLAGAV
AEISAYHHGEQSLMGTIVNLAQDYVGSNNINLLLPIGQFGTRLQGGKDSA
SARYIFTQLSPVTRTLFPAHDDNVLRFLYEENQRIEPEWYCPIIPMVLVN
GAQGIGTGWSTNIPNYNPRELVKNIKRLIAGEPQKALAPWYKNFRGKIIQ
IDPSRFACYGEVAVLDDNTIEITELPIKQWTQDYKEKVLEGLMESSDKKS
PVIVDYKEYHTDTTVKFVVKLSPGKLRELERGQDLHQVFKLQAVINTTCM
VLFDAAGCLRTYTSPEAITQEFYDSRQEKYVQRKEYLLGVLQAQSKRLTN
QARFILAKINNEIVLENKKKAAIVDVLIKMKFDADPVKKWKEEQKLKELR
ESGEIELDEDDLAAVAVEEDEAISSAAKAVETKLSDYDYLVGMALIKLSE
EEKNKLIKESEEKMAEVRVIEKKTWQDLWHEDLDNFVSELDKQEAREKAD
QDASIKNAAKKLAADAKTGRGPKKNVCTEVLPSKDGQRIEPMLDAATKAK
YEKMSQPKKERVKKEPKEPKEPKKVKKEGQDIKKFMSPAAPKTAKKEKSD
GFNSDMSEESDVEFDEGIDFDSDDDGVEREDVVSKPKPRTGKGAAKAEVI
DLSDDDEVPAKKPAPAKKAAPKKKKSEFSDLSGGDSDEEAEKKPSTSKKP
SPKKAAPKTAEPKSKAVTDFFGASKKNGKKAAGSDDEDDESFVVAPREKS
GRARKAPPTYDVDSGSDSDQPKKKRGRVVDSDSD
DNA gyrase A DNA gyrase B DNA replicatie
reparatie
12Day 4
Finding of signals in the DNA, RNA
13Dag 5, Vergelijkende Genoom Analyse Which genes
do two genes share and which do they not share
and can we relate that to their phenotype
14Dag 6 pathway prediction and pathway evolution
15Day 7 Comparative genome analysis for pathway
prediction
deoxycitidine
Cdd
deoxyuridine, deoxythimidine
DeoA
Glyceraldehyde-3-p, acetaldehyde
deoB
deoC
deoxyribose-1-P
deoxyribose-5-P
DeoD
purine deoxyribonucleosides
deoB ?
M.genitalium M.tuberculosis
deoD deoC deoA cdd pmm
16Dag 8 networks
17- Dag 9
- EST data, SNPs etc.
- -Human genome data human genome browsing
Day 10
-Systems Biology -Analyzing ChIP data
18Dag indeling Two lectures per day (starting at
9.00h) Oefeningen opdrachten (computer pen
en papier) Niet alleen sequentie-zoekmachines
kunnen gebruiken, maar ook begrijpen hoe ze
werken, hoe je de resultaten moet interpreteren.
DNA, informatie -gt biologische kennis, begrip
19Day 1a Predicting genes from DNA sequences
What is a gene ? Classical notion a gene
encodes a function (pea-flower color) More
modern a gene encodes an RNA encodes a
protein - how about introns, exons, 3 and 5
UTRs ? - how about promotors ? - how about
alternative splicing ? - how about tRNAs,
rRNAs, miRNAs etc. For the moment Predicting
the protein coding pieces in the DNA
20Gene prediction intrinsic signals 1) Length of
the open reading frame (ORF) probability of an
ORF of at least 300 bases/100 AA (61/64)
1000.008 -Caveats 1) statistical argument,
8 out of 1000 will be there by chance
alone 2) variations in nucleotide frequencies.
TAA, TAG and TGA are GC poor -gt GC rich
sequences have many long ORFs 3) misses
short proteins/exons -Watch out for 1)
variations in the genetic code (start codons,
codon reassignment, e.g. Mycoplasma TGA -gt Trp
21ORF length Comparison of Annotated and Spurious
ORFs in S. cerevisiae
Basrai MA, Hieter P, and Boeke J Genome Research
1997 7768-771
22Gene prediction intrinsic signals 2) The
sequence is coding for amino acids - bias in
nucleotide usage relative to non-coding (coding
sequences more constrained, e.g. second coding
position) - period-three/six etc. pattern -
amino-acid bias codon bias
23(No Transcript)
24Codon bias (relative to non-coding) is partly an
amino-acid bias
25Other Signals
- - ribosome binding sites (AGGAGG)
- - transcription termination signals
- - polyadenylation signals (AAUAAA..10-30N..CT.1
0- 20NGT) - - promotors
- - intron/exon boundaries (exonGT...intron..AGex
on)
26Combining the signals into a probabilistic
model Hidden Markov Model (HMM) A Markov Model
is a mathematical model of a series of
events/things (e.g. nucleotides) in which the
probability of the next event is determined by
the last X events -gt Markov Model of the Xth
order. Generally one speaks of a Markov chain
when the next event in the sequence is solely
determined by the previous event. Protein coding
sequences are Markov Models of the
second/fifth/eight etc. order (because of codon
bias)
27Markov Chains as Models ofSequence Generation
- 0th-order
- 1st-order
- 2nd-order
28The Hidden in Hidden Markov Models refers to
the fact that there are multiple Markov Models
(coding, non-coding) and that we cannot see
which one we are dealing with
29The main idea behind the usage of HMMs (Bays
theorem) P(SM) Pi1.n P(ni) (The
probability of any sequence S, given a model M,
is the product of the probabilities P (in M) of
every position (ni) in S) P(MjS) P(SMj) P
(Mj) / S r1,2, N P (SMr) P(Mr) The
probability of any model (M) for the sequence S
is the probability of any sequence S given the
model (M) times the probability of that model,
divided by the sum of all Models In practice
find the model that gives the highest P(SMj)
30- In the simplest example, we can simply compare
two models, and reduce it to a (log) odds ratio
31A simple Hidden Markov Model (HMM)Whos in goal?
Save pct Boruc 75
Save pct van der Sar 92
- Sequence (X save, 0 goal)
- XOXXXXXXOXXXXXXXXXXXXXOXXXXXXXOXXXOXOXXOXXXOXXOXXO
- Total 50 shots, 40 saves -gt Save pct 80
- Assuming only one goalie for the whole sequence
(simple Markov chain) -
- Pborac 0.004, Pv.d.sar 0.099, Pborac/Pv.d.
sar 25 - What if the goalie can change during the
sequence? - The goalie identity on each shot is the Hidden
variable (the state) - HMM algorithms give probabilities for the
sequence of goalie, given - the sequence of hits/misses,
- the probability of changing the goalie,
- the probability of hits and misses (save
percentage) of each goalie - XOXXXXXXOXXXXXXXXXXXXXOXXXXXXXOXXXOXOXXOXXXOXXOXXO
- bbssssssssssssssssssssssssssssbbbbbbbbbbbbbbbbbbbb
32A simple Hidden Markov Model for
coding/non-coding areas
1-Q
P
Q
Non-Coding
Coding
1-P
CGT 1/64 CGG 1/64 AGG 1/64 TAG 1/64 Etc.
CGT (Arg) 1/32 CGG (Arg) 1/128 AGG (Arg) lt
1/400 TAG (stop) 1/300 Etc.
33Of course, the number of alternative sequences
of events of coding/non-coding or goalie
A/goalie B increases exponentially with the
length of the sequences. We will not discuss how
the software decides which sequence of events
has the highest probability
34Questionable HMM assumptions constant
probability of a stop-codon
35A more complicated HMM for gene prediction
(Borodovski et al.)
36Another HMM issue Variation in GC frequency,
codon usage within a genome. Among others due to
recently genes recently acquired through
horizontal gene transfer (in this case the genes
that are derived from a bacteriophage (blue) have
a higher GC content than the other genes) Once
could in principle solve this by deriving
multiple HMMs. How ?
37Classical Hidden Markov Models for gene
prediction use hexamer frequencies (5th order
Markov Models, 2 codons, Hexamer frequencies
introduced by Jim Fickett). 7 models (three
coding frames, three anti-coding frames, one
non-coding) Interpolated Markov Model (GLIMMER)
uses 8th order Markov Models (three codons) when
enough sequences 8-mers are available. If not it
uses 7th, 6th etc. Markov Models. Major
shortcomings of all intrinsic signal protein
coding prediction short proteins/exons (lt 100
amino acids)
38Gene prediction in eukaryotes Relative to
prokaryotes major complication the intron/exon
structure of proteins. Most succesful
gene-prediction program is GenScan,
includes -typical gene density, -the typical
number of exons per gene, -the distribution of
exon sizes for different types of exon, -the
reading frame-specific hexamer composition of
coding regions vs the (reading frame-independent)
hexamer composition of introns and intergenic
regions, -position-specific composition of the
translation initiation (Kozak) and termination
signals, -TATA box, cap site and poly-adenylation
signals.
39An example of a well-annotated genome region of
Plasmodium (exp. data) The B9 locus of
Plasmodium encodes linked, conserved gametocyte
specific genes
5
P. berghei
OMPDC
KO performed
(Housekeeping)
ORF 4/ B9
ORF 5
ORF 3
ORF 3
ORF 2
ORF 1
KO failed
(?)
Gametocyte specific
300bp gametocyte specific
boundary
bi
-
directional promoter
0
2
4
6
8
10
12
13kb
10
P. falciparum
OMPDC
ORF 4/ B9
ORF 2
ORF 5
(Housekeeping)
(?)
ORF 3
ORF 1
v Lin et al (2001) NAR,
Most P. falciparum 1o sequence data courtesy of
sequence releases of TIGR
40Glimmer M prediction of a locus containing split
genes has some inaccuracies
P.berghei - Chr. 5
Glimmer M prediction
Orf 3
Orf 1
OMP
Orf 4
Orf 5
Orf 2
Actual
0
14kb
Coding regions
Orf 2
OMP
Orf 4
Orf 5
Actual
Orf 3
Orf 1
Glimmer M prediction
P.falciparum - Chr.10
Gametocyte specific transcription
Analysis courtesy of S. Salzberg TIGR
41Calculating Gene Prediction accuracy
Specificity and Sensitivity
42 Gene prediction accuracy measures for
eukaryotes
- Basic accuracy measures introduced by Burset and
Guigo (1996) Sensitivity (Sn, coverage) and
Specificity (Sp, accuracy) - AC approximate correlation (Sn Sp
Tn/(TnFp) Tn/(TnFn) - WE wrong exons (false positives)
- ME false negatives (not predicted exons)
43Accuracy of ab initio gene prediction(Burset and
Guigo, 1996)
44Homology based gene-prediction -Check whether
DNA encodes a potential protein that is similar
to a known protein (BlastX) -Check for the
presence of a potential protein in a DNA sequence
(e.g. a genome) tBlastN. -Compare two DNA
sequences for potentially conserved protein
coding sequences (tBlastX) Complications in
eukaryotic DNA splicing -gt Spliced alignment
(Gelfand et al.)
45Using sequence conservation to detect
functionally constrained sequences ( in this
case, using the conservation of exons relative to
introns in gene prediction). Comparison of a
human gene with that of from mouse, chicken and
the pufferfish shows that mouse might actually be
too close to human to use it for gene prediction.
Chicken gives a better discrimination between
introns and exons. (from R. Guigo et a, unpubl.)
46Gene prediction accuracy sensitivity (coverage)
and specificity (selectivity), including measures
of (protein coding) sequence conservation.
Genewise uses information from sequence
alignments (homology based prediction of coding
regions) sequence information about splice
sites etc.
47En example of using sequence conservation to
predict (protein coding) genes conservation of
intron/exon structure (Guigo et al, PNAS 2003)
An example of predictions with aligned introns.
RT-PCR positive predicted protein 3B1 (a novel
homolog of Dystrophin) is aligned with its
predicted human ortholog (N-terminal regions
shown Upper of each row mouse, Lower of each
row human). Each color indicates one coding
exon. Three of four predicted splice boundaries
(color boundaries) align perfectly. Gaps in the
alignment (shown as dashes) may indicate
mispredicted regions.
(the problem in using mouse-human to improve gene
prediction is that large conserved regions do not
code for proteins ? sequence conservation alone
is not sufficient. State of the art approaches
use a combination of ab-initio gene prediction
with conservation of certain features)
48Comparing gene prediction methods evidence
based works best.
Ab initio. What does that tell us about how well
we understand the cell
49Alternative splicing genomics approaches
sampling exons to increase the diversity of
proteins (Walter Gilbert, 1977)
Example from an FCe receptor b chain homolog
50Types of alternative splicing
Effects on the proteins
51An example of how sequence analysis of public
data can increase our knowledge and understanding
(?) of biology. Also an example of typical
omics research low quality in terms of
individual data points (In general no
experimental follow up) (presumably) high
quality because of the sheer amount of the data
analyzed. (Be careful of systematic errors
though) -Before omics the number of genes that
were estimated to be alternatively spliced was
5, now it stands at 40-60. -Discuss the
various methodologies and data used.
52EST (Expressed Sequence Tag) -short, contiguous
reverse transcribed segment from a spliced mRNA,
containing 5 and/or coding exons, and/or 3
sequences -large scale (4.3 million for man, 2.5
million for mouse, 190K for C.elegans in 2003),
low quality (often single read, possible
contamination from genomic sequences etc.) sets
of sequences.
53EST evidence for alternative splicing
54Subsequent alignment with the genomic DNA
- Extra quality checks
- Presence of the sequence itself in the genomic
DNA - Presence of consensus, intronic splice-junction
sequences (GTAG) at the intron/exon boundaries
in the genomic DNA - Check for mutually exclusive splicing to get rid
of potential intronic sequences.
55- Finding alt. Splicing by
- Aligning ESTs to Genomic DNA (Mironov et al,
(1999) aligned ESTs to 392 genes, found alt.
splicing in 133 cases) - Aligning ESTs to proteins (with tBLASTn, Hanke et
al., (1999) alt. splicing in 162 of 475 tested
genes, 10 genes exp. Control.) - Aligning ESTs to (presumably) introns (Croft et
al. (2000) found EST matches to introns in 582
genes, suggesting alt. Splicing) - Aligning ESTs to a set of known mRNAs (Brett et
al., (2000) identified 3011 alternatively spliced
genes) - Aligning ESTs to Genomic DNA (chromosome 22)
(Human Genome Consortium 2001, 145 alt. spliced
genes) - Aligning ESTs mRNAs to the whole draft genome
(Modrek et al., (2001) 6,201 alt. splices in
2,272 genes. - Examining alternative splicing in the Encode
regions (Tress et al, PNAS 2007)
56Biology is not perfect So is the spliceosome
? What about non-functional spliceforms (e.g. if
the catalytic domain is missing) ? What is
functional (if the catalytic domain is missing
the protein could have a regulatory role) ? What
is the functional relevance of the alternative
splicing ?
57Who cares ? or, the functional impact of
alternative splicing
Greveley, B.R., Trends Genet, 2001
58With respect to proteins, the substitution
(mutually exclusive splicing) of internal
elements in a protein is relatively rare, most of
the substitutions occur at N/C termini
Results from the encode project (Tress et al,
PNAS 2007)
59Alternative N-termini or C-termini of proteins
can affect intra-cellular targeting
THMtrans membrane helix NLSnuclear localization
signal
60Alternative splicing tends to affect entire
domains, rather than only parts. When parts of
domains are affected, it targets annotated
functional sites.
61The domains that are not alternatively spliced
out (Constitutive Splicing) tend to be the ones
whose boundaries do not correspond to intron-exon
boundaries, in contrast to the ones that are
alternatively spliced. Alternative Splicing tends
to cut between domains
62(No Transcript)
63Alternative splicing preferentially tends to
occur between domains rather than within, but the
fraction of broken domains remains high.
- Splicing events occur within Pfam-A (24)
hand-curated functional domains in 46.5 of
sequence-distinct isoforms, and the figure rises
to 71 when all Pfam-defined domains are
considered. Although this is a surprisingly high
figure, it is still considerably less than might
be anticipated. If the same number of splicing
events in each alternative isoform were to happen
randomly at any of the exon boundaries, they
would be expected to fall inside Pfam-A domains
in 59.8 of isoforms (84.8 in all Pfam-defined
domains). As shown (25), this effect is not due
to any correlation between domain and exon
boundaries. We found no such correlation. - Although these results do suggest that there is
some favorable selection against splicing events
that affect functional domains, the proportion of
splicing events that occur inside a domains is
still high, and, as a result, many of these
transcripts are likely to code for proteins with
drastically altered structure and function.
Tress et al. PNAS 2007
64Alternative splicing has, not unexpectedly, large
effects on protein structure. Alternative
structures are hard to predict with 3D modeling
techniques
Purple alternatively spliced part of the protein
65Does alternative splicing really solve the
conundrum of why we are so much more complicated
than C.elegans?
Caernohabditis elegans lt 1mm
66Is alternative splicing an explanation of the
high complexity of man (20.000 genes) compared to
that of the worm (19.000 genes) ? (is there more
alternative splicing in man than in worm)
Brett et al., Nature Gen. 2002
67Dependency of the observed amount of alt.
splicing on the number of available ESTs
68Further reading
- Gene prediction Mathe, Sagot, Schiex and Rouze
(2002) Current methods of gene prediction, their
strengths and weaknesses. Nucl. Acids Res 30,
4103-4117 - Alternative splicing Kriventseva EV, Koch I,
Apweiler R, Vingron M, Bork P, Gelfand MS,
Sunyaev S. Increase of functional diversity by
alternative splicing. Trends Genet. 2003
Mar19(3)124-8