Title: 1-Month Practical Master Course
11-Month Practical Master Course Genome
AnalysisJaap Heringa Centre for Integrative
Bioinformatics VU (IBIVU) Vrije Universiteit
Amsterdam The Netherlands www.ibivu.cs.vu.nl heri
nga_at_cs.vu.nl
2(No Transcript)
3Biological Sequence AnalysisPair-wise sequence
alignment Residue exchange matrices Multiple
sequence alignment Phylogeny
4DNA sequence
.....acctc ctgtgcaaga acatgaaaca nctgtggttc
tcccagatgg gtcctgtccc aggtgcacct gcaggagtcg
ggcccaggac tggggaagcc tccagagctc aaaaccccac
ttggtgacac aactcacaca tgcccacggt gcccagagcc
caaatcttgt gacacacctc ccccgtgccc acggtgccca
gagcccaaat cttgtgacac acctccccca tgcccacggt
gcccagagcc caaatcttgt gacacacctc ccccgtgccc
ccggtgccca gcacctgaac tcttgggagg accgtcagtc
ttcctcttcc ccccaaaacc caaggatacc cttatgattt
cccggacccc tgaggtcacg tgcgtggtgg tggacgtgag
ccacgaagac ccnnnngtcc agttcaagtg gtacgtggac
ggcgtggagg tgcataatgc caagacaaag ctgcgggagg
agcagtacaa cagcacgttc cgtgtggtca gcgtcctcac
cgtcctgcac caggactggc tgaacggcaa ggagtacaag
tgcaaggtct ccaacaaagc aaccaagtca gcctgacctg
cctggtcaaa ggcttctacc ccagcgacat cgccgtggag
tgggagagca atgggcagcc ggagaacaac tacaacacca
cgcctcccat gctggactcc gacggctcct tcttcctcta
cagcaagctc accgtggaca agagcaggtg gcagcagggg
aacatcttct catgctccgt gatgcatgag gctctgcaca
accgctacac gcagaagagc ctctc.....
5Genome size
- Organism Number of base pairs
- ?X-174 virus 5,386
- Epstein Bar Virus 172,282
- Mycoplasma genitalium 580,000
- Hemophilus Influenza 1.8 ? 106
- Yeast (S. Cerevisiae) 12.1 ? 106
- Human 3.2 ? 109
- Wheat 16 ? 109
- Lilium longiflorum 90 ? 109
- Salamander 100 ? 109
- Amoeba dubia 670 ? 109
6Three main principles
- DNA makes RNA makes Protein
- Structure more conserved than sequence
- Sequence Structure Function
7Regulation, signalling cascades, chaperonins,
compartmentalisation
8How to go from DNA to protein sequence
A piece of double stranded DNA 5
attcgttggcaaatcgcccctatccggc 3 3
taagcaaccgtttagcggggataggccg 5
DNA direction is from 5 to 3
9How to go from DNA to protein sequence
6-frame translation using the codon table (last
lecture) 5 attcgttggcaaatcgcccctatccggc
3 3 taagcaaccgtttagcggggataggccg 5
10Evolution and three-dimensional protein structure
information
Isocitrate dehydrogenase The distance from the
active site (in yellow) determines the rate of
evolution (red fast evolution, blue slow
evolution)
Dean, A. M. and G. B. Golding Pacific Symposium
on Bioinformatics 2000
11Protein Sequence-Structure-Function
Ab initio prediction and folding
Sequence Structure Function
Threading
Function prediction from structure
Homology searching (BLAST)
12Widely used tool for homology detection PSI-BLAST
- Heuristic tool to cut down computations required
for database searching (1M sequences in DB) - Sensitivity gained by iteratively finding hits
(local alignments) and repeating search
Q
hits
DB
T
PSSM
13Threading
Template sequence
Compatibility score
Query sequence
Template structure
14Threading
Template sequence
Compatibility score
Query sequence
Template structure
15Fold recognition by threading
Fold 1 Fold 2 Fold 3 Fold N
Query sequence
Compatibility scores
16Bioinformatics
- Nothing in Biology makes sense except in the
light of evolution (Theodosius Dobzhansky
(1900-1975)) - Nothing in bioinformatics makes sense except in
the light of Biology
17Divergent evolution
- Ancestral sequence ABCD
-
- ACCD (B C)
ABD (C ø) -
- ACCD or ACCD
Pairwise Alignment - AB-D A-BD
-
mutation deletion
18Divergent evolution
- Ancestral sequence ABCD
-
- ACCD (B C)
ABD (C ø) - ACCD or ACCD
Pairwise Alignment - AB-D A-BD
-
mutation deletion
true alignment
19Mutations under divergent evolution
G
(a)
G
(b)
Ancestral sequence
G
C
A
C
One substitution - one visible
Two substitutions - one visible
Sequence 1
Sequence 2
G
(c)
G
(d)
1 ACCTGTAATC 2 ACGTGCGATC D 3/10
(fraction different sites (nucleotides))
G
A
A
A
Back mutation - not visible
Two substitutions - none visible
G
20Convergent evolution
- Often with shorter motifs (e.g. active sites)
- Motif (function) has evolved more than once
independently, e.g. starting with two very
different sequences adopting different folds - Sequences and associated structures remain
different, but (functional) motif can become
identical - Classical example serine proteinase and
chymotrypsin
21Serine proteinase (subtilisin) and chymotrypsin
- Different evolutionary origins, no sequence
similarity - Similarities in the reaction mechanisms.
Chymotrypsin, subtilisin and carboxypeptidase C
have a catalytic triad of serine, aspartate and
histidine in common serine acts as a
nucleophile, aspartate as an electrophile, and
histidine as a base. - The geometric orientations of the catalytic
residues are similar between families, despite
different protein folds. - The linear arrangements of the catalytic residues
reflect different family relationships. For
example the catalytic triad in the chymotrypsin
clan (SA) is ordered HDS, but is ordered DHS in
the subtilisin clan (SB) and SDH in the
carboxypeptidase clan (SC).
22A protein sequence alignment MSTGAVLIY--TSILIKECHA
MPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS
A DNA sequence
alignment attcgttggcaaatcgcccctatccggccttaa att---
tggcggatcg-cctctacgggcc----
23What can sequence tell us about structure (HSSP)
Sander Schneider, 1991
24Searching for similarities What is the function
of the new gene? The lazy investigation (i.e.,
no biologial experiments, just bioinformatics
techniques) Find a set of similar protein
sequences to the unknown sequence Identify
similarities and differences For long proteins
identify domains first
25- Evolutionary and functional relationships
- Reconstruct evolutionary relation
- Based on sequence
- -Identity (simplest method)
- -Similarity
- Homology (common ancestry the ultimate goal)
- Other (e.g., 3D structure)
- Functional relation
- Sequence Structure Function
26Searching for similarities
Common ancestry is more interesting Makes it
more likely that genes share the same
function Homology sharing a common ancestor a
binary property (yes/no) its a nice tool When
(an unknown) gene X is homologous to (a known)
gene G it means that we gain a lot of information
on X what we know about G can be transferred to
X as a good suggestion.
27Biological definitions for related sequences
- Homologues are similar sequences in two different
organisms that have been derived from a common
ancestor sequence. Homologues can be described
as either orthologues or paralogues. - Orthologues are similar sequences in two
different organisms that have arisen due to a
speciation event. Orthologs typically retain
identical or similar functionality throughout
evolution. - Paralogues are similar sequences within a single
organism that have arisen due to a gene
duplication event. - Xenologues are similar sequences that do not
share the same evolutionary origin, but rather
have arisen out of horizontal transfer events
through symbiosis, viruses, etc.
28How to evolve
- Important distinction
- Orthologues homologous proteins in different
species (all deriving from same ancestor) - Paralogues homologous proteins in same species
(internal gene duplication) - In practice to recognise orthology,
bi-directional best hit is used in conjunction
with database search program (this is called an
operational definition)
29So this means
Source http//www.ncbi.nlm.nih.gov/Education/BLAS
Tinfo/Orthology.html
30Example today Pairwise sequence alignment needs
sense of evolution Global dynamic programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
MDAGSTVILCFVG-
Gap penalties (open,extension)
MDAAST-ILC--GS
31How to determine similarity Frequent evolutionary
events at the DNA level 1. Substitution 2.
Insertion, deletion 3. Duplication 4. Inversion
We will restrict ourselves to these events
32nucleotide one-letter code
A DNA sequence alignment attcgttggcaaatcgcccctatcc
ggccttaa att---tggcggatcg-cctctacgggcc----
A protein sequence
alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGIL
LFHRTHELIKESHAMANDEGGSNNS
amino acid one-letter code
33Dynamic programmingScoring alignments
Substitution (or match/mismatch) DNA
proteins Gap penalty Linear gp(k)ak
Affine gp(k)bak Concave, e.g.
gp(k)log(k) The score for an alignment is the
sum of the scores over all alignment columns
34Dynamic programmingScoring alignments
Sa,b - gp(k) gapinit
k?gapextension affine gap penalties
35DNA define a score for match/mismatch of
letters Simple Used in genome
alignments
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
A C G T
A 91 -114 -31 -123
C -114 100 -125 -31
G -31 -125 100 -114
T -123 -31 -114 91
36Dynamic programmingScoring alignments
T D W V T A L K T D W L - - I K
20?20
10
1
Affine gap penalties (open, extension)
Amino Acid Exchange Matrix
Score s(T,T)s(D,D)s(W,W)s(V,L)-Po-2Px
s(L,I)s(K,K)