Title: Bioinformatics For MNW 2nd Year
1Bioinformatics For MNW 2nd Year
- Jaap Heringa
- FEW/FALW
- Integrative Bioinformatics Institute VU (IBIVU)
- heringa_at_cs.vu.nl
2Current Bioinformatics Unit
- Jens Kleinjung (1/11/02)
- Victor Simosis PhD (1/12/02)
- Radek Szklarczyk - PhD (1/01/03)
- John Romein (1/12/02, Henri Bal)
3Bioinformatics course 2nd year MNW spring 2003
- Pattern recognition
- Supervised/unsupervised learning
- Types of data, data normalisation, lacking data
- Search image
- Similarity tables
- Clustering
- Principal component analysis
- Discriminant analysis
4Bioinformatics course 2nd year MNW spring 2003
- Protein
- Folding
- Structure and function
- Protein structure prediction
- Secondary structure
- Tertiary structure
- Function
- Post-translational modification
- Prot.-Prot. Interaction -- Docking algorithm
- Molecular dynamics/Monte Carlo
5Bioinformatics course 2nd year MNW spring 2003
- Sequence analysis
- Pairwise alignment
- Dynamic programming (NW, SW, shortcuts)
- Multiple alignment
- Combining information
- Database/homology searching (Fasta, Blast,
Statistical issues-E/P values)
6Bioinformatics course 2nd year MNW spring 2003
- Gene structure and gene finding algorithm
- Omics
- DNA makes RNA makes protein
- Expression data, Nucleus to ribosome,
translation, etc. - Metabolomics
- Physiomics
- Databases
- DNA, EST
- Protein sequence
- Protein structure
7Bioinformatics course 2nd year MNW spring 2003
- Microarray data
- Protein structure (PDB)
- Proteomics
- Mass spectrometry/NMR/X-ray?
8Bioinformatics course 2nd year MNW spring 2003
- Bioinformatics method development
- IPR issues
- Programming and scripting languages
- Web solutions
- Computational issues
- NP-complete problems
- CPU, memory, storage problems
- Parallel computing
- Bioinformatics method usage/application
- Molecular viewers (RasMol, MolMol, etc.)
9Gathering knowledge
- Anatomy, architecture
- Dynamics, mechanics
- Informatics
- (Cybernetics Wiener, 1948)
- (Cybernetics has been defined as the science of
control in machines and animals, and hence it
applies to technological, animal and
environmental systems) - Genomics, bioinformatics
Rembrandt, 1632
Newton, 1726
10Bioinformatics
Chemistry
Biology Molecular biology
Mathematics Statistics
Bioinformatics
Computer Science Informatics
Medicine
Physics
11Bioinformatics
- Studying informational processes in biological
systems (Hogeweg, early 1970s) - No computers necessary
- Back of envelope OK
Information technology applied to the management
and analysis of biological data (Attwood and
Parry-Smith)
Applying algorithms with mathematical formalisms
in biology (genomics) -- USA
12Bioinformatics in the olden days
- Close to Molecular Biology
- (Statistical) analysis of protein and nucleotide
structure - Protein folding problem
- Protein-protein and protein-nucleotide
interaction - Many essential methods were created early on (BG
era) - Protein sequence analysis (pairwise and multiple
alignment) - Protein structure prediction (secondary, tertiary
structure)
13Bioinformatics in the olden days (Cont.)
- Evolution was studied and methods created
- Phylogenetic reconstruction (clustering NJ
method
14The Human Genome -- 26 June 2000
15The Human Genome -- 26 June 2000
Dr. Craig Venter Celera Genomics -- Shotgun method
Sir John Sulston Human Genome Project
16Human DNA
- There are about 3bn (3 ? 109) nucleotides in the
nucleus of almost all of the trillions (3.5 ?
1012 ) of cells of a human body (an exception is,
for example, red blood cells which have no
nucleus and therefore no DNA) a total of 1022
nucleotides! - Many DNA regions code for proteins, and are
called genes (1 gene codes for 1 protein in
principle) - Human DNA contains 30,000 expressed genes
- Deoxyribonucleic acid (DNA) comprises 4 different
types of nucleotides adenine (A), thiamine (T),
cytosine (C) and guanine (G). These nucleotides
are sometimes also called bases
17Human DNA (Cont.)
- All people are different, but the DNA of
different people only varies for 0.2 or less.
So, only 2 letters in 1000 are expected to be
different. Over the whole genome, this means that
about 3 million letters would differ between
individuals. - The structure of DNA is the so-called double
helix, discovered by Watson and Crick in 1953,
where the two helices are cross-linked by A-T and
C-G base-pairs (nucleotide pairs so-called
Watson-Crick base pairing).
18Tot hier 3/2 10.45-12.30
19DNA compositional biases
- Base composition of genomes
- E. coli 25 A, 25 C, 25 G, 25 T
- P. falciparum (Malaria parasite) 82AT
- Translation initiation
- ATG is the near universal motif indicating the
start of translation in DNA coding sequence.
20Some facts about human genes
- Comprise about 3 of the genome
- Average gene length 8,000 bp
- Average of 5-6 exons/gene
- Average exon length 200 bp
- Average intron length 2,000 bp
- 8 genes have a single exon
- Some exons can be as small as 1 or 3 bp.
- HUMFMR1S is not atypical 17 exons 40-60 bp long,
comprising 3 of a 67,000 bp gene
21Genetic diseases
- Many diseases run in families and are a result of
genes which predispose such family members to
these illnesses - Examples are Alzheimers disease, cystic fibrosis
(CF), breast or colon cancer, or heart diseases. - Some of these diseases can be caused by a problem
within a single gene, such as with CF.
22Genetic diseases (Cont.)
- For other illnesses, like heart disease, at least
20-30 genes are thought to play a part, and it is
still unknown which combination of problems
within which genes are responsible. - With a problem within a gene is meant that a
single nucleotide or a combination of those
within the gene are causing the disease (or make
that the body is not sufficiently fighting the
disease). - Persons with different combinations of these
nucleotides could then be unaffected by these
diseases.
23Genetic diseases (Cont.)Cystic Fibrosis
- Known since very early on (Celtic gene)
- Inherited autosomal recessive condition (Chr. 7)
- Symptoms
- Clogging and infection of lungs (early death)
- Intestinal obstruction
- Reduced fertility and (male) anatomical anomalies
- CF gene CFTR has 3-bp deletion leading to Del508
(Phe) in 1480 aa protein (epithelial Cl- channel)
protein degraded in ER instead of inserted into
cell membrane
24Genomic Data Sources
- DNA/protein sequence
- Expression (microarray)
- Proteome (xray, NMR,
- mass spectrometry)
- Metabolome
- Physiome (spatial,
- temporal)
Integrative bioinformatics
25Genomic Data Sources Vertical Genomics
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion Integrative Bioinformatics
Genomics VU
26A gene codes for a protein
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
27Humans have spliced genes
28DNA makes RNA makes Protein
29Remark
- The problem of identifying (annotating) human
genes is considerably harder than the early
success story for ß-globin might suggest. - The human factor VIII gene (whose mutations cause
hemophilia A) is spread over 186,000 bp. It
consists of 26 exons ranging in size from 69 to
3,106 bp, and its 25 introns range in size from
207 to 32,400 bp. The complete gene is thus 9 kb
of exon and 177 kb of intron. -
- The biggest human gene yet is for dystrophin. It
has gt 30 exons and is spread over 2.4
million bp.
30DNA makes RNA makes ProteinExpression data
- More copies of mRNA for a gene leads to more
protein - mRNA can now be measured for all the genes in a
cell at ones through microarray technology - Can have 60,000 spots (genes) on a single gene
chip - Colour change gives intensity of gene expression
(over- or under-expression)
31(No Transcript)
32Metabolic networksGlycolysis and
Gluconeogenesis
Kegg database (Japan)
33High-throughput Biological Data
- Enormous amounts of biological data are being
generated by high-throughput capabilities even
more are coming - genomic sequences
- gene expression data
- mass spec. data
- protein-protein interaction
- protein structures
- ......
34Protein structural data explosion
Protein Data Bank (PDB) 14500 Structures (6
March 2001) 10900 x-ray crystallography, 1810
NMR, 278 theoretical models, others...
35Dickersons formula equivalent to Moores law
n e0.19(y-1960) with y the year.
On 27 March 2001 there were 12,123 3D protein
structures in the PDB Dickersons formula
predicts 12,066 (within 0.5)!
36Sequence versus structural data
- Despite structural genomics efforts, growth of
PDB slowed down in 2001-2002 (i.e did not keep up
with Dickersons formula) - More than 100 completely sequenced genomes
- Increasing gap between structural and sequence
data
37Bioinformatics
Bioinformatics
Large - external (integrative) Science Human
Planetary Science Cultural Anthropology
Population Biology Sociology
Sociobiology Psychology Systems
Biology Biology Medicine
Molecular Biology
Chemistry Physics Small
internal (individual)
38Bioinformatics
- Offers an ever more essential input to
- Molecular Biology
- Pharmacology (drug design)
- Agriculture
- Biotechnology
- Clinical medicine
- Anthropology
- Forensic science
- Chemical industries (detergent industries, etc.)
39High-throughput Biological DataThe data deluge
- Hidden in these data is information that reflects
- existence, organization, activity, functionality
of biological machineries at different levels
in living organisms
Most effectively utilising this information will
prove to be essential for Integrative
Bioinformatics
40Data Issues
- Data collection getting the data
- Data representation data standards, data
normalisation .. - Data organisation and storage database issues
.. - Data analysis and data mining discovering
knowledge, patterns/signals, from data,
establishing associations among data patterns - Data utilisation and application from data
patterns/signals to models for bio-machineries - Data visualization viewing complex data
- Data transmission data collection, retrieval,
.. -
41Tot hier 5/2
42Bioinformatics
- Nothing in Biology makes sense except in the
light of evolution (Theodosius Dobzhansky
(1900-1975)) - Nothing in bioinformatics makes sense except in
the light of Biology
43Pair-wise alignment
T D W V T A L K T D W L - - I K
Combinatorial explosion - 1 gap in 1 sequence
n1 possibilities - 2 gaps in 1 sequence (n1)n
- 3 gaps in 1 sequence (n1)n(n-1), etc.
2n (2n)! 22n
n (n!)2
??n 2 sequences of 300 a.a. 1088
alignments 2 sequences of 1000 a.a. 10600
alignments!
44Dynamic programmingScoring alignments
Sa,b gp(k) pi k?pe affine gap
penalties pi and pe are the penalties for gap
initialisation and extension, respectively
45Dynamic programmingScoring alignments
T D W V T A L K T D W L - - I K
20?20
10
1
Gap penalties (open, extension)
Amino Acid Exchange Matrix
Score s(T,T)s(D,D)s(W,W)s(V,L)Po2Px
s(L,I)s(K,K)
46Pairwise sequence alignment Global dynamic
programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG-
MDAAST-ILC--GS
47Global dynamic programming
j-1
i-1
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
48Global dynamic programming
49Global dynamic programming
50Tot hier 17/02/03
51Local dynamic programming (Smith Waterman,
1981)
LCFVMLAGSTVIVGTR
E D A S T I L C G S
Negative numbers
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open, extension)
AGSTVIVG A-STILCG
52 Local dynamic programming (Smith Waterman,
1981)
j-1
i-1
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
53Local dynamic programming
54Sequence database searching Homology searching
- DP too slow for repeated database searches
- FASTA
- BLAST and PSI-BLAST
- QUEST
- HMMER
- SAM-T98
Fast heuristics
Hidden Markov modelling
55FASTA
- Compares a given query sequence with a library of
sequences and calculates for each pair the
highest scoring local alignment - Speed is obtained by delaying application of the
dynamic programming technique to the moment where
the most similar segments are already identified
by faster and less sensitive techniques - FASTA routine operates in four steps
56FASTA
- Operates in four steps
- Rapid searches for identical words of a user
specified length occurring in query and database
sequence(s) (Wilbur and Lipman, 1983, 1984). For
each target sequence the 10 regions with the
highest density of ungapped common words are
determined. - These 10 regions are rescored using Dayhoff
PAM-250 residue exchange matrix (Dayhoff et al.,
1983) and the best scoring region of the 10 is
reported under init1 in the FASTA output. - Regions scoring higher than a threshold value and
being sufficiently near each other in the
sequence are joined, now allowing gaps. The
highest score of these new fragments can be found
under initn in the FASTA output. - full dynamic programming alignment (Chao et al.,
1992) over the final region which is widened by
32 residues at either side, of which the score is
written under opt in the FASTA output.
57FASTA output example
DE METAL RESISTANCE PROTEIN YCF1 (YEAST CADMIUM
FACTOR 1). . . . SCORES Init1 161 Initn 161
Opt 162 z-score 229.5 E() 3.4e-06
Smith-Waterman score 162 35.1 identity in 57
aa overlap
10 20 30 test.seq
MQRSPLEKASVVSKLFFSW
TRPILRKGYRQRLE
YCFI_YEAST CASILLLEALPKKPLMPHQHIHQTLTRRKPNPY
DSANIFSRITFSWMSGLMKTGYEKYLV 180
190 200 210 220 230
40 50 60
test.seq LSDIYQIPSVDSADNLSEKLEREWDRE
YCFI_YEAST EADLYKLPRNFSSEELSQKLEKNWENELKQKSN
PSLSWAICRTFGSKMLLAAFFKAIHDV 240
250 260 270 280 290
58FASTA
- (1) Rapid identical word searches
- Searching for k-tuples of a certain size within a
specified bandwidth along search matrix
diagonals. - For not-too-distant sequences (gt 35 residue
identity), little sensitivity is lost while speed
is greatly increased. - Technique employed is known as hash coding or
hashing a lookup table is constructed for all
words in the query sequence, which is then used
to compare all encountered words in each database
sequence.
59FASTA
- The k-tuple length is user-defined and is usually
1 or 2 for protein sequences (i.e. either the
positions of each of the individual 20 amino
acids or the positions of each of the 400
possible dipeptides are located). - For nucleic acid sequences, the k-tuple is 5-20,
and should be longer because short k-tuples are
much more common due to the 4 letter alphabet of
nucleic acids. The larger the k-tuple chosen, the
more rapid but less thorough, a database search.
60BLAST
- blastp compares an amino acid query sequence
against a protein sequence database - blastn compares a nucleotide query sequence
against a nucleotide sequence database - blastx compares the six-frame conceptual protein
translation products of a nucleotide query
sequence against a protein sequence database - tblastn compares a protein query sequence against
a nucleotide sequence database translated in six
reading frames - tblastx compares the six-frame translations of a
nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.
61BLAST
- Generates all tripeptides from a query sequence
and for each of those the derivation of a table
of similar tripeptides number is only fraction
of total number possible. - Quickly scans a database of protein sequences for
ungapped regions showing high similarity, which
are called high-scoring segment pairs (HSP),
using the tables of similar peptides. The initial
search is done for a word of length W that scores
at least the threshold value T when compared to
the query using a substitution matrix. - Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of S, and as far as
the cumulative alignment score can be increased.
62BLAST
- Extension of the word hits in each direction are
halted - when the cumulative alignment score falls off by
the quantity X from its maximum achieved value - the cumulative score goes to zero or below due to
the accumulation of one or more negative-scoring
residue alignments - upon reaching the end of either sequence
- The T parameter is the most important for the
speed and sensitivity of the search resulting in
the high-scoring segment pairs - A Maximal-scoring Segment Pair (MSP) is defined
as the highest scoring of all possible segment
pairs produced from two sequences.
63PSI-BLAST
- Query sequences are first scanned for the
presence of so-called low-complexity regions
(Wooton and Federhen, 1996), i.e. regions with a
biased composition likely to lead to spurious
hits are excluded from alignment. - The program then initially operates on a single
query sequence by performing a gapped BLAST
search - Then, the program takes significant local
alignments found, constructs a multiple alignment
and abstracts a position specific scoring matrix
(PSSM) from this alignment. - Rescan the database in a subsequent round to find
more homologous sequences Iteration continues
until user decides to stop or search has
converged
64PSI-BLAST iteration
Query sequence
Q
xxxxxxxxxxxxxxxxx
Gapped BLAST search
Query sequence
Q
xxxxxxxxxxxxxxxxx
Database hits
A C D . . Y
PSSM
Pi Px
Gapped BLAST search
A C D . . Y
PSSM
Pi Px
Database hits
65PSI-BLAST output example
66Multiple alignment profilesGribskov et al. 1987
i
A C D ? ? ? W Y
0.3 0.1 0 ? ? ? 0.3 0.3
Gap penalties
0.5
1.0
Position dependent gap penalties
67Normalised sequence similarity
The p-value is defined as the probability of
seeing at least one unrelated score S greater
than or equal to a given score x in a database
search over n sequences. This probability
follows the Poisson distribution (Waterman and
Vingron, 1994)
P(x, n) 1 e-n?P(S? x), where n is the
number of sequences in the database Depending on
x and n (fixed)
68Normalised sequence similarityStatistical
significance
The E-value is defined as the expected number of
non-homologous sequences with score greater than
or equal to a score x in a database of n
sequences E(x, n)
n?P(S ? x) if E-value 0.01, then the expected
number of random hits with score S ? x is 0.01,
which means that this E-value is expected by
chance only once in 100 independent searches over
the database. if the E-value of a hit is 5, then
five fortuitous hits with S ? x are expected
within a single database search, which renders
the hit not significant.
69Normalised sequence similarityStatistical
significance
- Database searching is commonly performed using an
E-value in between 0.1 and 0.001. - Low E-values decrease the number of false
positives in a database search, but increase the
number of false negatives, thereby lowering the
sensitivity of the search.
70HMM-based homology searching
- Most widely used HMM-based profile searching
tools currently are SAM-T98 (Karplus et al.,
1998) and HMMER2 (Eddy, 1998) - formal probabilistic basis and consistent theory
behind gap and insertion scores - HMMs good for profile searches, bad for alignment
- HMMs are slow
71The HMM algorithms
- Questions
- What is the most likely die (predicted) sequence?
Viterbi - What is the probability of the observed sequence?
Forward - What is the probability that the 3rd state is B,
given the observed sequence? Backward
72HMM-based homology searching
Transition probabilities and Emission
probabilities Gapped HMMs also have insertion
and deletion states
73Profile HMM mmatch state, I-insert state,
ddelete state go from left to right. I and m
states output amino acids d states are silent.
74Homology-derived Secondary Structure of Proteins
(HSSP) Sander Schneider, 1991
75Tot hier 17/02/03
76Bio-Data Analysis and Data Mining
- Existing/emerging bio-data analysis and mining
tools for - DNA sequence assembly
- Genetic map construction
- Sequence comparison and database searching
- Gene finding
- .
- Gene expression data analysis
- Phylogenetic tree analysis to infer
horizontally-transferred genes - Mass spec. data analysis for protein complex
characterization -
- Current mode of work
Often enough developing ad hoc tools for each
individual application
77Bio-Data Analysis and Data Mining
- As the amount and types of data and their cross
connections increase rapidly - the number of analysis tools needed will go up
exponentially - blast, blastp, blastx, blastn, from BLAST
family of tools - gene finding tools for human, mouse, fly, rice,
cyanobacteria, .. - tools for finding various signals in genomic
sequences, protein-binding sites, splice junction
sites, translation start sites, ..
78Bio-Data Analysis and Data Mining
Many of these data analysis problems are
fundamentally the same problem(s) and can be
solved using the same set of tools e.g.
clustering or optimal segmentation by Dynamic
Programming
Developing ad hoc tools for each application (by
each group of individual researchers) may soon
become inadequate as bio-data production
capabilities further ramp up
79Bio-data Analysis, Data Mining and Integrative
Bioinformatics
To have analysis capabilities covering wide
range of problems, we need to discover the common
fundamental structures of these problems HOWEVER
in biology one size does NOT fit all
Goal is development of a data analysis
infrastructure in support of Genomics and beyond
80Algorithms in bioinformatics
- string algorithms
- dynamic programming
- machine learning (NN, k-NN, SVM, GA, ..)
- Markov chain models
- hidden Markov models
- Markov Chain Monte Carlo (MCMC) algorithms
- stochastic context free grammars
- EM algorithms
- Gibbs sampling
- clustering
- tree algorithms
- text analysis
- hybrid/combinatorial techniques and more
81Sequence analysis and homology searching
82Finding genes and regulatory elements
83Expression data
84Functional genomics
Monte Carlo
85Protein translation
86Example of algorithm reuse Data clustering
- Many biological data analysis problems can be
formulated as clustering problems - microarray gene expression data analysis
- identification of regulatory binding sites
(similarly, splice junction sites, translation
start sites, ......) - (yeast) two-hybrid data analysis (for inference
of protein complexes) - phylogenetic tree clustering (for inference of
horizontally transferred genes) - protein domain identification
- identification of structural motifs
- prediction reliability assessment of protein
structures - NMR peak assignments
- ......
87Data Clustering Problems
- Clustering partition a data set into clusters so
that data points of the same cluster are
similar and points of different clusters are
dissimilar - cluster identification -- identifying clusters
with significantly different features than the
background
88Application Examples
- Regulatory binding site identification CRP (CAP)
binding site - Two hybrid data analysis
- Gene expression data analysis
Are all solvable by the same algorithm!
89Other Application Examples
- Phylogenetic tree clustering analysis
- Protein sidechain packing prediction
- Assessment of prediction reliability of protein
structures - Protein secondary structures
- Protein domain prediction
- NMR peak assignments
90Integrative bioinformatics _at_ VU
- Studying informational processes at biological
system level - From gene sequence to intercellular processes
- Computers necessary
- We have biology, statistics, computational
intelligence (AI), HTC, .. - VUMC microarray facility
- Enabling technology new glue to integrate
- New integrative algorithms
- Goals understanding cells in terms of genomes,
fighting disease (VUMC)
91Bioinformatics _at_ VU
- Progression
- DNA gene prediction, predicting regulatory
elements - mRNA expression
- Proteins docking, domain prediction
- Metabolic pathways metabolic control
- Cell-cell communication
92(No Transcript)
93Protein structure and function can be complex
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel catalytic
substrate binding domain a/b nucleotide binding
domain
1 continuous 2 discontinuous domains
94Bioinformatics _at_ VU
- Qualitative challenges
- High quality alignments (alternative splicing)
- In-silico structural genomics
- In-silico functional genomics reliable
annotation - Protein-protein interactions.
- Metabolic pathways assign the edges in the
networks - Cell-cell communication find membrane associated
components - New algorithms
95Bioinformatics _at_ VU
- Quantitative challenges
- Understanding mRNA expression levels
- Understanding resulting protein activity
- Time dependencies
- Spatial constraints, compartmentalisation
- Are classical differential equation models
adequate or do we need more individual modeling
(e.g macromolecular crowding and activity at
oligomolecular level)? - Metabolic pathways calculate fluxes through time
- Cell-cell communication tissues, hormones,
innervations
Need complete experimental data for good
biological model system to learn to integrate
96Bioinformatics _at_ VU
- VUMC
- Neuropeptide addiction
- Oncogenes disease patterns
- Reumatic disease
- CNCR
- From synapses to higher order behaviour
- Addiction
- FPP
- Genetic psychology twin data bank
97Integrative Genomics
98Recurrent theme Integration from molecule to
health
Leiden-VU-TNO (Centre for Medical Systems Biology)
CRCS
VUMC
Dinner discussion Integrative Bioinformatics
Genomics VU
99genome
transcriptome
proteome
metabolome
physiome
Dinner discussion Integrative Bioinformatics
Genomics VU
100Integrative bioinformatics
- Calculate from sequence to molecular behaviour
- Calculate from molecular behaviour and
interactions to cells - Calculate from cellular interactions to tissues
- Calculate from tissue to organism
- Calculate from organisms to ecosystem and society
- Do this in conjunction with data analysis at all
levels - AND CALCULATE BACK (induction)
101Bioinformatics _at_ VU
- Quantitative challenges
- How much protein produced from single gene?
- What time dependencies?
- What spatial constraints (compartmentalisation)?
- Metabolic pathways assign the edges in the
networks - Cell-cell communication find membrane associated
components
102Integrative bioinformatics
- Integrate data sources
- Integrate methods
- Integrate data through method integration
(biological model)
103Bioinformatics tool
Algorithm
Data
tool
Biological Interpretation (model)
104Bioinformatics
- Nothing in Biology makes sense except in the
light of evolution (Theodosius Dobzhansky
(1900-1975)) - Nothing in Bioinformatics makes sense except in
the light of Biology
105Pair-wise sequence alignment (more than just
string matching)
Global dynamic programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG-
MDAAST-ILC--GS
106Pair-wise alignment search explosions
T D W V T A L K T D W L - - I K
Combinatorial explosion - 1 gap in 1 sequence
n1 possibilities - 2 gaps in 1 sequence (n1)n
- 3 gaps in 1 sequence (n1)n(n-1), etc.
2n (2n)! 22n
n (n!)2
??n 2 sequences of 300 a.a. 1088
alignments 2 sequences of 1000 a.a. 10600
alignments!
107Global dynamic programming
108This talk own kitchen
- Three integrative methods to predict protein
structural aspects - Iterative multiple alignment protein secondary
structure (Praline) - Intermezzo 2½-D structure prediction of
flavodoxin fold by hand - Protein domain delineation based on consistency
of multiple ab initio model tertiary structures
(SnapDRAGON) - Protein domain delineation based on combining
homology searching with domain prediction
(Domaination)
109Comparing sequences - Similarity Score -
- Many properties can be used
- Nucleotide or amino acid composition
- Isoelectric point
- Molecular weight
- Morphological characters
110Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Similarity criterion
Similarity matrix
Scores
55
Cluster criterion
Phylogenetic tree
111Human Evolution
112Comparing sequences - Similarity Score -
- Many properties can be used
- Nucleotide or amino acid composition
- Isoelectric point
- Molecular weight
- Morphological characters
- But molecular evolution through sequence
alignment
113Multivariate statistics Cluster analysis
1 2 3 4 5
Multiple alignment
Similarity criterion
Similarity matrix
Scores
55
Phylogenetic tree
114Lactate dehydrogenase multiple alignment
Distance
Matrix 1 2 3 4
5 6 7 8 9 10 11 12
13 1 Human 0.000 0.112 0.128 0.202
0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635
0.637 2 Chicken 0.112 0.000 0.155 0.214
0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631
0.651 3 Dogfish 0.128 0.155 0.000 0.196
0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600
0.655 4 Lamprey 0.202 0.214 0.196 0.000
0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616
0.669 5 Barley 0.378 0.382 0.389 0.426
0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629
0.575 6 Maizey 0.346 0.348 0.337 0.356
0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643
0.587 7 Lacto_casei 0.530 0.538 0.522 0.553
0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526
0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589
0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598
0.495 9 Lacto_plant 0.512 0.516 0.516 0.544
0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563
0.485 10 Therma_mari 0.524 0.524 0.512 0.503
0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405
0.598 11 Bifido 0.528 0.524 0.524 0.544
0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604
0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616
0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000
0.641 13 Mycoplasma 0.637 0.651 0.655 0.669
0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641
0.000
115(No Transcript)
116Multiple sequence alignmentWhy?
- It is the most important means to assess
relatedness of a set of sequences - Gain information about the structure/function of
a query sequence (conservation patterns) - Construct a phylogenetic tree
- Putting together a set of sequenced fragments
(Fragment assembly) - Comparing a segment sequenced by two different
labs - Many bioinformatics methods depend on it (e.g.
secondary/tertiary structure prediction)
117Flavodoxin fold aligning 13 Flavodoxins cheY
5(??) fold
118Flavodoxin-cheY multiple alignment Praline with
pre-processing
- 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YE
VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-
DSLEETGAQGRKVACF - FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HE
VTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-
EEFNRFGLAGRKVAAf - FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YE
VDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-
DSLEETGAQGRKVACf - FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-ID
VELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-
DSLENADLKGKKVSVf - FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-ME
TTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-
EDLDRAGLKDKKVGVf - 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KA
DAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLY
DKLPEVDMKDLPVAIF - FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MS
DA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-
PKIEGLDFSGKTVALf - FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IA
DAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-
NTLSEADLTGKTVALf - FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DV
VTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-
SELDDVDFNGKLVAYf - FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DV
ADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-
PTLEEIDFNGKLVALf - 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KD
VNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-
EEIS-TKISGKKVALF - FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-AD
VESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-
TDLA-PKLKGKKVGLf - FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIE
VKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-
DESSEFNLEGKLGAAf - 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NV
EEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-
KTIRADGAMSALPVLM - T
- 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD
---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-
------- - FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE
---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-
------- - FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD
---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-
-------
119Flavodoxin-cheY NJ tree
120Integrating secondary structure prediction in
multiple alignmentVictor Simossis
- Praline multiple alignment method
- (Heringa, Comp. Chem. 23, 341-3641999, Comp.
Chem., 26, 459-4772002 - Kleinjung, Douglas Heringa, Bioinformatics, in
press2002) - Combining sequence data and secondary structure
prediction (Heringa, Curr. Prot. Pept. Sci., 1
(3), 273-3012000) - Secondary structure methods PhD, Predator,
PSIPred, Jpred, SSPRED,...
121Using secondary structure in multiple alignment
- Structure more conserved than sequence
122Protein structure hierarchical levels
123Protein structure hierarchical levels
124Secondary structure-induced alignment
125Using secondary structure in multiple alignment
Dynamic programming search matrix
Amino acid exchange weights matrices
MDAGSTVILCFV
HHHCCCEEEEEE
M D A A S T I L C G S
H H H H C C E E E C C
H
H
C
C
E
E
Default
126Flavodoxin-cheY predicted secondary
structure (PREDATOR)
1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YE
VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFD
S-LEETGAQGRKVACF e eeee b
ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b
ee sss ee ttthhhhtt ttss tt
eeeee FLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELA
DAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDD
FIPLFDS-LEETGAQGRKVACf e eeeeee
hhhhhhhhhhhhhhh eeeeee eeeeee
hhhhhh
eeeee FLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLN
SEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQED
FVPLYED-LDRAGLKDKKVGVf e eeeeee
hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee
hhhhhh
eeeeee FLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAF
ENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQD
DFIPLYDS-LENADLKGKKVSVf
eeeeee hhhhhhhhhhhhhh eeeee
eeeee hhhhhhh h
eeeee FLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIA
AGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDD
FLSLFEE-FNRFGLAGRKVAAf eeee
hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee
hhhhhhh hh eeeee 2fcr
--K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVT
DPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKD
LPVAIF eeeee
ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee
stt s s s sthhhhhhhtggg tt
eeeee FLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFG
ND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSD
WEGLYSE-LDDVDFNGKLVAYf eeeee
hhhhhhhhhhhh eee hhh hhhhhhheeeeee
hhhhhhhhh
eeeeee FLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQL
GKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QC
DWDDFFPT-LEEIDFNGKLVALf eee
hhhhhhhhhhhh eee hhh hhhhhhheeeee
hhhhh
eeeeee FLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRF
DDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENE
SWEEFLPK-IEGLDFSGKTVALf eee
hhhhhhhhhhhhh hhh hhhhhhheeeee
hhhhhhhhh
eeeeee FLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKL
DG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYD
SWQEFTNT-LSEADLTGKTVALf eeee
hhhhhhhhhhhh hhh hhhhhhheeeee
hhhhh eeeee 4fxn
----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDV
NIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KIS
GKKVALF eeeee
ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee
btttb ttthhhhhhh hst t tt
eeeee FLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVK
AAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSV
VEPFFTD-LAP-KLKGKKVGLf
hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee
eeeee FLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVK
RSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWE
MKKWIDE-SSEFNLEGKLGAAf eee
hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee
hhhhhhhhh eeeee 3chy
ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DAL
NKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSA
LPVLMV tt eeee s
hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s
sss hhhhhhhhhh ttttt eeee 1fx1
GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD----------
-----------GLRIDGD--PRAARDDIVGWAHDVRGAI--------
eee s ss sstthhhhhhhhhhhttt ee s
eeees gggghhhhhhhhhhhhhh FLAV
_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD-----
----------------GLRIDGD--PRAARDDIVGwAHDVRGAI------
-- eee hhhhhhhhhhhh
eeeee eeeee
hhhhhhhhhhhhhh FLAV_DESGI GCGDS-SY-TYFCGAVDVI
EKKAEELgATLVAS---------------------SLKIDGE--P--DSA
EVLDwAREVLARV-------- eee
hhhhhhhhhhhh eeeee
hhhhhhhhhhh FLAV_DESSA
GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD-----------------
----SLKIDGD--P--ERDEIVSwGSGIADKI--------
hhhhhhhhhhhh eeeee
e eee FLAV_DESDE
ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE-----------------
----GLKMEGD--ASNDPEAVASfAEDVLKQL--------
e hhhhhhhhhhhhhh eeeee
ee hhhhhhhhhhh 2fcr
GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSV
RD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------
eee ttt ttsttthhhhhhhhhhhtt eee b gggs
s tteet teesseeeettt ss hhhhhhhhhhhhhhhht FLAV_A
NASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYD
FNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------
hhhhhhhhhhhhhh
eeee
hhhhhhhhhhhhhhhh FLAV_ECOLI
GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADD
DHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
hhhhhhhhhhhhhh eeee
hhhhhhhhhhhhhhhhhh FLAV_AZOVI
GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESS
EAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--
e hhhhhhhhhhhhhh eeeee
hhhhhhhhhhh FLAV_ENTA
G GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSF
SAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------
hhhhhhhhhhhhhhh eeee
hhhhhhh hhhhhhhhhhhh 4fxn
G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------
------------PLIVQNE--PDEAEQDCIEFGKKIANI---------
e eesss shhhhhhhhhhhhtt ee s
eeees ggghhhhhhhhhhhht FLAV
_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT-----
-----------------AIVNEM--PDNAPE-CKElGEAAAKA-------
-- hhhhhhhhhhh
eeeee eeee h
hhhhhhhh FLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK
-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfG
ERiANkV--KQIF--
hhhhhhhhhhhhhh eeeee
hhhh hhh hhhhhhhhhhhh h 3chy
-----------TAEAKKENIIAAAQAGASGY-------------------
------VVK----P-FTAATLEEKLNKIFEKLGM------
ess hhhhhhhhhtt see
ees s hhhhhhhhhhhhhhht
G
Enough to predict 5(??) topology
127Secondary structure-induced alignment
128Iteration
Convergence
Limit cycle
Divergence
129Flavodoxin-cheY multiple alignment/ secondary
structure iteration cheY SSEs
3chy-AA SEQUENCE AA ADKELKFLVVDDFSTMRR
IVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP 3chy-I
TERATION-0 PHD EEEEEEE
HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE
3chy-ITERATION-1 PHD EEEEEEEE
HHHHHHHHHHHHHHH HHHHHHHH EEEEEE
3chy-ITERATION-2 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHHHH EEEEEE
3chy-ITERATION-3 PHD EEEEEEEE
HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-4 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHH EEEEE
3chy-ITERATION-5 PHD EEEEEEEE
HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-6 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHHH EEEEEE
3chy-ITERATION-7 PHD EEEEEEEE
HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-8 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHH EEEEEE
3chy-ITERATION-9 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHHHHH EEEEE
3chy-AA SEQUENCE AA
NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKP
FTAATLEEKLNKIFEKLGM 3chy-ITERATION-0
PHD HHHHHHEEEEEE HHHHHHHHHHHHHHHHH
HHHHHHHHHHHHHH 3chy-ITERATION-1
PHD HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-2
PHD HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-3
PHD HHHHHHHHHHHH
HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH
3chy-ITERATION-4 PHD HHHHH
EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH
3chy-ITERATION-5 PHD HHHHHHHH
EEEEE HHHHHHHHHHHHHHHH EEE
HHHHHHHHHHHHHH 3chy-ITERATION-6 PHD
HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE
HHHHHHHHHHHHHH 3chy-ITERATION-7
PHD HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-8
PHD HHHHHHHH EEEEE HHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-9
PHD HHHHHHHH EEEEE
HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH
1304fxn-AA SEQUENCE AA MKIVYWSGTGNTEKMAEL
IAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEV 4fxn-I
TERATION-0 PHD EEEEE
HHHHHHHHHHHHHHH EEE EEEEE
4fxn-ITERATION-1 PHD EEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-2 PHD EEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-3 PHD EEEEE
HHHHHHHHHHHHHHH E EEEEE
4fxn-ITERATION-4 PHD EEEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-5 PHD EEEEEE
HHHHHHHHHHHHHHH EE EEEEE
4fxn-ITERATION-6 PHD EEEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-7 PHD EEEEEE
HHHHHHHHHHHHHHH EE EEEEE
4fxn-ITERATION-8 PHD EEEEEE
HHHHHHHHHHHHHHH EEE EEEEE
4fxn-ITERATION-9 PHD EEEEE
HHHHHHHHHHHHHHH EEE EEEEE
4fxn-AA SEQUENCE AA
LEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCV
VVETPLIVQNE 4fxn-ITERATION-0 PHD
EEEEE HHHHHHHHHHHHHHHHH EEE
EEE 4fxn-ITERATION-1 PHD
HHHH EEEEE HHHHHHHHHHHHHHH EEE EE
4fxn-ITERATION-2 PHD
HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHH EEE
EE 4fxn-ITERATION-3 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHH EEE
EE 4fxn-ITERATION-4 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-5 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-6 PHD
HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-7 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-8 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-9 PHD
HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE
E 4fxn-AA SEQUENCE AA
PDEAEQDCIEFGKKIANI 4fxn-ITERATION-0
PHD HHHHHHHHHHHHH 4fxn-ITERATION-1
PHD HHHHHHHHHHHHH 4fxn-ITERATION-2
PHD HHHHHHHHHHHHH 4fxn-ITERATION-3
PHD HHHHHHHHHHHHH 4fxn-ITERATION-4
PHD HHHHHHHHHHHH 4fxn-ITERATION-5
PHD HHHHHHHHHHHHH
4fxn-ITERATION-6 PHD
HHHHHHHHHHHH 4fxn-ITERATION-7 PHD
HHHHHHHHHHHHH 4fxn-ITERATION-8 PHD
HHHHHHHHHHHHH 4fxn-ITERATION-9
PHD HHHHHHHHHHHH
131Predicting sec. struct. with PHD, etc.
A
1
5
B
2
4
C
3
D
6
132Secondary structure prediction using MA (SymSS)
1 2 3 4
2 1 3 4
3 1 2 4
4 1 2 3
1 1 1 1
EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH
EEE H EEEEE HHHHH? ??EE
HH EEEEEE ?HHHHH EEEE HH
EEEEE HHHHHH EEE HH EEEE? ?HHHHH
EEE H EEEEE HHHHH? ??EE HH EEEEE
?HHHHH EEEE HH
EEEEE HHHH EEE HH EEEE? ?HHH EEE
H EEEEE HHH? ??EE HH EEEEE HHH?
EEEE HH
EEEEE HHHHHH EEE HHHH EEEE? ?HHHHH
EEE ?HHH EEEEE HHHHH? ??EE HHHH EEEEE
?HHHHH EEEE HHHH
EEEEE HHHHH EEE H
EEEE HHHH EE HHH
EEEE HHHHH EEE H
EEEE HHH EEE HH
133Flavodoxin-cheY
3chy ------------ GYVVKPFTAATLEEKLNKI
FEKLGM------ PHD ---------------
hhhhhhhhhhhhhh ------ 13 -gt 0
ee ??hhhhhhhhhhh? 13 -gt 1
ee ??hhhhhhhhhhh??
13 -gt 2 ee
??hhhhhhhhhhh? 13 -gt 3
eee ?hhhhhhhhhhh? 13 -gt 4
eee ?hhhhhhhhhhh?
13 -gt 5 eee
h?hhhhhhhhhhh 13 -gt 6
eee hh hhhhhhhhhhh 13 -gt 7
e eeeeeee hhhhhhhhhhhhh??
13 -gt 8 eeeeeee
hhhhhhhhhhhhh?? 13 -gt 9
eeeeeee hhhhhhhhhhhhh?? ????? 13 -gt 10
eeeeeee hhhhhhhhhhhhh??
13 -gt 11 e eeeeeeee
hhhhhhhhhhhhh??? 13 -gt 12
eeeeeee hhhhhhhhhh 13 -gt 13
hhhhhhhhhhhhhh
h DSSP ...............EEEESS
HHHHHHHHHHHHHHHT ......
134Optimal segmentation of predicted secondary
structures
Each sequence within an alignment gives rise to
a library of n secondary structure predictions,
where n is the number of sequences in the
alignment. The predictions are recorded by
secondary structure type and region position in a
single matrix
1 2 3 4
1-gt1 1-gt2 1-gt3 1-gt4
EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE
H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH
EEEE HH
C
E
H
H score 0 0 0 0 0.
E score 3 4 4 4 3.
C score 1 0 0 0 0..
? Score 0 0 0 0 1.
Region 0 1 1 1 0.
135Optimal segmentation of predicted secondary
structures by Dynamic Programming
H score
The recorded values are used in a weighted
function according to their secondary structure
type, that gives each position a window-specific
score. The more probable the secondary structure
element, the higher the score. Restrictions H
only if wsgt4 E only if wsgt2
E score
C score
? score
Region
window size
Segmentation score (Total score of each path)
2
6
sequence position
Max score
5
Offset
Label
H
136Example of an optimally segmented secondary
structure prediction library for sequence 3chy
3chy ---------------GYVV-----
KPFTAATLEEKLNKIFEKLGM------ 3chy lt- 1fx1
??????????????? ee ?? hhhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESDE
??????????????? ee ?? hhhhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESVH
??????????????? ee ?? hhhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESGI
??????????????? eee ?? ??hhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESSA
??????????????? eee ?? ??hhhhhhhhhhhhh
???????? 3chy lt- 4fxn
??????????????? eee ?? hhhhhhhhhhhhh
????????? 3chy lt- FLAV_MEGEL
????????????????eee ?? hh?hhhhhhhhhhh
????????? 3chy lt- 2fcr e ?
eeeeeee hhhhhhhhhhhhhhh ?????? 3chy lt-
FLAV_ANASP ? eeeeeee
hhhhhhhhhhhhhhh ?????? 3chy lt- FLAV_ECOLI
eeeeeee hhhhhhhhhhhhhhh
hhhhh 3chy lt- FLAV_AZOVI ?
eeeeeee hhhhhhhhhhhhhhh ???? 3chy lt-
FLAV_ENTAG e eeeeeeee
hhhhhhhhhhhhhhhh? ?????? 3chy lt- FLAV_CLOAB
eeeeeee hhhhhhhhhh
??????????? 3chy lt- 3chy
--------------- ----- hhhhhhhhhhhhhh
------ Consensus
---------------EEEE----- HHHHHHHHHHHHH
------ Consensus-DSSP
....................xx.....
. PHD ---------------
----- HHHHHHHHHHHHHH ------ PHD-DSSP
...............xxxx.....
x...... DSSP
...............EEEE.....SS HHHHHHHHHHHHHHHT
...... LumpDSSP
...............EEEE..... HHHHHHHHHHHHHHH
......
137What to do with a multiple alignment?
- Use it to eyeball and detect structural/functiona
l features - Use it to make a profile and search a database
for homologs - Give it to other bioinformatics methods and
predict secondary structure, functional residues,
correlated mutations, phylogenetic trees, etc.
138Rules of thumb when looking at a multiple
alignment (MA)
- Hydrophobic residues are internal
- Gly (Thr, Ser) in loops
- MA hydrophobic block -gt internal ?-strand
- MA alternating (1-1) hydrophobic/hydrophilic gt
edge ?-strand - MA alternating 2-2 (or 3-1) periodicity gt
?-helix - MA gaps in loops
- MA Conserved column gt functional? gt active
site
139Rules of thumb when looking at a multiple
alignment (MA)
- Active site residues are together in 3D structure
- Helices often cover up core of strands
- Helices less extended than strands gt more
residues to cross protein - ?-?-? motif is right-handed in gt95 of cases
(with parallel strands) - MA inconsistent alignment columns and match
errors! - Secondary structures have local anomalies, e.g.
?-bulges