Title: Presentazione di PowerPoint
1Why sequence comparison?
The sequence determines the properties of the
macromolecule
- Sequence comparison is used
- To discover structural, functional and
evolutionary relationships - Similar sequence ? similar structure/function of
the protein - To identify conserved patterns
- To find known structural and functional domains
in unknown proteins
A comparison may be the basis of further
experimental investigations
2What is a sequence alignment?
Procedure of comparing two (pairwise) or more
(multiple) sequences by searching for a series of
individual characters that are in the same order
in the sequences
3Examples comparing some strings
What have we learnt? To compare the names
(strings) we have mentally searched for the
alignment which maximises the number of
identities ? if this number is high the names are
similar otherwise are different
What have we learnt? To decide the best alignment
among many ones we have to score the results
What have we learnt? In some cases, to find the
best alignment we have to put some gaps
What have we learnt? Gaps must be penalized
4Comparing sequences
The algorithms which compare sequences do the
same operations we have done They try many
alignments and select that with the highest score
as the best. In some cases they put some gaps to
increase the quality of the alignment
A different score
Comparing amino/nucleic acids sequences is
different with respect to comparing names because
each letter is a molecule with different steric
hindrance and chemico-physical properties that
influence their relative replaceability in
evolution
5An example of sequence comparison
An algorithm tries all the possible solutions
(there are lots of possible alignments) and
chooses that with the highest score
D A R I A E S K - A
The best alignment is
The score of the alignment is
(Ss m(as,bs))(gap insertion penalty gaps)(gap
extension penalty length of gaps)
Length of alignment
6Scoring matrices
Scoring matrices reflect - Probabilities of
mutual substitutions - The probability of
occurrence of each residue
7Protein Score Matrix
PAM240 (Percent Accept Mutation)
BLOSUM62 (BLOck SUbstitution Matrix)
The BLOSUM model is designed to find conserved
domains of proteins the BLOSUM statistics are
based on the BLOCKs library (http//bioinformatics
. weizmann.ac.il/blocks/process_blocks.html) a
collection of multiple alignments of protein
fragments without gaps. The numbers in the
matrix reflect the frequency of substitution of
one residue by another in the alignments with a
percentage of identity gt 62
The PAM model is designed to track evolutionary
origin of protein the numbers in the matrix
reflect the probability that a residue is
substituted by another, after 240 evolutionary
steps
GONNET250. Similar to PAM but are much more up to
date and are based on a far larger data set. They
appear to be more sensitive.
The higher the index, the higher the distance of
the matrix from the identity matrix
The lower the index, the higher the distance of
the matrix from the identity matrix
8How good is my alignment?
The output parameters
Score the value calculated for the sequence
using the substitution matrix and the gap
penalties. Percent identity percent of exact
matching residues in the alignment. Percentage
of similarity percent of similar residues
aligned (depends on the definition of similarity
but is biologically more significative) Percentag
e of gaps percent of gaps present in the
alignment Expected value (E) probability that
a match with this score would be obtained
comparing two random sequences. NOTE different
systems use different forms of these statistics.
9Multiple Sequence Alignment
- Compare all sequences pairwise.
- Perform cluster analysis on the pairwise data to
generate a hierarchy for alignment (guide tree). - Build alignment step by step according to the
guide tree. Build the multiple alignment by first
aligning the most similar pair of sequences, then
add another sequence or another pairwise
alignments.
10Steps in Multiple Alignment
(1) Pairwise alignment (prepare guide tree)
6 pairwise alignments then
cluster analysis (2) Multiple alignment following
the tree from (A) align pairs
align alignments - preserve gaps
11- We can deduce a pairwise alignment for each two
sequences in the multiple alignment (projected
pairwise alignment) - The projected pairwise alignment is NOT the best
pairwise alignment for the two sequences.
Best Pairwise alignment
Projected Pairwise alignment
12Isopenicillin N Synthase
- Mononuclear iron proteins electron carrier
proteins. Iron atoms are bound to amino acid side
chains. - In IPNS the metal ion is coordinated by three
protein ligands
13IsoPenicillin N Synthase
- IPNS is involved in biosynthesis of penicillin
H
M
e
S
N
N
H
2
M
e
N
O
C
O
O
H
O
2
O
C
O
O
H
H
2
F
e
A
s
c
o
r
b
a
t
e
A
C
V
M
e
2
H
O
S
N
N
H
2
2
M
e
N
O
C
O
O
H
C
O
O
H
O
I
s
o
p
e
n
i
c
i
l
l
i
n
N
14Research IPNS
- Goal Identify Fe2 binding residues.
- Possible solutions
- Empirical approach (Alanine walk)
- Bioinformatic approach (comparing different IPNS
sequences).
15Step 1
- Multiple alignment of known IPNS
16PileUp - output
- !!AA_MULTIPLE_ALIGNMENT 1.0
- PileUp of _at_ipns.fil
- Symbol comparison table GenRunDatablosum62.cmp
CompCheck 1102 - GapWeight 8
- GapLengthWeight 2
- ipns.msf MSF 338 Type P March 14, 2002
0929 Check 7631 .. - Name IPNS_STRJU Len 338 Check 6344
Weight 1.00 - Name IPNS_STRCL Len 338 Check 4249
Weight 1.00 - Name IPNS_NOCLA Len 338 Check 7020
Weight 1.00 - Name IPNS_CEPAC Len 338 Check 18
Weight 1.00 - //
- 1
50 - IPNS_STRJU MPILMPSAE VPTIDISPLS GDDAKAKQRV
AQEINKAARG SGFFYASNHG - IPNS_STRCL MPVLMPSAH VPTIDISPLF GTDAAAKKRV
AEEIHGACRG SGFFYATNHG - IPNS_NOCLA MKMPSAE VPTIDVSPLF GDDAQEKVRV
GQEINKACRG SGFFYAANHG - IPNS_CEPAC MGSVPVPVAN VPRIDVSPLF GDDKEKKLEV
ARAIDAASRD TGFFYAVNHG
17MA bacteria and fungi
- Multiple Sequence alignment of IPNS
Not enough variation
18Step 2
- Add more enzymes, similar to IPNS
19Isopenicillin N Synthase
- Alignment of IPNSs, hydroxylases and expandases
(same biochemical pathway)
20Isopenicillin N Synthase
- New multiple alignment, narrowing down the
possibilities
21Simple multiple alignment
- The known IPNS sequences are very similar.
- Close enzymes sequences are also quite similar.
- Not enough variability to categorize the active
sites. - We need to obtain even more distant sequences.
22Step 3
- Using the multiple alignment for further searches
23Consensus Sequence
- We can deduce a consensus sequence from the
multiple sequence alignment. The consensus
sequence holds the most frequent character of the
alignment at each column.
24Profile
- We can deduce a statistical model describing the
multiple sequence alignment. A Profile holds
statistical information about characters in
alignment at each column.
25Profile vs. Consensus
- Consensus each position reflects the most common
character found at a position. - Profile each position reflects the frequency of
the character found at a position.
26Profile vs. Consensus
- The following multiple alignments will have the
same consensus
27Profile vs. Consensus
- But have a different profile
28ProfileMake and ProfileSearch
- ProfileMake creates a profile position-specific
scoring table. - The profile is constructed from a multiple
sequence alignment. - profilemake alignment.msf -beg -end
29ProfileSearch
- ProfileSearch Searches for sequences in the
database that match the profile. - Profilesearch profile.prf
30Close enzymes
- IPNS, Hydroxylase, Expandase
- Ethylene forming enzyme (EFE, ACCO)
- Hyoscyamine 6 hydroxylase
- Flavanone-3-hydroxylases
- Flavonol synthases
- Anthocyanidin hydroxylases
- Anthocyanidin synthases
- Gibberellin A20 oxidases
- Gibberellin 3b oxidases
- Gibberellin 2b, 3b hydroxylase
- Gibberellin 7-oxidase
- Desacetoxyvindoline 4-hydroxylase
- L-proline 3-hydroxylases
- Prolyl 4-hydroxylases
- Lysyl hydroxylases
- Â
31Isopenicillin N Synthase
- Common to these enzymes is their involvement in
secondary metabolism, such as the production of
penicillin and cephalosporin antibiotics in
bacteria and fungi, gibberellins, alkaloids,
ethylene, anthocyanidins and flavonoids in
plants, and the modification of collagen.
The HXD(53-57)XH motif plays a role in binding
the iron in the active site.
32Isopenicillin N Synthase
- Experimental evidence supports the finding that
His212, Asp214 and His268 are the endogenous
ligands that bind Fe2 in IPNS. - Enzyme Relative Km kcat kcat/Km
- Activity (mM) (min-1) (mM-1min-1)
- Wild type 100 0.4 38.8 96.9
- His48Ala 16 0.56 7.5 13.4
- His63Ala 31 1.0 14.2 14.2
- His114Ala 28 0.85 12.5 14.7
- His124Ala 48 0.84 32.1 38.1
- His135Ala 22 0.59 11.7 19.8
- His212Ala lt0.007 n.d. n.d.
- His268Ala lt0.003 n.d. n.d.
- Â
- Asp14Ala 5 0.86 0.56 0.7
- Asp113Ala 63 0.45 23.8 52.8
- Asp131Ala 68 0.48 36.3 75.5
- Asp203Ala 32 0.91 12.3 13.5
- Asp214Ala lt0.004 n.d. n.d.
33Searching Databases withMultiple Alignments
- Pros
- Using representatives of multiple sequence
alignment data in database searches. Uses more
information, resulting in higher sensitivity - Cons
- Searches take longer and are often more difficult
to interpret
34Psi Blast
- Psi (Position Specific Iterated) is an automatic
profile-like search - The program first performs a gapped blast search
of the database. The information of the
significant alignments is then used to construct
a position specific score matrix. This matrix
replaces the query sequence in the next round of
database searching - The program may be iterated until no new
significant are found
35Psi-Blast Output
- Detected by Psi-Blast
- IPNS
- Deacetoxy cephalosporin C synthase (expandase)
- Deacetyl cephalosporin C synthase (hydroxylase)
- Ethylene Forming Enzyme (EFE, ACCO)
- Hyoscyamine 6 hydroxylase
- Flavanone-3-hydroxylases
- Flavonol synthases
- Anthocyanidin hydroxylases
- Gibberellin A20 oxidases
- Desacetoxyvindoline 4-hydroxylase
- Not detected by Psi-Blast
- (Prolyl 4-hydroxylases and Lysyl hydroxylases)
- Â