Title: Protein Sequence Analysis
1Protein Sequence Analysis
- By
- Rashmi Shrivastava
- Lecturer
- School of Biotechnology
- Devi Ahilya Vishwavidyalaya
- Indore
2Introduction
- Genomes of most organism have been deciphered.
- Further step is to identify key regions,
speciallly protein coding regions. - Assigning functions to individual proteins
- Predicting molecular structures of the proteins.
- Developing protein interaction network.
- Utilizing the information obtained for structure
based drug design, discovering new drug targets,
Creating mutations to alter properties/ create
desired property in proteins and so
on...............
3- By themselves the letters(amino acid sequence/
genome sequence) have no meaning. - our aim is to create sentence------proteins
- words--------
motifs (recognize patterns and signatures) - To investigate the meaning of sequences there are
two approaches- - pattern recognition techiques- detect
similarity between sequences. - ab initio prediction methods-prediction of
structure and thus the function
4Protein databases(The source of information)
- Primary and Secondary databases
- Primary sequence databases-
- Entrez-protein
- PIR- Developed at NBRF
- Swiss-Prot
- TrEMBL
5Secondary database
- -Results of analysis of primary databases
- -PROSITE/InterPro-protein families characterized
by presence of single most conserved motif
(domains) by multiple sequence alignment - -PRINTS-protein families are characterized by
several conserved motifs to develop a fingerprint
or signature for a particular family. - BLOCKS and Pfam
- Profiles-variable regions between conserved
motifs contain information about insertions and
deletionsdistant sequence relationship - Enzyme and KEGG- Functional classification
6Structure classification databases
- SCOP(Structural classification of proteins)
classify on Hierarchy Family, superfamily and
fold - CATH(Class, Architecture, Topology,
Homology)-Hierarchial domain classification of
proteins - C-gross secondary structure content
- A- Arrangement of secondary elemnts
- T-Overalll shape and connectivity
- H- gt 35 sequence identity
- Protein Data Bank (PDB)
7Pair wise
Multiple
8Pair wise Sequence Alignment
9Sequence alignment
Global Sequence Alignment
Local sequence alignment
10Algorithm
- Global sequence alignment-
- Needleman Wunch
- Local Sequence alignment-
- Smith Waterman
11Identity Similarity In alignment the sequence
which is already in database is known as Subject
and the sequence for which the alignment is
going on is termed as query or probe sequence. If
the aligned Probe residue is same with the
Subject residue then it is identical but if they
are of same nature (Glutamate Aspartate) then
they are similar.
VLSPADKTNVKAAWGKVGAHAGYEG
.
VLSEGEWQLVLHVWAKVEADVAGHG Total
Residue 25 Identical Residue 09 Similar
(not identical)01 Gap00 Percent Similarity
40.000 ( and .) (Identity similarity) Percent
Identity 36.000 ( only)
12Alignment
- ATCAGAGTC
-
- TTC----AGTC
- ATCAGAGTC
- TTCAG----TC
- ATCAGAGTC
- TTCA----GTC
13Aligning Sequences.
actaccagttcatttgatacttctcaaa
Sequence 1 Sequence 2
taccattaccgtgttaactgaaaggacttaaagact
14Gap Insertion
V K L A W A A K G N E A A P A K A A V D H Y V A
A
V K A W A A K G N E A E G L S A A P D J K V A A P
Total Residue 25 Identical Residue 04
Gap00 Percent Identity 16.00
V K L A W A A K G N E A A P A K A A V D H Y V A A
V K _ A W A A K G N E A E G L S A A P D J K V A
A
Total Residue 25 Identical Residue 18
Gap01 Percent Identity 72.00
15Scoring System
Proteins can differ in close organisms. Some
substitutions are more frequent than other
substitutions. Chemically similar amino acids
can be replaced without severely effecting the
proteins function and structure
16Matrices formed to score alignment
- Sparse Matrices
- Based on identical residue matching
- Problem Faced
- Diagnostic power is relatively poor, as all the
- identical matches carry equal weighting
- 2. Mathematically significant but biologically
- insignificant.
17To solve this problem
Scoring matrices has been devised that weight
matches between non identical residues,
according to observed substitution rates across
large evolutionary distances. This scoring
matrices are mathematically insignificant
but biologically significant specially for
aligning sequences of very low identity.
18(No Transcript)
19Percent Accepted Mutation (PAM or Dayhoff)
Matrices
- Similar sequences organized into phylogenetic
trees - Number of amino acid changes counted
- Relative mutabilities evaluated
- 20 x 20 amino acid substitution matrix calculated
20- PAM 1 1 accepted mutation event per 100 amino
acids PAM 250 250 mutation events per 100 - PAM 1 matrix can be multiplied by itself N times
to give transition matrices for sequences that
have undergone N mutations
21- Derived from global alignments of closely related
sequences. - Matrices for greater evolutionary distances are
extrapolated from those for lesser ones. - The number with the matrix (PAM40, PAM100) refers
to the evolutionary distance greater numbers are
greater distances. - Does not take into account different evolutionary
rates between conserved and non-conserved
regions.
22PAM 1
23PAM 250
24(No Transcript)
25Scoring
A K W T N L K - - - - W A K V - A D V A G H
- G
A K - T N V KA K L P W G K V G G H V A G E Y G
- The score of the alignment in this system is
- -Matrix value at (A,A) (K,K) (T,T) (K,K)
(W,W) (A,G) -
- (penalty for gap insertion/deletion)gap
-
- - (penalty for gap extension)(total length of
all gaps)
26(No Transcript)
27- Henikoff, S. Henikoff J.G. (1992)
- Use blocks of protein sequence fragments from
different families (the BLOCKS database) - Amino acid pair frequencies calculated by summing
over all possible pairs in block - Different evolutionary distances are incorporated
into this scheme with a clustering procedure
(identity over particular threshold same
cluster)
28- Target frequencies are identified directly
instead of extrapolation. - Sequences more than x identitical within the
block where substitutions are being counted, are
grouped together and treated as a single sequence - BLOSUM 50 gt 50 identity
- BLOSUM 62 gt 62 identity
29BLOSUM
- A 4
- B -2 6
- C 0 -3 9
- D -2 6 -3 6
- E -1 2 -4 2 5
- F -2 -3 -2 -3 -3 6
- G 0 -1 -3 -1 -2 -3 6
- H -2 -1 -3 -1 0 -1 -2 8
- I -1 -3 -1 -3 -3 0 -4 -3 4
- K -1 -1 -3 -1 1 -3 -2 -1 -3 5
- L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4
- M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5
- N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6
- P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7
- Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5
- R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1
5 - S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0
-1 4 - T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1
-1 1 5 - V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2
-3 -2 0 4
30(No Transcript)
31 Thumb rules
Lower PAMs and higher Blosums find short local
alignment of highly similar sequences. Higher
PAMs and lower Blosums find longer weaker local
alignment.
32PAM vs. BLOSUM
Based on the basic assumptions and the
construction of each matrix PAM model is
designed to track evolutionary origin of
proteins. Blosum model is designed to find
conserved domains of proteins.
33(No Transcript)
34Protein Structure
- Primary structure- The linear sequence
- of amino acids in a protein molecule
- Secondary structure- regions of local
- regularity within a protein fold (a
helices, ß strands, turns etc) - Super secondary structure- the arrangement of a
helices and/or ß strands, into discrete folding
units (ß-barrels, ß aß- units, greek key motifs
etc.) - Tertiary structure-The overall fold of a protein
sequence formed by packing of its secondary
and/or super- secondary structure elements. - Quaternary structure- Arrangement of separate
protein chains in a protein molecule
35From the Primary sequence to protein properties
- Predicting protein localization/ secretory nature
by the presence of signal peptide and
localization signal - Transmembrane helix prediction to identify
membrane proteins - Calculation of physiochemical properties-pI, Mwt.
- Identification of coiled coiled regions
36Post translational modification prediction
www.expasy.org
37(No Transcript)
38(No Transcript)
39Kyte-Doolitle hydrophobicity plot
- Nature of amino acids- hydrophilic or Hydrophobic
- A window of 9-20 a,.a taken
- A value greater than 0 means hydrophobic
40From Sequence to Structure
- Secondary structure prediction- GOR, Predict
protein, nnpredict - Domain Prediction- SBASE, PRODOM
41Importance of protein secondary structure
prediction
42(No Transcript)
43Basis of Secondary structure prediction
- Conservation in the multiple sequence alignment
- Hidden Markov Models and Neural networks
- 70-80 accuracy is achieved.
44Method used
45Key features of secondary structure prediction
46Chou Fasman Algorithm
47(No Transcript)
48GOR
Multiple Sequence
49Some sites