Protein Sequence Analysis - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Protein Sequence Analysis

Description:

Genomes of most organism have been deciphered. ... Kyte-Doolitle hydrophobicity plot. Nature of amino acids- hydrophilic or Hydrophobic ... – PowerPoint PPT presentation

Number of Views:598

Avg rating:3.0/5.0

Slides: 50

Provided by: davvbio

Category:

more less

Transcript and Presenter's Notes

Title: Protein Sequence Analysis

1
Protein Sequence Analysis

By
Rashmi Shrivastava
Lecturer
School of Biotechnology
Devi Ahilya Vishwavidyalaya
Indore

2
Introduction

Genomes of most organism have been deciphered.
Further step is to identify key regions,
speciallly protein coding regions.
Assigning functions to individual proteins
Predicting molecular structures of the proteins.
Developing protein interaction network.
Utilizing the information obtained for structure
based drug design, discovering new drug targets,
Creating mutations to alter properties/ create
desired property in proteins and so
on...............

By themselves the letters(amino acid sequence/
genome sequence) have no meaning.
our aim is to create sentence------proteins
words--------
motifs (recognize patterns and signatures)
To investigate the meaning of sequences there are
two approaches-
pattern recognition techiques- detect
similarity between sequences.
ab initio prediction methods-prediction of
structure and thus the function

4
Protein databases(The source of information)

Primary and Secondary databases
Primary sequence databases-
Entrez-protein
PIR- Developed at NBRF
Swiss-Prot
TrEMBL

5
Secondary database

-Results of analysis of primary databases
-PROSITE/InterPro-protein families characterized
by presence of single most conserved motif
(domains) by multiple sequence alignment
-PRINTS-protein families are characterized by
several conserved motifs to develop a fingerprint
or signature for a particular family.
BLOCKS and Pfam
Profiles-variable regions between conserved
motifs contain information about insertions and
deletionsdistant sequence relationship
Enzyme and KEGG- Functional classification

6
Structure classification databases

SCOP(Structural classification of proteins)
classify on Hierarchy Family, superfamily and
fold
CATH(Class, Architecture, Topology,
Homology)-Hierarchial domain classification of
proteins
C-gross secondary structure content
A- Arrangement of secondary elemnts
T-Overalll shape and connectivity
H- gt 35 sequence identity
Protein Data Bank (PDB)

Sequence alignment

Pair wise
Multiple
8
Pair wise Sequence Alignment
9
Sequence alignment
Global Sequence Alignment
Local sequence alignment
10
Algorithm

Global sequence alignment-
Needleman Wunch
Local Sequence alignment-
Smith Waterman

11
Identity Similarity In alignment the sequence
which is already in database is known as Subject
and the sequence for which the alignment is
going on is termed as query or probe sequence. If
the aligned Probe residue is same with the
Subject residue then it is identical but if they
are of same nature (Glutamate Aspartate) then
they are similar.
VLSPADKTNVKAAWGKVGAHAGYEG
.
VLSEGEWQLVLHVWAKVEADVAGHG Total
Residue 25 Identical Residue 09 Similar
(not identical)01 Gap00 Percent Similarity
40.000 ( and .) (Identity similarity) Percent
Identity 36.000 ( only)
12
Alignment

ATCAGAGTC
TTC----AGTC
ATCAGAGTC
TTCAG----TC
ATCAGAGTC
TTCA----GTC

13
Aligning Sequences.
actaccagttcatttgatacttctcaaa
Sequence 1 Sequence 2
taccattaccgtgttaactgaaaggacttaaagact
14
Gap Insertion
V K L A W A A K G N E A A P A K A A V D H Y V A
A
V K A W A A K G N E A E G L S A A P D J K V A A P
Total Residue 25 Identical Residue 04
Gap00 Percent Identity 16.00
V K L A W A A K G N E A A P A K A A V D H Y V A A
V K _ A W A A K G N E A E G L S A A P D J K V A
A
Total Residue 25 Identical Residue 18
Gap01 Percent Identity 72.00
15
Scoring System
Proteins can differ in close organisms. Some
substitutions are more frequent than other
substitutions. Chemically similar amino acids
can be replaced without severely effecting the
proteins function and structure
16
Matrices formed to score alignment

Sparse Matrices
Based on identical residue matching
Problem Faced
Diagnostic power is relatively poor, as all the
identical matches carry equal weighting
2. Mathematically significant but biologically
insignificant.

17
To solve this problem
Scoring matrices has been devised that weight
matches between non identical residues,
according to observed substitution rates across
large evolutionary distances. This scoring
matrices are mathematically insignificant
but biologically significant specially for
aligning sequences of very low identity.
18
(No Transcript)
19
Percent Accepted Mutation (PAM or Dayhoff)
Matrices

Similar sequences organized into phylogenetic
trees
Number of amino acid changes counted
Relative mutabilities evaluated
20 x 20 amino acid substitution matrix calculated

PAM 1 1 accepted mutation event per 100 amino
acids PAM 250 250 mutation events per 100
PAM 1 matrix can be multiplied by itself N times
to give transition matrices for sequences that
have undergone N mutations

Derived from global alignments of closely related
sequences.
Matrices for greater evolutionary distances are
extrapolated from those for lesser ones.
The number with the matrix (PAM40, PAM100) refers
to the evolutionary distance greater numbers are
greater distances.
Does not take into account different evolutionary
rates between conserved and non-conserved
regions.

22
PAM 1
23
PAM 250
24
(No Transcript)
25
Scoring
A K W T N L K - - - - W A K V - A D V A G H
- G
A K - T N V KA K L P W G K V G G H V A G E Y G

The score of the alignment in this system is
-Matrix value at (A,A) (K,K) (T,T) (K,K)
(W,W) (A,G)
(penalty for gap insertion/deletion)gap
- (penalty for gap extension)(total length of
all gaps)

26
(No Transcript)
27

Henikoff, S. Henikoff J.G. (1992)
Use blocks of protein sequence fragments from
different families (the BLOCKS database)
Amino acid pair frequencies calculated by summing
over all possible pairs in block
Different evolutionary distances are incorporated
into this scheme with a clustering procedure
(identity over particular threshold same
cluster)

Target frequencies are identified directly
instead of extrapolation.
Sequences more than x identitical within the
block where substitutions are being counted, are
grouped together and treated as a single sequence
BLOSUM 50 gt 50 identity
BLOSUM 62 gt 62 identity

29
BLOSUM

A 4
B -2 6
C 0 -3 9
D -2 6 -3 6
E -1 2 -4 2 5
F -2 -3 -2 -3 -3 6
G 0 -1 -3 -1 -2 -3 6
H -2 -1 -3 -1 0 -1 -2 8
I -1 -3 -1 -3 -3 0 -4 -3 4
K -1 -1 -3 -1 1 -3 -2 -1 -3 5
L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4
M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5
N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6
P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7
Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5
R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1
5
S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0
-1 4
T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1
-1 1 5
V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2
-3 -2 0 4

30
(No Transcript)
31
Thumb rules
Lower PAMs and higher Blosums find short local
alignment of highly similar sequences. Higher
PAMs and lower Blosums find longer weaker local
alignment.
32
PAM vs. BLOSUM
Based on the basic assumptions and the
construction of each matrix PAM model is
designed to track evolutionary origin of
proteins. Blosum model is designed to find
conserved domains of proteins.
33
(No Transcript)
34
Protein Structure

Primary structure- The linear sequence
of amino acids in a protein molecule
Secondary structure- regions of local
regularity within a protein fold (a
helices, ß strands, turns etc)
Super secondary structure- the arrangement of a
helices and/or ß strands, into discrete folding
units (ß-barrels, ß aß- units, greek key motifs
etc.)
Tertiary structure-The overall fold of a protein
sequence formed by packing of its secondary
and/or super- secondary structure elements.
Quaternary structure- Arrangement of separate
protein chains in a protein molecule

35
From the Primary sequence to protein properties

Predicting protein localization/ secretory nature
by the presence of signal peptide and
localization signal
Transmembrane helix prediction to identify
membrane proteins
Calculation of physiochemical properties-pI, Mwt.
Identification of coiled coiled regions