Characterization of Secondary Structure of Proteins using Different Vocabularies - PowerPoint PPT Presentation

About This Presentation

Title:

Characterization of Secondary Structure of Proteins using Different Vocabularies

Description:

Sample Protein: MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITAL ... n' most similar reference segment vectors are retrieved ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 51

Provided by: MadhaviGan1

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Characterization of Secondary Structure of Proteins using Different Vocabularies

1
Characterization of Secondary Structure of
Proteins using Different Vocabularies

Madhavi K. Ganapathiraju
Language Technologies Institute
Advisors
Raj Reddy, Judith Klein-Seetharaman,
Roni Rosenfeld

2nd Biological Language Modeling
Workshop Carnegie Mellon University May 13-14
2003
2
Presentation overview

Classification of Protein Segments by their
Secondary Structure types
Document Processing Techniques
Choice of Vocabulary in Protein Sequences
Application of Latent Semantic Analysis
Results
Discussion

3
Secondary Structure of Protein
Sample Protein MEPAPSAGAELQPPLFANASDAYPSACPSAGANA
SGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTA
TNIYIFNLALADALATSTLPFQSA
Sample Protein MEPAPSAGAELQPPLFANASDAYPSACPSAGANA
SGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTA
TNIYIFNLALADALATSTLPFQSA
4
Application of Text Processing

Letters ? Words ? Sentences
Letter counts in languages
Word counts in Documents
Residues ? Secondary Structure ?Proteins?Genomes

Can unigrams distinguish Secondary Structure
Elements from one another
5
Unigrams for Document Classification

Word-Document matrix
represents documents in terms of their word
unigrams

Bag-of-words model since the position of words
in the document is not taken into account
6
Word Document Matrix
7
Document Vectors
8
Document Vectors
Doc-1
9
Document Vectors
Doc-2
10
Document Vectors
Doc-3
11
Document Vectors
Doc-N
12
Document Comparison

Documents can be compared to one another in terms
of dot-product of document vectors

.
13
Document Comparison

Documents can be compared to one another in terms
of dot-product of document vectors

.
14
Document Comparison

Documents can be compared to one another in terms
of dot-product of document vectors

Formal Modeling of documents is
presented in next few slides

15
Vector Space Model construction

Document vectors in word-document matrix are
normalized
By word counts in entire document collection
By document lengths
This gives a Vector Space Model (VSM) of the set
of documents
Equations for Normalization

16
Word count normalization
(Word count in document)
(document length)
(depends on word count in corpus)
t_i is the total number of times word i
occurs in the corpus
17
Word-Document Matrix
Normalized Word-Document Matrix
18
Document vectors after normalisation
...
19
Use of Vector Space Model

A query document is also represented as a vector
It is normalized by corpus word counts
Documents related to the query-doc are identified
by measuring similarity of document vectors to
the query document vector

20
Application to Protein Secondary Structure
Prediction
21
Protein Secondary Structure

Dictionary of Secondary Structure Prediction
annotation of each residue with its structure
based on hydrogen bonding patterns and
geometrical constraints
7 DSSP labels for PSS
H
G
B
E
S
I
T

Helix types
Strand types
Coil types
22
Example
Residues
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH

PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH

____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
DSSP
Key to DSSP labels
T, S, I,_ Coil E, B Strand
H, G Helix
23
Reference Model

Proteins are segmented into structural Segments
Normalized word-document matrix
constructed from structural segments

24
Example
Residues
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH

PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH

PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH

____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
DSSP
Structural Segments obtained from the given
sequence PKPPVKFN RRIFLLNTQNVI NG YVKWAI ND VSL A
LPPTP YLGAMKY NLLH
Unigrams in the structural segments
26
Structural Segments
Amino-acid Structural-Segment Matrix
Amino Acids
27
Structural Segments
Amino-acid Structural-Segment Matrix
Amino Acids
Similar to Word-Document Matrix
28
Document Vectors
Word Vectors
29
Document Vectors
Query Vector
Word Vectors

30
Data Set used for PSSP

JPred data
513 protein sequences in all
lt25 homology between sequences
Residues corresponding DSSP annotations are
given
We used
50 sequences for model construction (training)
30 sequences for testing

31
Classification

Proteins from test set
segmented into structural elements
Called query segments
Segment vectors are constructed
For each query segment
n most similar reference segment vectors are
retrieved
Query segment is assigned same structure as that
of the majority of the retrieved segments

k-nearest neighbour classification
32
Structure type assignment to QVector
Reference Model
Query Vector
Helix Strand Coil
Key
Hence Structure-type assigned to Query Vector is
Coil
33
Choice of Vocabulary in Protein Sequences

Amino Acids
But Amino acids are
Not all distinct..
Similarity is primarily due to chemical
composition
?So,
Represent protein segments in terms of types of
amino acids
Represent in terms of chemical composition

34
Representation in terms of types of AA

Classify based on Electronic Properties
e- donors D,E,A,P
weak e-donors I,L,V
Ambivalent G,H,S,W
weak e- acceptor T,M,F,Q,Y
e- acceptor K,R,N
C (by itself, another group)
Use Chemical Groups

35
Representation using Chemical Groups
36
Results of Classification with AA as words
Leave 1-out testing of reference vectors Unseen
query segments
37
Results with chemical groups as words

Build VSM using both reference segments and test
segments
Structure labels of reference segments are known
Structure labels of query segments are unknown

38
Modification to Word-Document matrix

Latent Semantic Analysis
Word document matrix is transformed
by Singular Value Decomposition

39
(No Transcript)
40
Results with AA as words, using LSA
41
Results with types of AA as wordsusing LSA
42
Results with chemical groups as wordsusing LSA
43
LSA results for Different Vocabularies
Amino acids LSA
Types of Amino acid LSA
Chemical Groups LSA
44
Model construction using all data
Matrix models constructed using both reference
and query documents together. This gives better
models both for normalization and in construction
Of latent semantic model
Amino Acid
Chemical Groups
Amino acid types
45
Applications

Complement other methods for protein structure
prediction
Segmentation approaches
Protein classifications as all-alpha, all-beta,
alphabeta or alpha/beta types
Automatically assigning new proteins into SCOP
families

46
References

Kabsch, Sander Dictionary of Secondary Structure
Prediction, Biopolymers.
Dwyer, D.S., Electronic properties of the amino
acid side chains contribute to the structural
preferences in protein folding. J Biomol Struct
Dyn, 2001. 18(6) p. 881-92.
Bellegarda, J., Exploiting Latent Semantic
Information in Statistical Language Modeling,
Proceedings of the IEEE, Vol 888, 2000.

47
Thank you!
48
Use of SVD

Representation of Training and test segments very
similar to that in VSM
Structure type assignment goes through same
process, except that it is done with the LSA
matrices

49
Classification of Query Document