Title: Characterization of Secondary Structure of Proteins using Different Vocabularies
1Characterization of Secondary Structure of
Proteins using Different Vocabularies
- Madhavi K. Ganapathiraju
- Language Technologies Institute
-
- Advisors
- Raj Reddy, Judith Klein-Seetharaman,
- Roni Rosenfeld
2nd Biological Language Modeling
Workshop Carnegie Mellon University May 13-14
2003
2Presentation overview
- Classification of Protein Segments by their
Secondary Structure types - Document Processing Techniques
- Choice of Vocabulary in Protein Sequences
- Application of Latent Semantic Analysis
- Results
- Discussion
3Secondary Structure of Protein
Sample Protein MEPAPSAGAELQPPLFANASDAYPSACPSAGANA
SGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTA
TNIYIFNLALADALATSTLPFQSA
Sample Protein MEPAPSAGAELQPPLFANASDAYPSACPSAGANA
SGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTA
TNIYIFNLALADALATSTLPFQSA
4Application of Text Processing
- Letters ? Words ? Sentences
- Letter counts in languages
- Word counts in Documents
- Residues ? Secondary Structure ?Proteins?Genomes
Can unigrams distinguish Secondary Structure
Elements from one another
5Unigrams for Document Classification
- Word-Document matrix
- represents documents in terms of their word
unigrams
Bag-of-words model since the position of words
in the document is not taken into account
6Word Document Matrix
7Document Vectors
8Document Vectors
Doc-1
9Document Vectors
Doc-2
10Document Vectors
Doc-3
11Document Vectors
Doc-N
12Document Comparison
- Documents can be compared to one another in terms
of dot-product of document vectors -
.
13Document Comparison
- Documents can be compared to one another in terms
of dot-product of document vectors -
.
14Document Comparison
- Documents can be compared to one another in terms
of dot-product of document vectors -
.
- Formal Modeling of documents is
- presented in next few slides
15Vector Space Model construction
- Document vectors in word-document matrix are
normalized - By word counts in entire document collection
- By document lengths
- This gives a Vector Space Model (VSM) of the set
of documents - Equations for Normalization
16Word count normalization
(Word count in document)
(document length)
(depends on word count in corpus)
t_i is the total number of times word i
occurs in the corpus
17Word-Document Matrix
Normalized Word-Document Matrix
18Document vectors after normalisation
...
19Use of Vector Space Model
- A query document is also represented as a vector
- It is normalized by corpus word counts
- Documents related to the query-doc are identified
- by measuring similarity of document vectors to
the query document vector
20Application to Protein Secondary Structure
Prediction
21Protein Secondary Structure
- Dictionary of Secondary Structure Prediction
annotation of each residue with its structure - based on hydrogen bonding patterns and
geometrical constraints - 7 DSSP labels for PSS
- H
- G
- B
- E
- S
- I
- T
Helix types
Strand types
Coil types
22Example
Residues
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
- PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
DSSP
Key to DSSP labels
T, S, I,_ Coil E, B Strand
H, G Helix
23Reference Model
- Proteins are segmented into structural Segments
- Normalized word-document matrix
- constructed from structural segments
24Example
Residues
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
- PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
DSSP
Structural Segments obtained from the given
sequence PKPPVKFN RRIFLLNTQNVI NG YVKWAI ND VSL A
LPPTP YLGAMKY NLLH
25Example
Residues
PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
- PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT
DSSP
Structural Segments obtained from the given
sequence PKPPVKFN RRIFLLNTQNVI NG YVKWAI ND VSL A
LPPTP YLGAMKY NLLH
Unigrams in the structural segments
26Structural Segments
Amino-acid Structural-Segment Matrix
Amino Acids
27Structural Segments
Amino-acid Structural-Segment Matrix
Amino Acids
Similar to Word-Document Matrix
28Document Vectors
Word Vectors
29Document Vectors
Query Vector
Word Vectors
30Data Set used for PSSP
- JPred data
- 513 protein sequences in all
- lt25 homology between sequences
- Residues corresponding DSSP annotations are
given - We used
- 50 sequences for model construction (training)
- 30 sequences for testing
31Classification
- Proteins from test set
- segmented into structural elements
- Called query segments
- Segment vectors are constructed
- For each query segment
- n most similar reference segment vectors are
retrieved - Query segment is assigned same structure as that
of the majority of the retrieved segments
k-nearest neighbour classification
32Structure type assignment to QVector
Reference Model
Query Vector
Helix Strand Coil
Key
Hence Structure-type assigned to Query Vector is
Coil
33Choice of Vocabulary in Protein Sequences
- Amino Acids
- But Amino acids are
- Not all distinct..
- Similarity is primarily due to chemical
composition - ?So,
- Represent protein segments in terms of types of
amino acids - Represent in terms of chemical composition
34Representation in terms of types of AA
- Classify based on Electronic Properties
- e- donors D,E,A,P
- weak e-donors I,L,V
- Ambivalent G,H,S,W
- weak e- acceptor T,M,F,Q,Y
- e- acceptor K,R,N
- C (by itself, another group)
- Use Chemical Groups
35Representation using Chemical Groups
36Results of Classification with AA as words
Leave 1-out testing of reference vectors Unseen
query segments
37Results with chemical groups as words
- Build VSM using both reference segments and test
segments - Structure labels of reference segments are known
- Structure labels of query segments are unknown
38Modification to Word-Document matrix
- Latent Semantic Analysis
- Word document matrix is transformed
- by Singular Value Decomposition
-
39(No Transcript)
40Results with AA as words, using LSA
41Results with types of AA as wordsusing LSA
42Results with chemical groups as wordsusing LSA
43LSA results for Different Vocabularies
Amino acids LSA
Types of Amino acid LSA
Chemical Groups LSA
44Model construction using all data
Matrix models constructed using both reference
and query documents together. This gives better
models both for normalization and in construction
Of latent semantic model
Amino Acid
Chemical Groups
Amino acid types
45Applications
- Complement other methods for protein structure
prediction - Segmentation approaches
- Protein classifications as all-alpha, all-beta,
alphabeta or alpha/beta types - Automatically assigning new proteins into SCOP
families
46References
- Kabsch, Sander Dictionary of Secondary Structure
Prediction, Biopolymers. - Dwyer, D.S., Electronic properties of the amino
acid side chains contribute to the structural
preferences in protein folding. J Biomol Struct
Dyn, 2001. 18(6) p. 881-92. - Bellegarda, J., Exploiting Latent Semantic
Information in Statistical Language Modeling,
Proceedings of the IEEE, Vol 888, 2000.
47Thank you!
48Use of SVD
- Representation of Training and test segments very
similar to that in VSM - Structure type assignment goes through same
process, except that it is done with the LSA
matrices
49Classification of Query Document
- A query document is also represented as a vector
- It is normalized by corpus word counts
- Documents related to the query are identified
- by measuring similarity of document vectors to
the query document vector
- Query Document is assigned the same Structure as
of those retrieved by similarity measure - Majority voting
k-nearest neighbour classification
50Notes
- Results described are per-segment
- Normalized Word document matrix does not preserve
document lengths - Hence per residue accuracies of structure
assignments cannot be computed