Ubiquitination Sites Prediction - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Ubiquitination Sites Prediction

Description:

Influence of evolutionary consideration. Ubiquitin ... frequencies, Entropy, Net charge, Total charge, Aromatics, Charge-hydrophobicity ... – PowerPoint PPT presentation

Number of Views:1577
Avg rating:5.0/5.0
Slides: 24
Provided by: bioInform1
Category:

less

Transcript and Presenter's Notes

Title: Ubiquitination Sites Prediction


1
Ubiquitination Sites Prediction
  • Dah Mee Ko
  • Advisor Dr.Predrag Radivojac
  • School of Informatics
  • Indiana University
  • May 22, 2009

2
Outline
  • Ubiquitination
  • Machine Learning
  • Decision Tree
  • Support Vector Machines
  • Prediction of ubiquitination sites
  • Influence of sequence
  • Influence of structure
  • Influence of evolutionary consideration

3
Ubiquitin
  • A small protein that occurs in all eukaryotic
    cells.
  • Highly conserved among eukaryotic species.
  • Consists of 76 amino acids and has a molecular
    mass of 8.5 kDa.
  • Key features
  • its C-terminal tail and Lys residues
  • Human ubiquitin sequence
  • MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQL
    EDGRTLSDYNIQKESTLHLVLRLRGG

4
Ubiquitination
  • Post-translational modification of a protein
  • Covalent attachment of one or more ubiquitin
    monomers to Lys residues
  • Reversible
  • Target proteins for degradation by the proteasome

5
Functions of Ubiquitination
  • Monoubiquitination
  • Histone regulation
  • DNA repair
  • Endocytosis
  • Budding of retroviruses from the plasma membrane
  • Polyubiquitination
  • Protein kinase activation

6
Machine Learning
  • Machine learning is programming computers to
    optimize a performance criterion using data and
    past experience.
  • Learn general models from a data set of
    particular examples.
  • Build a model that is a good and useful
    approximation to the data.

7
Machine Learning
  • Supervised learning
  • Learn input/output patterns from given, correct
    output.
  • Split data into training and test set.
  • Train element on training data.
  • Evaluate performance on test data.
  • Unsupervised learning
  • Learn input/output patterns without known output.

8
Machine Learning Decision Tree
  • One of classification algorithms
  • Each internal node tests the value of a feature
    and branches according to the results of the
    test.
  • Each leaf node assigns a classification.

9
Machine Learning Random Forest
  • A machine learning ensemble classifier
  • Consists of many decision trees
  • Each tree is constructed using a bootstrap sample
    of training data.
  • After a large number of trees are generated, each
    tree casts a unit vote for the most popular class.

10
Machine Learning Support Vector Machines
  • Viewing input data as two sets of vectors in an
    n-dimensional space, an support vector machine
    will construct a separating hyperplane in that
    space.
  • The hyperplane maximizes the margin between the
    two data sets.

11
Machine Learning Support Vector Machines
  • H3 does not separate the classes.
  • H1 separates with a small margin.
  • H2 separates with the maximum margin.
  • If a data set is not linearly separable, map into
    a higher-dimensional space using kernel approach.

12
Data Sets for Prediction
  • 334 protein sequences from yeast
  • Positive and negative sites with 25 amino acid
    residues centered at lysine
  • Remove all positive and negative sites that have
    more than 40 identity inside the data sets.

13
Features Sequence Information
  • Relative amino acid frequencies, Entropy, Net
    charge, Total charge, Aromatics,
    Charge-hydrophobicity ratio, Protein disorder
    probability, Vihinen's flexibility, Hydrophobic
    moments, B-factors
  • ? 64 X 4 256 features
  • Relative amino acid frequencies
  • Window size 11
  • A 1/11 G 0/11 M 0/11 S 0/11
  • C 0/11 H 0/11 N 1/11 T 1/11
  • D 2/11 I 0/11 P 3/11 V 1/11
  • E 0/11 K 1/11 Q 0/11 W 0/11
  • F 0/11 L 0/11 R 0/11 Y 1/11

14
Features Evolutionary Information
  • Position Specific Scoring Matrix
  • ? 21 X 4 84 features
  • Window size 11
  • 256(Seq) 84(Evol) 340 features.

15
Features Structure Information
  • BLAST each sequence against PDB database.
  • Select alignments with greater than 30 identity.
  • For each mapped site, five shells with 1.5, 3,
    4.5, 6, 7,5Å radial boundaries are constructed
    from the residues alpha-carbon atom using X, Y,
    Z coordinates from PDB.
  • Amino acid at the center site ? 20 features
  • e.g. K ? A C D E F G H I K L M N P Q R S T V W Y
  • 0 0 0 0 0 0 0 0 1 0 0 0 0 0
    0 0 0 0 0 0
  • Each shell contains 24 features.
  • 4 for counts of C, N, O, S and 20 for counts
    of amino acids
  • 20 24 x 5 140 features
  • 60 sites among 245 positive sites ? 24
  • 3239 sites among 12906 negative sites ? 25
  • 1X140 zero vector for the other sites
  • 256(Seq) 84(Evol) 140(Str) 480 features

16
Prediction Results Random Forest
17
Prediction Results Random Forest
18
Prediction Results SVM
19
Prediction Results SVM
20
Feature Selection
  • Rank features using correlation coefficients (r).

21
Conclusions
  • Ubiquitination sites are predictable.
  • The accuracy is modest.
  • Long range interactions
  • Flexibility of structure
  • Noise in positive sites
  • Small data set
  • The sequence features are the most important.

22
Acknowledgements
  • Prof. Predrag Radivojac
  • Wyatt Clark
  • Arunima Ram
  • Nils Schimmelmann
  • Prof. Sun Kim
  • Linda Hostetter
  • School of Informatics

23
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com