Machine Learning Challenges in Location Proteomics - PowerPoint PPT Presentation

About This Presentation

Title:

Machine Learning Challenges in Location Proteomics

Description:

Departments of Biological Sciences and Biomedical Engineering ... Glut1 gene (type 1 glucose transporter) Tmpo gene (thymopoietin. tuba1 gene ( -tubulin) ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 45

Provided by: robert706

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning Challenges in Location Proteomics

1
Machine Learning Challenges in Location Proteomics

Robert F. Murphy
Departments of Biological Sciences and Biomedical
Engineering
Center for Automated Learning and Discovery
Carnegie Mellon University

2
Protein characteristics relevant to systems
approach

sequence
structure
expression level
activity
partners

location

3
Subcellular locations from major protein databases

Giantin
Entrez /note"a new 376kD Golgi complex outher
membrane protein"
SwissProt INTEGRAL MEMBRANE PROTEIN. GOLGI
MEMBRANE.
GPP130
Entrez /note"GPP130 type II Golgi membrane
protein
SwissProt nothing

4
More questions than answers

We learned that Giantin and GPP130 are both Golgi
proteins, but do we know
What part (i.e., cis, medial, trans) of the Golgi
complex they each are found in?
If they have the same subcellular distribution?
If they also are found in other compartments?

5
Vocabulary is part of the problem

Different investigators may use different terms
to refer to the same pattern or the same term to
refer to different patterns
Efforts to create restricted vocabularies (e.g.,
Gene Ontology consortium) for location have been
made

6
SWALL entries for giantin and gpp130

ID GIAN_HUMAN STANDARD PRT 3259 AA.
AC Q14789 Q14398
GN GOLGB1.
DR GO GO0000139 CGolgi membrane TAS.
DR GO GO0005795 CGolgi stack TAS.
DR GO GO0016021 Cintegral to membrane TAS.
DR GO GO0007030 PGolgi organization and
biogenesis TAS.
ID O00461 PRELIMINARY PRT 696 AA.
AC O00461
GN GPP130.
DR GO GO0005810 Cendocytotic transport
vesicle TAS.
DR GO GO0005801 CGolgi cis-face TAS.
DR GO GO0005796 CGolgi lumen TAS.
DR GO GO0016021 Cintegral to membrane TAS.

7
Words are not enough

Still dont know how similar the locations
patterns of these proteins are
Restricted vocabularies do not provide the
necessary complexity and specificity

8
Needed Systematic Approach

Need to advance past cartoon view of
subcellular location
Need systematic, quantitative approach to protein
location

Need new methods for accurately and objectively
determining the subcellular location pattern of
all proteins
Distinct from drug screening by low-resolution
microscopy

9
First Decision Point

Classification by direct (pixel-by-pixel)
comparison of individual images to known patterns
is not useful, since
different cells have different shapes, sizes,
orientations
organelles within cells are not found in fixed
locations

Therefore, use feature-based methods rather than
(pixel) model-based methods

10
Input Images

Created 2D image database for HeLa cells
Ten classes covering all major subcellular
structures Golgi, ER, mitochondria, lysosomes,
endosomes, nuclei, nucleoli, microfilaments,
microtubules
Included classes that are similar to each other

11
Example 2D Images of HeLa
12
Features SLF

Developed sets of Subcellular Location Features
(SLF) containing features of different types
Motivated in part by descriptions used by
biologists (e.g., punctate, perinuclear)
First type of features derived from morphological
image processing - finding objects by automated
thresholding

13
Features Morphological

Number of fluorescent objects per cell
Variance of the object sizes
Ratio of the largest object to the smallest
Average distance of objects to the center of
fluorescence
Average roundness of objects

14
Features Haralick texture

Give information on correlations in intensity
between adjacent pixels to answer questions like
is the pattern more like a checkerboard or
alternating stripes?
is the pattern highly organized (ordered) or more
scattered (disordered)?

15
Example Difference detected by texture feature
entropy
16
Features Zernike moment

Measure degree to which pattern matches a
particular Zernike polynomial
Give information on basic nature of pattern
(e.g., circle, donut) and sizes (frequencies)
present in pattern

17
Examples of Zernike Polynomials
Z(2,0)
Z(4,4)
Z(10,6)
18
Subcellular Location Features 2D

Morphological features
Haralick texture features
Zernike moment features
Geometric features
Edge features

19
2D Classification Results
True Class Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier
True Class DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub
DNA 99 1 0 0 0 0 0 0 0 0
ER 0 97 0 0 0 2 0 0 0 1
Gia 0 0 91 7 0 0 0 0 2 0
Gpp 0 0 14 82 0 0 2 0 1 0
Lam 0 0 1 0 88 1 0 0 10 0
Mit 0 3 0 0 0 92 0 0 3 3
Nuc 0 0 0 0 0 0 99 0 1 0
Act 0 0 0 0 0 0 0 100 0 0
TfR 0 1 0 0 12 2 0 1 81 2
Tub 1 2 0 0 0 1 0 0 1 95
Overall accuracy 92 (95 for major patterns)
20
Human Classification Results
Overall accuracy 83 (92 for major patterns)
21
Computer vs. Human
22
Extending to 3D Labeling approach

Total protein labeled with Cy5 reactive dye
DNA labeled with PI
Specific Proteins labeled with primary Ab
Alexa488 conjugated secondary Ab

23
3D Image Set
Giantin
Nuclear
ER
Lysosomal
gpp130
Actin
Mitoch.
Nucleolar
Tubulin
Endosomal
24
New features to measure z asymmetry

2D features treated x and y equivalently
For 3D images, while it makes sense to treat x
and y equivalently (cells dont have a left and
right, z should be treated differently (top
and bottom are not the same)
We designed features to separate distance
measures into x-y component and z component

25
Classification Results for 3D images
Overall accuracy 97
26
How to do even better

Biologists interpreting images of protein
localization typically view many cells before
reaching a conclusion
Can simulate this by classifying sets of cells
from the same microscope slide

27
Classification of Sets of 3D Images
Set size 9, Overall accuracy 99.7
28
First Conclusion

Description of subcellular locations for systems
biology should be implemented using a data-driven
approach rather than a knowledge-capture
approach, but

29
Subcellular Location Image Finder

(Have automated system for finding images in
on-line journal articles that match a particular
pattern - enables connection between new images
and previously published results)

30
Image Similarity

Classification power of features implies that
they capture essential characteristics of protein
patterns
Can be used to measure similarity between patterns

31
Clustering by Image Similarity

Ability to measure similarity of protein patterns
allows us for the first time to create a
systematic, objective, framework for describing
subcellular locations
Ideal for database references
One way is by creating a Subcellular Location
Tree
Illustration Build hierarchical dendrogram

32
Subcellular Location Tree for 10 classes in HeLa
cells
33
Do this for all proteins Location Proteomics

Can use CD-tagging (developed by Dr. Jonathan
Jarvik) to randomly tag many proteins Infect
population of cells with a retrovirus carrying a
DNA sequence that will produce a tag in a
random gene in each cell
Isolate separate clones, each of which produces
express one tagged protein
Use RT-PCR to identify tagged gene in each clone
Collect images of many cells for each clone using
fluorescence microscopy

34
Example images of CD-tagged clones

Glut1 gene (type 1 glucose transporter)
Tmpo gene (thymopoietin ??
tuba1 gene (?-tubulin)
Cald gene (caldesmon 1)
Ncl gene (nucleolin)
Rps11 gene (ribosomal protein S11)
Hmga1 gene (high mobility group AT-hook 1)
Col1a2 gene (procollagen type I ?2)
Atp5a1 gene (ATP synthase isoform 1)

35
Proof of principle

Cluster 46 clones expressing different tagged
proteins based on their subcellular location
patterns

36
Feature selection

Use Stepwise Discriminant Analysis to rank
features based on their ability to distinguish
proteins
Use increasing numbers of features to train
neural network classifiers and evaluate
classification accuracy over all 46 clones
Best performance obtained with 10 features

37
Tree building

Therefore use these 10 features with z-scored
Euclidean distance function to build SLT
Find optimal number of clusters using k-means
clustering and AIC
Find consensus hierarchical trees by randomly
dividing the images for each protein in half and
keeping branches conserved between both halves
(repeat for 50 random divisions)

38
Consensus Subcellular Location Tree
39
Examples from major clusters
40
Significance

Proteins clustered by location analogous to
proteins clustered by sequence (e.g., PFAM)
Can subdivide clusters by observing response to
drugs, oncogenes, etc.
These represent protein location states
Base knowledge required for modeling
Can be used to filter protein interactions

41
From patterns to causes

Machine learning approaches have been previously
used to find localization motifs in protein
sequences, but the set of locations used was
limited to major organelles
High-resolution subcellular location trees can be
used to discover (recursively) new motifs that
determine location of each group
Can include post-translational modifications

42
More Conclusions

Organized data collection approach is required to
capture high-resolution information on the
subcellular location of all proteins
Prohibitive combinatorial complexity make
colocalization approach infeasible, so major
effort should focus on one protein at a time

43
Center for Bioimage Informatics

2.75 M CMU funding from NSF ITR
Joint with UCSB and collaborators at Berkeley and
MIT
R. Murphy (CALD/Biomed.Eng./Biol.Sci.)
Jelena Kovacevic (Biomedical Engineering)
Tom Mitchell (CALD)
Christos Faloutsos (CALD)

44
Acknowledgments

Former students
Michael Boland, Mia Markey, William Dirks,
Gregory Porreca, Edward Roques, Meel Velliste
Current grad students
Kai Huang, Xiang Chen, Ting Zhao, Yanhua Hu,
Elvira Garcia Osuna, Zhenzhen Kou, Juchang Hua
Funding
NSF, NIH, Rockefeller Bros. Fund, PA. Tobacco
Settlement Fund
Collaborators/Consultants
Simon Watkins, David Cassasent, Tom Mitchell,
Christos Faloutsos, Jon Jarvik, Peter Berget