Title: Machine Learning Challenges in Location Proteomics
1Machine Learning Challenges in Location Proteomics
- Robert F. Murphy
- Departments of Biological Sciences and Biomedical
Engineering - Center for Automated Learning and Discovery
- Carnegie Mellon University
2Protein characteristics relevant to systems
approach
- sequence
- structure
- expression level
- activity
- partners
3Subcellular locations from major protein databases
- Giantin
- Entrez /note"a new 376kD Golgi complex outher
membrane protein" - SwissProt INTEGRAL MEMBRANE PROTEIN. GOLGI
MEMBRANE. - GPP130
- Entrez /note"GPP130 type II Golgi membrane
protein - SwissProt nothing
4More questions than answers
- We learned that Giantin and GPP130 are both Golgi
proteins, but do we know - What part (i.e., cis, medial, trans) of the Golgi
complex they each are found in? - If they have the same subcellular distribution?
- If they also are found in other compartments?
5Vocabulary is part of the problem
- Different investigators may use different terms
to refer to the same pattern or the same term to
refer to different patterns - Efforts to create restricted vocabularies (e.g.,
Gene Ontology consortium) for location have been
made
6SWALL entries for giantin and gpp130
- ID GIAN_HUMAN STANDARD PRT 3259 AA.
- AC Q14789 Q14398
- GN GOLGB1.
- DR GO GO0000139 CGolgi membrane TAS.
- DR GO GO0005795 CGolgi stack TAS.
- DR GO GO0016021 Cintegral to membrane TAS.
- DR GO GO0007030 PGolgi organization and
biogenesis TAS. - ID O00461 PRELIMINARY PRT 696 AA.
- AC O00461
- GN GPP130.
- DR GO GO0005810 Cendocytotic transport
vesicle TAS. - DR GO GO0005801 CGolgi cis-face TAS.
- DR GO GO0005796 CGolgi lumen TAS.
- DR GO GO0016021 Cintegral to membrane TAS.
7Words are not enough
- Still dont know how similar the locations
patterns of these proteins are - Restricted vocabularies do not provide the
necessary complexity and specificity
8Needed Systematic Approach
- Need to advance past cartoon view of
subcellular location - Need systematic, quantitative approach to protein
location
- Need new methods for accurately and objectively
determining the subcellular location pattern of
all proteins - Distinct from drug screening by low-resolution
microscopy
9First Decision Point
- Classification by direct (pixel-by-pixel)
comparison of individual images to known patterns
is not useful, since - different cells have different shapes, sizes,
orientations - organelles within cells are not found in fixed
locations
- Therefore, use feature-based methods rather than
(pixel) model-based methods
10Input Images
- Created 2D image database for HeLa cells
- Ten classes covering all major subcellular
structures Golgi, ER, mitochondria, lysosomes,
endosomes, nuclei, nucleoli, microfilaments,
microtubules - Included classes that are similar to each other
11Example 2D Images of HeLa
12Features SLF
- Developed sets of Subcellular Location Features
(SLF) containing features of different types - Motivated in part by descriptions used by
biologists (e.g., punctate, perinuclear) - First type of features derived from morphological
image processing - finding objects by automated
thresholding
13Features Morphological
- Number of fluorescent objects per cell
- Variance of the object sizes
- Ratio of the largest object to the smallest
- Average distance of objects to the center of
fluorescence - Average roundness of objects
14Features Haralick texture
- Give information on correlations in intensity
between adjacent pixels to answer questions like - is the pattern more like a checkerboard or
alternating stripes? - is the pattern highly organized (ordered) or more
scattered (disordered)?
15Example Difference detected by texture feature
entropy
16Features Zernike moment
- Measure degree to which pattern matches a
particular Zernike polynomial - Give information on basic nature of pattern
(e.g., circle, donut) and sizes (frequencies)
present in pattern
17Examples of Zernike Polynomials
Z(2,0)
Z(4,4)
Z(10,6)
18Subcellular Location Features 2D
- Morphological features
- Haralick texture features
- Zernike moment features
- Geometric features
- Edge features
192D Classification Results
True Class Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier
True Class DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub
DNA 99 1 0 0 0 0 0 0 0 0
ER 0 97 0 0 0 2 0 0 0 1
Gia 0 0 91 7 0 0 0 0 2 0
Gpp 0 0 14 82 0 0 2 0 1 0
Lam 0 0 1 0 88 1 0 0 10 0
Mit 0 3 0 0 0 92 0 0 3 3
Nuc 0 0 0 0 0 0 99 0 1 0
Act 0 0 0 0 0 0 0 100 0 0
TfR 0 1 0 0 12 2 0 1 81 2
Tub 1 2 0 0 0 1 0 0 1 95
Overall accuracy 92 (95 for major patterns)
20Human Classification Results
Overall accuracy 83 (92 for major patterns)
21Computer vs. Human
22Extending to 3D Labeling approach
- Total protein labeled with Cy5 reactive dye
- DNA labeled with PI
- Specific Proteins labeled with primary Ab
Alexa488 conjugated secondary Ab
233D Image Set
Giantin
Nuclear
ER
Lysosomal
gpp130
Actin
Mitoch.
Nucleolar
Tubulin
Endosomal
24New features to measure z asymmetry
- 2D features treated x and y equivalently
- For 3D images, while it makes sense to treat x
and y equivalently (cells dont have a left and
right, z should be treated differently (top
and bottom are not the same) - We designed features to separate distance
measures into x-y component and z component
25Classification Results for 3D images
Overall accuracy 97
26How to do even better
- Biologists interpreting images of protein
localization typically view many cells before
reaching a conclusion - Can simulate this by classifying sets of cells
from the same microscope slide
27Classification of Sets of 3D Images
Set size 9, Overall accuracy 99.7
28First Conclusion
- Description of subcellular locations for systems
biology should be implemented using a data-driven
approach rather than a knowledge-capture
approach, but
29Subcellular Location Image Finder
- (Have automated system for finding images in
on-line journal articles that match a particular
pattern - enables connection between new images
and previously published results)
30Image Similarity
- Classification power of features implies that
they capture essential characteristics of protein
patterns - Can be used to measure similarity between patterns
31Clustering by Image Similarity
- Ability to measure similarity of protein patterns
allows us for the first time to create a
systematic, objective, framework for describing
subcellular locations - Ideal for database references
- One way is by creating a Subcellular Location
Tree - Illustration Build hierarchical dendrogram
32Subcellular Location Tree for 10 classes in HeLa
cells
33Do this for all proteins Location Proteomics
- Can use CD-tagging (developed by Dr. Jonathan
Jarvik) to randomly tag many proteins Infect
population of cells with a retrovirus carrying a
DNA sequence that will produce a tag in a
random gene in each cell - Isolate separate clones, each of which produces
express one tagged protein - Use RT-PCR to identify tagged gene in each clone
- Collect images of many cells for each clone using
fluorescence microscopy
34Example images of CD-tagged clones
- Glut1 gene (type 1 glucose transporter)
- Tmpo gene (thymopoietin ??
- tuba1 gene (?-tubulin)
- Cald gene (caldesmon 1)
- Ncl gene (nucleolin)
- Rps11 gene (ribosomal protein S11)
- Hmga1 gene (high mobility group AT-hook 1)
- Col1a2 gene (procollagen type I ?2)
- Atp5a1 gene (ATP synthase isoform 1)
35Proof of principle
- Cluster 46 clones expressing different tagged
proteins based on their subcellular location
patterns
36Feature selection
- Use Stepwise Discriminant Analysis to rank
features based on their ability to distinguish
proteins - Use increasing numbers of features to train
neural network classifiers and evaluate
classification accuracy over all 46 clones - Best performance obtained with 10 features
37Tree building
- Therefore use these 10 features with z-scored
Euclidean distance function to build SLT - Find optimal number of clusters using k-means
clustering and AIC - Find consensus hierarchical trees by randomly
dividing the images for each protein in half and
keeping branches conserved between both halves
(repeat for 50 random divisions)
38Consensus Subcellular Location Tree
39Examples from major clusters
40Significance
- Proteins clustered by location analogous to
proteins clustered by sequence (e.g., PFAM) - Can subdivide clusters by observing response to
drugs, oncogenes, etc. - These represent protein location states
- Base knowledge required for modeling
- Can be used to filter protein interactions
41From patterns to causes
- Machine learning approaches have been previously
used to find localization motifs in protein
sequences, but the set of locations used was
limited to major organelles - High-resolution subcellular location trees can be
used to discover (recursively) new motifs that
determine location of each group - Can include post-translational modifications
42More Conclusions
- Organized data collection approach is required to
capture high-resolution information on the
subcellular location of all proteins - Prohibitive combinatorial complexity make
colocalization approach infeasible, so major
effort should focus on one protein at a time
43Center for Bioimage Informatics
- 2.75 M CMU funding from NSF ITR
- Joint with UCSB and collaborators at Berkeley and
MIT - R. Murphy (CALD/Biomed.Eng./Biol.Sci.)
- Jelena Kovacevic (Biomedical Engineering)
- Tom Mitchell (CALD)
- Christos Faloutsos (CALD)
44Acknowledgments
- Former students
- Michael Boland, Mia Markey, William Dirks,
Gregory Porreca, Edward Roques, Meel Velliste - Current grad students
- Kai Huang, Xiang Chen, Ting Zhao, Yanhua Hu,
Elvira Garcia Osuna, Zhenzhen Kou, Juchang Hua - Funding
- NSF, NIH, Rockefeller Bros. Fund, PA. Tobacco
Settlement Fund - Collaborators/Consultants
- Simon Watkins, David Cassasent, Tom Mitchell,
Christos Faloutsos, Jon Jarvik, Peter Berget