Machine Learning Challenges in Location Proteomics - PowerPoint PPT Presentation

About This Presentation
Title:

Machine Learning Challenges in Location Proteomics

Description:

Departments of Biological Sciences and Biomedical Engineering ... Glut1 gene (type 1 glucose transporter) Tmpo gene (thymopoietin. tuba1 gene ( -tubulin) ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 45
Provided by: robert706
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning Challenges in Location Proteomics


1
Machine Learning Challenges in Location Proteomics
  • Robert F. Murphy
  • Departments of Biological Sciences and Biomedical
    Engineering
  • Center for Automated Learning and Discovery
  • Carnegie Mellon University

2
Protein characteristics relevant to systems
approach
  • sequence
  • structure
  • expression level
  • activity
  • partners
  • location

3
Subcellular locations from major protein databases
  • Giantin
  • Entrez /note"a new 376kD Golgi complex outher
    membrane protein"
  • SwissProt INTEGRAL MEMBRANE PROTEIN. GOLGI
    MEMBRANE.
  • GPP130
  • Entrez /note"GPP130 type II Golgi membrane
    protein
  • SwissProt nothing

4
More questions than answers
  • We learned that Giantin and GPP130 are both Golgi
    proteins, but do we know
  • What part (i.e., cis, medial, trans) of the Golgi
    complex they each are found in?
  • If they have the same subcellular distribution?
  • If they also are found in other compartments?

5
Vocabulary is part of the problem
  • Different investigators may use different terms
    to refer to the same pattern or the same term to
    refer to different patterns
  • Efforts to create restricted vocabularies (e.g.,
    Gene Ontology consortium) for location have been
    made

6
SWALL entries for giantin and gpp130
  • ID GIAN_HUMAN STANDARD PRT 3259 AA.
  • AC Q14789 Q14398
  • GN GOLGB1.
  • DR GO GO0000139 CGolgi membrane TAS.
  • DR GO GO0005795 CGolgi stack TAS.
  • DR GO GO0016021 Cintegral to membrane TAS.
  • DR GO GO0007030 PGolgi organization and
    biogenesis TAS.
  • ID O00461 PRELIMINARY PRT 696 AA.
  • AC O00461
  • GN GPP130.
  • DR GO GO0005810 Cendocytotic transport
    vesicle TAS.
  • DR GO GO0005801 CGolgi cis-face TAS.
  • DR GO GO0005796 CGolgi lumen TAS.
  • DR GO GO0016021 Cintegral to membrane TAS.

7
Words are not enough
  • Still dont know how similar the locations
    patterns of these proteins are
  • Restricted vocabularies do not provide the
    necessary complexity and specificity

8
Needed Systematic Approach
  • Need to advance past cartoon view of
    subcellular location
  • Need systematic, quantitative approach to protein
    location
  • Need new methods for accurately and objectively
    determining the subcellular location pattern of
    all proteins
  • Distinct from drug screening by low-resolution
    microscopy

9
First Decision Point
  • Classification by direct (pixel-by-pixel)
    comparison of individual images to known patterns
    is not useful, since
  • different cells have different shapes, sizes,
    orientations
  • organelles within cells are not found in fixed
    locations
  • Therefore, use feature-based methods rather than
    (pixel) model-based methods

10
Input Images
  • Created 2D image database for HeLa cells
  • Ten classes covering all major subcellular
    structures Golgi, ER, mitochondria, lysosomes,
    endosomes, nuclei, nucleoli, microfilaments,
    microtubules
  • Included classes that are similar to each other

11
Example 2D Images of HeLa
12
Features SLF
  • Developed sets of Subcellular Location Features
    (SLF) containing features of different types
  • Motivated in part by descriptions used by
    biologists (e.g., punctate, perinuclear)
  • First type of features derived from morphological
    image processing - finding objects by automated
    thresholding

13
Features Morphological
  • Number of fluorescent objects per cell
  • Variance of the object sizes
  • Ratio of the largest object to the smallest
  • Average distance of objects to the center of
    fluorescence
  • Average roundness of objects

14
Features Haralick texture
  • Give information on correlations in intensity
    between adjacent pixels to answer questions like
  • is the pattern more like a checkerboard or
    alternating stripes?
  • is the pattern highly organized (ordered) or more
    scattered (disordered)?

15
Example Difference detected by texture feature
entropy
16
Features Zernike moment
  • Measure degree to which pattern matches a
    particular Zernike polynomial
  • Give information on basic nature of pattern
    (e.g., circle, donut) and sizes (frequencies)
    present in pattern

17
Examples of Zernike Polynomials
Z(2,0)
Z(4,4)
Z(10,6)
18
Subcellular Location Features 2D
  • Morphological features
  • Haralick texture features
  • Zernike moment features
  • Geometric features
  • Edge features

19
2D Classification Results
True Class Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier Output of the Classifier
True Class DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub
DNA 99 1 0 0 0 0 0 0 0 0
ER 0 97 0 0 0 2 0 0 0 1
Gia 0 0 91 7 0 0 0 0 2 0
Gpp 0 0 14 82 0 0 2 0 1 0
Lam 0 0 1 0 88 1 0 0 10 0
Mit 0 3 0 0 0 92 0 0 3 3
Nuc 0 0 0 0 0 0 99 0 1 0
Act 0 0 0 0 0 0 0 100 0 0
TfR 0 1 0 0 12 2 0 1 81 2
Tub 1 2 0 0 0 1 0 0 1 95
Overall accuracy 92 (95 for major patterns)
20
Human Classification Results
Overall accuracy 83 (92 for major patterns)
21
Computer vs. Human
22
Extending to 3D Labeling approach
  • Total protein labeled with Cy5 reactive dye
  • DNA labeled with PI
  • Specific Proteins labeled with primary Ab
    Alexa488 conjugated secondary Ab

23
3D Image Set
Giantin
Nuclear
ER
Lysosomal
gpp130
Actin
Mitoch.
Nucleolar
Tubulin
Endosomal
24
New features to measure z asymmetry
  • 2D features treated x and y equivalently
  • For 3D images, while it makes sense to treat x
    and y equivalently (cells dont have a left and
    right, z should be treated differently (top
    and bottom are not the same)
  • We designed features to separate distance
    measures into x-y component and z component

25
Classification Results for 3D images
Overall accuracy 97
26
How to do even better
  • Biologists interpreting images of protein
    localization typically view many cells before
    reaching a conclusion
  • Can simulate this by classifying sets of cells
    from the same microscope slide

27
Classification of Sets of 3D Images
Set size 9, Overall accuracy 99.7
28
First Conclusion
  • Description of subcellular locations for systems
    biology should be implemented using a data-driven
    approach rather than a knowledge-capture
    approach, but

29
Subcellular Location Image Finder
  • (Have automated system for finding images in
    on-line journal articles that match a particular
    pattern - enables connection between new images
    and previously published results)

30
Image Similarity
  • Classification power of features implies that
    they capture essential characteristics of protein
    patterns
  • Can be used to measure similarity between patterns

31
Clustering by Image Similarity
  • Ability to measure similarity of protein patterns
    allows us for the first time to create a
    systematic, objective, framework for describing
    subcellular locations
  • Ideal for database references
  • One way is by creating a Subcellular Location
    Tree
  • Illustration Build hierarchical dendrogram

32
Subcellular Location Tree for 10 classes in HeLa
cells
33
Do this for all proteins Location Proteomics
  • Can use CD-tagging (developed by Dr. Jonathan
    Jarvik) to randomly tag many proteins Infect
    population of cells with a retrovirus carrying a
    DNA sequence that will produce a tag in a
    random gene in each cell
  • Isolate separate clones, each of which produces
    express one tagged protein
  • Use RT-PCR to identify tagged gene in each clone
  • Collect images of many cells for each clone using
    fluorescence microscopy

34
Example images of CD-tagged clones
  1. Glut1 gene (type 1 glucose transporter)
  2. Tmpo gene (thymopoietin ??
  3. tuba1 gene (?-tubulin)
  4. Cald gene (caldesmon 1)
  5. Ncl gene (nucleolin)
  6. Rps11 gene (ribosomal protein S11)
  7. Hmga1 gene (high mobility group AT-hook 1)
  8. Col1a2 gene (procollagen type I ?2)
  9. Atp5a1 gene (ATP synthase isoform 1)

35
Proof of principle
  • Cluster 46 clones expressing different tagged
    proteins based on their subcellular location
    patterns

36
Feature selection
  • Use Stepwise Discriminant Analysis to rank
    features based on their ability to distinguish
    proteins
  • Use increasing numbers of features to train
    neural network classifiers and evaluate
    classification accuracy over all 46 clones
  • Best performance obtained with 10 features

37
Tree building
  • Therefore use these 10 features with z-scored
    Euclidean distance function to build SLT
  • Find optimal number of clusters using k-means
    clustering and AIC
  • Find consensus hierarchical trees by randomly
    dividing the images for each protein in half and
    keeping branches conserved between both halves
    (repeat for 50 random divisions)

38
Consensus Subcellular Location Tree
39
Examples from major clusters
40
Significance
  • Proteins clustered by location analogous to
    proteins clustered by sequence (e.g., PFAM)
  • Can subdivide clusters by observing response to
    drugs, oncogenes, etc.
  • These represent protein location states
  • Base knowledge required for modeling
  • Can be used to filter protein interactions

41
From patterns to causes
  • Machine learning approaches have been previously
    used to find localization motifs in protein
    sequences, but the set of locations used was
    limited to major organelles
  • High-resolution subcellular location trees can be
    used to discover (recursively) new motifs that
    determine location of each group
  • Can include post-translational modifications

42
More Conclusions
  • Organized data collection approach is required to
    capture high-resolution information on the
    subcellular location of all proteins
  • Prohibitive combinatorial complexity make
    colocalization approach infeasible, so major
    effort should focus on one protein at a time

43
Center for Bioimage Informatics
  • 2.75 M CMU funding from NSF ITR
  • Joint with UCSB and collaborators at Berkeley and
    MIT
  • R. Murphy (CALD/Biomed.Eng./Biol.Sci.)
  • Jelena Kovacevic (Biomedical Engineering)
  • Tom Mitchell (CALD)
  • Christos Faloutsos (CALD)

44
Acknowledgments
  • Former students
  • Michael Boland, Mia Markey, William Dirks,
    Gregory Porreca, Edward Roques, Meel Velliste
  • Current grad students
  • Kai Huang, Xiang Chen, Ting Zhao, Yanhua Hu,
    Elvira Garcia Osuna, Zhenzhen Kou, Juchang Hua
  • Funding
  • NSF, NIH, Rockefeller Bros. Fund, PA. Tobacco
    Settlement Fund
  • Collaborators/Consultants
  • Simon Watkins, David Cassasent, Tom Mitchell,
    Christos Faloutsos, Jon Jarvik, Peter Berget
Write a Comment
User Comments (0)
About PowerShow.com