Understanding Databases of Microscope Images - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Understanding Databases of Microscope Images

Description:

All advanced living organisms comprise many cells ... Bagging. Mixtures-of-Experts. Majority-voting classifier combining the above classifiers ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 73
Provided by: robertf54
Category:

less

Transcript and Presenter's Notes

Title: Understanding Databases of Microscope Images


1
Understanding Databases of Microscope Images
  • Kai Huang
  • Department of Biological Sciences and
  • Center for Automated Learning and Discovery
  • Carnegie Mellon University

2
A Little Biology
  • All advanced living organisms comprise many cells
  • Each cell has a mechanism to control how and when
    they develop and function
  • The mystery is coded in the sequence of genes
    that are packed tightly to form chromosomes in
    cell nucleus
  • The genome of an organism is its set of
    chromosomes, containing all of its genes and
    associated DNA

3
Central Dogma
http//www.accessexcellence.org/AB/GG/central.html
4
Genomics
Comparative Genomics
Gene Prediction
Genome Sequencing
RNA Secondary Structure
Genome Analysis
5
Genomics
  • Both human and mouse genome drafts are in hand
    plus hundreds of smaller organisms
  • The building blocks of a cell are proteins that
    are coded in the genome
  • Finding how proteins function with each other is
    the ultimate goal of life science
  • Next step Proteomics

6
Proteomics
  • The set of proteins expressed in a given cell
    type or tissue is its proteome
  • Protein differences between cell types
    responsible for the different behaviors of those
    cell types

7
Proteomics
  • Things to learn about proteins
  • sequence
  • structure
  • expression level
  • activity
  • partners
  • location

8
Proteomics
  • Things to learn about proteins
  • sequence
  • structure
  • expression level
  • activity
  • partners
  • location

Image source http//users.rcn.com/jkimball.ma.ult
ranet/BiologyPages/A/AnimalCell.gif
9
Proteomics
  • Things to learn about proteins
  • sequence
  • structure
  • expression level
  • activity
  • partners
  • location - critical to understanding function

10
Subcellular Location
  • How do we determine the subcellular location of
    a protein?
  • By looking it up
  • By doing an experiment

11
Looking it up Example
  • Giantin
  • Entrez /note"a new 376kD Golgi complex outher
    membrane protein"
  • SwissProt INTEGRAL MEMBRANE PROTEIN. GOLGI
    MEMBRANE.
  • GPP130
  • Entrez /note"GPP130 type II Golgi membrane
    protein
  • SwissProt nothing

12
Looking it up Example
  • We learned that Giantin and GPP130 are both Golgi
    proteins, but do we know
  • What part (i.e., cis, medial, trans) of the Golgi
    complex they each are found in?
  • If they have the same subcellular distribution?
  • If they also are found in other compartments?

13
Words are not enough
  • Different investigators may use different terms
    to refer to the same pattern or the same term to
    refer to different patterns
  • Some efforts using restricted vocabularies (e.g.,
    Yeast Protein Database, Gene Ontology consortium)
    for location have been made but these do not
    provide the necessary complexity and specificity

14
Need to advance past cartoon view of
subcellular location
http//www.cellsalive.net/cells/animcell.htm
  • Need a systematic, quantitative approach to
    protein location
  • Need new methods for accurately and objectively
    determining the subcellular location pattern of
    all proteins

15
Fluorescence Microscopy
  • Cells and proteins have almost no natural
    contrast under light
  • Proteins can be tagged by fluorescence dyes
  • Fluorescence microscopy provides high-resolution
    observation of protein subcellular location

16
Initial Goal
  • Classification by direct (pixel-by-pixel)
    comparison of individual images to known patterns
    is not useful, since
  • different cells have different shapes, sizes,
    orientations
  • organelles within cells are not found in fixed
    locations

17
Supervised Learning Approach
  • 1. Create sets of images showing the localization
    of many different proteins (each set defines one
    class of pattern)
  • 2. Reduce each image to a set of numerical values
    (features) that are insensitive to position and
    rotation of the cell
  • 3. Use statistical classification methods to
    learn how to distinguish each class using the
    features

18
Input Images
  • Created image database for HeLa cells
  • Ten classes covering all major subcellular
    structures Golgi, ER, mitochondria, lysosomes,
    endosomes, nuclei, nucleoli, microfilaments,
    microtubules
  • Includes classes that are similar to each other

19
Example Images
  • Patterns that might be easily confused

Endoplasmic Reticulum (ER)
Mitochondria
20
Example Images
  • Patterns that might be easily confused

Lysosomes (LAMP2)
Endosomes (TfR)
21
Example Images
  • Patterns that might be easily confused

F-actin
Tubulin
22
Example Images
  • Classes expected to be indistinguishable

Golgi (Giantin)
Golgi (gpp130)
23
Features Haralick texture
  • Give information on correlations in intensity
    between adjacent pixels to answer questions like
  • is the pattern more like a checkerboard or
    alternating stripes?
  • is the pattern highly organized (ordered) or more
    scattered (disordered)?

24
Example Difference detected by texture feature
entropy
25
Features Zernike moment
  • Measure degree to which pattern matches a
    particular Zernike polynomial
  • Give information on basic nature of pattern
    (e.g., circle, donut) and sizes (frequencies)
    present in pattern

26
Examples of Zernike Polynomials
Z axis shows intensity
27
Features SLF
  • Developed additional features (SLF, for
    Subcellular Location Features)
  • Motivated by descriptions of patterns used by
    biologists (e.g., punctate, perinuclear)
  • Combined with Zernike and Haralick features to
    give 84 features used to describe each image

28
Example Features from SLF1
  • Number of fluorescent objects per cell
  • Variance of the object sizes
  • Ratio of the largest object to the smallest
  • Average distance of objects to the center of
    fluorescence
  • Fraction of convex hull occupied by fluorescence

29
Subcellular Location Features 2D
  • Haralick texture features
  • Zernike moment features
  • Morphological features
  • Geometric features
  • Edge features
  • Gabor wavelet features
  • Daubechies 4 wavelet features

30
Feature Reduction
  • Remove non-discriminative features
  • Remove redundant features
  • Combine features
  • Benefits
  • Speed
  • Accuracy
  • Multimedia indexing

31
Feature Reduction
  • Feature Recombination
  • PCA (Principal Component Analysis)
  • NLPCA (Nonlinear PCA)
  • KPCA (Kernel PCA)
  • ICA (Independent Component Analysis)
  • Feature Selection
  • Classification Tree (Gain ratio)
  • Fractal Dimensionality Reduction
  • Genetic Algorithm
  • Stepwise Discriminant Analysis

32
Feature Reduction Results
33
Classifier Supervised Learning
  • Neural Network
  • Support Vector Machine
  • Linear kernel
  • Polynomial kernel
  • Radial basis kernel
  • Exponential radial basis kernel
  • Ensemble Classifiers
  • AdaBoost
  • Bagging
  • Mixtures-of-Experts
  • Majority-voting classifier combining the above
    classifiers

34
Majority-voting Classifier
  • Neural Network
  • Linear-kernel SVM
  • Exponential-rbf-kernel SVM
  • Polynomial-kernel SVM
  • AdaBoost
  • Pairwise-Classifier-Error Correlation
    Coefficients
  • Mean 0.10, STD 0.07

35
2D Classification Results
Overall accuracy 92.34
36
Human Classification Results
Overall accuracy 83
37
Extending to 3D Labeling approach
  • Total protein labeled with Cy5 reactive dye
  • DNA labeled with PI
  • Specific Proteins labeled with primary Ab
    Alexa488 conjugated secondary Ab

38
3D Image Set
Giantin
Nuclear
ER
Lysosomal
gpp130
Actin
Mitoch.
Nucleolar
Tubulin
Endosomal
39
Features to measure z asymmetry
  • 2D features treated x and y equivalently
  • For 3D images, while it makes sense to treat x
    and y equivalently (cells dont have a left and
    right, z should be treated differently (top
    and bottom are not the same)
  • We designed features to separate distance
    measures into x-y component and z component

40
Classification Results for 3D images
Overall accuracy 97
41
How to do even better
  • Biologists interpreting images of protein
    localization typically view many cells before
    reaching a conclusion
  • Can simulate this by classifying sets of cells
    from the same microscope slide

42
Classification of Sets of 3D Images
Set size 9, Overall accuracy 99.7
43
Next Experiment Interpretation
  • Classification results demonstrate the value of
    the SLF feature sets for describing subcellular
    patterns
  • The validation of the features suggests that they
    can be used for other applications, such as
    testing of hypotheses using image sets
  • Enabling concept image similarity

44
Searching databases
  • Sequence databases allow search by similarity
  • The same is true for protein structure databases

GSNWLAMQLT
45
Basic Method for Sequence Comparison
M A T N W G S L L Q
M D T N P V S L L R
Similarity Matrix
5 -1 3 2 -9 4 2 1 1 -3
25.7
46
Extension to location?
  • Use SLF to find similar patterns

Database
47
Goal Typical Image Selection
  • To develop automated methods for selecting a
    representative image from a set of images
    obtained by fluorescence microscopy

48
TypIC - Typical Image Chooser
Image Set
49
Approach
  • Calculate numerical features that contain
    information about each image (just like when
    classifying images)
  • Calculate the mean and covariance matrix for the
    set (usually after automated elimination of
    outliers)
  • Rank the images by their distance to the mean
    (centroid) of the population (usually using
    Mahalanobis distance, which weights according to
    the covariance matrix)

50
Goal Image Set Comparison
  • A common paradigm in molecular cell biology is to
    compare the distribution of a protein with and
    without the addition of a potential perturbing
    agent (e.g., drug, overexpressed protein)
  • Such experiments usually assayed by visual
    examination
  • We have explored automating such comparisons

51
SImEC - Statistical Imaging Experiment Comparator
Image Set 2
Image Set 1
52
Method
  • Calculate feature matrix for each set of images
  • Compare feature matrices using a multivariate
    hypothesis test called the Hotelling T2-test
  • Test returns an F value that can be compared to a
    critical value for a given confidence level

53
F values for comparison of all pairs of classes
using 65 features
  • 95 confidence critical values are approximately
    1.4 for all comparisons (depends on number of
    images)

54
Comparison of two sets drawn randomly from the
same class
  • TfR Phal
  • Average F 1.05 1.05
  • Critical F (0.95) 1.63 1.61
  • Number of failing sets out of 1000 47 45

Expected result obtained 95 of randomly drawn
sets are considered to be the same
55
Image Databasing
  • Our automated tools facilitate interpretation of
    large numbers of images
  • Ideal for use with image databases
  • We therefore began building an image database
    system in 1997 by first developing a database
    schema to describe all aspects of fluorescence
    microscope images
  • Fluorescence Microscope Annotation Schema (FMAS)
    http//murphylab.web.cmu.edu/services/FMAS
  • For Cell Biology labs to store and analyze
    fluorescence microscope images

56
Protein Subcellular Location Image Database
  • Implemented image database incorporating
  • Full annotation of experimental (FMAS)
  • SLF numerical features
  • Queries can be done by text or image content
  • Results can be fed to TypIC, SImEC, SLIC, SLIF

K. Huang, J. Lin, J.A. Gajnak, and R.F. Murphy
(2002). Image Content-based Retrieval and
Automated Interpretation of Fluorescence
Microscope Images via the Protein Subcellular
Location Image Database. Proceedings of the 2002
IEEE International Symposium on Biomedical
Imaging (ISBI 2002), pp. 325-328.
57
(No Transcript)
58
(No Transcript)
59
Clustering by Image Similarity
  • Ability to measure similarity of protein patterns
    allows us for the first time to create a
    systematic, objective, framework for describing
    subcellular locations
  • Ideal for database references
  • One way is by creating a Subcellular Location
    Tree
  • Start tree with the two proteins whose patterns
    are most similar, keep adding branches for less
    and less similar patterns

60
Subcellular Location Tree for 10 classes in HeLa
cells
61
Location Proteomics
  • Tag all proteins in a cell line randomly
  • Examine many cells, each of which expected to
    express one tagged protein, using fluorescence
    microscopy to determine the subcellular location
    of that protein

62
Example images of randomly tagged clones
  • Glut1 gene (type 1 glucose transporter)
  • Tmpo gene (thymopoietin ??
  • tuba1 gene (?-tubulin)
  • Cald gene (caldesmon 1)
  • Ncl gene (nucleolin)
  • Rps11 gene (ribosomal protein S11)
  • Hmga1 gene (high mobility group AT-hook 1)
  • Col1a2 gene (procollagen type I ?2)
  • Atp5a1 gene (ATP synthase isoform 1)

63
Goal
  • Cluster 46 clones expressing different tagged
    proteins based on their subcellular location
    patterns

64
Outlier removal
  • Have 9 to 30 images per protein
  • Use Q test or t test on individual features to
    remove outlier images
  • Calculate mean feature vector for each protein

65
Feature selection
  • Feature set optimization NP-complete
  • Use Stepwise Discriminant Analysis
    (backward/forward method) to rank features based
    on their ability to distinguish proteins
  • Use increasing numbers of features to train
    neural network classifiers and evaluate
    classification accuracy over all 46 clones

66
Classifier accuracy
67
Tree building
  • Best performance using between 10 and 15
  • Calculate Euclidean distance matrix for best 10
    features
  • Build SLT
  • Using classifier results to cut tree
  • Sort confusion matrix in order of tree
  • Group proteins with more than 25 confusion

68
Detailed view of some proteins
Classification result
100
Real classes
90
100
90
100
0 1 2 3 4 5
69
Hmga1-1 (nucleus)
Hmga1-2 (nucleus)
Unknown-9 (nucleus)
Hmgn2-1 (nucleus)
Unknown-8 (nucleus)
Rpl32 (Nucleolus)
70
SLT with best 10 features
71
ConclusionsLocation Proteomics
  • New frontier of automated cell biology opening
    for analysis of large numbers of 2D through 5D
    fluorescence microscope images
  • Random tagging provides tool for determining
    patterns for many proteins
  • Can construct Subcellular Location Trees to
    systematically represent knowledge about location
  • Can be built for many cell types and can reflect
    dynamic properties of proteins (changes with
    time, drugs, oncogenes, etc.)

72
Acknowledgments
  • Prof. Robert F. Murphy
  • Current grad students
  • Kai Huang
  • Xiang Chen
  • Yanhua Hu
  • Elvira Garcia Osuna
  • Ting Zhao
  • Juchang Hua
  • Former students
  • Meel Velliste
  • Michael Boland
  • Mia Markey
  • Gregory Porreca
  • Edward Roques
  • Jie Yao
  • Funding
  • NSF, NCI, Merck, Rockefeller Bros. Fund
  • Collaborators/Consultants
  • Simon Watkins
  • David Cassasent
  • Tom Mitchell
  • Christos Faloutsos
  • Jon Jarvik
  • Peter Berget
Write a Comment
User Comments (0)
About PowerShow.com