Title: Towards the Goal of Searchable Clinical Image Repositories
1Towards the Goal of Searchable Clinical Image
Repositories
- Ulysses J. Balis, M.D.
- Director of Clinical Informatics
- Co-Director, Division of Pathology Informatics
- Department of Pathology
- University of Michigan
- ulysses_at_umich.edu
2Learning Objectives
- Overview of the salient history of the underlying
of digital imagery technology of histopathology
repositories - Recognize digital representation of images as the
key transformative element enabling Digital
Microscopy - Familiarity with the topic of dimensional
reduction and its utility in reducing the search
complexity associated with large repositories - Familiarity with some candidate algorithmic /
heuristic approaches to image search /
content-based image retrieval (CBIR) - Followed by representative real-time
demonstrations of clinically-relevant CBIR
3(No Transcript)
4(No Transcript)
5(No Transcript)
6Some Observations
- Moore's Law is equally applicable to Pathology
Informatics as it is to semiconductor scaling - We (Pathologists) must operate as the manifest
stewards of our own data (or be rendered as moot,
in the overall enterprise IT equation) - New modalities are heavily dependent on IT
understanding and support - High-throughput molecular testing platforms
- All-digital signout model made possible by whole
slide imaging - Shift from qualitative approaches to quantitative
ones, as we shift from clinical to pre-clinical
diagnostic arenas (e.g. whole-slide analysis and
multispectral analysis) - Unsustainability of the current trajectory of
monolithic EHR architectures.
7Some Observations
- Moore's Law is equally applicable to Pathology
Informatics as it is to semiconductor scaling - We (Pathologists) must operate as the manifest
stewards of our own data (or be rendered as moot,
in the overall enterprise IT equation) - New modalities are heavily dependent on IT
understanding and support - High-throughput molecular testing platforms
- All-digital signout model made possible by whole
slide imaging - Shift from qualitative approaches to quantitative
ones, as we shift from clinical to pre-clinical
diagnostic arenas (e.g. whole-slide analysis and
multispectral analysis) - Unsustainability of the current trajectory of
monolithic EHR architectures.
8The CCD the fundamental transformative
technology enabling creation of wide-field
datasets
9(No Transcript)
10Digital Representation of Images as the Key
Transformative Element Enabling Digital Microscopy
- Without the image data in digital format, there
is no cogent question that can be asked, as there
is not dataset available to query. - With the advent of increasingly comprehensive
digital image repositories, we encounter an
entirely different situation essentially an
embarrassment of riches as we now have more data
than is easily parsed by conventional linear
programming. - this is a transformative enabling step,
nonetheless - As a confirming reality check Radiology has
already firmly entered the realm of investigation
of computer aided diagnosis (CAD), although it is
cogent to recognize that their current datasets
are much smaller that those now possible with
digital whole-slide imaging - And as such, the question becomes one of
algorithmic and heuristic development. - Hint, we know this is possible, as the human
brain carries out real-time CBIR with high
sensitivity and specificity. - Caveat recognizing the that human brain is
massively parallel in construction,
recapitulating this with current computational
technology may be impractical
11Some Observations Concerning Slide data Density
- Characteristics
- 2.5 by 7.5 cm
- 1/3 used for label
- 2.5 x 5.0 cm for tissue display
- Typical light microscopy is diffraction-limited
to 0.25 microns - Yields an effective required pixel count of 100K
by 200k pixels (2.3 Gb) or a 20k MPixel Image - This is the same things as saying that one would
need to capture 20,000 images with a 1 MPixel
camera to obtain a single slide - Linear programming on datasets of this size is
costly, in terms of time and storage.
2.5 cm
7.5 cm
5 cm
(1000 x 25) / 0.25 microns 100,000 linear pixels
(1000 x 50) / 0.25 microns 200,000 linear pixels
vs. a relatively insignificant 4 MPixel
Image
This is a 20 GPixel image
12Compelling Use Cases for Image Query
- Diagnostic decision support
- Longitudinal evaluation
- Differential diagnosis generation
- Detection of rare events
- Teaching
- Discovery
13Current World View of Pathology Imagery
Repositories
- Model 1 Relational Database
- Image Metadata associated with case-level data
- Entire Schema required to carry out discovery
- Text-based
- Image data is a passive component of the query
- Model 2 Metadata-tagged Images
- Image Metadata associated with each image
- Image becomes a self-contained dataset available
for discovery - Text-based
- Image data is a passive component of the query
Entry in master accession table
Associated case and image descriptors
14Highly Desirable World View of Pathology Imagery
Repositories (Future State)
- Model 3 Metadata-tagged surface map
- Image Metadata exists at the image level and is
spatially coupled to underlying digital imagery - Discovery can be carried out on the image-space
itself, with retrieved metadata classifiers
available for generating search result sets (e.g.
differential diagnosis generation) - Image-based
- Model 4 Surface discovery
- Non-metadata-associated digital imagery is
spatially probed for statistical convergence with
an image-based query set - Imagery becomes a self-contained dataset
available for discovery - Image-based
?
?
15Lop Nor
Vector quantization a forgotten algorithm.
16Attributes of an ideal search system
- Self-training, domain independent image
segmentation / classification tool. - Allows for at least two novel image search
modalities - Region of interest Query by example (image space
search not text based) - Retrieve diagnostic information associated with
prior classified fields, enabling the generation
of dynamically generated differential diagnosis - Useful as a bridge for exploration of stochastics
of multi-dimensional image space data when
queried in tandem with high-dimensionality data
sets types (genomics, proteomics, etc.) - i.e. Morphogenomics
- Ability to carry out real time assessment of
regions of interest against Terascale / Petascale
image repositories.
17On the prospect of analyzing 1000s of Gigabytes
of data in real-time
181.415461031044954789001553027745e9864
2 x 2 vector 2564 possible values in a
four-dimensional space
What is an Image Vector?
4,294,967,296 possible values
Typically, vectors have ordinality of 8 x 8 or
greater
19General Approaches to Image Analysis
- Supervised Learning
- Algorithm interacts with expert or another
training data source such that features of
interest are actively selected and classified
during the training stage - Time consuming
- Potential to converge to a solution with smaller
training sets - Variable robustness of predictive power when
convergence is detected
- Unsupervised Learning
- Algorithm parses data autonomously, without
user/expert intervention - Faster/ suitable for turnkey automation
- Slower convergence (if ever) on a solution set.
- Need for higher-dimensional systems
- Statistically robust when convergence is
identified
20General Approaches to Image Analysis
- Conventional Image Analysis
- Algorithms based upon spatially-, frequency- or
phase-space data present in image - Length scale hypothesis in effect structural
elements are usually the target - Often requires manual length-scale and magnitude
scale optimization to enhance detection accuracy - Some expertise in algorithm operation desirable
- Unstructured Classification
- Classification of vectors in high-dimensional
space based upon all-comers hypothesis - No tuning required
- No expertise required
- Approach leads to a plurality of classifiers for
every atomic spatial element, which must then be
annealed to a superclass. (this can require
manual vector sorting)
21Candidate Algorithmic / Heuristic Approaches to
Image Search / Content-based Image Retrieval
(CBIR)
- Principle component analysis (PCA)
- Bayesian Belief Networks
- Support Vector Engines couple to multi-parametric
conventional image analysis - Dimensional reduction via manifold projection
techniques, where high-dimensional distinctions
of statistical significance are preserved in the
low-dimensional projection. - Vector Quantization
- Galois Field Manifold Basis operators as an
inductive extrapolative technique of probable
(but unspecified) adjacency characteristics of
low dimensional candidate manifolds (manifold
extrapolation) - Many others.
- All the above approached have strengths and
weaknesses there is currently no one best
solution.
22An Issue of Dimensional Reduction
- Problem With the prospect of a typical 100x100
kernel (10,000 dimensional spaces), computational
approaches carried out on raw data sets can take
millions of years to complete, even with our
fastest current supercomputers. (bad for
turn-around time) - Fortunately, there are mathematical operations
that can sidestep this computational annoyance. - Support Vector Engines
- K-means approaches
- Bayesian Networks
- Vector Quantization
- Galois Field Manifold Projection / Tensor
Integration
23Pythagorean Theorem
b
On all PCs and high-end workstations (and most
Macs), 916 does indeed result in 25
5 x 5
3 x 3
a
4 x 4
b
24Vector Quantization
Original Image
Division of image into local domains
Extraction of Local Domain Composite Vectors
?
VKSLx0y0Order , LxnymOrder
Vectorization of each local kernel
Individual assessment of each vector dimension
25Vector Quantization
VKSLx0y0Order , LxnymOrder
Established Vocabulary
Query Against library (Vocabulary) of established
Galois Vectors
Novel Vector
Previously Identified Vector
Assignment of a unique serial number and
inclusion into global vocabulary
Assembly of compressed dataset
38857448643
26VQ-Based Image Compressiona fantastic
opportunity for automated search
Raw Data
Restored Data
Compressed data (preserved spatial organization
of original data)
Depending on the selected compression ratio,
restored loss-compression imagery may or may not
be of diagnostic quality.
27(No Transcript)
28Galois Field Theory
29A Typical Dimensional Reduction Galois Field
Question
- What is the mean densitometrically-weighted
distance of a single test vector to a statistical
manifold of established centroids (thus
establishing similarity or difference)?
30(No Transcript)
31What are the boundary conditions?
32General Form
What is the integral of the Galois Field?
?
33Which, after integration by parts, yields
341,1
1,2
2,1
1,1 1,2 .. 1,n
2,1 2,2 .. 2,n
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
n,1 n,2 .. n,n
n,n
Resultant Input Vector Kernel of n?n?3
dimensionality
Initial n by n sub-region of image
For every location
Canonical V.Q. Tensor
Each location is an RGB triplet hence, each
vector component is itself a triplet sub-vector.
Galois Field Transform
35Typical Galois Field mapped to the even
Jacobian/Chebyshev tensor polynomials manifested
on the edge of the complexity transition
- On Galois Fields
- Not merely a clustering algorithm
- The resulting field is a non-linear N-space
manifold selected for its distinctiveness from
all other modular functions in the Galois set
space - Fields may have local minima and local extrema
- Any Galois manifold is exclusive of any other
Galois set - Non-trivial to calculate trivial to query
36Local Islands in Galois Field Space of
statistical convergence and near-convergence to
high-probability feature matches using support
vector analysis
37Convergence with increasing Vocabulary Size
38Regions of a typical Galois manifold with no
correlation to established vocabulary tensors are
easily recognized as exhibiting chaotic behavior
and are therefore excluded.
39How does this approach differ from traditional
N-space cluster analysis?
- Conventional
- Algorithms are custom designed for a narrow
recognition task - Often requires customization with expert
programming - Low tolerance to variability in source format
- VQ-Galois
- General matching algorithm agnostic to input data
format - No end-user customization required
- Designed to improve with increased data pool size
(self-training)
40(No Transcript)
41(No Transcript)
42Some Demonstrations
43Summary
- Increasing availability of whole slide digital
data creates at least the possibility to carry
our CBIR for basic clinical tasks - Similar case retrieval
- Differential diagnosis generation
- Grading /staging decision support
- Rare event identification
- Much effort is still required to increase the
speed and accuracy of the current generation of
both supervised and unsupervised approaches for
the time being, these algorithms should be viewed
as investigational use only, unless otherwise
stated. - Initial reports in this field suggest that the
computational challenge can be solved. - Pilot toolsets will be available for
investigative use via the internet, within the
year if not sooner.