Fingerprint Clustering Comparative Study - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Fingerprint Clustering Comparative Study

Description:

This presentation will probably involve audience discussion, which will create action items. ... frequent itemset containing un-clustered fingerprints as a ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 22

Provided by: TBi79

Category:

more less

Transcript and Presenter's Notes

Title: Fingerprint Clustering Comparative Study

1
Fingerprint Clustering Comparative Study

This presentation will probably involve audience
discussion, which will create action items. Use
PowerPoint to keep track of these action items
during your presentation
In Slide Show, click on the right mouse button
Select Meeting Minder
Select the Action Items tab
Type in action items as they come up
Click OK to dismiss this box
This will automatically create an Action Item
slide at the end of your presentation with your
points entered.

606 Project Presentation by
Dean Cheng

2
Outline

A Brief Background Introduction
Some Definitions
Clustering Algorithms
Results
Conclusion and Future directions

3
Oligonucleotide Fingerprinting (1)

DNA clone long active linear sequence of
nucleotides (unknown)
Probe short linear sequence of 6-8 nucleotides
(known)

Hybridizations A-T A C A G G-C G
T T G A-T G A G T C-G G-C C
A T-A A-T G G (Total) (Half) (None)
4
Oligonucleotide Fingerprinting (2)

Fluorescent labeled probes, higher intensity
better hybridization

5
Oligonucleotide Fingerprinting (3)
Fingerprint vector A vector of intensity values
of all probes for a clone. Example c1 has a
fingerprint of 1 0 1 0 0 .
6
Oligonucleotide Fingerprinting (4)

Applications
DNA sequencing
Gene expression
DNA clone classification
Clustering
Similarity/Distance functions
Algorithms
Focus on binary (ternary) domain

7
Project Specifications

Three algorithms and two fingerprint
representations.
UPGMA-0-1, FREQ-0-1, UPGMA-0-1-N, GCP-0-1-N
0-1-N 2 thresholds
Intensity above positive control 1
Intensity below negative control 0
Anything in between N (unknown)
Hamming distance. For 0-1-N fingerprint, ignore
N.

8
Binary Clustering with Missing Values (BCMV)

Two 0-1-N vectors are compatible if they do not
differ at any position or they only differ at
positions where one or both of them has N.
A 0-1-N vector is resolved if value of N is
determined (0 or 1).
The challenge is to cluster only mutually
compatible 0-1-N fingerprint vectors such that
fingerprint vectors within a cluster can be
resolved in the same way.

9
Graphs and 0-1

For 0-1 fingerprint vectors, we can try to find
resolve vector of 1-hamming distance.
For example, the set 100, 010, 001 has a resolve
vector of 000.

10
GCP (Greedy Clique Partition)

The algorithm first finds unique maximal cliques
and removes them from the graph.
Then a greedy action is used to find and remove
maximum cliques from the graph until all vertices
in the graph belong to some clique.
Implementation from Zhipeng Cai.

11
UPGMA

Given a distance matrix, join two nodes most
similar to each other and update distance matrix
with average distance to the two nodes. Output
is a tree.
PHYLIPs neighbor joining has UPGMA.

12
FREQ (1)

Frequent itemset mining find itemset that has a
support above a minimum support.

Support of 1, 2, 3 is 40. Support of 1, 2
is 60.
Apriori if an itemset is frequent then all its
subsets must be frequent.
Eg 1 80, 260, 340, 1-260, 2-3 40
13
FREQ (2)

Transform fingerprints into transactions use a
fingerprint itself and its compatible
fingerprints.
Eg
1. 111001NN T1 1, 2, 3, 4, 5
2. 11100N1N T2 1, 2, 3, 4
3. 11100N11 T3 1, 2, 3, 4
4. 1110N11N T4 1, 2, 3, 4
5. 1110N10N T5 1, 5

14
FREQ (3)

Can do same for 0-1 vectors using 2-hamming
distance (eg. 010, 100, 001 has a resolve vector
of 000).
Experiment preprocess the fingerprints first,
only uses fingerprints that do not have 0-hamming
distance identical. An implementation issue.
Eg 101, 101, 111, 111, 100, 010, 001.
Take the longest frequent itemset containing
un-clustered fingerprints as a cluster.

15
Validation methods

Number of cluster with size gt 2. Number of
singleton.
Average homogeneity and average separation.
Euclidean distance and hamming distance used.
Minkowski measure and Jaccards coefficient using
GCPs result as standard.

16
Datasets and thresholds

Bacteria dataset 1491 clones and 27 probes.
Fungi dataset 1507 clones and 26 probes.
Thresholds
Bacteria 1 0.35-0.50 and 0.425
Bacteria 2 0.35-0.55 and 0.45
Fungi 425-775 and 600

17
Results
18
Discussion (1)

GCP-0-1-N perform the best, guarantee to find
mutually compatible clusters.
UPGMA-0-1-N also return good results but fewer
larger-size cluster.
Both GCP-0-1-N and UPGMA-0-1-N outperform
UPGMA-0-1 and FREQ-0-1. UPGMA-0-1-N superior to
UPGMA-0-1. 0-1-N is a better representation.
UPGMA find more singleton and fewer larger-size
clusters.

19
Discussion (2)

FREQ-0-1 better than UPGMA-0-1 but clustering
qualities unstable. FREQ-0-1 is comparable to
GCP-0-1-N and UPGMA-0-1-N in Bacteria 1.
Must add constraints to FREQ. Incremental
decrease of support can help efficiency and
quality.
Minkowski measure and Jaccards coefficient
indicate clustering solutions different.
UPGMA-0-1-N gt FREQ-0-1 gt UPGMA-0-1.

20
Conclusion Future Directions

Uses cluster size homogeneity, separation,
Minkowski measure and Jaccards coefficient to
evaluate cluster qualities.
0-1-N is a better representation. Use UPGMA-0-1-N
instead of UPGMA-0-1.
GCP perform the best.
FREQ works in both 0-1 and 0-1-N but need more
testing. Adding constraints are key. Other
frequent itemset mining algorithms can be
explored such as Max-Miner. Assigning a
fingerprint to multiple clusters a benefit?