Title: Fingerprint Clustering Comparative Study
1Fingerprint Clustering Comparative Study
- This presentation will probably involve audience
discussion, which will create action items. Use
PowerPoint to keep track of these action items
during your presentation - In Slide Show, click on the right mouse button
- Select Meeting Minder
- Select the Action Items tab
- Type in action items as they come up
- Click OK to dismiss this box
- This will automatically create an Action Item
slide at the end of your presentation with your
points entered.
- 606 Project Presentation by
- Dean Cheng
- A Brief Background Introduction
- Some Definitions
- Clustering Algorithms
- Results
- Conclusion and Future directions
3Oligonucleotide Fingerprinting (1)
- DNA clone long active linear sequence of
nucleotides (unknown) - Probe short linear sequence of 6-8 nucleotides
Hybridizations A-T A C A G G-C G
A T-A A-T G G (Total) (Half) (None)
4Oligonucleotide Fingerprinting (2)
- Fluorescent labeled probes, higher intensity
better hybridization
5Oligonucleotide Fingerprinting (3)
Fingerprint vector A vector of intensity values
of all probes for a clone. Example c1 has a
fingerprint of 1 0 1 0 0 .
6Oligonucleotide Fingerprinting (4)
- Applications
- DNA sequencing
- Gene expression
- DNA clone classification
- Clustering
- Similarity/Distance functions
- Algorithms
- Focus on binary (ternary) domain
7Project Specifications
- Three algorithms and two fingerprint
representations. - UPGMA-0-1, FREQ-0-1, UPGMA-0-1-N, GCP-0-1-N
- 0-1-N 2 thresholds
- Intensity above positive control 1
- Intensity below negative control 0
- Anything in between N (unknown)
- Hamming distance. For 0-1-N fingerprint, ignore
8Binary Clustering with Missing Values (BCMV)
- Two 0-1-N vectors are compatible if they do not
differ at any position or they only differ at
positions where one or both of them has N. - A 0-1-N vector is resolved if value of N is
determined (0 or 1). - The challenge is to cluster only mutually
compatible 0-1-N fingerprint vectors such that
fingerprint vectors within a cluster can be
resolved in the same way.
9Graphs and 0-1
- For 0-1 fingerprint vectors, we can try to find
resolve vector of 1-hamming distance. - For example, the set 100, 010, 001 has a resolve
vector of 000.
10GCP (Greedy Clique Partition)
- The algorithm first finds unique maximal cliques
and removes them from the graph. - Then a greedy action is used to find and remove
maximum cliques from the graph until all vertices
in the graph belong to some clique. - Implementation from Zhipeng Cai.
- Given a distance matrix, join two nodes most
similar to each other and update distance matrix
with average distance to the two nodes. Output
is a tree. - PHYLIPs neighbor joining has UPGMA.
12FREQ (1)
- Frequent itemset mining find itemset that has a
support above a minimum support.
Support of 1, 2, 3 is 40. Support of 1, 2
is 60.
Apriori if an itemset is frequent then all its
subsets must be frequent.
Eg 1 80, 260, 340, 1-260, 2-3 40
13FREQ (2)
- Transform fingerprints into transactions use a
fingerprint itself and its compatible
fingerprints. - Eg
- 1. 111001NN T1 1, 2, 3, 4, 5
- 2. 11100N1N T2 1, 2, 3, 4
- 3. 11100N11 T3 1, 2, 3, 4
- 4. 1110N11N T4 1, 2, 3, 4
- 5. 1110N10N T5 1, 5
14FREQ (3)
- Can do same for 0-1 vectors using 2-hamming
distance (eg. 010, 100, 001 has a resolve vector
of 000). - Experiment preprocess the fingerprints first,
only uses fingerprints that do not have 0-hamming
distance identical. An implementation issue.
Eg 101, 101, 111, 111, 100, 010, 001. - Take the longest frequent itemset containing
un-clustered fingerprints as a cluster.
15Validation methods
- Number of cluster with size gt 2. Number of
singleton. - Average homogeneity and average separation.
Euclidean distance and hamming distance used. - Minkowski measure and Jaccards coefficient using
GCPs result as standard.
16Datasets and thresholds
- Bacteria dataset 1491 clones and 27 probes.
Fungi dataset 1507 clones and 26 probes. - Thresholds
- Bacteria 1 0.35-0.50 and 0.425
- Bacteria 2 0.35-0.55 and 0.45
- Fungi 425-775 and 600
18Discussion (1)
- GCP-0-1-N perform the best, guarantee to find
mutually compatible clusters. - UPGMA-0-1-N also return good results but fewer
larger-size cluster. - Both GCP-0-1-N and UPGMA-0-1-N outperform
UPGMA-0-1 and FREQ-0-1. UPGMA-0-1-N superior to
UPGMA-0-1. 0-1-N is a better representation. - UPGMA find more singleton and fewer larger-size
19Discussion (2)
- FREQ-0-1 better than UPGMA-0-1 but clustering
qualities unstable. FREQ-0-1 is comparable to
GCP-0-1-N and UPGMA-0-1-N in Bacteria 1. - Must add constraints to FREQ. Incremental
decrease of support can help efficiency and
quality. - Minkowski measure and Jaccards coefficient
indicate clustering solutions different.
UPGMA-0-1-N gt FREQ-0-1 gt UPGMA-0-1.
20Conclusion Future Directions
- Uses cluster size homogeneity, separation,
Minkowski measure and Jaccards coefficient to
evaluate cluster qualities. - 0-1-N is a better representation. Use UPGMA-0-1-N
instead of UPGMA-0-1. - GCP perform the best.
- FREQ works in both 0-1 and 0-1-N but need more
testing. Adding constraints are key. Other
frequent itemset mining algorithms can be
explored such as Max-Miner. Assigning a
fingerprint to multiple clusters a benefit?