Title: Artificial Intelligence 15-381 Unsupervised Machine Learning Methods
1Artificial Intelligence 15-381Unsupervised
Machine Learning Methods
- Jaime Carbonell
- 1-November-2001
- OUTLINE
- What is unsupervised learning?
- Similarity computations
- Clustering Algorithms
- Other kinds of unsupervised learning
2Unsupervised Learning
- Definition of Unsupervised Learning
- Learning useful structure without labeled
classes, optimization criterion, feedback signal,
or any other information beyond the raw data
3Unsupervised Learning
- Examples
- Find natural groupings of Xs (Xhuman languages,
stocks, gene sequences, animal species,)? - Prelude to discovery of underlying properties
- Summarize the news for the past month?
- Cluster first, then report centroids.
- Sequence extrapolation E.g. Predict cancer
incidence next decade predict rise in
antibiotic-resistant bacteria - Methods
- Clustering (n-link, k-means, GAC,)
- Taxonomy creation (hierarchical clustering)
- Novelty detection ("meaningful"outliers)
- Trend detection (extrapolation from multivariate
partial derivatives)
4Similarity Measures in Data Analysis
- General Assumptions
- Each data item is a tuple (vector)
- Values of tuple are nominal, ordinal or numerical
- Similarity (Distance)-1
- Pure Numerical Tuples
- Sim(di,dj) ?di,kdj,k
- sim (di,dj) cos(didj)
- and many more (slide after next)
5Similarity Measures in Data Analysis
- For Ordinal Values
- E.g. "small," "medium," "large," "X-large"
- Convert to numerical assuming constant ?on a
normalized 0,1 scale, where max(v)1,
min(v)0, others interpolate - E.g. "small"0, "medium"0.33, etc.
- Then, use numerical similarity measures
- Or, use similarity matrix (see next slide)
6Similarity Measures (cont.)
- For Nominal Values
- E.g. "Boston", "LA", "Pittsburgh", or "male",
"female", or "diffuse", "globular", "spiral",
"pinwheel" - Binary rule If di,kdj,k, then sim1, else 0
- Use underlying sematic property E.g. Sim(Boston,
LA)?dist(Boston, LA)-1, or Sim(Boston,
LA)?(size(Boston) size(LA) )-1 - Use similarity Matrix
7Similarity Matrix
- tiny little small medium large huge
- tiny 1.0 0.8 0.7 0.5 0.2 0.0
- little 1.0 0.9 0.7 0.3 0.1
- small 1.0 0.7 0.3 0.2
- medium 1.0 0.5 0.3
- large 1.0 0.8
- huge 1.0
- Diagonal must be 1.0
- Monotonicity property must hold
- Triangle inequality must hold
- Transitive property need not hold
-
8Document Clustering Techniques
- Similarity or Distance MeasureAlternative
Choices - Cosine similarity
-
- Euclidean distance
- Kernel functions, e.g.,
- Language Modeling P(ymodelx) where x and y are
documents
9Document Clustering Techniques
- Kullback Leibler distance ("relative entropy")
10Incremental Clustering Methods
- Given n data items D D1, D2,Di,Dn
- And given minimal similarity threshold Smin
- Cluster data incrementally as follows
- Procedure Singlelink(D)
- Let CLUSTERS D1
- For i2 to n
- Let Dc ArgmaxSim(Di,Dj
- jlti
- If DcgtSmin, add Dj to Dc's cluster
- Else Append(CLUSTERS, Dj new cluster
11Incremental Clustering (cont.)
- Procedure Averagelink(D)
- Let CLUSTERS D1
- For i2 to n
- Let Dc ArgmaxSim(Di, centroid(C)
- C in CLUSTERS
- If DcgtSmin, add Dj to cluster C
- Else Append(CLUSTERS, Dj new cluster
- Observations
- Single pass over the data?easy to cluster new
data incrementally - Requires arbitrary Smin threshold
- O(N2) time, O(N) space
12Document Clustering Techniques
- Example. Group documents based on similarity
- Similarity matrix
- Thresholding at similarity value of .9 yields
- complete graph C1 1,4,5, namely Complete
Linkage - connected graph C21,4,5,6, namely Single
Linkage - For clustering we need three things
- A similarity measure for pairwise comparison
between documents - A clustering criterion (complete Link, Single
Ling,) - A clustering algorithm
13Document Clustering Techniques
- Clustering Criterion Alternative Linkages
- Single-link ('nearest neighbor")
- Complete-link
- Average-link ("group average clustering") or
GAC)
14Non-hierarchical Clustering Methods
- A Single-Pass Algorithm
- Treat the first document as the first cluster
(singleton cluster). - Compare each subsequent document to all the
clusters processed so far. - Add this new document to the closest cluster if
the intercluster similarity is above the
similarity threshold (predetermined) otherwise,
leave the new document alone as a new cluster. - Repeat Steps 2 and 3 until all the documents are
processed. - - O(n2) time and O(n) space (worst case
complexity)
15Non-hierarchical Methods (cont.)
- Multi-pass K-means ("reallocation method")
- Select K initial centroids (the "seeds")
- Assign each document to the closeest centroid,
resulting in K clusters. - Recompute the centroid for each of the K
clusters. - Repeat Steps 2 and 3 until the centroids are
stabilized. - - O(nK) time and O(K) space per pass
16Hierarchical Agglomerative Clustering Methods
- Generic Agglomerative Procedure (Salton '89)
- result in nested clusters via iterations
- Compute all pairwise document-document similarity
coefficients - Place each of n documents into a class of its own
- Merge the two most similar clusters into one
- - replace the two clusters by the new cluster
- - compute intercluster similarity scores w.r.t.
the new cluster - Repeat the above step until only one cluster is
left
17Hierarchical Agglomerative Clustering Methods
(cont.)
- Heuristic Approaches to Speedy Clustering
- Reallocation methods with k selected-seeds (O(kn)
time) - - k is the desired number of clusters n is the
number of documents - Buckshot random sampling (of ?(k)n documents)
puls global HAC - Fractionation Divide and Conquer
18Creating Taxonomies
- Hierarchical Clustering
- GAC trace creates binary hierarchy
- Incremental-link? Hierarchical version
- Cluster data with high Smin? 1st hierarchical
level - Decrease Smin (stop at Smin0)
- Treat cluster centroids as data tuples and
recluster, creating next level of hierarchy, then
repeat steps 2 and 3. - K-means? Hierarchical k-means
- Cluster data with large k
- Decrease k (stop at k1)
- Treat cluster centroids as data tuples and
recluster, creating next level of hierarchy, then
repeat steps 2 and 3.
19Taxonomies (cont.)
- Postprocess Taxonomies
- Eliminate "no-op" levels
- Agglomerate "skinny" levels
- Label meaningful levels manually or with centroid
summary