Title: Data Mining For Hypertext: A Tutorial Survey
1Data Mining For Hypertext A Tutorial Survey
11/11/01
sdbi winter 2001
- Based on a paper by
- Soumen Chakrabarti
- Indian Institute Of technology Bombay.
- Soumen_at_cse.iitb.ernet.in
- Lecture by
- Noga Kashti
- Efrat Daum
2Lets start with definitions
- Hypertext - a collection of documents (or
"nodes") containing cross-references or "links"
which, with the aid of an interactive browser
program, allow the reader to move easily from one
document to another. - Data Mining - Analysis of data in a database
using tools which look for trends or anomalies
without knowledge of the meaning of the data.
3Two Ways For Getting Information From The Web
- Clicking On Hyperlinks
- Searching Via Keyword Queries
4Some History
- Before the popular Web, Hypertext has been used
by ACM, SIGIR, SIGLINK/SIGWEB and DIGITAL
LIBRARIES. - The old IR (Information retrieval) deals with
documents whereas the Web deals with
semi-structured data.
5Some Numbers ..
- The Web exceeds 800 million HTML pages on about
three million servers. - Almost a million pages are added daily.
- A typical page changes in a few months.
- Several hundred gigabytes change every month.
6Difficulties With Accessing Information On The
Web
- Usual problems of text search (synonymy,
polysemy, text sensitivity) become much more
severe. - Semi-structured data.
- Sheer size and flux.
- No consistent standard or style.
7The Old Search Process Is Often Unsatisfactory!
- Deficiency of scale.
- Poor accuracy (low recall and low precision).
8Better Solutions Data Mining And Machine Learning
- NL Techniques.
- Statistical Techniques for learning structure in
various forms from text hypertext and
semi-structured data.
9Issues Well Discuss
- Models
- Supervised learning
- Unsupervised learning
- Semi-supervised learning
- Social network analysis
10Models For Text
- Representation for text with statistical analyses
only (bag-of-words) - The vector space model
- The binary model
- The multi-nominal model
11Models For Text (cont.)
- The vector space model
- Documents -gt tokens-gtcanonical forms.
- Canonical token is an axis in a Euclidean space.
- The t-th coordinate of d is n(d,t)
- t is a term
- d is a document
12The Vector Space Model Normalize The Document
Length To 1
13More Models For Text
- The Binary Model A document is a set of terms,
which is a subset of the lexicon. Word counts are
not significant. - The multinomial model a die with T faces.
Every face has a probability ?t of showing up
when tossed. Deciding of total word count, the
author tosses the die while writing the term that
shows up.
14Models For Hypertext
- Hypertext text with hyperlinks.
- Varying levels of detail.
- Example Directed Graph(D,L)
- D The set of nodes/documents/pages
- L The set of links
15Models For Semi-structured Data
- A point of convergence for the web(documents) and
database(data) communities
16Models For Semi-structured Data(cont.)
- like Topic Directories with tree-structured
hierarchies. - Examples Open Directory Project , Yahoo!
- Another representation XML.
17Supervised Learning (classification)
- Algorithm Initialization training data, each
item is marked with a label or class from a
discrete finite set. - Input unlabeled data.
- Algorithm roll guess the data labels.
18Supervised Learning (cont.)
- Example topic directories
- Advantages help structure, restrict keyword
search, can enable powerful searches.
19Probabilistic Models For Text Learning
- Let c1,,cm be m classes or topics with some
training documents Dc. - Prior probability of a class
- T the universe of terms in all the training
documents.
20Probabilistic Models For Text Learning (cont.)
- Naive Bayes classification
- Assumption for each class c, there is binary
text generator model. - Model parameters Fc,t the probability that a
document in class c will mention term t at lease
once.
21Naive Bayes classification (cont.)
-
- Problems
- short documents are discouraged.
- Pr (dc) estimation is likely to be greatly
distorted.
22Naive Bayes classification (cont.)
- With the multinomial model
23Naive Bayes classification (cont.)
- Problems
- Again, short documents are
discouraged. - Inter-term correlation ignored.
- Multiplicative Fc,t surprise factor.
- Conclusion
- Both model are effective.
24More Probabilistic Models For Text Learning
- Parameter smoothing and feature
selection. - Limited dependence modeling.
- The maximum entropy technique.
- Support vector machines (SVMs).
- Hierarchies over class labels.
25Learning Relations
- Classification extension a combination of
statistical and relational learning. - Improve accuracy.
- The ability to invent predicates.
- Can represent hyperlink graph structure and
word statistics of neighbor documents. - Learned rules will not be dependent on specific
keywords.
26Unsupervised learning
- hypertext documents
- a hierarchy among the documents
- What is a good clustering?
27Basic clustering techniques
- Techniques for Clustering
- kmeans
- hierarchical agglomerative clustering
28Basic clustering techniques
- documents
- unweighted vector space
- TFIDF vector space
- similarity between two documents
- cos(?), ? the angle between their
corresponding vectors - the distance between the vectors lengths
- (normalized)
29kmeans clustering
- the kmeans algorithm
- input
- d1,,dn - set of n documents
- k - the number of clusters desired (k?n)
- output
- C1,,Ck k clusters with the n classifier
documents
30kmeans clustering
- the kmeans algorithm (cont.)
- initial guess k initial means m1,mk
- Until there are no changes in any means
- For each document d - d is in ci if d-mi is
the minimum of all the k distances. - For 1?i?k - replace mi with the means of all the
documents for ci.
31kmeans clustering
- the kmeans algorithm Example
K2
K3
32kmeans clustering (cont.)
- Problem
- high dimensionality
- e.g. if 30000 dimensions has only two possible
values, the vector space size is 230000 - Solution
- Projecting out some dimensions
33Agglomerative clustering
- documents are merged into superdocuments or
groups until only one group is left - Some definitions
- the similarity between documents
d1 and d2 - the self-similarity of group A
-
34Agglomerative clustering
- The agglomerative clustering algorithm
- input
- d1,,dn - set of n documents
- output
- G the final group with a nested hierarchy
35Agglomerative clustering (cont.)
- The agglomerative clustering algorithm
- Initial G G1,,Gn, where Gidi
- while Ggt1
- Find A and B in G such as s(A ? B) is maximized
- G (G A,B) ? A ? B
- Times O(n2)
36Agglomerative clustering (cont.)
- The agglomerative clustering algorithm
- Example
37Techniques from linear algebra
- Documents and terms are represented by vectors
in Euclidean space. - Applications of linear algebra to text analysis
- Latent semantic indexing (LSI)
- Random projections
38Co-occurring terms
39Latent semantic indexing (LSI)
- Vector Space model of documents
- Let mT, the lexicon size
- Let nthe number of documents
- Define Amxn term-bydocument matrix
- where aij the number of occurrences of term i
in document j.
40Latent semantic indexing (LSI)
41Singular Value Decomposition (SVD)
- Let A?Rmxn, m ? n be a matrix.
- The singular value decomposition of A is the
factorization AUDVT, where - U and V are orthogonals, UTUVTVIn
- Ddiag(?1, ?n) with ?i?0, 1?i?n
- then,
- Uu1,un, u1,un are the left singular vectors
- Vv1,vn, v1,vn are the right singular
vectors - ?1, ?n are the singular values of A.
42Singular Value Decomposition (SVD)
- AAT(UDVT)(VDTUT)UDIDUTUD2UT
- ? AATUUD2?12u1,,?n2un
- for 1?i?n, AATui?i2ui
- ? the columns of U are the eigenvectors of AAT.
- Similary, ATAVD2VT
- ? the columns of V are the eigenvectors of ATA.
- The eigenvalues of AAT (or ATA) are ?12,,?n2
43Singular Value Decomposition (SVD)
-
- Let
- be the k-truncated SVD.
- rank(Ak)k
- A-AK2 ?A-MK2 for any matrix Mk of rank k.
44Singular Value Decomposition (SVD)
45LSI with SVD
- Define q?Rm a query vector.
- qi?0 if term i is a part of the query.
- Then, ATq ?Rn, the answer vector.
- (ATq)j?0 if document j contains one or more terms
in the query. - How to do it better?
46LSI with SVD
- Use Ak instead of A
- ? calculate AkTq
- Now, query on car will return a document
containing the word auto.
47Random projections
- Theorem
- let
- - a unit vector
- H - a randomly oriented -dimensional subspace
through the origin - X - random variable of the square of the length
of the projection of v on H - then
- and if is chosen between and
- where
48Random projections
- A projection of a set of points to a randomly
oriented subspace. - Small distortion in inter-points distances
- The technique
- reducing the dimensionality of the points
- speed up the distances computation
49Semi-supervised learning
- Real-life applications
- labeled documents
- unlabeled documents
- Between supervised and unsupervised learning
50Learning from labeled and unlabeled documents
- Expectation Maximization (EM) Algorithm
- Initial train a naive Bayes classifier using
only labeled data. - Repeat EM iteration until near convergence
- Estep
- Mstep assign class probabilities Pr(c/d) to all
documents not labeled by the ?c,t estimates. - error is reduced by a third in the best cases.
51Relaxation labeling
- The hypertext model
- documents are nodes in a hypertext graph.
- There are other sources of information induced by
the links.
52Relaxation labeling
- cclass, tterm, Nneighbors
- In supervised learning Pr(tc)
- In hypertext, using neighbors terms Pr(
t(d),t(N(d)) c) - Better model, using neighbors classes Pr(
t(d),c(N(d)) c - Circularity
53Relaxation labeling
- Resolve the circularity
- Initial Pr(0)(cd) to each document d?N(d1)
where d1 is a test document (use text-only) - Iterations
54Social network analysis
- Social networks
- between academics by coauthoring, advising.
- between movie personnel by directing and acting.
- between people by making phone calls
- between web pages by hyperlinking to other web
pages. - Applications
- Google
- HITS
55-
- where
- ? means link to
- N total number of nodes in the Web graph
- simulated a random walk on the web graph
- used a score of popularity
- the popularity score is precomputed independent
of the query -
56Hyperlink induced topic search (HITS)
- Depended on a search engine
- For each node u in the graph calculated
Authorities scores (au) and Hubs scores (hu) - Initialize huau1
- Repeat until convergence
-
-
- are normalized to 1
57- Interesting page include links to others
interesting pages. - The goal
- many relevant pages
- few irrelevant pages
- fast
58Conclusion
- Supervised learning
- Probabilistic models
- Unsupervised learning
- Techniques for clustering
- k-means (top-down)
- agglomerative (bottom-up)
- Techniques for reducing
- LSI with SVD
- Random projections
- Semi-supervised learning
- The EM algorithm
- Relaxation labeling
59referance
- http//www.engr.sjsu.edu/knapp/HCIRDFSC/C/k_means
.htm - http//ei.cs.vt.edu/cs5604/cs5604cnCL/CL-illus.ht
ml - http//www.cs.utexas.edu/users/inderjit/Datamining
- Scatter/Gather A Clusterbased Approach to
Browsing Large Document Collections (Cutting,
Karger, Pedersen, Tukey)