Data Mining For Hypertext: A Tutorial Survey - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining For Hypertext: A Tutorial Survey

Description:

Hypertext - a collection of documents (or 'nodes') containing cross-references ... documents are merged into super documents or groups until only one group is left ... – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 60

Provided by: NEW94

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining For Hypertext: A Tutorial Survey

1
Data Mining For Hypertext A Tutorial Survey
11/11/01
sdbi winter 2001

Based on a paper by
Soumen Chakrabarti
Indian Institute Of technology Bombay.
Soumen_at_cse.iitb.ernet.in
Lecture by
Noga Kashti
Efrat Daum

2
Lets start with definitions

Hypertext - a collection of documents (or
"nodes") containing cross-references or "links"
which, with the aid of an interactive browser
program, allow the reader to move easily from one
document to another.
Data Mining - Analysis of data in a database
using tools which look for trends or anomalies
without knowledge of the meaning of the data.

3
Two Ways For Getting Information From The Web

Clicking On Hyperlinks
Searching Via Keyword Queries

4
Some History

Before the popular Web, Hypertext has been used
by ACM, SIGIR, SIGLINK/SIGWEB and DIGITAL
LIBRARIES.
The old IR (Information retrieval) deals with
documents whereas the Web deals with
semi-structured data.

5
Some Numbers ..

The Web exceeds 800 million HTML pages on about
three million servers.
Almost a million pages are added daily.

A typical page changes in a few months.
Several hundred gigabytes change every month.

6
Difficulties With Accessing Information On The
Web

Usual problems of text search (synonymy,
polysemy, text sensitivity) become much more
severe.
Semi-structured data.
Sheer size and flux.
No consistent standard or style.

7
The Old Search Process Is Often Unsatisfactory!

Deficiency of scale.
Poor accuracy (low recall and low precision).

8
Better Solutions Data Mining And Machine Learning

NL Techniques.
Statistical Techniques for learning structure in
various forms from text hypertext and
semi-structured data.

9
Issues Well Discuss

Models
Supervised learning
Unsupervised learning
Semi-supervised learning
Social network analysis

10
Models For Text

Representation for text with statistical analyses
only (bag-of-words)
The vector space model
The binary model
The multi-nominal model

11
Models For Text (cont.)

The vector space model
Documents -gt tokens-gtcanonical forms.
Canonical token is an axis in a Euclidean space.
The t-th coordinate of d is n(d,t)
t is a term
d is a document

12
The Vector Space Model Normalize The Document
Length To 1

13
More Models For Text

The Binary Model A document is a set of terms,
which is a subset of the lexicon. Word counts are
not significant.
The multinomial model a die with T faces.
Every face has a probability ?t of showing up
when tossed. Deciding of total word count, the
author tosses the die while writing the term that
shows up.

14
Models For Hypertext

Hypertext text with hyperlinks.
Varying levels of detail.
Example Directed Graph(D,L)
D The set of nodes/documents/pages
L The set of links

15
Models For Semi-structured Data

A point of convergence for the web(documents) and
database(data) communities

16
Models For Semi-structured Data(cont.)

like Topic Directories with tree-structured
hierarchies.
Examples Open Directory Project , Yahoo!
Another representation XML.

17
Supervised Learning (classification)

Algorithm Initialization training data, each
item is marked with a label or class from a
discrete finite set.
Input unlabeled data.
Algorithm roll guess the data labels.

18
Supervised Learning (cont.)

Example topic directories
Advantages help structure, restrict keyword
search, can enable powerful searches.

19
Probabilistic Models For Text Learning

Let c1,,cm be m classes or topics with some
training documents Dc.
Prior probability of a class
T the universe of terms in all the training
documents.

20
Probabilistic Models For Text Learning (cont.)

Naive Bayes classification
Assumption for each class c, there is binary
text generator model.
Model parameters Fc,t the probability that a
document in class c will mention term t at lease
once.

21
Naive Bayes classification (cont.)

Problems
short documents are discouraged.
Pr (dc) estimation is likely to be greatly
distorted.

22
Naive Bayes classification (cont.)

With the multinomial model

23
Naive Bayes classification (cont.)

Problems
Again, short documents are
discouraged.
Inter-term correlation ignored.
Multiplicative Fc,t surprise factor.
Conclusion
Both model are effective.

24
More Probabilistic Models For Text Learning

Parameter smoothing and feature
selection.
Limited dependence modeling.
The maximum entropy technique.
Support vector machines (SVMs).
Hierarchies over class labels.

25
Learning Relations

Classification extension a combination of
statistical and relational learning.
Improve accuracy.
The ability to invent predicates.
Can represent hyperlink graph structure and
word statistics of neighbor documents.
Learned rules will not be dependent on specific
keywords.

26
Unsupervised learning

hypertext documents
a hierarchy among the documents
What is a good clustering?

27
Basic clustering techniques

Techniques for Clustering
kmeans
hierarchical agglomerative clustering

28
Basic clustering techniques

documents
unweighted vector space
TFIDF vector space
similarity between two documents
cos(?), ? the angle between their
corresponding vectors
the distance between the vectors lengths
(normalized)

29
kmeans clustering

the kmeans algorithm
input
d1,,dn - set of n documents
k - the number of clusters desired (k?n)
output
C1,,Ck k clusters with the n classifier
documents

30
kmeans clustering

the kmeans algorithm (cont.)
initial guess k initial means m1,mk
Until there are no changes in any means
For each document d - d is in ci if d-mi is
the minimum of all the k distances.
For 1?i?k - replace mi with the means of all the
documents for ci.

31
kmeans clustering

the kmeans algorithm Example

K2
K3
32
kmeans clustering (cont.)

Problem
high dimensionality
e.g. if 30000 dimensions has only two possible
values, the vector space size is 230000
Solution
Projecting out some dimensions

33
Agglomerative clustering

documents are merged into superdocuments or
groups until only one group is left
Some definitions
the similarity between documents
d1 and d2
the self-similarity of group A

34
Agglomerative clustering

The agglomerative clustering algorithm
input
d1,,dn - set of n documents
output
G the final group with a nested hierarchy

35
Agglomerative clustering (cont.)

The agglomerative clustering algorithm
Initial G G1,,Gn, where Gidi
while Ggt1
Find A and B in G such as s(A ? B) is maximized
G (G A,B) ? A ? B
Times O(n2)

36
Agglomerative clustering (cont.)

The agglomerative clustering algorithm
Example

37
Techniques from linear algebra

Documents and terms are represented by vectors
in Euclidean space.
Applications of linear algebra to text analysis
Latent semantic indexing (LSI)
Random projections

38
Co-occurring terms

Exemple

39
Latent semantic indexing (LSI)

Vector Space model of documents
Let mT, the lexicon size
Let nthe number of documents
Define Amxn term-bydocument matrix
where aij the number of occurrences of term i
in document j.

40
Latent semantic indexing (LSI)

How to reduce it?

41
Singular Value Decomposition (SVD)

Let A?Rmxn, m ? n be a matrix.
The singular value decomposition of A is the
factorization AUDVT, where
U and V are orthogonals, UTUVTVIn
Ddiag(?1, ?n) with ?i?0, 1?i?n
then,
Uu1,un, u1,un are the left singular vectors
Vv1,vn, v1,vn are the right singular
vectors
?1, ?n are the singular values of A.

42
Singular Value Decomposition (SVD)

AAT(UDVT)(VDTUT)UDIDUTUD2UT
? AATUUD2?12u1,,?n2un
for 1?i?n, AATui?i2ui
? the columns of U are the eigenvectors of AAT.
Similary, ATAVD2VT
? the columns of V are the eigenvectors of ATA.
The eigenvalues of AAT (or ATA) are ?12,,?n2

43
Singular Value Decomposition (SVD)

Let
be the k-truncated SVD.
rank(Ak)k
A-AK2 ?A-MK2 for any matrix Mk of rank k.

44
Singular Value Decomposition (SVD)

Note A, Ak ? Rmxn

45
LSI with SVD

Define q?Rm a query vector.
qi?0 if term i is a part of the query.
Then, ATq ?Rn, the answer vector.
(ATq)j?0 if document j contains one or more terms
in the query.
How to do it better?

46
LSI with SVD

Use Ak instead of A
? calculate AkTq
Now, query on car will return a document
containing the word auto.

47
Random projections

Theorem
let
- a unit vector
H - a randomly oriented -dimensional subspace
through the origin
X - random variable of the square of the length
of the projection of v on H
then
and if is chosen between and
where

48
Random projections

A projection of a set of points to a randomly
oriented subspace.
Small distortion in inter-points distances
The technique
reducing the dimensionality of the points
speed up the distances computation

49
Semi-supervised learning

Real-life applications
labeled documents
unlabeled documents
Between supervised and unsupervised learning

50
Learning from labeled and unlabeled documents

Expectation Maximization (EM) Algorithm
Initial train a naive Bayes classifier using
only labeled data.
Repeat EM iteration until near convergence
Estep
Mstep assign class probabilities Pr(c/d) to all
documents not labeled by the ?c,t estimates.
error is reduced by a third in the best cases.

51
Relaxation labeling

The hypertext model
documents are nodes in a hypertext graph.
There are other sources of information induced by
the links.

52
Relaxation labeling

cclass, tterm, Nneighbors
In supervised learning Pr(tc)
In hypertext, using neighbors terms Pr(
t(d),t(N(d)) c)
Better model, using neighbors classes Pr(
t(d),c(N(d)) c
Circularity

53
Relaxation labeling

Resolve the circularity
Initial Pr(0)(cd) to each document d?N(d1)
where d1 is a test document (use text-only)
Iterations

54
Social network analysis

Social networks
between academics by coauthoring, advising.
between movie personnel by directing and acting.
between people by making phone calls
between web pages by hyperlinking to other web
pages.
Applications
Google
HITS

where
? means link to
N total number of nodes in the Web graph
simulated a random walk on the web graph
used a score of popularity
the popularity score is precomputed independent
of the query

56
Hyperlink induced topic search (HITS)

Depended on a search engine
For each node u in the graph calculated
Authorities scores (au) and Hubs scores (hu)
Initialize huau1
Repeat until convergence
are normalized to 1

Interesting page include links to others
interesting pages.
The goal
many relevant pages
few irrelevant pages
fast

58
Conclusion

Supervised learning
Probabilistic models
Unsupervised learning
Techniques for clustering
k-means (top-down)
agglomerative (bottom-up)
Techniques for reducing
LSI with SVD
Random projections
Semi-supervised learning
The EM algorithm
Relaxation labeling

59
referance

http//www.engr.sjsu.edu/knapp/HCIRDFSC/C/k_means
.htm
http//ei.cs.vt.edu/cs5604/cs5604cnCL/CL-illus.ht
ml
http//www.cs.utexas.edu/users/inderjit/Datamining
Scatter/Gather A Clusterbased Approach to
Browsing Large Document Collections (Cutting,
Karger, Pedersen, Tukey)

Write a Comment

User Comments (0)