Web-based Information Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

Web-based Information Architectures

Description:

Web-based Information Architectures Jian Zhang – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 30

Provided by: cji1

Learn more at: https://www.andrew.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Web-based Information Architectures

1
Web-based Information Architectures

Jian Zhang

2
Todays Topics

Term Weighting Scheme
Vector Space Model GVSM
Evaluation of IR
Rocchio Feedback
Web Spider Algorithm
Text Mining Named Entity Identification
Data Mining
Text Categorization (kNN)

3
Term Weighting Scheme

TW TF IDF
TF part f1(tf(term, doc))
IDF part f2(idf(term)) f2(N/df(term))
E.g., f1(tf) normalized_tf tf/max_tf
f2(idf) log2(idf)
E.g, f1(tf) tf f2(idf) 1
NOTE definition of DF!

4
Document Query Representation

Bag of words, Vector Space Model(VSM)
Word Normalization
Stopwords removal
Stemming
Proximity phrases
Each element of the vector is the Term Weight of
that term w.r.t the document/query.

5
Similarity Measure

Dot Product

6
Similarity Measure

Cosine Similarity

7
Information Retrieval

Basic assumption Shared words between query and
document
Similarity measures
Dot product
Cosine similarity (normalized)

8
Evaluation

Recall a/(ac)
Precision a/(ab)
F12.0recallprecision / (recallprecision)
Accuracy Bad for IR,

9
Refinement of VSM

Query expansion
Relevance Feedback
Rocchio Formula
Alpha, beta, gamma and their meanings

10
Generalized Vector Space Model

Given a collection of training data, present each
term as a n-dimensional vector

D1 D2 Dj Dn
T1 w11 w12 w1j w1n
T2 w21 w22 w2j w2n

Ti wi1 wi2 wij win

Tm wm1 wm2 wmj wmn
11
GVSM (2)

Define similarity between term ti and tj
Sim(ti, tj) cos(ti, tj)
Similarity between qury and document is based on
the term-term similarity
For each query term qi, find the term tD in the
document D that is most similar to qi. This value
viD, can be considered as the similarity between
a sigle word query qi and the document D.
Sum up the similarities between each query term
and the document D. This is considered the
similarity between the query and the document D.

12
GVSM (3)

Sim(Q,D) SiMaxj(sim(qi, dj)
or normalizing for document query length
Simnorm(Q, D)

13
Maximal Marginal Relevance

Redundancy reduction
Getting more novel things
Formula
MMR(Q, C, R)
Argmaxkdi in C?S(Q, di) - (1-?)maxdj in R (S(di,
dj))

14
MMR Example (Summarization)
Full Text
S1
Summary
Query
S2
S1
S3
S3
S4
S4
S5
S6
15
MMR Example (Summarization)Select first
sentence ?0.7
Full Text
0.4
S1
Query
S2
0.3
Summary
0.6
S3
S3
0.2
S4
0.2
S5
Sim(Q, S) Q . S / (QS)
0.3
S6
16
MMR Example (Summarization)Select second sentence
Full Text
S1
Query
S2
Summary
0.1
S3
0.15
S1
S3
0.2
S4
S3
0.5
S5
0.5
S6
17
MMR Example (Summarization)Select third sentence
Full Text
S1
Query
S2
Summary
0.2
S3
S1
S1
0.1
S4
S3
0.4
S5
S4
0.6
S6
18
Text Categorization

Task
You want to classify a document to some
categories automatically. For example, the
categories are "weather" and "sport".
To do that, you can use kNN algorithm.
To use kNN, you need a collection of documents,
each of them is labeled to some categories by
human.

19
Text Categorization

Procedure
Using VSM represent each document in the training
data
Using VSM represent the document to be
categorized (new document).
Use cosine (or some other measures, but cosine is
good here, why) find top k documents (k nearest
neighbors ) in the training data that are similar
to the new document.
Decide from the k nearest neighbors what are the
categories for the new document

20
Web Spider

The web graph at any instant of time contains
k-connected subgraphs
The spider algorithm given in class is a depth
first search through a web subgraph
Avoiding respidering the same page
Completeness is not guaranteed. Partial solution
is to get seed URLs as diverse as possible.

21
Web Spider