Web-based Information Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Web-based Information Architectures

Description:

Web-based Information Architectures Jian Zhang – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 30
Provided by: cji1
Category:

less

Transcript and Presenter's Notes

Title: Web-based Information Architectures


1
Web-based Information Architectures
  • Jian Zhang

2
Todays Topics
  • Term Weighting Scheme
  • Vector Space Model GVSM
  • Evaluation of IR
  • Rocchio Feedback
  • Web Spider Algorithm
  • Text Mining Named Entity Identification
  • Data Mining
  • Text Categorization (kNN)

3
Term Weighting Scheme
  • TW TF IDF
  • TF part f1(tf(term, doc))
  • IDF part f2(idf(term)) f2(N/df(term))
  • E.g., f1(tf) normalized_tf tf/max_tf
    f2(idf) log2(idf)
  • E.g, f1(tf) tf f2(idf) 1
  • NOTE definition of DF!

4
Document Query Representation
  • Bag of words, Vector Space Model(VSM)
  • Word Normalization
  • Stopwords removal
  • Stemming
  • Proximity phrases
  • Each element of the vector is the Term Weight of
    that term w.r.t the document/query.

5
Similarity Measure
  • Dot Product

6
Similarity Measure
  • Cosine Similarity

7
Information Retrieval
  • Basic assumption Shared words between query and
    document
  • Similarity measures
  • Dot product
  • Cosine similarity (normalized)

8
Evaluation
  • Recall a/(ac)
  • Precision a/(ab)
  • F12.0recallprecision / (recallprecision)
  • Accuracy Bad for IR,

9
Refinement of VSM
  • Query expansion
  • Relevance Feedback
  • Rocchio Formula
  • Alpha, beta, gamma and their meanings

10
Generalized Vector Space Model
  • Given a collection of training data, present each
    term as a n-dimensional vector

D1 D2 Dj Dn
T1 w11 w12 w1j w1n
T2 w21 w22 w2j w2n

Ti wi1 wi2 wij win

Tm wm1 wm2 wmj wmn
11
GVSM (2)
  • Define similarity between term ti and tj
  • Sim(ti, tj) cos(ti, tj)
  • Similarity between qury and document is based on
    the term-term similarity
  • For each query term qi, find the term tD in the
    document D that is most similar to qi. This value
    viD, can be considered as the similarity between
    a sigle word query qi and the document D.
  • Sum up the similarities between each query term
    and the document D. This is considered the
    similarity between the query and the document D.

12
GVSM (3)
  • Sim(Q,D) SiMaxj(sim(qi, dj)
  • or normalizing for document query length
  • Simnorm(Q, D)

13
Maximal Marginal Relevance
  • Redundancy reduction
  • Getting more novel things
  • Formula
  • MMR(Q, C, R)
  • Argmaxkdi in C?S(Q, di) - (1-?)maxdj in R (S(di,
    dj))

14
MMR Example (Summarization)
Full Text
S1
Summary
Query
S2
S1
S3
S3
S4
S4
S5
S6
15
MMR Example (Summarization)Select first
sentence ?0.7
Full Text
0.4
S1
Query
S2
0.3
Summary
0.6
S3
S3
0.2
S4
0.2
S5
Sim(Q, S) Q . S / (QS)
0.3
S6
16
MMR Example (Summarization)Select second sentence
Full Text
S1
Query
S2
Summary
0.1
S3
0.15
S1
S3
0.2
S4
S3
0.5
S5
0.5
S6
17
MMR Example (Summarization)Select third sentence
Full Text
S1
Query
S2
Summary
0.2
S3
S1
S1
0.1
S4
S3
0.4
S5
S4
0.6
S6
18
Text Categorization
  • Task
  • You want to classify a document to some
    categories automatically. For example, the
    categories are "weather" and "sport".
  • To do that, you can use kNN algorithm.
  • To use kNN, you need a collection of documents,
    each of them is labeled to some categories by
    human.

19
Text Categorization
  • Procedure
  • Using VSM represent each document in the training
    data
  • Using VSM represent the document to be
    categorized (new document).
  • Use cosine (or some other measures, but cosine is
    good here, why) find top k documents (k nearest
    neighbors ) in the training data that are similar
    to the new document.
  • Decide from the k nearest neighbors what are the
    categories for the new document

20
Web Spider
  • The web graph at any instant of time contains
    k-connected subgraphs
  • The spider algorithm given in class is a depth
    first search through a web subgraph
  • Avoiding respidering the same page
  • Completeness is not guaranteed. Partial solution
    is to get seed URLs as diverse as possible.

21
Web Spider
  • PROCEDURE SPIDER4(G, SEEDS)
  • Initialize COLLECTION ltbig file of URL-page
    pairsgt
  • Initialize VISITED ltbig hash-tablegt
  • For every ROOT in SEEDS
  • Initialize STACK ltstack data structuregt
  • Let STACK push(ROOT, STACK)
  • While STACK is not empty,
  • Do URLcurr pop(STACK)
  • Until URLcurr is not in VISITED
  • insert-hash(URLcurr, VISITED)
  • PAGE look-up(URLcurr)
  • STORE(ltURLcurr, PAGEgt, COLLECTION)
  • For every URLi in PAGE,
  • push(URLi, STACK)
  • Return COLLECTION

22
Text Mining
  • Components of Text Mining
  • Categorization by topic or Genre
  • Fact extraction from text
  • Data Mining from DBs or extracted facts

23
Fact extraction from text
  • Named Entity Identification
  • FSA/FST, HMM
  • Role-Situated Named Entities
  • Apply context information
  • Information Extraction
  • Template matching

24
Named Entity Identification
  • Definition of A Finite State Acceptor (FSA)
  • With an input source (e.g. string of words)
  • Outputs "YES" or "NO"
  • Definition of A Finite State Transducer (FST)
  • An FSA with variable binding
  • Outputs "NO" or "YES"variable-bindings
  • Variable bindings encode recognized entity
  • e.g. "YES ltfirstname Hidetogt ltlastname Suzukigt"

25
Named Entity Identification
  • Example. Identify numbers
  • 1, 2.0, -3.22, 3e2, 4e-5
  • D 0,1,2,3,4,5,6,7,8,9

Start
D
D
D
.
-
D
e
D
D
-
e
D
D
26
Data Mining
  • Learning by caching
  • What/when to cache
  • When to use/invalidate/update cache
  • Learning from Examples
  • (a.k.a, "Supervised" learning)
  • Labeled examples for training
  • Learn the mapping from examples to labels
  • E.g. Naive Bayes, Decision Trees, ...
  • Text Categorization (using kNN or other means)
  • is a learning-from-examples task

27
Data Mining
  • "Speedup" Learning
  • Tuning search heuristics from experience
  • Inducing explicit control knowledge
  • Analogical learning (generalized instances)
  • Optimization "policy" learning
  • Predicting continuous objective function
  • E.g. Regression, Reinforcement, ...
  • New Pattern Discovery
  • (aka "Unsupervised" Learning)
  • Finding meaningful correlations in data
  • E.g. association rules, clustering, ...

28
Generalize v.s. Specialize
  • Generalize
  • First, each record in your database is a RULE
  • Then, generalize (how?, when to stop?)
  • Specialize
  • First, give a very general rule (almost useless)
  • Then, specialize (how? When to stop?)

29
Methods for Supervised DM
  • Classifiers
  • Linear Separators (regression)
  • Naive Bayes (NB)
  • Decision Trees (DTs)
  • k-Nearest Neighbor (kNN)
  • Decision rule induction
  • Support Vector Machines (SVMs)
  • Neural Networks (NNs) ...
Write a Comment
User Comments (0)
About PowerShow.com