David Newman, UC Irvine Lecture 2 1 - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

David Newman, UC Irvine Lecture 2 1

Description:

Added to list of ideas for projects (link on class website) ... instanc, and your messag is broken into ani number of littl bundl; each take a ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 35

Provided by: Informatio367

Category:

more less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 2 1

1
CS 277 Data MiningLecture 2 Text Mining and
Information Retrieval

David Newman
Department of Computer Science
University of California, Irvine

2
Notices

Homework 1 available. Due Tuesday Oct 9.
Added to list of ideas for projects (link on
class website)
In 2 weeks (Oct 16) Project proposal due

3
Lecture Topics in Text Mining

Information Retrieval
Text Classification
Text Clustering
Information Extraction

4
Text Mining Applications

Information Retrieval
Query-based search of large text archives, e.g.,
the Web
Text Classification
Automated assignment of topics to Web pages,
e.g., Yahoo, Google
Automated classification of email into spam and
non-spam (ham)
Text Clustering
Automated organization of search results in
real-time into categories
Discovery clusters and trends in technical
literature (e.g. CiteSeer)
Information Extraction
Extracting standard fields from free-text
extracting names and places from reports,
newspapers
extracting resume information automatically from
resumes

5
Text Mining

Information Retrieval
Text Classification
Text Clustering
Information Extraction

6
General Concepts in Information Retrieval

Representation language
Typically a vector of W features
images set of color, intensity, texture,
gradient features characterizing images
text word counts
Data set of D objects
Typically represented as an D x W matrix
Query q
User poses a query to search data set
Query is expressed in the same representation
language as the data
each text document is a set of words that occur
in the document
Query q is also expressed as a set of words, e.g.
data and mining

7
Query by Content

traditional DB query exact matches
query q level MANAGER AND age lt 30
or, Boolean match on text
query Irvine AND fun return all docs with
Irvine and fun
Not useful when there are many matches
data mining in Google returns 60 million
documents
query-by-content query more general / less
precise
what record is most similar to a query q?
for text data, often called information
retrieval (IR)
can also be used for images, sequences, video
q can itself be an object (a document) or a
shorter version (1 word)
Goal
Match query q to the D objects in the database
Return a ranked list of the most relevant objects
in the data set given q

8
Issues in Query by Content

What representation language to use?
How to measure similarity between q and each d in
D?
How to compute the results in real-time?
How to rank the results for the user?
Allowing user feedback (query modification)
How to evaluate and compare different IR
algorithms/systems?

9
Text Retrieval

document book, paper, WWW page, ...
term word, word-pair, phrase, (often
W100,000)
query q set of terms, e.g., data mining
NLP (natural language processing) too hard, so
want (vector) representation for text which
retains maximum useful semantics
supports efficient distance computes between docs
and q
term weights
Boolean (term in document or not) bag of words
real-valued (freq term in doc relative to all
docs) ...
notice loses word order, sentence structure

10
Processing one document
tokenize
stem
vocab filter
11
Practical Issues

Tokenization
Convert document to word counts
word token any nonempty sequence of
characters
for HTML (etc) need to remove formatting
Canonical forms, Stopwords, Stemming
Remove capitalization
Stopwords
remove very frequent words (the, and) can use
standard list
Can also remove very rare words
Stemming (next slide)
Data representation
3 column ltdocID termID positiongt
Inverted index (faster)
List of sorted lttermID docIDgt pairs useful for
finding docs containing certain terms
Equivalent to a sparse representation of term x
doc matrix

12
Stemming

May want to reduce all morphological variants of
a word to a single index term
a document containing words fish and fisher will
not be retrieved by a query containing fishing
(no fishing explicitly contained in the document)
Stemming - reduce words to their root form
fish becomes a new index term
Porter stemming algorithm (1980)
relies on a preconstructed suffix list with
associated rules
if suffixIZATION and prefix contains at least
one vowel followed by a consonant, replace with
suffixIZE
BINARIZATION gt BINARIZE
Not always desirable university, universal -gt
univers (in Porters)
WordNet dictionary-based approach

13
Porter Stemmer

Aside You will use the Porter Stemmer in
Homework 1. I have provided you with porter.pl
If you dont already have access to perl, it is
available on your unix account

14
Porter Stemmer (example)
input
output

The Internet is an engineering feat of no small
magnitude. It operates using a method of data
transmission invented in the 1960s called packet
switching. Send e-mail, for instance, and your
message is broken into any number of little
bundles each takes a separate route and reunites
with the others at the destination.

the internet is an engin feat of no small
magnitud. it oper us a method of data transmiss
invent in the 1960s call packet switch. send e
mail, for instanc, and your messag is broken into
ani number of littl bundl each take a separ rout
and reunit with the other at the destin.

Q Do you think Google uses stemming?
Q What might stemming be good for?
15
Toy example of a document-term matrix
16
Inverted index

Queries
q1 database
q2 database schema

17
Distance metric

Vector-space model
1xW vector d
What is a suitable distance metric between q and
d?
? whiteboard

18
Distance metric cosine similarity

Go with cosine similarity
simil(q,d) ltq,dgt / q d cos q
slightly unnatural using inner products and
norms from L2
Bias
favors shorter documents?
or longer documents?

19
Distance matrices for toy document-term data
Euclidean Distances
TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24
21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5
0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3
0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3
0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0
17 4 23
Cosine Distances
20
q database schema
21
TF-IDF Term Weighting Schemes

Not all terms in a query or document may be
equally important...
TF (term freq) term weight number of times in
that document
problem term common to many docs ? low
discrimination
IDF (inverse-document frequency of a term)
nj documents contain term j, D documents in total
IDF log(D/nj)
Favors terms that occur in relatively few
documents
TF-IDF weight TF(term)IDF(term)
No real theoretical basis, but works well
empirically and widely used

22
TF-IDF Example
TF
t1 t2 t3 t4 t5 t6 d1 24 21 9 0 0 3 d2 32
10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2
0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7
16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9
1 0 0 34 27 25 d10 6 0 0 17 4 23 idf 0.1 0.7
0.5 0.7 0.4 0.7
TF-IDF
t1 t2 t3 t4 t5 t6 d1 2.5 14.6 4.6
0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3
11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0
0 d5 4.5 21.5 10.2 0 1.1 0 ...
IDF weights log(D/nj) (0.1, 0.7, 0.5, 0.7,
0.4, 0.7)
23
Baseline Document Querying System

Queries q binary term vectors
Documents represented by TF-IDF weights
Cosine distance used for retrieval and ranking

24
Baseline Document Querying System
TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24
21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5
0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3
0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3
0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0
17 4 23
TF-IDF doc-term matrix t1 t2 t3 t4 t5
t6 d1 2.5 14.6 4.6 0 0 2.1 d2 3.4 6.9
2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4
0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1
0 ...
q (1,0,1,0,0,0)
TF TF-IDF d1 0.70 0.32 d2 0.77 0.51 d3
0.58 0.24 d4 0.60 0.23 d5 0.79 0.43 ...
Cosine similarity
25
Precision versus Recall

Rank documents (numerically) with respect to
query
Compute precision and recall by threshholding the
rankings
precision
fraction of retrieved objects that are relevant
recall
fraction of retrieved relevant objects / total
relevant objects
Tradeoff high precision ?? low recall, and
vice-versa
Very similar to ROC in concept
For multiple queries, precision for specific
ranges of recall can be averaged (so-called
interpolated precision).

26
Precision versus Recall

Chakrabati, p55
recall(k) 1/R S ri , i1k
precision(k) 1/k S ri , i1k

R5
27
Precision-Recall Curve (form of ROC)
alternative (point) values precision
where recallprecision or precision for
fixed number of retrievals or average
precision over multiple recall levels
C is universally worse than A B
28
TREC evaluations

Text Retrieval Conference (TReC)
Web site trec.nist.gov
Annual impartial evaluation of IR systems
e.g., D 1 million documents
TREC organizers supply contestants with several
hundred queries q
Each competing system provides its ranked list of
documents
Union of top 100 ranked documents or so from each
system is then manually judged to be relevant or
non-relevant for each query q
Precision, recall, etc, then calculated and
systems compared

29
Other Examples of Evaluation Data Sets

Cranfield data
Number of documents 1400
225 Queries, medium length, manually
constructed test questions
Relevance determined by expert committee (from
1968)
Newsgroups
Articles from 20 Usenet newsgroups
Queries randomly selected documents
Relevance is the document d in the same category
as the query doc?

30
Performance on Cranfield Document Set
31
Performance on Newsgroups Data
32
Related Types of Data

Sparse high-dimensional data sets with counts,
like document-term matrices, are common in data
mining
transaction data
Rows customers
Columns products
Web log data (ignoring sequence)
Rows Web surfers
Columns Web pages
Recommender systems
Given some products from user i, suggest other
products to the user
e.g., Amazon.coms book recommender
Collaborative filtering
use k-nearest-individuals as the basis for
predictions
Many similarities with querying and information
retrieval
use of cosine distance to normalize vectors

33
Web-based Retrieval

Additional information in Web documents
Link structure (e.g., PageRank to be discussed
later)
HTML structure
Link/anchor text
Title text
This information can be leveraged for better
retrieval
Additional issues in Web retrieval
Scalability size of corpus is huge (10 to 100
billion docs)
Constantly changing
Crawlers to update document-term information
need schemes for efficient updating indices
Evaluation is more difficult how is relevance
measured? How many documents in total are
relevant?

34
Further Reading

Chakrabati Chapter 3
General reference on text and language modeling
Foundations of Statistical Language Processing,
C. Manning and H. Schutze, MIT Press, 1999.
Very useful reference on indexing and searching
text
Managing Gigabytes Compressing and Indexing
Documents and Images, 2nd edition, Morgan
Kaufmann, 1999, by Ian H. Witten, Alistair
Moffat, and Timothy C. Bell,
Web-related Document Search
Information on how real Web search engines work
http//searchenginewatch.com/