Title: David Newman, UC Irvine Lecture 2 1
1CS 277 Data MiningLecture 2 Text Mining and
Information Retrieval
- David Newman
- Department of Computer Science
- University of California, Irvine
2Notices
- Homework 1 available. Due Tuesday Oct 9.
- Added to list of ideas for projects (link on
class website) - In 2 weeks (Oct 16) Project proposal due
3Lecture Topics in Text Mining
- Information Retrieval
- Text Classification
- Text Clustering
- Information Extraction
4Text Mining Applications
- Information Retrieval
- Query-based search of large text archives, e.g.,
the Web - Text Classification
- Automated assignment of topics to Web pages,
e.g., Yahoo, Google - Automated classification of email into spam and
non-spam (ham) - Text Clustering
- Automated organization of search results in
real-time into categories - Discovery clusters and trends in technical
literature (e.g. CiteSeer) - Information Extraction
- Extracting standard fields from free-text
- extracting names and places from reports,
newspapers - extracting resume information automatically from
resumes
5Text Mining
- Information Retrieval
- Text Classification
- Text Clustering
- Information Extraction
6General Concepts in Information Retrieval
- Representation language
- Typically a vector of W features
- images set of color, intensity, texture,
gradient features characterizing images - text word counts
- Data set of D objects
- Typically represented as an D x W matrix
- Query q
- User poses a query to search data set
- Query is expressed in the same representation
language as the data - each text document is a set of words that occur
in the document - Query q is also expressed as a set of words, e.g.
data and mining
7Query by Content
- traditional DB query exact matches
- query q level MANAGER AND age lt 30
- or, Boolean match on text
- query Irvine AND fun return all docs with
Irvine and fun - Not useful when there are many matches
- data mining in Google returns 60 million
documents - query-by-content query more general / less
precise - what record is most similar to a query q?
- for text data, often called information
retrieval (IR) - can also be used for images, sequences, video
- q can itself be an object (a document) or a
shorter version (1 word) - Goal
- Match query q to the D objects in the database
- Return a ranked list of the most relevant objects
in the data set given q
8Issues in Query by Content
- What representation language to use?
- How to measure similarity between q and each d in
D? - How to compute the results in real-time?
- How to rank the results for the user?
- Allowing user feedback (query modification)
- How to evaluate and compare different IR
algorithms/systems?
9Text Retrieval
- document book, paper, WWW page, ...
- term word, word-pair, phrase, (often
W100,000) - query q set of terms, e.g., data mining
- NLP (natural language processing) too hard, so
- want (vector) representation for text which
- retains maximum useful semantics
- supports efficient distance computes between docs
and q - term weights
- Boolean (term in document or not) bag of words
- real-valued (freq term in doc relative to all
docs) ... - notice loses word order, sentence structure
10Processing one document
tokenize
stem
vocab filter
11Practical Issues
- Tokenization
- Convert document to word counts
- word token any nonempty sequence of
characters - for HTML (etc) need to remove formatting
- Canonical forms, Stopwords, Stemming
- Remove capitalization
- Stopwords
- remove very frequent words (the, and) can use
standard list - Can also remove very rare words
- Stemming (next slide)
- Data representation
- 3 column ltdocID termID positiongt
- Inverted index (faster)
- List of sorted lttermID docIDgt pairs useful for
finding docs containing certain terms - Equivalent to a sparse representation of term x
doc matrix
12Stemming
- May want to reduce all morphological variants of
a word to a single index term - a document containing words fish and fisher will
not be retrieved by a query containing fishing
(no fishing explicitly contained in the document) - Stemming - reduce words to their root form
- fish becomes a new index term
- Porter stemming algorithm (1980)
- relies on a preconstructed suffix list with
associated rules - if suffixIZATION and prefix contains at least
one vowel followed by a consonant, replace with
suffixIZE - BINARIZATION gt BINARIZE
- Not always desirable university, universal -gt
univers (in Porters) - WordNet dictionary-based approach
13Porter Stemmer
- Aside You will use the Porter Stemmer in
Homework 1. I have provided you with porter.pl - If you dont already have access to perl, it is
available on your unix account
14Porter Stemmer (example)
input
output
- The Internet is an engineering feat of no small
magnitude. It operates using a method of data
transmission invented in the 1960s called packet
switching. Send e-mail, for instance, and your
message is broken into any number of little
bundles each takes a separate route and reunites
with the others at the destination.
- the internet is an engin feat of no small
magnitud. it oper us a method of data transmiss
invent in the 1960s call packet switch. send e
mail, for instanc, and your messag is broken into
ani number of littl bundl each take a separ rout
and reunit with the other at the destin.
Q Do you think Google uses stemming?
Q What might stemming be good for?
15Toy example of a document-term matrix
16Inverted index
- Queries
- q1 database
- q2 database schema
17Distance metric
- Vector-space model
- 1xW vector d
- What is a suitable distance metric between q and
d? - ? whiteboard
18Distance metric cosine similarity
- Go with cosine similarity
- simil(q,d) ltq,dgt / q d cos q
- slightly unnatural using inner products and
norms from L2 - Bias
- favors shorter documents?
- or longer documents?
19Distance matrices for toy document-term data
Euclidean Distances
TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24
21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5
0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3
0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3
0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0
17 4 23
Cosine Distances
20q database schema
21TF-IDF Term Weighting Schemes
- Not all terms in a query or document may be
equally important... - TF (term freq) term weight number of times in
that document - problem term common to many docs ? low
discrimination - IDF (inverse-document frequency of a term)
- nj documents contain term j, D documents in total
- IDF log(D/nj)
- Favors terms that occur in relatively few
documents - TF-IDF weight TF(term)IDF(term)
- No real theoretical basis, but works well
empirically and widely used
22TF-IDF Example
TF
t1 t2 t3 t4 t5 t6 d1 24 21 9 0 0 3 d2 32
10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2
0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7
16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9
1 0 0 34 27 25 d10 6 0 0 17 4 23 idf 0.1 0.7
0.5 0.7 0.4 0.7
TF-IDF
t1 t2 t3 t4 t5 t6 d1 2.5 14.6 4.6
0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3
11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0
0 d5 4.5 21.5 10.2 0 1.1 0 ...
IDF weights log(D/nj) (0.1, 0.7, 0.5, 0.7,
0.4, 0.7)
23Baseline Document Querying System
- Queries q binary term vectors
- Documents represented by TF-IDF weights
- Cosine distance used for retrieval and ranking
24Baseline Document Querying System
TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24
21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5
0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3
0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3
0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0
17 4 23
TF-IDF doc-term matrix t1 t2 t3 t4 t5
t6 d1 2.5 14.6 4.6 0 0 2.1 d2 3.4 6.9
2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4
0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1
0 ...
q (1,0,1,0,0,0)
TF TF-IDF d1 0.70 0.32 d2 0.77 0.51 d3
0.58 0.24 d4 0.60 0.23 d5 0.79 0.43 ...
Cosine similarity
25Precision versus Recall
- Rank documents (numerically) with respect to
query - Compute precision and recall by threshholding the
rankings - precision
- fraction of retrieved objects that are relevant
- recall
- fraction of retrieved relevant objects / total
relevant objects - Tradeoff high precision ?? low recall, and
vice-versa - Very similar to ROC in concept
- For multiple queries, precision for specific
ranges of recall can be averaged (so-called
interpolated precision).
26Precision versus Recall
- Chakrabati, p55
- recall(k) 1/R S ri , i1k
- precision(k) 1/k S ri , i1k
R5
27Precision-Recall Curve (form of ROC)
alternative (point) values precision
where recallprecision or precision for
fixed number of retrievals or average
precision over multiple recall levels
C is universally worse than A B
28TREC evaluations
- Text Retrieval Conference (TReC)
- Web site trec.nist.gov
- Annual impartial evaluation of IR systems
- e.g., D 1 million documents
- TREC organizers supply contestants with several
hundred queries q - Each competing system provides its ranked list of
documents - Union of top 100 ranked documents or so from each
system is then manually judged to be relevant or
non-relevant for each query q - Precision, recall, etc, then calculated and
systems compared
29Other Examples of Evaluation Data Sets
- Cranfield data
- Number of documents 1400
- 225 Queries, medium length, manually
constructed test questions - Relevance determined by expert committee (from
1968) - Newsgroups
- Articles from 20 Usenet newsgroups
- Queries randomly selected documents
- Relevance is the document d in the same category
as the query doc?
30Performance on Cranfield Document Set
31Performance on Newsgroups Data
32Related Types of Data
- Sparse high-dimensional data sets with counts,
like document-term matrices, are common in data
mining - transaction data
- Rows customers
- Columns products
- Web log data (ignoring sequence)
- Rows Web surfers
- Columns Web pages
- Recommender systems
- Given some products from user i, suggest other
products to the user - e.g., Amazon.coms book recommender
- Collaborative filtering
- use k-nearest-individuals as the basis for
predictions - Many similarities with querying and information
retrieval - use of cosine distance to normalize vectors
33Web-based Retrieval
- Additional information in Web documents
- Link structure (e.g., PageRank to be discussed
later) - HTML structure
- Link/anchor text
- Title text
- This information can be leveraged for better
retrieval - Additional issues in Web retrieval
- Scalability size of corpus is huge (10 to 100
billion docs) - Constantly changing
- Crawlers to update document-term information
- need schemes for efficient updating indices
- Evaluation is more difficult how is relevance
measured? How many documents in total are
relevant?
34Further Reading
- Chakrabati Chapter 3
- General reference on text and language modeling
- Foundations of Statistical Language Processing,
C. Manning and H. Schutze, MIT Press, 1999. - Very useful reference on indexing and searching
text - Managing Gigabytes Compressing and Indexing
Documents and Images, 2nd edition, Morgan
Kaufmann, 1999, by Ian H. Witten, Alistair
Moffat, and Timothy C. Bell, - Web-related Document Search
- Information on how real Web search engines work
http//searchenginewatch.com/