Title: Text Mining
1Text Mining Dr Eamonn Keogh Computer Science
Engineering DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
2Text Mining/Information Retrieval
- Task Statement
- Build a system that retrieves documents that
users are likely to find relevant to their
queries. - This assumption underlies the field of
Information Retrieval.
3Information need
Collections
How is the query constructed?
text input
How is the text processed?
4Terminology
Token A natural language word Swim,
Simpson, 92513 etc Document Usually a web
page, but more generally any file.
5Some IR History
- Roots in the scientific Information Explosion
following WWII - Interest in computer-based IR from mid 1950s
- H.P. Luhn at IBM (1958)
- Probabilistic models at Rand (Maron Kuhns)
(1960) - Boolean system development at Lockheed (60s)
- Vector Space Model (Salton at Cornell 1965)
- Statistical Weighting methods and theoretical
advances (70s) - Refinements and Advances in application (80s)
- User Interfaces, Large-scale testing and
application (90s)
6Relevance
- In what ways can a document be relevant to a
query? - Answer precise question precisely.
- Who is Homers Boss? Montgomery Burns.
- Partially answer question.
- Where does Homer work? Power Plant.
- Suggest a source for more information.
- What is Barts middle name? Look in Issue 234 of
Fanzine - Give background information.
- Remind the user of other knowledge.
- Others ...
7(No Transcript)
8Information need
Collections
How is the query constructed?
text input
How is the text processed?
The section that follows is about Content
Analysis (transforming raw text into a
computationally more manageable form)
9Document Processing Steps
10Stemming and Morphological Analysis
- Goal normalize similar words
- Morphology (form of words)
- Inflectional Morphology
- E.g,. inflect verb endings and noun number
- Never change grammatical class
- dog, dogs
- Bike, Biking
- Swim, Swimmer, Swimming
What about build, building
11Examples of Stemming (using Porters algorithm)
Original Words        consignconsignedconsign
ingconsignmentconsistconsistedconsistencycons
istentconsistentlyconsistingconsists
Stemmed Words consignconsignconsignconsignco
nsistconsistconsistconsistconsistconsistcons
ist
Porters algorithms is available in Java, C, Lisp,
Perl, Python etc from http//www.tartarus.org/ma
rtin/PorterStemmer/
12Errors Generated by Porter Stemmer (Krovetz 93)
13Statistical Properties of Text
- Token occurrences in text are not uniformly
distributed - They are also not normally distributed
- They do exhibit a Zipf distribution
14Government documents, 157734 tokens, 32259 unique
969 on 915 FT 883 Mr 860 was 855 be 849
Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE
798 HEADLINE 798 DOCNO
1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI
1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT
1 ADVISERS 1 AE
8164 the 4771 of 4005 to 2834 a 2827 and 2802
in 1592 The 1370 for 1326 is 1324 s 1194 that
973 by
15Plotting Word Frequency by Rank
- Main idea count
- How many times tokens occur in the text
- Over all texts in the collection
- Now rank these according to how often they occur.
This is called the rank.
16Rank Freq1 37 system2 32
knowledg3 24 base4 20
problem5 18 abstract6 15
model7 15 languag8 15
implem9 13 reason10 13
inform11 11 expert12 11
analysi13 10 rule14 10
program15 10 oper16 10
evalu17 10 comput18 10
case19 9 gener20 9 form
The Corresponding Zipf Curve
17Zipf Distribution
- The Important Points
- a few elements occur very frequently
- a medium number of elements have medium frequency
- many elements occur very infrequently
18Zipf Distribution
- The product of the frequency of words (f) and
their rank (r) is approximately constant - Rank order of words frequency of occurrence
- Another way to state this is with an
approximately correct rule of thumb - Say the most common term occurs C times
- The second most common occurs C/2 times
- The third most common occurs C/3 times
-
19Zipf Distribution(linear and log scale)
20What Kinds of Data Exhibit a Zipf Distribution?
- Words in a text collection
- Virtually any language usage
- Library book checkout patterns
- Incoming Web Page Requests
- Outgoing Web Page Requests
- Document Size on Web
- City Sizes
21Consequences of Zipf
- There are always a few very frequent tokens that
are not good discriminators. - Called stop words in IR
- English examples to, from, on, and, the, ...
- There are always a large number of tokens that
occur once and can mess up algorithms. - Medium frequency words most descriptive
22Word Frequency vs. Resolving Power (from van
Rijsbergen 79)
The most frequent words are not the most
descriptive.
23Statistical Independence
- Two events x and y are statistically
independent if the product of their probability
of their happening individually equals their
probability of happening together.
24Statistical Independence and Dependence
- What are examples of things that are
statistically independent? - What are examples of things that are
statistically dependent?
25Lexical Associations
- Subjects write first word that comes to mind
- doctor/nurse black/white (Palermo Jenkins 64)
- Text Corpora yield similar associations
- One measure Mutual Information (Church and Hanks
89) - If word occurrences were independent, the
numerator and denominator would be equal (if
measured across a large collection)
26Statistical Independence
- Compute for a window of words
a b c d e f g h i j k l m n o p
w1
w11
w21
27Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
28Un-Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
These associations were likely to happen because
the non-doctor words shown here are very
common and therefore likely to co-occur with any
noun.
29Associations Are Important Because
- We may be able to discover that phrases that
should be treated as a word. I.e. data mining. - We may be able to automatically discover
synonyms. I.e. Bike and Bicycle
30Content Analysis Summary
- Content Analysis transforming raw text into more
computationally useful forms - Words in text collections exhibit interesting
statistical properties - Word frequencies have a Zipf distribution
- Word co-occurrences exhibit dependencies
- Text documents are transformed to vectors
- Pre-processing includes tokenization, stemming,
collocations/phrases
31(No Transcript)
32Information need
Collections
text input
How is the index constructed?
The section that follows is about Index
Construction
33Inverted Index
- This is the primary data structure for text
indexes - Main Idea
- Invert documents into a big index
- Basic steps
- Make a dictionary of all the tokens in the
collection - For each token, list all the docs it occurs in.
- Do a few things to reduce redundancy in the data
structure
34How Are Inverted Files Created
- Documents are parsed to extract tokens. These are
saved with the Document ID.
Doc 1
Doc 2
Now is the time for all good men to come to the
aid of their country
It was a dark and stormy night in the country
manor. The time was past midnight
35How Inverted Files are Created
- After all documents have been parsed the inverted
file is sorted alphabetically.
36How InvertedFiles are Created
- Multiple term entries for a single document are
merged. - Within-document term frequency information is
compiled.
37How Inverted Files are Created
- Then the file can be split into
- A Dictionary file
- and
- A Postings file
38How Inverted Files are Created
39Inverted Indexes
- Permit fast search for individual terms
- For each term, you get a list consisting of
- document ID
- frequency of term in doc (optional)
- position of term in doc (optional)
- These lists can be used to solve Boolean queries
- country -gt d1, d2
- manor -gt d2
- country AND manor -gt d2
- Also used for statistical ranking algorithms
40How Inverted Files are Used
Query on time AND dark 2 docs with time
in dictionary -gt IDs 1 and 2 from posting file 1
doc with dark in dictionary -gt ID 2 from
posting file Therefore, only doc 2 satisfied the
query.
41(No Transcript)
42Information need
Collections
text input
How is the index constructed?
The section that follows is about Querying (and
ranking)
43Simple query language Boolean
- Terms Connectors (or operators)
- terms
- words
- normalized (stemmed) words
- phrases
- connectors
- AND
- OR
- NOT
- NEAR (Pseudo Boolean)
- Word Doc
- Cat x
- Dog
- Collar x
- Leash
44Boolean Queries
- Cat
- Cat OR Dog
- Cat AND Dog
- (Cat AND Dog)
- (Cat AND Dog) OR Collar
- (Cat AND Dog) OR (Collar AND Leash)
- (Cat OR Dog) AND (Collar OR Leash)
45Boolean Queries
- (Cat OR Dog) AND (Collar OR Leash)
- Each of the following combinations works
- Cat x x x x
- Dog x x x x x
- Collar x x x x
- Leash x x x x
46Boolean Queries
- (Cat OR Dog) AND (Collar OR Leash)
- None of the following combinations work
- Cat x x
- Dog x x
- Collar x x
- Leash x x
47Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
48Ordering of Retrieved Documents
- Pure Boolean has no ordering
- In practice
- order chronologically
- order by total number of hits on query terms
- What if one term has more hits than others?
- Is it better to one of each term or many of one
term?
49Boolean Model
- Advantages
- simple queries are easy to understand
- relatively easy to implement
- Disadvantages
- difficult to specify what is wanted
- too much returned, or too little
- ordering not well determined
- Dominant language in commercial Information
Retrieval systems until the WWW
Since the Boolean model is limited, lets consider
a generalization
50Vector Model
- Documents are represented as bags of words
- Represented as vectors when used computationally
- A vector is like an array of floating point
- Has direction and magnitude
- Each vector holds a place for every term in the
collection - Therefore, most vectors are sparse
- Smithers secretly loves Monty Burns
- Monty Burns secretly loves Smithers
- Both map to
- Burns, loves, Monty, secretly, Smithers
51Document VectorsOne location for each word
Document ids
- nova galaxy heat hwood film role diet fur
- 10 5 3
- 5 10
- 10 8 7
- 9 10 5
- 10 10
- 9 10
- 5 7 9
- 6 10 2 8
- 7 5 1 3
A B C D E F G H I
52We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
53Documents in 3D Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
54Vector Space Model
Note that the query is projected into the same
vector space as the documents. The query here is
for Marge. We can use a vector similarity
model to determine the best match to our query
(details in a few slides). But what weights
should we use for the terms?
55Assigning Weights to Terms
- Binary Weights
- Raw term frequency
- tf x idf
- Recall the Zipf distribution
- Want to weight terms highly if they are
- frequent in relevant documents BUT
- infrequent in the collection as a whole
56Binary Weights
- Only the presence (1) or absence (0) of a term is
included in the vector
We have already seen and discussed this model.
57Raw Term Weights
- The frequency of occurrence for the term in each
document is included in the vector
This model is open to exploitation by websites
sex sex sex sex sex sex sex sex sex sex sex sex
sex sex sex sex sex sex sex sex sex sex sex sex
sex sex sex sex sex sex
Counts can be normalized by document lengths.
58tf idf Weights
- tf idf measure
- term frequency (tf)
- inverse document frequency (idf) -- a way to deal
with the problems of the Zipf distribution - Goal assign a tf idf weight to each term in
each document
59tf idf
60Inverse Document Frequency
- IDF provides high values for rare words and low
values for common words
For a collection of 10000 documents
61Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
62Cosine
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
63Problems with Vector Space
- There is no real theoretical basis for the
assumption of a term space - it is more for visualization that having any real
basis - most similarity measures work about the same
regardless of model - Terms are not really orthogonal dimensions
- Terms are not independent of all other terms
64Probabilistic Models
- Rigorous formal model attempts to predict the
probability that a given document will be
relevant to a given query - Ranks retrieved documents according to this
probability of relevance (Probability Ranking
Principle) - Rely on accurate estimates of probabilities
65(No Transcript)
66Relevance Feedback
- Main Idea
- Modify existing query based on relevance
judgements - Query Expansion Extract terms from relevant
documents and add them to the query - Term Re-weighing and/or re-weight the terms
already in the query - Two main approaches
- Automatic (psuedo-relevance feedback)
- Users select relevant documents
- Users/system select terms from an
automatically-generated list
67Definition Relevance Feedback is the
reformulation of a search query in response to
feedback provided by the user for the results of
previous versions of the query.
Suppose you are interested in bovine agriculture
on the banks of the river Jordan
Search
Display Results
Gather Feedback
Update Weights
68Rocchio Method
69Rocchio Illustration
Although we usually work in vector space for
text, it is easier to visualize Euclidian space
Original Query
Term Re-weighting Note that both the location of
the center, and the shape of the query have
changed
Query Expansion
70Rocchio Method
- Rocchio automatically
- re-weights terms
- adds in new terms (from relevant docs)
- have to be careful when using negative terms
- Rocchio is not a machine learning algorithm
- Most methods perform similarly
- results heavily dependent on test collection
- Machine learning methods are proving to work
better than standard IR approaches like Rocchio
71Using Relevance Feedback
- Known to improve results
- People dont seem to like giving feedback!
72Relevance Feedback for Time Series
The original query
The weigh vector. Initially, all weighs are the
same.
Note In this example we are using a piecewise
linear approximation of the data. We will learn
more about this representation later.
73The initial query is executed, and the five best
matches are shown (in the dendrogram)
One by one the 5 best matching sequences will
appear, and the user will rank them from between
very bad (-3) to very good (3)
74Based on the user feedback, both the shape and
the weigh vector of the query are changed.
The new query can be executed. The hope is that
the query shape and weights will converge to the
optimal query.
Two papers consider relevance feedback for time
series. Query Expansion L Wu, C Faloutsos, K
Sycara, T. Payne FALCON Feedback Adaptive Loop
for Content-Based Retrieval. VLDB 2000
297-306 Term Re-weighting Keogh, E. Pazzani,
M. Relevance feedback retrieval of time series
data. In Proceedings of SIGIR 99
75(No Transcript)
76Document Space has High Dimensionality
- What happens beyond 2 or 3 dimensions?
- Similarity still has to do with how many tokens
are shared in common. - More terms -gt harder to understand which subsets
of words are shared among similar documents. - One approach to handling high dimensionalityClust
ering
77Text Clustering
- Finds overall similarities among groups of
documents. - Finds overall similarities among groups of
tokens. - Picks out some themes, ignores others.
78Scatter/Gather
- Hearst Pedersen 95
- Cluster sets of documents into general themes,
like a table of contents (using K-means) - Display the contents of the clusters by showing
topical terms and typical titles - User chooses subsets of the clusters and
re-clusters the documents within - Resulting new groups have different themes
79S/G Example query on star
- Encyclopedia text
- 14 sports
- 8 symbols 47 film, tv
- 68 film, tv (p) 7 music
- 97 astrophysics
- 67 astronomy(p) 12 stellar phenomena
- 10 flora/fauna 49 galaxies, stars
- 29 constellations
- 7 miscellaneous
- Clustering and re-clustering is entirely
automated
80(No Transcript)
81Ego Surfing!
http//vivisimo.com/
82(No Transcript)
83Information need
Collections
text input
How is the index constructed?
The section that follows is about Evaluation
84Evaluation
- Why Evaluate?
- What to Evaluate?
- How to Evaluate?
85Why Evaluate?
- Determine if the system is desirable
- Make comparative assessments
- Others?
86What to Evaluate?
- How much of the information need is satisfied.
- How much was learned about a topic.
- Incidental learning
- How much was learned about the collection.
- How much was learned about other topics.
- How inviting the system is.
87What to Evaluate?
- What can be measured that reflects users
ability to use system? (Cleverdon 66) - Coverage of Information
- Form of Presentation
- Effort required/Ease of Use
- Time and Space Efficiency
- Recall
- proportion of relevant material actually
retrieved - Precision
- proportion of retrieved material actually relevant
effectiveness
88Relevant vs. Retrieved
All docs
Retrieved
Relevant
89Precision vs. Recall
All docs
Retrieved
Relevant
90Why Precision and Recall?
- Intuition
- Get as much good stuff while at the same time
getting as little junk as possible.
91Retrieved vs. Relevant Documents
Very high precision, very low recall
Relevant
92Retrieved vs. Relevant Documents
Very low precision, very low recall (0 in fact)
Relevant
93Retrieved vs. Relevant Documents
High recall, but low precision
Relevant
94Retrieved vs. Relevant Documents
High precision, high recall (at last!)
Relevant
95Precision/Recall Curves
- There is a tradeoff between Precision and Recall
- So measure Precision at different levels of
Recall - Note this is an AVERAGE over MANY queries
precision
x
x
x
x
recall
96Precision/Recall Curves
- Difficult to determine which of these two
hypothetical results is better
x
precision
x
x
x
recall
97Precision/Recall Curves
98Recall under various retrieval assumptions
99Precision under various assumptions
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Perfect
P R E C I S I O N
Tangent Parabolic Recall
1000 Documents 100 Relevant
Parabolic Recall
random
Perverse
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Proportion of documents retrieved
100Document Cutoff Levels
- Another way to evaluate
- Fix the number of documents retrieved at several
levels - top 5
- top 10
- top 20
- top 50
- top 100
- top 500
- Measure precision at each of these levels
- Take (weighted) average over results
- This is a way to focus on how well the system
ranks the first k documents.
101Problems with Precision/Recall
- Cant know true recall value
- except in small collections
- Precision/Recall are related
- A combined measure sometimes more appropriate
- Assumes batch mode
- Interactive IR is important and has different
criteria for successful searches - Assumes a strict rank ordering matters.
102Relation to Contingency Table
Doc is Relevant Doc is NOT relevant
Doc is retrieved a b
Doc is NOT retrieved c d
Doc is Relevant Doc is NOT relevant
Doc is retrieved
Doc is NOT retrieved
- Accuracy (ad) / (abcd)
- Precision a/(ab)
- Recall a/(ac)
- Why dont we use Accuracy for IR?
- (Assuming a large collection)
- Most docs arent relevant
- Most docs arent retrieved
- Inflates the accuracy value
103The E-Measure
- Combine Precision and Recall into one number (van
Rijsbergen 79)
P precision R recall b measure of relative
importance of P or R For example, b 0.5 means
user is twice as interested in precision as
recall
104How to Evaluate?Test Collections
105Test Collections
- Cranfield 2
- 1400 Documents, 221 Queries
- 200 Documents, 42 Queries
- INSPEC 542 Documents, 97 Queries
- UKCIS -- gt 10000 Documents, multiple sets, 193
Queries - ADI 82 Document, 35 Queries
- CACM 3204 Documents, 50 Queries
- CISI 1460 Documents, 35 Queries
- MEDLARS (Salton) 273 Documents, 18 Queries
106TREC
- Text REtrieval Conference/Competition
- Run by NIST (National Institute of Standards
Technology) - 2002 (November) will be 11th year
- Collection gt6 Gigabytes (5 CRDOMs), gt1.5
Million Docs - Newswire full text news (AP, WSJ, Ziff, FT)
- Government documents (federal register,
Congressional Record) - Radio Transcripts (FBIS)
- Web subsets
107TREC (cont.)
- Queries Relevance Judgments
- Queries devised and judged by Information
Specialists - Relevance judgments done only for those documents
retrieved -- not entire collection! - Competition
- Various research and commercial groups compete
(TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) - Results judged on precision and recall, going up
to a recall level of 1000 documents
108TREC
- Benefits
- made research systems scale to large collections
(pre-WWW) - allows for somewhat controlled comparisons
- Drawbacks
- emphasis on high recall, which may be unrealistic
for what most users want - very long queries, also unrealistic
- comparisons still difficult to make, because
systems are quite different on many dimensions - focus on batch ranking rather than interaction
- no focus on the WWW
109TREC is changing
- Emphasis on specialized tracks
- Interactive track
- Natural Language Processing (NLP) track
- Multilingual tracks (Chinese, Spanish)
- Filtering track
- High-Precision
- High-Performance
- http//trec.nist.gov/
110What to Evaluate?
- Effectiveness
- Difficult to measure
- Recall and Precision are one way
- What might be others?