Text Mining

About This Presentation

Title:

Text Mining

Description:

Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn_at_cs.ucr.edu Text Mining ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 111

Provided by: csUcrEdu6

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Text Mining

1
Text Mining Dr Eamonn Keogh Computer Science
Engineering DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
2
Text Mining/Information Retrieval

Task Statement
Build a system that retrieves documents that
users are likely to find relevant to their
queries.
This assumption underlies the field of
Information Retrieval.

3
Information need
Collections
How is the query constructed?
text input
How is the text processed?
4
Terminology
Token A natural language word Swim,
Simpson, 92513 etc Document Usually a web
page, but more generally any file.
5
Some IR History

Roots in the scientific Information Explosion
following WWII
Interest in computer-based IR from mid 1950s
H.P. Luhn at IBM (1958)
Probabilistic models at Rand (Maron Kuhns)
(1960)
Boolean system development at Lockheed (60s)
Vector Space Model (Salton at Cornell 1965)
Statistical Weighting methods and theoretical
advances (70s)
Refinements and Advances in application (80s)
User Interfaces, Large-scale testing and
application (90s)

6
Relevance

In what ways can a document be relevant to a
query?
Answer precise question precisely.
Who is Homers Boss? Montgomery Burns.
Partially answer question.
Where does Homer work? Power Plant.
Suggest a source for more information.
What is Barts middle name? Look in Issue 234 of
Fanzine
Give background information.
Remind the user of other knowledge.
Others ...

7
(No Transcript)
8
Information need
Collections
How is the query constructed?
text input
How is the text processed?
The section that follows is about Content
Analysis (transforming raw text into a
computationally more manageable form)
9
Document Processing Steps
10
Stemming and Morphological Analysis

Goal normalize similar words
Morphology (form of words)
Inflectional Morphology
E.g,. inflect verb endings and noun number
Never change grammatical class
dog, dogs
Bike, Biking
Swim, Swimmer, Swimming

What about build, building
11
Examples of Stemming (using Porters algorithm)
Original Words consignconsignedconsign
ingconsignmentconsistconsistedconsistencycons
istentconsistentlyconsistingconsists
Stemmed Words consignconsignconsignconsignco
nsistconsistconsistconsistconsistconsistcons
ist
Porters algorithms is available in Java, C, Lisp,
Perl, Python etc from http//www.tartarus.org/ma
rtin/PorterStemmer/
12
Errors Generated by Porter Stemmer (Krovetz 93)
13
Statistical Properties of Text

Token occurrences in text are not uniformly
distributed
They are also not normally distributed
They do exhibit a Zipf distribution

14
Government documents, 157734 tokens, 32259 unique
969 on 915 FT 883 Mr 860 was 855 be 849
Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE
798 HEADLINE 798 DOCNO
1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI
1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT
1 ADVISERS 1 AE
8164 the 4771 of 4005 to 2834 a 2827 and 2802
in 1592 The 1370 for 1326 is 1324 s 1194 that
973 by
15
Plotting Word Frequency by Rank

Main idea count
How many times tokens occur in the text
Over all texts in the collection
Now rank these according to how often they occur.
This is called the rank.

16
Rank Freq1 37 system2 32
knowledg3 24 base4 20
problem5 18 abstract6 15
model7 15 languag8 15
implem9 13 reason10 13
inform11 11 expert12 11
analysi13 10 rule14 10
program15 10 oper16 10
evalu17 10 comput18 10
case19 9 gener20 9 form
The Corresponding Zipf Curve
17
Zipf Distribution

The Important Points
a few elements occur very frequently
a medium number of elements have medium frequency
many elements occur very infrequently

18
Zipf Distribution

The product of the frequency of words (f) and
their rank (r) is approximately constant
Rank order of words frequency of occurrence
Another way to state this is with an
approximately correct rule of thumb
Say the most common term occurs C times
The second most common occurs C/2 times
The third most common occurs C/3 times

19
Zipf Distribution(linear and log scale)
20
What Kinds of Data Exhibit a Zipf Distribution?

Words in a text collection
Virtually any language usage
Library book checkout patterns
Incoming Web Page Requests
Outgoing Web Page Requests
Document Size on Web
City Sizes

21
Consequences of Zipf

There are always a few very frequent tokens that
are not good discriminators.
Called stop words in IR
English examples to, from, on, and, the, ...
There are always a large number of tokens that
occur once and can mess up algorithms.
Medium frequency words most descriptive

22
Word Frequency vs. Resolving Power (from van
Rijsbergen 79)
The most frequent words are not the most
descriptive.
23
Statistical Independence

Two events x and y are statistically
independent if the product of their probability
of their happening individually equals their
probability of happening together.

24
Statistical Independence and Dependence

What are examples of things that are
statistically independent?
What are examples of things that are
statistically dependent?

25
Lexical Associations

Subjects write first word that comes to mind
doctor/nurse black/white (Palermo Jenkins 64)
Text Corpora yield similar associations
One measure Mutual Information (Church and Hanks
89)
If word occurrences were independent, the
numerator and denominator would be equal (if
measured across a large collection)

26
Statistical Independence

Compute for a window of words

a b c d e f g h i j k l m n o p
w1
w11
w21
27
Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
28
Un-Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
These associations were likely to happen because
the non-doctor words shown here are very
common and therefore likely to co-occur with any
noun.
29
Associations Are Important Because

We may be able to discover that phrases that
should be treated as a word. I.e. data mining.
We may be able to automatically discover
synonyms. I.e. Bike and Bicycle

30
Content Analysis Summary

Content Analysis transforming raw text into more
computationally useful forms
Words in text collections exhibit interesting
statistical properties
Word frequencies have a Zipf distribution
Word co-occurrences exhibit dependencies
Text documents are transformed to vectors
Pre-processing includes tokenization, stemming,
collocations/phrases

31
(No Transcript)
32
Information need
Collections
text input
How is the index constructed?
The section that follows is about Index
Construction
33
Inverted Index

This is the primary data structure for text
indexes
Main Idea
Invert documents into a big index
Basic steps
Make a dictionary of all the tokens in the
collection
For each token, list all the docs it occurs in.
Do a few things to reduce redundancy in the data
structure

34
How Are Inverted Files Created

Documents are parsed to extract tokens. These are
saved with the Document ID.

Doc 1
Doc 2
Now is the time for all good men to come to the
aid of their country
It was a dark and stormy night in the country
manor. The time was past midnight
35
How Inverted Files are Created

After all documents have been parsed the inverted
file is sorted alphabetically.

36
How InvertedFiles are Created

Multiple term entries for a single document are
merged.
Within-document term frequency information is
compiled.

37
How Inverted Files are Created

Then the file can be split into
A Dictionary file
and
A Postings file

38
How Inverted Files are Created

Dictionary Postings

39
Inverted Indexes

Permit fast search for individual terms
For each term, you get a list consisting of
document ID
frequency of term in doc (optional)
position of term in doc (optional)
These lists can be used to solve Boolean queries
country -gt d1, d2
manor -gt d2
country AND manor -gt d2
Also used for statistical ranking algorithms

40
How Inverted Files are Used
Query on time AND dark 2 docs with time
in dictionary -gt IDs 1 and 2 from posting file 1
doc with dark in dictionary -gt ID 2 from
posting file Therefore, only doc 2 satisfied the
query.

Dictionary Postings

41
(No Transcript)
42
Information need
Collections
text input
How is the index constructed?
The section that follows is about Querying (and
ranking)
43
Simple query language Boolean

Terms Connectors (or operators)
terms
words
normalized (stemmed) words
phrases
connectors
AND
OR
NOT
NEAR (Pseudo Boolean)

Word Doc
Cat x
Dog
Collar x
Leash

44
Boolean Queries

Cat
Cat OR Dog
Cat AND Dog
(Cat AND Dog)
(Cat AND Dog) OR Collar
(Cat AND Dog) OR (Collar AND Leash)
(Cat OR Dog) AND (Collar OR Leash)

45
Boolean Queries

(Cat OR Dog) AND (Collar OR Leash)
Each of the following combinations works
Cat x x x x
Dog x x x x x
Collar x x x x
Leash x x x x

46
Boolean Queries

(Cat OR Dog) AND (Collar OR Leash)
None of the following combinations work
Cat x x
Dog x x
Collar x x
Leash x x

47
Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
48
Ordering of Retrieved Documents

Pure Boolean has no ordering
In practice
order chronologically
order by total number of hits on query terms
What if one term has more hits than others?
Is it better to one of each term or many of one
term?

49
Boolean Model

Advantages
simple queries are easy to understand
relatively easy to implement
Disadvantages
difficult to specify what is wanted
too much returned, or too little
ordering not well determined
Dominant language in commercial Information
Retrieval systems until the WWW

Since the Boolean model is limited, lets consider
a generalization
50
Vector Model

Documents are represented as bags of words
Represented as vectors when used computationally
A vector is like an array of floating point
Has direction and magnitude
Each vector holds a place for every term in the
collection
Therefore, most vectors are sparse

Smithers secretly loves Monty Burns
Monty Burns secretly loves Smithers
Both map to
Burns, loves, Monty, secretly, Smithers

51
Document VectorsOne location for each word
Document ids

nova galaxy heat hwood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3

A B C D E F G H I
52
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
53
Documents in 3D Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
54
Vector Space Model
Note that the query is projected into the same
vector space as the documents. The query here is
for Marge. We can use a vector similarity
model to determine the best match to our query
(details in a few slides). But what weights
should we use for the terms?
55
Assigning Weights to Terms

Binary Weights
Raw term frequency
tf x idf
Recall the Zipf distribution
Want to weight terms highly if they are
frequent in relevant documents BUT
infrequent in the collection as a whole

56
Binary Weights

Only the presence (1) or absence (0) of a term is
included in the vector

We have already seen and discussed this model.
57
Raw Term Weights

The frequency of occurrence for the term in each
document is included in the vector

This model is open to exploitation by websites
sex sex sex sex sex sex sex sex sex sex sex sex
sex sex sex sex sex sex sex sex sex sex sex sex
sex sex sex sex sex sex
Counts can be normalized by document lengths.
58
tf idf Weights

tf idf measure
term frequency (tf)
inverse document frequency (idf) -- a way to deal
with the problems of the Zipf distribution
Goal assign a tf idf weight to each term in
each document

59
tf idf
60
Inverse Document Frequency

IDF provides high values for rare words and low
values for common words

For a collection of 10000 documents
61
Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
62
Cosine
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
63
Problems with Vector Space

There is no real theoretical basis for the
assumption of a term space
it is more for visualization that having any real
basis
most similarity measures work about the same
regardless of model
Terms are not really orthogonal dimensions
Terms are not independent of all other terms

64
Probabilistic Models

Rigorous formal model attempts to predict the
probability that a given document will be
relevant to a given query
Ranks retrieved documents according to this
probability of relevance (Probability Ranking
Principle)
Rely on accurate estimates of probabilities

65
(No Transcript)
66
Relevance Feedback

Main Idea
Modify existing query based on relevance
judgements
Query Expansion Extract terms from relevant
documents and add them to the query
Term Re-weighing and/or re-weight the terms
already in the query
Two main approaches
Automatic (psuedo-relevance feedback)
Users select relevant documents
Users/system select terms from an
automatically-generated list

67
Definition Relevance Feedback is the
reformulation of a search query in response to
feedback provided by the user for the results of
previous versions of the query.
Suppose you are interested in bovine agriculture
on the banks of the river Jordan
Search
Display Results
Gather Feedback
Update Weights
68
Rocchio Method
69
Rocchio Illustration
Although we usually work in vector space for
text, it is easier to visualize Euclidian space
Original Query
Term Re-weighting Note that both the location of
the center, and the shape of the query have
changed
Query Expansion
70
Rocchio Method

Rocchio automatically
re-weights terms
adds in new terms (from relevant docs)
have to be careful when using negative terms
Rocchio is not a machine learning algorithm
Most methods perform similarly
results heavily dependent on test collection
Machine learning methods are proving to work
better than standard IR approaches like Rocchio

71
Using Relevance Feedback

Known to improve results
People dont seem to like giving feedback!

72
Relevance Feedback for Time Series
The original query
The weigh vector. Initially, all weighs are the
same.
Note In this example we are using a piecewise
linear approximation of the data. We will learn
more about this representation later.
73
The initial query is executed, and the five best
matches are shown (in the dendrogram)
One by one the 5 best matching sequences will
appear, and the user will rank them from between
very bad (-3) to very good (3)
74
Based on the user feedback, both the shape and
the weigh vector of the query are changed.
The new query can be executed. The hope is that
the query shape and weights will converge to the
optimal query.
Two papers consider relevance feedback for time
series. Query Expansion L Wu, C Faloutsos, K
Sycara, T. Payne FALCON Feedback Adaptive Loop
for Content-Based Retrieval. VLDB 2000
297-306 Term Re-weighting Keogh, E. Pazzani,
M. Relevance feedback retrieval of time series
data. In Proceedings of SIGIR 99
75
(No Transcript)
76
Document Space has High Dimensionality

What happens beyond 2 or 3 dimensions?
Similarity still has to do with how many tokens
are shared in common.
More terms -gt harder to understand which subsets
of words are shared among similar documents.
One approach to handling high dimensionalityClust
ering

77
Text Clustering

Finds overall similarities among groups of
documents.
Finds overall similarities among groups of
tokens.
Picks out some themes, ignores others.

78
Scatter/Gather

Hearst Pedersen 95
Cluster sets of documents into general themes,
like a table of contents (using K-means)
Display the contents of the clusters by showing
topical terms and typical titles
User chooses subsets of the clusters and
re-clusters the documents within
Resulting new groups have different themes

79
S/G Example query on star

Encyclopedia text
14 sports
8 symbols 47 film, tv
68 film, tv (p) 7 music
97 astrophysics
67 astronomy(p) 12 stellar phenomena
10 flora/fauna 49 galaxies, stars
29 constellations
7 miscellaneous
Clustering and re-clustering is entirely
automated

80
(No Transcript)
81
Ego Surfing!
http//vivisimo.com/
82
(No Transcript)
83
Information need
Collections
text input
How is the index constructed?
The section that follows is about Evaluation
84
Evaluation

Why Evaluate?
What to Evaluate?
How to Evaluate?

85
Why Evaluate?

Determine if the system is desirable
Make comparative assessments
Others?

86
What to Evaluate?

How much of the information need is satisfied.
How much was learned about a topic.
Incidental learning
How much was learned about the collection.
How much was learned about other topics.
How inviting the system is.

87
What to Evaluate?

What can be measured that reflects users
ability to use system? (Cleverdon 66)
Coverage of Information
Form of Presentation
Effort required/Ease of Use
Time and Space Efficiency
Recall
proportion of relevant material actually
retrieved
Precision
proportion of retrieved material actually relevant

effectiveness
88
Relevant vs. Retrieved
All docs
Retrieved
Relevant
89
Precision vs. Recall
All docs
Retrieved
Relevant
90
Why Precision and Recall?

Intuition
Get as much good stuff while at the same time
getting as little junk as possible.

91
Retrieved vs. Relevant Documents
Very high precision, very low recall
Relevant
92
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 in fact)
Relevant
93
Retrieved vs. Relevant Documents
High recall, but low precision
Relevant
94
Retrieved vs. Relevant Documents
High precision, high recall (at last!)
Relevant
95
Precision/Recall Curves

There is a tradeoff between Precision and Recall
So measure Precision at different levels of
Recall
Note this is an AVERAGE over MANY queries

precision
x
x
x
x
recall
96
Precision/Recall Curves

Difficult to determine which of these two
hypothetical results is better

x
precision
x
x
x
recall
97
Precision/Recall Curves
98
Recall under various retrieval assumptions
99
Precision under various assumptions
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Perfect
P R E C I S I O N
Tangent Parabolic Recall
1000 Documents 100 Relevant
Parabolic Recall
random
Perverse
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Proportion of documents retrieved
100
Document Cutoff Levels

Another way to evaluate
Fix the number of documents retrieved at several
levels
top 5
top 10
top 20
top 50
top 100
top 500
Measure precision at each of these levels
Take (weighted) average over results
This is a way to focus on how well the system
ranks the first k documents.

101
Problems with Precision/Recall

Cant know true recall value
except in small collections
Precision/Recall are related
A combined measure sometimes more appropriate
Assumes batch mode
Interactive IR is important and has different
criteria for successful searches
Assumes a strict rank ordering matters.

102
Relation to Contingency Table
Doc is Relevant Doc is NOT relevant
Doc is retrieved a b
Doc is NOT retrieved c d
Doc is Relevant Doc is NOT relevant
Doc is retrieved
Doc is NOT retrieved

Accuracy (ad) / (abcd)
Precision a/(ab)
Recall a/(ac)
Why dont we use Accuracy for IR?
(Assuming a large collection)
Most docs arent relevant
Most docs arent retrieved
Inflates the accuracy value

103
The E-Measure

Combine Precision and Recall into one number (van
Rijsbergen 79)

P precision R recall b measure of relative
importance of P or R For example, b 0.5 means
user is twice as interested in precision as
recall
104
How to Evaluate?Test Collections
105
Test Collections

Cranfield 2
1400 Documents, 221 Queries
200 Documents, 42 Queries
INSPEC 542 Documents, 97 Queries
UKCIS -- gt 10000 Documents, multiple sets, 193
Queries
ADI 82 Document, 35 Queries
CACM 3204 Documents, 50 Queries
CISI 1460 Documents, 35 Queries
MEDLARS (Salton) 273 Documents, 18 Queries

106
TREC

Text REtrieval Conference/Competition
Run by NIST (National Institute of Standards
Technology)
2002 (November) will be 11th year
Collection gt6 Gigabytes (5 CRDOMs), gt1.5
Million Docs
Newswire full text news (AP, WSJ, Ziff, FT)
Government documents (federal register,
Congressional Record)
Radio Transcripts (FBIS)
Web subsets

107
TREC (cont.)

Queries Relevance Judgments
Queries devised and judged by Information
Specialists
Relevance judgments done only for those documents
retrieved -- not entire collection!
Competition
Various research and commercial groups compete
(TREC 6 had 51, TREC 7 had 56, TREC 8 had 66)
Results judged on precision and recall, going up
to a recall level of 1000 documents

108
TREC

Benefits
made research systems scale to large collections
(pre-WWW)
allows for somewhat controlled comparisons
Drawbacks
emphasis on high recall, which may be unrealistic
for what most users want
very long queries, also unrealistic
comparisons still difficult to make, because
systems are quite different on many dimensions
focus on batch ranking rather than interaction
no focus on the WWW

109
TREC is changing