Title: Text Preprocessing
1 2Unstructured (text) vs. structured (database)
data in 1996
3Unstructured (text) vs. structured (database)
data in 2006
4Tasks on a collection of documents
- Document retrieval
- Document clustering
- Document categorization
- All these task required text preprocessing
5Steps in Text Preprocessing
Identification all unique words
Removal stop words
- non-informative word
- ex.the,and,when,more
- Removal of suffix to
- generate word stem
- grouping words
- increasing the relevance
- ex.walker,walking?walk
Word Stemming
- Naive terms
- Importance of term in Doc
Term Weighting
6Word stemming
- The process for reducing inflected (or sometimes
derived) words to their stem, base or root form - It is usually sufficient that related words map
to the same stem, even if this stem is not in
itself a valid root. - Stemming programs are commonly referred to as
stemming algorithms or stemmers - Most popular Porter stemmer
7Example of stemming (English)
- cats, "catlike", "catty" ? "cat",
- "stemmer", "stemming", "stemmed" ? "stem".
- "fishing", "fished", "fish", and "fisher"
?"fish".
8Types of stemming algorithms
- Brute force algoritms
- Suffix stripping algoritms
9Brute force algorithms
- Employ a lookup table which contains relations
between root forms and inflected forms. - To stem a word, the table is queried to find a
matching inflection. If a matching inflection is
found, the associated root form is returned.
10Suffix Stripping Algorithms
- A typically smaller list of "rules" are stored
which provide a path for the algorithm, given an
input word form, to find its root form - Example of rules
- if the word ends in 'ed', remove the 'ed'
- if the word ends in 'ing', remove the 'ing'
- if the word ends in 'ly', remove the 'ly'
11Affix Stemmers
- The term affix refers to either a prefix or a
suffix. - In addition to dealing with suffices, several
approaches also attempt to remove common
prefixes.
12Document Representation
d(w1,w2,wt)?Rt
wi is the weight of ith term in document d.
13tf - Term Frequency weighting
wij Freqij Freqij the number of times jth
term occurs in document
Di. Drawback without reflection of importance
factor for document
discrimination.
D1
D2
14tf?idf - Inverse Document Frequency
- wij Freqij log(N/ DocFreqj) .
- N the number of documents in the document
collection. - DocFreqj the number of documents in
- which the jth term occurs.
-
- Assumptionterms with low DocFreq are better
discriminator - than ones with high DocFreq in document
collection
A B K O Q R S T W X
D1 0 0 0 0.3 0 0 0 0
0.3 0 D2 0 0 0.3 0 0 0 0
0 0 0
Ref13
Ref1122
15- Advantage with reflection of importance factor
for document discrimination.
16Entropy weighting
where
is average entropy of ith term and -1 if word
occurs once time in every document 0 if word
occurs in only one document
Ref13
Ref1122
17Dimension Reduction
- Document frequency thresholding
- X2-statistic
18DocFreq Thresholding
Naive Terms
Calculates DocFreq(w)
Sets threshold ?
Removes all words DocFreq lt ?
Feature Terms
Ref11202127
19X2-statistic
- Assumption pre-defined category set for a
training collection D - Goal Estimated independence between term and
category
20X2-statistic
Naive Terms
Term categorical score
Sets threshold ?
Ad d ?cj ? w ?d Bd d ?cj ? w
?d Cd d ?cj ? w ? d Dd d ? cj ? w ?
d Nd d ?D
Removes all words X2max(w)lt ?
FEATURE TERMS
21Text Preprocessing using RapidMiner
22Vector creation
23Vector creation
24Document Retrieval
- Document retrieval is the retrieval of documents
relevant to user requests, commonly called
queries. - Document retrieval research and development
efforts focus on both efficient and accurate
search techniques
25Text Indexing Lucene
- Lucene is a Java library that adds text indexing
and searching capabilities to an application. - Following is an example of Lucene usage in search
application
26(No Transcript)
27Measure of Accuracy
28Example
Precision 50 Recall ?
29Document Clustering
- Groups together conceptually related documents.
- Enabling identification of duplicates and
near-duplicates. It also provides metadata
characterizing the contents of a given document
cluster.
30Document Similarity
- How to determine if two documents are similar?
- Euclidean distance
- Cosine similarity
31Document Clustering Carrot2
- Open source search results clustering engine.
- It can automatically organize small collections
of documents, e.g. search results, into thematic
categories. - Website
- http//project.carrot2.org/
- http//search.carrot2.org/stable/search
-
32Document Categorization
- Automatically organizes documents into
user-defined categories or taxonomies.