Text Preprocessing - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Text Preprocessing

Description:

Following is an example of Lucene usage in search application Measure of Accuracy Example: Document Clustering Groups together conceptually related documents. – PowerPoint PPT presentation

Number of Views:724

Avg rating:3.0/5.0

Slides: 33

Provided by: ewin7

Category:

more less

Transcript and Presenter's Notes

Title: Text Preprocessing

1

Text Preprocessing

2
Unstructured (text) vs. structured (database)
data in 1996
3
Unstructured (text) vs. structured (database)
data in 2006
4
Tasks on a collection of documents

Document retrieval
Document clustering
Document categorization
All these task required text preprocessing

5
Steps in Text Preprocessing
Identification all unique words
Removal stop words

non-informative word
ex.the,and,when,more

Removal of suffix to
generate word stem
grouping words
increasing the relevance
ex.walker,walking?walk

Word Stemming

Naive terms
Importance of term in Doc

Term Weighting
6
Word stemming

The process for reducing inflected (or sometimes
derived) words to their stem, base or root form
It is usually sufficient that related words map
to the same stem, even if this stem is not in
itself a valid root.
Stemming programs are commonly referred to as
stemming algorithms or stemmers
Most popular Porter stemmer

7
Example of stemming (English)

cats, "catlike", "catty" ? "cat",
"stemmer", "stemming", "stemmed" ? "stem".
"fishing", "fished", "fish", and "fisher"
?"fish".

8
Types of stemming algorithms

Brute force algoritms
Suffix stripping algoritms

9
Brute force algorithms

Employ a lookup table which contains relations
between root forms and inflected forms.
To stem a word, the table is queried to find a
matching inflection. If a matching inflection is
found, the associated root form is returned.

10
Suffix Stripping Algorithms

A typically smaller list of "rules" are stored
which provide a path for the algorithm, given an
input word form, to find its root form
Example of rules
if the word ends in 'ed', remove the 'ed'
if the word ends in 'ing', remove the 'ing'
if the word ends in 'ly', remove the 'ly'

11
Affix Stemmers

The term affix refers to either a prefix or a
suffix.
In addition to dealing with suffices, several
approaches also attempt to remove common
prefixes.

12
Document Representation

Vector space model

d(w1,w2,wt)?Rt
wi is the weight of ith term in document d.
13
tf - Term Frequency weighting
wij Freqij Freqij the number of times jth
term occurs in document
Di. Drawback without reflection of importance
factor for document
discrimination.
D1
D2
14
tf?idf - Inverse Document Frequency

wij Freqij log(N/ DocFreqj) .
N the number of documents in the document
collection.
DocFreqj the number of documents in
which the jth term occurs.
Assumptionterms with low DocFreq are better
discriminator
than ones with high DocFreq in document
collection

A B K O Q R S T W X
D1 0 0 0 0.3 0 0 0 0
0.3 0 D2 0 0 0.3 0 0 0 0
0 0 0
Ref13
Ref1122
15

Advantage with reflection of importance factor
for document discrimination.

16
Entropy weighting
where
is average entropy of ith term and -1 if word
occurs once time in every document 0 if word
occurs in only one document
Ref13
Ref1122
17
Dimension Reduction

Document frequency thresholding
X2-statistic

18
DocFreq Thresholding
Naive Terms
Calculates DocFreq(w)
Sets threshold ?
Removes all words DocFreq lt ?
Feature Terms
Ref11202127
19
X2-statistic

Assumption pre-defined category set for a
training collection D
Goal Estimated independence between term and
category

20
X2-statistic
Naive Terms
Term categorical score
Sets threshold ?
Ad d ?cj ? w ?d Bd d ?cj ? w
?d Cd d ?cj ? w ? d Dd d ? cj ? w ?
d Nd d ?D
Removes all words X2max(w)lt ?
FEATURE TERMS
21
Text Preprocessing using RapidMiner
22
Vector creation
23
Vector creation
24
Document Retrieval

Document retrieval is the retrieval of documents
relevant to user requests, commonly called
queries.
Document retrieval research and development
efforts focus on both efficient and accurate
search techniques

25
Text Indexing Lucene

Lucene is a Java library that adds text indexing
and searching capabilities to an application.
Following is an example of Lucene usage in search
application

26
(No Transcript)
27
Measure of Accuracy
28
Example
Precision 50 Recall ?
29
Document Clustering

Groups together conceptually related documents.
Enabling identification of duplicates and
near-duplicates. It also provides metadata
characterizing the contents of a given document
cluster.

30
Document Similarity

How to determine if two documents are similar?
Euclidean distance
Cosine similarity

31
Document Clustering Carrot2

Open source search results clustering engine.
It can automatically organize small collections
of documents, e.g. search results, into thematic
categories.
Website
http//project.carrot2.org/
http//search.carrot2.org/stable/search

32
Document Categorization