Text Preprocessing - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Text Preprocessing

Description:

Following is an example of Lucene usage in search application Measure of Accuracy Example: Document Clustering Groups together conceptually related documents. – PowerPoint PPT presentation

Number of Views:721
Avg rating:3.0/5.0
Slides: 33
Provided by: ewin7
Category:

less

Transcript and Presenter's Notes

Title: Text Preprocessing


1
  • Text Preprocessing

2
Unstructured (text) vs. structured (database)
data in 1996
3
Unstructured (text) vs. structured (database)
data in 2006
4
Tasks on a collection of documents
  • Document retrieval
  • Document clustering
  • Document categorization
  • All these task required text preprocessing

5
Steps in Text Preprocessing
Identification all unique words
Removal stop words
  • non-informative word
  • ex.the,and,when,more
  • Removal of suffix to
  • generate word stem
  • grouping words
  • increasing the relevance
  • ex.walker,walking?walk

Word Stemming
  • Naive terms
  • Importance of term in Doc

Term Weighting
6
Word stemming
  • The process for reducing inflected (or sometimes
    derived) words to their stem, base or root form
  • It is usually sufficient that related words map
    to the same stem, even if this stem is not in
    itself a valid root.
  • Stemming programs are commonly referred to as
    stemming algorithms or stemmers
  • Most popular Porter stemmer

7
Example of stemming (English)
  • cats, "catlike", "catty" ? "cat",
  • "stemmer", "stemming", "stemmed" ? "stem".
  • "fishing", "fished", "fish", and "fisher"
    ?"fish".

8
Types of stemming algorithms
  • Brute force algoritms
  • Suffix stripping algoritms

9
Brute force algorithms
  • Employ a lookup table which contains relations
    between root forms and inflected forms.
  • To stem a word, the table is queried to find a
    matching inflection. If a matching inflection is
    found, the associated root form is returned.

10
Suffix Stripping Algorithms
  • A typically smaller list of "rules" are stored
    which provide a path for the algorithm, given an
    input word form, to find its root form
  • Example of rules
  • if the word ends in 'ed', remove the 'ed'
  • if the word ends in 'ing', remove the 'ing'
  • if the word ends in 'ly', remove the 'ly'

11
Affix Stemmers
  • The term affix refers to either a prefix or a
    suffix.
  • In addition to dealing with suffices, several
    approaches also attempt to remove common
    prefixes.

12
Document Representation
  • Vector space model

d(w1,w2,wt)?Rt
wi is the weight of ith term in document d.
13
tf - Term Frequency weighting
wij Freqij Freqij   the number of times jth
term occurs in document
Di. Drawback without reflection of importance
factor for document
discrimination.
D1
D2
14
tf?idf - Inverse Document Frequency
  • wij Freqij log(N/ DocFreqj) .
  • N  the number of documents in the document
    collection.
  • DocFreqj the number of documents in
  • which the jth term occurs.
  • Assumptionterms with low DocFreq are better
    discriminator
  • than ones with high DocFreq in document
    collection

A B K O Q R S T W X
D1 0 0 0 0.3 0 0 0 0
0.3 0 D2 0 0 0.3 0 0 0 0
0 0 0
Ref13
Ref1122
15
  • Advantage with reflection of importance factor
    for document discrimination.

16
Entropy weighting
where
is average entropy of ith term and -1 if word
occurs once time in every document 0 if word
occurs in only one document
Ref13
Ref1122
17
Dimension Reduction
  • Document frequency thresholding
  • X2-statistic

18
DocFreq Thresholding
Naive Terms
Calculates DocFreq(w)
Sets threshold ?
Removes all words DocFreq lt ?
Feature Terms
Ref11202127
19
X2-statistic
  • Assumption pre-defined category set for a
    training collection D
  • Goal Estimated independence between term and
    category

20
X2-statistic
Naive Terms
Term categorical score
Sets threshold ?
Ad d ?cj ? w ?d Bd d ?cj ? w
?d Cd d ?cj ? w ? d Dd d ? cj ? w ?
d Nd d ?D
Removes all words X2max(w)lt ?
FEATURE TERMS
21
Text Preprocessing using RapidMiner
22
Vector creation
23
Vector creation
24
Document Retrieval
  • Document retrieval is the retrieval of documents
    relevant to user requests, commonly called
    queries.
  • Document retrieval research and development
    efforts focus on both efficient and accurate
    search techniques

25
Text Indexing Lucene
  • Lucene is a Java library that adds text indexing
    and searching capabilities to an application.
  • Following is an example of Lucene usage in search
    application

26
(No Transcript)
27
Measure of Accuracy
28
Example
Precision 50 Recall ?
29
Document Clustering
  • Groups together conceptually related documents.
  • Enabling identification of duplicates and
    near-duplicates. It also provides metadata
    characterizing the contents of a given document
    cluster.

30
Document Similarity
  • How to determine if two documents are similar?
  • Euclidean distance
  • Cosine similarity

31
Document Clustering Carrot2
  • Open source search results clustering engine.
  • It can automatically organize small collections
    of documents, e.g. search results, into thematic
    categories.
  • Website
  • http//project.carrot2.org/
  • http//search.carrot2.org/stable/search

32
Document Categorization
  • Automatically organizes documents into
    user-defined categories or taxonomies.
Write a Comment
User Comments (0)
About PowerShow.com