Text Processing - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Text Processing

Description:

stresses stress; but not forbes forb. Excessive 'normalization' wander wand, but sander sand ... Rankings fusion. Cross-stream dependencies. 26. Stream Model ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 37

Provided by: tcnj

Category:

more less

Transcript and Presenter's Notes

Title: Text Processing

1
Text Processing

Purpose prepare text for indexing and retrieval
Standard text preparation processes
Stopping
Stemming
Collocations
Advanced text preparation processes
Tagging
Parsing identifying structure
HM Pairs
Concept extraction
Co-references and cross-references

2
Text Preparation
INDEX
Text Processing
Search
What recent disasters occurred in tunnels?
3
Typical Text Processing Steps
stopping
stemming
colloc.
Text
HM
tagging
parsing
concepts
names
4
Stopping

Elimination of stopwords
Not used in indexing
Not considered content words
standard list http//www.uspto.gov/patft/stopwo
rd.htm
Elimination of common words symbols
Domain dependent, e.g., today in news
Elimination of annotations, symbols

5
Stemming

Reducing words to root forms
Eliminate morphological variations
retrieval, retrieved, retrieving, ? retriev
Break/unbreak multi-word compounds
stop-words, stop words, stopwords
Detect negations
relevant, non-relevant, irrelevant, not relevant
in order to increase retrieval probability
(recall)
variants considered synonymous (?)
statistics more accurate

6
Stemming Approaches

Standard word cutters (e.g. Porters)
Use a list of standard word endings
-ing, -s, -es, -ed, -ally,
Usually cuts off the longest matching suffix
But makes sure the stem left not too short
Morphological
Performs morphological analysis of each word
Requires part-of-speech information (why?)
Dictionary-based
Uses a lexicon to reduce words to root form
(rather than cut suffix)

7
Common Problems

Insufficient normalization
stress ? stres stresses ? stresse
stresses ? stress but not forbes ? forb
Excessive normalization
wander ? wand, but sander ? sand
probate ? prob ? probe not both!

8
Dictionary-based Stemmer

Described in Strzalkowski94
Uses on-line dictionary (MRD)
For each word
Determine part of speech N,V,ADJ,ADV,
Determine inflexion patterns ? legal suffixes
Cutoff the longest matching suffix
Add on (verb) root-form ending if required
Verify root form against the dictionary
If step 5 fails, repeat 3 through 5 for other
suffixes

9
Stemming Example

retrieval ? retrieve
retrieval (N)
-es, -s, -al,
retrieval ? retriev
e
retriev e
OK retrieve

10
Stemming Example

retrieved ? retrieve
retrieved (VBN, VBD)
-ed, -en
retrieved ? retriev
e
retriev e
OK retrieve

11
Collocations

Identifying words that frequently come together,
because they may
Denote concepts White House, senior citizen,
joint venture
Predict the presence of the other word in text
Can be used as units in indexing
Should the component words be used as well?
Mutual Information formula is useful
Collocations may be specific to domains/text
genres

12
Part-of-Speech Tagging

Goal tag all words in text with POS info
Part of speech classes in English
Nouns cat, dog, retrieval,
Verbs buy, walk, argued, processing,
Adjectives red, white, happy,
Adverbs fast, slowly, carefully,
Conjuctions and, or, but,
Determiner the, this, some,
Automated systems usually more detailed tagset

13
Why POS tagging?

POS tag depends upon word use in context
They drive (V) very fast.
My disk drive (N) crashed.
Tagging removes some ambiguity that arises from
treating words separately
Tagged text can be analysed for phrases and other
compounds

14
Example of POS tagged text

For McCaw, it would have hurt the company's
strategy of building a seamless national cellular
network.
For/in McCaw/pn, it/pp would/md have/vb hurt/vbn
the/dt company/nn 's/pos strategy/nn of/in
building/vbg a/dt seamless/jj national/jj
cellular/jj network/nn

15
Phrase Identification

POS tags can be used to identify phrases
NP (dt) ((rb) (jj)) (nn pos) nnnnspn
For/in McCaw/pn , it/pp would/md have/vb
hurt/vbn the/dt company/nn 's/pos strategy/nn
of/in building/vbg a/dt seamless/jj national/jj
cellular/jj network/nn

16
How POS tagging works?

Stochastic approaches (e.g., Kupiec)
Use HMM over word trigrams
Requires training, accuracy up to 98
Rule-based approaches (e.g., Brill)
Initial tags from a lexicon ambiguous
Some empirical rules, e.g., no vb after dt
Supervised error-driven learning accuracy up to
98
Learn tag preferences for words tag ranking
Learn tagging rules (e.g., nn preferred after jj)

17
Error-Driven Learning

A powerful paradigm for supervised machine
learning.
Proposed by Eric Brill in his PhD work (1992)
Applications in routing and classification
General idea
Assume initial classification could be a guess
Manually correct errors
Have the system correct its behavior to
accommodate the corrections
Use unbiased training data

18
Parsing

Use English (or other language) grammar to derive
full (or approximate) syntactic structure of
text.
Hand constructed grammars (generic)
Stochastic grammars (derived from training texts)
Identify phrases, word-dependencies
Normalize structural variants

19
Parsing Example
assert will_aux perf have verb
hurt subject np n it object
np n strategy t_pos the
n_pos poss n company of
verb build
subject anyone
object np n network t_pos a
adj seamless
adj national
adj
cellular
20
Graphical Parsing
assert
predicate
subject
object
aux
it
perf
verb
will
NP
tpos
npos
head
hurt
have
strategy
21
HeadModifier dependencies
assert will_aux perf have verb
hurt subject np n it object
np n strategy t_pos the
n_pos poss n company of
verb build
subject anyone
object np n network t_pos a
adj seamless
adj national
adj
cellular
head
modifier
head
modifier
22
Head-Modifier Structures

HM Dependencies extracted from this parse
(headmodifier format)
hurtstrategy, strategycompany,
buildnetwork,
networkcellular, networknational,
networkseamless
Can be used as indexing terms
More refined than simple phrases
Normalization across all syntactic forms
(problems?)
Can be nested or un-nested pairs
Order information important
venetianblind ? blindvenetian
collegefreshman ? freshmancollege

23
Stream Model
phrases
phrases
names
names
search topics
HM pairs
HM pairs
Merged Ranked output
24
Stream Model IR
Query
feedback
words
phrases
people
locations
weapons
Search Engines
Indexes
NLP
fuse
text
Summarize Present
25
Stream Model Evaluation

Compare performance of different indexing
approaches
Determine what is the contribution of each stream
to overall result
Uncertainty factors
Rankings fusion
Cross-stream dependencies

26
Stream Model Evaluation (TREC-5)
27
Concept Extraction

Explicit identification of references to
Named entities people, organizations, locations
Events and relationships
Detection of small text regions that
Have features indicating presence of concepts
No explicit extraction cant use in index
(why?)
Used for query expansion relevance feedback

28
Concept-based Indexing

Index documents using concepts such as
entities, events, relations
E.g. disasters in tunnels
But how to represent disaster, tunnel, etc?
But how to recognize disaster, tunnel in text?
How to represent documents with concepts?
Weighted keys?
Semantic maps?

29
Using concepts to enrich BOW

Add compound terms and concepts to documents
Treat as tokens in the bag-of-words
Weigh just as other tokens
tfidf based on distribution
A function of weights of component words
Ad-hoc
Neither approach satisfactory (why not?)

30
Detecting concept presence

Supervised Machine Learning
Use human annotated text for training
Extract context cues that indicate concepts of a
given kind
Construct first-cut recognizers
Unsupervised fitting
Apply recognizer to new training data
(un-annotated)
Learn more context cues from where the concepts
occur
Revise recognizer rules, iterate until stable

31
Self-Learning Concept Spotter

Proposed by Strzalkowski Wang, 1996
Start from seed descriptions/naïve rules
Bootstrap from examples found in text
Very effective, converges quickly
Accuracy rivals human-made grammars

32
S-LCS Example
Seed rules COMPANY NP Co. NP Inc.
Henry Kaufman is president of Henry Kaufman
Co., a Gabelli, chairman of Gabelli Funds,
Inc. Claude N. Rosenberg is named president of
Skandinaviska Enskilda Banken become vice
chairman of the state-owned electronics giant
Thompson S.A. banking group, said the former
merger of Skanska Banken into water maker
Source Perrier S.A., according to French stock
33
S-LCS Example
Seed rules COMPANY PN Co. PN Inc. Add
COMPANY president chairman of np PN
Henry Kaufman is president of Henry Kaufman
Co., a Gabelli, chairman of Gabelli Funds,
Inc. Claude N. Rosenberg is named president of
Skandinaviska Enskilda Banken become vice
chairman of the state-owned electronics giant
Thompson S.A. banking group, said the former
merger of Skanska Banken into water maker
Source Perrier S.A., according to French stock
34
S-LCS Example
Seed rules COMPANY PN Co. PN Inc. Add
COMPANY president chairman of np PN Add
COMPANY PN S.A. PN Banken
Henry Kaufman is president of Henry Kaufman
Co., a Gabelli, chairman of Gabelli Funds,
Inc. Claude N. Rosenberg is named president of
Skandinaviska Enskilda Banken become vice
chairman of the state-owned electronics giant
Thompson S.A. banking group, said the former
merger of Skanska Banken into water maker
Source Perrier S.A., according to French stock
35
Co- and Cross-References

Co-references usually pronouns, definite
descriptions
Tracking to get counts right
Not an easy problem generally
Cross-references are across documents
Is this the same person, place, event?
Generalizes to topic detection

36
XDC Approach

Proposed by Amit Bagga and Breck Baldwin
disambiguate entities/events across documents
done by looking at context around entity/event
Expected to differ for Michael Jordan (NBA) and
Prof. Michael I. Jordan (UC Berkeley)
context extracted in the form of a summary for
each entity/event
sentence selection
contexts are compared currently using the Vector
Space Model