Title: Text Processing
1Text Processing
- Purpose prepare text for indexing and retrieval
- Standard text preparation processes
- Stopping
- Stemming
- Collocations
- Advanced text preparation processes
- Tagging
- Parsing identifying structure
- HM Pairs
- Concept extraction
- Co-references and cross-references
2Text Preparation
INDEX
Text Processing
Search
What recent disasters occurred in tunnels?
3Typical Text Processing Steps
stopping
stemming
colloc.
Text
HM
tagging
parsing
concepts
names
4Stopping
- Elimination of stopwords
- Not used in indexing
- Not considered content words
- standard list http//www.uspto.gov/patft/stopwo
rd.htm - Elimination of common words symbols
- Domain dependent, e.g., today in news
- Elimination of annotations, symbols
5Stemming
- Reducing words to root forms
- Eliminate morphological variations
- retrieval, retrieved, retrieving, ? retriev
- Break/unbreak multi-word compounds
- stop-words, stop words, stopwords
- Detect negations
- relevant, non-relevant, irrelevant, not relevant
- in order to increase retrieval probability
(recall) - variants considered synonymous (?)
- statistics more accurate
6Stemming Approaches
- Standard word cutters (e.g. Porters)
- Use a list of standard word endings
- -ing, -s, -es, -ed, -ally,
- Usually cuts off the longest matching suffix
- But makes sure the stem left not too short
- Morphological
- Performs morphological analysis of each word
- Requires part-of-speech information (why?)
- Dictionary-based
- Uses a lexicon to reduce words to root form
(rather than cut suffix)
7Common Problems
- Insufficient normalization
- stress ? stres stresses ? stresse
- stresses ? stress but not forbes ? forb
- Excessive normalization
- wander ? wand, but sander ? sand
- probate ? prob ? probe not both!
8Dictionary-based Stemmer
- Described in Strzalkowski94
- Uses on-line dictionary (MRD)
- For each word
- Determine part of speech N,V,ADJ,ADV,
- Determine inflexion patterns ? legal suffixes
- Cutoff the longest matching suffix
- Add on (verb) root-form ending if required
- Verify root form against the dictionary
- If step 5 fails, repeat 3 through 5 for other
suffixes
9Stemming Example
- retrieval ? retrieve
- retrieval (N)
- -es, -s, -al,
- retrieval ? retriev
- e
- retriev e
- OK retrieve
10Stemming Example
- retrieved ? retrieve
- retrieved (VBN, VBD)
- -ed, -en
- retrieved ? retriev
- e
- retriev e
- OK retrieve
11Collocations
- Identifying words that frequently come together,
because they may - Denote concepts White House, senior citizen,
joint venture - Predict the presence of the other word in text
- Can be used as units in indexing
- Should the component words be used as well?
- Mutual Information formula is useful
- Collocations may be specific to domains/text
genres
12Part-of-Speech Tagging
- Goal tag all words in text with POS info
- Part of speech classes in English
- Nouns cat, dog, retrieval,
- Verbs buy, walk, argued, processing,
- Adjectives red, white, happy,
- Adverbs fast, slowly, carefully,
- Conjuctions and, or, but,
- Determiner the, this, some,
- Automated systems usually more detailed tagset
13Why POS tagging?
- POS tag depends upon word use in context
- They drive (V) very fast.
- My disk drive (N) crashed.
- Tagging removes some ambiguity that arises from
treating words separately - Tagged text can be analysed for phrases and other
compounds
14Example of POS tagged text
- For McCaw, it would have hurt the company's
strategy of building a seamless national cellular
network. - For/in McCaw/pn, it/pp would/md have/vb hurt/vbn
the/dt company/nn 's/pos strategy/nn of/in
building/vbg a/dt seamless/jj national/jj
cellular/jj network/nn
15Phrase Identification
- POS tags can be used to identify phrases
- NP (dt) ((rb) (jj)) (nn pos) nnnnspn
- For/in McCaw/pn , it/pp would/md have/vb
hurt/vbn the/dt company/nn 's/pos strategy/nn
of/in building/vbg a/dt seamless/jj national/jj
cellular/jj network/nn
16How POS tagging works?
- Stochastic approaches (e.g., Kupiec)
- Use HMM over word trigrams
- Requires training, accuracy up to 98
- Rule-based approaches (e.g., Brill)
- Initial tags from a lexicon ambiguous
- Some empirical rules, e.g., no vb after dt
- Supervised error-driven learning accuracy up to
98 - Learn tag preferences for words tag ranking
- Learn tagging rules (e.g., nn preferred after jj)
17Error-Driven Learning
- A powerful paradigm for supervised machine
learning. - Proposed by Eric Brill in his PhD work (1992)
- Applications in routing and classification
- General idea
- Assume initial classification could be a guess
- Manually correct errors
- Have the system correct its behavior to
accommodate the corrections - Use unbiased training data
18Parsing
- Use English (or other language) grammar to derive
full (or approximate) syntactic structure of
text. - Hand constructed grammars (generic)
- Stochastic grammars (derived from training texts)
- Identify phrases, word-dependencies
- Normalize structural variants
19Parsing Example
assert will_aux perf have verb
hurt subject np n it object
np n strategy t_pos the
n_pos poss n company of
verb build
subject anyone
object np n network t_pos a
adj seamless
adj national
adj
cellular
20Graphical Parsing
assert
predicate
subject
object
aux
it
perf
verb
will
NP
tpos
npos
head
hurt
have
strategy
21HeadModifier dependencies
assert will_aux perf have verb
hurt subject np n it object
np n strategy t_pos the
n_pos poss n company of
verb build
subject anyone
object np n network t_pos a
adj seamless
adj national
adj
cellular
head
modifier
head
modifier
22Head-Modifier Structures
- HM Dependencies extracted from this parse
(headmodifier format) - hurtstrategy, strategycompany,
- buildnetwork,
- networkcellular, networknational,
networkseamless - Can be used as indexing terms
- More refined than simple phrases
- Normalization across all syntactic forms
(problems?) - Can be nested or un-nested pairs
- Order information important
- venetianblind ? blindvenetian
- collegefreshman ? freshmancollege
23Stream Model
phrases
phrases
names
names
search topics
HM pairs
HM pairs
Merged Ranked output
24Stream Model IR
Query
feedback
words
phrases
people
locations
weapons
Search Engines
Indexes
NLP
fuse
text
Summarize Present
25Stream Model Evaluation
- Compare performance of different indexing
approaches - Determine what is the contribution of each stream
to overall result - Uncertainty factors
- Rankings fusion
- Cross-stream dependencies
26Stream Model Evaluation (TREC-5)
27Concept Extraction
- Explicit identification of references to
- Named entities people, organizations, locations
- Events and relationships
- Detection of small text regions that
- Have features indicating presence of concepts
- No explicit extraction cant use in index
(why?) - Used for query expansion relevance feedback
28Concept-based Indexing
- Index documents using concepts such as
- entities, events, relations
- E.g. disasters in tunnels
- But how to represent disaster, tunnel, etc?
- But how to recognize disaster, tunnel in text?
- How to represent documents with concepts?
- Weighted keys?
- Semantic maps?
29Using concepts to enrich BOW
- Add compound terms and concepts to documents
- Treat as tokens in the bag-of-words
- Weigh just as other tokens
- tfidf based on distribution
- A function of weights of component words
- Ad-hoc
- Neither approach satisfactory (why not?)
30Detecting concept presence
- Supervised Machine Learning
- Use human annotated text for training
- Extract context cues that indicate concepts of a
given kind - Construct first-cut recognizers
- Unsupervised fitting
- Apply recognizer to new training data
(un-annotated) - Learn more context cues from where the concepts
occur - Revise recognizer rules, iterate until stable
31Self-Learning Concept Spotter
- Proposed by Strzalkowski Wang, 1996
- Start from seed descriptions/naïve rules
- Bootstrap from examples found in text
- Very effective, converges quickly
- Accuracy rivals human-made grammars
32S-LCS Example
Seed rules COMPANY NP Co. NP Inc.
Henry Kaufman is president of Henry Kaufman
Co., a Gabelli, chairman of Gabelli Funds,
Inc. Claude N. Rosenberg is named president of
Skandinaviska Enskilda Banken become vice
chairman of the state-owned electronics giant
Thompson S.A. banking group, said the former
merger of Skanska Banken into water maker
Source Perrier S.A., according to French stock
33S-LCS Example
Seed rules COMPANY PN Co. PN Inc. Add
COMPANY president chairman of np PN
Henry Kaufman is president of Henry Kaufman
Co., a Gabelli, chairman of Gabelli Funds,
Inc. Claude N. Rosenberg is named president of
Skandinaviska Enskilda Banken become vice
chairman of the state-owned electronics giant
Thompson S.A. banking group, said the former
merger of Skanska Banken into water maker
Source Perrier S.A., according to French stock
34S-LCS Example
Seed rules COMPANY PN Co. PN Inc. Add
COMPANY president chairman of np PN Add
COMPANY PN S.A. PN Banken
Henry Kaufman is president of Henry Kaufman
Co., a Gabelli, chairman of Gabelli Funds,
Inc. Claude N. Rosenberg is named president of
Skandinaviska Enskilda Banken become vice
chairman of the state-owned electronics giant
Thompson S.A. banking group, said the former
merger of Skanska Banken into water maker
Source Perrier S.A., according to French stock
35Co- and Cross-References
- Co-references usually pronouns, definite
descriptions - Tracking to get counts right
- Not an easy problem generally
- Cross-references are across documents
- Is this the same person, place, event?
- Generalizes to topic detection
36XDC Approach
- Proposed by Amit Bagga and Breck Baldwin
- disambiguate entities/events across documents
- done by looking at context around entity/event
- Expected to differ for Michael Jordan (NBA) and
Prof. Michael I. Jordan (UC Berkeley) - context extracted in the form of a summary for
each entity/event - sentence selection
- contexts are compared currently using the Vector
Space Model