Title: N-gram Tokenization for Indian Language Text Retrieval
1N-gram Tokenization for Indian Language Text
Retrieval
- Paul McNamee
- paul.mcnamee_at_jhu.edu
- 13 December 2008
2Talk Outline
- Introduction
- Monolingual Experiments from CLEF 2000-2007
- Words
- Stemmed words (Snowball)
- Character n-grams (n4,5)
- N-gram stems
- Automatically segmented words (Morfessor
algorithm) - Skipgrams (n-grams with skips)
- Why are n-grams effective?
- Bilingual Experiments (CLEF)
- FIRE Results
- Summary
3Morphological Processes
- Inflection
- box, boxes (plural) actor (male), actress
(female) - Conjugation
- write, written, writing swim, swam, swum
- Derivation
- sleep, sleepy play (verb), player (noun),
playful (adjective) - Word Formation
- Compounding news paper newspaper air port
airport - Clipping professor -gt prof facsimile-gt fax
- Acronyms GOI Government of India
4Why Do We Normalize Text?
- It seems desirable to group related words
together for query/document processing - Why?
- To make lexicographers happy?
- To improve system performance?
- If performance is the goal, then it ought not to
matter whether the indexing terms look like
morphemes, or not
5Rule-Based Stemming Snowball
- Applicable to alphabetic languages
- An approximation to lemmatization
- Identify a root morpheme by chopping off prefixes
and suffixes - Used for Dutch, English, Finnish, French, German,
Italian, Spanish, and Swedish - Snowball rulesets also exist for Hungarian and
Portuguese - No Indian language support
Most stemmers are rule-based -ing gt e juggling
gt juggl -es gt e juggles gt juggl -le gt
-l juggle gt juggl
The Snowball project provides high quality,
rule-based stemmers for many European languages
http//snowball.tartarus.org/
6N-gram Tokenization
- Represent text as overlapping substrings
- Fixed length of n of 4 or 5 is effective in
alphabetic languages - For text of length m, there are m-n1 n-grams
s w i m m e r s
_ s w i m
s w i m m
w i m m e
i m m e r
m m e r s
m e r s _
- Advantages simple, address morphology, surrogate
for short phrases, robust against spelling
diacritical errors, language-independence - Disadvantages conflation (e.g., simmer, slimmer,
glimmer, immerse), n-grams incur both speed and
disk usage penalties
7Single N-gram Stemming
- Traditional (rule-based) stemming attempts to
remove the morphologically variable portion of
words - Negative effects from over- and under-conflation
Hungarian Bulgarian _hun (20547) _bul
(10222) hung (4329) bulg (963) unga (1773) ulga
(1955) ngar (1194) lgar (1480) gari (2477) gari
(2477) aria (11036) aria (11036) rian
(18485) rian (18485) ian_ (49777) ian_ (49777)
Short n-grams covering affixes occur frequently -
those around the morpheme tend to occur less
often. This motivates the following approach (1)
For each word choose the least frequently
occurring character 4-gram (using a 4-gram
index) (2) Benefits of n-grams with run-time
efficiency of stemming
Continues work in Mayfield and McNamee, Single
N-gram Stemming, SIGIR 2003
8Statistical Segmentation
- Morfessor Algorithm
- Given a dictionary list, learns to split words
into segments - A form of statistical stemming based on Minimum
Description Length (MDL) - gt 70 of world languages have concatenative
morphology - Creutz Lagus, ACL-2002
- http//www.cis.hut.fi/projects/morpho
- 2007 Morphology Challenge
- Successful on an IR task
- Multiple segments per word are generated
- Examples
- affectionate
- authorized
- juggled
- jugglers
- seagulls
See McNamee, Nicholas, Mayfield, Dont Have a
Stemmer? Be unconcerned, SIGIR 2008
9Character Skipgrams
- Character n-grams robust matching technique
- Skipgrams super robust matching
- Some letters are omitted (essentially a wildcard
match) - swm matches swim / swam / swum
- ft matches foot / feet
- Skip bi-grams for fuzzy matching
- Pirkola et al. (2002) learning cross-lingual
translation mappings in related languages - Mustafa (2004) monolingual Arabic retrieval
- Example 4,2 skipgrams for Hopkins
- 4 letters, 2 skips
- hkin, hpin, hpkn, hoin, hokn, hopn
- oins, okns, okis, opns, opis, opks
- Note more skipgrams than plain n-grams
- Slight gains in Czech, Hungarian, Persian
- Application to OCRd docs?
10Generating Indexing Terms
Word Snowball Morfessor 5-grams
authored author authored _auth, autho, uthor, thore, hored, ored_
authorized author authorized _auth, autho, uthor, thori, horiz, orize, rized, ized_
authorship authorship authorship _auth, autho, uthor, thors, horsh, orshi, rship, ship_
reauthorization reauthor reauthorization _reau, reaut, eauth, autho, uthor, thori, horiz, oriza, rizat, izati, zatio, ation, tion_
afoot afoot afoot _afoo, afoot, foot_
footballs footbal footballs _foot, footb, ootba, otbal, tbaall, balls, alls_
footloose footloos footloose _foot, footl, ootlo, otloo, tloos, loose, oose_
footprint footprint footprint _foot, footp, ootpr, otpri, tprin, print, rint_
feet feet feet _feet, feet_
juggle juggl juggle _jugg, juggl, uggle, ggle_
juggled juggl juggled _jugg, juggl, uggle, ggled, gled_
jugglers juggler jugglers _jugg, juggl, uggle, ggler, glers, lers_
11JHU/APL HAIRCUT System
- The Hopkins Automatic Information Retriever for
Combing Unstructured Text (HAIRCUT) - Uses state-of-the-art statistical language model
- Ponte Croft, A Language Modeling Approach to
Information Retrieval, SIGIR-98 - Miller, Leek, and Schwartz, A Hidden Markov
Model Information Retrieval System, SIGIR-99. - Typically set ? to 0.5
- Language-neutral
- Supports large dictionaries
- Used at TREC (10x), CLEF (9x), NTCIR(2x)
12CLEF Ad Hoc Test Sets (2000 2007)
docs size 00 01 02 03 04 05 06 07
Bulgarian (BG) 69 k 213 MB 49 50 50 149
Czech (CS) 82 k 178 MB 50 50
Dutch (NL) 190 k 540 MB 50 50 56 156
English (EN) 170 k 580 MB 33 47 42 54 42 50 49 50 367
Finnish (FI) 55 k 137 MB 30 45 45 120
French (FR) 178 k 470 MB 34 49 50 52 49 50 49 333
German (DE) 295 k 660 MB 37 49 50 56 192
Hungarian (HU) 50 k 105 MB 50 48 50 148
Italian (IT) 157 k 363 MB 34 47 49 51 181
Portuguese (PT) 107 k 340 MB 46 50 50 146
Russian (RU) 17 k 68 MB 28 34 62
Spanish (ES) 453 k 1086 MB 49 50 57 156
Swedish (SV) 143 k 352 MB 49 53 102
13Tokenization Alternatives
- Stemming
- Effective in Romance languages
- Not always available
- N-grams
- Language-neutral
- Large gains in complex languages
- Other techniques
- Statistical stemming beats words
- Segmentation
- Single n-gram stems
- No run-time penalty
14Monolingual Tokenization
words stems morf 4-stem 4-grams 5-grams
BG Bulgarian 0.2164 0.2703 0.2822 0.3105 0.2820
CS Czech 0.2270 0.3215 0.2567 0.3294 0.3223
DE German 0.3303 0.3695 0.3994 0.3464 0.4098 0.4201
EN English 0.4060 0.4373 0.4018 0.4176 0.3990 0.4152
ES Spanish 0.4396 0.4846 0.4451 0.4485 0.4597 0.4609
FI Finnish 0.3406 0.4296 0.4018 0.3995 0.4989 0.5078
FR French 0.3638 0.4019 0.3680 0.3882 0.3844 0.3930
HU Hungarian 0.1976 0.2921 0.2836 0.3746 0.3624
IT Italian 0.3749 0.4178 0.3474 0.3741 0.3738 0.3997
NL Dutch 0.3813 0.4003 0.4053 0.3836 0.4219 0.4243
PT Portuguese 0.3162 0.3287 0.3418 0.3358 0.3524
RU Russian 0.2671 0.3307 0.2875 0.3406 0.3330
SV Swedish 0.3387 0.3756 0.3738 0.3638 0.4236 0.4271
Average Average 0.3230 0.3605 0.3518 0.3894 0.3923
change change 11.6 8.9 20.5 21.4
Aveage-8 Aveage-8 0.3719 0.4146 0.3928 0.3902 0.4214 0.4310
change change 11.5 5.6 4.9 13.3 15.9
15IR Language Family
- 5-gram Gains
- Tied to morphological complexity
- Small improvements in Romance family
- Estimating Complexity
- Mean word length
- Spearman rho 0.77
- Information-theoretic approach
- Spearman rho 0.67
- Kettunen et al., Juola
HU
FI
CS
BG
DE
RU
SV
HU
FI
CS
DE
SV
NL
16Why are N-grams Effective?
- (1) Spelling
- N-grams localize single letter spelling errors
- In news about 1 in 2000 words is misspelled
- (2) Phrasal Clues
- Word spanning n-grams hint at phrases
- Only slight differences observed
17(3) Because of Morphological Variation?
- N-grams might gain their power by controlling for
morphological variation - N-grams focused on root morphemes tend to match
across inflected forms - Juola (1998) and Kettunen (2006) did experiments
removing morphology from language - Such as replacing each surface form with a
6-digit number - I compared words and 5-grams under normal and
permuted letter conditions - golfer legfro
- golfed dofegl
- golfing ligfron
18Source of N-gram Power
- Idea remove morphology from a language
- Letter order of words was randomly permuted
- golfer -gt legfro, team-gt eamt
- golfing, golfer, golfed no longer share a
morpheme - 4 conditions words,5-grams x normal,shuffled
19Corpus-Based Translation
- Given aligned parallel texts and a particular
term to translate - Find set of documents (sentences) in the source
language containing the term - Examine corresponding foreign documents
- Extract good candidate(s)
- Goodness can be based on term similarity measures
(Dice, MI, IBM Model 1, etc.)
The Rosetta Stone was discovered in 1799 by
Napoleonic forces in Egypt. British physicist
Thomas Young determined that cartouches were
names of royalty. In 1821 Jean François
Champollion began deciphering hieroglyphics using
parallel data in Demotic and Greek
The price of oil increased yesterday. The economy
reacted sharply
El precio del petróleo aumentó ayer. La economía
reaccionó agudamente
20N-gram Translations
- Character n-grams can be statistically
translated, just like words - N-grams (such as n4,5) are smaller than words
- May capture affixes and morphological roots
- work (from working) maps to abaj (as in
trabajaba) - yrup (from syrup) maps to rabe (as in jarabe)
- Suitable with Proper Nouns
- therl (from Netherlands) to ses b (as in
Países Bajos)
German Italian
word milch latte
stem milch latt
4-grams milc ilch latt latt
5-grams _milc milch ilch_ _latt _latt latte
French Dutch
word lait melk
stem lait melk
4-grams lait melk
5-grams _lait lait_ _melk melk_
21Parallel Sources
Corpus Size Genre CLEF Languages
Bible 785k Religious CZ, DE, EN, ES, FI, FR, IT, NL, PT, RU, SV
JRC/Acquis 32M EU Law BG, CZ, DE, EN, ES, FI, FR, HU, IT, NL, PT, RU, SV
Europarl 33M Parlimentary Debate DE, EN, ES, FI, FR, IT, NL, PT, SV
OJEU 84M Governmental Affairs DE, EN, ES, FI, FR, IT, NL PT, SV
Bible Therefore was the name of it called Babel because Jehovah did there confound the language of all the earth and from thence did Jehovah scatter them abroad upon the face of all the earth.
Acquis (24) In order to contribute to the conservation of octopus and in particular to protect the juveniles, it is necessary to establish, in 2006, a minimum size of octopus from the maritime waters under the sovereignty or jurisdiction of third countries and situated in the CECAF region pending the adoption of a regulation amending Regulation (EC) No 850/98.
Europarl Mr President, the tsunami tragedy should be no less signi?cant to the worlds leaders and to Europe than 11 September.
OJEU 11. Traf?cking in women for sexual exploitation. A4-0372/97. Resolution on the Communi- cation from the Commission to the Council and the European Parliament on traf?cking in women for the purpose of sexual exploitation (COM(96)0567 - C4-0638/96). The European Parliament,
22Effectiveness Corpus Size
English queries translated using Europarl Corpus
sub-sampled from 1 to 100.
23Effectiveness by size (2)
24FIRE Index Characteristics
docs uniq words uniq 5-grams text size (gzip)
BN Bengali 123,040 34,985 1,321,876 151 MB
EN English 125,516 247,592 839,103 122 MB
HI Hindi 95,213 19,403 741,915 110 MB
MR Marathi 99,359 47,940 1,580,775 104 MB
- Vocabulary size in ILs seems abnormally small
- Possibly a bug in my pre-processing or
tokenization, perhaps related to Unicode (e.g.,
continuation or modification characters)
Neuchâtel Neuchâtel docs uniq words
BN Bengali 123,047 249,215
HI Hindi 95,215 127,658
MR Marathi 99,357 511,550
25Tokenization for FIRE 2008
words 4-grams 5-grams sk41 Top _at_ FIRE
BN Bengali 0.1231 0.3280 0.3582 0.3352 0.4719
EN English 0.5495 0.5241 0.5415 0.5264 0.5572
HI Hindi 0.0672 0.2820 0.3487 0.2746 0.3487
MR Marathi 0.1735 0.3740 0.3675 0.3478 0.4483
Average Average 0.2283 0.3834 0.4040 0.3710 0.4565
- Difficult to interpret results with anomalous
vocabulary - Need Failure Analysis
- Performance using words in ILs seems quite
depressed - Hindi 5-gram run had good relative performance
- Difference vs. 4-grams much larger than typically
seen
26Relative Gains w/ Relevance Feedback
words 4-grams 5-grams sk41
BN Bengali 29.4 36.3 46.4 41.3
EN English 22.0 21.1 21.4 19.6
HI Hindi -4.0 18.6 20.3 4.0
MR Marathi 19.5 23.9 27.4 21.0
- Query expansion using top 10 documents
- 50 terms (words), 150 terms (4/5-grams), 400
terms (sk41) - Fairly effective 20-40 gains
27In Conclusion
- Compared several forms of representing text
- In European languages n-grams obtain 20 gain
over words - Rule-based stemming good in Romance languages
- Morfessor segments, n-gram stems better than
words, not as good as Snowball stemmer - N-grams gains
- Greatest in morphologically richer languages
- Lost when morphology removed from language
- FIRE
- N-grams and RF also effective in ILs
- Must resolve vocabulary issue
- Difficulty finding parallel text, but would like
to investigate bilingual retrieval