Human Language Technology - PowerPoint PPT Presentation

About This Presentation
Title:

Human Language Technology

Description:

Human Language Technology Conflation Algorithms – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 32
Provided by: MikeR207
Category:

less

Transcript and Presenter's Notes

Title: Human Language Technology


1
Human Language Technology
  • Conflation Algorithms

2
Acknowledgements
  • John Repici (2002) http//www.creativyst.com/Doc/A
    rticles/SoundEx1/SoundEx1.htm
  • Porter, M.F., 1980, An algorithm for suffix
    stripping, reprinted in Sparck Jones, Karen, and
    Peter Willet, 1997, Readings in Information
    Retrieval, San Francisco Morgan Kaufmann, ISBN
    1-55860-454-4. Vince has a copy of this
  • Jurafsky Martin appendix B pp 833-836.

3
Conflation
COMPUT
COMPUTE
COMPUTER
COMPUTES
COMPUTING
COMPUTABILITY
COMPUTATION
4
Types of Conflation Algorithm
  • Stemming
  • Process based - e.g. affix stripping
  • Lemmatisation
  • Attempt to map to same lemma
  • POS dependent
  • Morphological Analysis
  • Includes morpho-syntactic information

5
Word Conflation Algorithms
  • Morphological analysis versus conflation
  • Notion of word class used is application
    dependent
  • Genealogy Phonetic similarity
  • Information Retrieval Semantic similarity
  • Based on written language (not phonetic
    transcription)
  • Well known algorithms
  • Soundex
  • Porter

6
SoundexProblems with Names
  • Names can be misspelt Rossner
  • Same name can be spelt in different waysKirkop
    Chircop
  • Same name appears differently in different
    cultures Tchaikovsky Chaicowski
  • To solve this problem, we need phonetically
    oriented algorithms which can find similar
    sounding terms and names.
  • Just such a family of algorithms exist and are
    called SoundExes, after the first patented
    version.

7
The Soundex Algorithm
  • A Soundex algorithm takes a word as input and
    produces a character string which identifies a
    set of words that are (roughly) phonetically
    alike.
  • It is very handy for searching large databases
  • Originally developed 1918 by Margaret K. Odell
    and Robert C. Russell of the US Bureau of
    Archives, to simplify census-taking.

8
Soundex Algorithm 1
  • The Soundex Algorithm uses the following
    steps to encode a word
  • The first character of the word is retained as
    the first character of the Soundex code.
  • The following letters are discarded
    a,e,i,o,u,h,w, and y.
  • Remaining consonants are given a code number.
  • If consonants having the same code number appear
    consecutively, the number will only be coded
    once. (e.g. "B233" becomes "B23")

9
Code Numbers
b, p, f, and v 1
c, s, k, g, j, q, x, z 2
d, t 3
l 4
m,n 5
r 6
10
Soundex Algorithm Example
  • The Soundex Algorithm uses the following
    steps to encode a word
  • ROSNER
  • The first character of the word is retained as
    the first character of the Soundex code R
  • The following letters are discarded
    a,e,i,o,u,h,w, and y. RSNR
  • Remaining consonants are given a code number.
    R256
  • If consonants having the same code number appear
    consecutively, the number will only be coded
    once. (e.g. "B233" becomes "B23")R256

11
Soundex Algorithm 2
  • The resulting code is modified so that it becomes
    exactly four characters long If it is less than
    4 characters, zeroes are added to the end (e.g.
    "B2" becomes "B200")
  • If it is more than 4 characters, the code is
    truncated (e.g. "B2435" becomes "B243")

12
Uses for the Soundex Code
  • Airline reservations - The soundex code for a
    passenger's surname is often recorded to avoid
    confusion when trying to pronounce it.
  • U.S. Census - As is noted above, the U.S. Census
    Department was a frequent user of the Soundex
    algorithm while trying to compile a listing of
    families around the turn of the century.
  • Genealogy - In genealogy, the Soundex code is
    most often used to avoid problems when dealing
    with names that might have alternate spellings.

13
Improvements
  • Preprocessing before applying the basic
    algorithm, e.g. identification of
  • DG with G
  • GH with H
  • GN with N (not 'ng')
  • KN with N
  • PH with F
  • Question where to stop?
  • Question how to evaluate?

14
IR Applications
  • Information RetrievalQuery ?
    ? Relevant
    Documents
  • Bag of Terms document model
  • What is a single term?

15
Why Stemming is Necessary
  • Frequently we get collections of words of the
    following kind in the same documentcompute,
    computer, computing, computation, computability
    .
  • Performance of IR system will be improved if all
    of these terms are conflated.
  • Less terms to worry about
  • More accurate statistics

16
Issues
  • Is a dictionary available?
  • Stems
  • Affixes
  • Motivation linguistic credibility or engineering
    performance?
  • When to remove a affix versus when to leave it
    alone
  • Porter (1980) W1 and W2 should be conflated if
    there appears to be no difference between the
    statements "this document is about
    W1/W2"relate/relativity vs. radioactive/radioact
    ivity

17
Consonants and Vowels
  • A consonant is a letter other than a,e,i,o,u and
    other than y preceded by a consonant sky, (nb. y
    in toy is not regarded as a consonant).
  • If a letter is not a consonant it is a vowel.
  • A sequence of consonants (cc..c) or vowels
    (vv..v) will be represented by C or V
    respectively.
  • For example the word troubles maps to C V C V C
  • Any word or part of a word, therefore has one of
    the following forms(CV)n.C(CV)n.V(VC)n.C(
    VC)n.V

18
Measure
  • All the above patterns can be replaced bythe
    following regular expression(C) (VC)m (V)
  • m is called the measure of any word or word part.
  • m0 tr, ee, tree, y, bym1 trouble, oats,
    trees, ivym2 troubles private

19
Rules
  • Rules for removing a suffix are given in the
    form(condition) S1 ? S2
  • i.e. if a word ends with suffix S1, and the stem
    before S1 satisfies the condition, then it is
    replaced with S2. Example(m gt 1) EMENT ?
  • Example enlargement ? enlarg

20
Conditions
  • S - stem ends with s
  • Z - stem ends with z
  • T stem ends with t
  • v - stem contains a vowel
  • d - stem ends with a double consonant
  • o - stem ends cvc, where second c is not w, x
    or y e.g. wil, -hop
  • In conditions, Boolean operators are possible
    e.g. (mgt1 and (S or T))
  • Sets of rules applied in 7 steps. Within each
    step, rule matching longest suffix applies.

21
Organisation
-s
Step 1 Plurals and Third Person Singular Verbs
-ed, -ing
fly/flies
Step 2 Verbal Past Tense and Progressive
Step 3 Y to I Noun Inflections
Steps 4 and 5 Derivational Morphology Multiple
Suffixes visualisation ? visualise
Steps 6 Derivational Morphology Single Suffixes
Step 7 Cleanup
22
Step 1Plural Nouns and 3rd Person Singular Verbs
condition rewrite example
SSES ? SS caresses ? caress
IES ? I ponies ? poni
SS ? SS caress ? caress
S ? cats ? cat
23
Step 2a Verbal Past Tense and Progressive Forms
condition rewrite example
(mgt1) EED ? EE feed ? feed agreed ? agree
(v) ED ? e plastered ? plaster bled ? bled
(v) ING ? e killing ? killsing ? sing
24
Step 2b CleanupIf 2nd or 3rd of last step
succeeds
condition rewrite example
AT ? ATE generat ? generate
BL ? BLE troubl ? trouble
IZ ? IZE capsiz ? capsize
d and not (L or S or Z) ? single letter hopp ? hop hiss ? hiss
25
Step 3 Y to I
(v) Y ? I happy ? happi cry ? cry
26
STEP 4 Derivational Morphology 1 Multiple Suffixes (excerpt) STEP 4 Derivational Morphology 1 Multiple Suffixes (excerpt) STEP 4 Derivational Morphology 1 Multiple Suffixes (excerpt)
Condition Rewrite Example
(m gt 0) ATIONAL ? ATE relational ? relate
(m gt 0) TIONAL ? TION conditional ? condition
(m gt 0) ENCI ? ENCE valenci ? valence
(m gt 0) ABLI ? ABLE comfortabli ? comfortable
(m gt 0) OUSLI ? OUS analagously ? analagous
(m gt 0) IZATION ? IZE digitizer ? digitize
(m gt 0) ATION ? ATE generation ? generate
(m gt 0) ATOR ? ATE operator ? operate
(m gt 0) ALISM ? AL formalism ? formal
(m gt 0) IVENESS ? IVE pensiveness ? pensive
(m gt 0) FULNESS ? FUL hopefulness ? hopeful
(m gt 0) OUSNESS ? OUS callousness ? callous
(m gt 0) ALITI ? AL formality ? formal
(m gt 0) BILITI ? BLE possibility ? possible
27
Step 6 Derivational Morphology III Single Suffixes Step 6 Derivational Morphology III Single Suffixes Step 6 Derivational Morphology III Single Suffixes
Condition Rewrite Example
(m gt 1) AL ? e revival ? reviv
(m gt 1) ANCE ? e allowance ? allow
(m gt 1) ENCE ? e inference ? infer
(m gt 1) ER ? e airliner ? airlin
(m gt 1) IC ? e Coptic ? Copt
(m gt 1) ABLE ? e laughable ? laugh
(m gt 1) ANT ? e irritant ? irrit
(m gt 1) EMENT ? e replacement ? replac
(m gt 1) MENT ? e adjustment ? adjust
(m gt 1) ENT ? e dependent ? depend
(m gt 0) (S or T) ION ? e adoption ? adopt
(m gt 1) OU ? e callousness ? callous
(m gt 1) ISM ? e formalism? formal
(m gt 1) ATE ? e activate ? activ
ITI ? e
28
Porter Example
  • INPUTin the first focus area, integrated
    projects shall help develop, principally, common
    open platforms for software and services
    supporting a distributed information and decision
    systems for risk and crisis management

29
Porter Output
Original Word Stemmed Word
first first
focus focu
area area
integrated integr
projects project
help help
develop develop
principally princip
common common
open open
platforms platform
Original Word Stemmed Word
platforms platform
software softwar
services servic
supporting support
distributed distribut
information inform
decision decis
systems system
risk risk
crisis crisi
management manag
30
Stemming Errors
  • Under-stemming
  • the error of taking off too small a suffix
  • croulons ? croulon
  • since croulons is a form of the verb crouler
  • Over-stemming
  • the error of taking off too much
  • example croûtons ? croût
  • since croûtons is the plural of croûton
  • Miss-stemming
  • taking off what looks like an ending, but is
    really part of the stem
  • reply ? rep

31
Summary
  • Conflation serves different purposes
  • Generally, motivation is to achieve an
    engineering goal rather than linguistic fidelity.
  • This can cause errors in the bag of words model.
  • Soundex and Porter very well established and
    easily available.
Write a Comment
User Comments (0)
About PowerShow.com