Title: CC384 Natural Language Engineering
1CC384 - Natural Language Engineering
2The basic tasks in text processing
- TOKENIZATION identify tokens in text
- WORD COUNTING count words and their frequencies
- SEARCHING FOR WORDS
- NORMALIZATION
- MASSIMO POESIO, massimo poesio, masimo peosio ?
Massimo Poesio - Oct 20, 20th of October, .. ? 20/10/2009
- STEMMING
3Regular Expressions and Finite State Automata
- A central language technology
- Regular expressions a way to express powerful
SEARCH PATTERNS that can be implemented
efficiently - Implemented in Perl, Java 1.4, Emacs, search
engines - Finite state automata the computational model
underlying regular expressions - The regular expressions model can be expanded to
specify SUBSTITUTIONS as well, implementable as
FINITE STATE TRANSDUCERS - In fact, Finite State Transducers are powerful
enough to be usable for PARSING - Simpler cases of parsing tokenization,
normalization
4Searching text for words
char text .int leftMargin Boolean
matchWord(String word) boolean retval
true for (int i 0 i lt word.length()
retval true i) if
(word.charAt(i) ! textleftMarginI)
retvalfalse return(retval)
5Searching text for patterns
- Most common case searching using Google or
similar - Simpler case just looking for web pages
containing a word (accommodation) - More complex cases
- Different spellings
- accomodation OR accommodation
- Centre OR Center Cognitive Science
- Patterns only occurring in certain contexts
- But also to validate string entered by the user
- E.g., checking whether the string entered is a
phone number - (44)(0)20-12341234, 02012341234, 44 (0)
1234-1234 - But not (44)020-12341234, 12341234(020)
- A regular email address
- asmith_at_mactec.com, foo12_at_foo.edu,
bob.smith_at_foo.tv - But not asmith, _at_mactech.com, a_at_a
- A post code
- G1 1AA, EH10 2QQ, SW1 1ZZ
6Regular Expressions a formalism for expressing
search patterns
- Because matching is a very common problem, over
the years computer scientists have identified a
set of patterns that - Are very common
- Can be searched for efficiently
- The language of REGULAR EXPRESSIONS has been
developed to characterize these patterns - Many programming languages (Perl, Java 1.4, TCL,
Python. ) / web search tools / software systems
(awk, sed, emacs) allow users to use regular
expressions to specify what they are searching
these REs are then compiled into efficient code - You do not need to write the code yourselves!
7Regular Expressions the basic case
- The simplest form of regular expression a
SEQUENCE OF SYMBOLS - /can/
- Matches any string which contains can can,
canterbury, scannning - Whitespace can be included /top ten/
- Also matches how to stop tension
8More complex types of regular expressions
- Disjunction
- /centrecenter/
- /accomodationaccommodation/
- Also
- /Ccentre/
- /accommmodation/
- Repetitions
- Any number greater than 0
- /YES!/
- Matches YES!, YESS!, YESSS!
- E.g., any binary number 01
- 0 or more
- /ab/
- Matches a, ab, abb, abbb
9Software that includes an implementation of REs
- Pure REs awk, egrep, lex
- Extended REs perl, Java
10Regular expressions in Java (from 1.4)
- Standard library java.util.regex
- Tutorial (very good) http//java.sun.com/docs/bo
oks/tutorial/extra/regex/index.html - Alternative
- http//www.javaworld.com/javaworld/jw-07-2001
/jw-0713-regex-p2.html - Main classes
- PATTERN ( compiled form of a RE)
- Pattern rePattern Pattern.compile(ab")
- MATCHER ( analyze a string using a pattern)
- Matcher pm rePattern.matcher(string)
- pm.find() find the next substring that matches
- pm.group() the substring found by find()
11Grep in Java 1.4 (cc384/code/java)
- .import java.util.regex. public
class Grep .
// Pattern used to parse lines
private static Pattern linePattern
Pattern.compile(".\r?\n")
// The input pattern that we're looking
for private static Pattern pattern
// Compile the
pattern from the command line private
static void compile(String pat)
try pattern Pattern.compile(pat)
catch (PatternSyntaxException
x) System.err.println(
x.getMessage()) System.exit(1)
// Use the linePattern to break
the given CharBuffer into lines, applying
// the input pattern to each
line to see if we have a match - private static void grep(File f,
CharBuffer cb) Matcher
lm linePattern.matcher(cb) // Line matcher
Matcher pm null // Pattern
matcher int lines 0
while (lm.find()) lines
CharSequence cs
lm.group() // The current line
if (pm null) pm
pattern.matcher(cs) else pm.reset(cs)
if (pm.find())
System.out.print(f "" lines "" cs)
if (lm.end()
cb.limit()) break
12Regular expressions in Perl
- Example print lines containing the string can
(a simple version of the grep program)
while (ltSTDINgt) if (/can/) print _
13Even more complex cases and more metacharacters
(PERL- and Java-specific )
- Other forms of disjunction
- Range /textfile02-4/
- Will match textfile02 textfile03 textfile04
- Metacharacters (in Perl / Java)
- \d (any digit) /a\dz/ matches a0z, a123z, a456z
- \w (letter, digit, or underscore _)
- \s (any whitespace)
- Any character . (period)
- /cyclo.ane/ matches
- cyclodecane, cyclohexane, cyclones drive me
insane - Zero or one times ?
- /accomm?odation/ matches accomodation and
accommodation - Negation abc
- /textfile0268/ matches textfile1,
textfile3,
14Applications of more complex REs
- Web pages about Centres and Centers
- /CcentreCcenter/
- Regular expression to validate phone numbers
- (44)(0)20-12341234, 02012341234, 44 (0)
1234-1234 - But not (44)020-12341234, 12341234(020)
- (\(?\?0-9\)?)?0-9_\- \(\)
- Validating email addresses
- asmith_at_mactec.com, foo12_at_foo.edu,
bob.smith_at_foo.tv - But not asmith, _at_mactech.com, a_at_a
- (a-zA-Z0-9_\-\.)_at_((\0-91,3\.0-91,3\.
0-91,3\.)((a-zA-Z0-9\-\.)))(a-zA-Z2,4
0-91,3)(\?)
15Notational Variants
- Different programming languages tend to use
different notations for expressing REs. - In FSA,
- Sequence d,o,g
- Disjunction c,a,t,d,o,g (instead of
catdog) - Range a..z (instead of a-z)
- Any symbol whatsoever ? (instead of .)
- Optional character E (instead of E?)
16Notational variants advanced search in Google
- CAPITALIZATION, etc
- Google search is not case-sensitive
- OR search
- vacation london OR paris
- NUMRANGE search
- DVD player 250..350
- WILDCARD search
- "Sony Vaio laptop"
- For more tips http//www.google.com/help/refinese
arch.html
17Readings
- Jurafsky and Martin, chapter 2
- The regular expressions library
- http//www.regxlib.com/
- The Java tutorial at Sun, section on regular
expressions - http//java.sun.com/docs/books/tutorial/extra/rege
x/index.html - The sections of the Perl manual on regular
expressions (perlre) - Jeffrey Friedl, Understanding Regular
Expressions, The Perl Journal
18Acknowledgments
- Some material borrowed from Gosse Bouma