Title: Project Discussion
1Project Discussion
- Java
- Installation
- First Program
- Lazy Program
- Project Example
- Interfaces
- Packages
- Implementation Strategies
- Procedural
- State Based
- Machine Learning
- General Issues
2Java Installation
- For this class well use Java 1.4.1
- Start at
- http//java.sun.com/j2se/1.4.1/download.html
- Download the SDK for your platform
3Verifying Installation
- Go to command prompt or shell and enter
java Usage java -options class args...
(to execute a class) or java -jar
-options jarfile args... (to
execute a jar file) where options include
-client to select the "client" VM
-server to select the "server" VM
-hotspot is a synonym for the "client" VM
deprecated The default VM is
client. -cp -classpath ltdirectories and
zip/jar files separated by gt
set search path for application classes and
resources -Dltnamegtltvaluegt
set a system property ..
- You can now write your first program
4Your 1st or 100th Java Program
- Create a file Hello.java containing the
following text - public class Hello
-
- public static void main(String args)
-
- System.out.println("Hello world!")
-
-
- Compile the program via
- javac Hello.java
- Run the program via
- java Hello
- Hello world!
5Your 1st or 100th Java Program
public class Hello public static void
main(String args)
System.out.println("Hello world!")
A
F
B
C
D
E
I
J
- A--Creating a class called Hello must match
the filename Hello.java - B--Declaring a public method
- C--Method is a class method (not an instance
method) - D--Method returns void
- E--Method name is main
- F--Method takes a single param named args of type
array of String - G--Invoking static method
- J--Actual parameter is a String
6A Larger Lazy Example
public class Lazy private float
m_myArray private Float m_average
public Lazy(float contents)
m_myArray contents public float
average() if (m_average null)
float average 0 for (int
i0 i lt m_myArray.length i)
average m_myArrayi
average / m_myArray.length
m_average new Float(average)
return m_average.floatValue()
7Invoking Lazy
public static void main (String args)
float initialArray 1.0f,2.0f,3.0f,3.0f
Lazy lazy new Lazy(initialArray)
System.out.println("average "
lazy.average())
8Java Class Libraries
- http//java.sun.com/j2se/1.4.1/docs/api/
9Interfaces
- Interface defines methods that a class
implementing the interface must support
TokenList
ltltInterfacegtgt LexicalUnit
getContents() String
1
ltltInterfacegtgt WordToken
ltltInterfacegtgt SentenceToken
getWordTokens() List
1..
1
10Example Implementing WordToken
ltltInterfacegtgt WordToken
SimpleWordToken
implements
public class SimpleWordToken implements
WordToken private String m_word
public SimpleWordToken(String contents)
m_word contents public String
getContents() return m_word
public String toString() return
m_word
11Current Project Structure
Structure may change
12Packages
- Used to carve up namepsace of classes
- Used to support visibility
- Word Token
- package edu.seattleu.se514.lexicalUnits
- public interface WordToken extends LexicalUnit
-
- SimpleWordToken
- package edu.seattleu.se514.exampleImplementation
- import edu.seattleu.se514.lexicalUnits.WordToken
- public class SimpleWordToken implements
WordToken
13Example Implementation Highlights
public class SimpleSentenceToken implements
SentenceToken private String m_contents
private List m_wordTokens public
SimpleSentenceToken(String contents)
m_wordTokens new ArrayList()
m_contents contents public List
getWordTokens() return
m_wordTokens public void
addWord(WordToken wordToken)
m_wordTokens.add(wordToken) public
String getContents() return
m_contents
14Very Simple Tokenizer
public TokenList tokenize(String text)
TokenList tokenList new TokenList() int
startOfSentence 0 int endOfSentencetext.ind
exOf(". ",startOfSentence) while
(endOfSentence gt 0) String
sentenceContents text.substring(startOfSen
tence,endOfSentence1) SimpleSentenceToken
sentenceToken new SimpleSentenceToken(sent
enceContents) tokenList.addSentenceToken(sen
tenceToken) StringTokenizer tokenizer
new StringTokenizer(sentenceContents,".
",true) while (tokenizer.hasMoreElements())
String wordContents
tokenizer.nextToken() if
(!wordContents.equals(" "))
SimpleWordToken wordToken new
SimpleWordToken(wordContents)
sentenceToken.addWord(wordToken)
startOfSentence endOfSentence 2
endOfSentence text.indexOf(".
",startOfSentence) return tokenList
15Invocation of Tokenizer
public static void main(String args)
SimpleTokenizer tokenizer new
SimpleTokenizer() System.out.println(toke
nizer.tokenize(args0)) java -cp
project.jar edu.seattleu.se514.exampleImplement
ation.SimpleTokenizer "Wow. This might work.
Wow. Thismightwork. jar -xf
project.jar will unpack the contents of the .jar
file jar -tf project.jar will list the contents
of the .jar file
16Test Harness
- Incomplete YOYO harness for testing tokenizer.
- Class is TokenizationTest
- Invoked with two command line params
- Filename
- ClassName of Tokenizer
- Filename looks like
- This is a simple test. Try it first.
- Thisisasimpletest.Tryitfirst.
- Richard M. Nixon doesn't properly tokenize.
- RichardM.Nixondoesn'tproperlytokenize.
- Invoked Via
- java -cp project.jar edu.seattleu.se514.testHarne
sses.TokenizationTest - tokenizationtests.txt edu.seattleu.se514.exampleI
mplementation.SimpleTokenizer
17Three Approaches to Tokenization
- Procedural
- Have random access to array elements
- Use if/then statements to write boundaries rules
- Finite State Machine
- Generate FSM, character classes (e.g., .?!)
represent transition alphabet - Outputs identify sentence, word boundaries
- Machine Learning
- Create a program to learn sentence boundaries
- Train the program through a collection of
properly tokenized examples
18Procedural Fragement
- Find sentence boundaries fragement
private boolean isSentenceBoundary(String text,
int location) // Make sure we don't
go dancing off end of array if (location gt
0 location lt text.length())
// Find position of termination (e.g., .,?,!)
if (text.charAt(location) '.'
text.charAt(location) '?'
text.charAt(location) '!')
// See if next char is a space
followed by a cap if
(text.charAt(location 1) ' '
Character.isUpperCase(text.charAt(location
2))) return true
return false return
false
19FSM based approach
- First remember that FSM are equivalent to regular
expressions - ((.?!) A-Z)
else
S1(start)
A-Z sentence
.?!
S2
else
20Machine Learning
PreTokenized Text
Learning Engine
Tokenization Rules
Tokenizer
Non-Tokenized Text
Tokenized Text
21Generating features
- Identify relevant features
- a) previous character capitalized?
- b) current character a ?!.c) next character
whitespace - d) next character capitalized
- Generate some training data
- Wow! That was fun. --gtF,T,F,T
- Wow! That was fun. --gtF,F,F,F
- Richard M. Nixon--gtT,T,F,T
22Mining the features
- F,T,F,T yes
- F,F,F,F no
- T,T,F,T no
- F,T,F,T yes
- F,T,F,T yes
- T,T,T,T no
- ..
- T,T,F,T yes
- Could conclude that FTFT is the only pattern that
indicates an end of sentence OR - 10 of the time TTFT indicates the end of a
sentence - Good news from I.B.M. They..
23Using Bayes Formula
- Where
- S is sentence boundary
- C is context
- P(SC) probability of S given C
- We can use same features and calculate
probabilities (E.g., F,T,F,T) - P(S) the a priori probability of any location
being a sentences - P(CS) If this is a sentence boundary whats the
chance that F,T,F,T is the pattern - P(C) Whats the overall a priori probability of
F,T,F,T
24Lightweight Machine Learning
- Period is most ambiguous separator of sentences
- A particularly difficult problem is use of period
in abbreviations - .e.g., ex-Gov. Davis was
- Can use corpus to discover likely abbreviations
- string letters, terminated by period
- followed by comma, semi-colon, question-mark,
lowercase letter, or by capital letter followed
by period - The ex-Gov. R. M. Smith
25General Tokenization Issues
- Need to establish what kind text you are parsing
- Theres an apartment for rent. Please contact
Norman Bates - THERES AN APARTMENT FOR RENT. PLEASE CONTACT
NORMAN BATES - theress an apartment for rent. please contact
norman bates - apt 4 rent.cntct norman bates
- Well assume normal well written text (e.g., from
news stories).
26Summary
- Java
- Installation
- Creating classes
- Compiling
- Running
- Example Programs
- Lazy Eval
- Project
- Required interfaces
- Example implementation
- Three Tokenization Strategies
- Procedural
- State Based
- Machine Learning