Project Discussion - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Project Discussion

Description:

http://java.sun.com/j2se/1.4.1/download.html. Download the SDK for ... or java -jar [-options] jarfile [args...] (to execute a jar file) where options include: ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 27
Provided by: rob1108
Category:

less

Transcript and Presenter's Notes

Title: Project Discussion


1
Project Discussion
  • Java
  • Installation
  • First Program
  • Lazy Program
  • Project Example
  • Interfaces
  • Packages
  • Implementation Strategies
  • Procedural
  • State Based
  • Machine Learning
  • General Issues

2
Java Installation
  • For this class well use Java 1.4.1
  • Start at
  • http//java.sun.com/j2se/1.4.1/download.html
  • Download the SDK for your platform

3
Verifying Installation
  • Go to command prompt or shell and enter

java Usage java -options class args...
(to execute a class) or java -jar
-options jarfile args... (to
execute a jar file) where options include
-client to select the "client" VM
-server to select the "server" VM
-hotspot is a synonym for the "client" VM
deprecated The default VM is
client. -cp -classpath ltdirectories and
zip/jar files separated by gt
set search path for application classes and
resources -Dltnamegtltvaluegt
set a system property ..
  • You can now write your first program

4
Your 1st or 100th Java Program
  • Create a file Hello.java containing the
    following text
  • public class Hello
  • public static void main(String args)
  • System.out.println("Hello world!")
  • Compile the program via
  • javac Hello.java
  • Run the program via
  • java Hello
  • Hello world!

5
Your 1st or 100th Java Program
public class Hello public static void
main(String args)
System.out.println("Hello world!")
A
F
B
C
D
E
I
J
  • A--Creating a class called Hello must match
    the filename Hello.java
  • B--Declaring a public method
  • C--Method is a class method (not an instance
    method)
  • D--Method returns void
  • E--Method name is main
  • F--Method takes a single param named args of type
    array of String
  • G--Invoking static method
  • J--Actual parameter is a String

6
A Larger Lazy Example
public class Lazy private float
m_myArray private Float m_average
public Lazy(float contents)
m_myArray contents public float
average() if (m_average null)
float average 0 for (int
i0 i lt m_myArray.length i)
average m_myArrayi
average / m_myArray.length
m_average new Float(average)
return m_average.floatValue()
7
Invoking Lazy
public static void main (String args)
float initialArray 1.0f,2.0f,3.0f,3.0f
Lazy lazy new Lazy(initialArray)
System.out.println("average "
lazy.average())
8
Java Class Libraries
  • http//java.sun.com/j2se/1.4.1/docs/api/

9
Interfaces
  • Interface defines methods that a class
    implementing the interface must support

TokenList
ltltInterfacegtgt LexicalUnit
getContents() String
1
ltltInterfacegtgt WordToken
ltltInterfacegtgt SentenceToken
getWordTokens() List

1..
1
10
Example Implementing WordToken
ltltInterfacegtgt WordToken
SimpleWordToken
implements
public class SimpleWordToken implements
WordToken private String m_word
public SimpleWordToken(String contents)
m_word contents public String
getContents() return m_word
public String toString() return
m_word
11
Current Project Structure
Structure may change
12
Packages
  • Used to carve up namepsace of classes
  • Used to support visibility
  • Word Token
  • package edu.seattleu.se514.lexicalUnits
  • public interface WordToken extends LexicalUnit
  • SimpleWordToken
  • package edu.seattleu.se514.exampleImplementation
  • import edu.seattleu.se514.lexicalUnits.WordToken
  • public class SimpleWordToken implements
    WordToken


13
Example Implementation Highlights
public class SimpleSentenceToken implements
SentenceToken private String m_contents
private List m_wordTokens public
SimpleSentenceToken(String contents)
m_wordTokens new ArrayList()
m_contents contents public List
getWordTokens() return
m_wordTokens public void
addWord(WordToken wordToken)
m_wordTokens.add(wordToken) public
String getContents() return
m_contents
14
Very Simple Tokenizer
public TokenList tokenize(String text)
TokenList tokenList new TokenList() int
startOfSentence 0 int endOfSentencetext.ind
exOf(". ",startOfSentence) while
(endOfSentence gt 0) String
sentenceContents text.substring(startOfSen
tence,endOfSentence1) SimpleSentenceToken
sentenceToken new SimpleSentenceToken(sent
enceContents) tokenList.addSentenceToken(sen
tenceToken) StringTokenizer tokenizer
new StringTokenizer(sentenceContents,".
",true) while (tokenizer.hasMoreElements())
String wordContents
tokenizer.nextToken() if
(!wordContents.equals(" "))
SimpleWordToken wordToken new
SimpleWordToken(wordContents)
sentenceToken.addWord(wordToken)
startOfSentence endOfSentence 2
endOfSentence text.indexOf(".
",startOfSentence) return tokenList
15
Invocation of Tokenizer
public static void main(String args)
SimpleTokenizer tokenizer new
SimpleTokenizer() System.out.println(toke
nizer.tokenize(args0)) java -cp
project.jar edu.seattleu.se514.exampleImplement
ation.SimpleTokenizer "Wow. This might work.
Wow. Thismightwork. jar -xf
project.jar will unpack the contents of the .jar
file jar -tf project.jar will list the contents
of the .jar file
16
Test Harness
  • Incomplete YOYO harness for testing tokenizer.
  • Class is TokenizationTest
  • Invoked with two command line params
  • Filename
  • ClassName of Tokenizer
  • Filename looks like
  • This is a simple test. Try it first.
  • Thisisasimpletest.Tryitfirst.
  • Richard M. Nixon doesn't properly tokenize.
  • RichardM.Nixondoesn'tproperlytokenize.
  • Invoked Via
  • java -cp project.jar edu.seattleu.se514.testHarne
    sses.TokenizationTest
  • tokenizationtests.txt edu.seattleu.se514.exampleI
    mplementation.SimpleTokenizer

17
Three Approaches to Tokenization
  • Procedural
  • Have random access to array elements
  • Use if/then statements to write boundaries rules
  • Finite State Machine
  • Generate FSM, character classes (e.g., .?!)
    represent transition alphabet
  • Outputs identify sentence, word boundaries
  • Machine Learning
  • Create a program to learn sentence boundaries
  • Train the program through a collection of
    properly tokenized examples

18
Procedural Fragement
  • Find sentence boundaries fragement

private boolean isSentenceBoundary(String text,
int location) // Make sure we don't
go dancing off end of array if (location gt
0 location lt text.length())
// Find position of termination (e.g., .,?,!)
if (text.charAt(location) '.'
text.charAt(location) '?'
text.charAt(location) '!')
// See if next char is a space
followed by a cap if
(text.charAt(location 1) ' '
Character.isUpperCase(text.charAt(location
2))) return true
return false return
false
19
FSM based approach
  • First remember that FSM are equivalent to regular
    expressions
  • ((.?!) A-Z)

else
S1(start)
A-Z sentence
.?!
S2
else
20
Machine Learning
PreTokenized Text
Learning Engine
Tokenization Rules
Tokenizer
Non-Tokenized Text
Tokenized Text
21
Generating features
  • Identify relevant features
  • a) previous character capitalized?
  • b) current character a ?!.c) next character
    whitespace
  • d) next character capitalized
  • Generate some training data
  • Wow! That was fun. --gtF,T,F,T
  • Wow! That was fun. --gtF,F,F,F
  • Richard M. Nixon--gtT,T,F,T

22
Mining the features
  • F,T,F,T yes
  • F,F,F,F no
  • T,T,F,T no
  • F,T,F,T yes
  • F,T,F,T yes
  • T,T,T,T no
  • ..
  • T,T,F,T yes
  • Could conclude that FTFT is the only pattern that
    indicates an end of sentence OR
  • 10 of the time TTFT indicates the end of a
    sentence
  • Good news from I.B.M. They..

23
Using Bayes Formula
  • Where
  • S is sentence boundary
  • C is context
  • P(SC) probability of S given C
  • We can use same features and calculate
    probabilities (E.g., F,T,F,T)
  • P(S) the a priori probability of any location
    being a sentences
  • P(CS) If this is a sentence boundary whats the
    chance that F,T,F,T is the pattern
  • P(C) Whats the overall a priori probability of
    F,T,F,T

24
Lightweight Machine Learning
  • Period is most ambiguous separator of sentences
  • A particularly difficult problem is use of period
    in abbreviations
  • .e.g., ex-Gov. Davis was
  • Can use corpus to discover likely abbreviations
  • string letters, terminated by period
  • followed by comma, semi-colon, question-mark,
    lowercase letter, or by capital letter followed
    by period
  • The ex-Gov. R. M. Smith

25
General Tokenization Issues
  • Need to establish what kind text you are parsing
  • Theres an apartment for rent. Please contact
    Norman Bates
  • THERES AN APARTMENT FOR RENT. PLEASE CONTACT
    NORMAN BATES
  • theress an apartment for rent. please contact
    norman bates
  • apt 4 rent.cntct norman bates
  • Well assume normal well written text (e.g., from
    news stories).

26
Summary
  • Java
  • Installation
  • Creating classes
  • Compiling
  • Running
  • Example Programs
  • Lazy Eval
  • Project
  • Required interfaces
  • Example implementation
  • Three Tokenization Strategies
  • Procedural
  • State Based
  • Machine Learning
Write a Comment
User Comments (0)
About PowerShow.com