CS4705 - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

CS4705

Description:

CS4705 Corpus Linguistics and Machine Learning Techniques – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 21

Provided by: juliah154

Learn more at: http://www.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS4705

1
CS4705

Corpus Linguistics and Machine Learning Techniques

2
Review

What do we know about so far?
Words (stems and affixes, roots and templates,)
Ngrams (simple word sequences)
POS (e.g. nouns, verbs, adverbs, adjectives,
determiners, articles, )

3
Some Additional Things We Could Find

Named Entities
Persons
Company Names
Locations
Dates

4
What useful things can we do with this knowledge?

Find sentence boundaries, abbreviations
Find Named Entities (person names, company names,
telephone numbers, addresses,)
Find topic boundaries and classify articles into
topics
Identify a documents author and their opinion on
the topic, pro or con
Answer simple questions (factoids)
Do simple summarization/compression

5
But first, we need corpora

Online collections of text and speech
Some examples
Brown Corpus
Wall Street Journal and AP News
ATIS, Broadcast News
TDTN
Switchboard, Call Home
TRAINS, FM Radio, BDC Corpus
Hansards parallel corpus of French and English
And many private research collections

6
Next, we pose a questionthe dependent variable

Binary questions
Is this word followed by a sentence boundary or
not?
A topic boundary?
Does this word begin a person name? End one?
Should this word or sentence be included in a
summary?
Classification
Is this document about medical issues? Politics?
Religion? Sports?
Predicting continuous variables
How loud or high should this utterance be
produced?

7
Finding a suitable corpus and preparing it for
analysis

Which corpora can answer my question?
Do I need to get them labeled to do so?
Dividing the corpus into training and test
corpora
To develop a model, we need a training corpus
overly narrow corpus doesnt generalize
overly general corpus don't reflect task or
domain
To demonstrate how general our model is, we need
a test corpus to evaluate the model
Development test set vs. held out test set
To evaluate our model we must choose an
evaluation metric
Accuracy
Precision, recall, F-measure,
Cross validation

8
Then we build the model

Identify the dependent variable what do we want
to predict or classify?
Does this word begin a person name? Is this word
within a person name?
Is this document about sports? The weather?
International news? ???
Identify the independent variables what features
might help to predict the dependent variable?
What is this words POS? What is the POS of the
word before it? After it?
Is this word capitalized? Is it followed by a
.?
Does hocky appear in this document?
How far is this word from the beginning of its
sentence?
Extract the values of each variable from the
corpus by some automatic means

9
A Sample Feature Vector for Sentence-Ending
Detection
WordID POS Cap? , After? Dist/Sbeg End?
Clinton N y n 1 n
won V n n 2 n
easily Adv n y 3 n
but Conj n n 4 n
10
An Example Finding Caller Names in Voicemail ?
SCANMail

Motivated by interviews, surveys and usage logs
of heavy users
Hard to scan new msgs to find those you need to
deal with quickly
Hard to find msg you want in archive
Hard to locate information you want in any msg
How could we help?

11
SCANMail Architecture
Caller
SCANMail Subscriber
12
Corpus Collection

Recordings collected from 138 ATT Labs
employees mailboxes
100 hours 10K msgs 2500 speakers
Gender balanced 12 non-native speakers
Mean message duration 36.4 secs, median 30.0 secs
Hand-transcribed and annotated with caller id,
gender, age, entity demarcation (names, dates,
telnos)
Also recognized using ASR engine

13
Transcription and Bracketing

Greeting hi R CallerID it's me give me
a call um right away cos there's .hn I
guess there's some .hn change Date
tomorrow with the nursery school and they um
.hn anyway they had this idea cos since
I think J's the only one staying Date tomorrow
for play club so they wanted to they suggested
that .hn well J2 actually offered to take J
home with her and then would she

would meet you back at the synagogue at Time
five thirty to pick her up .hn uh so I
don't know how you feel about that otherwise M_
and one other teacher would stay and take care of
her till Date five thirty tomorrow but if
you .hn I wanted to know how you feel before
I tell her one way or the other so call me .hn
right away cos I have to get back to her in
about an hour so .hn okay Closing bye
.nhn .onhk

15
SCANMail Demo
http//www.avatarweb.com/scanmail/
Audix extension demo Audix password (null)
16
Information Extraction (Martin Jansche and Steve
Abney)

Goals extract key information from msgs to
present in headers
Approach
Supervised learning from transcripts (phone s,
caller self-ids)
Combine Machine Learning techniques with simpler
alternatives, e.g. hand-crafted rules
Two stage approaches

Features exploit structure of key elements (e.g.
length of phone numbers) and of surrounding
context (e.g. self-ids tend to occur at beginning
of msg)

18
Telephone Number Identification