Title: CS4705
1CS4705
- Corpus Linguistics and Machine Learning Techniques
2Review
- What do we know about so far?
- Words (stems and affixes, roots and templates,)
- Ngrams (simple word sequences)
- POS (e.g. nouns, verbs, adverbs, adjectives,
determiners, articles, )
3Some Additional Things We Could Find
- Named Entities
- Persons
- Company Names
- Locations
- Dates
4What useful things can we do with this knowledge?
- Find sentence boundaries, abbreviations
- Find Named Entities (person names, company names,
telephone numbers, addresses,) - Find topic boundaries and classify articles into
topics - Identify a documents author and their opinion on
the topic, pro or con - Answer simple questions (factoids)
- Do simple summarization/compression
5But first, we need corpora
- Online collections of text and speech
- Some examples
- Brown Corpus
- Wall Street Journal and AP News
- ATIS, Broadcast News
- TDTN
- Switchboard, Call Home
- TRAINS, FM Radio, BDC Corpus
- Hansards parallel corpus of French and English
- And many private research collections
6Next, we pose a questionthe dependent variable
- Binary questions
- Is this word followed by a sentence boundary or
not? - A topic boundary?
- Does this word begin a person name? End one?
- Should this word or sentence be included in a
summary? - Classification
- Is this document about medical issues? Politics?
Religion? Sports? - Predicting continuous variables
- How loud or high should this utterance be
produced?
7Finding a suitable corpus and preparing it for
analysis
- Which corpora can answer my question?
- Do I need to get them labeled to do so?
- Dividing the corpus into training and test
corpora - To develop a model, we need a training corpus
- overly narrow corpus doesnt generalize
- overly general corpus don't reflect task or
domain - To demonstrate how general our model is, we need
a test corpus to evaluate the model - Development test set vs. held out test set
- To evaluate our model we must choose an
evaluation metric - Accuracy
- Precision, recall, F-measure,
- Cross validation
8Then we build the model
- Identify the dependent variable what do we want
to predict or classify? - Does this word begin a person name? Is this word
within a person name? - Is this document about sports? The weather?
International news? ??? - Identify the independent variables what features
might help to predict the dependent variable? - What is this words POS? What is the POS of the
word before it? After it? - Is this word capitalized? Is it followed by a
.? - Does hocky appear in this document?
- How far is this word from the beginning of its
sentence? - Extract the values of each variable from the
corpus by some automatic means
9A Sample Feature Vector for Sentence-Ending
Detection
WordID POS Cap? , After? Dist/Sbeg End?
Clinton N y n 1 n
won V n n 2 n
easily Adv n y 3 n
but Conj n n 4 n
10An Example Finding Caller Names in Voicemail ?
SCANMail
- Motivated by interviews, surveys and usage logs
of heavy users - Hard to scan new msgs to find those you need to
deal with quickly - Hard to find msg you want in archive
- Hard to locate information you want in any msg
- How could we help?
11SCANMail Architecture
Caller
SCANMail Subscriber
12Corpus Collection
- Recordings collected from 138 ATT Labs
employees mailboxes - 100 hours 10K msgs 2500 speakers
- Gender balanced 12 non-native speakers
- Mean message duration 36.4 secs, median 30.0 secs
- Hand-transcribed and annotated with caller id,
gender, age, entity demarcation (names, dates,
telnos) - Also recognized using ASR engine
13Transcription and Bracketing
- Greeting hi R CallerID it's me give me
a call um right away cos there's .hn I
guess there's some .hn change Date
tomorrow with the nursery school and they um
.hn anyway they had this idea cos since
I think J's the only one staying Date tomorrow
for play club so they wanted to they suggested
that .hn well J2 actually offered to take J
home with her and then would she
14- would meet you back at the synagogue at Time
five thirty to pick her up .hn uh so I
don't know how you feel about that otherwise M_
and one other teacher would stay and take care of
her till Date five thirty tomorrow but if
you .hn I wanted to know how you feel before
I tell her one way or the other so call me .hn
right away cos I have to get back to her in
about an hour so .hn okay Closing bye
.nhn .onhk
15SCANMail Demo
http//www.avatarweb.com/scanmail/
Audix extension demo Audix password (null)
16Information Extraction (Martin Jansche and Steve
Abney)
- Goals extract key information from msgs to
present in headers - Approach
- Supervised learning from transcripts (phone s,
caller self-ids) - Combine Machine Learning techniques with simpler
alternatives, e.g. hand-crafted rules - Two stage approaches
17- Features exploit structure of key elements (e.g.
length of phone numbers) and of surrounding
context (e.g. self-ids tend to occur at beginning
of msg)
18Telephone Number Identification
- Rules convert all numbers to standard digit
format - Predict start of phone number with rules
- This step over-generates
- Prune with decision-tree classifier
- Best features
- Position in msg
- Lexical cues
- Length of digit string
- Performance
- .94 F on human-labeled transcripts
- .95 F on ASR)
19Caller Self-Identifications
- Predict start of id with classifier
- 97 of ids begin 1-7 words into msg
- Then predict length of phrase
- Majority are only 2-4 words long
- Avoid risk of relying on correct speech
recognition for names - Best cues to end of phrase are a few common words
- I, could, please
- No actual names they over-fit the data
- Performance
- .71 F on human-labeled
- .70 F on ASR
20Introduction to Weka