CS4705 - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

CS4705

Description:

CS4705 Corpus Linguistics and Machine Learning Techniques – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 21
Provided by: juliah154
Category:

less

Transcript and Presenter's Notes

Title: CS4705


1
CS4705
  • Corpus Linguistics and Machine Learning Techniques

2
Review
  • What do we know about so far?
  • Words (stems and affixes, roots and templates,)
  • Ngrams (simple word sequences)
  • POS (e.g. nouns, verbs, adverbs, adjectives,
    determiners, articles, )

3
Some Additional Things We Could Find
  • Named Entities
  • Persons
  • Company Names
  • Locations
  • Dates

4
What useful things can we do with this knowledge?
  • Find sentence boundaries, abbreviations
  • Find Named Entities (person names, company names,
    telephone numbers, addresses,)
  • Find topic boundaries and classify articles into
    topics
  • Identify a documents author and their opinion on
    the topic, pro or con
  • Answer simple questions (factoids)
  • Do simple summarization/compression

5
But first, we need corpora
  • Online collections of text and speech
  • Some examples
  • Brown Corpus
  • Wall Street Journal and AP News
  • ATIS, Broadcast News
  • TDTN
  • Switchboard, Call Home
  • TRAINS, FM Radio, BDC Corpus
  • Hansards parallel corpus of French and English
  • And many private research collections

6
Next, we pose a questionthe dependent variable
  • Binary questions
  • Is this word followed by a sentence boundary or
    not?
  • A topic boundary?
  • Does this word begin a person name? End one?
  • Should this word or sentence be included in a
    summary?
  • Classification
  • Is this document about medical issues? Politics?
    Religion? Sports?
  • Predicting continuous variables
  • How loud or high should this utterance be
    produced?

7
Finding a suitable corpus and preparing it for
analysis
  • Which corpora can answer my question?
  • Do I need to get them labeled to do so?
  • Dividing the corpus into training and test
    corpora
  • To develop a model, we need a training corpus
  • overly narrow corpus doesnt generalize
  • overly general corpus don't reflect task or
    domain
  • To demonstrate how general our model is, we need
    a test corpus to evaluate the model
  • Development test set vs. held out test set
  • To evaluate our model we must choose an
    evaluation metric
  • Accuracy
  • Precision, recall, F-measure,
  • Cross validation

8
Then we build the model
  • Identify the dependent variable what do we want
    to predict or classify?
  • Does this word begin a person name? Is this word
    within a person name?
  • Is this document about sports? The weather?
    International news? ???
  • Identify the independent variables what features
    might help to predict the dependent variable?
  • What is this words POS? What is the POS of the
    word before it? After it?
  • Is this word capitalized? Is it followed by a
    .?
  • Does hocky appear in this document?
  • How far is this word from the beginning of its
    sentence?
  • Extract the values of each variable from the
    corpus by some automatic means

9
A Sample Feature Vector for Sentence-Ending
Detection
WordID POS Cap? , After? Dist/Sbeg End?
Clinton N y n 1 n
won V n n 2 n
easily Adv n y 3 n
but Conj n n 4 n
10
An Example Finding Caller Names in Voicemail ?
SCANMail
  • Motivated by interviews, surveys and usage logs
    of heavy users
  • Hard to scan new msgs to find those you need to
    deal with quickly
  • Hard to find msg you want in archive
  • Hard to locate information you want in any msg
  • How could we help?

11
SCANMail Architecture
Caller
SCANMail Subscriber
12
Corpus Collection
  • Recordings collected from 138 ATT Labs
    employees mailboxes
  • 100 hours 10K msgs 2500 speakers
  • Gender balanced 12 non-native speakers
  • Mean message duration 36.4 secs, median 30.0 secs
  • Hand-transcribed and annotated with caller id,
    gender, age, entity demarcation (names, dates,
    telnos)
  • Also recognized using ASR engine

13
Transcription and Bracketing
  • Greeting hi R CallerID it's me give me
    a call um right away cos there's .hn I
    guess there's some .hn change Date
    tomorrow with the nursery school and they um
    .hn anyway they had this idea cos since
    I think J's the only one staying Date tomorrow
    for play club so they wanted to they suggested
    that .hn well J2 actually offered to take J
    home with her and then would she

14
  • would meet you back at the synagogue at Time
    five thirty to pick her up .hn uh so I
    don't know how you feel about that otherwise M_
    and one other teacher would stay and take care of
    her till Date five thirty tomorrow but if
    you .hn I wanted to know how you feel before
    I tell her one way or the other so call me .hn
    right away cos I have to get back to her in
    about an hour so .hn okay Closing bye
    .nhn .onhk

15
SCANMail Demo
http//www.avatarweb.com/scanmail/
Audix extension demo Audix password (null)
16
Information Extraction (Martin Jansche and Steve
Abney)
  • Goals extract key information from msgs to
    present in headers
  • Approach
  • Supervised learning from transcripts (phone s,
    caller self-ids)
  • Combine Machine Learning techniques with simpler
    alternatives, e.g. hand-crafted rules
  • Two stage approaches

17
  • Features exploit structure of key elements (e.g.
    length of phone numbers) and of surrounding
    context (e.g. self-ids tend to occur at beginning
    of msg)

18
Telephone Number Identification
  • Rules convert all numbers to standard digit
    format
  • Predict start of phone number with rules
  • This step over-generates
  • Prune with decision-tree classifier
  • Best features
  • Position in msg
  • Lexical cues
  • Length of digit string
  • Performance
  • .94 F on human-labeled transcripts
  • .95 F on ASR)

19
Caller Self-Identifications
  • Predict start of id with classifier
  • 97 of ids begin 1-7 words into msg
  • Then predict length of phrase
  • Majority are only 2-4 words long
  • Avoid risk of relying on correct speech
    recognition for names
  • Best cues to end of phrase are a few common words
  • I, could, please
  • No actual names they over-fit the data
  • Performance
  • .71 F on human-labeled
  • .70 F on ASR

20
Introduction to Weka
Write a Comment
User Comments (0)
About PowerShow.com