Probabilistic Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Probabilistic Language Processing

Description:

Machine Translation Goals. Rough Translation (E.g. p. 851) Restricted Doman (mergers, weather) ... Literary Translation -- not yet! ... – PowerPoint PPT presentation

Number of Views:151
Avg rating:3.0/5.0
Slides: 17
Provided by: chris1183
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Language Processing


1
Probabilistic Language Processing
  • Chapter 23

2
Probabilistic Language Models
  • Goal -- define probability distribution over set
    of strings
  • Unigram, bigram, n-gram
  • Count using corpus but need smoothing
  • add-one
  • Linear interpolation
  • Evaluate with Perplexity measure
  • E.g. segmentwordswithoutspaces w/ Viterbi

3
PCFGs
  • Rewrite rules have probabilities.
  • Prob of a string is sum of probs of its parse
    trees.
  • Context-freedom means no lexical constraints.
  • Prefers short sentences.

4
Learning PCFGs
  • Parsed corpus -- count trees.
  • Unparsed corpus
  • Rule structure known -- use EM (inside-outside
    algorithm)
  • Rules unknown -- Chomsky normal form problems.

5
Information Retrieval
  • Goal Google. Find docs relevant to users
    needs.
  • IR system has doc. Collection, query in some
    language, set of results, and a presentation of
    results.
  • Ideally, parse docs into knowledge base too hard.

6
IR 2
  • Boolean Keyword Model -- in or out?
  • Problem -- single bit of relevance
  • Boolean combinations a bit mysterious
  • How compute P(Rtrue D,Q)?
  • Estimate language model for each doc, computes
    prob of query given the model.
  • Can rank documents by P(rD,Q)/P(rD,Q)

7
IR3
  • For this, need model of how queries are related
    to docs. Bag of words freq of words in doc.,
    naïve Bayes.
  • Good example pp 842-843.

8
Evaluating IR
  • Precision is proportion of results that are
    relevant.
  • Recall is proportion of relevant docs that are in
    results
  • ROC curve (there are several varieties) standard
    is to plot false negatives vs. false positives.
  • More practical for web reciprocal rank of
    first relevant result, or just time to answer

9
IR Refinements
  • Case
  • Stems
  • Synonyms
  • Spelling correction
  • Metadata --keywords

10
IR Presentation
  • Give list in order of relevance, deal with
    duplicates
  • Cluster results into classes
  • Agglomerative
  • K-means
  • How describe automatically-generated clusters?
    Word list? Title of centroid doc?

11
IR Implementation
  • CSC172!
  • Lexicon with stop list,
  • inverted index where words occur
  • Match with vectors vectorof freq of words dotted
    with query terms.

12
Information Extraction
  • Goal create database entries from docs.
  • Emphasis on massive data, speed, stylized
    expressions
  • Regular expression grammars OK if stylized enough
  • Cascaded Finite State Transducers,,,stages of
    grouping and structure-finding

13
Machine Translation Goals
  • Rough Translation (E.g. p. 851)
  • Restricted Doman (mergers, weather)
  • Pre-edited (Caterpillar or Xerox English)
  • Literary Translation -- not yet!
  • Interlingua-- or canonical semantic
    representation like Conceptual Dependency
  • Basic Problem ! languages, ! categories

14
MT in Practice
  • Transfer -- uses data base of rules for
    translating small units of language
  • Memory -based. Memorize sentence pairs
  • Good diagram p. 853

15
Statistical MT
  • Bilingual corpus
  • Find most likely translation given corpus.
  • Argmax_F P(FE) argmax_F P(EF)P(F)
  • P(F) is language model
  • P(EF) is translation model
  • Lots of interesting problems fertility (home vs.
    a la maison).
  • Horrible drastic simplfications and hacks work
    pretty well!

16
Learning and MT
  • Stat. MT needs language model, fertility model,
    word choice model, offset model.
  • Millions of parameters
  • Counting , estimate, EM.
Write a Comment
User Comments (0)
About PowerShow.com