Detecting Erroneous Sentences using Automatically Mined Sequential Patterns - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

Description:

Perplexity from Language Model (PLM) 2 features. Syntactic Score (SC) 1 feature ... The erroneous sentences would have higher perplexity. Proposed Technique ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 26
Provided by: nlgCsie
Category:

less

Transcript and Presenter's Notes

Title: Detecting Erroneous Sentences using Automatically Mined Sequential Patterns


1
Detecting Erroneous Sentences using
Automatically Mined Sequential Patterns
Advisor Hsin-His Chen Reporter Chi-Hsin
Yu Date 2007.12.04
2
Outlines
  • Introduction
  • Related Work
  • Proposed Technique
  • Experimental Evaluation
  • Conclusions and Future Work

3
Introduction
  • Summary
  • Problem Identifying erroneous/correct sentences
  • Algorithm Classification (SVM, NB)
  • Approach Sequential patterns (Data Mining)
  • Applications
  • Providing feedback for writers of English as a
    Second Language (ESL)
  • Controlling the quality of parallel bilingual
    sentences mined from the Web
  • Evaluating the MT results

4
Introduction (cont.)
  • The common mistakes (Yukio et al.,2001 Gui and
    Yang, 2003) made by ESL learners
  • spelling, verb formation
  • lexical collocation, tense, agreement, wrong
    Part-Of-Speech (POS), article usage
  • sentence structure (grammar structure)
  • Example
  • If Maggie will go to supermarket, she will buy a
    bag for you.
  • The pattern if...will...will (would )
  • N-grams considering only continuous sequence of
    words, very expensive if N gt 3

5
Related Work
  • Category 1 the use of hand-crafted rules
  • Heidorn, 2000 Michaud et al., 2000 Bender et
    al., 2004
  • Difficulties
  • Expensive to write rules manually
  • difficult to produce and maintain a large number
    of non-conflicting rules to cover a wide range of
    grammatical errors
  • making different errors by different
    first-language backgrounds and skill levels
  • hard to write rules for some grammatical errors

6
Related Work (cont.)
  • Category 2 statistical approaches
  • Chodorow and Leacock, 2000 Izumi et al., 2003
    Brockett et al., 2006 Nagata et al., 2006
  • Problems
  • focusing on some pre-defined errors
  • the reported results being not attractive
  • the need of errors to be specified and tagged in
    the training sentences
  • the need of parallel tagged data

7
Proposed Technique
  • Classification model
  • Using SVM (light SVM)
  • Features
  • Labeled Sequential Patterns (LSP) 1 feature
  • Complementary features
  • Lexical Collocation (LC) 3 features
  • Perplexity from Language Model (PLM) 2 features
  • Syntactic Score (SC) 1 feature
  • Function Word Density (FWD) 5 features

8
Proposed Technique LSP (1)
  • A labeled sequential pattern (LSP), p, is in the
    form of ltLHS, cgt
  • LHS is a sequence lta1, ..., amgt
  • ai is named item.
  • c is a class label (correct/incorrect here)
  • Sequence database D
  • The collection of LSPs

9
Proposed Technique LSP (2)
  • Contain relation (subsequence)
  • a sequence s1 lt a1, ..., am gt is contained in a
    sequence s2 lt b1, ..., bn gt if there exist
    integers i1, ...im such that 1 lt i1 lt i2 lt ... lt
    im lt n and aj bij for all j in 1, ...,m.
  • Altabcdefghgt has a subsequence Bltbdeggt
  • A contains B.
  • A LSP p1 is contained by p2 if the sequence
    p1.LHS is contained by p2.LHS and p1.c p2.c.

10
Proposed Technique LSP (3)
  • A LSP p is attached with two measures, support
    and confidence.
  • The support of p (the generality of the pattern
    p)
  • denoted by sup(p)
  • the percentage of tuples in database D that
    contain the LSP p
  • the confidence of p (predictive ability of p)
  • Denoted by conf(p)
  • Computed as

11
Proposed Technique LSP (4)
  • Example
  • t1 (lt a, d, e, f gt,E)
  • t2 (lt a, f, e, f gt,E)
  • t3 (lt d, a, f gt,C)
  • One example LSP p1 (lt a, e, f gt, E)
  • is contained in t1 and t2
  • sup(p1) 2/3 66.7,
  • conf(p1)(2/3)/(2/3) 100
  • LSP p2 (lt a, f gt, E)
  • sup(p2) 3/3 100,
  • conf(p2) (2/3)/(3/3) 66.7

12
Proposed Technique LSP (5)
  • Generating Sequence Database
  • applying Part-Of-Speech (POS) tagger to tag each
    training sentence
  • MXPOST-Maximum Entropy Part of Speech Tagger
    Toolkit3 for POS tags
  • keeping function words and time words
  • each sentence together with its label becomes a
    database tuple
  • In the past, John was kind to his sister
  • In the past, NNP was JJ to his NN
  • LSP Examples
  • (lta, NNSgt, Error), NNS plural noun
  • (ltyesterday, isgt, Error)

13
Proposed Technique LSP (6)
  • Mining LSPs
  • adapting the frequent sequence mining algorithm
    in (Pei et al., 2001)
  • setting minimum support at 0.1 and minimum
    confidence at 75
  • Converting LSPs to Features
  • the corresponding feature being set at 1 if a
    sentence includes a LSP

14
Proposed Technique LSP (7)
  • LSPs for erroneous sentences
  • ltthis, NNSgt (this books is stolen.)
  • ltpast, isgt ( in the past, John is kind to his
    sister.)
  • ltone, of, NNgt ( it is one of important working
    language
  • ltalthough, butgt (although he likes it, but he
    cant buy it.)
  • ltonly, if, I, amgt (only if my teacher has
    given permission, I am allowed to enter this
    room.)
  • LSPs for correct sentences
  • ltwould, VBgt (he would buy it.),
  • ltVBD, yeserdaygt (I bought this book
    yesterday.)

15
Proposed Technique Other Linguistic Features (1)
  • Lexical Collocation (LC)
  • Lexical collocation (strong tea/??, not
    powerful tea)
  • collecting five types of collocations
  • verb-object, adjective-noun, verb-adverb,
    subject-verb, and preposition-object from a
    general English corpus
  • Correct LCs
  • extracting collocations of high frequency
  • Erroneous LC candidates
  • generated by replacing the word in correct
    collocations with its confusion words, obtained
    from WordNet
  • Consulted by experts to see if a candidate is a
    true erroneous collocation

16
Proposed Technique Other Linguistic Features (2)
  • computing three LC features for each sentence
  • (1)
  • m is the number of CLs
  • n is the number of collocations in each sentence
  • Probability p(coi) of each CL coi is calculated
    using the method (Lu and Zhou, 2004)
  • (2) the ratio of the number of unknown
    collocations (neither correct LCs nor erroneous
    LCs) to the number of collocations in each
    sentence
  • (3) the ratio of the number of erroneous LCs to
    the number of collocations in each sentence

17
Proposed Technique Other Linguistic Features (3)
  • Perplexity from Language Model (PLM)
  • extracted from a trigram language
  • Using the SRILM-SRI Language Modeling Toolkit
    (Stolcke, 2002)
  • Calculating two values for each sentence
  • lexicalized trigram perplexity
  • POS trigram perplexity
  • The erroneous sentences would have higher
    perplexity

18
Proposed Technique Other Linguistic Features (4)
  • Syntactic Score (SC)
  • using a statistical parser Toolkit (Collins,
    1997)
  • assigning each sentence a parsers score
  • the related log probability of parsing
  • Assuming that erroneous sentences with
    undesirable sentence structures are more likely
    to receive lower scores

19
Proposed Technique Other Linguistic Features (5)
  • Function Word Density (FWD)
  • the ratio of function words to content words
  • inspired by the work (Corston-Oliver et al.,
    2001)
  • Be effective to distinguish between human
    references and machine outputs
  • seven kinds of function words

20
Experimental Evaluation (1) Experimental setup
  • Classification model SVM
  • For a non-binary feature X its value x is
    normalized by z-score.
  • Two data sets Japanese Corpus (JC) and Chinese
    Corpus (CC)

21
Experimental Evaluation (2)
22
Experimental Evaluation (3)
ALEK (Chodorow and Leacock, 2000) from
Educational Testing Service (ETS)
694 parallel-sentences 1671 non-parallel sentences
Different cultures (Japanese/Chinese as first
language)
23
Experimental Evaluation (4)
  • Two LDC data, low-ranked and high-ranked data
  • 14,604 low ranked (score 1-3) MTs
  • 808 high ranked (score 3-5) MTs
  • Both with corresponding human reference
    translations
  • human references (Correct), MT (erroneous)

24
Conclusions and Future Work
  • Conclusions
  • This paper proposed to mine LSPs as the input of
    classification models.
  • LSPs were shown to be much more effective than
    the other linguistic features.
  • Other features were also beneficial.
  • Future work
  • To use LSPs to provide detailed feedback for ESL
    learners
  • To integrate the features effectively
  • To further investigate the application for MT
    evaluation

25
Thanks!!
Write a Comment
User Comments (0)
About PowerShow.com