Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

About This Presentation

Title:

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

Description:

Perplexity from Language Model (PLM) 2 features. Syntactic Score (SC) 1 feature ... The erroneous sentences would have higher perplexity. Proposed Technique ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 26

Provided by: nlgCsie

Category:

more less

Transcript and Presenter's Notes

Title: Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

1
Detecting Erroneous Sentences using
Automatically Mined Sequential Patterns
Advisor Hsin-His Chen Reporter Chi-Hsin
Yu Date 2007.12.04
2
Outlines

Introduction
Related Work
Proposed Technique
Experimental Evaluation
Conclusions and Future Work

3
Introduction

Summary
Problem Identifying erroneous/correct sentences
Algorithm Classification (SVM, NB)
Approach Sequential patterns (Data Mining)
Applications
Providing feedback for writers of English as a
Second Language (ESL)
Controlling the quality of parallel bilingual
sentences mined from the Web
Evaluating the MT results

4
Introduction (cont.)

The common mistakes (Yukio et al.,2001 Gui and
Yang, 2003) made by ESL learners
spelling, verb formation
lexical collocation, tense, agreement, wrong
Part-Of-Speech (POS), article usage
sentence structure (grammar structure)
Example
If Maggie will go to supermarket, she will buy a
bag for you.
The pattern if...will...will (would )
N-grams considering only continuous sequence of
words, very expensive if N gt 3

5
Related Work

Category 1 the use of hand-crafted rules
Heidorn, 2000 Michaud et al., 2000 Bender et
al., 2004
Difficulties
Expensive to write rules manually
difficult to produce and maintain a large number
of non-conflicting rules to cover a wide range of
grammatical errors
making different errors by different
first-language backgrounds and skill levels
hard to write rules for some grammatical errors

6
Related Work (cont.)

Category 2 statistical approaches
Chodorow and Leacock, 2000 Izumi et al., 2003
Brockett et al., 2006 Nagata et al., 2006
Problems
focusing on some pre-defined errors
the reported results being not attractive
the need of errors to be specified and tagged in
the training sentences
the need of parallel tagged data

7
Proposed Technique

Classification model
Using SVM (light SVM)
Features
Labeled Sequential Patterns (LSP) 1 feature
Complementary features
Lexical Collocation (LC) 3 features
Perplexity from Language Model (PLM) 2 features
Syntactic Score (SC) 1 feature
Function Word Density (FWD) 5 features

8
Proposed Technique LSP (1)

A labeled sequential pattern (LSP), p, is in the
form of ltLHS, cgt
LHS is a sequence lta1, ..., amgt
ai is named item.
c is a class label (correct/incorrect here)
Sequence database D
The collection of LSPs

9
Proposed Technique LSP (2)

Contain relation (subsequence)
a sequence s1 lt a1, ..., am gt is contained in a
sequence s2 lt b1, ..., bn gt if there exist
integers i1, ...im such that 1 lt i1 lt i2 lt ... lt
im lt n and aj bij for all j in 1, ...,m.
Altabcdefghgt has a subsequence Bltbdeggt
A contains B.
A LSP p1 is contained by p2 if the sequence
p1.LHS is contained by p2.LHS and p1.c p2.c.

10
Proposed Technique LSP (3)

A LSP p is attached with two measures, support
and confidence.
The support of p (the generality of the pattern
p)
denoted by sup(p)
the percentage of tuples in database D that
contain the LSP p
the confidence of p (predictive ability of p)
Denoted by conf(p)
Computed as

11
Proposed Technique LSP (4)

Example
t1 (lt a, d, e, f gt,E)
t2 (lt a, f, e, f gt,E)
t3 (lt d, a, f gt,C)
One example LSP p1 (lt a, e, f gt, E)
is contained in t1 and t2
sup(p1) 2/3 66.7,
conf(p1)(2/3)/(2/3) 100
LSP p2 (lt a, f gt, E)
sup(p2) 3/3 100,
conf(p2) (2/3)/(3/3) 66.7

12
Proposed Technique LSP (5)

Generating Sequence Database
applying Part-Of-Speech (POS) tagger to tag each
training sentence
MXPOST-Maximum Entropy Part of Speech Tagger
Toolkit3 for POS tags
keeping function words and time words
each sentence together with its label becomes a
database tuple
In the past, John was kind to his sister
In the past, NNP was JJ to his NN
LSP Examples
(lta, NNSgt, Error), NNS plural noun
(ltyesterday, isgt, Error)

13
Proposed Technique LSP (6)

Mining LSPs
adapting the frequent sequence mining algorithm
in (Pei et al., 2001)
setting minimum support at 0.1 and minimum
confidence at 75
Converting LSPs to Features
the corresponding feature being set at 1 if a
sentence includes a LSP

14
Proposed Technique LSP (7)

LSPs for erroneous sentences
ltthis, NNSgt (this books is stolen.)
ltpast, isgt ( in the past, John is kind to his
sister.)
ltone, of, NNgt ( it is one of important working
language
ltalthough, butgt (although he likes it, but he
cant buy it.)
ltonly, if, I, amgt (only if my teacher has
given permission, I am allowed to enter this
room.)
LSPs for correct sentences
ltwould, VBgt (he would buy it.),
ltVBD, yeserdaygt (I bought this book
yesterday.)

15
Proposed Technique Other Linguistic Features (1)

Lexical Collocation (LC)
Lexical collocation (strong tea/??, not
powerful tea)
collecting five types of collocations
verb-object, adjective-noun, verb-adverb,
subject-verb, and preposition-object from a
general English corpus
Correct LCs
extracting collocations of high frequency
Erroneous LC candidates
generated by replacing the word in correct
collocations with its confusion words, obtained
from WordNet
Consulted by experts to see if a candidate is a
true erroneous collocation

16
Proposed Technique Other Linguistic Features (2)

computing three LC features for each sentence
(1)
m is the number of CLs
n is the number of collocations in each sentence
Probability p(coi) of each CL coi is calculated
using the method (Lu and Zhou, 2004)
(2) the ratio of the number of unknown
collocations (neither correct LCs nor erroneous
LCs) to the number of collocations in each
sentence
(3) the ratio of the number of erroneous LCs to
the number of collocations in each sentence

17
Proposed Technique Other Linguistic Features (3)

Perplexity from Language Model (PLM)
extracted from a trigram language
Using the SRILM-SRI Language Modeling Toolkit
(Stolcke, 2002)
Calculating two values for each sentence
lexicalized trigram perplexity
POS trigram perplexity
The erroneous sentences would have higher
perplexity

18
Proposed Technique Other Linguistic Features (4)

Syntactic Score (SC)
using a statistical parser Toolkit (Collins,
1997)
assigning each sentence a parsers score
the related log probability of parsing
Assuming that erroneous sentences with
undesirable sentence structures are more likely
to receive lower scores

19
Proposed Technique Other Linguistic Features (5)

Function Word Density (FWD)
the ratio of function words to content words
inspired by the work (Corston-Oliver et al.,
2001)
Be effective to distinguish between human
references and machine outputs
seven kinds of function words

20
Experimental Evaluation (1) Experimental setup

Classification model SVM
For a non-binary feature X its value x is
normalized by z-score.
Two data sets Japanese Corpus (JC) and Chinese
Corpus (CC)

21
Experimental Evaluation (2)
22
Experimental Evaluation (3)
ALEK (Chodorow and Leacock, 2000) from
Educational Testing Service (ETS)
694 parallel-sentences 1671 non-parallel sentences
Different cultures (Japanese/Chinese as first
language)
23
Experimental Evaluation (4)

Two LDC data, low-ranked and high-ranked data
14,604 low ranked (score 1-3) MTs
808 high ranked (score 3-5) MTs
Both with corresponding human reference
translations
human references (Correct), MT (erroneous)

24
Conclusions and Future Work

Conclusions
This paper proposed to mine LSPs as the input of
classification models.
LSPs were shown to be much more effective than
the other linguistic features.
Other features were also beneficial.
Future work
To use LSPs to provide detailed feedback for ESL
learners
To integrate the features effectively
To further investigate the application for MT
evaluation

25
Thanks!!

Write a Comment

User Comments (0)

About PowerShow.com

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns - PowerPoint PPT Presentation

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

Perplexity from Language Model (PLM) 2 features. Syntactic Score (SC) 1 feature ... The erroneous sentences would have higher perplexity. Proposed Technique ... – PowerPoint PPT presentation