Title: Detecting Erroneous Sentences using Automatically Mined Sequential Patterns
1Detecting Erroneous Sentences using
Automatically Mined Sequential Patterns
Advisor Hsin-His Chen Reporter Chi-Hsin
Yu Date 2007.12.04
2Outlines
- Introduction
- Related Work
- Proposed Technique
- Experimental Evaluation
- Conclusions and Future Work
3Introduction
- Summary
- Problem Identifying erroneous/correct sentences
- Algorithm Classification (SVM, NB)
- Approach Sequential patterns (Data Mining)
- Applications
- Providing feedback for writers of English as a
Second Language (ESL) - Controlling the quality of parallel bilingual
sentences mined from the Web - Evaluating the MT results
4Introduction (cont.)
- The common mistakes (Yukio et al.,2001 Gui and
Yang, 2003) made by ESL learners - spelling, verb formation
- lexical collocation, tense, agreement, wrong
Part-Of-Speech (POS), article usage - sentence structure (grammar structure)
- Example
- If Maggie will go to supermarket, she will buy a
bag for you. - The pattern if...will...will (would )
- N-grams considering only continuous sequence of
words, very expensive if N gt 3
5Related Work
- Category 1 the use of hand-crafted rules
- Heidorn, 2000 Michaud et al., 2000 Bender et
al., 2004 - Difficulties
- Expensive to write rules manually
- difficult to produce and maintain a large number
of non-conflicting rules to cover a wide range of
grammatical errors - making different errors by different
first-language backgrounds and skill levels - hard to write rules for some grammatical errors
6Related Work (cont.)
- Category 2 statistical approaches
- Chodorow and Leacock, 2000 Izumi et al., 2003
Brockett et al., 2006 Nagata et al., 2006 - Problems
- focusing on some pre-defined errors
- the reported results being not attractive
- the need of errors to be specified and tagged in
the training sentences - the need of parallel tagged data
7Proposed Technique
- Classification model
- Using SVM (light SVM)
- Features
- Labeled Sequential Patterns (LSP) 1 feature
- Complementary features
- Lexical Collocation (LC) 3 features
- Perplexity from Language Model (PLM) 2 features
- Syntactic Score (SC) 1 feature
- Function Word Density (FWD) 5 features
8Proposed Technique LSP (1)
- A labeled sequential pattern (LSP), p, is in the
form of ltLHS, cgt - LHS is a sequence lta1, ..., amgt
- ai is named item.
- c is a class label (correct/incorrect here)
- Sequence database D
- The collection of LSPs
9Proposed Technique LSP (2)
- Contain relation (subsequence)
- a sequence s1 lt a1, ..., am gt is contained in a
sequence s2 lt b1, ..., bn gt if there exist
integers i1, ...im such that 1 lt i1 lt i2 lt ... lt
im lt n and aj bij for all j in 1, ...,m. - Altabcdefghgt has a subsequence Bltbdeggt
- A contains B.
- A LSP p1 is contained by p2 if the sequence
p1.LHS is contained by p2.LHS and p1.c p2.c.
10Proposed Technique LSP (3)
- A LSP p is attached with two measures, support
and confidence. - The support of p (the generality of the pattern
p) - denoted by sup(p)
- the percentage of tuples in database D that
contain the LSP p - the confidence of p (predictive ability of p)
- Denoted by conf(p)
- Computed as
11Proposed Technique LSP (4)
- Example
- t1 (lt a, d, e, f gt,E)
- t2 (lt a, f, e, f gt,E)
- t3 (lt d, a, f gt,C)
- One example LSP p1 (lt a, e, f gt, E)
- is contained in t1 and t2
- sup(p1) 2/3 66.7,
- conf(p1)(2/3)/(2/3) 100
- LSP p2 (lt a, f gt, E)
- sup(p2) 3/3 100,
- conf(p2) (2/3)/(3/3) 66.7
12Proposed Technique LSP (5)
- Generating Sequence Database
- applying Part-Of-Speech (POS) tagger to tag each
training sentence - MXPOST-Maximum Entropy Part of Speech Tagger
Toolkit3 for POS tags - keeping function words and time words
- each sentence together with its label becomes a
database tuple - In the past, John was kind to his sister
- In the past, NNP was JJ to his NN
- LSP Examples
- (lta, NNSgt, Error), NNS plural noun
- (ltyesterday, isgt, Error)
13Proposed Technique LSP (6)
- Mining LSPs
- adapting the frequent sequence mining algorithm
in (Pei et al., 2001) - setting minimum support at 0.1 and minimum
confidence at 75 - Converting LSPs to Features
- the corresponding feature being set at 1 if a
sentence includes a LSP
14Proposed Technique LSP (7)
- LSPs for erroneous sentences
- ltthis, NNSgt (this books is stolen.)
- ltpast, isgt ( in the past, John is kind to his
sister.) - ltone, of, NNgt ( it is one of important working
language - ltalthough, butgt (although he likes it, but he
cant buy it.) - ltonly, if, I, amgt (only if my teacher has
given permission, I am allowed to enter this
room.) - LSPs for correct sentences
- ltwould, VBgt (he would buy it.),
- ltVBD, yeserdaygt (I bought this book
yesterday.)
15Proposed Technique Other Linguistic Features (1)
- Lexical Collocation (LC)
- Lexical collocation (strong tea/??, not
powerful tea) - collecting five types of collocations
- verb-object, adjective-noun, verb-adverb,
subject-verb, and preposition-object from a
general English corpus - Correct LCs
- extracting collocations of high frequency
- Erroneous LC candidates
- generated by replacing the word in correct
collocations with its confusion words, obtained
from WordNet - Consulted by experts to see if a candidate is a
true erroneous collocation
16Proposed Technique Other Linguistic Features (2)
- computing three LC features for each sentence
- (1)
- m is the number of CLs
- n is the number of collocations in each sentence
- Probability p(coi) of each CL coi is calculated
using the method (Lu and Zhou, 2004) - (2) the ratio of the number of unknown
collocations (neither correct LCs nor erroneous
LCs) to the number of collocations in each
sentence - (3) the ratio of the number of erroneous LCs to
the number of collocations in each sentence
17Proposed Technique Other Linguistic Features (3)
- Perplexity from Language Model (PLM)
- extracted from a trigram language
- Using the SRILM-SRI Language Modeling Toolkit
(Stolcke, 2002) - Calculating two values for each sentence
- lexicalized trigram perplexity
- POS trigram perplexity
- The erroneous sentences would have higher
perplexity
18Proposed Technique Other Linguistic Features (4)
- Syntactic Score (SC)
- using a statistical parser Toolkit (Collins,
1997) - assigning each sentence a parsers score
- the related log probability of parsing
- Assuming that erroneous sentences with
undesirable sentence structures are more likely
to receive lower scores
19Proposed Technique Other Linguistic Features (5)
- Function Word Density (FWD)
- the ratio of function words to content words
- inspired by the work (Corston-Oliver et al.,
2001) - Be effective to distinguish between human
references and machine outputs - seven kinds of function words
20Experimental Evaluation (1) Experimental setup
- Classification model SVM
- For a non-binary feature X its value x is
normalized by z-score. - Two data sets Japanese Corpus (JC) and Chinese
Corpus (CC)
21Experimental Evaluation (2)
22Experimental Evaluation (3)
ALEK (Chodorow and Leacock, 2000) from
Educational Testing Service (ETS)
694 parallel-sentences 1671 non-parallel sentences
Different cultures (Japanese/Chinese as first
language)
23Experimental Evaluation (4)
- Two LDC data, low-ranked and high-ranked data
- 14,604 low ranked (score 1-3) MTs
- 808 high ranked (score 3-5) MTs
- Both with corresponding human reference
translations - human references (Correct), MT (erroneous)
24Conclusions and Future Work
- Conclusions
- This paper proposed to mine LSPs as the input of
classification models. - LSPs were shown to be much more effective than
the other linguistic features. - Other features were also beneficial.
- Future work
- To use LSPs to provide detailed feedback for ESL
learners - To integrate the features effectively
- To further investigate the application for MT
evaluation
25Thanks!!