A%20Robust%20Shallow%20Parser%20for%20Swedish - PowerPoint PPT Presentation

About This Presentation
Title:

A%20Robust%20Shallow%20Parser%20for%20Swedish

Description:

bli vb.inf.akt.kop VCI CLI. som kn O CLI. nya jj.pos.utr/neu.plu.ind/def.nom APB CLI. ... var vb.prt.akt.kop VCB CLI. J rf lla pm.gen NPB|NPB CLI ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 24
Provided by: knut5
Category:

less

Transcript and Presenter's Notes

Title: A%20Robust%20Shallow%20Parser%20for%20Swedish


1
A Robust Shallow Parser for Swedish
  • Ola Knutsson, Johnny Bigert, Viggo Kann
  • KTH Nada
  • Royal Institute of Technology, Sweden

2
What is robustness?
  • Robust against noisy, ill-formed and partial
    natural language data

3
Shallow parsing
  • Many NLP-applications do not need full parsing
  • Shallow parsing
  • A parsing approach
  • Pre-processing for full parsing
  • A collection of techniques
  • Abney - finite state cascades (1991)
  • Currently, a lot of attention on ML
  • Well suitable for modularization

4
Common modules in a shallow parser
  • Tokenizer
  • PoS-tagger
  • Chunker
  • Phrase identifier
  • Grammatical function identifier

5
Chunking
  • NP Den mycket gamla mannenVC gilladeNP mat

Phrase identification
NP Den AP mycket gamla mannenVC gilladeNP
mat
6
Other parsers for Swedish
  • Full parsers UCP (SÃ¥gvall Hein)SLE (Gambäck)
  • Shallow parsers (phrase structure) Cass-Swe
    (Kokkinakis) Megyesi using ML
  • Dependency parsers CG (Birn)FDG (Voutilainen)

7
Granska Text Analyzer (GTA)
  • Hand-crafted rules
  • Context-free backbone
  • Partly object-oriented notation

8
Major phrase categories
  • NP Han sÃ¥g den lilla mannen pÃ¥ bänken
  • VC Han har spelat kort hela natten
  • PP Han sÃ¥g spÃ¥r i sanden
  • AP Han ogillade smÃ¥ vita lögner
  • ADVP Han vill inte gÃ¥ pÃ¥ bio.
  • INFP Han tycker om att spela

9
Clause boundary identification
  • Based on Ejerheds algorithm
  • Context-sensitive rules
  • Using only PoS information

10
Different kinds of rules
  • GTA contains 260 rules
  • 200 phrase structure identification rules
  • 20 clause boundary identification rules
  • 40 disambiguation rules

11
Example rule, NP den lilla bilen
  • NPmin_at_
  • X(wordcldt wordclrg), den
  • X2(wordclab)?,
  • Y(wordcljj), lilla
  • Z(wordclnn) bilen
  • --gt
  • action(help, wordclZ.wordcl, pnfundef,
  • genderZ.gender, numZ.num,
  • specZ.spec, caseZ.case)

12
Clause boundary rule
  • cl_at_
  • V(sed!sen text!"som" wordcl!sn),
  • X((wordclpn pnfsub) (wordclpm casenom)
  • (wordclnn casenom V.case!gen)
    wordclab),
  • ---endleftcontext---,
  • Y(wordclkn),
  • ---beginrightcontext---,
  • Y2(((wordclpn pnfsub) (wordclpm
    casenom)
  • (wordclnn casenom) wordclab)
    wordclX.wordcl),
  • Z(wordclvb (vbfprs vbfprt vbfimp))
  • --gt
  • action(help, wordclY.wordcl)

13
The Tetris algorithm

PP till general
PP till general Claes
NP general Claes Olsson
NP boken
NP Fänrik Ax
VC gav
PP till general Claes Olsson
14
The IOB format
  • Marcus and Ramshaw 1995
  • A phrase/clause tag contains two parts
  • Phrase/Clause type, e.g. NP, PP
  • One of two tags
  • I Inside a phrase/clause
  • B Beginning a phrase/clause
  • When a word does not belong to a phrase
  • 3. O Outside

15
Disagreement error
  • De dt.utr/neu.plu.def NPB CLB
  • gamla jj.pos.utr/neu.plu.ind/def.nom
    APBNPI CLI
  • äppelträdet nn.neu.sin.def.nom NPI CLI
  • kan vb.prs.akt.mod VCB CLI
  • bli vb.inf.akt.kop VCI CLI
  • som kn O CLI
  • nya jj.pos.utr/neu.plu.ind/def.nom APB
    CLI
  • . mad O CLI

16
Partial input
  • Arrangör nn.utr.sin.ind.nom NPB
    CLB
  • var vb.prt.akt.kop VCB CLI
  • Järfälla pm.gen NPBNPB CLI
  • naturskyddsförening nn.utr.sin.ind.nom NPBNPI
    CLI
  • där ab ADVPB CLI
  • är vb.prs.akt.kop VCB CLI
  • medlem nn.utr.sin.ind.nom NPB CLI
  • . mad O CLI

17
Noisy data
  • Inte ab APB CLB
  • sÃ¥ ab ADVPBAPBAPI CLI
  • tjck jj.pos.utr.sin.ind.nom APBAPIAPI
    CLI
  • som ha O CLB
  • det pn.neu.sin.def.sub/obj NPB
    CLI
  • ofta ab.pos ADVPB CLI
  • stÃ¥r vb.prs.akt VCB CLI
  • i pp PPB CLI
  • lärobökerna nn.utr.plu.def.nom NPBPPI
    CLI
  • mid O CLI

18
Word order violation
  • Ympkvisten nn.utr.sin.ind.nom NPB CLB
  • inte ab ADVPB CLI
  • ska vb.prs.akt.mod VCB CLI
  • vara vb.inf.akt.kop VCI CLI
  • sÃ¥där ab ADVPBAPB CLI
  • lÃ¥ng jj.pos.utr.sin.ind.nom APB CLI
  • , mid O CLI

19
Evaluation
  • Manually corrected output from GTA
  • Untuned GTA in the evaluation
  • 15 000 words from SUC
  • 5 genres

20
F-scores for individual phrase types
Type Accuracy Count
ADVP 81.9 1008
AP 91.3 1332
INFP 81.9 512
NP 91.4 6895
O 94.4 2449
PP 95.3 3886
VC 92.9 2562
Total 88.7
21
F-score for clause boundary identification
Tagger F-score
UNIGRAM 84.2
BRILL 87.3
TNT 88.3
F-score for a baseline identifier was 69.0
22
Applications with GTA
  • We are using GTA in
  • grammar checking, statistical and rule based
  • clustering of medical texts
  • CALL-systems
  • What do you want to do with GTA?

23
More information
  • www.nada.kth.se/theory/projects/xcheck
  • Contact Ola Knutsson
  • knutsson_at_nada.kth.se
Write a Comment
User Comments (0)
About PowerShow.com