Treebanks and Parsing - PowerPoint PPT Presentation

1 / 7
About This Presentation
Title:

Treebanks and Parsing

Description:

Institute of Formal and Applied Linguistics ... advantages of pre-parsing (surface) Speed Up to 50% faster (100% increase in ... PowerPoint Presentation Last ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 8
Provided by: washi139
Category:

less

Transcript and Presenter's Notes

Title: Treebanks and Parsing


1
Treebanks and Parsing
  • Jan Hajic
  • Institute of Formal and Applied Linguistics
  • School of Computer Science
  • Faculty of Mathematics and Physics
  • Charles University, Prague
  • Czech Republic

2
Questions Covered
  • Q1 What do you care building a parser?
  • Q2 What works, what doesnt?
  • Q3 What info is useful, what not?
  • Q4 How does grammar writing interact with
    treebank building (TB)?
  • Q5 Methodological lessons learned from TB?
  • Q6 (Dis)advantages of pre-parsing for TB?
  • Q7 Phrase-structure vs. dependency?

3
Q1 What do we really care about building a parser
  • What will its output used for
  • Deep (semantic structure) parsing
  • Translation
  • Question answering
  • etc.
  • Conversion of annotation into features
  • Locality good (with todays parsers)
  • Accuracy
  • Size, speed, (the practical things)

4
Q3What info is useful
  • Hard to say
  • MST (McDonald), Collins, Charniak surface syntax
    parsers (Czech)
  • No function tags used
  • Reduced tagset (1100 -gt 43)
  • Hand-made reduction worked best! (POS, case if
    possible)
  • Lemmatization, word forms used
  • Empty categories, co-indexation not used (not
    present)
  • Adjunct/argument distinction not used
  • Subcat frames not used (not present)

5
Q5Lessons learned
  • ! For parsing only !
  • Separated surface and deep annotation is good
  • Even then, Czech parsing lags behind English
  • 1 million word treebank is far from enough
  • for languages with rich inflection, that is
  • Need for tagset reduction
  • Local information helps
  • Often can be extracted automatically from the
    annotated treebank
  • Lexicalized PS/dependency
  • not much difference (so far)

6
Q6 (Dis)advantages of pre-parsing (surface)
  • Speed
  • Up to 50 faster (100 increase in throughput)
  • therefore cheaper
  • Consistency better
  • Labeling
  • Color codes for uncertainty of label assignment
  • Disadvantage
  • strange errors
  • Can be checked for automatically with
    cross-checking

7
Q7 Phrase structure vs. dependency
  • If
  • (phrase structure) has heads marked
  • AND (dependency) has tags suitable for phrase
    labels and no non-projectivity
  • Then
  • essentially the same thing
  • Else...
  • ?? determining heads branching labels
    projectivization
  • Done on Czech Collins parser 98, ACL 99
  • Dependency -gt lexicalized PS (parsing) -gt Dep.
Write a Comment
User Comments (0)
About PowerShow.com