Title: Diapositive 1
1FipsRomanian Towards a Romanian Version of the
Fips Syntactic Parser Violeta Seretan, Eric
Wehrli, Luka Nerima, Gabriela Soare LATL
Language Technology Laboratory
violeta.seretan, eric.wehrli, luka.nerima,
gabriela.soare_at_unige.ch
Extending Fips to Romanian two main tasks
Romanian language
- Lexicon construction
- list of headwords (DEX, 1998)
- morphological generation given a base word
form, generates all its forms according to the
appropriate inflection paradigm -
- manual and semi-automatic insertion
- manual insertion for verbs (specific
information subcategorization, selectional
features, thematic function, ) - Current status
- simple entries
- 60K lexemes/ 380K words
- (10 K proper nouns)
- complex entries multi-word expressions
(compounds and collocations) - de jur împrejurul around
- problema a se pune problem to arise
- Grammar implementation
- Specifications (Soare, 2005)
- Customisation of FipsRomanian grammar for
standard operations (syntactic transformations
relativization, interrogation, passivization,
...) - Similarities and differences. Examples
- clitic system
- wh-fronting
- Attachment rules constraints on the main parser
operation, Merge, which combines two adjacent
structures into a larger structure - Current status about 100 rules specified nearly
half implemented and tested
- Vocabulary
- Latin origin (fundamental vocabulary)
- Slavic origin
- Neologisms French, Italian,
- Loanwords Turkish, Greek, Hungarian, Albanian,
...
- Morphology
- Case system inherited from Latin
- nominative-accusative, genitive-dative, vocative
- Three grammatical genders
- masculine, feminine, neuter
- Rich declension of determiners, nouns,
adjectives, and verbs - e.g., about 35 forms for a verb
- The definite article is enclitic, i.e., suffixed
to nouns and adjectives - casa/house casa/house-the
- mare/big marea/big-the
Europe - Romance languages
- Orthography
- phonemic Latin alphabet (since 1859)
- Diacritics a/?, â/?, î/? cedilla s/?, t/?
- Syntax
- VSO language, relatively free word order
FipsRomanian Sample results
Fips a multilingual parsing architecture
(Wehrli, 2007)
- Output
- Rich sentence representation
- constituent structure
- predicate-argument table
- co-indexation chains
- intra-sentential pronoun resolution
- Underlying theory
- Generative Grammar (Chomsky, 1995)
- Similarities
- Simpler Syntax (Culicover and Jackendoff, 2005)
- Lexical Functional Grammar (Bresnan, 2001)
Sample parse tree produced by Fips
- Implementation
- Left-to-right, bottom-up tabular parsing
algorithm, relying on detailed lexical
information - Language-independent core language-specific
implementation - Component Pascal, OOP paradigm, BlackBox IDE
- Supported languages French, English, German,
Spanish, Italian, Greek others in progress
Preliminary results
Screen captures
- Parsing experiment
- data journalistic texts, 1.05M words
- average sentence length 26.9 tokens
- 16.2 full parses (FipsFrench, FipsEnglish
about 80) - average partial parses length 5.3 tokens
- unknown words 6.5 (of which 39.2 proper
nouns) - satisfactory lexical coverage
- grammatical coverage needs to be improved (work
in progress!)
- Task-based evaluation
- Collocation extraction from parsed data
(Seretan, 2008) - Collocations are half idioms (of encoding, but
not of decoding) - Used by parser and in-house rule-based machine
translation system - Precision for top 2000 results 30.3
- (Precision for French data 65.9, top 500
results)
Lexicon interface
Fips interface
Sample collocations extracted
References
Related work Useful resources
- Bresnan, J. 2001. Lexical Functional Syntax.
Blackwell, Oxford. - Chomsky, N. 1995. The Minimalist Program. MIT
Press, Cambridge, Mass. - Calacean, M. and J. Nivre. 2009. A data-driven
dependency parser for Romanian. In Proceedings of
the 7th International Workshop on Treebanks and
Linguistic Theories (TLT 7), pages 6576,
Groningen, Holland. - 1998. DEX Dictionarul explicativ al limbii
române. Academia Româna, Bucharest. - Seretan, V. 2008. Collocation extraction based on
syntactic parsing. Ph.D. thesis, University of
Geneva. - Soare, G. 2005. Romanian syntax. Technical
report, University of Geneva. - Wehrli, E. 2007. Fips, a deep linguistic
multilingual parser. In ACL 2007 Workshop on Deep
Linguistic Processing, pages 120127, Prague,
Czech Republic.
- Data-driven dependency parser for Romanian based
on the MaltParser, learns dependencies from
manual annotations (Calacean and Nivre, 2009).
Problem reduced treebank size and grammatical
coverage (simple structures, no subordination,
average sentence length only 9 words). - Sketch Engine for Romanian shallow parsing (POS
patterns), http//www.sketchengine.co.uk/ - Dependency treebank construction, work in
progress at the University of Iasi, Romania - Text processing webservices, RACAI Research
Institute for Artificial Intelligence, Romanian
Academy, Bucarest, Romania. http//www.racai.ro/we
bservices/TextProcessing.aspx - A repository of tools for Romanian ConsILR -
Consortium for the Romanian Language Resources
Tools, research groups from Iasi, Bucarest and
Chisinau http//consilr.info.uaic.ro/
Faculté des Lettes, Département de Linguistique