Language Technologies Institute - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Language Technologies Institute

Description:

The Example Based Machine Translation System EBMT (Brown 96; Brown 99) A shallow match system ... Monolingual tokenization may lead to over segmentation ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 18
Provided by: Joy293
Category:

less

Transcript and Presenter's Notes

Title: Language Technologies Institute


1
Pre-processing of Bilingual Corpora for
Mandarin-English EBMT
  • Ying Zhang, Ralf Brown,
  • Robert Frederking, Alon Lavie
  • (http//www.cs.cmu.edu/joy)

2
Background
  • The Example Based Machine Translation System
    EBMT (Brown 96 Brown 99)
  • A shallow match system
  • Extract statistical dictionary from bitext
  • Word-level alignment
  • Dictionary and glossary are used to fill the gaps
  • Using target language trigram to generate the
    best translaton (Hogan Frederking 1998)

3
Data Used
  • Hong Kong Legal Code
  • Chinese 23 MB
  • English 37.8 MB
  • Hong Kong News (After cleaning) 7622 Documents
  • Dev-test Size 1,331,915 byte , 4,992
    sentence pairs
  • Final-test Size 1,329,764 byte, 4,866
    sentence pairs
  • Training Size 25,720,755 byte, 95,752
    sentence pairs
  • Corpus Cleaning
  • Converted from Big5 to GB
  • Divided into Training set (90), Dev-test (5)
    and test set (5)
  • Sentence level alignment, using Church Gale
    Method (by ISI)
  • Cleaned
  • Convert two-byte Chinese characters to their
    cognates

4
Chinese Segmentation
  • Our EBMT system is word based
  • Written Chinese has no spaces between words

5
Chinese Segmentation (2)
  • Why not just using characters?
  • Mis-match between Chinese and English will be
    worse

6
Chinese Segmentation (3)
  • Segmentation Problem
  • Given a sentence with no spaces, break it into
    words.
  • Segmentation Approaches
  • Statistical approach
  • Dictionary based approach
  • Combination of dictionary and linguistic
    knowledge
  • We used forward/backward maximum match, with
    LDCs frequency dictionary for baseline
  • Suffered from the incomplete coverage of the
    dictionary on corpus

7
Goal
  • Extract Chinese terms from the corpus and add
    them to the frequency dictionary for segmentation
  • Result of pre-processing
  • A segmented/bracketed bilingual corpus
  • A statistical dictionary

8
Definitions
  • Vague definitions of Chinese words
  • Definition used in this paper
  • Chinese Characters
  • The smallest unit in written Chinese is a
    character, which is represented by 2 bytes in
    GB-2312 code.
  • Chinese Words
  • A word in natural language is the smallest
    reusable unit which can be used in isolation.
  • Chinese Phrases
  • We define a Chinese phrase as a sequence of
    Chinese words. For each word in the phrase, the
    meaning of this word is the same as the meaning
    when the word appears by itself.
  • Terms
  • A term is a meaningful constituent. It can be
    either a word or a phrase.

9
Tokenization Techniques (1)
  • Collocation measure
  • For two adjacent terms w1 and w2
  • Where VMI(w1w2) is a variant of average mutual
    information

10
Tokenization Techniques (2)
  • Dual-threshold for segmenting

11
Tokenization Procedure
  • Tokenizing on character level cannot produce a
    highly accurate segmentation
  • Cross-boundary problem
  • Instead, tokenize on the segmented corpus using
    LDCs segmenter

12
Feedback from Statistical Dictionary
  • Monolingual tokenization may lead to over
    segmentation
  • The statistical dictionary was built from
    segmented corpus
  • Using the results of statistical dictionary to
    adjust the segmentation

13
Flowchart of Pre-processing
14
Results
  • With proper parameters for two thresholds
  • Average length of Chinese terms increased by 60,
    10 for English
  • Statistical dictionary gained 30 increase in
    coverage (with the same precision)
  • Small boost in EBMT overall performance
  • Automatic evaluation metrics
  • Human evaluations

15
Ongoing and Future work
  • Adding word-clustering and grammar induction
    features
  • Improving the sub-sentential alignment model by
    utilizing the bilingual collocation information
  • Change threshold dynamically according to the
    current segmentation

16
References (partial)
  • Ralf D. Brown. 1996. Example-Based Machine
    Translation in the PanGloss System. In
    Proceedings of the Sixteenth International
    Conference on Computational Linguistics, Pages
    169-174, Copenhagen, Denmark. http//www.cs.cmu.ed
    u/ralf/papers.html
  • Ralf D. Brown. 1997. "Automated Dictionary
    Extraction for Knowledge-Free'' Example-Based
    Translation". In Proceedings of the Seventh
    International Conference on Theoretical and
    Methodological Issues in Machine Translation, p.
    111-118. Santa Fe, July 23-25, 1997
  • Ralf D. Brown. 1999. Adding Linguistic Knowledge
    to a Lexical Example-Based Translation System. In
    Proceedings of the Eighth International
    Conferences on Theoretical and Methodological
    Issues in Machine Transaltion (TMI-99), pages
    22-32, Chester, England, August.
    http//www.cs.cmu.edu/ralf/papers.html
  • Ralf D. Brown. 2000. Automated Generalization of
    Translation Examples. In Proceedings of the
    Eighteenth International Conferences on
    Computational Linguistics (COLING-2000), pages
    125-131
  • Tom Emerson, Segmentation of Chinese Text. In
    38 Volume 12 Issue2 of MultiLingual Computing
    Technology published by MultiLingual Computing,
    Inc.
  • Christopher Hogan and Robert E. Frederking. 1998.
    An Evaluation of the Multi-engine MT
    Architecture. In Machine Translation and the
    Information Soup Proceedings of the Third
    Conference of the Association for Machine
    Translation in Americas (AMTA 98), volume 1529
    of Lecture Notes in Artificial Intelligence,
    pages 113-123. Springer-Verlag, Berlin, October

17
The End
  • Questions and Comments?
Write a Comment
User Comments (0)
About PowerShow.com