The Corpgrafo Theory and Practice - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

The Corpgrafo Theory and Practice

Description:

PALC '97 'Do-it-yourself corpora ... with a little bit of help from your friends! ... Terminology projects with the support of domain specialists in: ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 32
Provided by: bma87
Category:

less

Transcript and Presenter's Notes

Title: The Corpgrafo Theory and Practice


1
The CorpógrafoTheory and Practice
  • Belinda Maia Luís Sarmento
  • LINGUATECA PoloFLUP

2
A bit of history
  • PALC 97 'Do-it-yourself corpora ... with a
    little bit of help from your friends!'
  • CULT 1998 - Making corpora a learning process
  • Contrastive corpora linguist /translation
    teacher
  • General gt specific language

3
A bit of history
  • 2000 First Masters in Terminology and
    Translation at FLUP
  • PALC 2001 - Training Translators in Terminology
    and Information Retrieval using Comparable and
    Parallel Corpora
  • Specialized translation and terminology
  • Contact with domain experts
  • Importance of IT gt panic at lack of technical
    help for more ambitious students!

4
A bit of history
  • LREC 2002 - Corpora for terminology extraction
    the differing perspectives and objectives of
    researchers, teachers and language services
    providers
  • 2002 Second Masters in Terminology and
    Translation at FLUP
  • Plea for help to Diana Santos
  • October 2002
  • LINGUATECA - Polo FLUP

5
LINGUATECA Polo FLUP
  • See http//www.linguateca.pt
  • Leader gt Diana Santos (SINTEF Oslo)
  • Objective - to create resources and tools for the
    computational processing of Portuguese
  • Poles at Lisbon, Braga and Porto
  • Porto Polo FLUP main objectives creation of
    comparable corpora and tools for terminology
    extraction

6
More history
  • 2003 Poster of the GC at CL2003
  • 2003 What are comparable corpora? CL2003
  • (2003 Experimentation with evaluation of
    Machine Translation)
  • 2003 Experimentation with GC
  • 2003 Third Masters in Terminology and
    Translation at FLUP

7
GC Integrated Web Environment for Corpora
Linguistics
  • Motivation
  • Lack of Comprehensive, wide-scope Corpora Tools
  • Commercial Packages are usually difficult to
    Integrate/Customize
  • Tools are not prepared to support cooperative
    work.
  • Linguistic knowledge is not usually integrated
    in tools.

BNC
CETEM Público
COMPARA
Others
Custom Interface
Custom Interface
Custom Interface
Custom Interface
  • Concordance Engine
  • Taggers
  • Aligner (Semi-Auto)
  • Corpora Bot
  • Statistics
  • Custom Tools

DEV
Internet
Tool Pool
Terminology DB
Personal Corpora
Inter-user Communication
Virtual Desktop
Terminology Extraction Tool (Auto/Semi-Auto)
ADM
USER
PDF
PS
RTF
TXT
HTML
DOC
8
Prescriptive v descriptive terminology
  • Paper gt digital form
  • Static gt dynamic resources
  • Democratization of terminology
  • ISO standards gt socioterminology
  • Knowledge structures increasingly recognized as
    structured but dynamic - ask Gerhard Budin to
    explain this to you .

9
Perspectives of terminology users
  • Domain experts and vested interests
  • Translators
  • Information retrieval
  • Knowledge engineering
  • Standardized terminology
  • Getting the right word
  • Finding information
  • Perfecting Google
  • Structuring knowledge
  • Finding it fast

10
Bridging the Gap
  • Translation teachers
  • General linguists
  • Translation students
  • Corpus linguists
  • Computational linguists
  • Computer engineers
  • Computer-phobia
  • Computer-worship

11
The Corpógrafo tries to combine
  • Terminology, translation and language study and
    research (Belinda)
  • Terminology databases (Domain experts)
  • Computational linguistics research and production
    of resources (Diana)
  • Information retrieval and artificial intelligence
    (Luís)
  • Discussions on priorities!

12
Focus of Corpógrafo
  • Design priorities are to
  • See the Big Picture
  • Create the Overall Framework
  • Get feedback from users to see their needs
  • Develop certain areas according to real research
    needs
  • Fill in the details and improve techniques later

13
Corpógrafo
  • File Manager - area where each individual or
    group can
  • convert various text formats to .txt
  • upload texts to their space on server
  • clean them of unnecessary material
  • check tokenization and sentence divisions
  • consult wordlists alphabetical, frequency etc
  • group texts into corpora
  • register full information on source, domain and
    text type

14
Corpógrafo
  • 2. Corpora analysis area
  • Concordancing tools allowing for
  • KWIC concordancing
  • KWIC concordancing with sorted according to word
    to left or right
  • N-gram tool
  • N-grams
  • Term-candidates
  • With filters for PT

15
Corpógrafo
  • 3. Terminology database
  • Terms
  • Definitions
  • Examples
  • Morphology
  • Multilingual equivalents
  • Sources and text details of corpora used
  • Semantic relations further complexity

16
Internet
Corpora
Corpora Analysis
Terminology Database
Text details
Text details
Text details
17
Future developments general policy
  • General testing and improvement of the Corpógrafo
  • Experimentation with ideas from other projects-
    e.g. Wordnet, Framenet
  • Experimentation with theories of semantic
    primitives, human universals etc
  • Development of new ideas or functions using
    isomorphic relationships between researchers
    needs and our possibilities

18
Future developments- File Manager
  • Creation of overall framework perhaps UDC based
    for
  • consultation of research available to public
  • information on ongoing research
  • Coordination of individual corpus projects into
    bigger projects, when possible or necessary

19
File ManagerTheoretical questions
  • Domain organization UDC or ?
  • Categorization of text by genre how many
    genres?
  • Reliability of texts from Internet how does one
    guarantee quality?
  • Is a translator or linguist able to distinguish a
    good text?
  • Should the domain specialist choose the texts?

20
Corpora constructiontheoretical questions /
problems
  • How large is a good domain corpus?
  • No domain corpus will produce EVERY term in the
    area
  • Comparable corpora v. Parallel corpora
  • Aligning comparable corpora at term level

21
Future developments- Corpora analysis
  • Development of finer-grained concordancing
  • Experimentation with finding definitions in
    context
  • Semi-automatic creation of keyword shortlists for
    further text retrieval

22
Corpora AnalysisTheoretical questions
  • How far can one rely on the computational
    linguist or computer engineer to produce analyses
    of corpora?
  • If (semi-) automated processes produce 80
    possible results, should the linguist /
    translator rubbish these processes?
  • Can we leave it all the computer engineer?

23
Future developments- terminology databases
  • Refinement of terminology fields
  • Development of further multi-lingual functions
  • Development of organized and robust set of
    semantic relations
  • Semi-automatic visualizing of semantic relations

24
Terminology databasesTheory
  • How much information does a database need?
  • How much does the user of a database need?
  • Is it reasonable to hope that all our databases
    could one day communicate with each other and
    help us with translation / information retrieval
    or whatever?

25
How is the Corpógrafo being used at present?
  • Masters in Terminology and Translation
  • Terminology projects with the support of domain
    specialists in
  • Engineering Electronics, Mechanical Engineering
  • Geography - Population Geography, Natural Hazards
    Fire, Floods, Earthquakes, Coastal Erosion,
  • Medicine - Kidney support machines, Neurology
  • Science Genetics
  • Translation and Localization

26
How is the Corpógrafo being used at present?
  • Dissertations completed on
  • Definitions for different purposes pedagogical
    glossary for Corrosion, Electrical engineering
    http//www.fe.up.pt/cdm/QAE/QAE_gloss_b.htm
  • Socioterminology in the area of Composite
    Materials
  • Graphical representation of Conceptual systems
  • Terminology and Metaphors
  • Football Metaphors

27
How is the Corpógrafo being used at present?
  • Ongoing dissertations on aspects of
  • Terminology databases for different uses,
    neologisms, conceptual analysis
  • Corpora text analysis, corpora construction
  • Translation and localization terminology
  • Technical writing gt Electrical Appliances
  • Terminology in documentaries

28
Pedagogical applications of the Corpógrafo
  • Undergraduate courses only possible if both
    teachers and students are trained to use it
  • Postgraduate research
  • Terminology and translation (Belinda domain
    experts)
  • Computational linguistics (Diana)
  • Information retrieval (Luís)
  • Long live team work!

29
To what extent is the Corpógrafo available to
others?
  • Linguatecas policy is to make all resources and
    tools available online
  • Primary users are expected to be Portuguese and
    Brazilian as most of resources and tools are for
    Portuguese
  • PoloFLUPs main objective comparable corpora
    and terminology tools

30
To what extent is the Corpógrafo available to
others?
  • PoloFLUP is, by definition, bi- or multi-lingual
    in interest
  • The Corpógrafo is therefore available for
    experiments on a small scale to the general
    public
  • In the future we hope to be able to work on
    projects with users from other universities and
    other countries

31
Contacts
  • If you are interested is finding out more, please
    contact me
  • Belinda Maia
  • bmaia_at_mail.telepac.pt
  • The Corpógrafo can be used
  • (with a username and password) at
  • http//www.linguateca.pt and
  • http//poloclup.linguateca.pt/ferramentas/gc
Write a Comment
User Comments (0)
About PowerShow.com