Title: The Corpgrafo Theory and Practice
1The CorpógrafoTheory and Practice
- Belinda Maia Luís Sarmento
- LINGUATECA PoloFLUP
2A bit of history
- PALC 97 'Do-it-yourself corpora ... with a
little bit of help from your friends!' - CULT 1998 - Making corpora a learning process
- Contrastive corpora linguist /translation
teacher - General gt specific language
3A bit of history
- 2000 First Masters in Terminology and
Translation at FLUP - PALC 2001 - Training Translators in Terminology
and Information Retrieval using Comparable and
Parallel Corpora
- Specialized translation and terminology
- Contact with domain experts
- Importance of IT gt panic at lack of technical
help for more ambitious students!
4A bit of history
- LREC 2002 - Corpora for terminology extraction
the differing perspectives and objectives of
researchers, teachers and language services
providers - 2002 Second Masters in Terminology and
Translation at FLUP
- Plea for help to Diana Santos
- October 2002
- LINGUATECA - Polo FLUP
5LINGUATECA Polo FLUP
- See http//www.linguateca.pt
- Leader gt Diana Santos (SINTEF Oslo)
- Objective - to create resources and tools for the
computational processing of Portuguese - Poles at Lisbon, Braga and Porto
- Porto Polo FLUP main objectives creation of
comparable corpora and tools for terminology
extraction
6More history
- 2003 Poster of the GC at CL2003
- 2003 What are comparable corpora? CL2003
- (2003 Experimentation with evaluation of
Machine Translation) - 2003 Experimentation with GC
- 2003 Third Masters in Terminology and
Translation at FLUP
7GC Integrated Web Environment for Corpora
Linguistics
- Motivation
- Lack of Comprehensive, wide-scope Corpora Tools
- Commercial Packages are usually difficult to
Integrate/Customize - Tools are not prepared to support cooperative
work. - Linguistic knowledge is not usually integrated
in tools.
BNC
CETEM Público
COMPARA
Others
Custom Interface
Custom Interface
Custom Interface
Custom Interface
- Concordance Engine
- Taggers
- Aligner (Semi-Auto)
- Corpora Bot
- Statistics
- Custom Tools
DEV
Internet
Tool Pool
Terminology DB
Personal Corpora
Inter-user Communication
Virtual Desktop
Terminology Extraction Tool (Auto/Semi-Auto)
ADM
USER
PDF
PS
RTF
TXT
HTML
DOC
8Prescriptive v descriptive terminology
- Paper gt digital form
- Static gt dynamic resources
- Democratization of terminology
- ISO standards gt socioterminology
- Knowledge structures increasingly recognized as
structured but dynamic - ask Gerhard Budin to
explain this to you .
9Perspectives of terminology users
- Domain experts and vested interests
- Translators
- Information retrieval
- Knowledge engineering
- Standardized terminology
- Getting the right word
- Finding information
- Perfecting Google
- Structuring knowledge
- Finding it fast
10Bridging the Gap
- Translation teachers
- General linguists
- Translation students
- Corpus linguists
- Computational linguists
- Computer engineers
- Computer-phobia
- Computer-worship
-
11The Corpógrafo tries to combine
- Terminology, translation and language study and
research (Belinda) - Terminology databases (Domain experts)
- Computational linguistics research and production
of resources (Diana) - Information retrieval and artificial intelligence
(Luís) - Discussions on priorities!
12Focus of Corpógrafo
- Design priorities are to
- See the Big Picture
- Create the Overall Framework
- Get feedback from users to see their needs
- Develop certain areas according to real research
needs - Fill in the details and improve techniques later
13Corpógrafo
- File Manager - area where each individual or
group can - convert various text formats to .txt
- upload texts to their space on server
- clean them of unnecessary material
- check tokenization and sentence divisions
- consult wordlists alphabetical, frequency etc
- group texts into corpora
- register full information on source, domain and
text type
14Corpógrafo
- 2. Corpora analysis area
- Concordancing tools allowing for
- KWIC concordancing
- KWIC concordancing with sorted according to word
to left or right - N-gram tool
- N-grams
- Term-candidates
- With filters for PT
15Corpógrafo
- 3. Terminology database
- Terms
- Definitions
- Examples
- Morphology
- Multilingual equivalents
- Sources and text details of corpora used
- Semantic relations further complexity
16 Internet
Corpora
Corpora Analysis
Terminology Database
Text details
Text details
Text details
17Future developments general policy
- General testing and improvement of the Corpógrafo
- Experimentation with ideas from other projects-
e.g. Wordnet, Framenet - Experimentation with theories of semantic
primitives, human universals etc - Development of new ideas or functions using
isomorphic relationships between researchers
needs and our possibilities
18Future developments- File Manager
- Creation of overall framework perhaps UDC based
for - consultation of research available to public
- information on ongoing research
- Coordination of individual corpus projects into
bigger projects, when possible or necessary
19File ManagerTheoretical questions
- Domain organization UDC or ?
- Categorization of text by genre how many
genres? - Reliability of texts from Internet how does one
guarantee quality? - Is a translator or linguist able to distinguish a
good text? - Should the domain specialist choose the texts?
20Corpora constructiontheoretical questions /
problems
- How large is a good domain corpus?
- No domain corpus will produce EVERY term in the
area - Comparable corpora v. Parallel corpora
- Aligning comparable corpora at term level
21Future developments- Corpora analysis
- Development of finer-grained concordancing
- Experimentation with finding definitions in
context - Semi-automatic creation of keyword shortlists for
further text retrieval
22Corpora AnalysisTheoretical questions
- How far can one rely on the computational
linguist or computer engineer to produce analyses
of corpora? - If (semi-) automated processes produce 80
possible results, should the linguist /
translator rubbish these processes? - Can we leave it all the computer engineer?
23Future developments- terminology databases
- Refinement of terminology fields
- Development of further multi-lingual functions
- Development of organized and robust set of
semantic relations - Semi-automatic visualizing of semantic relations
24Terminology databasesTheory
- How much information does a database need?
- How much does the user of a database need?
- Is it reasonable to hope that all our databases
could one day communicate with each other and
help us with translation / information retrieval
or whatever?
25How is the Corpógrafo being used at present?
- Masters in Terminology and Translation
- Terminology projects with the support of domain
specialists in - Engineering Electronics, Mechanical Engineering
- Geography - Population Geography, Natural Hazards
Fire, Floods, Earthquakes, Coastal Erosion, - Medicine - Kidney support machines, Neurology
- Science Genetics
- Translation and Localization
26How is the Corpógrafo being used at present?
- Dissertations completed on
- Definitions for different purposes pedagogical
glossary for Corrosion, Electrical engineering
http//www.fe.up.pt/cdm/QAE/QAE_gloss_b.htm - Socioterminology in the area of Composite
Materials - Graphical representation of Conceptual systems
- Terminology and Metaphors
- Football Metaphors
27How is the Corpógrafo being used at present?
- Ongoing dissertations on aspects of
- Terminology databases for different uses,
neologisms, conceptual analysis - Corpora text analysis, corpora construction
- Translation and localization terminology
- Technical writing gt Electrical Appliances
- Terminology in documentaries
28Pedagogical applications of the Corpógrafo
- Undergraduate courses only possible if both
teachers and students are trained to use it - Postgraduate research
- Terminology and translation (Belinda domain
experts) - Computational linguistics (Diana)
- Information retrieval (Luís)
- Long live team work!
29To what extent is the Corpógrafo available to
others?
- Linguatecas policy is to make all resources and
tools available online - Primary users are expected to be Portuguese and
Brazilian as most of resources and tools are for
Portuguese - PoloFLUPs main objective comparable corpora
and terminology tools
30To what extent is the Corpógrafo available to
others?
- PoloFLUP is, by definition, bi- or multi-lingual
in interest - The Corpógrafo is therefore available for
experiments on a small scale to the general
public - In the future we hope to be able to work on
projects with users from other universities and
other countries
31Contacts
- If you are interested is finding out more, please
contact me - Belinda Maia
- bmaia_at_mail.telepac.pt
- The Corpógrafo can be used
- (with a username and password) at
- http//www.linguateca.pt and
- http//poloclup.linguateca.pt/ferramentas/gc