Title: Parallel corpora and contrastive studies
1Parallel corpora and contrastive studies
- Hilde Hasselgård
- University of Oslo
2From monolingual to multilingual corpus
linguistics
- Corpus linguistics the study of language by
means of large(ish), structured databases of text
compiled and prepared for use in linguistic
research. - Largely developed within English linguistics,
with the Brown corpus as the first (1960s),
followed by the Lancaster-Oslo/Bergen (LOB)
corpus. - Greatly facilitated the access to material.
- Opened up new possibilities for quantitative
studies variation studies. - Parallel corpora a more recent development
(1990s), requiring new technology and new
research methods.
3Structure of talk
- Multilingual corpus linguistics
- Multilingual corpora
- The English-Norwegian Parallel Corpus
- Contrastive analysis
- The use of parallel corpora in contrastive
studies - The contribution of parallel corpora
- Methodology
- The Oslo Multilingual Corpus and the work of
Språk i Kontrast (Languages in Contrast) in
Oslo - Case study two future-referring expressions
- Summing up
4What is a parallel corpus?
- original texts with translations into one or more
other languages ? A translation corpus - comparable original texts in different languages
? A comparable corpus - bi-directional translation corpus ? Parallel
corpus
5Translation corpus
- A corpus that contains the same texts in more
than one language, in other words a corpus with
both original and translated texts.
6Comparable corpus
- a corpus that contains original texts in more
than one language and where the texts in each
language have been selected according to the same
criteria (genre, content, publication date etc.)
Language 1 criterion A criterion B criterion C criterion D Language 2 criterion A criterion B criterion C criterion D Language 3 criterion A criterion B criterion C criterion D
7Parallel corpus (ENPC model)
- Combination of translation and comparable corpus
- The original texts are comparable (genre, number
of words) - The translations go in both directions a
bidirectional translation corpus
8The English-Norwegian Parallel Corpus (ENPC)
Some facts
- Started as a research project at the Department
of British and American Studies in 1994 and
completed in 1997. Prof. Stig Johansson initiated
and directed the project. - Original texts with translations
(English-Norwegian and Norwegian-English) - Fiction and non-fiction
- Compiled for use in applied and theoretical
linguistic research - Development of software for alignment of the
texts (Knut Hofland, UiB) and for searching the
corpus (Jarle Ebeling, UiO) - Sister projects The English-Swedish Parallel
Corpus (Lund/Göteborg), English-Finnish Parallel
Corpus (Jyväskylä/Savonlinna/Tampere) same
principle of compilation to some extent also
shared texts. - Other corpora built on the ENPC model in Germany
(Chemnitz), France/Belgium (Poitiers/Louvain-la-Ne
uve the PLECI corpus), Spain (University of
Léon).
9Contrastive analysis
- Contrastive analysis is the systematic comparison
of two or more languages, with the aim of
describing their similarities and differences.
(Johansson 2007 1) - CA contrastive analysis is a linguistic
enterprise aimed at producing inverted (i.e.
contrastive, not comparative) two-valued
typologies (a CA is always concerned with a pair
of languages), and founded on the assumption that
languages can be compared. (James 1980 3) - Executing a CA involves two steps description
and comparison and the steps are taken in that
order. (James 1980 63)
10Contrastive analysis
- A CA presupposes a tertium comparationis, i.e. a
measure by which we can be fairly certain we are
comparing like with like. - The items to be compared across languages are
selected on the basis of perceived similarity
(Chesterman 1998), such as translation
equivalence, semantic/etymological similarity,
grammatical or functional categories. - A frequently suggested tertium comparationis is
translation equivalence (e.g. James 1980,
Chesterman 1998) which implies that the items in
the two languages convey (more or less) the same
meaning.
11What can multilingual corpora contribute?
- They give insights into the languages compared
insights that are likely to be unnoticed in
studies of monolingual corpora. - They can be used for a range of comparative
purposes and increase our understanding of
language-specific, typological and cultural
differences, as well as of universal features. - They illuminate differences between source texts
and translations, and between native and
non-native texts. - They can be used for a number of practical
applications, e.g. in lexicography, language
teaching, and translation. - (Aijmer Altenberg 1996 12)
12Other benefits of a parallel corpus such as the
ENPC
- Ready access to (relatively) large quantities of
bilingual data - Sentence alignment
- Comparable original and translated texts in both
languages - Control for translation bias
- In-built tertium comparationis through
translation equivalence and text comparability - the paired texts reveal the interlingual
identifications made by translators (Johansson
1999 117)
13Methodology Classifying correspondences
- congruent
- expressed
- divergent
- Correspondence
- zero
Same realisation type
Different realisation type
Example English correspondences of imidlertid
(however) in ENPC Alle "innrømmelsene" hadde
imidlertid en pris. (GL1) ? However, all these
"concessions" had a price. Det endte imidlertid
godt () (UD1) ? But it ended well
() Reguleringstiltakene har imidlertid gitt
resultater (). (ABJH1) ? The regulations have
shown results ().
14Paradigms of correspondences
- Swedish translations of however
- emellertid (51 47)
- men (but) (36 33)
- dock (14 13)
- ändå (2)
- däremot (1)
- i alla fall (1)
- Ø (4)
- English translations of emellertid
- however (83 81)
- but (3)
- yet (3)
- anyway (1)
- Ø (13)
(Altenberg 1999)
15Mutual correspondence (MC)(Altenberg 1999)
- The frequency with which different (grammatical,
semantic and lexical) expressions are translated
into each other. - Calculated and expressed as a percentage by means
of the formula - (At Bt) x 100
- As Bs
- The MC of however and emmelertid in the ESPC is
thus - (51 83) x 100 / (109 103) 63.2
16Lexicogrammar
- Paradigms of correspondence highlight the fuzzy
borderlines between lexis and grammar and grammar
and discourse. - Example A modal verb will have a wide range of
correspondences - Norwegian kan (can)
- Valget av tidspunkt kan også inneholde et stenk
av egoisme. (KH1) - Maybe his choice of timing also contained a touch
of egotism.
Modal aux can, could, may, might, ll, will,
would, should Other verbs know, enable, have,
have to, had better Adjectives possible, able,
capable. Adverbs maybe, perhaps Suffix -able
(Løken 2007)
17From ENPC to OMC under the SPRIK umbrella (SPRÃ¥k
I Kontrast)
- New languages have been added, first (mainly)
German, then French - Focus on English Norwegian German in the
first phase of the SPRIK-project original texts
in each language with translations into the other
two. - Same principles for text selection, text sampling
and preparation as for the ENPC (exception even
more biased towards fiction because of the lack
of translated non-fiction) - Same (or later versions of same) software for
alignment, searching etc. - Expanded search facilities and research
possibilities - Three-way comparison of translations and
originals - Possibilities of investigating two different
translations of the same text (translation
strategies, translationese)
18Current stock of multilingual corpora at Oslo
- OMC
- Parallel corpora English-Norwegian,
French-Norwegian, German-Norwegian three-way
English-German Norwegian. - Translation corpora Norwegian English French
German, Norwegian French German,
English-Dutch, English-Portuguese. - Multiple translations corpus (English-Norwegian)
- Outside OMC
- Russian English Norwegian (RuN)
- Multilingual corpora of historical texts (two
projects)
19Trilingual parallel corpus model
20Searching in the OMC (En-Ge-No)Search terms
however in English originals, doch in German
originals
- Now, however, our father wears jackets and ties
and white shirts, and a tweed overcoat and a
scarf. (MA1) - Jetzt jedoch trägt unser Vater Jacken und
Krawatten und weiße Hemden und einen Tweedmantel
und einen Schal. (MA1TD) - Nå går faren vår imidlertid med jakker og slips
og hvite skjorter og tweedfrakk og skjerf.
(MA1TN) - "However, the ex-Royal Family will be protected
by the laws of the land. (ST1) - Doch die Ex-Königsfamilie genießt auch den Schutz
der Gesetze dieses Landes. (ST1TD) - Ikke desto mindre vil den eks-kongelige familie
kunne påberope seg beskyttelse under landets lov.
(ST1TN) - Und er war doch noch da. (ME1)
- And after all he was still there. (ME1TE)
- Og han var jo fortsatt til. (ME1TN)
21Translation corpus with four languages
No-En-Fr-Ge
22Searching in No-En-Fr-Ge
- Jeg kommer til å si det til ham likevel. (KF1)
- Ich werde es ihm sowieso sagen. (KF1TD)
- I 'll tell him about it anyway. (KF1TE)
- De toute façon, je le lui dirai. (KF1TF)
- "You're going to have a book reissued (BHH1TE)
- Du skal få en bok trykt opp igjen ... (BHH1)
- "Ein Buch von dir wird neu aufgelegt, ...
(BHH1TD) - Un de tes livres va être réédité ... (BHH1TF)
23Using the ENPC/OMC for research
- Particularly well suited for studies of lexis /
lexico-grammar (or phenomena that can take lexis
as their starting point) - A broad range of phenomena have been (are being)
investigated, e.g. the use of individual verbs
(bli, få, take, give, see), modality, particular
syntactic constructions, connectives, sentence
openings and other discourse phenomena. - The methodology is not tied to any particular
theoretical approach - A range of theoretical approaches, e.g. SFL,
cognitive linguistics, pattern grammar,
lexis-based approach à la Sinclair traditional
grammar / basic linguistic theory.
24Limitations
- (As with corpus linguistics in general) you can
only search for something that is explicit in the
text - Restricted to texts / text types that have been
translated - The size of the corpus restricts studies of less
frequent lexical/ grammatical constructions - Faulty and less successful translations
- The corpus has been word-class tagged, but not
parsed (syntactically annotated), i.e. it is not
possible to search for grammatical constructions,
patterns of word order etc. - Tagging errors
25Ways around the limitations?
- Identify typical (and searchable!) expressions of
a grammatical construction, e.g. presentatives,
clefting, phrasal verbs, inversion. - Use a combination of word class tagging, filters
and wildcards. Example tense / aspect,
participle clauses. (e.g. BE Ving) - In any case a lot of work involved in tidying
up the search results (precision). - Possibility of searching with regular expressions
- Errors in the tagging Never possible to make
sure that you have found all the relevant
instances (recall). - Errors/idiosyncracies in the translation Weed
out? Ignore translations that occur only once, or
in only one text? - Manual searches in running text, e.g. for Theme,
subjects. - Supplement results of parallel corpus study with
(larger) monolingual corpora. - Supplement corpus study with e.g. experimental
data.
26Examples of studies based on ENPC/ OMC / ESPC
- Bengt Altenberg Work on adverbial connectors,
sentence openings, subject selection etc. in
English and Swedish. - Karin Aijmer Work on modality and discourse
markers in English and Swedish. - Ã…ke Viberg Work on verbs of motion and cognition
in English and Swedish. - Helge Dyvik Translations as semantic mirrors
ENPC as basis for bilingual wordnet. - Jarle Ebeling (2000) Presentative constructions
in English and Norwegian a corpus-based
contrastive study (PhD, University of Oslo) - Mats Johansson (2002) Clefts in English and
Swedish A contrastive study of IT-clefts and
WH-clefts in original texts and translations.
(PhD, Lund University) - Signe Oksefjell Ebeling (2003) The Norwegian
verbs bli and få and their correspondences in
English a corpus-based contrastive study (PhD,
University of Oslo)
27- Berit Løken Beyond modals A corpus-based study
of English and Norwegian expressions of
possibility (PhD, Oslo, 2007) - Lene Nordrum English lexical nominalizations in
a Norwegian-Swedish contrastive perspective.
(PhD, Göteborg, 2007) - Wiebke Ramm Sentence boundary adjustments in
translation (German / Norwegian) Consequences on
information distribution and discourse structure
(PhD, Oslo, ongoing) - Astrid Nome Ongoing PhD work on connectors in
Norwegian and French. (Oslo) - Cathrine Fabricius Hansen et al Big Events,
Small Clauses. The Grammar of Elaboration.
(Forthcoming book with multiple authors and
multiple languages) - Master theses (English, German, French) studying
individual verbs, syntactic constructions,
connectors, metaphor
28My own contrastive work
- 2009. A textual perspective on the pragmatic
markers in fact and faktisk. In S. Slembrouck,,
M. Taverniers, M. Van Herreweghe (eds.) From will
to well Studies in Linguistics offered to
Anne-Marie Simon-Vandenbergen. Ghent Academia
Press. - 2007. Using the ENPC and the ESPC as a parallel
translation corpus adverbs of frequency and
usuality. Nordic Journal of English Studies 61,
http//ojs.ub.gu.se/ojs/index.php/njes/issue/view/
6 - 2006. Not now on non-correspondence between
the cognate adverbs now and nå. In K. Aijmer
A.-M. Simon Vandenbergen (eds.) Pragmatic Markers
in Contrast. Elsevier, 93-114. - 2005. Theme in Norwegian. In K.L. Berge, E.
Maagerø (eds.). Semiotics from the North Nordic
Approaches to Systemic Functional Linguistics.
Oslo Novus, 35-48. - 2004 . Spatial linking in English and Norwegian.
In K. Aijmer H. Hasselgård (eds.). Translation
and Corpora. Göteborg Acta Universitatis
Gothoburgensis, 163-188. - 2004. Thematic choice in English and Norwegian.
Functions of Language 112. 187-212. - 2000. English multiple Themes in translation. In
A. Klinge (ed.) Contrastive Studies in Syntax.
Special issue of Copenhagen Studies in Language,
Vol 25. Copenhagen Samfundslitteratur, 11-38.
29Case study be going to and komme til å (come
to)
- Future-referring expressions based on motion verb
infinitive - Both described in grammars as common expressions,
though less common than expressions with English
will, Norwegian skal
30Meanings
- be going to
- future fulfilment of the present present
intention or present cause (Quirk et al 1985) - associated with present intention or arrangement
was going to quite often has an implicature of
non-actualisation. (Huddleston Pullum 2002) - Two meanings futurish, linked to a present
situation, and future tense, simply expressing
future time reference. (Declerck 2006) - komme til å
- the speaker predicts what will happen based on
his knowledge at the moment of speaking (Faarlund
et al 1997) - Past tense kom til å V also accidentally V or
was led to V/ grew to V (Vannebo 1979 and
Engelsk Stor Ordbok)
31Examples
- I know what hes going to say even before he says
it. (FW1) - Jeg vet hva han kommer til å si selv før han sier
det. (FW1T) - "I was going to wait until another time we met,
but I may as well tell you now. (AH1) - Meningen var å vente til en annen gang, men jeg
kan like godt si det nå. (AH1T) - Ingen av dem visste hva som kom til å skje.
(TTH1) - Neither of them knew what was going to happen.
(TTH1T) - Kanskje hun kom til å svelge dem ved et uhell?
(LSC1) - Maybe she happened to swallow them by accident?
(LSC1T) - Og siden ble det jeg som kom til å se mest til
henne. (EHA1) - And then I became the one who ended up seeing her
most often. (EHA1T)
32be going to and komme til å in ENPC fiction (raw
frequencies)
33Preliminary observations
- Be going to is more common than komme til å in
original texts - Be going to is more common in original texts
than in translations - Komme til å is less common in original texts than
in translations - i.e. translations in both directions can be
assumed to be coloured by the source texts. - The frequency differences between originals and
translations (particularly with komme til å)
indicate that the two expressions can often be
used in the same contexts, but may tend not to
be.
34Correspondences of be going to (percentages)
35Correspondences of komme til å (percentages)
36Correspondences
- The mutual correspondence between be going to and
komme til å is surprisingly low 12.6 - The correspondence is asymmetrical
- 15 of be going to are translated as komme til å
- 7 of komme til å are translated as be going to
- Komme til å has meanings not covered by be going
to (accidentally, grow to, be led to). - The present cause/intention meaning works
differently for the two expressions apparently
also speaker certainty/non-actualisation. - What are we going to do, says Ruth, (BV2T)
- Hva skal vi gjøre, sier Rut (BV2)
- Hun kommer bare til å bli redd." (THA1)
- She 'll only be frightened." (THA1T)
- "Are you going to run a hotel?" enquired
Frederick reasonably, (DL1) - "Har dere tenkt å drive hotell?" spurte Frederick
fornuftig, (DL1T)
Uncertain outcome, no intentionality
Confident prediction speaker knowledge
Intention, but uncertain outcome
37- Thus, in spite of shared meanings, English be
going to and and Norwegian komme til å, differ as
to - The frequency with which the item is chosen
- The extent to which they compete with other
future-referring expressions - The extent to which they convey confident
predictions, present intention and actualised
future in past. - Some other explanations may be
- Translators in both directions tend to normalize
be going to / komme til å into a more common
future-referring expression (will/would INF and
skal/skulle INF) Will/would and skal/skulle are
also the most common sources of komme til å / be
going to - Sometimes more lexically explicit forms have been
used to translate be going to/komme til å ha
tenkt å / intend to (subjects intention) was to
(was led/destined to) - Be going to may be needed for syntactic reasons,
as English modals lack non-finite forms and do
not show tense clearly. - Norwegian modal auxiliaries are more flexible,
having non-finite and tensed forms ? skal /skulle
INF fits into more syntactic environments than
will/would INF
38The verb forms
39- The present tense be going to occurs to a great
extent in direct speech. - The meanings of accidentally do and grow to/
be led to of komme til å occur mainly with the
past tense, the former also with modalisation. - Hun kjenner at hun er søvnig, at hun kan komme
til å sovne mot fars jakke, hun vil ikke det.
(BV2) - She feels that she is sleepy, that she might fall
asleep against father's jacket, but she doesn't
want to do that. (BV2T) - og at den kvinnen jeg leter efter egentlig var
et barn den gangen hun kom til å bety noe for
meg. (FC1) - and that the woman I'm searching for was really
a child when she came to mean something to me.
(FC1T)
40Some reflections on findings and further work
- The picture of correspondence is a complex one,
in spite of the rather similar descriptions in
grammars of be going to and komme til å. - Syntactic differences between will/skal-future
expressions may go some way towards explaining
the difference in distribution. - Correspondence types will have to be correlated
with tense forms. - Subtle differences of meaning regarding speaker
certainty and present cause/intention come to the
surface when studying correspondences. - be going to is closer to a neutral future meaning
than komme til å further grammaticalized as a
future tense.
41Summing up
- Parallel corpora enhance contrastive studies in a
number of ways - by ensuring that observations are based on
authentic language use - by yielding paradigms and patterns of
correspondences - thus often revealing meanings and nuances we
might not have thought of - and showing how the same meaning may be expressed
by means of different linguistic categories - by providing quantitative data
- thus also giving insights into preferred ways
of putting things - (if the corpus is bidirectional) by providing
control for translation bias - (if the corpus is representative) by controlling
for the idiosyncrasies of individual
authors/translators
42Why undertake corpus-based contrastive
investigations?
- The importance of multilingual corpora extends
beyond contrastive studies. It is up to the user
to define fruitful research questions and use the
corpora creatively. In this process we learn not
only about individual languages and their
relationships, about translation and
foreign-language acquisition, but also about
language in general provided that the study
becomes truly multilingual. Seeing through
corpora we can see through language. - Stig Johansson (2007 316)
43Information on the OMC / ENPC
- About the corpora
- OMC www.hf.uio.no/ilos/english/originalfiler/serv
ices/omc/ - ENPC www.hf.uio.no/ilos/english/originalfiler/ser
vices/omc/enpc/ - www.helsinki.fi/varieng/CoRD/corpora/ENPC/
- About publications based on the OMC (up to 2006)
- www.hf.uio.no/ilos/forskning/prosjekter/sprik/engl
ish/publications/
44References
- Aijmer, K. B. Altenberg. 1996. Introduction. In
K. Aijmer, B. Altenberg, M. Johansson (eds.)
Languages in Contrast. Lund University Press,
11-16. - Altenberg, B. 1999. Adverbial connectors in
English and Swedish Semantic and lexical
correspondences. In Hasselgård Oksefjell (eds.)
Out of Corpora. Amsterdam Rodopi, 249-268. - Berglund, Y. 2005. Expressions of Future in
Present-day English. A Corpus-based Approach.
Uppsala University. - Chesterman, A. 1998 Contrastive Functional
Analysis. Amsterdam/Philadelphia John Benjamins
Publishing Company. - Declerck, R. 2006. The Grammar of the English
Verb Phrase, Vol. 1. Berlin Mouton de Gruyter. - Faarlund, J. T., S. Lie, K. I. Vannebo. 1997.
Norsk Referansegrammatikk. Oslo
Universitetsforlaget. - Huddleston, R. and G. K. Pullum. 2002. The
Cambridge Grammar of the English Language.
Cambridge Cambridge University Press. - James, C.. 1980. Contrastive Analysis. London
Longman. - Johansson, S. 1999. Corpora and contrastive
studies. In P. Pietilä O-P. Salo (eds.)
Multiple Languages Multiple Perspectives.
AFinLA Yearbook 1999 / No. 57, 116-125. - Johansson, S. 2007. Seeing through multilingual
corpora. Amsterdam Benjamins. - Quirk, R., S. Greenbaum, G. Leech, J. Svartvik.
1985. A Comprehensive Grammar of the English
Language. London Longman. - Vannebo, K. I. 1979. Tempus og tidsreferanse.
Oslo Novus