Title: Combining the strengths of UMIST and
1 AimsThe GerManC project involves the
compilation of a representative corpus of German
texts for the period 1650-1800.It is designed
to parallel historical corpora of English (i.e.
ARCHER, Helsinki) for this period in order to
facilitate comparative synchronic study of the
two languages.
DesignThe corpus will consist of 2000 word
extracts from eight text typesorally-oriented
drama, newspapers, sermons, lettersprint-oriented
narrative prose, academic texts, medical texts,
legal textsTo ensure representativeness there
will be an equal number of extracts fromthree
sub-periods 1650-1700 1701-1750
1751-1800five regions North West Central
East Central South-West
South-EastThis will result in a corpus of about
800,000 words and will be the first
representative corpus of German for this
period.It will further the synchronic study of
the development of German syntax and lexis in the
early modern period, and also provide material
for investigating the process of standardization
in German. The regional representativeness is
vital for this these 150 years saw the decline
of local linguistic norms and the emergence of a
supraregional standard accepted throughout the
Holy Roman Empire.
6. Analytical tools 1 A major objective was to
adapt and develop programs for tagging and
lemmatizing the corpus. The difficulties which
have to be overcome for this are (a)
Orthographic variation in a pre-standardized
language variety (b) The morphological structure
of early modern German, with much
lexeme-dependent allomorphy and the prevalence of
vowel changes as well as affixes to mark
morphosyntactic categories. We adapted the
Stuttgart-Tübingen tagset this produced good
results, with some 80 of word forms tagged and
lemmatized accurately. The orthographic variation
was found to be relatively systematic, with each
variable tending to have a discrete set of
variants. These significant regularities could be
exploited in order to automate assigning basic
leading forms for specific variants for each
text, with a stoplist of exceptions. In this way
we developed programs to normalize variant
spellings, capturing the relationship between the
variants and a standardized form and establishing
an overall lexicon of variant forms for each
lemma. This is a significant improvement on
existing corpus tools which tended to treat each
variant separately, necessitating manually
matching variant spellings to normalized forms.
7. Application A number of further programs were
developed for use with the corpus, e.g. to
generate frequency lists for word forms with
lists of the first and last occurrences of all
word forms and of all forms with unique
occurrence. A concordance program allows one to
search for words or patterns (e.g. all words
ending in -keit) and to show these in context.
Another program allows the search for particular
tag sequences. Thus by searching for sequences of
determiner adjective noun it has been
possible to generate lists to show the inflection
of adjectives within the noun phrase. In the
nominative/accusative plural this was subject to
considerable variation at this time, and the
corpus shows the gradual elimination of one
variant to leave only that one which was
eventually adopted into the standard language.
8. Further developments In the course of the
proposed extended project, with the compilation
of the complete corpus of 800,000 words, it is
intended that further tools should be developed,
in particular to parse the corpus. Given the
complexity of German syntax at this period, this
presents a considerable challenge. In this
context it would also be desirable to identify
not only the part-of-speech of each word-form,
but also its morphosyntactic properties. A start
has been made with a program which identifies
singular and plural nouns and their cases with a
reasonable degree of accuracy (ca 75).
an annotated, spatialised, multi-genre corpus of
Early Modern German Martin Durrell, Astrid
Ensslin, Paul Bennett
GerManC
Methods
stage 1 - digitizationFor the pilot project 45
extracts from German newspapers of this period
were digitized by double-keying, i.e. entered
independently by two people and the results
compared and checked with the original to
eliminate mistakes. Scanning (apart from being
potentially more prone to error) was not feasible
as there is no reliable OCR program for black
letter (Gothic) typefaces. stage 2
- annotationThe corpus was then annotated
according to the standards of the Text Encoding
Initiative (TEI). Each text was supplied with
administrative metadata (header information,
etc.) and marked for significant textual features
using the TEI tagset.The TEI conventions were
applied rigorously, and as this corpus consists
of newspapers with a wealth of relevant detail it
required a very intensive level of annotation.
It was marked for loan words, passages in
languages other than German, proper names (of
places, people, organizations etc.), numbers,
dates, times, abbreviations with expansions,
special characters and other diacritics,
illustrations and text decorations and any
formatting conventions. Exchanger XML was used
as editing software, and CLaRK for automatic
conformance checking in line with TEI U5
standards. Each stage of corpus construction and
annotation was documented in detail and any
deviations from and modifications of existing TEI
standards were noted and accounted for.
Analytical tools A major objective was to
develop programs for tagging and lemmatizing the
corpus. The Stuttgart-Tübingen tagset was
adapted and this produced good results, with some
80 of word forms tagged and lemmatized
accurately. Significant regularities could be
exploited to automate assigning basic leading
forms for specific variants for each text.
Programs were developed to normalize variant
spellings, capturing the relationship between the
variants and a standardized form and establishing
an overall lexicon of variant forms for each
lemma. ApplicationFurther
programs were developed, e.g. to allow searches
for particular tag sequences. Thus, by searching
for sequences of determiner adjective noun
lists can be generated to show the inflection of
adjectives within the noun phrase this was
subject to considerable variation at this time,
and the corpus shows the elimination of one
variant to leave only the one which was
eventually adopted into the standard
language.Further developmentsIn the proposed
extended project, with the compilation of the
complete corpus of 800,000 words, further tools
will be developed, in particular to parse the
corpus. It would also be desirable also to
identify the morphosyntactic properties of each
word-form. A start has been made with a program
identifying singular and plural nouns and their
cases with a reasonable degree of accuracy (ca
75).
Pilot The project was piloted by
the compilation of a corpus of 100,000 words from
one text type newspapers with this design,
i.e. with an equal number of texts from the three
sub-periods and five regions. This was
completed with an ESRC grant (RES-000-22-1609)
between March 2006 and March 2007. A bid for
funding of the complete project, which will
include the other text types, is currently
awaiting decision.