Title: Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation
1Multilingual and cross-lingual topic detection
and tracking SEPLN2007Sevilla, Spain, 12
September 2007 Ralf Steinberger with Bruno
Pouliquen, Erik van der Goot, Olivier Deguernel,
Camelia Ignat, European Commission Joint
Research Centre (JRC) http//langtech.jrc.it/h
ttp//press.jrc.it/NewsExplorer
2DG JRC - Who we are
3Agenda
- Background
- Introduction to the terms Topic Detection (TD)
and Topic Tracking (TT) ? TDT - Known approaches to cross-lingual document
similarity calculation - NewsBrief multi-monolingual live clustering (TDT)
(http//press.jrc.it/) - Demo of NewsExplorer (http//press.jrc.it/NewsExpl
orer/) - Technical details on how NewsExplorer works
- NER person and organisation names
- NER recognition and disambiguation of
geographical references - Multilingual subject domain categorisation
- NewsExplorer multi-monolingual daily clustering
(TDT) - Cross-lingual cluster linking
- Conclusion
4BackgroundT D T CL
5Topic Detection and Tracking (TDT) - Background
- US-American DARPA program TDT (1997-2004).
- TDT refers to automatic techniques for locating
topically related material in streams of data
such as newswire and broadcast news. (Wayne
2000) - Topic e.g. Oklahoma City bombing in 1995 incl.
memorial services, investigations, prosecution,
etc. - Topic ? category (bombing)!
- Since 2000 part of Translingual Information
Detection, Extraction, and Summarization (TIDES). - The goal of the TIDES program is to enable
English-speaking users to access, correlate, and
interpret multilingual sources of real-time
information and to share the essence of this
information with collaborators. (English,
Chinese, Arabic, with some research on Korean and
Spanish) - TDT and TIDES explanations and images borrowed
from http//www.nist.gov/speech/tests/tdt/tides.h
tm
65 TDT Sub-tasks (each formally evaluated)
7TDT summary of approaches (Wayne 2000)
- Means used by participants
- Stop words, stemming, TF.IDF weighting
- Using single documents or clusters of documents
- (Incremental) vector space models
- Using cosine or Okapi, k-NN clustering
- Various types of normalisation across sources,
languages and topics -
- Cross-lingual topic tracking
- Chinese to English, later also Arabic to English
? focus on English as target language - Chinese Machine Translation (MT, Systran) results
were given to participants - Participants could use other means instead some
experimented with bilingual dictionaries.
8Insights from past research
- TDT Program Insights (Wayne 2000)
- TDT techniques can work well in languages very
different from English similar performance for
monolingual Chinese and English - Lower performance for cross-lingual tracking
(performance impacted by translation errors) - Making use of named entities (people,
organisations, locations) helped (Chen Ku
2002). - Larkey et al. (2004) native language
hypothesis Topic Tracking works better in the
original language than in (MT)-translated
collections.
9Approaches to cross-lingual document similarity
calculation (1)
- How to find out whether two texts in different
languages are related? - Most common approach (until today) use MT or
bilingual dictionaries to translate into
English, then use monolingual methods to
calculate similarity. - Using MT (e.g. Leek et al. 1999 for
Chinese-Mandarin to English) 50 performance
loss when using MT - Using bilingual dictionaries (e.g. Wactlar 1999
for Serbo-Croatian to English Urizar Loinaz
for Basque, Spanish and English 2007) - In TDT 1999, the better results were achieved
using MT
10Approaches to cross-lingual document similarity
calculation (2)
- Automatically produce bilingual lexical space for
bilingual document representation and document
similarity calculation, e.g. - Bilingual Lexical Semantic Analysis
(LSA) (Landauer Littman 1991) - Kernel Canonical Correlation Analysis
(KCCA) (Vinokourov et al., 2002) - Achieved results are relatively good
- Bilingual approach is restricted to a few
languages (OK for English as target lang.) - Language pairs (N2 N) / 2 (N number of
languages) -
- EU 22 official languages ? 231 language pairs
(462 language pair directions)!
11Approaches to cross-lingual document similarity
calculation (3)
- Alternative use entities as anchors
- Names of persons and organisations
- Names of locations
- Units of measurements
- Time
- Speed
- Temperature
- Acceleration
- Multilingual specialist dictionaries (MeSH for
medicine, etc.) -
- Normalise these expressions
- ? Use as kind of an interlingua no language
pair-specific resource needed - Steinberger Ralf, Pouliquen Bruno Camelia Ignat
(2004). Providing cross-lingual information
access with knowledge-poor methods. Informatica
28-4, pp. 415-423.
12The EMM news data
13Europe Media Monitor (EMM) News Aggregation
- Best et al. 2005
- External site (http//press.jrc.it/)
- Scrapes 1000 news portals world-wide for new
news articles - Up to every 10 minutes
- Standardises input format (UTF-8-encoded RSS
format) - 35,000 news articles per day
- Articles in 34 languages
- 3 public systems
- NewsBrief
- Medical Information System MedISys
- NewsExplorer
14EMM NewsBrief
- Best et al. 2005
- Public site http//press.jrc.it/
- Uses all EMM news data
- Categorises news into 600 categories, using
Boolean search word combinations (plus optional
weights plus vicinity operators) - Clusters and tracks news live (multi-monolinguall
y) - Sends out email notifications for each category
- Detects breaking news
- Short-term story tracking
15EMM Medical Information System (MedISys)
- Fuart et al. 2007 Yangarber et al. 2007
- Public site http//medusa.jrc.it/
- Uses all EMM news data
- Selects articles of relevance to Public Health
(diseases, symptoms, health organisations) - Categorises news into 250 categories, using EMM
functionality - Detects breaking news for each category and
country - All other EMM functionality
16MedISys Automatic Email Alert
17EMM NewsExplorer
- Steinberger et al. 2005
- http//press.jrc.it/NewsExplorer
- Uses public EMM news data
- Publicly accessible news aggregation and analysis
system - Clusters related news once per day in 19
languages - Links clusters over time and across languages ?
event time lines - Extracts references to locations, persons, and
other entities - Collects historical information about named
entities across languages
18MultilingualClustering Tracking
19NewsBrief live clustering (multi-monolingual)
- Multilingual, language-independent algorithm
- Live clustering of incoming news every ten
minutes (Topic Detection) - All articles that fall in a sliding 4-hour window
(up to 8 hours when lt 200 articles) - Using 100 stop words (most frequent words among
last 50,000 articles) - No word normalisation
- Document representation word frequency (except
stop words) of first 200 words only - Similarity measure cosine
- Hierarchical clustering with group-averaging
- Minimum size two non-identical documents from
two different sources - Similarity threshold 0.6 or 0.8 or 0.9,
depending on vector sparseness (En 0.6)
20NewsBrief live clustering (multi-monolingual) (2)
- Cluster linking (into a story)
- Link clusters if at least 10 overlap of articles
- Inherit articles that would fall out due to
window time constraint, or due to shifting
word-base - Story ends if no new articles in window
- Longest stories last a few days and have a few
hundred articles - Big stories have 50 current articles plus
inherited ones - Stories can merge (inherit from both previous
clusters) - Stories cannot currently split
- Approx. 75 English clusters for a 24-hour period
21NewsBrief live clustering (multi-monolingual) (3)
- Story finalisation
- if cluster is not linked in current window
- Breaking news detection (red dots ? email
alerts) - If story is less than 1 hour old
- If at least 3 articles from 3 different sources
- If at least 75 of articles from the last hour
came in during the last 30 minutes - Breaking news update (red dots)
- If story is older than 1 hour
- If at least 3 articles from 3 different sources
- If at least 75 of articles from the 60 minutes
came in during the last 30 minutes
22NewsBrief live clustering (4) Observations
- Clustering instead of dealing with single stories
(first story detection, topic tracking, ) - Language bias articles sometimes cluster by
country (UK vs. US, NL vs. BE) - Stories in languages other than world languages
are likely to die over night (En, It, Es)
- Language-independent ? multilingual
- Interesting cross-lingual comparisons
23EMM NewsExplorer Live Demo
- http//press.jrc.it/NewsExplorer
24NewsExplorer Technical details
25NewsExplorer - Cross-lingual cluster linking
- Language-independent features for multilingual
document representation - No MT or bilingual dictionaries
- 19 languages
- Sim1 (40) Multilingual Eurovoc subject
domains - Sim2 (30) Geo-locations
- Sim3 (20) Names variants
- Sim4 (10) Cognates and numbers
26NER PER ORG
27Multilingual name recognition and variant merging
28NER Known person organisation names
- Lookup of known names from database
- Currently over 630,000 names
- 135.000 variants
- Only 50.000 have been found in five different
clusters or more - Pre-generate morphological variants (Slovene
example) - Tony(aouomemmjujemja)?\sBlair(aouomem
mjujemja)
Live name variants
29NER New person organisation names (1)
- Guessing names using empirically-derived lexical
patterns - Identification of about 450 unknown names per day
- 50 of those are automatically merged with known
names - Trigger word(s) Uppercase Words ( name
particles von, van, de la, abu, bin, ) - President, Minister, Head of State, Sir, American
- death of, 0-9-year-old,
- Known first names (John, Jean, Hans, Giovanni,
Johan, ) -
- Combinations 56-year-old former prime
minister Kurmanbek Bakiyev - Use bootstrapping to produce a trigger word list
for a new language - Small initial trigger word list
- Produce frequency list of contexts of known names
- Manual selection
For details, see Steinberger Pouliquen (LI
30.1, 2007)
30NER Most frequent trigger words across languages
31NER Inflection of trigger words
- Inflection of trigger words for person names,
using regular expressions (Slovene example) - kandidat(auom)?
- legend(aeio)
- milijarder(jajujem)?
- predsednik(auomem)?
- predsednic(aeio)
- ministric(aeio)
- sekretar(jajujomjem)?
- diktator(jajujem)?
- playboy(auomem)?
uppercase words
verskega voditelja Moktade al Sadra je z
notranjim Muqtada al-Sadr (ID236)
32Name Variants
- Adding names from web sources
- Merging NewsExplorer name variants
- Transliteration
- Normalisation
- Similarity measure
33Name Recognition - Evaluation results (2005)
For details, see Steinberger Pouliquen (LI
30.1, 2007)
34Name transliteration
- Currently, EMM NewsExplorer transliterates from
Arabic, Farsi, Greek, Russian and Bulgarian - Transliterate each character, or sequence of
characters, by a Latin correspondent - ? gt ps
- ? gt l
- µp gt b
- Hard-code some common transliterations ??????
DJORDJ gt "George, ?????? gt "James", - Examples of transliterations
- ??f? ????, Greek ? Kofi Anan
- ???? ?????, Russian ? Kofi Anan
- ???? ????, Bulgarian ? Kofi Anan
- ???? ????, Arabic ? Kufi Anan
- ???? ??????, Hindi ? Kofi Anan
35Name normalisation Why?
- Transliteration rules depend on the target
language, e.g. - ???????? ??????? (Russian)
- Vladimir Ustinov (English)
- Wladimir Ustinow (German)
- Vladimir Oustinov (French)
- Various ways to represent the same sound sh,
sch, ch, , e.g. - Baar al Assad
- Baschar al Assad
- Bachar al Assad
- Diacritics, e.g.
- Walesa ? Walesa
- Saïd ? Said
- Schröder ? Schroder
- ? Edit distance is large for naturally occurring
word variants - Rafik Harriri" vs. "Rafiq Hariri ? 2
- Rfk Hrr" vs. "Rafiq Hariri ? 6
36Name normalisation (2) - 30 Rules
- Latin normalisation Malik al-Saïdoullaïev
- accented character ? non-accented
equivalent Malik al-Saidoullaiev - double consonant ? single consonant Malik
al-Saidoulaiev - ou ? u Malik al-Saidulaiev
- al- ? Malik Saidulaiev
- wl (beginning of name)? vl
- ow (end of name) ? ov
- ck ? k
- ph ? f
- ? j
- ? sh
- x ? ks
- Remove vowels
37Similarity measure for name merging
- To compare 450 new names every day with 800,000
known name variants - Only if the transliterated, normalised form with
vowels removed is identical - Calculate edit distance variant similarity using
two different representations
20 80 Condition
38Merging name variants some results
- Threshold 0.94 (100 Precision in test set)
- NewsExplorer 450 new names every day
- 50 are automatically merged (11)
- 42 are saved for expert judgment (9)
39Person name recognition and variant merging
Result
Name variants
Trigger words
live
40Geo-coding
41Previous work
- Aim multilingual text to unique identifier
- MUC-7, etc. identify and classify (PER vs. LOC)
- Leidner (2007)
- Geo-CLEF
- For toponym disambiguation, people work on 1-2
languages - MetaCarta (http//www.metacarta.com) commercial,
English only - Mikheev (1999) gazetteer needed
For details, see Pouliquen et al. (LREC 2006)
42NER Geographical Locations
- Aim multilingual text to map
- Major challenges
- Solution Procedure using 6 different heuristics,
and their interaction - Evaluation and results
43Major challenges for geo-coding (1)
- Place homographs with common words
44Major challenges for geo-coding (2)
- Place homographs with peoples first and last
names
45Major challenges for geo-coding (3)
46Major challenges for geo-coding (4)
- Completeness of gazetteer multilinguality
(exonyms), endonyms, historical variants, e.g. - ?????-?????????, Saint Petersburg, Saint
Pétersbourg, - Leningrad, Petrograd,
- Morphological variation / Inflection
- Romanian Parisului (of Paris)
- Estonian Londonit (London),
New Yorgile (New York) - Arabic (the Paris inhabitants)
- albaRiziun
47Proposed solutions Multilingual gazetteer
- We combined three different sources
- Global Discovery database of place names (
GeoNet) - gt 500,000 place names
- In English and in local language (but in Roman
script) - Contains 6 size classes
- KNAB database of exonyms and historical variants
- Institute of the Estonian Language
- Venezia, Venice, Venise, Venedig,
- Istanbul, Constantinople, Istamboul, Istanbul,
- European Commission-internal document
- in the 11 languages of the pre-Enlargement EU
- Country name
- Capital name
- Inhabitant name
- Currency name
- Country adjective
48Proposed solutions morphological variation
- English London - Finnish Lontoo (nominative
case) - Lontoossa (In London) Lontoosta (from
London) - Lontoon (Londons) Lontoolaisen
(Londoner, of London) - Lontoseen (to London)
- 3 Options
- Use morphological analyser software
- Pre-generate all inflection forms, using suffix
replacement rules - Strip/Replace suffixes of uppercase words not
found in gazetteer and check again - e.g. Finnish Lontoosta ? Lontoo
Tony(aouomemmjujemja)?\sBlair(aouomem
mjujemja)
49Proposed solutions Disambiguation heuristics (1)
- Language-independent heuristics
- Using language-specific resources
- 2 Binary filters
- Several preference rules
- Formula that combines them all
- Geo-stop words
- 5489 English Geo-stop words
- binary filter
50Proposed solutions Disambiguation heuristics (2)
- Location only if not part of a person name
- binary filter
- e.g. Kofi Annan, Annan
- Size class information
- Bigger places preferred
- Weight ? preference rule
51Proposed solutions Disambiguation heuristics (3)
- Country context
- Default Publication place of newspaper
- Two-level approach
- Identify unambiguous places of levels 0, 1 or 2
- Identify places of levels 3 to 6 only if in
countries identified in step 1. - Trigger words for locations
- Using simple rules (city/village/town of Ispra)
did not produce useful results. - ? test ML methods
52Proposed solutions Disambiguation heuristics (4)
- Kilometric distance
- If one of the homographic places is nearby
non-ambiguous places, prefer this over other
homographic places. - E.g. from Warsaw to Brest
- Brest (France) 2000 km from Warsaw
- Brest (Belarus) 200 km from Warsaw
- For calculation of minimum kilometric distance,
use formula by Sinnott (1984)
53Proposed solutions Combination of rules (1)
- Apply binary rules
- Ignore uppercase words that are name parts (Mr.
Kofi Annan ? Annan) - Ignore Geo-stopwords
- For remaining ambiguous place names, calculate a
score. The highest score wins. - Parameters were empirically derived to perform
optimally on a given test set
For details, see Kimler (2004)
54Proposed solutions Combination of rules (2)
- kilometricWeight Arc-Cotangent
(kilometricDistance)this distance is weighted
using the arc-cotangent formula (Bronstein et
al., 1999), with an inflexion point set to 300
kilometres, as shown in the equation - kilometricDistance d minimum distance between
the place and all unambiguous places (according
to formula by Sinnott 1984)
- Example from Warsaw to Brest
- Brest (France) 2000 km from Warsaw
- Brest (Belarus) 200 km from Warsaw
- Both Brest are size 3 (classScore 30).
- Brest (FR) has kilometricWeight of 0.05
- Brest (PL) has kilometricWeight of 0.85
Observation Distance lt 200 km very
significant Distance gt 500 km do not make a
difference ? Inflection point 300 km
55Evaluation of geocoding test set
- Document test set with many smaller and ambiguous
place names (161 documents) - Comparable news stories in 5 languages, taken
from NewsExplorer application (http//pres
s.jrc.it/NewsExplorer)
56Evaluation of geocoding results
By disambiguation technique
By language
Difficult test set Pouliquen et al. (2004) on
the same test set F 0.38 Same algorithm on
48 average English news texts F 0.94
57Eurovoc
58Eurovoc Thesaurushttp//eurovoc.europa.eu/
- Over 6000 classes
- Covering many different subject domains (wide
coverage) - Multilingual (over 20 languages, one-to-one
translations) - Developed by the European Parliament and others
- Actively used to manually index and retrieve
documents in large collections(fine-grained
classification and cataloguing system) - Freely available for research purposes
59Eurovoc categorisation Major challenges
- Eurovoc is a conceptual thesaurus
- ? categorisation vs. term extraction
- Large number of classes ( 6000)
- Very unevenly distributed
- Various text types (heterogeneous training set)
- Multi-label categorisation (both for training and
assignment)
- E.g.
- SPORT
- PROTECTION OF MINORITIES
- CONSTRUCTION AND TOWN PLANNING
- RADIOACTIVE MATERIALS
60Eurovoc categorisation Approach
- Profile-based, category ranking task
- Training Identification of most significant
words for each class - Assignment combination of measures to calculate
similarity between profiles and new document
- Empirical refinement of parameter settings
- Training
- Stop words
- Lemmatisation
- Multi-word terms
- Consider number of classes of each training
document - Thresholds for training document length and
number of training documents per class - Methods to determine significant words per
document (log-likelihood vs. chi-square, etc.) - Choice of reference corpus
-
- Assignment
- Selection and combination of similarity measures
(cosine, okapi, ) - ...
- For details, see Pouliquen et al. (Eurolan 2003)
61Assignment Result (Example)
Title Legislative resolution embodying
Parliament's opinion on the proposal for a
Council Regulation amending Regulation No 2847/93
establishing a control system applicable to the
common fisheries policy (COM(95)0256 - C4-0272/95
- 95/ 0146(CNS)) (Consultation procedure)
62Results of automatic evaluation across languages
(F1 per document at rank6)
Human evaluation (correct descriptors, compared
to inter-annotator agreement) English
83 Spanish 80
With pre-processing (Frenchgt only stop words)
Without pre-processing
63Eurovoc indexing Result
- Ranked list of (100) Eurovoc descriptor codes
found for each news cluster
64The JRC-Acquis parallel corpus in 22 languages
- Freely available for research purposes on our web
site http//langtech.jrc.it/JRC-Acquis.html - For details, see Steinberger et al. (2006, LREC)
- Total of over one Billion words
- Pair-wise alignment for all 231 language pairs!
- Most documents have been Eurovoc-classified
manually - useful for
- Training of multilingual subject domain
classifiers. - Creation of multilingual lexical space (LSA,
KCCA) - Training of automatic systems for Statistical
Machine Translation. - Producing multilingual lexical or semantic
resources such as dictionaries or ontologies. - Training and testing multilingual information
extraction software. - Automatic translation consistency checking.
- Testing and benchmarking alignment software
(sentences, words, etc.), across a larger variety
of language pairs. - All types of multilingual and cross-lingual
research.
65Monolingual TDT
66Clustering Monolingual document representation
- Vector of keywords and their keyness using
log-likelihood test (Dunning 1993)
Michael Jackson Jury Reaches Verdicts
Keyness Keyword 109.24 jackson 41.54
neverland 37.93 santa 32.61 molestation
24.51 boy 24.43 pop 20.68
documentary 18.79 accuser 13.59
courthouse 11.12 jury 10.08 ranch
9.60 california
Keyness Keyword 9.39 verdict 7.56
testimony 6.50 maria 4.09
michael 1.73 reached 1.68 ap
1.05 appeared 0.53 child 0.50
trial 0.45 monday 0.26
children 0.09 family
Original cluster
67Calculation of a texts Country Score
- Aim show to what extent a text talks about a
certain country - Sum of references to a country, normalised using
the log-likelihood test - Add country score vector to keyword vector
Keyness Keyword 7.5620 testimony 6.5014
maria 4.0957 michael 1.7368 reached
1.6857 ap 1.5610 gb 1.5610 il
1.5610 br 1.0520 appeared 0.5384 child
0.5045 trial 0.4502 monday 0.2647
children 0.0946 family
Keyness Keyword 109.2478 jackson
41.5450 neverland 37.9347 santa
32.6105 molestation 24.5193 boy 24.4351 pop
20.6824 documentary 18.7973 accuser
13.5945 courthouse 11.1224 jury
10.4184 us 10.0838 ranch
9.6021 california 9.3905 verdict
68Multi-monolingual news clustering
- Input Vectors consisting of keywords and country
score - Similarity measure cosine
- Method Bottom-up group average unsupervised
clustering - Build the binary hierarchical clustering tree
(dendrogram) - Retain only big nodes in the tree with a high
cohesion (empirically refined minimum intra-node
similarity 45) - Use the title of the clusters medoid as the
cluster title - For details, see Pouliquen et al. (CoLing 2004)
69Monolingual cluster linking - Evaluation
- Link clusters historically if
- Link within 7 days
- Cosine cluster similarity gt 0.5
- Evaluation results depending on similarity
threshold - Details Pouliquen et al. (CoLing 2004)
70Cross-lingual Tracking
71Cross-lingual cluster linking combination of 4
ingredients
- CLDS (using cosine) based on these
representations - CLDS aS1 ßS2 ?S3 dS4
- Ranked list of Eurovoc classes (40)
- Country score (30)
- Names frequency (20)
- Monolingual cluster representation without
country score (10) - ? establish cross-lingual link if combined
similarity gt 0.3
72Cross-lingual cluster linking evaluation
- Evaluation results depending on similarity
threshold - Ingredients 40/30/30 (names not yet considered)
- Evaluation for EN ? FR and EN ? IT (136 EN
clusters)
Recall at 15 similarity threshold 100
For details, see Pouliquen et al. (CoLing 2004)
73Filter out bad links by exploiting all
cross-lingual links
Assumption If EN is linked to FR, ES, IT,
FR should also be linked to ES, IT, ... If not
lower link likelihood
74Filter out bad links by exploiting all
cross-lingual links
- Build a second similarity, based on the first. It
uses the following input - 1) the number of links between the set of
clusters in the other languages - 2) the strength (or similarity level) of these
links - 3) the number of potential links between the set
of clusters in the other languages (which means
all the links minus those between clusters in the
same language) - Empirical formula
- similarity_2 similarity_1
(number_of_links / number_of_potential_links)
square_root(number_of_potential_links) - Result elimination of some wrong links
- (No formal evaluation results available)
75Conclusion Topic Detection and Tracking
- (Multi)-Monolingual TDT is a relatively well
explored area - Use vector space models
- Use named entities, etc.
- Presentation of two highly multilingual TDT
applications - NewsBrief live clustering (vector space)
- NewsExplorer daily clustering (vector space
enhanced with geographical information)
76Conclusion
77Conclusion Cross-lingual linking of documents
(clusters)
- State-of-the-art approaches to cross-lingual
document similarity calculation - Use Machine Translation
- Use bilingual dictionaries
- Use bilingual document space (LSA KCCA)
- ? restricted to small number of languages
- Alternative proposal link documents across
languages via anchors - Use different entity types (persons,
organisations, locations) - Use subject domain classification
- Exploit cognates
- further anchors measurement expressions
- time,
- speed,
- volume,
-
- Terminology from specialist dictionaries
78Conclusion Cross-lingual linking of documents
(2)
- Performance is text type-dependent
- In order to use named entities (and measurement
expressions, etc.) as anchors, they frequently
need normalising - Different writing systems
- Transliteration variants
- Morphological variants
- Spelling variants (even monolingually)
79Conclusion Multilinguality
- In a highly multilingual context, it is an
advantage - To use language-independent rules (with
language-specific resources, if needed) - To use simple rules that can easily be adapted to
new languages - To avoid language pair-specific resources
- Downside lower performance than best-performing
monolingual systems - Advantage highly multilingual applications are
made possible without too much effort
80Current and future work
- Improve each of the individual components of the
system - Re-implement some of the tools more efficiently
- Concentrate on extracting more structured
information - Relations between person (co-occurrence ?
criticise, support, family relationship, - Events Who did What to Whom, Where and When?
- Not possible with simple bag-of-word approaches
- More language-specific effort needed ? restricted
to fewer languages
81Thank you!