Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation

About This Presentation

Title:

Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation

Description:

with Bruno Pouliquen, Erik van der Goot, Olivier Deguernel, Camelia Ignat, ... European Commission Joint Research ... Using cosine or Okapi, k-NN clustering; ... – PowerPoint PPT presentation

Number of Views:244

Avg rating:3.0/5.0

Slides: 82

Provided by: ralfstei

Category:

more less

Transcript and Presenter's Notes

Title: Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation

1
Multilingual and cross-lingual topic detection
and tracking SEPLN2007Sevilla, Spain, 12
September 2007 Ralf Steinberger with Bruno
Pouliquen, Erik van der Goot, Olivier Deguernel,
Camelia Ignat, European Commission Joint
Research Centre (JRC) http//langtech.jrc.it/h
ttp//press.jrc.it/NewsExplorer
2
DG JRC - Who we are
3
Agenda

Background
Introduction to the terms Topic Detection (TD)
and Topic Tracking (TT) ? TDT
Known approaches to cross-lingual document
similarity calculation
NewsBrief multi-monolingual live clustering (TDT)
(http//press.jrc.it/)
Demo of NewsExplorer (http//press.jrc.it/NewsExpl
orer/)
Technical details on how NewsExplorer works
NER person and organisation names
NER recognition and disambiguation of
geographical references
Multilingual subject domain categorisation
NewsExplorer multi-monolingual daily clustering
(TDT)
Cross-lingual cluster linking
Conclusion

4
BackgroundT D T CL
5
Topic Detection and Tracking (TDT) - Background

US-American DARPA program TDT (1997-2004).
TDT refers to automatic techniques for locating
topically related material in streams of data
such as newswire and broadcast news. (Wayne
2000)
Topic e.g. Oklahoma City bombing in 1995 incl.
memorial services, investigations, prosecution,
etc.
Topic ? category (bombing)!
Since 2000 part of Translingual Information
Detection, Extraction, and Summarization (TIDES).
The goal of the TIDES program is to enable
English-speaking users to access, correlate, and
interpret multilingual sources of real-time
information and to share the essence of this
information with collaborators. (English,
Chinese, Arabic, with some research on Korean and
Spanish)
TDT and TIDES explanations and images borrowed
from http//www.nist.gov/speech/tests/tdt/tides.h
tm

6
5 TDT Sub-tasks (each formally evaluated)
7
TDT summary of approaches (Wayne 2000)

Means used by participants
Stop words, stemming, TF.IDF weighting
Using single documents or clusters of documents
(Incremental) vector space models
Using cosine or Okapi, k-NN clustering
Various types of normalisation across sources,
languages and topics
Cross-lingual topic tracking
Chinese to English, later also Arabic to English
? focus on English as target language
Chinese Machine Translation (MT, Systran) results
were given to participants
Participants could use other means instead some
experimented with bilingual dictionaries.

8
Insights from past research

TDT Program Insights (Wayne 2000)
TDT techniques can work well in languages very
different from English similar performance for
monolingual Chinese and English
Lower performance for cross-lingual tracking
(performance impacted by translation errors)
Making use of named entities (people,
organisations, locations) helped (Chen Ku
2002).
Larkey et al. (2004) native language
hypothesis Topic Tracking works better in the
original language than in (MT)-translated
collections.

9
Approaches to cross-lingual document similarity
calculation (1)

How to find out whether two texts in different
languages are related?
Most common approach (until today) use MT or
bilingual dictionaries to translate into
English, then use monolingual methods to
calculate similarity.
Using MT (e.g. Leek et al. 1999 for
Chinese-Mandarin to English) 50 performance
loss when using MT
Using bilingual dictionaries (e.g. Wactlar 1999
for Serbo-Croatian to English Urizar Loinaz
for Basque, Spanish and English 2007)
In TDT 1999, the better results were achieved
using MT

10
Approaches to cross-lingual document similarity
calculation (2)

Automatically produce bilingual lexical space for
bilingual document representation and document
similarity calculation, e.g.
Bilingual Lexical Semantic Analysis
(LSA) (Landauer Littman 1991)
Kernel Canonical Correlation Analysis
(KCCA) (Vinokourov et al., 2002)
Achieved results are relatively good
Bilingual approach is restricted to a few
languages (OK for English as target lang.)
Language pairs (N2 N) / 2 (N number of
languages)
EU 22 official languages ? 231 language pairs
(462 language pair directions)!

11
Approaches to cross-lingual document similarity
calculation (3)

Alternative use entities as anchors
Names of persons and organisations
Names of locations
Units of measurements
Time
Speed
Temperature
Acceleration
Multilingual specialist dictionaries (MeSH for
medicine, etc.)
Normalise these expressions
? Use as kind of an interlingua no language
pair-specific resource needed
Steinberger Ralf, Pouliquen Bruno Camelia Ignat
(2004). Providing cross-lingual information
access with knowledge-poor methods. Informatica
28-4, pp. 415-423.

12
The EMM news data
13
Europe Media Monitor (EMM) News Aggregation

Best et al. 2005
External site (http//press.jrc.it/)
Scrapes 1000 news portals world-wide for new
news articles
Up to every 10 minutes
Standardises input format (UTF-8-encoded RSS
format)
35,000 news articles per day
Articles in 34 languages
3 public systems
NewsBrief
Medical Information System MedISys
NewsExplorer

14
EMM NewsBrief

Best et al. 2005
Public site http//press.jrc.it/
Uses all EMM news data
Categorises news into 600 categories, using
Boolean search word combinations (plus optional
weights plus vicinity operators)
Clusters and tracks news live (multi-monolinguall
y)
Sends out email notifications for each category
Detects breaking news
Short-term story tracking

15
EMM Medical Information System (MedISys)

Fuart et al. 2007 Yangarber et al. 2007
Public site http//medusa.jrc.it/
Uses all EMM news data
Selects articles of relevance to Public Health
(diseases, symptoms, health organisations)
Categorises news into 250 categories, using EMM
functionality
Detects breaking news for each category and
country
All other EMM functionality

16
MedISys Automatic Email Alert
17
EMM NewsExplorer

Steinberger et al. 2005
http//press.jrc.it/NewsExplorer
Uses public EMM news data
Publicly accessible news aggregation and analysis
system
Clusters related news once per day in 19
languages
Links clusters over time and across languages ?
event time lines
Extracts references to locations, persons, and
other entities
Collects historical information about named
entities across languages

18
MultilingualClustering Tracking
19
NewsBrief live clustering (multi-monolingual)

Multilingual, language-independent algorithm
Live clustering of incoming news every ten
minutes (Topic Detection)
All articles that fall in a sliding 4-hour window
(up to 8 hours when lt 200 articles)
Using 100 stop words (most frequent words among
last 50,000 articles)
No word normalisation
Document representation word frequency (except
stop words) of first 200 words only
Similarity measure cosine
Hierarchical clustering with group-averaging
Minimum size two non-identical documents from
two different sources
Similarity threshold 0.6 or 0.8 or 0.9,
depending on vector sparseness (En 0.6)

20
NewsBrief live clustering (multi-monolingual) (2)

Cluster linking (into a story)
Link clusters if at least 10 overlap of articles
Inherit articles that would fall out due to
window time constraint, or due to shifting
word-base
Story ends if no new articles in window
Longest stories last a few days and have a few
hundred articles
Big stories have 50 current articles plus
inherited ones
Stories can merge (inherit from both previous
clusters)
Stories cannot currently split
Approx. 75 English clusters for a 24-hour period

21
NewsBrief live clustering (multi-monolingual) (3)

Story finalisation
if cluster is not linked in current window
Breaking news detection (red dots ? email
alerts)
If story is less than 1 hour old
If at least 3 articles from 3 different sources
If at least 75 of articles from the last hour
came in during the last 30 minutes
Breaking news update (red dots)
If story is older than 1 hour
If at least 3 articles from 3 different sources
If at least 75 of articles from the 60 minutes
came in during the last 30 minutes

22
NewsBrief live clustering (4) Observations

Clustering instead of dealing with single stories
(first story detection, topic tracking, )
Language bias articles sometimes cluster by
country (UK vs. US, NL vs. BE)
Stories in languages other than world languages
are likely to die over night (En, It, Es)

Language-independent ? multilingual
Interesting cross-lingual comparisons

23
EMM NewsExplorer Live Demo

http//press.jrc.it/NewsExplorer

24
NewsExplorer Technical details
25
NewsExplorer - Cross-lingual cluster linking

Language-independent features for multilingual
document representation
No MT or bilingual dictionaries
19 languages
Sim1 (40) Multilingual Eurovoc subject
domains
Sim2 (30) Geo-locations
Sim3 (20) Names variants
Sim4 (10) Cognates and numbers

26
NER PER ORG
27
Multilingual name recognition and variant merging
28
NER Known person organisation names

Lookup of known names from database
Currently over 630,000 names
135.000 variants
Only 50.000 have been found in five different
clusters or more
Pre-generate morphological variants (Slovene
example)
Tony(aouomemmjujemja)?\sBlair(aouomem
mjujemja)

Live name variants
29
NER New person organisation names (1)

Guessing names using empirically-derived lexical
patterns
Identification of about 450 unknown names per day
50 of those are automatically merged with known
names
Trigger word(s) Uppercase Words ( name
particles von, van, de la, abu, bin, )
President, Minister, Head of State, Sir, American
death of, 0-9-year-old,
Known first names (John, Jean, Hans, Giovanni,
Johan, )
Combinations 56-year-old former prime
minister Kurmanbek Bakiyev
Use bootstrapping to produce a trigger word list
for a new language
Small initial trigger word list
Produce frequency list of contexts of known names
Manual selection

For details, see Steinberger Pouliquen (LI
30.1, 2007)
30
NER Most frequent trigger words across languages
31
NER Inflection of trigger words

Inflection of trigger words for person names,
using regular expressions (Slovene example)
kandidat(auom)?
legend(aeio)
milijarder(jajujem)?
predsednik(auomem)?
predsednic(aeio)
ministric(aeio)
sekretar(jajujomjem)?
diktator(jajujem)?
playboy(auomem)?

uppercase words
verskega voditelja Moktade al Sadra je z
notranjim Muqtada al-Sadr (ID236)
32
Name Variants

Adding names from web sources
Merging NewsExplorer name variants
Transliteration
Normalisation
Similarity measure

33
Name Recognition - Evaluation results (2005)
For details, see Steinberger Pouliquen (LI
30.1, 2007)
34
Name transliteration

Currently, EMM NewsExplorer transliterates from
Arabic, Farsi, Greek, Russian and Bulgarian
Transliterate each character, or sequence of
characters, by a Latin correspondent
? gt ps
? gt l
µp gt b
Hard-code some common transliterations ??????
DJORDJ gt "George, ?????? gt "James",
Examples of transliterations
??f? ????, Greek ? Kofi Anan
???? ?????, Russian ? Kofi Anan
???? ????, Bulgarian ? Kofi Anan
???? ????, Arabic ? Kufi Anan
???? ??????, Hindi ? Kofi Anan

35
Name normalisation Why?

Transliteration rules depend on the target
language, e.g.
???????? ??????? (Russian)
Vladimir Ustinov (English)
Wladimir Ustinow (German)
Vladimir Oustinov (French)
Various ways to represent the same sound sh,
sch, ch, , e.g.
Baar al Assad
Baschar al Assad
Bachar al Assad
Diacritics, e.g.
Walesa ? Walesa
Saïd ? Said
Schröder ? Schroder
? Edit distance is large for naturally occurring
word variants
Rafik Harriri" vs. "Rafiq Hariri ? 2
Rfk Hrr" vs. "Rafiq Hariri ? 6

36
Name normalisation (2) - 30 Rules

Latin normalisation Malik al-Saïdoullaïev
accented character ? non-accented
equivalent Malik al-Saidoullaiev
double consonant ? single consonant Malik
al-Saidoulaiev
ou ? u Malik al-Saidulaiev
al- ? Malik Saidulaiev
wl (beginning of name)? vl
ow (end of name) ? ov
ck ? k
ph ? f
? j
? sh
x ? ks
Remove vowels

37
Similarity measure for name merging

To compare 450 new names every day with 800,000
known name variants
Only if the transliterated, normalised form with
vowels removed is identical
Calculate edit distance variant similarity using
two different representations

20 80 Condition
38
Merging name variants some results

Threshold 0.94 (100 Precision in test set)
NewsExplorer 450 new names every day
50 are automatically merged (11)
42 are saved for expert judgment (9)

39
Person name recognition and variant merging
Result
Name variants
Trigger words
live
40
Geo-coding
41
Previous work

Aim multilingual text to unique identifier
MUC-7, etc. identify and classify (PER vs. LOC)
Leidner (2007)
Geo-CLEF
For toponym disambiguation, people work on 1-2
languages
MetaCarta (http//www.metacarta.com) commercial,
English only
Mikheev (1999) gazetteer needed

For details, see Pouliquen et al. (LREC 2006)
42
NER Geographical Locations

Aim multilingual text to map
Major challenges
Solution Procedure using 6 different heuristics,
and their interaction
Evaluation and results

43
Major challenges for geo-coding (1)

Place homographs with common words

44
Major challenges for geo-coding (2)

Place homographs with peoples first and last
names

45
Major challenges for geo-coding (3)

Homographic place names

46
Major challenges for geo-coding (4)

Completeness of gazetteer multilinguality
(exonyms), endonyms, historical variants, e.g.
?????-?????????, Saint Petersburg, Saint
Pétersbourg,
Leningrad, Petrograd,
Morphological variation / Inflection
Romanian Parisului (of Paris)
Estonian Londonit (London),
New Yorgile (New York)
Arabic (the Paris inhabitants)
albaRiziun

47
Proposed solutions Multilingual gazetteer

We combined three different sources
Global Discovery database of place names (
GeoNet)
gt 500,000 place names
In English and in local language (but in Roman
script)
Contains 6 size classes
KNAB database of exonyms and historical variants
Institute of the Estonian Language
Venezia, Venice, Venise, Venedig,
Istanbul, Constantinople, Istamboul, Istanbul,
European Commission-internal document
in the 11 languages of the pre-Enlargement EU
Country name
Capital name
Inhabitant name
Currency name
Country adjective

48
Proposed solutions morphological variation

English London - Finnish Lontoo (nominative
case)
Lontoossa (In London) Lontoosta (from
London)
Lontoon (Londons) Lontoolaisen
(Londoner, of London)
Lontoseen (to London)
3 Options
Use morphological analyser software
Pre-generate all inflection forms, using suffix
replacement rules
Strip/Replace suffixes of uppercase words not
found in gazetteer and check again
e.g. Finnish Lontoosta ? Lontoo

Tony(aouomemmjujemja)?\sBlair(aouomem
mjujemja)
49
Proposed solutions Disambiguation heuristics (1)

Language-independent heuristics
Using language-specific resources
2 Binary filters
Several preference rules
Formula that combines them all
Geo-stop words
5489 English Geo-stop words
binary filter

50
Proposed solutions Disambiguation heuristics (2)

Location only if not part of a person name
binary filter
e.g. Kofi Annan, Annan
Size class information
Bigger places preferred
Weight ? preference rule

51
Proposed solutions Disambiguation heuristics (3)

Country context
Default Publication place of newspaper
Two-level approach
Identify unambiguous places of levels 0, 1 or 2
Identify places of levels 3 to 6 only if in
countries identified in step 1.
Trigger words for locations
Using simple rules (city/village/town of Ispra)
did not produce useful results.
? test ML methods

52
Proposed solutions Disambiguation heuristics (4)

Kilometric distance
If one of the homographic places is nearby
non-ambiguous places, prefer this over other
homographic places.
E.g. from Warsaw to Brest
Brest (France) 2000 km from Warsaw
Brest (Belarus) 200 km from Warsaw
For calculation of minimum kilometric distance,
use formula by Sinnott (1984)

53
Proposed solutions Combination of rules (1)

Apply binary rules
Ignore uppercase words that are name parts (Mr.
Kofi Annan ? Annan)
Ignore Geo-stopwords
For remaining ambiguous place names, calculate a
score. The highest score wins.
Parameters were empirically derived to perform
optimally on a given test set

For details, see Kimler (2004)
54
Proposed solutions Combination of rules (2)

kilometricWeight Arc-Cotangent
(kilometricDistance)this distance is weighted
using the arc-cotangent formula (Bronstein et
al., 1999), with an inflexion point set to 300
kilometres, as shown in the equation
kilometricDistance d minimum distance between
the place and all unambiguous places (according
to formula by Sinnott 1984)

Example from Warsaw to Brest
Brest (France) 2000 km from Warsaw
Brest (Belarus) 200 km from Warsaw
Both Brest are size 3 (classScore 30).
Brest (FR) has kilometricWeight of 0.05
Brest (PL) has kilometricWeight of 0.85

Observation Distance lt 200 km very
significant Distance gt 500 km do not make a
difference ? Inflection point 300 km
55
Evaluation of geocoding test set

Document test set with many smaller and ambiguous
place names (161 documents)
Comparable news stories in 5 languages, taken
from NewsExplorer application (http//pres
s.jrc.it/NewsExplorer)

56
Evaluation of geocoding results
By disambiguation technique
By language
Difficult test set Pouliquen et al. (2004) on
the same test set F 0.38 Same algorithm on
48 average English news texts F 0.94
57
Eurovoc
58
Eurovoc Thesaurushttp//eurovoc.europa.eu/

Over 6000 classes
Covering many different subject domains (wide
coverage)
Multilingual (over 20 languages, one-to-one
translations)
Developed by the European Parliament and others
Actively used to manually index and retrieve
documents in large collections(fine-grained
classification and cataloguing system)
Freely available for research purposes

59
Eurovoc categorisation Major challenges

Eurovoc is a conceptual thesaurus
? categorisation vs. term extraction
Large number of classes ( 6000)
Very unevenly distributed
Various text types (heterogeneous training set)
Multi-label categorisation (both for training and
assignment)

E.g.
SPORT
PROTECTION OF MINORITIES
CONSTRUCTION AND TOWN PLANNING
RADIOACTIVE MATERIALS

60
Eurovoc categorisation Approach

Profile-based, category ranking task
Training Identification of most significant
words for each class
Assignment combination of measures to calculate
similarity between profiles and new document

Empirical refinement of parameter settings
Training
Stop words
Lemmatisation
Multi-word terms
Consider number of classes of each training
document
Thresholds for training document length and
number of training documents per class
Methods to determine significant words per
document (log-likelihood vs. chi-square, etc.)
Choice of reference corpus
Assignment
Selection and combination of similarity measures
(cosine, okapi, )
...
For details, see Pouliquen et al. (Eurolan 2003)

61
Assignment Result (Example)
Title Legislative resolution embodying
Parliament's opinion on the proposal for a
Council Regulation amending Regulation No 2847/93
establishing a control system applicable to the
common fisheries policy (COM(95)0256 - C4-0272/95
- 95/ 0146(CNS)) (Consultation procedure)
62
Results of automatic evaluation across languages
(F1 per document at rank6)
Human evaluation (correct descriptors, compared
to inter-annotator agreement) English
83 Spanish 80
With pre-processing (Frenchgt only stop words)
Without pre-processing
63
Eurovoc indexing Result

Ranked list of (100) Eurovoc descriptor codes
found for each news cluster

64
The JRC-Acquis parallel corpus in 22 languages

Freely available for research purposes on our web
site http//langtech.jrc.it/JRC-Acquis.html
For details, see Steinberger et al. (2006, LREC)
Total of over one Billion words
Pair-wise alignment for all 231 language pairs!
Most documents have been Eurovoc-classified
manually
useful for
Training of multilingual subject domain
classifiers.
Creation of multilingual lexical space (LSA,
KCCA)
Training of automatic systems for Statistical
Machine Translation.
Producing multilingual lexical or semantic
resources such as dictionaries or ontologies.
Training and testing multilingual information
extraction software.
Automatic translation consistency checking.
Testing and benchmarking alignment software
(sentences, words, etc.), across a larger variety
of language pairs.
All types of multilingual and cross-lingual
research.

65
Monolingual TDT
66
Clustering Monolingual document representation

Vector of keywords and their keyness using
log-likelihood test (Dunning 1993)

Michael Jackson Jury Reaches Verdicts
Keyness Keyword 109.24 jackson 41.54
neverland 37.93 santa 32.61 molestation
24.51 boy 24.43 pop 20.68
documentary 18.79 accuser 13.59
courthouse 11.12 jury 10.08 ranch
9.60 california
Keyness Keyword 9.39 verdict 7.56
testimony 6.50 maria 4.09
michael 1.73 reached 1.68 ap
1.05 appeared 0.53 child 0.50
trial 0.45 monday 0.26
children 0.09 family
Original cluster
67
Calculation of a texts Country Score

Aim show to what extent a text talks about a
certain country
Sum of references to a country, normalised using
the log-likelihood test
Add country score vector to keyword vector

Keyness Keyword 7.5620 testimony 6.5014
maria 4.0957 michael 1.7368 reached
1.6857 ap 1.5610 gb 1.5610 il
1.5610 br 1.0520 appeared 0.5384 child
0.5045 trial 0.4502 monday 0.2647
children 0.0946 family
Keyness Keyword 109.2478 jackson
41.5450 neverland 37.9347 santa
32.6105 molestation 24.5193 boy 24.4351 pop
20.6824 documentary 18.7973 accuser
13.5945 courthouse 11.1224 jury
10.4184 us 10.0838 ranch
9.6021 california 9.3905 verdict
68
Multi-monolingual news clustering

Input Vectors consisting of keywords and country
score
Similarity measure cosine
Method Bottom-up group average unsupervised
clustering
Build the binary hierarchical clustering tree
(dendrogram)
Retain only big nodes in the tree with a high
cohesion (empirically refined minimum intra-node
similarity 45)
Use the title of the clusters medoid as the
cluster title
For details, see Pouliquen et al. (CoLing 2004)

69
Monolingual cluster linking - Evaluation

Link clusters historically if
Link within 7 days
Cosine cluster similarity gt 0.5
Evaluation results depending on similarity
threshold
Details Pouliquen et al. (CoLing 2004)

70
Cross-lingual Tracking
71
Cross-lingual cluster linking combination of 4
ingredients

CLDS (using cosine) based on these
representations
CLDS aS1 ßS2 ?S3 dS4
Ranked list of Eurovoc classes (40)
Country score (30)
Names frequency (20)
Monolingual cluster representation without
country score (10)
? establish cross-lingual link if combined
similarity gt 0.3

72
Cross-lingual cluster linking evaluation

Evaluation results depending on similarity
threshold
Ingredients 40/30/30 (names not yet considered)
Evaluation for EN ? FR and EN ? IT (136 EN
clusters)

Recall at 15 similarity threshold 100
For details, see Pouliquen et al. (CoLing 2004)
73
Filter out bad links by exploiting all
cross-lingual links
Assumption If EN is linked to FR, ES, IT,
FR should also be linked to ES, IT, ... If not
lower link likelihood
74
Filter out bad links by exploiting all
cross-lingual links

Build a second similarity, based on the first. It
uses the following input
1) the number of links between the set of
clusters in the other languages
2) the strength (or similarity level) of these
links
3) the number of potential links between the set
of clusters in the other languages (which means
all the links minus those between clusters in the
same language)
Empirical formula
similarity_2 similarity_1
(number_of_links / number_of_potential_links)
square_root(number_of_potential_links)
Result elimination of some wrong links
(No formal evaluation results available)

75
Conclusion Topic Detection and Tracking

(Multi)-Monolingual TDT is a relatively well
explored area
Use vector space models
Use named entities, etc.
Presentation of two highly multilingual TDT
applications
NewsBrief live clustering (vector space)
NewsExplorer daily clustering (vector space
enhanced with geographical information)

76
Conclusion
77
Conclusion Cross-lingual linking of documents
(clusters)

State-of-the-art approaches to cross-lingual
document similarity calculation
Use Machine Translation
Use bilingual dictionaries
Use bilingual document space (LSA KCCA)
? restricted to small number of languages
Alternative proposal link documents across
languages via anchors
Use different entity types (persons,
organisations, locations)
Use subject domain classification
Exploit cognates
further anchors measurement expressions
time,
speed,
volume,
Terminology from specialist dictionaries

78
Conclusion Cross-lingual linking of documents
(2)

Performance is text type-dependent
In order to use named entities (and measurement
expressions, etc.) as anchors, they frequently
need normalising
Different writing systems
Transliteration variants
Morphological variants
Spelling variants (even monolingually)

79
Conclusion Multilinguality

In a highly multilingual context, it is an
advantage
To use language-independent rules (with
language-specific resources, if needed)
To use simple rules that can easily be adapted to
new languages
To avoid language pair-specific resources
Downside lower performance than best-performing
monolingual systems
Advantage highly multilingual applications are
made possible without too much effort

80
Current and future work

Improve each of the individual components of the
system
Re-implement some of the tools more efficiently
Concentrate on extracting more structured
information
Relations between person (co-occurrence ?
criticise, support, family relationship,
Events Who did What to Whom, Where and When?
Not possible with simple bag-of-word approaches
More language-specific effort needed ? restricted
to fewer languages

81
Thank you!

Write a Comment

User Comments (0)