Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation

Description:

with Bruno Pouliquen, Erik van der Goot, Olivier Deguernel, Camelia Ignat, ... European Commission Joint Research ... Using cosine or Okapi, k-NN clustering; ... – PowerPoint PPT presentation

Number of Views:241
Avg rating:3.0/5.0
Slides: 82
Provided by: ralfstei
Category:

less

Transcript and Presenter's Notes

Title: Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation


1
Multilingual and cross-lingual topic detection
and tracking SEPLN2007Sevilla, Spain, 12
September 2007 Ralf Steinberger with Bruno
Pouliquen, Erik van der Goot, Olivier Deguernel,
Camelia Ignat, European Commission Joint
Research Centre (JRC) http//langtech.jrc.it/h
ttp//press.jrc.it/NewsExplorer
2
DG JRC - Who we are
3
Agenda
  • Background
  • Introduction to the terms Topic Detection (TD)
    and Topic Tracking (TT) ? TDT
  • Known approaches to cross-lingual document
    similarity calculation
  • NewsBrief multi-monolingual live clustering (TDT)
    (http//press.jrc.it/)
  • Demo of NewsExplorer (http//press.jrc.it/NewsExpl
    orer/)
  • Technical details on how NewsExplorer works
  • NER person and organisation names
  • NER recognition and disambiguation of
    geographical references
  • Multilingual subject domain categorisation
  • NewsExplorer multi-monolingual daily clustering
    (TDT)
  • Cross-lingual cluster linking
  • Conclusion

4
BackgroundT D T CL
5
Topic Detection and Tracking (TDT) - Background
  • US-American DARPA program TDT (1997-2004).
  • TDT refers to automatic techniques for locating
    topically related material in streams of data
    such as newswire and broadcast news. (Wayne
    2000)
  • Topic e.g. Oklahoma City bombing in 1995 incl.
    memorial services, investigations, prosecution,
    etc.
  • Topic ? category (bombing)!
  • Since 2000 part of Translingual Information
    Detection, Extraction, and Summarization (TIDES).
  • The goal of the TIDES program is to enable
    English-speaking users to access, correlate, and
    interpret multilingual sources of real-time
    information and to share the essence of this
    information with collaborators. (English,
    Chinese, Arabic, with some research on Korean and
    Spanish)
  • TDT and TIDES explanations and images borrowed
    from http//www.nist.gov/speech/tests/tdt/tides.h
    tm

6
5 TDT Sub-tasks (each formally evaluated)
7
TDT summary of approaches (Wayne 2000)
  • Means used by participants
  • Stop words, stemming, TF.IDF weighting
  • Using single documents or clusters of documents
  • (Incremental) vector space models
  • Using cosine or Okapi, k-NN clustering
  • Various types of normalisation across sources,
    languages and topics
  • Cross-lingual topic tracking
  • Chinese to English, later also Arabic to English
    ? focus on English as target language
  • Chinese Machine Translation (MT, Systran) results
    were given to participants
  • Participants could use other means instead some
    experimented with bilingual dictionaries.

8
Insights from past research
  • TDT Program Insights (Wayne 2000)
  • TDT techniques can work well in languages very
    different from English similar performance for
    monolingual Chinese and English
  • Lower performance for cross-lingual tracking
    (performance impacted by translation errors)
  • Making use of named entities (people,
    organisations, locations) helped (Chen Ku
    2002).
  • Larkey et al. (2004) native language
    hypothesis Topic Tracking works better in the
    original language than in (MT)-translated
    collections.

9
Approaches to cross-lingual document similarity
calculation (1)
  • How to find out whether two texts in different
    languages are related?
  • Most common approach (until today) use MT or
    bilingual dictionaries to translate into
    English, then use monolingual methods to
    calculate similarity.
  • Using MT (e.g. Leek et al. 1999 for
    Chinese-Mandarin to English) 50 performance
    loss when using MT
  • Using bilingual dictionaries (e.g. Wactlar 1999
    for Serbo-Croatian to English Urizar Loinaz
    for Basque, Spanish and English 2007)
  • In TDT 1999, the better results were achieved
    using MT

10
Approaches to cross-lingual document similarity
calculation (2)
  • Automatically produce bilingual lexical space for
    bilingual document representation and document
    similarity calculation, e.g.
  • Bilingual Lexical Semantic Analysis
    (LSA) (Landauer Littman 1991)
  • Kernel Canonical Correlation Analysis
    (KCCA) (Vinokourov et al., 2002)
  • Achieved results are relatively good
  • Bilingual approach is restricted to a few
    languages (OK for English as target lang.)
  • Language pairs (N2 N) / 2 (N number of
    languages)
  • EU 22 official languages ? 231 language pairs
    (462 language pair directions)!

11
Approaches to cross-lingual document similarity
calculation (3)
  • Alternative use entities as anchors
  • Names of persons and organisations
  • Names of locations
  • Units of measurements
  • Time
  • Speed
  • Temperature
  • Acceleration
  • Multilingual specialist dictionaries (MeSH for
    medicine, etc.)
  • Normalise these expressions
  • ? Use as kind of an interlingua no language
    pair-specific resource needed
  • Steinberger Ralf, Pouliquen Bruno Camelia Ignat
    (2004). Providing cross-lingual information
    access with knowledge-poor methods. Informatica
    28-4, pp. 415-423.

12
The EMM news data
13
Europe Media Monitor (EMM) News Aggregation
  • Best et al. 2005
  • External site (http//press.jrc.it/)
  • Scrapes 1000 news portals world-wide for new
    news articles
  • Up to every 10 minutes
  • Standardises input format (UTF-8-encoded RSS
    format)
  • 35,000 news articles per day
  • Articles in 34 languages
  • 3 public systems
  • NewsBrief
  • Medical Information System MedISys
  • NewsExplorer

14
EMM NewsBrief
  • Best et al. 2005
  • Public site http//press.jrc.it/
  • Uses all EMM news data
  • Categorises news into 600 categories, using
    Boolean search word combinations (plus optional
    weights plus vicinity operators)
  • Clusters and tracks news live (multi-monolinguall
    y)
  • Sends out email notifications for each category
  • Detects breaking news
  • Short-term story tracking

15
EMM Medical Information System (MedISys)
  • Fuart et al. 2007 Yangarber et al. 2007
  • Public site http//medusa.jrc.it/
  • Uses all EMM news data
  • Selects articles of relevance to Public Health
    (diseases, symptoms, health organisations)
  • Categorises news into 250 categories, using EMM
    functionality
  • Detects breaking news for each category and
    country
  • All other EMM functionality

16
MedISys Automatic Email Alert
17
EMM NewsExplorer
  • Steinberger et al. 2005
  • http//press.jrc.it/NewsExplorer
  • Uses public EMM news data
  • Publicly accessible news aggregation and analysis
    system
  • Clusters related news once per day in 19
    languages
  • Links clusters over time and across languages ?
    event time lines
  • Extracts references to locations, persons, and
    other entities
  • Collects historical information about named
    entities across languages

18
MultilingualClustering Tracking
19
NewsBrief live clustering (multi-monolingual)
  • Multilingual, language-independent algorithm
  • Live clustering of incoming news every ten
    minutes (Topic Detection)
  • All articles that fall in a sliding 4-hour window
    (up to 8 hours when lt 200 articles)
  • Using 100 stop words (most frequent words among
    last 50,000 articles)
  • No word normalisation
  • Document representation word frequency (except
    stop words) of first 200 words only
  • Similarity measure cosine
  • Hierarchical clustering with group-averaging
  • Minimum size two non-identical documents from
    two different sources
  • Similarity threshold 0.6 or 0.8 or 0.9,
    depending on vector sparseness (En 0.6)

20
NewsBrief live clustering (multi-monolingual) (2)
  • Cluster linking (into a story)
  • Link clusters if at least 10 overlap of articles
  • Inherit articles that would fall out due to
    window time constraint, or due to shifting
    word-base
  • Story ends if no new articles in window
  • Longest stories last a few days and have a few
    hundred articles
  • Big stories have 50 current articles plus
    inherited ones
  • Stories can merge (inherit from both previous
    clusters)
  • Stories cannot currently split
  • Approx. 75 English clusters for a 24-hour period

21
NewsBrief live clustering (multi-monolingual) (3)
  • Story finalisation
  • if cluster is not linked in current window
  • Breaking news detection (red dots ? email
    alerts)
  • If story is less than 1 hour old
  • If at least 3 articles from 3 different sources
  • If at least 75 of articles from the last hour
    came in during the last 30 minutes
  • Breaking news update (red dots)
  • If story is older than 1 hour
  • If at least 3 articles from 3 different sources
  • If at least 75 of articles from the 60 minutes
    came in during the last 30 minutes

22
NewsBrief live clustering (4) Observations
  • Clustering instead of dealing with single stories
    (first story detection, topic tracking, )
  • Language bias articles sometimes cluster by
    country (UK vs. US, NL vs. BE)
  • Stories in languages other than world languages
    are likely to die over night (En, It, Es)
  • Language-independent ? multilingual
  • Interesting cross-lingual comparisons

23
EMM NewsExplorer Live Demo
  • http//press.jrc.it/NewsExplorer

24
NewsExplorer Technical details
25
NewsExplorer - Cross-lingual cluster linking
  • Language-independent features for multilingual
    document representation
  • No MT or bilingual dictionaries
  • 19 languages
  • Sim1 (40) Multilingual Eurovoc subject
    domains
  • Sim2 (30) Geo-locations
  • Sim3 (20) Names variants
  • Sim4 (10) Cognates and numbers

26
NER PER ORG
27
Multilingual name recognition and variant merging
28
NER Known person organisation names
  • Lookup of known names from database
  • Currently over 630,000 names
  • 135.000 variants
  • Only 50.000 have been found in five different
    clusters or more
  • Pre-generate morphological variants (Slovene
    example)
  • Tony(aouomemmjujemja)?\sBlair(aouomem
    mjujemja)

Live name variants
29
NER New person organisation names (1)
  • Guessing names using empirically-derived lexical
    patterns
  • Identification of about 450 unknown names per day
  • 50 of those are automatically merged with known
    names
  • Trigger word(s) Uppercase Words ( name
    particles von, van, de la, abu, bin, )
  • President, Minister, Head of State, Sir, American
  • death of, 0-9-year-old,
  • Known first names (John, Jean, Hans, Giovanni,
    Johan, )
  • Combinations 56-year-old former prime
    minister Kurmanbek Bakiyev
  • Use bootstrapping to produce a trigger word list
    for a new language
  • Small initial trigger word list
  • Produce frequency list of contexts of known names
  • Manual selection

For details, see Steinberger Pouliquen (LI
30.1, 2007)
30
NER Most frequent trigger words across languages
31
NER Inflection of trigger words
  • Inflection of trigger words for person names,
    using regular expressions (Slovene example)
  • kandidat(auom)?
  • legend(aeio)
  • milijarder(jajujem)?
  • predsednik(auomem)?
  • predsednic(aeio)
  • ministric(aeio)
  • sekretar(jajujomjem)?
  • diktator(jajujem)?
  • playboy(auomem)?

uppercase words
verskega voditelja Moktade al Sadra je z
notranjim Muqtada al-Sadr (ID236)
32
Name Variants
  • Adding names from web sources
  • Merging NewsExplorer name variants
  • Transliteration
  • Normalisation
  • Similarity measure

33
Name Recognition - Evaluation results (2005)
For details, see Steinberger Pouliquen (LI
30.1, 2007)
34
Name transliteration
  • Currently, EMM NewsExplorer transliterates from
    Arabic, Farsi, Greek, Russian and Bulgarian
  • Transliterate each character, or sequence of
    characters, by a Latin correspondent
  • ? gt ps
  • ? gt l
  • µp gt b
  • Hard-code some common transliterations ??????
    DJORDJ gt "George, ?????? gt "James",
  • Examples of transliterations
  • ??f? ????, Greek ? Kofi Anan
  • ???? ?????, Russian ? Kofi Anan
  • ???? ????, Bulgarian ? Kofi Anan
  • ???? ????, Arabic ? Kufi Anan
  • ???? ??????, Hindi ? Kofi Anan

35
Name normalisation Why?
  • Transliteration rules depend on the target
    language, e.g.
  • ???????? ??????? (Russian)
  • Vladimir Ustinov (English)
  • Wladimir Ustinow (German)
  • Vladimir Oustinov (French)
  • Various ways to represent the same sound sh,
    sch, ch, , e.g.
  • Baar al Assad
  • Baschar al Assad
  • Bachar al Assad
  • Diacritics, e.g.
  • Walesa ? Walesa
  • Saïd ? Said
  • Schröder ? Schroder
  • ? Edit distance is large for naturally occurring
    word variants
  • Rafik Harriri" vs. "Rafiq Hariri ? 2
  • Rfk Hrr" vs. "Rafiq Hariri ? 6

36
Name normalisation (2) - 30 Rules
  • Latin normalisation Malik al-Saïdoullaïev
  • accented character ? non-accented
    equivalent Malik al-Saidoullaiev
  • double consonant ? single consonant Malik
    al-Saidoulaiev
  • ou ? u Malik al-Saidulaiev
  • al- ? Malik Saidulaiev
  • wl (beginning of name)? vl
  • ow (end of name) ? ov
  • ck ? k
  • ph ? f
  • ? j
  • ? sh
  • x ? ks
  • Remove vowels

37
Similarity measure for name merging
  • To compare 450 new names every day with 800,000
    known name variants
  • Only if the transliterated, normalised form with
    vowels removed is identical
  • Calculate edit distance variant similarity using
    two different representations

20 80 Condition
38
Merging name variants some results
  • Threshold 0.94 (100 Precision in test set)
  • NewsExplorer 450 new names every day
  • 50 are automatically merged (11)
  • 42 are saved for expert judgment (9)

39
Person name recognition and variant merging
Result
Name variants
Trigger words
live
40
Geo-coding
41
Previous work
  • Aim multilingual text to unique identifier
  • MUC-7, etc. identify and classify (PER vs. LOC)
  • Leidner (2007)
  • Geo-CLEF
  • For toponym disambiguation, people work on 1-2
    languages
  • MetaCarta (http//www.metacarta.com) commercial,
    English only
  • Mikheev (1999) gazetteer needed

For details, see Pouliquen et al. (LREC 2006)
42
NER Geographical Locations
  • Aim multilingual text to map
  • Major challenges
  • Solution Procedure using 6 different heuristics,
    and their interaction
  • Evaluation and results

43
Major challenges for geo-coding (1)
  • Place homographs with common words

44
Major challenges for geo-coding (2)
  • Place homographs with peoples first and last
    names

45
Major challenges for geo-coding (3)
  • Homographic place names

46
Major challenges for geo-coding (4)
  • Completeness of gazetteer multilinguality
    (exonyms), endonyms, historical variants, e.g.
  • ?????-?????????, Saint Petersburg, Saint
    Pétersbourg,
  • Leningrad, Petrograd,
  • Morphological variation / Inflection
  • Romanian Parisului (of Paris)
  • Estonian Londonit (London),
    New Yorgile (New York)
  • Arabic (the Paris inhabitants)
  • albaRiziun

47
Proposed solutions Multilingual gazetteer
  • We combined three different sources
  • Global Discovery database of place names (
    GeoNet)
  • gt 500,000 place names
  • In English and in local language (but in Roman
    script)
  • Contains 6 size classes
  • KNAB database of exonyms and historical variants
  • Institute of the Estonian Language
  • Venezia, Venice, Venise, Venedig,
  • Istanbul, Constantinople, Istamboul, Istanbul,
  • European Commission-internal document
  • in the 11 languages of the pre-Enlargement EU
  • Country name
  • Capital name
  • Inhabitant name
  • Currency name
  • Country adjective

48
Proposed solutions morphological variation
  • English London - Finnish Lontoo (nominative
    case)
  • Lontoossa (In London) Lontoosta (from
    London)
  • Lontoon (Londons) Lontoolaisen
    (Londoner, of London)
  • Lontoseen (to London)
  • 3 Options
  • Use morphological analyser software
  • Pre-generate all inflection forms, using suffix
    replacement rules
  • Strip/Replace suffixes of uppercase words not
    found in gazetteer and check again
  • e.g. Finnish Lontoosta ? Lontoo

Tony(aouomemmjujemja)?\sBlair(aouomem
mjujemja)
49
Proposed solutions Disambiguation heuristics (1)
  • Language-independent heuristics
  • Using language-specific resources
  • 2 Binary filters
  • Several preference rules
  • Formula that combines them all
  • Geo-stop words
  • 5489 English Geo-stop words
  • binary filter

50
Proposed solutions Disambiguation heuristics (2)
  • Location only if not part of a person name
  • binary filter
  • e.g. Kofi Annan, Annan
  • Size class information
  • Bigger places preferred
  • Weight ? preference rule

51
Proposed solutions Disambiguation heuristics (3)
  • Country context
  • Default Publication place of newspaper
  • Two-level approach
  • Identify unambiguous places of levels 0, 1 or 2
  • Identify places of levels 3 to 6 only if in
    countries identified in step 1.
  • Trigger words for locations
  • Using simple rules (city/village/town of Ispra)
    did not produce useful results.
  • ? test ML methods

52
Proposed solutions Disambiguation heuristics (4)
  • Kilometric distance
  • If one of the homographic places is nearby
    non-ambiguous places, prefer this over other
    homographic places.
  • E.g. from Warsaw to Brest
  • Brest (France) 2000 km from Warsaw
  • Brest (Belarus) 200 km from Warsaw
  • For calculation of minimum kilometric distance,
    use formula by Sinnott (1984)

53
Proposed solutions Combination of rules (1)
  • Apply binary rules
  • Ignore uppercase words that are name parts (Mr.
    Kofi Annan ? Annan)
  • Ignore Geo-stopwords
  • For remaining ambiguous place names, calculate a
    score. The highest score wins.
  • Parameters were empirically derived to perform
    optimally on a given test set

For details, see Kimler (2004)
54
Proposed solutions Combination of rules (2)
  • kilometricWeight Arc-Cotangent
    (kilometricDistance)this distance is weighted
    using the arc-cotangent formula (Bronstein et
    al., 1999), with an inflexion point set to 300
    kilometres, as shown in the equation
  • kilometricDistance d minimum distance between
    the place and all unambiguous places (according
    to formula by Sinnott 1984)
  • Example from Warsaw to Brest
  • Brest (France) 2000 km from Warsaw
  • Brest (Belarus) 200 km from Warsaw
  • Both Brest are size 3 (classScore 30).
  • Brest (FR) has kilometricWeight of 0.05
  • Brest (PL) has kilometricWeight of 0.85

Observation Distance lt 200 km very
significant Distance gt 500 km do not make a
difference ? Inflection point 300 km
55
Evaluation of geocoding test set
  • Document test set with many smaller and ambiguous
    place names (161 documents)
  • Comparable news stories in 5 languages, taken
    from NewsExplorer application (http//pres
    s.jrc.it/NewsExplorer)

56
Evaluation of geocoding results
By disambiguation technique
By language
Difficult test set Pouliquen et al. (2004) on
the same test set F 0.38 Same algorithm on
48 average English news texts F 0.94
57
Eurovoc
58
Eurovoc Thesaurushttp//eurovoc.europa.eu/
  • Over 6000 classes
  • Covering many different subject domains (wide
    coverage)
  • Multilingual (over 20 languages, one-to-one
    translations)
  • Developed by the European Parliament and others
  • Actively used to manually index and retrieve
    documents in large collections(fine-grained
    classification and cataloguing system)
  • Freely available for research purposes

59
Eurovoc categorisation Major challenges
  • Eurovoc is a conceptual thesaurus
  • ? categorisation vs. term extraction
  • Large number of classes ( 6000)
  • Very unevenly distributed
  • Various text types (heterogeneous training set)
  • Multi-label categorisation (both for training and
    assignment)
  • E.g.
  • SPORT
  • PROTECTION OF MINORITIES
  • CONSTRUCTION AND TOWN PLANNING
  • RADIOACTIVE MATERIALS

60
Eurovoc categorisation Approach
  • Profile-based, category ranking task
  • Training Identification of most significant
    words for each class
  • Assignment combination of measures to calculate
    similarity between profiles and new document
  • Empirical refinement of parameter settings
  • Training
  • Stop words
  • Lemmatisation
  • Multi-word terms
  • Consider number of classes of each training
    document
  • Thresholds for training document length and
    number of training documents per class
  • Methods to determine significant words per
    document (log-likelihood vs. chi-square, etc.)
  • Choice of reference corpus
  • Assignment
  • Selection and combination of similarity measures
    (cosine, okapi, )
  • ...
  • For details, see Pouliquen et al. (Eurolan 2003)

61
Assignment Result (Example)
Title Legislative resolution embodying
Parliament's opinion on the proposal for a
Council Regulation amending Regulation No 2847/93
establishing a control system applicable to the
common fisheries policy (COM(95)0256 - C4-0272/95
- 95/ 0146(CNS)) (Consultation procedure)
62
Results of automatic evaluation across languages
(F1 per document at rank6)
Human evaluation (correct descriptors, compared
to inter-annotator agreement) English
83 Spanish 80
With pre-processing (Frenchgt only stop words)
Without pre-processing
63
Eurovoc indexing Result
  • Ranked list of (100) Eurovoc descriptor codes
    found for each news cluster

64
The JRC-Acquis parallel corpus in 22 languages
  • Freely available for research purposes on our web
    site http//langtech.jrc.it/JRC-Acquis.html
  • For details, see Steinberger et al. (2006, LREC)
  • Total of over one Billion words
  • Pair-wise alignment for all 231 language pairs!
  • Most documents have been Eurovoc-classified
    manually
  • useful for
  • Training of multilingual subject domain
    classifiers.
  • Creation of multilingual lexical space (LSA,
    KCCA)
  • Training of automatic systems for Statistical
    Machine Translation.
  • Producing multilingual lexical or semantic
    resources such as dictionaries or ontologies.
  • Training and testing multilingual information
    extraction software.
  • Automatic translation consistency checking.
  • Testing and benchmarking alignment software
    (sentences, words, etc.), across a larger variety
    of language pairs.
  • All types of multilingual and cross-lingual
    research.

65
Monolingual TDT
66
Clustering Monolingual document representation
  • Vector of keywords and their keyness using
    log-likelihood test (Dunning 1993)

Michael Jackson Jury Reaches Verdicts
Keyness Keyword  109.24 jackson   41.54
neverland   37.93 santa   32.61 molestation
  24.51 boy   24.43 pop   20.68
documentary   18.79 accuser   13.59
courthouse   11.12 jury   10.08 ranch  
9.60 california
Keyness Keyword    9.39 verdict 7.56
testimony   6.50 maria   4.09
michael   1.73 reached   1.68 ap  
1.05 appeared   0.53 child   0.50
trial   0.45 monday   0.26
children   0.09 family
Original cluster
67
Calculation of a texts Country Score
  • Aim show to what extent a text talks about a
    certain country
  • Sum of references to a country, normalised using
    the log-likelihood test
  • Add country score vector to keyword vector

  Keyness Keyword  7.5620 testimony   6.5014
maria   4.0957 michael   1.7368 reached  
1.6857 ap   1.5610 gb   1.5610 il  
1.5610 br   1.0520 appeared   0.5384 child
  0.5045 trial   0.4502 monday   0.2647
children   0.0946 family
Keyness Keyword  109.2478 jackson  
41.5450 neverland   37.9347 santa  
32.6105 molestation   24.5193 boy   24.4351 pop
  20.6824 documentary   18.7973 accuser  
13.5945 courthouse   11.1224 jury  
10.4184 us   10.0838 ranch  
9.6021 california   9.3905 verdict
68
Multi-monolingual news clustering
  • Input Vectors consisting of keywords and country
    score
  • Similarity measure cosine
  • Method Bottom-up group average unsupervised
    clustering
  • Build the binary hierarchical clustering tree
    (dendrogram)
  • Retain only big nodes in the tree with a high
    cohesion (empirically refined minimum intra-node
    similarity 45)
  • Use the title of the clusters medoid as the
    cluster title
  • For details, see Pouliquen et al. (CoLing 2004)

69
Monolingual cluster linking - Evaluation
  • Link clusters historically if
  • Link within 7 days
  • Cosine cluster similarity gt 0.5
  • Evaluation results depending on similarity
    threshold
  • Details Pouliquen et al. (CoLing 2004)

70
Cross-lingual Tracking
71
Cross-lingual cluster linking combination of 4
ingredients
  • CLDS (using cosine) based on these
    representations
  • CLDS aS1 ßS2 ?S3 dS4
  • Ranked list of Eurovoc classes (40)
  • Country score (30)
  • Names frequency (20)
  • Monolingual cluster representation without
    country score (10)
  • ? establish cross-lingual link if combined
    similarity gt 0.3




72
Cross-lingual cluster linking evaluation
  • Evaluation results depending on similarity
    threshold
  • Ingredients 40/30/30 (names not yet considered)
  • Evaluation for EN ? FR and EN ? IT (136 EN
    clusters)

Recall at 15 similarity threshold 100
For details, see Pouliquen et al. (CoLing 2004)
73
Filter out bad links by exploiting all
cross-lingual links
Assumption If EN is linked to FR, ES, IT,
FR should also be linked to ES, IT, ... If not
lower link likelihood
74
Filter out bad links by exploiting all
cross-lingual links
  • Build a second similarity, based on the first. It
    uses the following input
  • 1) the number of links between the set of
    clusters in the other languages
  • 2) the strength (or similarity level) of these
    links
  • 3) the number of potential links between the set
    of clusters in the other languages (which means
    all the links minus those between clusters in the
    same language)
  • Empirical formula
  • similarity_2 similarity_1
    (number_of_links / number_of_potential_links)
    square_root(number_of_potential_links)
  • Result elimination of some wrong links
  • (No formal evaluation results available)

75
Conclusion Topic Detection and Tracking
  • (Multi)-Monolingual TDT is a relatively well
    explored area
  • Use vector space models
  • Use named entities, etc.
  • Presentation of two highly multilingual TDT
    applications
  • NewsBrief live clustering (vector space)
  • NewsExplorer daily clustering (vector space
    enhanced with geographical information)

76
Conclusion
77
Conclusion Cross-lingual linking of documents
(clusters)
  • State-of-the-art approaches to cross-lingual
    document similarity calculation
  • Use Machine Translation
  • Use bilingual dictionaries
  • Use bilingual document space (LSA KCCA)
  • ? restricted to small number of languages
  • Alternative proposal link documents across
    languages via anchors
  • Use different entity types (persons,
    organisations, locations)
  • Use subject domain classification
  • Exploit cognates
  • further anchors measurement expressions
  • time,
  • speed,
  • volume,
  • Terminology from specialist dictionaries

78
Conclusion Cross-lingual linking of documents
(2)
  • Performance is text type-dependent
  • In order to use named entities (and measurement
    expressions, etc.) as anchors, they frequently
    need normalising
  • Different writing systems
  • Transliteration variants
  • Morphological variants
  • Spelling variants (even monolingually)

79
Conclusion Multilinguality
  • In a highly multilingual context, it is an
    advantage
  • To use language-independent rules (with
    language-specific resources, if needed)
  • To use simple rules that can easily be adapted to
    new languages
  • To avoid language pair-specific resources
  • Downside lower performance than best-performing
    monolingual systems
  • Advantage highly multilingual applications are
    made possible without too much effort

80
Current and future work
  • Improve each of the individual components of the
    system
  • Re-implement some of the tools more efficiently
  • Concentrate on extracting more structured
    information
  • Relations between person (co-occurrence ?
    criticise, support, family relationship,
  • Events Who did What to Whom, Where and When?
  • Not possible with simple bag-of-word approaches
  • More language-specific effort needed ? restricted
    to fewer languages

81
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com