Kernel Canonical Correlation Analysis Language Independent Document Representation

1 / 15

About This Presentation

Title:

Kernel Canonical Correlation Analysis Language Independent Document Representation

Description:

NOMENCLATURE, COMBINED, COLUMN, GOODS, TARIFF, CLASSIFICATION, CUSTOMS. NOMENCLATURE, COMBINEE, COLONNE, MARCHANDISES, CLASSEMENT, TARIF, TARIFAIRES ... –

Number of Views:216

Avg rating:3.0/5.0

Slides: 16

Provided by: blaz153

Category:

more less

Transcript and Presenter's Notes

Title: Kernel Canonical Correlation Analysis Language Independent Document Representation

1
Kernel Canonical Correlation Analysis (Language
Independent Document Representation)

Blaz Fortuna
Marko Grobelnik
Dunja Mladenic
Jozef Stefan Institute, Ljubljana

2
Outline

What is KCCA intuition and theory
Preliminary results for AC corpora
Applications of KCCA
Related approaches

3
What is KCCA about?

KCCA enables to represent documents in a
language neutral way
Intuition behind KCCA
Given a parallel corpus (such as Acquis)
first, we automatically identify language
independent semantic concepts from text,
then, we re-represent documents with the
identified concepts,
finally, we are able to perform cross language
statistical operations (such as retrieval,
classification, clustering)

4
Slovenian
Slovak
German
English
Czech
French
Hungarian
Spanish
KCCA
Greek
Italian
Language Independent Document Representation
Danish
Finnish
Lithuanian
Swedish
transform
Dutch
New document represented as text in any of the
above languages
New document represented in Language Neutral way
enables cross-lingual retrieval, categorization,
clustering,
5
Input for KCCA

On input we have set of aligned documents
For each document we have a version in each
language
Documents are represented as bag-of-words vectors

Bag-of-words space for English language
Bag-of-words space for German language
Pair of aligned documents
6
The Output from KCCA
Semantic dimension

The goal find pairs of semantic dimensions that
co-appear in documents and their translations
with high correlation
Semantic dimension is a weighted set of words.
These pairs are pairs of vectors, one from e.g.
English bag-of-words space and one from German
bag-of-words space.

Semantic dimensions pair
7
The Algorithm Theory (1/2)

Formally the KCCA solves
max(x,y) Corr(ltx,, , gt, lty,, , gt)
x, y semantic directions for English and German
( , ) is a pair of aligned documents

8
The Algorithm Theory (2/2)
9
Examples of Semantic Dimensions from Acquis
corpus English-French (1/2)

Most important words from semantic dimensions
automatically generated from 2000 documents

Veterinary, Transport
DIRECTIVE, DECISION, VEHICLES, AGREEMENT, EC,
VETERINARY, PRODUCTS, HEALTH, MEAT DIRECTIVE,
DECISION, VEHICULES, PRESENTE, RESIDUS, ACCORD,
PRODUITS, ANIMAUX, SANITAIRE
Customs
NOMENCLATURE, COMBINED, COLUMN, GOODS, TARIFF,
CLASSIFICATION, CUSTOMS NOMENCLATURE, COMBINEE,
COLONNE, MARCHANDISES, CLASSEMENT, TARIF,
TARIFAIRES
EMBRYOS, ANIMALS, OVA, SEMEN, ANIMAL, CONVENTION,
BOVINE, DECISION, FEEDINGSTUFFS EMBRYONS,
ANIMAUX, OVULES, CONVENTION, SPERME, EQUIDES,
DECISION, BOVINE, ADDITIFS
SUGAR, CONVENTION, ADDITIVES, PIGMEAT, PRICE,
PRICES, FEEDINGSTUFFS, SEED SUCRE, CONVENTION,
PORC, ADDITIFS, PRIX, ALIMENTATION, SEMENCES,
DECISION
EXPORT, LICENCES, LICENCE, REFUND, VEHICLES,
FISHERY, CONVENTION, CERTIFICATE,
ISSUED EXPORTATION, CERTIFICATS, CERTIFICAT,
PECHE, VEHICULES, LAIT, CONVENTION
Veterinary
Agriculture
Export Licences
10
Examples of Semantic Dimensions from Acquis
corpora English-Slovene (2/2)

Most important words from semantic dimensions
automatically generated from 2000 documents

Agriculture
OLIVE, OIL, AID, SUGAR, PRICE, STATE, MILK,
LICENCES, OR, EXPORT, INTERVENTION OLJA,
OLJCNEGA, POMOCI, SLADKORJA, POMOC, OLJK,
SLADKOR, ALI, DOVOLJENJA, CE
Customs
NOMENCLATURE, COLUMN, COMBINED, GOODS, TARIFF,
CLASSIFICATION, ST, ANNEXED, INVOKED NOMENKLATURO,
STOLPCU, NOMENKLATURE, KOMBINIRANO, KOMBINIRANE,
CARINSKI, BLAGA
QUOTAS, TARIFF, SEED, CUSTOMS, COLUMN, ENERGY,
INVOKED, ATOMIC, QUOTA, OPENING KVOT, TARIFNE,
SEMENA, KVOTE, TARIFNIH, CARINSKI, ATOMSKO,
ENERGIJO, ODPRTJU
DESIGNATIONS, GEOGRAPHICAL, INDICATIONS, EURATOM,
PROTECTED, ECSC, NAMES, ORIGIN OZNACB, EURATOM,
GEOGRAFSKIH, POREKLA, ESPJ, ZASCITENIH, OZNACBE,
IMEN, REGISTER
WINE, WINES, ALCOHOL, DRINKS, DISTILLATION,
POULTRYMEAT, ICEWINE, ANALYSIS VINO, VINA, VIN,
VINSKEM, VINSKEGA, ALKOHOL, NAMIZNEGA,
DESTILACIJO, DESTILACIJE
Energy
Agriculture protection
Wine
11
Applications of KCCA

Cross-lingual document retrieval retrieved
documents depend only on the meaning of the query
and not its language.
Automatic document categorization only one
classifier is learned and not a separate
classifier for each language
Document clustering documents should be grouped
into clusters based on their content, not on the
language they are written in.
Cross-media information retrieval in the same
way we correlate two languages we can correlate
text to images, text to video, text to sound,

12
Example of cross-lingual information retrieval on
Reuters news corpus using KCCA
Borse Stock Exchange
Borse Stock Exchange
Borse Stock Exchange
Borse Stock Exchange
Borse Stock Exchange
13
Related approaches

Usual approach for modelling cross language
Information Retrieval is Latent Semantic Indexing
(LSI/SVD) on parallel corpora
measured performance of KCCA is significantly
better then of LSI
Vinokourov et. al, 2002

14
Availability/Scalability

KCCA is available within Text-Garden text-mining
software environment
available at http//www.textmining.net
Current version processes up-to 10.000 documents
Next version (incremental) will be able to
process up-to 100.000 documents

15
Questions?

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them.

Related Presentations

Quantitative Demand Analysis Estimation of Demand - to test whether independent variable positively affects the dependent variable ... dependent variable would be explained by the variation in independent variables. ... | free to view

1st A' GUIORA ANNUAL ROUNDTABLE CONFERENCE IN THE COGNITIVE NEUROSCIENCE OF LANGUAGE: THE COGNITIVE - Overview (selective) of facts and issues relating to age and L2A ... NB: Correlative w/ L2 behavioral data, volumetric declines not bounded ... | free to view

Machine Learning for Textual Information Access: Results from the SMART project - Machine Learning for Textual Information Access: Results from the SMART project ... Deflate kernel matrices. end for. Solve GEP for index set i. Algorithm 2 ... | free to view

Introduction to the SSS Reading and Language Arts Access Points for - Title: Introduction to the Reading and Language Arts Access Points Last modified by: HP Authorized Customer Document presentation format: On-screen Show (4:3) | free to view

FMRI Data Analysis II: Connectivity Analyses - Independent Component Analysis Related to PCA, ICA deconvolves a mixture of signals into sources. Generally accepted as more powerful and sensitive than PCA. | free to view

Automatic Domain Adaptive Sentiment Analysis Phase 1 - Title: Determining, Creating, and Encoding Semantic, Domain-Specific, and Domain-Independent Knowledge for Sentiment Analysis Author: Patricia Ordonez | free to view

Multiple Regression Analysis (MRA) - Title: Multiple Regression Analysis Author: Ann Porteus Last modified by: Jennifer Crew Solomon Created Date: 12/3/2000 12:16:06 AM Document presentation format | free to view

Kernelized Discriminant Analysis and Adaptive Methods for Discriminant Analysis - Title: Efficient Nonlinear Dimension Reduction for Clustered Data using Kernel Functions Author: cheong hee park Last modified by: admin Created Date | free to view

fMRI Analysis with emphasis on the general linear model - Jody Culham Department of Psychology University of Western Ontario http://www.fmri4newbies.com/ fMRI Analysis with emphasis on the general linear model | free to view

Chapter 3 Numeral System and Data Representation - Title: Author: syh Last modified by: yhshiau Created Date: 6/19/2004 8:30:17 AM Document presentation format: Company | free to view

$Aarkstore - Metastatic Hormone Refractory (Castration Resistant, Androgen-Independent) Prostate Cancer-Global API Manufacturers, Marketed and Phase III Drugs Landscape, 2015$