Title: Tesis doctoral
1 Cross-language experiments with IR-n system
CLEF-2003
Departamento de Lenguajes y Sistemas Informáticos
2IR-n system Introduction
- IR-n system is a Passage Retrieval System.
- IR-n system participated in the Conferences
CLEF-2001 and CLEF-2002 in Spanish monolingual
task. - This year we have participated
- Monolingual (Spanish, French, German, Italian).
- Bilingual (Italian-Spanish).
- Multilingual (4 languages).
3Index
Passage Retrieval Systems
IR-n System
Monolingual Experiments
Bilingual Experiments
Multilingual Experiments
Conclusions
4Index
Passage Retrieval Systems
Passage Retrieval Systems
IR-n System
Monolingual Experiments
Bilingual Experiments
Multilingual Experiments
Conclusions
5Passage Retrieval Systems Relevance measures
General Custer was Civil War Union Major
soldier. One of the most famous and controversial
figures in United States Military history.
Graduated last in his West Point Class (June
1861). Spent first part of the Civil War as a
courier and staff officer. Promoted from Captain
to Brigadier General of Volunteers just prior to
the Battle of Gettysburg, and was given command
of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to
make a cavalry strike behind Union lines on the
3rd Day of the Battle (July 3, 1863), thus
markedly contributing to the Army of the
Potomac's victory (a large monument to his
Brigade now stands in the East Cavalry Field in
Gettysburg). Participated in nearly every cavalry
action in Virginia from that point until the end
of the war, always performing boldly, most often
brilliantly, and always seeking publicity for
himself and his actions. Ended the war as a Major
General of Volunteers and a Brevet Major General
in the Regular Army. Upon Army reorganization
in 1886, he was appointed Lieutenant Colonel of
the soon to be renown 7th United States Cavalry.
Fought in the various actions against the Western
Indians, often with a singular brutality
(exemplified by his wiping out of a Cheyenne
village on the Washita in November 1868). His
exploits on the Plains were romanticized by
Eastern Unites States newspapermen, and he was
elevated to legendary status in his time. The
death of his friend, Lucarelli change his life.
The death of General Custer
6Passage Retrieval Systems IR systems based on
whole document
General Custer was Civil War Union Major
soldier. One of the most famous and controversial
figures in United States Military history.
Graduated last in his West Point Class (June
1861). Spent first part of the Civil War as a
courier and staff officer. Promoted from Captain
to Brigadier General of Volunteers just prior to
the Battle of Gettysburg, and was given command
of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to
make a cavalry strike behind Union lines on the
3rd Day of the Battle (July 3, 1863), thus
markedly contributing to the Army of the
Potomac's victory (a large monument to his
Brigade now stands in the East Cavalry Field in
Gettysburg). Participated in nearly every cavalry
action in Virginia from that point until the end
of the war, always performing boldly, most often
brilliantly, and always seeking publicity for
himself and his actions. Ended the war as a Major
General of Volunteers and a Brevet Major General
in the Regular Army. Upon Army reorganization
in 1886, he was appointed Lieutenant Colonel of
the soon to be renown 7th United States Cavalry.
Fought in the various actions against the Western
Indians, often with a singular brutality
(exemplified by his wiping out of a Cheyenne
village on the Washita in November 1868). His
exploits on the Plains were romanticized by
Eastern Unites States newspapermen, and he was
elevated to legendary status in his time. The
death of his friend, Lucarelli change his life.
7Passage Retrieval Systems Definition
- Use a short fragments of documents instead of
whole documents to evaluate the relevance or
similarity. - These fragments are called passages.
- Each document is divided into passages before
calculate the relevance.
8Passage Retrieval Systems Definition (II)
Steps
General Custer was Civil War Union Major
soldier. One of the most famous and controversial
figures in United States Military history.
Graduated last in his West Point Class (June
1861). Spent first part of the Civil War as a
courier and staff officer. Promoted from Captain
to Brigadier General of Volunteers just prior to
the Battle of Gettysburg, and was given command
of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to
make a cavalry strike behind Union lines on the
3rd Day of the Battle (July 3, 1863), thus
markedly contributing to the Army of the
Potomac's victory (a large monument to his
Brigade now stands in the East Cavalry Field in
Gettysburg). Participated in nearly every cavalry
action in Virginia from that point until the end
of the war, always performing boldly, most often
brilliantly, and always seeking publicity for
himself and his actions. Ended the war as a Major
General of Volunteers and a Brevet Major General
in the Regular Army. Upon Army reorganization
in 1886, he was appointed Lieutenant Colonel of
the soon to be renown 7th United States Cavalry.
Fought in the various actions against the Western
Indians, often with a singular brutality
(exemplified by his wiping out of a Cheyenne
village on the Washita in November 1868). His
exploits on the Plains were romanticized by
Eastern Unites States newspapermen, and he was
elevated to legendary status in his time. The
death of his friend, Lucarelli change his life.
1 Define the passages
2 Evaluate the relevance of each passage
3 Evaluate the relevance of document in
function of passages relevance
9Passage Retrieval Systems Advantages
- Add the concept of proximity to calculate the
similarity between document and query - Allow locate short relevant fragments on a
non-relevant documents - Avoid the difficulties of comparing documents of
different length
10Index
Passage Retrieval Systems
IR-n System
IR-n System
Monolingual Experiments
Bilingual Experiments
Multilingual Experiments
Conclusions
11IR-n system Definition
Steps
General Custer was Civil War Union Major
soldier. One of the most famous and controversial
figures in United States Military history.
Graduated last in his West Point Class (June
1861). Spent first part of the Civil War as a
courier and staff officer. Promoted from Captain
to Brigadier General of Volunteers just prior to
the Battle of Gettysburg, and was given command
of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to
make a cavalry strike behind Union lines on the
3rd Day of the Battle (July 3, 1863), thus
markedly contributing to the Army of the
Potomac's victory (a large monument to his
Brigade now stands in the East Cavalry Field in
Gettysburg). Participated in nearly every cavalry
action in Virginia from that point until the end
of the war, always performing boldly, most often
brilliantly, and always seeking publicity for
himself and his actions. Ended the war as a Major
General of Volunteers and a Brevet Major General
in the Regular Army. Upon Army reorganization
in 1886, he was appointed Lieutenant Colonel of
the soon to be renown 7th United States Cavalry.
Fought in the various actions against the Western
Indians, often with a singular brutality
(exemplified by his wiping out of a Cheyenne
village on the Washita in November 1868). His
exploits on the Plains were romanticized by
Eastern Unites States newspapermen, and he was
elevated to legendary status in his time. The
death of his friend, Lucarelli change his life.
1 Definition of passages
2 Evaluate the relevance of each passage
3 Evaluate the relevance of document in
function of passages relevance
12IR-n system Passage concept
- IR-n system use the sentence to define the
passages - Every passage have the same number of sentences
- This number depends on
- The collection of documents
- Size of the query
- A sentence expresses an idea in the document
- There are algorithms to obtain each sentence with
a high precision - Sentences are full units allowing to show an
understandable information by users or provide
this information to a subsequent system
13IR-n system Passage concept (II)
IR-n system defines the passages in the following
way
General Custer was Civil War Union Major
soldier. One of the most famous and controversial
figures in United States Military history.
Graduated last in his West Point Class (June
1861). Spent first part of the Civil War as a
courier and staff officer. Promoted from Captain
to Brigadier General of Volunteers just prior to
the Battle of Gettysburg, and was given command
of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to
make a cavalry strike behind Union lines on the
3rd Day of the Battle (July 3, 1863), thus
markedly contributing to the Army of the
Potomac's victory (a large monument to his
Brigade now stands in the East Cavalry Field in
Gettysburg). Participated in nearly every cavalry
action in Virginia from that point until the end
of the war, always performing boldly, most often
brilliantly, and always seeking publicity for
himself and his actions. Ended the war as a Major
General of Volunteers and a Brevet Major General
in the Regular Army. Upon Army reorganization
in 1886, he was appointed Lieutenant Colonel of
the soon to be renown 7th United States Cavalry.
Fought in the various actions against the Western
Indians, often with a singular brutality
(exemplified by his wiping out of a Cheyenne
village on the Washita in November 1868). His
exploits on the Plains were romanticized by
Eastern Unites States newspapermen, and he was
elevated to legendary status in his time. The
death of his friend, Lucarelli change his life.
SENTENCE 1 SENTENCE 2 SENTENCE 3 SENTENCE
4 SENTENCE 5 SENTENCE 6 SENTENCE 7 SENTENCE
8 SENTENCE 9 SENTENCE 10 SENTENCE 11 SENTENCE
12 SENTENCE 13 SENTENCE 14 SENTENCE 15
1 Obtains the sentences of the document
2 Defines the passages in base of a number
fixed of sentences (5)
14IR-n system Definition
Steps
General Custer was Civil War Union Major
soldier. One of the most famous and controversial
figures in United States Military history.
Graduated last in his West Point Class (June
1861). Spent first part of the Civil War as a
courier and staff officer. Promoted from Captain
to Brigadier General of Volunteers just prior to
the Battle of Gettysburg, and was given command
of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to
make a cavalry strike behind Union lines on the
3rd Day of the Battle (July 3, 1863), thus
markedly contributing to the Army of the
Potomac's victory (a large monument to his
Brigade now stands in the East Cavalry Field in
Gettysburg). Participated in nearly every cavalry
action in Virginia from that point until the end
of the war, always performing boldly, most often
brilliantly, and always seeking publicity for
himself and his actions. Ended the war as a Major
General of Volunteers and a Brevet Major General
in the Regular Army. Upon Army reorganization
in 1886, he was appointed Lieutenant Colonel of
the soon to be renown 7th United States Cavalry.
Fought in the various actions against the Western
Indians, often with a singular brutality
(exemplified by his wiping out of a Cheyenne
village on the Washita in November 1868). His
exploits on the Plains were romanticized by
Eastern Unites States newspapermen, and he was
elevated to legendary status in his time. The
death of his friend, Lucarelli change his life.
1 Definition of passages
2 Evaluate the relevance of each passage
3 Evaluate the relevance of document in
function of passages relevance
15IR-n system Similarity Measure Query-Passage
- In this year we have change the originally
similarity measure of IR-n system, improving the
results - IR-n uses
- Number of appearances of term in query and
passage - Number of different documents that contains each
term - IR-n does not use
- Normalization measures depending on the passage
size, due to all passages have the same size (the
same number of sentences)
16IR-n system Definition
Steps
General Custer was Civil War Union Major
soldier. One of the most famous and controversial
figures in United States Military history.
Graduated last in his West Point Class (June
1861). Spent first part of the Civil War as a
courier and staff officer. Promoted from Captain
to Brigadier General of Volunteers just prior to
the Battle of Gettysburg, and was given command
of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to
make a cavalry strike behind Union lines on the
3rd Day of the Battle (July 3, 1863), thus
markedly contributing to the Army of the
Potomac's victory (a large monument to his
Brigade now stands in the East Cavalry Field in
Gettysburg). Participated in nearly every cavalry
action in Virginia from that point until the end
of the war, always performing boldly, most often
brilliantly, and always seeking publicity for
himself and his actions. Ended the war as a Major
General of Volunteers and a Brevet Major General
in the Regular Army. Upon Army reorganization
in 1886, he was appointed Lieutenant Colonel of
the soon to be renown 7th United States Cavalry.
Fought in the various actions against the Western
Indians, often with a singular brutality
(exemplified by his wiping out of a Cheyenne
village on the Washita in November 1868). His
exploits on the Plains were romanticized by
Eastern Unites States newspapermen, and he was
elevated to legendary status in his time. The
death of his friend, Lucarelli change his life.
1 Definition of passages
2 Evaluate the relevance of each passage
3 Evaluate the relevance of document in
function of passages relevance
17IR-n system Similarity measure Document-query
- Based on the best similarity of passages
18IR-n system Another aspects
- IR-n system use
- Overlapping passages
- Relevance Feedback based on passages
19IR-n system Passage overlapping
- IR-n system uses overlapping in the definition of
the passages - In this way, a fragment of document can be in
more than one passage - IR-n system uses the sentence to define the
overlapping.
20IR-n system Passage overlapping (II)
Definition of passages using overlapping
General Custer was Civil War Union Major
soldier. One of the most famous and controversial
figures in United States Military history.
Graduated last in his West Point Class (June
1861). Spent first part of the Civil War as a
courier and staff officer. Promoted from Captain
to Brigadier General of Volunteers just prior to
the Battle of Gettysburg, and was given command
of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to
make a cavalry strike behind Union lines on the
3rd Day of the Battle (July 3, 1863), thus
markedly contributing to the Army of the
Potomac's victory (a large monument to his
Brigade now stands in the East Cavalry Field in
Gettysburg). Participated in nearly every cavalry
action in Virginia from that point until the end
of the war, always performing boldly, most often
brilliantly, and always seeking publicity for
himself and his actions. Ended the war as a Major
General of Volunteers and a Brevet Major General
in the Regular Army. Upon Army reorganization
in 1886, he was appointed Lieutenant Colonel of
the soon to be renown 7th United States Cavalry.
Fought in the various actions against the Western
Indians, often with a singular brutality
(exemplified by his wiping out of a Cheyenne
village on the Washita in November 1868). His
exploits on the Plains were romanticized by
Eastern Unites States newspapermen, and he was
elevated to legendary status in his time. The
death of his friend, Lucarelli change his life.
SENTENCE 1 SENTENCE 2 SENTENCE 3 SENTENCE
4 SENTENCE 5 SENTENCE 6 SENTENCE 7 SENTENCE
8
1 Obtain the sentences of document
2 Define the passages using the size of
passages and degree of overlapping
21IR-n system Passage overlapping (III)
- This way of define the passages increment the
number of passages to evaluate. - However, the architecture of IR-n system allows
to use overlapping passages without considerable
increment of processing time
22IR-n system Relevance Feedback using passages
23Index
Passage Retrieval Systems
IR-n System
Monolingual Experiments
Monolingual Experiments
Bilingual Experiments
Multilingual Experiments
Conclusions
24Monolingual Experiments Training
- Training
- Test collections CLEF-2002
- Objectives
- Determine the passage size of each collection
- Resources
- Stop-word lists (Clef)
- Stemmer (Clef)
25Monolingual Experiments Determining the best size
of passages
Características
Spanish
English
French
German
Italian
Number of documents
215,738
113,005
87,191
225,371
108,578
Avg Sentences Document
9.30
27.34
16.86
17.70
16.28
Best size
14
9
14
14
8
26Monolingual Experiments Conclusions of training
- First Conclusions
- Good results on Spanish, French and English
- Bad results in German and Italian
- Query expansion allow improve over 10 AvgP
- Problems with German
- Problems with Italian
27Monolingual Experiments Conclusions of training
- First Conclusions
- Good results on Spanish, French and English
- Bad results in German and Italian
- Query expansion allow improve over 10 AvgP
- Problems with German
- We have not a algorithm to split compound nouns
- Solution
- Use a list of more frequently compound names
(200.000 terms). The use of this list allows to
improve over 19,7 AvgP - Problems with Italian
28Monolingual Experiments Conclusions of training
- First Conclusions
- Good results on Spanish, French and English
- Bad results in German and Italian
- Query expansion allow improve over 10 AvgP
- Problems with German
- Problems with Italian
- ?
29Index
Passage Retrieval Systems
IR-n System
Monolingual Experiments
Bilingual Experiments
Bilingual Experiments
Multilingual Experiments
Conclusions
30Bilingual Experiments Training
- Training
- Test collections CLEF-2002
- Objectives
- Determine how to translate the queries
- Resources
- PowerTranslator
- FreeTranslator
- BabelFish
- Google
31Bilingual Experiments Conclusions
- Google was the translator with worst results
- Power Translator was the translator with best
results - However, we obtained better results (5) than
Power using three translators (Power, Free and
Babel) at time.
32Index
Passage Retrieval Systems
IR-n System
Monolingual Experiments
Bilingual Experiments
Multilingual Experiments
Multilingual Experiments
Conclusions
33Multilingual Experiments Training
- Training
- Test collections CLEF-2002
- Objectives
- Determine how to generate the multilingual
document list
34Multilingual Experiments Training
- Method
- We are working in a model based in dictionaries
to generate the main list. - But, we could not finish the development of this
model for this conference - Used method
- Translated the original query (English) to each
language, using three translators - Process each query separately
- Merge the four documents list using a formula to
normalize each document. - We test several formulas and obtained better
results with
35Multilingual Experiments Training (II)
- Method
- We use for each language two kind of passages
- The best passage in monolingual experiments
- The same passage size for all collections (11
sentences) - We obtained similar results (Better in the first)
- We want continue exploring the use of the same
size for all collections, maybe better for
compare each collection
36Index
Passage Retrieval Systems
IR-n System
Monolingual Experiments
Bilingual Experiments
Multilingual Experiments
Conclusions
Conclusions
37Conclusions Comparison with the Clef Average.
Monolingual
Inc.
Language
Spanish
8,75
French
5,88
German
7,48
Italian
-2,06
38Conclusions Comparison with the Clef Average.
Cross-lingual
Language
Inc.
25,78
Italian/Spanish
Language
Inc.
22,71
Multilingual(4)
39Conclusions Work in progress
- Improve the efficiency of IR-n system in the
retrieval task - Develop a algorithm to split compound nouns
(German) - Continue with the develop of our method to
multilingual retrieval using dictionaries - If you are interested in use IR-n system
(llopis_at_dlsi.ua.es)
40(No Transcript)