2ID10: Information Retrieval Lecture 10: Assignments - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

2ID10: Information Retrieval Lecture 10: Assignments

Description:

2ID10: Information Retrieval Lecture 10: Assignments – PowerPoint PPT presentation

Number of Views:241

Avg rating:3.0/5.0

Slides: 34

Provided by: loraa

Category:

more less

Transcript and Presenter's Notes

Title: 2ID10: Information Retrieval Lecture 10: Assignments

1
2ID10 Information RetrievalLecture 10
Assignments

Lora Aroyo
30th May 2006

2
Course Topics

Basic Information Retrieval Terminology
Query Languages Operations
Precision Recall in Search Engines
Relevance Feedback
Language modeling for IR
Search engines
Reference Structures in IR
Multimedia Information Retrieval
Publishing of enriched, structured content

3
Assignment Submission

Final Assignment 7th July 2006
Submit to IR_at_listserver.tue.nl
Register at http//listserver.tue.nl/mailman/list
info/IR
Subject Group Assignment
Files group.ass.title.extension
URL http//...../group/assignment
URL with running application documentation
In all files include your Group and Assignment,
as well as all the names of the group members

4
Assignment Submission

Each assignment should include
detailed report of the modeling, algorithmic and
implementation (in applicable) of the solution
clear problem description
clear solution description and justification
literature and relevant material that you used to
solve the problem
the significance and benefit of your solution
URL with report, running implementation and
source code

5
Assignment 1 How to improve an existing NL
parser

Related Lecture 9 - C-content
The behaviour of the NL parser needs to be tuned,
based upon the results of a pre-defined set of
queries on a pre-defined document collection.
Identify problematic queries or query types
leading to low precision/recall results.

6
Assignment 1 Improving an existing NL parser

What do we have?
A natural language parser, which does
Spelling checker
Term identification
Search term checking
Syntactic analysis
Semantic expansion
Query base generation
What is the problem?
The behavior needs to be tuned, based upon the
results of a pre-defined set of queries on a
pre-defined document collection

7
Assignment 1 Improving an existing NL parser

Assignment
Identifying problematic queries or query types
(Requires at least one Dutch native speaker in
the team)
Type of problems
Low precision
Low recall
Misunderstood queries
E.g., does information on cars and traffic jams
mean that only documents containing both cars
and traffic jams should be found or both
documents containing either cars or traffic
jams

8
Assignment 1 Improving an existing NL parser

Proposed steps
Define a set of natural language queries,
including expected query results
define the Boolean query to be executed
list the documents to be found by the query
Run queries through the on-line website
Identify mismatch areas

9
Assignment 1Improving an existing NL parser

Proposed steps
Define a set of natural language queries,
including expected query results
Run queries through the on-line website
Perform the queries on the information portal
Identify mismatch areas

10
Assignment 1 Improving an existing NL parser

Proposed steps
Define a set of natural language queries,
including expected query results
Run queries through the on-line website
Identify mismatch areas
keep track of precision and recall percentages
if these percentages are low, what is wrong with
the Boolean query
how could this be improved

11
Assignment 2 Identify context/metadata relation
between documents in legal domain

Related Lecture 9 - C-content
What do we have?
Document collections, which
Contain metadata
Has contextual relations between documents
What is the problem?
How to construct relations between documents
based on existing metadata?

12
Assignment 2 Identify context/metadata relation
between documents in legal domain

Assignment
Design and implement the algorithm to identify
the possible relation between documents,
including a link certainty indicator
(Requires at least one Dutch native speaker in
the team)
Basis information
An ideal document collection relevant to one
starting document
A mixed ideal and noise documents collection
Description of metadata that can be used to
identify the relation

13
Assignment 2 Identify context/metadata relation
between documents in legal domain

Proposed Steps
Study description and the document collections
delivered.
Design the algorithm to solve the problem
Implement the algorithm in a simple programming
module
Analyze possible problems due to ambiguity in the
context/metadata

14
Assignment 4 User Alert Service

Related Lecture 9 - C-content
What do we have?
Document collections, which
Contain metadata
Has contextual relations between documents
Structure of the context/metadata relation
(assignment 2)
What is the problem?
How to construct the user alert service based on
the user profile information and the resulted
document relation?

15
Assignment 4 User Alert Service

Assignment
Design and implement the user profile structure
and the user alert service
Proposed Steps
Design the user profile structure
Design the algorithm for the user alert service
Implement the algorithm in simple programming
module
Analyze complexity level in the user profile
structure and the consequence to the user alert
service

16
Assignment 3 Using Relevance Feedback during
Information Retrieval

Related lecture 3 by Theo van der Weide
Consider the cooperators of the Informatics
Department.
User feedback for query modification with the
Rocchio technique
Download the Perlfect Search engine sources
(http//www.perlfect.com/freescripts/search/).
After a query, offer the searcher the top 10
ranking documents, and ask feedback.
Use this feedback to construct the modified query
using the Rocchio technique.

17
Assignment 3 Using Relevance Feedback during
Information Retrieval
18
Assignment 5 Language identification is an
important basic component for multilingual search
engines

Related lecture 8 by Wessel Kraaij (TNO)
Develop a language identification module using
generative character models for the 21 EU
languages.
Develop and test the language identification
module on the EU constitution corpus.
Test the language identification module on some
non EU languages
(e.g. taken from the KDE corpus on the same site)
adjust the classifier in such a way that it can
recognize unknown languages.
Optional produce a similarity matrix between the
languages using the identification module.
Use the similarity matrix to build a tree using
hierarchical agglomerative techniques. (related
languages will be grouped).

19
Assignment 5

Download the EU constitution in 21 languages from
http//logos.uio.no/opus/
Split each language corpus in a training, test
and evaluation set.
Train classifiers for each language using
character trigram models smoothed with character
bigram models.
Optimize the smoothing parameter on the test
data.
Compute the accuracy of the classifier, by taking
100 sentences from the evaluation set of each
language.
After adjusting the classifier, test the accuracy
of the classifier on the same evaluation set plus
100 sentences from 5 non European languages.
Present results per language.

20
Assignment 5Language identification is an
important basic component for multilingual search
engines

Optional produce a similarity matrix, using the
previous test data by averaging probabilities
across test sentences and averaging P(AB) and
P(BA) (the similarity matrix must be symmetric).
Use CLUTO to cluster the languages, e.g. by
applying hierarchical agglomerative clustering.

21
Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search

Related lecture 8 by Wessel Kraaij (TNO)
Test the assumption by training statistical
thesauri and performing experiments on public IR
test collections.
Investigate whether a combination of statistical
thesauri trained on different parallel texts can
be used to improve retrieval performance.

22
Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search

Choose three European languages other than
English X, Y, Z. Train statistical thesauri
EnglishgtXYZgtEnglish on several parallel
corpora available through the OPUS website, by
cascading two translation models trained with the
EGYPT toolkit (choose IBM model 1).

23
Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search

Construct various combined thesauri e.g. by
interpolating three individual thesauri trained
on a single corpus or by combining thesauri
trained on different corpora. To be effective
such thesauri usually are interpolated with the
identity matrix. (Xu et al. 2002, TREC2001)
Implement a CLIR system using LEMUR and
http//www.lemurproject.org/doxygen/lemur-3.1/html
/classXLingRetMethod.html as retrieval method.

24
Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search

Perform monolingual IR experiments (measure mean
average precision) on the CACM and CRANFIELD
collections. Experiments must include
a baseline run (standard generative language
model)
individual thesauri (with and without
interpolation with the identity matrix)
combined thesauri (with and without interpolation
with the identity matrix)

25
Assignment 7 Self-learning Dutch meta-search

Related lecture 10 by Nils Rooijmans (ilse
media)
Build a meta-search
Learn which algorithm works best for which type
of queries
single VS multi-word queries
general queries VS specific queries(roughly
based on the number of results in the resultset)
Improve ranking

26
Assignment 7 Self-learning Dutch meta-search

Build website based on APIs
Ilse API (mail searchengine_at_ilse.net for
documentation)
Google API http//www.google.com/apis/index.html
Yahoo API http//developer.yahoo.net/web/V1/webSe
arch.html
Merge resultsets (handle duplicates, hide
originating API(s))
Build feedback mechanism
Score different APIs for different query types
Use score in merging of results

27
Assignment 8 Web-services for Multimedia
Indexing

Related lecture 8 by Arjen de Vries (CWI)
8a) implement the EM algorithm to train a model
on a given image
8b) make a web-service that takes an image URL
and uses this program to train the model (include
8a)
8c) make a web-service that visualizes a trained
model on a given image, where results for each
component are presented in SVG as an overlay of
their pixel blocks with transparency depending on
the block likelihood (include 8a and 8b)

28
Assignment 9 Digital Accessibility on Dutch
Species

Related lecture 6 by Trezorix
www.soortenregister.nl
www.nederlandsesoorten.nl (NSR)
www.w3.org/2004/02/skos (SKOS)
www.w3.org/TR/owl-features (OWL)
www.openrdf.org (Sesame)
lucene.apache.org (Lucene)

29
Assignment 9 Digital Accessibility on Dutch
Species

The Dutch Species Register (Nederlands
Soortenregister, NSR)
thesaurus-structure with information on Dutch
plants, animals and fungi.
has about 40,000 biological concepts (taxa).
the web version of the NSR is linked to various
databases
like a preservation status database
an image library
article libraries, etc.
the editorial maintenance of the site is done by
the Dutch National Museum of Natural History
Naturalis
the technical infrastructure is developed and
maintained by Trezorix.

30
Assignment 9 Digital Accessibility on Dutch
Species

Make a website for digital access to (part of)
the collection of the NSR.
The site should illustrate the use of reference
structures for browsing and searching digital
collections.
The NSR-structure is a complicated structure of
thesaurus-like semantic relations combined with
naming data for each concept.

31
Assignment 9 Digital Accessibility on Dutch
Species

Describe a NSR datamodel.
Represent the structure part (thesaurus) of the
NSR in SKOS.
Represent as much of the extra data (see below)
as possible in SKOS.
Represent those data which do not fit into SKOS
in OWL.
Use the open source RDF-framework Sesame for
storage of structures.
Use the open source search engine Lucene for
findability of NSR elements
Explain how do you deal with the complexity of the

32
Assignment 9 Digital Accessibility on Dutch
Species

Applicable extra data
Change record data - these data represent a
history per concept of mutations of naming data.
Data for species counter - bottom up accumulative
data about the number of species under a certain
node.
Data about the availability of photographs of
species.
What will be supplied by Trezorix?
A limited NSR dataset, for instance songbirds,
including scientific names, synonyms, change
record data, etc.
Extra data data for species counter and data
about availability of photographs.
Relevant photographs.
Technical background.