Title: 2ID10: Information Retrieval Lecture 10: Assignments
12ID10 Information RetrievalLecture 10
Assignments
2Course Topics
- Basic Information Retrieval Terminology
- Query Languages Operations
- Precision Recall in Search Engines
- Relevance Feedback
- Language modeling for IR
- Search engines
- Reference Structures in IR
- Multimedia Information Retrieval
- Publishing of enriched, structured content
3Assignment Submission
- Final Assignment 7th July 2006
- Submit to IR_at_listserver.tue.nl
- Register at http//listserver.tue.nl/mailman/list
info/IR - Subject Group Assignment
- Files group.ass.title.extension
- URL http//...../group/assignment
- URL with running application documentation
- In all files include your Group and Assignment,
as well as all the names of the group members
4Assignment Submission
- Each assignment should include
- detailed report of the modeling, algorithmic and
implementation (in applicable) of the solution - clear problem description
- clear solution description and justification
- literature and relevant material that you used to
solve the problem - the significance and benefit of your solution
- URL with report, running implementation and
source code
5Assignment 1 How to improve an existing NL
parser
- Related Lecture 9 - C-content
- The behaviour of the NL parser needs to be tuned,
based upon the results of a pre-defined set of
queries on a pre-defined document collection. - Identify problematic queries or query types
leading to low precision/recall results.
6Assignment 1 Improving an existing NL parser
- What do we have?
- A natural language parser, which does
- Spelling checker
- Term identification
- Search term checking
- Syntactic analysis
- Semantic expansion
- Query base generation
- What is the problem?
- The behavior needs to be tuned, based upon the
results of a pre-defined set of queries on a
pre-defined document collection
7Assignment 1 Improving an existing NL parser
- Assignment
- Identifying problematic queries or query types
- (Requires at least one Dutch native speaker in
the team) - Type of problems
- Low precision
- Low recall
- Misunderstood queries
- E.g., does information on cars and traffic jams
mean that only documents containing both cars
and traffic jams should be found or both
documents containing either cars or traffic
jams -
8Assignment 1 Improving an existing NL parser
- Proposed steps
- Define a set of natural language queries,
including expected query results - define the Boolean query to be executed
- list the documents to be found by the query
- Run queries through the on-line website
- Identify mismatch areas
9Assignment 1Improving an existing NL parser
- Proposed steps
- Define a set of natural language queries,
including expected query results - Run queries through the on-line website
- Perform the queries on the information portal
- Identify mismatch areas
10Assignment 1 Improving an existing NL parser
- Proposed steps
- Define a set of natural language queries,
including expected query results - Run queries through the on-line website
- Identify mismatch areas
- keep track of precision and recall percentages
- if these percentages are low, what is wrong with
the Boolean query - how could this be improved
11Assignment 2 Identify context/metadata relation
between documents in legal domain
- Related Lecture 9 - C-content
- What do we have?
- Document collections, which
- Contain metadata
- Has contextual relations between documents
- What is the problem?
- How to construct relations between documents
based on existing metadata?
12Assignment 2 Identify context/metadata relation
between documents in legal domain
- Assignment
- Design and implement the algorithm to identify
the possible relation between documents,
including a link certainty indicator - (Requires at least one Dutch native speaker in
the team) - Basis information
- An ideal document collection relevant to one
starting document - A mixed ideal and noise documents collection
- Description of metadata that can be used to
identify the relation
13Assignment 2 Identify context/metadata relation
between documents in legal domain
- Proposed Steps
- Study description and the document collections
delivered. - Design the algorithm to solve the problem
- Implement the algorithm in a simple programming
module - Analyze possible problems due to ambiguity in the
context/metadata
14Assignment 4 User Alert Service
- Related Lecture 9 - C-content
- What do we have?
- Document collections, which
- Contain metadata
- Has contextual relations between documents
- Structure of the context/metadata relation
(assignment 2) - What is the problem?
- How to construct the user alert service based on
the user profile information and the resulted
document relation?
15Assignment 4 User Alert Service
- Assignment
- Design and implement the user profile structure
and the user alert service - Proposed Steps
- Design the user profile structure
- Design the algorithm for the user alert service
- Implement the algorithm in simple programming
module - Analyze complexity level in the user profile
structure and the consequence to the user alert
service
16Assignment 3 Using Relevance Feedback during
Information Retrieval
- Related lecture 3 by Theo van der Weide
- Consider the cooperators of the Informatics
Department. - User feedback for query modification with the
Rocchio technique - Download the Perlfect Search engine sources
(http//www.perlfect.com/freescripts/search/). - After a query, offer the searcher the top 10
ranking documents, and ask feedback. - Use this feedback to construct the modified query
using the Rocchio technique.
17Assignment 3 Using Relevance Feedback during
Information Retrieval
18Assignment 5 Language identification is an
important basic component for multilingual search
engines
- Related lecture 8 by Wessel Kraaij (TNO)
- Develop a language identification module using
generative character models for the 21 EU
languages. - Develop and test the language identification
module on the EU constitution corpus. - Test the language identification module on some
non EU languages - (e.g. taken from the KDE corpus on the same site)
- adjust the classifier in such a way that it can
recognize unknown languages. - Optional produce a similarity matrix between the
languages using the identification module. - Use the similarity matrix to build a tree using
hierarchical agglomerative techniques. (related
languages will be grouped).
19Assignment 5
- Download the EU constitution in 21 languages from
http//logos.uio.no/opus/ - Split each language corpus in a training, test
and evaluation set. - Train classifiers for each language using
character trigram models smoothed with character
bigram models. - Optimize the smoothing parameter on the test
data. - Compute the accuracy of the classifier, by taking
100 sentences from the evaluation set of each
language. - After adjusting the classifier, test the accuracy
of the classifier on the same evaluation set plus
100 sentences from 5 non European languages. - Present results per language.
20Assignment 5Language identification is an
important basic component for multilingual search
engines
- Optional produce a similarity matrix, using the
previous test data by averaging probabilities
across test sentences and averaging P(AB) and
P(BA) (the similarity matrix must be symmetric).
Use CLUTO to cluster the languages, e.g. by
applying hierarchical agglomerative clustering.
21Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search
- Related lecture 8 by Wessel Kraaij (TNO)
- Test the assumption by training statistical
thesauri and performing experiments on public IR
test collections. - Investigate whether a combination of statistical
thesauri trained on different parallel texts can
be used to improve retrieval performance.
22Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search
- Choose three European languages other than
English X, Y, Z. Train statistical thesauri
EnglishgtXYZgtEnglish on several parallel
corpora available through the OPUS website, by
cascading two translation models trained with the
EGYPT toolkit (choose IBM model 1).
23Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search
- Construct various combined thesauri e.g. by
interpolating three individual thesauri trained
on a single corpus or by combining thesauri
trained on different corpora. To be effective
such thesauri usually are interpolated with the
identity matrix. (Xu et al. 2002, TREC2001) - Implement a CLIR system using LEMUR and
http//www.lemurproject.org/doxygen/lemur-3.1/html
/classXLingRetMethod.html as retrieval method.
24Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search
- Perform monolingual IR experiments (measure mean
average precision) on the CACM and CRANFIELD
collections. Experiments must include - a baseline run (standard generative language
model) - individual thesauri (with and without
interpolation with the identity matrix) - combined thesauri (with and without interpolation
with the identity matrix)
25Assignment 7 Self-learning Dutch meta-search
- Related lecture 10 by Nils Rooijmans (ilse
media) - Build a meta-search
- Learn which algorithm works best for which type
of queries - single VS multi-word queries
- general queries VS specific queries(roughly
based on the number of results in the resultset) - Improve ranking
26Assignment 7 Self-learning Dutch meta-search
- Build website based on APIs
- Ilse API (mail searchengine_at_ilse.net for
documentation) - Google API http//www.google.com/apis/index.html
- Yahoo API http//developer.yahoo.net/web/V1/webSe
arch.html - Merge resultsets (handle duplicates, hide
originating API(s)) - Build feedback mechanism
- Score different APIs for different query types
- Use score in merging of results
27Assignment 8 Web-services for Multimedia
Indexing
- Related lecture 8 by Arjen de Vries (CWI)
- 8a) implement the EM algorithm to train a model
on a given image - 8b) make a web-service that takes an image URL
and uses this program to train the model (include
8a) - 8c) make a web-service that visualizes a trained
model on a given image, where results for each
component are presented in SVG as an overlay of
their pixel blocks with transparency depending on
the block likelihood (include 8a and 8b)
28Assignment 9 Digital Accessibility on Dutch
Species
- Related lecture 6 by Trezorix
- www.soortenregister.nl
- www.nederlandsesoorten.nl (NSR)
- www.w3.org/2004/02/skos (SKOS)
- www.w3.org/TR/owl-features (OWL)
- www.openrdf.org (Sesame)
- lucene.apache.org (Lucene)
29Assignment 9 Digital Accessibility on Dutch
Species
- The Dutch Species Register (Nederlands
Soortenregister, NSR) - thesaurus-structure with information on Dutch
plants, animals and fungi. - has about 40,000 biological concepts (taxa).
- the web version of the NSR is linked to various
databases - like a preservation status database
- an image library
- article libraries, etc.
- the editorial maintenance of the site is done by
the Dutch National Museum of Natural History
Naturalis - the technical infrastructure is developed and
maintained by Trezorix.
30Assignment 9 Digital Accessibility on Dutch
Species
- Make a website for digital access to (part of)
the collection of the NSR. - The site should illustrate the use of reference
structures for browsing and searching digital
collections. - The NSR-structure is a complicated structure of
thesaurus-like semantic relations combined with
naming data for each concept.
31Assignment 9 Digital Accessibility on Dutch
Species
- Describe a NSR datamodel.
- Represent the structure part (thesaurus) of the
NSR in SKOS. - Represent as much of the extra data (see below)
as possible in SKOS. - Represent those data which do not fit into SKOS
in OWL. - Use the open source RDF-framework Sesame for
storage of structures. - Use the open source search engine Lucene for
findability of NSR elements - Explain how do you deal with the complexity of the
32Assignment 9 Digital Accessibility on Dutch
Species
- Applicable extra data
- Change record data - these data represent a
history per concept of mutations of naming data. - Data for species counter - bottom up accumulative
data about the number of species under a certain
node. - Data about the availability of photographs of
species. - What will be supplied by Trezorix?
- A limited NSR dataset, for instance songbirds,
including scientific names, synonyms, change
record data, etc. - Extra data data for species counter and data
about availability of photographs. - Relevant photographs.
- Technical background.
33Course Goal
- dimensions of the IR "problem
- functions of an IR system
- components of an IR system
- factors which optimize the IR process
- examine current research issues in IR
- explore examples of industrial research IR
applications - form a broad picture of the IR field
- build experience to work with IR systems