2ID10: Information Retrieval Lecture 10: Assignments - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

2ID10: Information Retrieval Lecture 10: Assignments

Description:

2ID10: Information Retrieval Lecture 10: Assignments – PowerPoint PPT presentation

Number of Views:241
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: 2ID10: Information Retrieval Lecture 10: Assignments


1
2ID10 Information RetrievalLecture 10
Assignments
  • Lora Aroyo
  • 30th May 2006

2
Course Topics
  • Basic Information Retrieval Terminology
  • Query Languages Operations
  • Precision Recall in Search Engines
  • Relevance Feedback
  • Language modeling for IR
  • Search engines
  • Reference Structures in IR
  • Multimedia Information Retrieval
  • Publishing of enriched, structured content

3
Assignment Submission
  • Final Assignment 7th July 2006
  • Submit to IR_at_listserver.tue.nl
  • Register at http//listserver.tue.nl/mailman/list
    info/IR
  • Subject Group Assignment
  • Files group.ass.title.extension
  • URL http//...../group/assignment
  • URL with running application documentation
  • In all files include your Group and Assignment,
    as well as all the names of the group members

4
Assignment Submission
  • Each assignment should include
  • detailed report of the modeling, algorithmic and
    implementation (in applicable) of the solution
  • clear problem description
  • clear solution description and justification
  • literature and relevant material that you used to
    solve the problem
  • the significance and benefit of your solution
  • URL with report, running implementation and
    source code

5
Assignment 1 How to improve an existing NL
parser
  • Related Lecture 9 - C-content
  • The behaviour of the NL parser needs to be tuned,
    based upon the results of a pre-defined set of
    queries on a pre-defined document collection.
  • Identify problematic queries or query types
    leading to low precision/recall results.

6
Assignment 1 Improving an existing NL parser
  • What do we have?
  • A natural language parser, which does
  • Spelling checker
  • Term identification
  • Search term checking
  • Syntactic analysis
  • Semantic expansion
  • Query base generation
  • What is the problem?
  • The behavior needs to be tuned, based upon the
    results of a pre-defined set of queries on a
    pre-defined document collection

7
Assignment 1 Improving an existing NL parser
  • Assignment
  • Identifying problematic queries or query types
  • (Requires at least one Dutch native speaker in
    the team)
  • Type of problems
  • Low precision
  • Low recall
  • Misunderstood queries
  • E.g., does information on cars and traffic jams
    mean that only documents containing both cars
    and traffic jams should be found or both
    documents containing either cars or traffic
    jams

8
Assignment 1 Improving an existing NL parser
  • Proposed steps
  • Define a set of natural language queries,
    including expected query results
  • define the Boolean query to be executed
  • list the documents to be found by the query
  • Run queries through the on-line website
  • Identify mismatch areas

9
Assignment 1Improving an existing NL parser
  • Proposed steps
  • Define a set of natural language queries,
    including expected query results
  • Run queries through the on-line website
  • Perform the queries on the information portal
  • Identify mismatch areas

10
Assignment 1 Improving an existing NL parser
  • Proposed steps
  • Define a set of natural language queries,
    including expected query results
  • Run queries through the on-line website
  • Identify mismatch areas
  • keep track of precision and recall percentages
  • if these percentages are low, what is wrong with
    the Boolean query
  • how could this be improved

11
Assignment 2 Identify context/metadata relation
between documents in legal domain
  • Related Lecture 9 - C-content
  • What do we have?
  • Document collections, which
  • Contain metadata
  • Has contextual relations between documents
  • What is the problem?
  • How to construct relations between documents
    based on existing metadata?

12
Assignment 2 Identify context/metadata relation
between documents in legal domain
  • Assignment
  • Design and implement the algorithm to identify
    the possible relation between documents,
    including a link certainty indicator
  • (Requires at least one Dutch native speaker in
    the team)
  • Basis information
  • An ideal document collection relevant to one
    starting document
  • A mixed ideal and noise documents collection
  • Description of metadata that can be used to
    identify the relation

13
Assignment 2 Identify context/metadata relation
between documents in legal domain
  • Proposed Steps
  • Study description and the document collections
    delivered.
  • Design the algorithm to solve the problem
  • Implement the algorithm in a simple programming
    module
  • Analyze possible problems due to ambiguity in the
    context/metadata

14
Assignment 4 User Alert Service
  • Related Lecture 9 - C-content
  • What do we have?
  • Document collections, which
  • Contain metadata
  • Has contextual relations between documents
  • Structure of the context/metadata relation
    (assignment 2)
  • What is the problem?
  • How to construct the user alert service based on
    the user profile information and the resulted
    document relation?

15
Assignment 4 User Alert Service
  • Assignment
  • Design and implement the user profile structure
    and the user alert service
  • Proposed Steps
  • Design the user profile structure
  • Design the algorithm for the user alert service
  • Implement the algorithm in simple programming
    module
  • Analyze complexity level in the user profile
    structure and the consequence to the user alert
    service

16
Assignment 3 Using Relevance Feedback during
Information Retrieval
  • Related lecture 3 by Theo van der Weide
  • Consider the cooperators of the Informatics
    Department.
  • User feedback for query modification with the
    Rocchio technique
  • Download the Perlfect Search engine sources
    (http//www.perlfect.com/freescripts/search/).
  • After a query, offer the searcher the top 10
    ranking documents, and ask feedback.
  • Use this feedback to construct the modified query
    using the Rocchio technique.

17
Assignment 3 Using Relevance Feedback during
Information Retrieval
18
Assignment 5 Language identification is an
important basic component for multilingual search
engines
  • Related lecture 8 by Wessel Kraaij (TNO)
  • Develop a language identification module using
    generative character models for the 21 EU
    languages.
  • Develop and test the language identification
    module on the EU constitution corpus.
  • Test the language identification module on some
    non EU languages
  • (e.g. taken from the KDE corpus on the same site)
  • adjust the classifier in such a way that it can
    recognize unknown languages.
  • Optional produce a similarity matrix between the
    languages using the identification module.
  • Use the similarity matrix to build a tree using
    hierarchical agglomerative techniques. (related
    languages will be grouped).

19
Assignment 5
  • Download the EU constitution in 21 languages from
    http//logos.uio.no/opus/
  • Split each language corpus in a training, test
    and evaluation set.
  • Train classifiers for each language using
    character trigram models smoothed with character
    bigram models.
  • Optimize the smoothing parameter on the test
    data.
  • Compute the accuracy of the classifier, by taking
    100 sentences from the evaluation set of each
    language.
  • After adjusting the classifier, test the accuracy
    of the classifier on the same evaluation set plus
    100 sentences from 5 non European languages.
  • Present results per language.

20
Assignment 5Language identification is an
important basic component for multilingual search
engines
  • Optional produce a similarity matrix, using the
    previous test data by averaging probabilities
    across test sentences and averaging P(AB) and
    P(BA) (the similarity matrix must be symmetric).
    Use CLUTO to cluster the languages, e.g. by
    applying hierarchical agglomerative clustering.

21
Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search
  • Related lecture 8 by Wessel Kraaij (TNO)
  • Test the assumption by training statistical
    thesauri and performing experiments on public IR
    test collections.
  • Investigate whether a combination of statistical
    thesauri trained on different parallel texts can
    be used to improve retrieval performance.

22
Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search
  • Choose three European languages other than
    English X, Y, Z. Train statistical thesauri
    EnglishgtXYZgtEnglish on several parallel
    corpora available through the OPUS website, by
    cascading two translation models trained with the
    EGYPT toolkit (choose IBM model 1).

23
Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search
  • Construct various combined thesauri e.g. by
    interpolating three individual thesauri trained
    on a single corpus or by combining thesauri
    trained on different corpora. To be effective
    such thesauri usually are interpolated with the
    identity matrix. (Xu et al. 2002, TREC2001)
  • Implement a CLIR system using LEMUR and
    http//www.lemurproject.org/doxygen/lemur-3.1/html
    /classXLingRetMethod.html as retrieval method.

24
Assignment 6 A general assumption in CLIR
research is that parallel corpora can be
exploited to improve monolingual search
  • Perform monolingual IR experiments (measure mean
    average precision) on the CACM and CRANFIELD
    collections. Experiments must include
  • a baseline run (standard generative language
    model)
  • individual thesauri (with and without
    interpolation with the identity matrix)
  • combined thesauri (with and without interpolation
    with the identity matrix)

25
Assignment 7 Self-learning Dutch meta-search
  • Related lecture 10 by Nils Rooijmans (ilse
    media)
  • Build a meta-search
  • Learn which algorithm works best for which type
    of queries
  • single VS multi-word queries
  • general queries VS specific queries(roughly
    based on the number of results in the resultset)
  • Improve ranking

26
Assignment 7 Self-learning Dutch meta-search
  • Build website based on APIs
  • Ilse API (mail searchengine_at_ilse.net for
    documentation)
  • Google API http//www.google.com/apis/index.html
  • Yahoo API http//developer.yahoo.net/web/V1/webSe
    arch.html
  • Merge resultsets (handle duplicates, hide
    originating API(s))
  • Build feedback mechanism
  • Score different APIs for different query types
  • Use score in merging of results

27
Assignment 8 Web-services for Multimedia
Indexing
  • Related lecture 8 by Arjen de Vries (CWI)
  • 8a) implement the EM algorithm to train a model
    on a given image
  • 8b) make a web-service that takes an image URL
    and uses this program to train the model (include
    8a)
  • 8c) make a web-service that visualizes a trained
    model on a given image, where results for each
    component are presented in SVG as an overlay of
    their pixel blocks with transparency depending on
    the block likelihood (include 8a and 8b)

28
Assignment 9 Digital Accessibility on Dutch
Species
  • Related lecture 6 by Trezorix
  • www.soortenregister.nl
  • www.nederlandsesoorten.nl (NSR)
  • www.w3.org/2004/02/skos (SKOS)
  • www.w3.org/TR/owl-features (OWL)
  • www.openrdf.org (Sesame)
  • lucene.apache.org (Lucene)

29
Assignment 9 Digital Accessibility on Dutch
Species
  • The Dutch Species Register (Nederlands
    Soortenregister, NSR)
  • thesaurus-structure with information on Dutch
    plants, animals and fungi.
  • has about 40,000 biological concepts (taxa).
  • the web version of the NSR is linked to various
    databases
  • like a preservation status database
  • an image library
  • article libraries, etc.
  • the editorial maintenance of the site is done by
    the Dutch National Museum of Natural History
    Naturalis
  • the technical infrastructure is developed and
    maintained by Trezorix.

30
Assignment 9 Digital Accessibility on Dutch
Species
  • Make a website for digital access to (part of)
    the collection of the NSR.
  • The site should illustrate the use of reference
    structures for browsing and searching digital
    collections.
  • The NSR-structure is a complicated structure of
    thesaurus-like semantic relations combined with
    naming data for each concept.

31
Assignment 9 Digital Accessibility on Dutch
Species
  • Describe a NSR datamodel.
  • Represent the structure part (thesaurus) of the
    NSR in SKOS.
  • Represent as much of the extra data (see below)
    as possible in SKOS.
  • Represent those data which do not fit into SKOS
    in OWL.
  • Use the open source RDF-framework Sesame for
    storage of structures.
  • Use the open source search engine Lucene for
    findability of NSR elements
  • Explain how do you deal with the complexity of the

32
Assignment 9 Digital Accessibility on Dutch
Species
  • Applicable extra data
  • Change record data - these data represent a
    history per concept of mutations of naming data.
  • Data for species counter - bottom up accumulative
    data about the number of species under a certain
    node.
  • Data about the availability of photographs of
    species.
  • What will be supplied by Trezorix?
  • A limited NSR dataset, for instance songbirds,
    including scientific names, synonyms, change
    record data, etc.
  • Extra data data for species counter and data
    about availability of photographs.
  • Relevant photographs.
  • Technical background.

33
Course Goal
  • dimensions of the IR "problem
  • functions of an IR system
  • components of an IR system
  • factors which optimize the IR process
  • examine current research issues in IR
  • explore examples of industrial research IR
    applications
  • form a broad picture of the IR field
  • build experience to work with IR systems
Write a Comment
User Comments (0)
About PowerShow.com