Web Based Collection Selection Using Singular Value Decomposition - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Web Based Collection Selection Using Singular Value Decomposition

Description:

As the number of electronic data collections available on the internet increases, ... Value Decomposition method has equalled or outperformed standard vector ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 50
Provided by: Informatio344
Category:

less

Transcript and Presenter's Notes

Title: Web Based Collection Selection Using Singular Value Decomposition


1
Web Based Collection Selection Using Singular
Value Decomposition
  • Authors
  • John King Yuefeng Li
  • Queensland University of Technology
  • Brisbane, Australia

2
Introduction
  • As the number of electronic data collections
    available on the internet increases, so does the
    difficulty of finding the right collection for a
    given query.
  • NOTE All quotes in this presentation are fully
    referenced in the paper

3
Introduction Cont.
  • Often the first time user will be overwhelmed by
    the array of options available, and will waste
    time hunting through pages of collection names,
    followed by time reading results pages after
    doing an ad-hoc search.

4
Collection Selection
  • Automatic collection selection methods try to
    solve this problem by suggesting the best subset
    of collections to search based on a query.

5
Collection Selection Cont.
  • This is of importance to
  • Fields containing large number of electronic
    collections which undergo frequent change
  • Collections that cannot be fully indexed using
    traditional methods such as spiders.

6
Collection Selection Cont.
  • This paper presents a solution to these problems
    of selecting the best collections and reducing
    the number of collections needing to be searched.

7
Collection Selection Aims
  • Reduce search costs
  • Make searching multiple collections appear as
    seamless as searching a single collection
  • Learn which collections contain relevant
    information and which collections contain no
    relevant information

8
Collection Selection Aims Cont.
  • If only a small high quality subset of the
    available collections are searched then savings
    can be made in time, bandwidth, and computation.

9
Significance
  • As the internet grows the number of internet
    based collections grows
  • It is now impossible to manually track and index
    all collections as they number in the thousands
  • More and more information resources are made
    available electronically

10
Differences
  • There are a number of differences between
    traditional and web based collection selection

11
Differences Cont.
  • Ordered vs. unordered results
  • Difficult to estimate the size and density of Web
    based collections
  • Commonly used metrics such as recall cannot be
    used with Web based collections

12
Differences Cont.
  • Many collections on the Web can not be indexed by
    traditional indexing methods as they only present
    a search interface and no way of indexing the
    contents

13
Differences Cont.
  • Many Web based collections also change
    frequently, meaning that traditional indexing
    methods have to be run frequently in order to
    keep index information valid.

14
Difficult Problems
  • Reducing expenses
  • Increasing search speed
  • Learning to adapt to change in the search
    environment
  • Learning to adapt to the users preferences

15
Full vs. Sampled
  • There are two main approaches to collection
    selection.
  • Generate a full index of every term in every
    available collection.
  • Take samples from each collection.

16
Our Solution
  • The solutions presented use a novel application
    of singular value decomposition in order to find
    relationships between samples of collections, and
    to use user precision feedback to find user
    preferred collections.

17
Our Solution Cont.
  • The solution does not require large amounts of
    meta-data about each server, and does not require
    any of the servers to keep meta-data about
    themselves.

18
Our Solution Cont.
  • The anticipated outcome is a system that can
    sample heterogenous collections and rank the
    collections based on relevance to a query.
  • The system will also not suffer the problems of
    synonymy and polysemy faced by traditional
    information retrieval systems.

19
Expected Gains
  • Well planned collection selection can have a
    large influence on the efficiency of a query.
  • Lu et al and Choi et al state that there is a
    high correlation between the search performance
    and the distribution of queries among
    collections.

20
Full Index
  • Creating a full index of every collection is the
    more accurate (and expensive) approach.
  • In the case of the Web full indexing would mean
    that every one of the thousands of available
    collections would be fully crawled and indexed.

21
Full Index Cont.
  • Creating these indexes consume large amounts of
    computation, bandwidth, and memory, which means
    that this approach is available only to a few.

22
Sampled Index
  • Because of the size and rate of change of many
    collections, it is often difficult and expensive
    to create an index of each collection used
  • Thus small samples of the highest ranked (or most
    representative) documents can be taken from each
    collection instead, and these samples are treated
    as being representative of the entire collection.

23
Sampled Index Cont.
  • This would mean that a short query could be
    broadcasted to the entire set of collections, and
    top results of these collections returned could
    then be ranked in order of relevance to the
    query.
  • This sampling technique works well with large
    sets of collections.

24
Search Strategy
  • Collections are typically searched in parallel,
    as some collections may have a higher response
    time than other collections due to network load,
    database access times, and other factors out of
    control of the searcher.

25
Related Work
  • CORI(Collection Retrieval Inference Network)
  • Bayesian probabilistic inference network
  • Best collections are the ones that contain the
    most documents related to the query
  • GlOSS (Glossary of Servers Server)
  • Use a server which contains all the relevant
    information of other collections
  • periodically updated by a collector program which
    gets information from each collection

26
CORI and GlOSS comparison
  • In a comparison of CORI and GlOSS(Craswell 2000)
    it was found that CORI was the best method, and
    that a selection of a small number of collections
    could outperform selecting all the servers and a
    central index.
  • Probe queries are a good method for evaluating
    collections without having full knowledge of the
    collection contents, with 50,000 documents
    evaluated instead of 250,000 documents.

27
CORI and GlOSS comparison
  • These conclusions are important to our research
    because it shows that a high quality subset of
    collections will be as effective as a full set of
    collections, and that probes of a collection are
    an effective method of ranking an entire
    collection.

28
Singular Value Decomposition
  • Singular Value Decomposition is a vector space
    model for analysing relationships within a
    matrix.
  • A statistical method of reducing the noise in a
    matrix and associating related concepts together.
  • An advanced form of pattern matching

29
Singular Value Decomposition Cont.
  • Used with a term-collection matrix, Singular
    Value Decomposition takes the background
    structure of word usage and removes the noise.
  • This noise reduction allows the higher order
    relationship between terms and documents to be
    clearly seen.

30
Singular Value Decomposition Cont.
  • Polysemy is a word having multiple meanings
  • Synonymy is multiple words having the same
    meaning
  • Latent Semantic Analysis helps with these
    problems, in fact the term itself does not have
    to occur within the collection for Latent
    Semantic Analysis to find that the collection is
    relevant

31
Singular Value Decomposition Cont.
  • This is a great feature with off-the-page
    rankings being so popular with search engines at
    present.

32
Singular Value Decomposition Cont.
  • Chen et al proved that Singular Value
    Decomposition consistently improves both recall
    and precision
  • The Singular Value Decomposition method has
    equalled or outperformed standard vector
    retrieval methods in almost every case, and gives
    up to 30 better retrieval

33
Experiment
  • The experiment will attempt to prove two things
  • A sampled set of data can approximate the
    retrieval of a full set of data when used with
    Singular Value Decomposition
  • Singular Value Decomposition is an effective
    method of selecting collections of data

34
IEEE Data Set
  • Using IEEE Computer Society XML Retrieval
    Research Collection
  • Scientific articles
  • Marked up in XML

35
IEEE Data Set
  • Approximately 500 megabytes,
  • Contains over twelve thousand articles
  • From 18 magazines/transactions
  • The period of 1995-2002
  • An article on average consists of 1500 XML nodes.
  • (data from http//www.is.informatik.uni-duisburg.d
    e/projects/inex03/)

36
Experiment
  • For the first part of the experiment, we will try
    to prove that sampled data can approximate a full
    set of data
  • This will mean that this method can be extended
    to search engines and collections which cannot be
    fully indexed such as databases which only expose
    a search interface.

37
The queries
  • The queries used for this experiment are
  • Data Mining
  • Knowledge Representation
  • Intelligent Agents
  • Pattern Reorganization

38
Full Set of Data
  • The entire IEEE collection was indexed and
    converted into a 18 column inverted index, each
    column being a journal
  • The queries were then added to the matrix
  • The matrix size was 18 x 48,890
  • SVD was run on the matrix. SVD took less then 3
    seconds to calculate the collection matrix
  • The results were sorted by the query columns

39
Sampled Set of Data
  • The IEEE data was transformed into an inverted
    index and stored in MS Access
  • The database size is approximately 1.8 Gig in
    size
  • A custom built search engine (GP-XOR) was used
    for finding the top results from the database
  • The top 40 and 60 results from each journal were
    taken

40
Sampled Set of Data
  • Each of the returned documents were indexed and
    transformed into a matrix.
  • The query was then added to the matrix
  • SVD was performed on the matrix
  • The results were ordered by the query

41
Sampled Set of Data
  • Good results were found on sample sizes of 40
    documents and more
  • Attempts to use smaller sample sizes such as 10
    and 20 articles from each journal did not succeed

42
Discussion
  • The sample size of 40 from each of the 18
    journals is still far cheaper than indexing over
    12,000 documents
  • This sampling method is also more efficient when
    collection data changes or grows frequently, and
    when full indexing is not possible such as in
    search engines.

43
SVD Results Intelligent Agents
44
Search Engine Results - Intelligent Agents
  • 9.22685024 IEEE INTELLIGENT SYSTEMS
  • 4.31090544 COMPUTER
  • 2.89284444 IEEE TRANSACTIONS ON KNOWLEDGE AND
    DATA ENGINEERING
  • 2.79830704 IEEE INTERNET COMPUTING
  • 1.47478344 IEEE MULTIMEDIA
  • 1.09663384 IEEE TRANSACTIONS ON PATTERN ANALYSIS
    AND MACHINE INTELLIGENCE
  • 1.00209644 IEEE TRANSACTIONS ON SOFTWARE
    ENGINEERING
  • 0.96428148 IEEE CONCURRENCY
  • 0.86974408 IEEE COMPUTER GRAPHICS AND
    APPLICATIONS
  • 0.77520668 IEEE TRANSACTIONS ON COMPUTERS
  • 0.52940944 IEEE MICRO
  • 0.49159448 IEEE TRANSACTIONS ON PARALLEL AND
    DISTRIBUTED SYSTEMS
  • 0.472687 IEEE SOFTWARE
  • 0.39705708 COMPUTING IN SCIENCE ENGINEERING
  • 0.3781496 IT PROFESSIONAL
  • 0.34033464 IEEE DESIGN TEST OF COMPUTERSlt
  • 0.20798228 IEEE ANNALS OF THE HISTORY OF
    COMPUTING
  • 0.11344488 IEEE TRANSACTIONS ON VISUALIZATION AND
    COMPUTER GRAPHICS

45
Intelligent Agents Distance Between Search
Engine and SVD
46
All queries SVD vs. Search Engine
47
Conclusion
  • Preliminary results indicate that the technique
    is suited to the task of selecting the most
    relevant collections and learning user
    preferences in collections
  • The approach uses short queries and samples of
    data and is thus suitable for use on the Web
  • The approach also reduces the need for ontology's
    and thesaurus.

48
Conclusion Cont.
  • The system proved to be fast, with the actual
    indexing of the data taking the longest time.

49
Further Work
  • Find the optimal term weighting scheme
  • Find the optimal sample size taken from each
    collection
  • Compare system with CORI/GlOSS or another
    functioning collection selection system using
    IEEE journals
Write a Comment
User Comments (0)
About PowerShow.com