Web Based Collection Selection Using Singular Value Decomposition - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Web Based Collection Selection Using Singular Value Decomposition

Description:

As the number of electronic data collections available on the internet increases, ... Value Decomposition method has equalled or outperformed standard vector ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 50

Provided by: Informatio344

Category:

more less

Transcript and Presenter's Notes

Title: Web Based Collection Selection Using Singular Value Decomposition

1
Web Based Collection Selection Using Singular
Value Decomposition

Authors
John King Yuefeng Li
Queensland University of Technology
Brisbane, Australia

2
Introduction

As the number of electronic data collections
available on the internet increases, so does the
difficulty of finding the right collection for a
given query.
NOTE All quotes in this presentation are fully
referenced in the paper

3
Introduction Cont.

Often the first time user will be overwhelmed by
the array of options available, and will waste
time hunting through pages of collection names,
followed by time reading results pages after
doing an ad-hoc search.

4
Collection Selection

Automatic collection selection methods try to
solve this problem by suggesting the best subset
of collections to search based on a query.

5
Collection Selection Cont.

This is of importance to
Fields containing large number of electronic
collections which undergo frequent change
Collections that cannot be fully indexed using
traditional methods such as spiders.

6
Collection Selection Cont.

This paper presents a solution to these problems
of selecting the best collections and reducing
the number of collections needing to be searched.

7
Collection Selection Aims

Reduce search costs
Make searching multiple collections appear as
seamless as searching a single collection
Learn which collections contain relevant
information and which collections contain no
relevant information

8
Collection Selection Aims Cont.

If only a small high quality subset of the
available collections are searched then savings
can be made in time, bandwidth, and computation.

9
Significance

As the internet grows the number of internet
based collections grows
It is now impossible to manually track and index
all collections as they number in the thousands
More and more information resources are made
available electronically

10
Differences

There are a number of differences between
traditional and web based collection selection

11
Differences Cont.

Ordered vs. unordered results
Difficult to estimate the size and density of Web
based collections
Commonly used metrics such as recall cannot be
used with Web based collections

12
Differences Cont.

Many collections on the Web can not be indexed by
traditional indexing methods as they only present
a search interface and no way of indexing the
contents

13
Differences Cont.

Many Web based collections also change
frequently, meaning that traditional indexing
methods have to be run frequently in order to
keep index information valid.

14
Difficult Problems

Reducing expenses
Increasing search speed
Learning to adapt to change in the search
environment
Learning to adapt to the users preferences

15
Full vs. Sampled

There are two main approaches to collection
selection.
Generate a full index of every term in every
available collection.
Take samples from each collection.

16
Our Solution

The solutions presented use a novel application
of singular value decomposition in order to find
relationships between samples of collections, and
to use user precision feedback to find user
preferred collections.

17
Our Solution Cont.

The solution does not require large amounts of
meta-data about each server, and does not require
any of the servers to keep meta-data about
themselves.

18
Our Solution Cont.

The anticipated outcome is a system that can
sample heterogenous collections and rank the
collections based on relevance to a query.
The system will also not suffer the problems of
synonymy and polysemy faced by traditional
information retrieval systems.

19
Expected Gains

Well planned collection selection can have a
large influence on the efficiency of a query.
Lu et al and Choi et al state that there is a
high correlation between the search performance
and the distribution of queries among
collections.

20
Full Index

Creating a full index of every collection is the
more accurate (and expensive) approach.
In the case of the Web full indexing would mean
that every one of the thousands of available
collections would be fully crawled and indexed.

21
Full Index Cont.

Creating these indexes consume large amounts of
computation, bandwidth, and memory, which means
that this approach is available only to a few.

22
Sampled Index

Because of the size and rate of change of many
collections, it is often difficult and expensive
to create an index of each collection used
Thus small samples of the highest ranked (or most
representative) documents can be taken from each
collection instead, and these samples are treated
as being representative of the entire collection.

23
Sampled Index Cont.

This would mean that a short query could be
broadcasted to the entire set of collections, and
top results of these collections returned could
then be ranked in order of relevance to the
query.
This sampling technique works well with large
sets of collections.

24
Search Strategy

Collections are typically searched in parallel,
as some collections may have a higher response
time than other collections due to network load,
database access times, and other factors out of
control of the searcher.

25
Related Work

CORI(Collection Retrieval Inference Network)
Bayesian probabilistic inference network
Best collections are the ones that contain the
most documents related to the query
GlOSS (Glossary of Servers Server)
Use a server which contains all the relevant
information of other collections
periodically updated by a collector program which
gets information from each collection

26
CORI and GlOSS comparison

In a comparison of CORI and GlOSS(Craswell 2000)
it was found that CORI was the best method, and
that a selection of a small number of collections
could outperform selecting all the servers and a
central index.
Probe queries are a good method for evaluating
collections without having full knowledge of the
collection contents, with 50,000 documents
evaluated instead of 250,000 documents.

27
CORI and GlOSS comparison

These conclusions are important to our research
because it shows that a high quality subset of
collections will be as effective as a full set of
collections, and that probes of a collection are
an effective method of ranking an entire
collection.

28
Singular Value Decomposition

Singular Value Decomposition is a vector space
model for analysing relationships within a
matrix.
A statistical method of reducing the noise in a
matrix and associating related concepts together.
An advanced form of pattern matching

29
Singular Value Decomposition Cont.

Used with a term-collection matrix, Singular
Value Decomposition takes the background
structure of word usage and removes the noise.
This noise reduction allows the higher order
relationship between terms and documents to be
clearly seen.

30
Singular Value Decomposition Cont.

Polysemy is a word having multiple meanings
Synonymy is multiple words having the same
meaning
Latent Semantic Analysis helps with these
problems, in fact the term itself does not have
to occur within the collection for Latent
Semantic Analysis to find that the collection is
relevant

31
Singular Value Decomposition Cont.

This is a great feature with off-the-page
rankings being so popular with search engines at
present.

32
Singular Value Decomposition Cont.

Chen et al proved that Singular Value
Decomposition consistently improves both recall
and precision
The Singular Value Decomposition method has
equalled or outperformed standard vector
retrieval methods in almost every case, and gives
up to 30 better retrieval

33
Experiment

The experiment will attempt to prove two things
A sampled set of data can approximate the
retrieval of a full set of data when used with
Singular Value Decomposition
Singular Value Decomposition is an effective
method of selecting collections of data

34
IEEE Data Set

Using IEEE Computer Society XML Retrieval
Research Collection
Scientific articles
Marked up in XML

35
IEEE Data Set

Approximately 500 megabytes,
Contains over twelve thousand articles
From 18 magazines/transactions
The period of 1995-2002
An article on average consists of 1500 XML nodes.
(data from http//www.is.informatik.uni-duisburg.d
e/projects/inex03/)

36
Experiment

For the first part of the experiment, we will try
to prove that sampled data can approximate a full
set of data
This will mean that this method can be extended
to search engines and collections which cannot be
fully indexed such as databases which only expose
a search interface.

37
The queries

The queries used for this experiment are
Data Mining
Knowledge Representation
Intelligent Agents
Pattern Reorganization

38
Full Set of Data

The entire IEEE collection was indexed and
converted into a 18 column inverted index, each
column being a journal
The queries were then added to the matrix
The matrix size was 18 x 48,890
SVD was run on the matrix. SVD took less then 3
seconds to calculate the collection matrix
The results were sorted by the query columns

39
Sampled Set of Data

The IEEE data was transformed into an inverted
index and stored in MS Access
The database size is approximately 1.8 Gig in
size
A custom built search engine (GP-XOR) was used
for finding the top results from the database
The top 40 and 60 results from each journal were
taken

40
Sampled Set of Data

Each of the returned documents were indexed and
transformed into a matrix.
The query was then added to the matrix
SVD was performed on the matrix
The results were ordered by the query

41
Sampled Set of Data

Good results were found on sample sizes of 40
documents and more
Attempts to use smaller sample sizes such as 10
and 20 articles from each journal did not succeed

42
Discussion

The sample size of 40 from each of the 18
journals is still far cheaper than indexing over
12,000 documents
This sampling method is also more efficient when
collection data changes or grows frequently, and
when full indexing is not possible such as in
search engines.

43
SVD Results Intelligent Agents
44
Search Engine Results - Intelligent Agents

9.22685024 IEEE INTELLIGENT SYSTEMS
4.31090544 COMPUTER
2.89284444 IEEE TRANSACTIONS ON KNOWLEDGE AND
DATA ENGINEERING
2.79830704 IEEE INTERNET COMPUTING
1.47478344 IEEE MULTIMEDIA
1.09663384 IEEE TRANSACTIONS ON PATTERN ANALYSIS
AND MACHINE INTELLIGENCE
1.00209644 IEEE TRANSACTIONS ON SOFTWARE
ENGINEERING
0.96428148 IEEE CONCURRENCY
0.86974408 IEEE COMPUTER GRAPHICS AND
APPLICATIONS
0.77520668 IEEE TRANSACTIONS ON COMPUTERS
0.52940944 IEEE MICRO
0.49159448 IEEE TRANSACTIONS ON PARALLEL AND
DISTRIBUTED SYSTEMS
0.472687 IEEE SOFTWARE
0.39705708 COMPUTING IN SCIENCE ENGINEERING
0.3781496 IT PROFESSIONAL
0.34033464 IEEE DESIGN TEST OF COMPUTERSlt
0.20798228 IEEE ANNALS OF THE HISTORY OF
COMPUTING
0.11344488 IEEE TRANSACTIONS ON VISUALIZATION AND
COMPUTER GRAPHICS

45
Intelligent Agents Distance Between Search
Engine and SVD
46
All queries SVD vs. Search Engine
47
Conclusion

Preliminary results indicate that the technique
is suited to the task of selecting the most
relevant collections and learning user
preferences in collections
The approach uses short queries and samples of
data and is thus suitable for use on the Web
The approach also reduces the need for ontology's
and thesaurus.

48
Conclusion Cont.

The system proved to be fast, with the actual
indexing of the data taking the longest time.

49
Further Work

Find the optimal term weighting scheme
Find the optimal sample size taken from each
collection
Compare system with CORI/GlOSS or another
functioning collection selection system using
IEEE journals

Write a Comment

User Comments (0)