Title: Web Based Collection Selection Using Singular Value Decomposition
1Web Based Collection Selection Using Singular
Value Decomposition
- Authors
- John King Yuefeng Li
- Queensland University of Technology
- Brisbane, Australia
2Introduction
- As the number of electronic data collections
available on the internet increases, so does the
difficulty of finding the right collection for a
given query. - NOTE All quotes in this presentation are fully
referenced in the paper
3Introduction Cont.
- Often the first time user will be overwhelmed by
the array of options available, and will waste
time hunting through pages of collection names,
followed by time reading results pages after
doing an ad-hoc search.
4Collection Selection
- Automatic collection selection methods try to
solve this problem by suggesting the best subset
of collections to search based on a query.
5Collection Selection Cont.
- This is of importance to
- Fields containing large number of electronic
collections which undergo frequent change - Collections that cannot be fully indexed using
traditional methods such as spiders.
6Collection Selection Cont.
- This paper presents a solution to these problems
of selecting the best collections and reducing
the number of collections needing to be searched.
7Collection Selection Aims
- Reduce search costs
- Make searching multiple collections appear as
seamless as searching a single collection - Learn which collections contain relevant
information and which collections contain no
relevant information
8Collection Selection Aims Cont.
- If only a small high quality subset of the
available collections are searched then savings
can be made in time, bandwidth, and computation.
9Significance
- As the internet grows the number of internet
based collections grows - It is now impossible to manually track and index
all collections as they number in the thousands - More and more information resources are made
available electronically
10Differences
- There are a number of differences between
traditional and web based collection selection
11Differences Cont.
- Ordered vs. unordered results
- Difficult to estimate the size and density of Web
based collections - Commonly used metrics such as recall cannot be
used with Web based collections
12Differences Cont.
- Many collections on the Web can not be indexed by
traditional indexing methods as they only present
a search interface and no way of indexing the
contents
13Differences Cont.
- Many Web based collections also change
frequently, meaning that traditional indexing
methods have to be run frequently in order to
keep index information valid.
14Difficult Problems
- Reducing expenses
- Increasing search speed
- Learning to adapt to change in the search
environment - Learning to adapt to the users preferences
15Full vs. Sampled
- There are two main approaches to collection
selection. - Generate a full index of every term in every
available collection. - Take samples from each collection.
16Our Solution
- The solutions presented use a novel application
of singular value decomposition in order to find
relationships between samples of collections, and
to use user precision feedback to find user
preferred collections.
17Our Solution Cont.
- The solution does not require large amounts of
meta-data about each server, and does not require
any of the servers to keep meta-data about
themselves.
18Our Solution Cont.
- The anticipated outcome is a system that can
sample heterogenous collections and rank the
collections based on relevance to a query. - The system will also not suffer the problems of
synonymy and polysemy faced by traditional
information retrieval systems.
19Expected Gains
- Well planned collection selection can have a
large influence on the efficiency of a query. - Lu et al and Choi et al state that there is a
high correlation between the search performance
and the distribution of queries among
collections.
20Full Index
- Creating a full index of every collection is the
more accurate (and expensive) approach. - In the case of the Web full indexing would mean
that every one of the thousands of available
collections would be fully crawled and indexed.
21Full Index Cont.
- Creating these indexes consume large amounts of
computation, bandwidth, and memory, which means
that this approach is available only to a few.
22Sampled Index
- Because of the size and rate of change of many
collections, it is often difficult and expensive
to create an index of each collection used - Thus small samples of the highest ranked (or most
representative) documents can be taken from each
collection instead, and these samples are treated
as being representative of the entire collection.
23Sampled Index Cont.
- This would mean that a short query could be
broadcasted to the entire set of collections, and
top results of these collections returned could
then be ranked in order of relevance to the
query. - This sampling technique works well with large
sets of collections.
24Search Strategy
- Collections are typically searched in parallel,
as some collections may have a higher response
time than other collections due to network load,
database access times, and other factors out of
control of the searcher.
25Related Work
- CORI(Collection Retrieval Inference Network)
- Bayesian probabilistic inference network
- Best collections are the ones that contain the
most documents related to the query - GlOSS (Glossary of Servers Server)
- Use a server which contains all the relevant
information of other collections - periodically updated by a collector program which
gets information from each collection
26CORI and GlOSS comparison
- In a comparison of CORI and GlOSS(Craswell 2000)
it was found that CORI was the best method, and
that a selection of a small number of collections
could outperform selecting all the servers and a
central index. - Probe queries are a good method for evaluating
collections without having full knowledge of the
collection contents, with 50,000 documents
evaluated instead of 250,000 documents.
27CORI and GlOSS comparison
- These conclusions are important to our research
because it shows that a high quality subset of
collections will be as effective as a full set of
collections, and that probes of a collection are
an effective method of ranking an entire
collection.
28Singular Value Decomposition
- Singular Value Decomposition is a vector space
model for analysing relationships within a
matrix. - A statistical method of reducing the noise in a
matrix and associating related concepts together. - An advanced form of pattern matching
29Singular Value Decomposition Cont.
- Used with a term-collection matrix, Singular
Value Decomposition takes the background
structure of word usage and removes the noise. - This noise reduction allows the higher order
relationship between terms and documents to be
clearly seen.
30Singular Value Decomposition Cont.
- Polysemy is a word having multiple meanings
- Synonymy is multiple words having the same
meaning - Latent Semantic Analysis helps with these
problems, in fact the term itself does not have
to occur within the collection for Latent
Semantic Analysis to find that the collection is
relevant
31Singular Value Decomposition Cont.
- This is a great feature with off-the-page
rankings being so popular with search engines at
present.
32Singular Value Decomposition Cont.
- Chen et al proved that Singular Value
Decomposition consistently improves both recall
and precision - The Singular Value Decomposition method has
equalled or outperformed standard vector
retrieval methods in almost every case, and gives
up to 30 better retrieval
33Experiment
- The experiment will attempt to prove two things
- A sampled set of data can approximate the
retrieval of a full set of data when used with
Singular Value Decomposition - Singular Value Decomposition is an effective
method of selecting collections of data
34IEEE Data Set
- Using IEEE Computer Society XML Retrieval
Research Collection - Scientific articles
- Marked up in XML
35IEEE Data Set
- Approximately 500 megabytes,
- Contains over twelve thousand articles
- From 18 magazines/transactions
- The period of 1995-2002
- An article on average consists of 1500 XML nodes.
- (data from http//www.is.informatik.uni-duisburg.d
e/projects/inex03/)
36Experiment
- For the first part of the experiment, we will try
to prove that sampled data can approximate a full
set of data - This will mean that this method can be extended
to search engines and collections which cannot be
fully indexed such as databases which only expose
a search interface.
37The queries
- The queries used for this experiment are
- Data Mining
- Knowledge Representation
- Intelligent Agents
- Pattern Reorganization
38Full Set of Data
- The entire IEEE collection was indexed and
converted into a 18 column inverted index, each
column being a journal - The queries were then added to the matrix
- The matrix size was 18 x 48,890
- SVD was run on the matrix. SVD took less then 3
seconds to calculate the collection matrix - The results were sorted by the query columns
39Sampled Set of Data
- The IEEE data was transformed into an inverted
index and stored in MS Access - The database size is approximately 1.8 Gig in
size - A custom built search engine (GP-XOR) was used
for finding the top results from the database - The top 40 and 60 results from each journal were
taken
40Sampled Set of Data
- Each of the returned documents were indexed and
transformed into a matrix. - The query was then added to the matrix
- SVD was performed on the matrix
- The results were ordered by the query
41Sampled Set of Data
- Good results were found on sample sizes of 40
documents and more - Attempts to use smaller sample sizes such as 10
and 20 articles from each journal did not succeed
42Discussion
- The sample size of 40 from each of the 18
journals is still far cheaper than indexing over
12,000 documents - This sampling method is also more efficient when
collection data changes or grows frequently, and
when full indexing is not possible such as in
search engines.
43SVD Results Intelligent Agents
44Search Engine Results - Intelligent Agents
- 9.22685024 IEEE INTELLIGENT SYSTEMS
- 4.31090544 COMPUTER
- 2.89284444 IEEE TRANSACTIONS ON KNOWLEDGE AND
DATA ENGINEERING - 2.79830704 IEEE INTERNET COMPUTING
- 1.47478344 IEEE MULTIMEDIA
- 1.09663384 IEEE TRANSACTIONS ON PATTERN ANALYSIS
AND MACHINE INTELLIGENCE - 1.00209644 IEEE TRANSACTIONS ON SOFTWARE
ENGINEERING - 0.96428148 IEEE CONCURRENCY
- 0.86974408 IEEE COMPUTER GRAPHICS AND
APPLICATIONS - 0.77520668 IEEE TRANSACTIONS ON COMPUTERS
- 0.52940944 IEEE MICRO
- 0.49159448 IEEE TRANSACTIONS ON PARALLEL AND
DISTRIBUTED SYSTEMS - 0.472687 IEEE SOFTWARE
- 0.39705708 COMPUTING IN SCIENCE ENGINEERING
- 0.3781496 IT PROFESSIONAL
- 0.34033464 IEEE DESIGN TEST OF COMPUTERSlt
- 0.20798228 IEEE ANNALS OF THE HISTORY OF
COMPUTING - 0.11344488 IEEE TRANSACTIONS ON VISUALIZATION AND
COMPUTER GRAPHICS
45Intelligent Agents Distance Between Search
Engine and SVD
46All queries SVD vs. Search Engine
47Conclusion
- Preliminary results indicate that the technique
is suited to the task of selecting the most
relevant collections and learning user
preferences in collections - The approach uses short queries and samples of
data and is thus suitable for use on the Web - The approach also reduces the need for ontology's
and thesaurus.
48Conclusion Cont.
- The system proved to be fast, with the actual
indexing of the data taking the longest time.
49Further Work
- Find the optimal term weighting scheme
- Find the optimal sample size taken from each
collection - Compare system with CORI/GlOSS or another
functioning collection selection system using
IEEE journals