Title: Human Information Access with Fuzzy Searching
1Human Information Access with Fuzzy Searching
Chris Demwell Communication Networks
Laboratory School of Engineering Science Simon
Fraser University
2Introduction
- Language is a process of negotiated meaning
- Ambiguity can lead to search engine error
- Fuzzy logic has shown to act like approximate
reasoning in control systems - Why not apply fuzzy logic to searching?
- The Jasmine fuzzy searching framework could be
used to explore searching with fuzzy logic - Based on RUBIN98 FuzzyBase an
information-intelligent retrieval system
3Road Map
- Searching as Human Information Access
- Introduction to Fuzzy Logic
- Fuzzy Logic and Searching
- Existing Search Engine Implementations
- Recent Research
- The Jasmine Search Framework
- References and Questions
4Searching as Human Information Access (1)
- Find document similar to known key
- First, Characterize key and documents
- Stop words removed The, a, and,
- Count word frequency in document
- Complex data mining / Computational Linguistics
techniques - Second, compare the keys characterization
against the documents characterizations
5Searching as Human Information Access (2)
- We want high precision - all documents returned
should be relevant - We also want high recall (a.k.a. coverage) - we
should find all relevant documents - Problem 1 Searcher is not looking for exact
match - Problem 2 Searcher does not know exactly what is
wanted - Problem 3 Search key is very small and ambiguous
6Road Map
- Searching as Human Information Access
- Introduction to Fuzzy Logic
- Fuzzy Logic and Searching
- Existing Search Engine Implementations
- Recent Research
- The Jasmine Search Framework
- References and Questions
7Introduction to Fuzzy Logic (1)
- Fuzzy set membership is not true or false
- Sets characterized by fuzzy membership functions
- Fuzzy sets better represent human reasoning
8Introduction to Fuzzy Logic (2)
- Fuzzy logic operations typically have three
phases - Fuzzification
- Fuzzy set processing
- Defuzzification
- This maps well onto characterization, matching,
and retrieval
9Fuzzy Logic and Searching (1)
- Fuzzy Logic can apply to Searching in each phase
- Query Subtask Submit a key with weighted parts
- Characterization subtask Each document described
by membership in fuzzy sets - Pattern-matching subtask Fuzzily match the key
to the documents
Query cast(.8) away(.1) tires(.9)_
10Fuzzy Logic and Searching (1)
- Fuzzy Logic can apply to Searching in each phase
- Query Subtask Submit a key with weighted parts
- Characterization subtask Each document described
by membership in fuzzy sets - Pattern-matching subtask Fuzzily match the key
to the documents
Document 1 1 .2 .5 .3 .8 .1 Document 2 .2
1 .6 .4 .7 .2
11Fuzzy Logic and Searching (1)
- Fuzzy Logic can apply to Searching in each phase
- Query Subtask Submit a key with weighted parts
- Characterization subtask Each document described
by membership in fuzzy sets - Pattern-matching subtask Fuzzily match the key
to the documents
µ1(key) 0.3 µ1 defines like document
1 µ2(key) 0.8 µ2 defines like document 2
µ is the symbol commonly used to represent a
fuzzy membership function
12Fuzzy Logic and Searching (1)
- Fuzzy Logic can apply to Searching in each phase
- Query Subtask Submit a key with weighted parts
- Characterization subtask Each document described
by membership in fuzzy sets - Pattern-matching subtask Fuzzily match the key
to the documents
13Fuzzy Logic and Searching (2)
- Only making the characterization and pattern
matching subtasks considered for building
Jasmine - Forcing user involvement when it is not required
must be avoided - The software should do as much interpretation as
possible - Search through fuzzy data characterizations
- Fuzzily match keys to documents
- Possibly both
14Road Map
- Searching as Human Information Access
- Introduction to Fuzzy Logic
- Fuzzy Logic and Searching
- Existing Search Engine Implementations
- Recent Research
- The Jasmine Search Framework
- References and Questions
15Existing Search Engine Implementations (1)
- Four of the main search methods used for
searching the World Wide Web - Term-based search engines
- Popularity-based search engines
- Semantics-based search engines
- Clustering-based search engines
- Note this ignores many subtleties beyond scope
16Existing Search Engine Implementations (2)
- Term-based Engines use term existence to find
similar documents - Documents are considered similar if a word from
the key is more present in the document - More complex methods, e.g. cosine distance
measure, boolean logic - Ambiguity simply ignored
- Results dependent on query construction
Excerpt from Altavista.com search results
17Existing Search Engine Implementations (3)
- Popularity-based searching consider how popular a
document is - More commonly searched-for documents more likely
to be appropriate - Implicit endorsement through hyperlinks
- Still does not address ambiguity - finds
authoritative sites, but on wrong topic!
www.cream.co.ukYou really need to get a browser
with javascript (orturn it on if you already
have) Skip Intro. Description Liverpool night
club. Category Regional gt Europe gt ... gt Arts
and Entertainment gt Music gt Clubs
andwww.cream.co.uk/ - 4k - Cached - Similar
pages Ben Jerry's Ice CreamUnited States
United Kingdom The NetherlandsFrance
Japan Company Info Page Blank ...
Description Vermont's Finest All Natural Ice
Cream, Frozen Yogurt and Sorbet. Overnight
delivery. Category Shopping gt Food gt
Confectionery gt Frozenwww.benjerry.com/ - 3k -
Cached - Similar pages TV CreamCategory Arts gt
Television gt Theme Songstv.cream.org/ - 1k -
Cached - Similar pages
Excerpt from google.coms search results
18Existing Search Engine Implementations (4)
- Semantic engines attempt to find the meaning of
the query - Often means matching query words against an
ontology to find context - When ambiguity found, engine asks user to clarify
- Intuitively a better document model
- Difficult to automate ontology generation
- Does not solve searching problem - just
disambiguates!
Excerpt from Simpli.coms search results. Note
Simpli.com no longer appears to provide this
service (as of 08/30/2001)
19Existing Search Engine Implementations (5)
- Clustering-based engines cluster results
statistically - Leverage existing data mining techniques
- Hope that statistical groups match semantic
groups - Helps user ignore irrelevancies instead
discarding automatically - No ontology needed
- Must still do full search!
- Clusters may be useless
Excerpt from vivisimo.coms search results
20Road Map
- Searching as Human Information Access
- Introduction to Fuzzy Logic
- Fuzzy Logic and Searching
- Existing Search Engine Implementations
- Recent Research
- The Jasmine Search Framework
- References and Questions
21Recent Research - Key Phrases and Meaning
- It is difficult to process plain text - must
extract keyphrases - FRANK99 describes automatic keyphrase
extraction technique - Split document into phrases use Bayesian methods
to classify phrases as key or not - Accuracy increases with domain knowledge
- Saw that we could merge with an ontology to infer
meaning of keyphrases - Now, we can match against key concepts in the
document
22Recent Research - Ontology Construction
- To construct an ontology, we must
- disambiguate any ambiguous words
- find their place within the tree
- Unreasonable to do by hand
- KARKALETSIS99
- decision tree containing about 1000 nodes
- precision about 90, recall of 60 in
disambiguation task - 60 recall a training artifact? Study used many
negative examples - Iteratively comparing word usage could allow this
to help build an entire ontology KROHN01
23Recent Research - Flat and Hyper Texts
- Many text databases are not hypertexts, nor do
they contain much metadata - Hypertexts are useful for determining authority
CHAKRABARTI98 CHAKRABARTI99 - KIM99 and FRANK99 demonstrate a method to
automatically construct a hypertext using
existing thesauri - Once hypertext is constructed, can also be used
to find mostly distinct communities of documents
24Recent Research - Agents and Clustering
- ALLOWAY97 describes a successful project to
build a set of ontologies for the University of
Michigan Digital Library - Used a system of distributed intelligent agents
- A technique described in VELING98
automatically disambiguates queries based on
statistical clustering - Words considered similar (linked) if they often
appear close together in the database - Finds some non-intuitive clusters
- Fast! 10,000 documents /min on a 200 MHz x86
25Recent Research - Fuzzy Queries
- There is a dearth of work regarding the
application of fuzzy logic to the searching task! - Wolski and Bouaziz proposed in BOUAZIZ98 a
method to replace crisp database triggers with
fuzzy ones - Mostly beyond scope
- Bulk of the work seems similar
26Road Map
- Searching as Human Information Access
- Introduction to Fuzzy Logic
- Fuzzy Logic and Searching
- Existing Search Engine Implementations
- Recent Research
- The Jasmine Search Framework
- References and Questions
27The Jasmine Search Framework (1)
- Jasmine designed to accommodate research into
the key components of a fuzzy logic based search
engine - Assumptions Fuzzy logic used for fuzzy
characterizations, fuzzy pattern matching, or
both - Modular design permits specification of the
Jasmine framework without specifying fuzzy
components - Fuzzy components may be swapped out to
comparatively test algorithms without changing
engines
28The Jasmine Search Framework (2)
29The Jasmine Search Framework (3)
30The Jasmine Search Framework (4)
- Future extensions
- Key phrase extraction and ontology creation
- Hypertext induction on plain text databases
- Hypertext clustering for authority and community
- Distributed, intelligent architecture
- Complex Metadata
- MPEG 7 WEB3
- Dublin Core WEB4
- Use model data mining
- COOLEY00, GREENBERG97 Mining hypertext
usage patterns can yield useful information about
relevance - HOFMANN99 Aggregated user feedback is
powerful
31References (1)
RUBIN98 S. Rubin, M. H. Smith, and Lj.
Trajkovic, FuzzyBase an information
intelligent retrieval system,'' Proc. 1998 IEEE
Int. Conf. on Systems, Man, and Cybernetics, San
Diego, CA, Oct. 1998, TA11, pp.
2797-2802. FRANK99 E. Frank, G. Paynter, I.
Witten, C. Gutwin, and C. Nevill-Manning.
Domain-Specific Keyphrase Extraction. In Proc.
16th Joint Int. Conf. on Artificial Intelligence
(IJCAI'99), PP 668-673, Stockholm, Sweeden,
1999. KARKALETSIS99 Vangelis Karkaletsis,
Georgios Paliouras, and Constantine D.
Spyropoulos. Learning Rules for Large Vocabulary
Word Sense Disambiguation. 16th Joint Int. Conf.
on Artificial Intelligence (IJCAI'99), PP
674-679, Stockholm, Sweeden, 1999. KROHN01 Fred
Krohn, conversation at ASI exchange,
2000 CHARKRABARTI98 S. Chakrabarti, B.E. Dom,
and P. Indyk. Enhanced hypertext classification
using hyper-links. In Proc. 1998 ACM-SIGMOD Int.
Conf. Management of Data (SIGMOD'98), pages
307-318, Seattle, Washington, June
1998. CHAKRABARTI99 S. Chakrabarti, B. E.
Dom, S. R. Kumar, P. Raghavan, S. Rajahopalan, A.
Tomkins, D. Gibson, and J. M. Kleinberg. Mining
the web's link structure. COMPUTER, 3260-67,
1999.
32References (2)
KIM99 Munseok Kim, Sejin Nam, and Dongwook
Shin. Hypertext Construction using statistical
and semantic similarity. 16th Joint Int. Conf.
on Artificial Intelligence (IJCAI'99), PP
57-63, Stockholm, Sweeden, 1999. ALLOWAY99 Ge
ne Alloway and Peter Weinstein. Seed Ontologies
growing digital libraries as distributed, intelli
gent systems. Proceedings of the second ACM
International Conference on Digital Libraries,
pp. 83-91, Philadelphia, USA, 1999. VELING98 A
nne Veling and Peter van der Weerd. Conceptual
grouping in word co-occurrence networks. 16th
Joint Int. Conf. on Artificial Intelligence
(IJCAI'99), PP 694-699, Stockholm, Sweeden,
1999. BOUAZIZ98 Tarik Bouaziz and Anton
Wolski. Fuzzy Triggers Incorporating Imprecise
Reasoning into Active Databases. Proc. IEEE 14th
International Conference on Data Engineering.
1998. WEB3 http//www.darmstadt.gmd.de/mobile/M
PEG7/, The MPEG 7 web page. MPEG 7 is a proposed
standard for metadata description of multimedia
information of varying kinds.
33References (3)
WEB4 http//dublincore.org/documents/,
recommendations of the Dublin Core
Metadata Initiative, an open forum concerned
with "development of interoperable online
metadata standards that support a broad range of
purposes and business models". COOLEY00 R.
Cooley, M. Deshpande, J. Srivastava, and P. N.
Tan. Web usage mining Discovery and aplications
of usage patterns from web data. SIGKDD
Explorations, 112-23, 2000. GREENBERG97 L.
Tauscher and S. Greenberg. How people revisit web
pages Empirical findings and implications for
the design of history systems. International
Journal of Human Computer Studies, Special issue
on World Wide Web Usability, 4797-138,
1997. HOFMANN99 Thomas Hofmann and Jan
Puzicha. Latent Class Models for Collaborative
Filtering. 16th Joint Int. Conf. on Artificial
Intelligence (IJCAI'99), PP 688-693, Stockholm,
Sweeden, 1999.