Title: Organizing the Web: Semiautomatic Construction of a Faceted Scheme
1Organizing the Web Semi-automatic
Constructionof a Faceted Scheme
- Kiduk Yang, Elin K. Jacob, Aaron Loehrlein,
Seungmin Lee, Ning Yu - School of Library and Information ScienceIndiana
University, USA
2Outline
- Introduction
- Construction of a Faceted Scheme
- Generalized Approach to Constructing a Faceted
Vocabulary
3Introduction Background
- WIDIT (Web Information Discovery Integrated Tool)
- IU-SLIS research group
- Research area
- Information Retrieval, Classification, Fusion
- Major Projects
- CSKD (Classification-based Search Knowledge
Discovery) - DGov (Digital Government)
- TREC (Text REtrieval Conference)
- VCoB (Virtual Collection Builder)
- http//widit.slis.indiana.edu/
4Introduction CSKD Overview
- Aim
- Dynamic, flexible, and effective information
retrieval and knowledge discovery - Assumptions
- Path to knowledge is not deterministic
- Individual weakness complementary strengths of
evidences, methods, processes - Approaches
- Leveraging of multiple sources of evidence
- Integration of information retrieval and
knowledge organization methods - Combination of automatic and manual processes
5Introduction CSKD Architecture
Metadata Scheme Construction
Metadata Indexing
Content Indexing
Faceted Vocabulary Construction
Manual Classification
HeuristicsDiscovery
Inverted Index Creation
RDF Scheme Creation
Knowledge Base Harvesting
Automatic Classification
Hybrid Classification
Free Text Search
Database Search
Static Ontology Search Browse
Dynamic Ontology Search Browse
Dynamic Query RefinementIntegrated Search
Browse Flexible Knowledge Organization
6Faceted Scheme Intro
- Problems with unstructured Web search
- too many or too few search results
- failure of free-text searching to retrieve by
concept - Renewed interest in traditional systems of
representation and organization - classification/categorization
- thesauri/controlled vocabularies
- metadata (e.g. digital libraries)
- ontologies (e.g. Semantic Web)
7Faceted Scheme Intro
- Impossibility of organizing entire Web
- quantity, diversity, and dynamic nature of
resources - scale-up problem of machine learning approaches
- text categorization based on static
classification scheme - Requires an organizational approach that
- provides for flexibility of representation
- accommodates dynamic nature of human knowledge
- responds to the information needs of diverse and
interdisciplinary searchers
8Faceted Scheme Intro
- Problems with traditional enumerative systems
- top-down (data-independent) approach
- fixed groupings (definition, membership)
- inability to respond to dynamic nature of Web
collections - Advantages of faceted systems
- bottom-up (data-driven) approach
- groupings created on an as-needed basis
- dynamic and responsive to change
9Faceted Scheme Construction
- Overview of Faceted Scheme construction
- Faceted Vocabulary
- identify characteristics (color) and values (red,
green) relevant to a domain - organize characteristics (facets) and associated
values (isolates) as independent concept
hierarchies - determine relationships between concept
hierarchies - Faceted Classification
- establish citation order for combining facets
- create classes (and class structure) based on
resource collection - Dynamic Faceted Classification
- modify or adapt citation order to create
customized classes (class structure) based on
users immediate information need
10Faceted Scheme Construction
- Obstacles to Faceted Scheme construction
- time-consuming
- intellectually demanding
- lack of standardized procedures
- Research objectives
- to reduce intellectual resources required to
construct faceted schemes - to identify faceted scheme construction
procedures that can be standardized - Approach
- to develop semi-automatic process that augments
the cognitive strengths of the human with the
automatic processing capabilities of the machine
11Faceted Vocabulary Methodology
- Create lexicon base
- Manually construct the faceted vocabulary
- Analyze the manual construction process to
identify cognitive strategies used - Construct heuristics for automating cognitive
strategies - Suffix Heuristic
- WordNet Heuristic
- Concept Pairs Heuristic
- Evaluate, modify, and validate heuristics
- Devise a semi-automatic approach to constructing
the faceted vocabulary that combines automatic
(machine) and manual (human) processes.
12Faceted Vocabulary Heuristics
- Input lexicon base
- Output groupings of conceptually related terms
- Suffix Heuristic
- organizes terms based on common word endings
- steps
- identify suffixes and meanings in dictionary
- identify domain-specific suffixes and meanings
- organize and conflate suffixes by meaning
- apply suffix structure to generate groupings
- validate output and refine heuristic if needed
example
13Faceted Vocabulary Heuristics
- WordNet Heuristic
- groups terms by their position in the WordNet
category hierarchy - act ? action ? change ? change of magnitude
activity ? occupation ? accountancy - steps
- submit terms from lexicon base to WordNet
- group terms based on common WordNet category
- validate output and refine heuristic if needed
14Faceted Vocabulary Heuristics
- Concept Pairs Heuristic
- groups pairs of terms from noun phrases that
share a common term - air pollution, water pollution, soil pollution ?
air, water, soil - pollution control, pollution monitoring ?
control, monitoring - steps
- identify noun phrases from the lexicon base
- group noun phrases sharing a common term based on
position - strip out the common term
- validate output and refine heuristic if needed
15Faceted Vocabulary Construction Generalized
Model
16Questions?
17Organize suffixes/termsby specific meaning
- entities
- chemicals, chemical compounds
- Carbon
- Chlorofluorocarbons
- Hydrochlorofluorocarbons
- binary chemical compounds, compounds regarded as
binary - Bromide
- Chloride
- Cyanide
- Cyanides
- Monoxide
- Oxides
- Radionuclides
- Ride
- chemical elements, chemical radicals, ions having
a positive charge - Cadmium
- Uranium
- chemical radicals
- Biphenyls
- Butyl
- Methyl
- unsaturated carbon compounds
- Benzene
- Ethylbenzene
- Scene
- Styrene
- Toluene
- unsaturated hydrocarbons, bivalent radicals
- Dichloroethylene
- Perchloroethylene
- Tetrachloroethylene
- Trichloroethylene
18Conflate suffixes/termsby general meaning
back
- entities
- chemicals, chemical compounds
- Benzene
- Biphenyls
- Bromide
- Butyl
- Cadmium
- Carbon
- Chloride
- Chlorofluorocarbons
- Cyanide
- Cyanides
- Dichloroethylene
- Ethylbenzene
- Hydrochlorofluorocarbons
- Methyl
- Monoxide
- Oxides
- Perchloroethylene
- Radionuclides
- Ride
- Scene
- Styrene
- Tetrachloroethylene
- Toluene
- Trichloroethylene
- Uranium