Title: Castanet: Using WordNet to Build Facet Hierarchies
1CastanetUsing WordNet to Build Facet
Hierarchies
- Emilia Stoica and Marti HearstSchool of
Information, - Berkeley
2Motivation
- Want to assign labels from multiple hierarchies
3Motivation
- Hot and Sweet Chicken 1 pepper, 2 apricots,
1 pound chicken breast, 1 Tbsp gingerroot
Meat Chicken
4Castanet
- Carves out a structure from the hypernym (IS-A)
relations within WordNet - Produces surprisingly good results for a wide
range of subjects - e.g., arts, medicine, recipes, math, news,
bibliographical records
5WordNet Challenges
- A word may have more than one sense
- - Fine granularity of word sense distinctions
- e.g., newspaper (1) - daily publication
on - folded sheets
- newspaper (3) - physical object
-
- - Ambiguity for the same sense
6WordNet Challenges (cont.)
- The hypernym path may be quite long (e.g., sense
3 of tuna has 14 nodes) - Sparse coverage of proper names and noun phrases
(not addressed)
7Algorithm Goals
- Build a set of facet hierarchies
- Balance depth and breadth
- Avoid skinny paths
- Dont go too deep or too broad
- Choose understandable labels
- Disambiguate words
- Currently a word can take on only one sense
8Our Approach
Documents
91. Select Terms
- Select well-distributed terms from the collection
- Eliminate stopwords
- Retain only those terms with a distribution
higher than a threshold - (default top 10)
Build core tree
Augm. core tree
Documents
Select terms
Comp. tree
Remove top level categ.
WordNet
102. Build Core Tree
- Build a backbone
- Create paths from unambiguous terms only
- Bias the structure towards appropriate senses of
words
- Get hypernym path if term
- - has only one sense, or
- - matches a pre-selected
- WordNet domain
- Adding a new term increases a count at each node
on its path by of docs with the term.
112. Build Core Tree (cont.)
- Merge hypernym paths to build a tree
123. Augment Core Tree
- Attach to Core tree the terms with more than one
sense - Favor the more common path over other alternatives
13Augment Core Tree (cont.)
14Optional Step Domains
- To disambiguate, use Domains
- Wordnet has 212 Domains
- medicine, mathematics, biology, chemistry,
linguistics, soccer, etc. - A better collection has been developed by Magnini
2000 - Assigns a domain to every noun synset
- Automatically scan the collection to see which
domains apply - The user selects which of the suggested domains
to use or may add own - Paths for terms that match the selected domains
are added to the core tree
15Using Domains
dip glosses Sense 1 A depression in an
otherwise level surface Sense 2 The angle that a
magnet needle makes with horizon Sense 3 Tasty
mixture into which bite-size foods are dipped
dip hypernyms Sense 1
Sense 2 Sense 3
solid
shape, form food gt concave
shape gt space
gt ingredient, fixings gt
depression gt angle
gt flavorer
Given domain food, choose
sense 3
164. Compress Tree
- Rule 1
- Eliminate a parent with fewer than k children
unless it is the root or its distribution is
larger than 0.1maxdist
dessert
frozen dessert
ice cream sundae
parfait
sherbet,sorbet
sundae
sherbet
174. Compress Tree (cont.)
- Rule 2
- Eliminate a child whose name appears within the
parents name
dessert
frozen dessert
sundae
parfait
sherbet
185. Divide into Facets
195. Divide into Facets(Remove top levels)
Rule 1 Eliminate very general categories (e.g.,
entity, abstraction). If no paths are longer
than threshold t, then done. Else
Rule 2 Undo first step. Then eliminate all top
levels until the maximum length of any path in
the resulting hierarchy is t.
20Example Recipes (3500 docs)
21Castanet Output (shown in Flamenco)
22Castanet Output
23Castanet Output
24Castanet Output
25Castanet Output
26(No Transcript)
27Castanet Evaluation
- This is a tool for information architects, so
people of this type did the evaluation - We compared output on
- Recipes
- Biomedical journal titles
- We compared to two state-of-the-art algorithms
- LDA (Blei et al. 04)
- Subsumption (Sanderson Croft 99)
28Subsumption Output
29Subsumption Output
30Subsumption Output
31Subsumption Output
32LDA Output
33LDA Output
34LDA Output
35Evaluation Method
- Information architects assessed the category
systems - For each of 2 systems output
- Examined and commented on top-level
- Examined and commented on two sub-levels
- Then comment on overall properties
- Meaningful?
- Systematic?
- Likely to use in your work?
36Evaluation (cont.)
- Sample questions for top level categories
- Would you add/remove/rename any category ? - - Did this category match your expectations ?
- Sample questions for a specific category
- - Would you add/move/remove any
sub-categories ? - - Would you promote any sub-category to top
level ? - General questions
- - Would you use Castanet ?
- - Would you use LDA ?
- - Would you use Subsumption ?
- - Would you use list of most frequent terms ?
-
37Evaluation Results
- Results on recipes collection for
Would you use this system in your work? - Yes in some cases or yes, definitely
- Castanet 29/34
- LDA 0/18
- Subsumption 6/16
- Baseline 25/34
- Average response to questions about quality
(4 strongly agree)
38Evaluation Results
- Average responses for top-level categories
- 4 no changes, 1 change many
- Average responses for 2 subcategories
39Needed Improvements
- Take spelling variations and morphological
variants into account - Use verbs and adjectives, not just nouns
- Normalize noun phrases
- Allow terms to have more than one sense
- Improve algorithm for assigning documents to
categories.
40Opportunities for Tagging
- New opportunity Tagging, folksonomies
- (flickr, de.lici.ous)
- People are created facets in a decentralized
manner - They are assigning multiple facets to items
- This is done on a massive scale
- This leads naturally to meaningful associations
41Conclusions
- Flexible application of hierarchical faceted
metadata is a proven approach for navigating
large information collections. - Midway in complexity between simple hierarchies
and deep knowledge representation. - Currently in use on e-commerce sites spreading
to other domains - Systems are needed to help create faceted
metadata structures - Our WordNet-based algorithm, while not perfect,
seems like it will be a useful tool for
Information Architects.
42Conclusions
- Castanet builds a set of faceted hierarchies by
finding IS-A relations between terms using
WordNet. - The method has been tested on various domains
- medicine, recipes, math, news, arts,
bibliographical records - Usability study shows
- Castanet is preferred to other state-of-the art
solutions. - Information architects want to use the tool in
their work. -
43Learn More
- Funding
- This work supported in part by NSF (IIS-9984741)
- For more information
- Stoica, E., Hearst, M., and Richardson, M.,
Automating Creation of Hierarchical Faceted
Metadata Structures, NAACL/HLT 2007 - See http//flamenco.berkeley.edu