Title: Semi-Automated Creation of Facet Hierarchies
1Semi-Automated Creation ofFacet Hierarchies
- Marti Hearst
- School of Information, UC Berkeley
- Joint work with Dr. Emilia Stoica
2Outline
- Faceted Metadata
- Definition
- Advantages
- Flamenco
- Search Interface Design using Faceted Metadata
- Castanet
- (Semi) Automated Tool for Creation of Category
Systems - Comparison to State-of-the-Art Alternatives
- Conclusions
3Focus Search and Navigation of Large Collections
Shopping Sites
Digital Libraries
E-Government Sites
Image Collections
4Problems with Site Search
- Study by Vividence in 2001 on 69 Sites
- 70 eCommerce
- 31 Service
- 21 Content
- 2 Community
- Poorly organized search results
- Frustration and wasted time
- Poor information architecture
- Confusion
- Dead ends
- "back and forthing"
- Forced to search
5What we want to Achieve
- Integrate browsing and searching seamlessly
- Support exploration and learning
- Avoid dead-ends, pogoing, and lostness
6Main Idea
- Use hierarchical faceted metadata
- Design the interface to
- Allow flexible navigation
- Provide previews of next steps
- Organize results in a meaningful way
- Support both expanding and refining the search
7The Problem With Hierarchy
- Most things can be classified in more than one
way. - Most organizational systems do not handle this
well. - Example Animal Classification
Skin Covering
otter penguin robin salmon wolf cobra bat
Locomotion
Diet
8The Problem with Hierarchy
- Inflexible
- Force the user to start with a particular
category - What if I dont know the animals diet, but the
interface makes me start with that category? - Wasteful
- Have to repeat combinations of categories
- Makes for extra clicking and extra coding
- Difficult to modify
- To add a new category type, must duplicate it
everywhere or change things everywhere
9The Problem With Hierarchy
start
swim
fly
run
slither
fur
scales
feathers
fur
scales
feathers
fur
scales
feathers
fish
fish
fish
fish
fish
fish
fish
fish
fish
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
insects
insects
insects
insects
insects
insects
insects
insects
insects
salmon
bat
robin
wolf
10The Idea of Facets
- Facets are a way of labeling data
- A kind of Metadata (data about data)
- Can be thought of as properties of items
- Facets vs. Categories
- Items are placed INTO a category system
- Multiple facet labels are ASSIGNED TO items
11The Idea of Facets
- Create INDEPENDENT categories (facets)
- Each facet has labels (sometimes arranged in a
hierarchy) - Assign labels from the facets to every item
- Example recipe collection
Ingredient
Cooking Method
Chicken
Stir-fry
Bell Pepper
Curry
Course
Cuisine
Main Course
Thai
12The Idea of Facets
- Break out all the important concepts into their
own facets - Sometimes the facets are hierarchical
- Assign labels to items from any level of the
hierarchy
Preparation Method Fry Saute Boil
Bake Broil Freeze
Desserts Cakes Cookies Dairy
Ice Cream Sorbet Flan
Fruits Cherries Berries Blueberries
Strawberries Bananas Pineapple
13Using Facets
- Now there are multiple ways to get to each item
Preparation Method Fry Saute Boil
Bake Broil Freeze
Desserts Cakes Cookies Dairy
Ice Cream Sherbet Flan
Fruits Cherries Berries Blueberries
Strawberries Bananas Pineapple
Fruit gt Pineapple Dessert gt Cake Preparation gt
Bake
Dessert gt Dairy gt Sherbet Fruit gt Berries gt
Strawberries Preparation gt Freeze
14ExampleNobel Prize Winners Collection(Before
and After Facets)
15Only One Way to View Laureates
16First, Choose Prize Type
17Next, view the list!
The user must first choose an Award type
(literature), then browse through the laureates
in chronological order. No choice is given to,
say organize by year and then award, or
by country, then decade, then award, etc.
18Flamenco InterfaceUsing Hierarchical Faceted
Metadata
19Opening ViewSelect literature from PRIZE facet
20Group results by YEAR facet
21Select 1920s from YEAR facet
22Current query is PRIZE gt literature ANDYEAR
1920s. Now remove PRIZE gt literature
23Now Group By YEAR gt 1920s
24Hierarchy TraversalGroup By YEAR gt 1920s, and
drill down to 1921
25Select an individual item
26Use Endgame to expand out
27Use Endgame to expand out
28Or use More like this to find similar items
29Start a new search using keyword California
30Note that category structure remains after the
keyword search
31 The query is now a keyword ANDed with a facet
subhierarchy
32Using Facets
- The system only shows the labels that correspond
to the current set of items - Start with all items and all facets
- The user then selects a label within a facet
- This reduces the set of items (only those that
have been assigned to the subcategory label are
displayed) - This also eliminates some subcategories from the
view.
33Advantages of Facets
- Cant end up with empty results sets
- (except with keyword search)
- Helps avoid feelings of being lost.
- Easier to explore the collection.
- Helps users infer what kinds of things are in the
collection. - Evokes a feeling of browsing the shelves
- Is preferred over standard search for collection
browsing in usability studies. - (Interface must be designed properly)
34Advantages of Facets
- Seamless to add new facets and subcategories
- Seamless to add new items.
- Helps with categorization wars
- Dont have to agree exactly where to place
something - Interaction can be implemented using a standard
relational database. - May be easier for automatic categorization
35Information previews
- Use the metadata to show where to go next
- More flexible than canned hyperlinks
- Less complex than full search
- Help users see and return to previous steps
- Reduces mental work
- Recognition over recall
- Suggests alternatives
- More clicks are ok only if (J. Spool)
- The scent of the target does not weaken
- If users feel they are going towards, rather than
away, from their target.
36Facets vs. Hierarchy
- Early Flamenco studies compared allowing multiple
hierarchical facets vs. just one facet. - Multiple facets was preferred and more successful.
37Limitation of Facets
- Do not naturally capture MAIN THEMES
- Facets do not show RELATIONS explicitly
Aquamarine Red Orange
Door Doorway Wall
- Which color associated with which object?
Photo by J. Hearst, jhearst.typepad.com
38Terminology Clarification
- Facets vs. Attributes
- Facets are shown independently in the interface
- Attributes just associated with individual items
- E.g., ID number, Source, Affiliation
- However, can always convert an attribute to a
facet - Facets vs. Labels
- Labels are the names used within facets
- These are organized into subhierarchies
- Synonyms
- There should be alternate names for the category
labels - Currently (in Flamenco) this is done with
subcategories - E.g., Deer has subcategories stag, fawn,
doe
39Usability Study Results
40Flamenco Usability Studies
- Usability studies done on 3 collections
- Recipes (epicurious) 13,000 items
- Architecture Images 40,000 items
- Fine Arts Images 35,000 items
- Conclusions
- Users like and are successful with the dynamic
faceted hierarchical metadata, especially for
browsing tasks - Very positive results, in contrast with studies
on earlier iterations.
41Most Recent Usability Study
- Participants Collection
- 32 Art History Students
- 35,000 images from SF Fine Arts Museum
- Study Design
- Within-subjects
- Each participant sees both interfaces
- Balanced in terms of order and tasks
- Participants assess each interface after use
- Afterwards they compare them directly
- Data recorded in behavior logs, server logs,
paper-surveys one or two experienced testers at
each trial. - Used 9 point Likert scales.
- Session took about 1.5 hours pay was 15/hour
42Post-Interface Assessments
All significant at plt.05 except simple and
overwhelming
43Post-Test Comparison
Which Interface Preferable For
Faceted
Baseline
Find images of roses Find all works from a given
period Find pictures by 2 artists in same media
Overall Assessment
More useful for your tasks Easiest to use Most
flexible More likely to result in dead
ends Helped you learn more Overall preference
44How to Create Facet Hierarchies?
45Example Recipes (3500 docs)
46Castanet Output (shown in Flamenco)
47Castanet Output (shown in Flamenco)
48Castanet Output (shown in Flamenco)
49Castanet Output (shown in Flamenco)
50Castanet Output (shown in Flamenco)
51Our ApproachLeverage the structure of WordNet
52Our Approach
- Leverage the structure of WordNet
Documents
531. Select Terms
Build tree
Comp. tree
- Select well distributed
- terms from collection
Documents
Select terms
Get hypernym paths
WordNet
542. Get Hypernym Path
red
blue
553. Build Tree
Build tree
Comp. tree
Documents
Select terms
Get hypernym paths
WordNet
red
blue
564. Compress Tree
Build tree
Comp. tree
Documents
Select terms
Get hypernym paths
WordNet
color
chromatic color
red, redness
blue, blueness
green, greenness
red
blue
green
574. Compress Tree (cont.)
Build tree
Comp. tree
Documents
Select terms
Get hypernym paths
WordNet
color
color
chromatic color
red
blue
green
red
blue
green
585. Divide into Facets
Divide into facets
59Disambiguation
- Ambiguity in
- Word senses
- Paths up the hypernym tree
60How to Select the Right Senses and Paths?
- First build core tree
- (1) Create paths for words with only one sense
- (2) Use Domains
- Wordnet has 212 Domains
- medicine, mathematics, biology, chemistry,
linguistics, soccer, etc. - Automatically scan the collection to see which
domains apply - The user selects which of the suggested domains
to use or may add own - Paths for terms that match the selected domains
are added to the core tree - Then add remaining terms to the core tree.
61Using Domains
dip glosses Sense 1 A depression in an
otherwise level surface Sense 2 The angle that a
magnet needle makes with horizon Sense 3 Tasty
mixture into which bite-size foods are dipped
dip hypernyms Sense 1
Sense 2 Sense 3
solid
shape, form food gt concave
shape gt space
gt ingredient, fixings gt
depression gt angle
gt flavorer
Given domain food, choose
sense 3
62Castanet Evaluation
63Castanet Evaluation
- This is a tool for information architects, so
people of this type did the evaluation - We compared output on
- Recipes
- Biomedical journal titles
- We compared to two state-of-the-art algorithms
- LDA (Blei et al. 04)
- Subsumption (Sanderson Croft 99)
64Subsumption Output (shown in Flamenco)
65Subsumption Output (shown in Flamenco)
66Subsumption Output (shown in Flamenco)
67Subsumption Output (shown in Flamenco)
68LDA Output (shown in Flamenco)
69LDA Output (shown in Flamenco)
70LDA Output (shown in Flamenco)
71Evaluation Method
- Information architects assessed the category
systems - For each of 2 systems output
- Examined and commented on top-level
- Examined and commented on two sub-levels
- Then comment on overall properties
- Meaningful?
- Systematic?
- Likely to use in your work?
72Evaluation Results
- Results on recipes collection for Would you use
this system in your work? - Yes in some cases or yes definitely
- Pine (Castanet) 29/34
- Oak (LDA) 0/18
- Birch (Subsumption) 6/16
- Results on quality of categories
73Opportunities for Tagging
- New opportunity Tagging, folksonomies
- (flickr de.lici.ous)
- People are created facets in a decentralized
manner - They are assigning multiple facets to items
- This is done on a massive scale
- This leads naturally to meaningful associations
74Conclusions
- Flexible application of hierarchical faceted
metadata is a proven approach for navigating
large information collections. - Midway in complexity between simple hierarchies
and deep knowledge representation. - Currently in use on e-commerce sites spreading
to other domains - Systems are needed to help create faceted
metadata structures - Our WordNet-based algorithm, while not perfect,
seems like it will be a useful tool for
Information Architects.
75Acknowledgements
- Flamenco Team
- Brycen Chun, Ame Elliott, Jennifer English, Kevin
Li, Rashmi Sinha, Emilia Stoica, Kirsten
Swearingen, Ka-Ping Yee - Castanet
- Emilia Stoica
- Funding
- This work supported in part by NSF (IIS-9984741)
76For more informationflamenco.berkeley.edu
- Thank you!
- Marti Hearst Emilia Stoica