Title: SIMS%20247:%20Information%20Visualization%20and%20Presentation%20Marti%20Hearst
1SIMS 247 Information Visualization and
PresentationMarti Hearst
Nov 2 and Nov 7, 2005
2Outline
- Why Text is Tough
- Single-document Visualization
- Visualizing Concept Spaces
- Clusters
- Category Hierarchies
- Visualizing Query Specifications
- Visualizing Retrieval Results
- Usability Study Meta-Analysis
3Why Visualize Text?
- To help with Information Retrieval
- give an overview of a collection
- show user what aspects of their interests are
present in a collection - help user understand why documents retrieved as a
result of a query - Text Data Mining
- Mainly clustering nodes-and-links
- Software Engineering
- not really text, but has some similar properties
4Why Text is Tough
- Text is not pre-attentive
- Text consists of abstract concepts
- which are difficult to visualize
- Text represents similar concepts in many
different ways - space ship, flying saucer, UFO, figment of
imagination - Text has very high dimensionality
- Tens or hundreds of thousands of features
- Many subsets can be combined together
5Why Text is Tough
The Dog.
6Why Text is Tough
The Dog.
The dog cavorts.
The dog cavorted.
7Why Text is Tough
The man.
The man walks.
8Why Text is Tough
The man walks the cavorting dog.
So far, we can sort of show this in pictures.
9Why Text is Tough
As the man walks the cavorting dog,
thoughts arrive unbidden of the previous spring,
so unlike this one, in which walking was marching
and dogs were baleful sentinals outside unjust
halls.
How do we visualize this?
10Why Text is Tough
- Abstract concepts are difficult to visualize
- Combinations of abstract concepts are even more
difficult to visualize - time
- shades of meaning
- social and psychological concepts
- causal relationships
11Why Text is Tough
- Language only hints at meaning
- Most meaning of text lies within our minds and
common understanding - How much is that doggy in the window?
- how much social system of barter and trade (not
the size of the dog) - doggy implies childlike, plaintive, probably
cannot do the purchasing on their own - in the window implies behind a store window,
not really inside a window, requires notion of
window shopping
12Why Text is Tough
- General categories have no standard ordering
(nominal data) - Categorization of documents by single topics
misses important distinctions - Consider an article about
- NAFTA
- The effects of NAFTA on truck manufacture
- The effects of NAFTA on productivity of truck
manufacture in the neighboring cities of El Paso
and Juarez
13Why Text is Tough
- Other issues about language
- ambiguous (many different meanings for the same
words and phrases) - different combinations imply different meanings
14Why Text is Tough
- I saw Pathfinder on Mars with a telescope.
- Pathfinder photographed Mars.
- The Pathfinder photograph mars our perception of
a lifeless planet. - The Pathfinder photograph from Ford has arrived.
- The Pathfinder forded the river without marring
its paint job.
15Why Text is Easy
- Text is highly redundant
- When you have lots of it
- Pretty much any simple technique can pull out
phrases that seem to characterize a document - Instant summary
- Extract the most frequent words from a text
- Remove the most common English words
16Guess the Text
- 478 said
- 233 god
- 201 father
- 187 land
- 181 jacob
- 160 son
- 157 joseph
- 134 abraham
- 121 earth
- 119 man
- 118 behold
- 113 years
- 104 wife
- 101 name
- 94 pharaoh
17Visualizing Individual Documents
- Early approach SuperBook
- Showing term occurences TextArc
18Superbook (http//superbook.bellcore.com/SB)
19TextArc (www.textarc.org)
20SeeSoft Showing Text Content using a linear
representation and brushing and linking (Eick
Wills 95)
21Virtual Shakespeare (Small 96)
22Text Collection Overviews
- How can we show an overview of the contents of a
text collection? - Show info external to the docs
- e.g., date, author, source, number of inlinks
- does not show what they are about
- Show the meanings or topics in the docs
- a list of titles
- results of clustering words or documents
- organize according to categories (next time)
23The Need to Group
- Interviews with lay users often reveal a desire
for better organization of retrieval results - Useful for suggesting where to look next
- People prefer links over generating search terms
- But only when the links are for what they want
- Three main approaches for text and images
- Group items according to pre-defined categories
- Group items into automatically-created clusters
- Group items according to common keywords
Ojakaar and Spool, Users Continue After Category
Links, UIETips Newsletter, http//world.std.com/u
ieweb/Articles/, 2001
24Categories
- Human-created
- But often automatically assigned to items
- Arranged in hierarchy, network, or facets
- Can assign multiple categories to items
- Or place items within categories
- Usually restricted to a fixed set
- So help reduce the space of concepts
- Intended to be readily understandable
- To those who know the underlying domain
- Provide a novice with a conceptual structure
- There are many already made up!
- However, until recently, their use in interfaces
has been - Under-investigated
- Not met their promise
25Clustering
- The art of finding groups in data
- Kaufman and Rousseeuw
- Groups are formed according to associations and
commonalities among the datas features. - There are dozens of algorithms, more all the time
- Most need a way of determining similarity or
difference between a pair of items - In text clustering, documents usually represented
as a vector of weighted features which are some
transformation on the words - Similarity between documents is a weighted
measure of feature overlap
26Clustering
- Potential benefits
- Find the main themes in a set of documents
- Potentially useful if the user wants a summary of
the main themes in the subcollection - Potentially harmful if the user is interested in
less dominant themes - More flexible than pre-defined categories
- There may be important themes that have not been
anticipated - Disambiguate ambiguous terms
- ACL
- Clustering retrieved documents tends to group
those relevant to a complex query together
Hearst, Pedersen, Revisiting the Cluster
Hypothesis, SIGIR96
27Scatter/Gather Clustering
- Developed at PARC in the late 80s/early 90s
- Top-down approach
- Start with k seeds (documents) to represent k
clusters - Each document assigned to the cluster with the
most similar seeds - To choose the seeds
- Cluster in a bottom-up manner
- Hierarchical agglomerative clustering
- Start with n documents, compare all by pairwise
similarity, combine the two most similar
documents to make a cluster - Now compare both clusters and individual
documents to find the most similar pair to
combine - Continue until k clusters remain
- Use the centroid of each of these as seeds
- Centroid average of the weighted vectors
- Can recluster a cluster to produce a hierarchy of
clusters
Pedersen, Cutting, Karger, Tukey, Scatter/Gather
A Cluster-based Approach to Browsing Large
Document Collections, SIGIR 1992
28Scatter/Gather
29Northern Light Web Search Started out with
clustering. Then integrated with categories.
Then did not do web search and used only
categories.
30(No Transcript)
31(No Transcript)
32Visualizing Clustering Results
- Use clustering to map the entire huge
multidimensional document space into a huge
number of small clusters. - User dimension reduction and then project these
onto a 2D/3D graphical representation
33Clustering Multi-Dimensional Document
Space(image from Wise et al 95)
34Clustering Multi-Dimensional Document
Space(image from Wise et al 95)
35Kohonen Feature Maps on Text(from Chen et al.,
JASIS 49(7))
36Is it useful?
- 4 Clustering Visualization Usability Studies
37Clustering for Search Study 1
-
- This study compared
- a system with 2D graphical clusters
- a system with 3D graphical clusters
- a system that shows textual clusters
- Novice users
- Only textual clusters were helpful (and they were
difficult to use well)
Kleiboemer, Lazear, and Pedersen. Tailoring a
retrieval system for naive users. SDAIR96
38Clustering Study 2 Kohonen Feature Maps
- Comparison Kohonen Map and Yahoo
- Task
- Window shop for interesting home page
- Repeat with other interface
- Results
- Starting with map could repeat in Yahoo (8/11)
- Starting with Yahoo unable to repeat in map (2/14)
Chen, Houston, Sewell, Schatz, Internet Browsing
and Searching User Evaluations of Category Map
and Concept Space Techniques. JASIS 49(7)
582-603 (1998)
39Kohonen Feature Maps(Lin 92, Chen et al. 97)
40Study 2 (cont.)
- Participants liked
- Correspondence of region size to documents
- Overview (but also wanted zoom)
- Ease of jumping from one topic to another
- Multiple routes to topics
- Use of category and subcategory labels
Chen, Houston, Sewell, Schatz, Internet Browsing
and Searching User Evaluations of Category Map
and Concept Space Techniques. JASIS 49(7)
582-603 (1998)
41Study 2 (cont.)
- Participants wanted
- hierarchical organization
- other ordering of concepts (alphabetical)
- integration of browsing and search
- correspondence of color to meaning
- more meaningful labels
- labels at same level of abstraction
- fit more labels in the given space
- combined keyword and category search
- multiple category assignment (sportsentertain)
- (These can all be addressed with faceted
hierarchical categories)
Chen, Houston, Sewell, Schatz, Internet Browsing
and Searching User Evaluations of Category Map
and Concept Space Techniques. JASIS 49(7)
582-603 (1998)
42Clustering Study 3 NIRVE
- Each rectangle is a cluster. Larger clusters
closer to the pole. Similar clusters near one
another. Opening a cluster causes a projection
that shows the titles.
43Study 3
- This study compared
- 3D graphical clusters
- 2D graphical clusters
- textual clusters
- 15 participants, between-subject design
- Tasks
- Locate a particular document
- Locate and mark a particular document
- Locate a previously marked document
- Locate all clusters that discuss some topic
- List more frequently represented topics
Visualization of search results a comparative
evaluation of text, 2D, and 3D interfaces
Sebrechts, Cugini, Laskowski, Vasilakis and
Miller, SIGIR 99.
44Study 3
- Results (time to locate targets)
- Text clusters fastest
- 2D next
- 3D last
- With practice (6 sessions) 2D neared text
results 3D still slower - Computer experts were just as fast with 3D
- Certain tasks equally fast with 2D text
- Find particular cluster
- Find an already-marked document
- But anything involving text (e.g., find title)
much faster with text. - Spatial location rotated, so users lost context
- Helpful viz features
- Color coding (helped text too)
- Relative vertical locations
Visualization of search results a comparative
evaluation of text, 2D, and 3D interfaces
Sebrechts, Cugini, Laskowski, Vasilakis and
Miller, SIGIR 99.
45Clustering Study 4
- Compared several factors
- Findings
- Topic effects dominate (this is a common finding)
- Strong difference in results based on spatial
ability - No difference between librarians and other people
- No evidence of usefulness for the cluster
visualization -
Aspect windows, 3-D visualizations, and indirect
comparisons of information retrieval systems,
Swan, Allan, SIGIR 1998.
46SummaryVisualizing for Search Using Clusters
- Huge 2D maps may be inappropriate focus for
information retrieval - cannot see what the documents are about
- space is difficult to browse for IR purposes
- (tough to visualize abstract concepts)
- Perhaps more suited for pattern discovery and
gist-like overviews
47Category Combinations
- Lets show categories instead of clusters
48DynaCat (Pratt, Hearst, Fagan 99)
49DynaCat (Pratt 97)
- Decide on important question types in an advance
- What are the adverse effects of drug D?
- What is the prognosis for treatment T?
- Make use of MeSH categories
- Retain only those types of categories known to be
useful for this type of query.
50DynaCat Study
- Design
- Three queries
- 24 cancer patients
- Compared three interfaces
- ranked list, clusters, categories
- Results
- Participants strongly preferred categories
- Participants found more answers using categories
- Participants took same amount of time with all
three interfaces
51MultiTrees (Furnas Zacks 94)
52Cat-a-ConeMultiple Simultaneous Categories
- Key Ideas
- Separate documents from category labels
- Show both simultaneously
- Link the two for iterative feedback
- Distinguish between
- Searching for Documents vs.
- Searching for Categories
53 Cat-a-Cone Interface
54Cat-a-Cone
- Catacomb
- (definition 2b, online Websters)
- A complex set of interrelated things
- Makes use of earlier PARC work on 3Danimation
Rooms Henderson and Card 86 IV Cone Tree
Robertson, Card, Mackinlay 93 Web Book Card,
Robertson, York 96
55search
browse
query terms
Category Hierarchy
Collection
Retrieved Documents
56ConeTree for Category Labels
- Browse/explore category hierarchy
- by search on label names
- by growing/shrinking subtrees
- by spinning subtrees
- Affordances
- learn meaning via ancestors, siblings
- disambiguate meanings
- all cats simultaneously viewable
57Virtual Book for Result Sets
- Categories on Page (Retrieved Document) linked to
Categories in Tree - Flipping through Book Pages causes some Subtrees
to Expand and Contract - Most Subtrees remain unchanged
- Book can be Stored for later Re-Use
58Improvements over Standard Category Interfaces
- Integrate category selection with viewing of
categories - Show all categories context
- Show relationship of retrieved documents to the
category structure - But do users understand and like the 3D?
59The FLAMENCO Project
- Basic idea similar to Cat-a-Cone
- But use familiar HTML interaction to achieve
similar goals - Usability results are very strong for users who
care about the collection.
60Co-Citation Analysis
- Has been around since the 50s. (Small, Garfield,
White McCain) - Used to identify core sets of
- authors, journals, articles for particular fields
- Not for general search
- Main Idea
- Find pairs of papers that cite third papers
- Look for commonalitieis
- A nice demonstration by Eugene Garfield at
- http//165.123.33.33/eugene_garfield/papers/mapsci
world.html
61Co-citation analysis (From Garfield 98)
62Co-citation analysis (From Garfield 98)
63Co-citation analysis (From Garfield 98)
64Query Specification
65Command-Based Query Specification
- command attribute value connector
- find pa shneiderman and tw user
- What are the attribute names?
- What are the command names?
- What are allowable values?
66Form-Based Query Specification (Altavista)
67Form-Based Query Specification (Melvyl)
68Form-based Query Specification (Infoseek)
69Direct Manipulation Spec.VQUERY (Jones 98)
70Menu-based Query Specification(Young
Shneiderman 93)
71Context
72Putting Results in Context
- Visualizations of Query Term Distribution
- KWIC, TileBars, SeeSoft
- Visualizing Shared Subsets of Query Terms
- InfoCrystal, VIBE, Lattice Views
- Table of Contents as Context
- Superbook, Cha-Cha, DynaCat
- Organizing Results with Tables
- Envision, SenseMaker
- Using Hyperlinks
- WebCutter
73Putting Results in Context
- Interfaces should
- give hints about the roles terms play in the
collection - give hints about what will happen if various
terms are combined - show explicitly why documents are retrieved in
response to the query - summarize compactly the subset of interest
74KWIC (Keyword in Context)
- An old standard, ignored until recently by
internet search engines - used in some intranet engines, e.g., Cha-Cha
75Highlighting Keywords in Context
76(No Transcript)
77Superbook (Remde et al. 89)
- Hyper-media software manual
- Functions
- Word Lookup
- Table of Contents Dynamic fisheye view of the
hierarchical topics list - Page of Text show selected page and highlighted
search terms - Hypertext features linking through search words
rather than page links
78Display of Retrieval Results
- Goal minimize time/effort for deciding which
documents to examine in detail - Idea show the roles of the query terms in the
retrieved documents, making use of document
structure
79TileBars
- Graphical Representation of Term Distribution and
Overlap - Simultaneously Indicate
- relative document length
- query term frequencies
- query term distributions
- query term overlap
80(No Transcript)
81(No Transcript)
82Exploiting Visual Properties
- Variation in gray scale saturation imposes a
universal, perceptual order (Bertin et al. 83) - Varying shades of gray show varying quantities
better than color (Tufte 83) - Differences in shading should align with the
values being presented (Kosslyn et al. 83)
83Key Aspect Faceted Queries
- Conjunct of disjuncts
- Each disjunct is a concept
- osteoporosis, bone loss
- prevention, cure
- research, Mayo clinic, study
- User does not have to specify which are main
topics, which are subtopics - Ranking algorithm gives higher weight to overlap
of topics - This kind of query works better at high-precision
queries than similarity search (Hearst 95)
84TileBars Summary
- Preliminary User Studies
- users understand them
- find them helpful in some situations, but
probably slower than just reading titles - sometimes terms need to be disambiguated
85More Recent Attempts
- Analyzing retrieval results
- KartOO http//www.kartoo.com/
- Grokker http//www.groxis.com/service/grok
86(No Transcript)
87(No Transcript)
88(No Transcript)
89(No Transcript)
90Query Term Subsets
- Show which subsets of query terms occur in
which subsets of documents occurs in which
subsets of retrieved documents
91Term Occurrences in Results Sets
- Show how often each query term occurs in
retrieved documents - VIBE (Korfhage 91)
- InfoCrystal (Spoerri 94)
- Problems
- cant see overlap of terms within docs
- quantities not represented graphically
- more than 4 terms hard to handle
- no help in selecting terms to begin with
92InfoCrystal (Spoerri 94)
93VIBE (Olson et al. 93, Korfhage 93)
94Term Occurrences in Results Sets
- Problems
- cant see overlap of terms within docs
- quantities not represented graphically
- more than 4 terms hard to handle
- no help in selecting terms to begin with
95DLITE (Cousins 97)
- Supporting the Information Seeking Process
- UI to a digital library
- Direct manipulation interface
- Workcenter approach
- experts create workcenters
- lots of tools for one task
- contents persistent
96DLITE (Cousins 97)
- Drag and Drop interface
- Reify queries, sources, retrieval results
- Animation to keep track of activity
97IR Infovis Meta-Analysis (Chen Yu 00)
- Goal
- Find invariant underlying relations suggested
collectively by empirical findings from many
different studies - Procedure
- Examine the literature of empirical infoviz
studies - 35 studies between 1991 and 2000
- 27 focused on information retrieval tasks
- But due to wide differences in the conduct of the
studies and the reporting of statistics, could
use only 6 studies
98IR Infovis Meta-Analysis (Chen Yu 00)
- Conclusions
- IR Infoviz studies not reported in a standard
format - Individual cognitive differences had the largest
effect - Especially on accuracy
- Somewhat on efficiency
- Holding cognitive abilities constant, users did
better with simpler visual-spatial interfaces - The combined effect of visualization is not
statistically significant - Misc
- Tilebars and Scatter/Gather are well-known enough
to not require citations!!
99Summary Search and Doc Viz
- Visualization still has yet to prove its
usefulness for search and documents - Needs to integrate with more accurate dialogue
systems