Title: Visualization of Web Search Results
1Visualization of Web Search Results
- Zhigang Li
- A Web Technology Presentation, Oct 2001, UTD
2Outlines
- The Challenge An Exploding Web
- Web Directories
- Examples and demos
- Advanced presentation techniques
- Search Engines
- Spiders
- Page relevance and ranking
- Metasearch
- Major tools
- Visualizing Search Results
- Factors for choosing visualizations
- Synchronized alternative visualization
- Examples and demos
3Types of Search Services
- Subject Directory
- Search Engine
- Metasearch
Information Visualization (IV)
- The use of computer-supported, interactive,
visual representations of abstract data to
amplify cognition. - IV joins the humans capacity of visual thinking
and the computers capacity of analytical
computing, thereby building a bi-directional
visual and interactive interface between human
user and the information resources.
4The Big Picture An Exploding Web
- 9.6 million web servers as of Dec 1999
- 72.4 million web sites as of Jan 2000
- 275 million people online as of Mar 2000
- 800 million publicly indexable pages
- 180 million images
- 30 web pages are copied or mirrored
- 1 billion hyperlinks
5The Challenge
- Huge number
- No single search engine indexes more than 16 of
web sites - All search engines combined covering only 42
- Extreme heterogeneity
- Variable information value
- Variable length
- Often containing grammatical mistakes and typos
- Content may be outdated, false, or unreliable
- Multiple data formats
- Multiple languages and alphabets
- Need for speed
- 15,000 20,000 search queries requested per
minute
6History
- Archie 1990
- First Internet search engine
- Directory service of anonymous FTP files
- WWW Wander 1993
- First gathered Internets content
- Aliweb 1993
- Index web pages on a server, which must register
with Aliweb - Simple retrieval program to search collected
indices - JumpStation 1993
- Used a web robot (spider) to gather information
- Used exhaustive match to retrieve pages
- RBSE Spider 1993
- First to implement ranked-relevance retrieval
- Used WAIS (Wide-area Indexing Service)
7Subject Trees (Directories)
Small coverage of the web, high quality of web
links
8Subject Trees Inadequate UI
- How to find Yahoo.com from Yahoos listing
- Visit http//www.yahoo.com
- Select Computers and Internet from the 14 main
categories - Select Internet from the 24 subcategories
- Select World Wide Web from the 44
sub-subcategories - Select Searching the Web from the 36
sub-sub-subcategories - Select Search Engines and Directories from the
9 subcategories - Find Yahoo there (out of 200 peers).
- Problems
- User cant view the whole information structure
- Navigation is a slow process
- Easily get lost
9New Interfaces Hyperbolic Tree
- Designed for exploring hierarchical data
- Focus context
- Intuitive and interactive
- A Xerox PARC invention
10New Interfaces Web Map
- Multi-level visual directory of 2 million web
sites - Hierarchical categories are represented by
irregular polygons - Zooming brings more details into view
- Websites shown as symbols
- The closer that any two documents or categories
are in terms of content, the closer they appear
on the map - Topography represents rating
11New Interface Mapuccino
- Multiple layout scheme of tree structure,
including fisheye - Nodes represent HTML pages
- An IBM product
- Zooming / panning
12Search Engines
Databases of the Internet content
13Sizes of Search Engines
GGGoogle, FASTFAST, AVAltaVista, INKInktomi,
WTWebTop.com, NLNorthern Light, EXExcite
14Spiders for Search Engines
- Where to explore next?
- Depth-first high load on servers
- Breath-first favors smaller web servers
- Best-first based on popularity heuristic
- What information to keep?
- Titlesheaders vs. whole document
- Manual description vs. automated abstracts
Create a queue of pages to be explored
Choose a page
Add to queue
Fetch page content, extract all links
Database
Process page to extract information
15Page Relevance/Ranking
- A common complaint they return too many pages
(the search engines didnt rank the pages very
well) - Google uses PageRank based on the linkage
structure of the Internet - DirectHit uses popularity data (number of
visitors of a specific link) - More and more search engines are providing
rankings based on comprehensive analysis. - Some are offering advanced features like
16PageRank Algorithm of Google
- PageRank computed bydl Number of pages
pointed to it nl Number of outgoing linksa
Damping factor - PageRank is the probability of a random user in
visiting a page. - Damping factor is the probability of the user
gets bored at that page and requests another
random page.
Document
dl
nl
17Stockpiling and Retrieval
- Databases must
- Allow efficient insertion of new documents
- Allow efficient update of documents when spider
revisits a page - Allow random access to records required during
retrieval phase - Be efficient in terms of storage space
- Retrieval
- Boolean keyword query
- Regular expression matching
- No relevance order results are presented in
database order - Scanning the entire database is computationally
expensive - Vector space (statistical) retrieval
- Inverted file indexing unique word ? list of
documents (and positions) - Relevance frequency, proximity, position
18Metasearch Searching the search engines
Accessing variable databases from the web
19MetaCrawler Softbot
- Design Issues
- Provides a single unified interface
- Performs tasks as quickly as possible
- Adapts to a rapidly changing environment
- Interacting with Search Engines
- Must formulate the query formats
- Must understand query results
- Preprocess References
- Download pages for analysis
- Results collation
- Domain, path, title comparison
20Metasearch Strengths and weaknesses
- Advantages
- Ability to combine results of multiple search
engines - Ability to provide a consistent user interface
- Deficiencies
- Have difficulty in ranking the list of results
- Limited coverage, poor precision
- Subject to outdated database info in major search
engines
21Inquirus and Specific Expressive Forms
- Inquirus is a metasearch engine from NEC
- How Inquirus overcomes the deficiencies of
metasearching - Download and analyze the individual documents,
rather than working with the list of summaries
returned by search engines - Identify pages no longer exists or no longer
containing queried terms - Generate more useful summaries
- Improved document ranking using proximity
information - Show local context of query terms
- Specific Expressive Forms (SEF)
- A technique that transforms queries into specific
forms - Used by Inquirus
- Example What does NASDAQ stand for? ? NASDAQ
stands for, NASDAQ is an abbreviation, NASDAQ
means.
22ResearchIndex.org from NEC
- Autonomous citation indexing
- Provides reference linking
- Shows citation context
- Lists related and similar documents
- Query-sensitive summaries
- Page images, PS, PDF
- Autonomous location of articles
23New Search Tools Vivisimo
- Information clustering Vivisimo is a metasearch
engine that categorizes summaries returned by
other search engines and groups pages
accordingly. - A hierarchy of categories is provided
automatically.
24More Search Tools
- Features results ranking
- Understands natural language and boolean query
- Results clustering and ranking
- Understands boolean search query
SearchServer
- Comprehensive, w/ ranking, but slow
- Understands natural language query
- Can use comprehensive boolean queries
- Results integrated and ranked
25Results Visualization
- Phase 1 Formulation Expressing the search
- Phase 2 Initiation of action Launching the
search - Phase 3 Review of results Reading messages and
outcomes - Set level Representation of whole set
- Web site level The structure of a website
- Document level Specific URLs
- Phase 4 Refinement Formulating the next step
26Factors for choosing visualizations
- 4-T environment
- Target user group
- Type and number of data
- Task to be done
- Technical possibilities
- There is not a single best visualization for all
use cases - Synchronized alternative visualizations are
encouraged
Increasing level of detail
Vector
Scatter plot
Bar graph
List
Tilebars
R. Curve
Thumbnails
27Synchronized Alternative Visualizations
- Scatter Plot and Document Vector
- Scatter plot shows document clusters
- Document vector plots list them in 1-D
- User can highlight data points
28Synchronized Alternative Visualizations
- Bar Graphs
- Bar graph shows the relevance for each keyword
- List can be ordered by a column
- Tilebars and Relevance Curves
- Each row in the tilebar stands for one keyword
- Length of tilebar stands for document size
- Darkness of tile stands for relevance
- Relevance curve plot the same information
Tilebars
R. Curve
29Result Display Current Practice
- Ranked list of titles
- Number of hits of each term
- Highlighted digest
- Inter-document similarity
VISUALIZING PHYSICS WITH SOUND The Trivial
Case Listen to this amplitude being repeated
over and over with time harmonic oscillator (40k)
realistic pendulum (40k) anharmonic oscillator
(40k) particle in square well potential with
driving force (40k) More Visualizing Physics
With...10, http//goophy.physics.orst.edu/nacse/
hans/SOUND/sound.html (Direct Hit) More Like
This Spotfire - Welcome to Spotfire
Spotfire is the leading provider of decision
analytic software solutions, speeding research
and development of pharmaceutical, biotechnology,
chemcials, semiconductor and manufacturing
companies worldwide.9, http//www.ivee.com/
(Direct Hit) More Like This Dr. Dobb's Web
Site Altoweb - Click Here for 30 Day Free Trial
Altoweb - Click Here for 30 Day Free Trial
Talarian Macrovision's GLOBEtrotte TECHNETCAST
DEVSEARCHER OP-EDS COLUMNS ARTICLES CMP's
Software Development...9, http//www.ddj.com/oped
/1997/kim.htm (Direct Hit) More Like This
Business Information Visualization for
Decision-Making Support Business Information
Visualization for Decision-Making Support -- A
Research Strategy Introduction In most management
domains, problem-solving is overwhelming because
of the large amount of complicated data, multiple
complex relationships among... Due...8, http//hs
b.baylor.edu/ramsower/acis/papers/zhang.htm
(Direct Hit) More Like This infovis.org
Welcome to the site, which now hosts the
InfoVis symposium pages and the infovis email
digest archives. IEEE Symposia InfoVis 2000
InfoVis 99 InfoVis 98 InfoVis 97 InfoVis
968, http//www.infovis.org/ (Direct Hit) More
Like This Software Visualization research at
GVU Georgia Institute of Technology,
Georgia7, http//www.cc.gatech.edu/gvu/softviz/So
ftViz.html (Direct Hit) More Like This
IEEE Symposium on Information Visualization
(InfoVis '97) 7, http//www.erc.msstate.edu/con
...erences/vis97/cfp/infoviz.html (Direct Hit)
More Like This Graphics and visualization
links This page under continuous construction.
New links are added when we have time. We make no
claims for completeness of any kind... The
included links have been hand-picked by the
visualization crew at CSC and lead to sites or
articles with... SAL...6, http//www.csc.fi/visua
lization/links.html (Direct Hit) More Like This
IEEE Symposium on Information Visualization
infovis_logo.gif (16802 bytes) (InfoVis '98)
InfoVis '98, the fourth Information Visualization
Symposium, will be held to focus on the rapidly
growing area of information visualization.
Increasing amounts of data and information and
the availability...1000, http//www.erc.msstate.e
du/conferences/infovis98/ (Direct Hit) More
Like This
30Gallery Envision Matrix of Icons
- Sorted by index terms
- Ranked by relevance
- Color, icon size, icon shape carry different
information - User rating supported
31Gallery Lighthouse Flying Stars
- Ranked by relevance, distributed according to
similarity - Animated rotation of the cluster reveals all
spheres
32Gallery MarketMap
- Trade companies grouped into sectors
- Neighboring stocks have historically similar
movements - Color and brightness indicate price changes
- Size corresponds to market capitalization
- Headliners, gainers, and losers just a click away
33Gallery TileBar Patterns
- Length indicates document size
- Shade of tile means frequency of queried terms
- Distribution shown by the tile pattern
34References
- V. Ceric, "Advancements and trends in the world
wide web search," Proceedings of the 22nd
International Conference on Information
Technology Interfaces, 2000, pp. 211 -220 - R. Filman and F. Pena-Mora, "Seek, and ye shall
find Web search engines comparison," IEEE
Internet Computing, Vol. 2 No. 4, July 1998, pp.
78 -83 - M.I. Mauldin, "Lycos design choices in an
Internet search service," IEEE Expert see also
IEEE Intelligent Systems, Vol. 12 No. 1, Jan.
1997, pp. 8 -11 - E. Selberg and Oren Etzioni, The MetaCrawler
architecture for resource aggregation on the
Web, ibid., pp. 11-14. - T.M. Mann and H. Reiterer, "Evaluation of
different visualizations of Web search results,"
Proceedings of the 11th International Workshop on
Database and Expert Systems Applications, 2000,
pp. 586 -590 - S. Mukherjea and Y. Hara, "Visualizing World-Wide
Web search engine results," Proceedings of 1999
IEEE International Conference on Information
Visualization, 1999, pp. 400 - 405 - T.M. Mann, Visualization of WWW-search results,
Database and Expert Systems Applications, 1999.
Proceedings. Tenth International Workshop on,
1999, pp. 264 268 - Longzhuang Li Yi Shang. "A new statistical
method for performance evaluation of search
engines", Tools with Artificial Intelligence,
2000. ICTAI 2000. Proceedings. 12th IEEE
International Conference on , 2000 pp. 208 215 - A. C. Tsoi, Structure of the Internet?,
Proceedings of 2001 International Symposium of
Intelligent Multimedia, Video and Speech
Processing, May 2001. pp. 449 - 452. - SearchIQ http//www.zdnet.com/searchiq/directory/
multi.html - Search Engine Watch http//searchenginewatch.com/