Title: Structural Web Search Using a Graph-Based Discovery System
1Structural Web Search Using a Graph-Based
Discovery System
- Nitish Manocha, Diane J. Cook, and Lawrence B.
Holder - University of Texas at Arlington
- cook_at_cse.uta.edu
- http//www-cse.uta.edu/cook
2Structured Web Search
- Existing search engines use linear feature match
- Web contains structural information as well
- Hyperlink information
- Web viewed as a graph Kleinberg
- Subdue searches based on structure
- Use as foundation of a structural search engine
- Incorporation of WordNet allows for synonym match
3SUBDUE
- Discovers structural patterns in input graphs
- A substructure is connected subgraph
- An instance of a substructure is a subgraph that
is isomorphic to substructure definition - Pattern discovery, classification, clustering
Input Database
Substructure S1 (graph form)
Compressed Database
triangle
shape
C1
S1
object
R1
R1
on
square
S1
S1
S1
shape
object
4Subdue Algorithm
- Start with individual vertices
- Keep only best substructures on queue
- Expand substructure by adding edge/vertex
- Compress graph and repeat to generate
hierarchical description - Optional use of background knowledge
5Inexact Graph Match
- Some variations may occur between instances
- Want to abstract over minor differences
- Difference cost of transforming one graph to
make it isomorphic to another - Match if cost/size lt threshold
6Application Domains
- Protein data
- Human Genome DNA data
- Spatial-temporal domains
- Earthquake data
- Aircraft Safety and Reporting System
- Telecommunications data
- Program source code
- Web data
7Represent Web as Graph
- Breadth-first search of domain to generate graph
- Nodes represent pages / documents
- Edges represent hyperlinks
- Additional nodes represent document keywords
subdue
texas
projects
word
word
university
work
hyperlink
page
parallel
group
learning
robotics
planning
8WebSubdues Structural Search
- Formulate query as graph
- Use Subdues predefined substructure option to
search for instances of query
9Query Find all pages which link to a page
containing term Subdue
- Subgraph vertices
- Â
- 1 page
- URL http//cygnus.uta.edu
- 7Â page
- URL http//cygnus.uta.edu/projects.html
- Subdue
- 1-gt7 hyperlink
- 7-gt8 word
Subdue
word
hyperlink
page
page
/ Vertex ID Label / s v 1 page v 2 page v
3 Subdue
/ Edge Vertex 1 Vertex 2 Label / d 1 2
hyperlink d 2 3 word
10Search for Presentation Pages
page
hyperlink
hyperlink
hyperlink
page
page
page
hyperlink
hyperlink
- AltaVista
- Query hostwww-cse.uta.edu AND
imagenext_motif.gif AND imageup_motif.gif AND
imageprevious_motif.gif. - 12 instances
11Search for Reference Pages
page
hyperlink
hyperlink
hyperlink
page
page
page
- Search for page with at least 35 in links
- WebSubdue found 5 pages in www-cse
- AltaVista cannot perform this type of search
12Inclusion of WordNet
- When generating graph
- Use common stopword list
- When searching for subgraph instances
- Morphology functions
- October Oct
- teaching teach
- Synsets
- Optional allowance of synonyms
13Search for pages on jobs in computer science
- Inexact match allow one level of synonyms
- WebSubdue found 33 matches
- Words include employment, work, job, problem,
task - AltaVista found 2 matches
page
word
word
word
jobs
computer
science
14Search for authority hub and authority pages
- WebSubdue found 3 hub (and 3 authority) pages
- AltaVista cannot perform this type of search
- Inexact match applied with threshold 0.2 (4.2
transformations allowed) - WebSubdue found 13 matches
15Subdue Learning from Web Data
- Distinguish professors and students web pages
- Learned concept (professors have box in address
field) - Distinguish online stores and professors web
pages - Learned concept (stores have more levels in graph)
page
page
page
page
page
page
page
16Conclusions
- WebSubdue can be used to search for structural
web data - Could be enhanced with additional WordNet
features such as synset path length - Efficient structural search necessary for future
of web search tools
17To Learn More
cygnus.uta.edu/subdue
cook_at_cse.uta.edu http//www-cse.uta.edu/cook