Title: ?? ???(sw@cs.ccu.edu.tw)
1??????
- ?????????????
- ?? ???(sw_at_cs.ccu.edu.tw)
2What is a search engine?
- A web service site for the Internet Users to find
information in the Internet Cyberspace? - The software to provide web search service
3Use of search engines?
- Search for the url of a company/website
- Look for the contact info about a person or an
organization - Search for information related to a term, eg. to
collect information about ????? - Look for news regarding XXX
- Treat the search engine as a big dictionary
- ...
4Types of search engines
- Directory browse/search
- Web pages search
- USENET news search
- Ftp search
- People/organization search
- Daily-life information search
- Library search
- Commercial product search
5Example search engines
- Yahoo,
- Google,
- AltaVista,
- MSN,
- Excite,
- Lycos, ...
- YAM, Kimo, PCHome
- GAIS, Openfind, ...
-
- DejaNews,
- Archie, ...
6 Portal Services
- Directory / Search
- Daily information Weather, Maps. TV, ...
- Free Emails, Free Pages, Calendar
- Personalized services, channel subscription
- Web Chat,
- E-Commerce,
- Content Aggregation
- ...
7Directory implementation
- Each url data is a record
- The url data is managed by a database system
- Search function is supported for searching the
data in the directory tree
8Directory implementation
- The search is in general for locating a website
or a category of web sites. - The data input is through manual registration by
the website owner or the suffer - The management of the directory tree needs
intensive labor work by people who are familiar
with certain domain knowledge
9 The Advantages/Disadvantages of Directory
search engine
- Advantages
- The data is manually maintained, and
thus contains less noise, and is more precise. - The output of search can be categorized and can
be more organized - Can support search within a category
10 The Advantages/Disadvantages of Directory
search engine
- Disadvantages
- The data coverage is limited, and sometimes, can
not find wanted - Does not support relevance ranking
- Labor intensive
11Implementation of Webpage search engine
- 1.Feature consideration
- 2.Data Gathering
- 3.Data Preprocessing
- 4.Data Indexing
- 5.Query Processing
- 6. Interaction
- 7.Service tools
- 8.Personalization
12Requirements for WebPage search engines
- 0. The quality of the search result in a search
engine basically depends on - a. the quality of the underlying data
- b. the search techniques such as ranking tech.
- 1. Data coverage should be large enough
- 2. Data needs to be filtered, such as removing
redundant pages
13Requirements for WebPage search engines
- 3. Full text search capability should be provided
- 4. Relevance Ranking mechanism should be provided
- 5. Search Speed should be fast enough
- 6. Search features
- I.e., evaluation points
- Quality, speed, scale, robustness, features,
14Data Gatherer
- Also known as spider, crawler, robot, ...
- Periodically travels the web space to collect web
pages - Need a list management to decide which and when
to collect - Need a link analyzer to generate new URL list
- Need to decide what to collect and what not to.
15Data Gatherer
- Get-file function through http protocol is the
basic function - Webpage parser module used to extract link info
from a retrieved page, - URL bank manager module to manage the urls to be
fetched. - Robot-controller module to manage the data
collection using multiple clients
16Issues of Robot
- Site Based vs URL based
- Site based is popular such as wget, teleport
- robots.txt is easier to implement in SiteBased
robot - URL based robot is more appropriate for large
scale search engines - Retrieval Schedule, BFS is better
- Incremental Retrieval
17Robot Issues
- What to gather and what not to?
- Hidden web data collection
- Focused crawling
- targeting specialized content of web pages
- suitable for special search engines
- evaluated by precision and recall
18Data Preprocessing
- Remove redundant pages
- Transform the page into internal data format.
- Perform web cross-link analysis to generate a URL
databank. - Filter the data to remove data that better not be
indexed - Partition the data space
19Redundancy removal
- 15 to 20 of the web pages are replicated on
different websites, e.g., some tutorials such as
Java, Perl, Python, - Can be implemented by partitioned-hashing or
external sorting
20Ranking the URLs
- Link analysis is done to count the mutual
reference between web pages - A URL receiving higher number of references will
get higher score - weighted link
- discount internal link // such as back to home
- Order the web pages in order of score such that a
page with higher rank will have lower ID
21Data Partition
- The data is partitioned by language type
- The language partition can be done as follows
- for each known language, collect certain amount
of webpages of that language - build up high-frequent term set for each language
set from the analysis of the sample data - determine the language type by term analysis
22Indexer
- In general, inverted file is used to generate the
index - Need large data space for the indexing task.
- For each indexed term, an index list is generated
to record which files/locations such term
appears. - Need about the same or more space as the original
data
23Indexer - implementation issue
- Data filter module is used to cope with different
data sources - Inversion module is the kernel module
- Need to be scalable to handle continuous growing
data size. - Hundreds of Giga bytes
- Tera bytes
- Distributed/Concurrent Indexing
24Indexer - implementation issue
- Temporary space minimization
- Index speed is crucial
- Memory can be utilized to improve the index
performance - Hashing and Sorting is the key!
25Query Processing
- Use dictionary/stop-list to preprocess the query
string - Parse the query into expressions of tokens
- Use index structure to locate the matched
- Use TFIDF type technique to score the matched
documents - Combine URL scores to rank the result
26Search CGI programs
- search agent CGI
- parse the query and fork a searcher process to do
the search (or use IPC to query the searcher) - when the searcher returns, analyze and process
the result for formatted output - process the result and store it in tmp result
store - log query and some status info
- cgi for view-next-page
- showmatch cgi
27Output control
- Site grouping
- group the pages from same website together
- Title grouping
- group the pages with similar title
- Sort the output according to certain criteria
28Interaction
- Term Suggestion
- Related terms
- thesaurus
- term-expansion
- error correction
- phonetic
- spelling
29Personalization
- Keeping track of a users interest such that the
search result can be tuned to improve the
satisfaction to the user - Query Tracking and classification
30Service tools
- Query cache to improve the performance of the
Search, for queries that have been served. - Use memory cache file system to reduce the dick
access overhead - Mechanism for special case handling
- Log analyzer
31Research Issues
- Hidden Web data collection
- Distributed index/search
- Index minimization, incremental Indexing
- Smart robot
- Intelligent Retrieval
- Output result auto classification/clustering
- Data source clustering/classification
- classifying/clustering the whole web
32Conclusion
- Size does matter
- Is still searching for a better engine!