?? ???(sw@cs.ccu.edu.tw)

About This Presentation

Title:

?? ???(sw@cs.ccu.edu.tw)

Description:

(sw_at_cs.ccu.edu.tw) – PowerPoint PPT presentation

Number of Views:4

Avg rating:3.0/5.0

Slides: 33

Provided by: Gais8

more less

Transcript and Presenter's Notes

Title: ?? ???(sw@cs.ccu.edu.tw)

1
??????

?????????????
?? ???(sw_at_cs.ccu.edu.tw)

2
What is a search engine?

A web service site for the Internet Users to find
information in the Internet Cyberspace?
The software to provide web search service

3
Use of search engines?

Search for the url of a company/website
Look for the contact info about a person or an
organization
Search for information related to a term, eg. to
collect information about ?????
Look for news regarding XXX
Treat the search engine as a big dictionary
...

4
Types of search engines

Directory browse/search
Web pages search
USENET news search
Ftp search
People/organization search
Daily-life information search
Library search
Commercial product search

5
Example search engines

Yahoo,
Google,
AltaVista,
MSN,
Excite,
Lycos, ...

YAM, Kimo, PCHome
GAIS, Openfind, ...
DejaNews,
Archie, ...

6
Portal Services

Directory / Search
Daily information Weather, Maps. TV, ...
Free Emails, Free Pages, Calendar
Personalized services, channel subscription
Web Chat,
E-Commerce,
Content Aggregation
...

7
Directory implementation

Each url data is a record
The url data is managed by a database system
Search function is supported for searching the
data in the directory tree

8
Directory implementation

The search is in general for locating a website
or a category of web sites.
The data input is through manual registration by
the website owner or the suffer
The management of the directory tree needs
intensive labor work by people who are familiar
with certain domain knowledge

9
The Advantages/Disadvantages of Directory
search engine

Advantages
The data is manually maintained, and
thus contains less noise, and is more precise.
The output of search can be categorized and can
be more organized
Can support search within a category

10
The Advantages/Disadvantages of Directory
search engine

Disadvantages
The data coverage is limited, and sometimes, can
not find wanted
Does not support relevance ranking
Labor intensive

11
Implementation of Webpage search engine

1.Feature consideration
2.Data Gathering
3.Data Preprocessing
4.Data Indexing
5.Query Processing
6. Interaction
7.Service tools
8.Personalization

12
Requirements for WebPage search engines

0. The quality of the search result in a search
engine basically depends on
a. the quality of the underlying data
b. the search techniques such as ranking tech.
1. Data coverage should be large enough
2. Data needs to be filtered, such as removing
redundant pages

13
Requirements for WebPage search engines

3. Full text search capability should be provided
4. Relevance Ranking mechanism should be provided
5. Search Speed should be fast enough
6. Search features
I.e., evaluation points
Quality, speed, scale, robustness, features,

14
Data Gatherer

Also known as spider, crawler, robot, ...
Periodically travels the web space to collect web
pages
Need a list management to decide which and when
to collect
Need a link analyzer to generate new URL list
Need to decide what to collect and what not to.

15
Data Gatherer

Get-file function through http protocol is the
basic function
Webpage parser module used to extract link info
from a retrieved page,
URL bank manager module to manage the urls to be
fetched.
Robot-controller module to manage the data
collection using multiple clients

16
Issues of Robot

Site Based vs URL based
Site based is popular such as wget, teleport
robots.txt is easier to implement in SiteBased
robot
URL based robot is more appropriate for large
scale search engines
Retrieval Schedule, BFS is better
Incremental Retrieval

17
Robot Issues

What to gather and what not to?
Hidden web data collection
Focused crawling
targeting specialized content of web pages
suitable for special search engines
evaluated by precision and recall

18
Data Preprocessing

Remove redundant pages
Transform the page into internal data format.
Perform web cross-link analysis to generate a URL
databank.
Filter the data to remove data that better not be
indexed
Partition the data space

19
Redundancy removal

15 to 20 of the web pages are replicated on
different websites, e.g., some tutorials such as
Java, Perl, Python,
Can be implemented by partitioned-hashing or
external sorting

20
Ranking the URLs

Link analysis is done to count the mutual
reference between web pages
A URL receiving higher number of references will
get higher score
weighted link
discount internal link // such as back to home
Order the web pages in order of score such that a
page with higher rank will have lower ID

21
Data Partition

The data is partitioned by language type
The language partition can be done as follows
for each known language, collect certain amount
of webpages of that language
build up high-frequent term set for each language
set from the analysis of the sample data
determine the language type by term analysis

22
Indexer

In general, inverted file is used to generate the
index
Need large data space for the indexing task.
For each indexed term, an index list is generated
to record which files/locations such term
appears.
Need about the same or more space as the original
data

23
Indexer - implementation issue

Data filter module is used to cope with different
data sources
Inversion module is the kernel module
Need to be scalable to handle continuous growing
data size.
Hundreds of Giga bytes
Tera bytes
Distributed/Concurrent Indexing

24
Indexer - implementation issue

Temporary space minimization
Index speed is crucial
Memory can be utilized to improve the index
performance
Hashing and Sorting is the key!

25
Query Processing

Use dictionary/stop-list to preprocess the query
string
Parse the query into expressions of tokens
Use index structure to locate the matched
Use TFIDF type technique to score the matched
documents
Combine URL scores to rank the result

26
Search CGI programs

search agent CGI
parse the query and fork a searcher process to do
the search (or use IPC to query the searcher)
when the searcher returns, analyze and process
the result for formatted output
process the result and store it in tmp result
store
log query and some status info
cgi for view-next-page
showmatch cgi

27
Output control

Site grouping
group the pages from same website together
Title grouping
group the pages with similar title
Sort the output according to certain criteria

28
Interaction

Term Suggestion
Related terms
thesaurus
term-expansion
error correction
phonetic
spelling

29
Personalization

Keeping track of a users interest such that the
search result can be tuned to improve the
satisfaction to the user
Query Tracking and classification

30
Service tools

Query cache to improve the performance of the
Search, for queries that have been served.
Use memory cache file system to reduce the dick
access overhead
Mechanism for special case handling
Log analyzer

31
Research Issues

Hidden Web data collection
Distributed index/search
Index minimization, incremental Indexing
Smart robot
Intelligent Retrieval
Output result auto classification/clustering
Data source clustering/classification
classifying/clustering the whole web

32
Conclusion

Size does matter
Is still searching for a better engine!

Write a Comment

User Comments (0)