swcs.ccu.edu.tw

1 / 36

About This Presentation

Title:

swcs.ccu.edu.tw

Description:

Newsgroup/BBS search. Ftp search. People/organization search. Daily-life information search ... Typing, (EC, News, BBS, ...) Time detection. Hub identification ... – PowerPoint PPT presentation

Number of Views:190

Avg rating:3.0/5.0

Slides: 37

Provided by: Gai8

more less

Transcript and Presenter's Notes

Title: swcs.ccu.edu.tw

1
??????

?????????????
?? ???(sw_at_cs.ccu.edu.tw)

2
What is a search engine?

A web service site for the Internet Users to find
information in the Internet Cyberspace?
The software to provide web search service

3
Use of search engines?

Look for the contact info about a person or an
organization
Search for information related to a term, eg. to
collect information about ?????
Search for the url of a company/website
Look for news regarding XXX
Treat the search engine as a big dictionary
Search for products/movies/Travel
Search for Games/softwares,
..

4
Types of search engines

Directory browse/search
Web pages search
Image Search/Multimedia Object Search
News Search
EC Search
Newsgroup/BBS search
Ftp search
People/organization search
Daily-life information search
Library search/Literature Search
Dictionary Search
.

5
Example search engines

Yahoo,
Google,
AltaVista,
MSN,
Excite,
Lycos, ...

YAM, Kimo, PCHome
GAIS, Openfind, ...
DejaNews,
Archie, ...

6
Portal Services

Directory / Search
Daily information Weather, Maps. TV, ...
Free Emails, Free Pages, Calendar
Personalized services, channel subscription
Web Chat,
E-Commerce,
Content Aggregation
...

7
Directory implementation

Each url data is a record
The url data is managed by a database system
Search function is supported for searching the
data in the directory tree

8
Directory implementation

The search is in general for locating a website
or a category of web sites.
The data input is through manual registration by
the website owner or the suffer
The management of the directory tree needs
intensive labor work by people who are familiar
with certain domain knowledge

9
The Advantages/Disadvantages of Directory
search engine

Advantages
The data is manually maintained, and
thus contains less noise, and is more precise.
The output of search can be categorized and can
be more organized
Can support search within a category

10
The Advantages/Disadvantages of Directory
search engine

Disadvantages
The data coverage is limited, and sometimes, can
not find wanted
Does not support relevance ranking
Labor intensive

11
Topics

Automatic Classification
Error detection of the classification tree
Consistency Checking
Link Evolutions
Link Revolutions

12
Implementation of Webpage search engine

Data Gathering Subsystem
Data Preprocessing Subsystem
Indexing Subsystem
Query Processing Subsystem
Service Management

13
Search Engines Evaluations

0. The quality of the search result in a search
engine basically depends on
a. the quality of the underlying data
b. the search techniques such as ranking tech.
1. Data coverage should be large enough
2. Data needs to be filtered, such as removing
redundant pages

3. Should be able to find it if existing
4. Quality of ranking
5. Speed and scalability
6. Search features
7. Intelligence
I.e., evaluation points
Quality, speed, scale, robustness, features,
Intelligence
???, ???, ???, ???

15
Evaluation and Comparison

How to compare different engines fairly?
How to evaluate a search engine in a fair and
scientific way?
How about a model of satisfaction degree
calculation?
How to test the correctness?

16
Data Gatherer

Also known as spider, crawler, robot, ...
Periodically travels the web space to collect web
pages
Need a list management to decide which and when
to collect
Need a link analyzer to generate new URL list
Need to decide what to collect and what not to.

17
Data Gatherer

Get-file function through http protocol is the
basic function
Webpage parser module used to extract link info
from a retrieved page,
URL bank manager module to manage the urls to be
fetched.
Robot-controller module to manage the data
collection using multiple clients

18
Robot Issues

Site Based vs URL based
Site based is popular such as wget, teleport
robots.txt is easier to implement in SiteBased
robot
URL based robot is more appropriate for large
scale search engines
Retrieval Scheduling
Incremental Retrieval
Robots.txt processing
DOS prevention

19
Robot Issues

What to gather and what not to?
Hidden web data collection
Java script
Cgi
Focused crawling
targeting specialized content of web pages
suitable for special search engines
evaluated by precision and recall
Spam detection,
Crawling optimization
Scale, speed, quality, efficiency,

20
Robot Program ????

Watch out the traps
From the colo manager
One more complaint call, I will shut down all
your servers !
This is XYZs legal office, I have got a
complaint from one of our customer that your
sites are launching a DOS attack that has caused
serious damage to their biz
Guess how many alias names can a site have
You bot deleted the content of my site !

21
Data Preprocessing

Remove redundant pages
Transform the page into internal data format.
Perform web cross-link analysis to generate a URL
database and linkage db.
Partition the data space
Language classification
Knowledge classification
Data Ranking
Data Typing, (EC, News, BBS, )
Time detection
Hub identification

22
Data Preprocessing

Approximate File Detection
Keyterms generation
Redundancy removal, (Data Optimization)
Data Filtering
Essential Body identification
Automatic summary
Automatic dictionary generation
Thesaurus dictionary generation
WNS analysis
Name Selection
Data compression

23
Redundancy removal

15 to 20 of the web pages are replicated on
different websites, e.g., some tutorials such as
Java, Perl, Python,
Can be implemented by partitioned-hashing or
external sorting

24
Ranking the URLs

Link analysis is done to count the mutual
reference between web pages
A URL receiving higher number of references will
get higher score
weighted link
discount internal link // such as back to home
Order the web pages in order of score such that a
page with higher rank will have lower ID

25
Data Partition

The data is partitioned by language type
The language partition can be done as follows
for each known language, collect certain amount
of web pages of that language
build up high-frequent term set for each language
set from the analysis of the sample data
determine the language type by term analysis

26
Indexer

In general, inverted file is used to generate the
index
Need large data space for the indexing task.
For each indexed term, an index list is generated
to record which files/locations such term
appears.
Need about the same or more space as the original
data

27
Indexer issues

Data filter/convertor module is used to cope with
different data sources
Can be decomposed into two parts
Page index,
Inverted index
Need to be scalable to handle continuous growing
data size.
Hundreds of Giga bytes
Tera bytes
Distributed/Concurrent Indexing

28
Indexer issues

Temporary space minimization
Index speed optimization (sequential, and
concurrent)
Memory can be utilized to improve the index
performance
Index size optimization
Hashing and Sorting is the key!

29
Query Processing

Use dictionary/stop-list to preprocess the query
string
Parse the query into expressions of tokens
Use index structure to locate the matched
Use TFIDF type technique to score the matched
documents
Combine URL scores to rank the result
PageRank, WNS, BNS

30
Search CGI programs

search agent CGI
parse the query and fork a searcher process to do
the search (or use IPC to query the searcher)
when the searcher returns, analyze and process
the result for formatted output
process the result and store it in tmp result
store
log query and some status info
What portion (matched area) to display
Show cache
Similar pages

31
Output control

Site grouping
group the pages from same website together
Title grouping
group the pages with similar title
Output Clustering, (classification)
Ontology guided clustering
Ranking and Ordering

32
Interaction

Term Suggestion
Related terms
thesaurus
term-expansion
error correction
phonetic
spelling

33
Personalization

Keeping track of a users interest such that the
search result can be tuned to improve the
satisfaction to the user
Query Tracking and classification
Personalized ranking

34
Service tools

Query cache to improve the performance of the
Search, for queries that have been served.
Use memory cache file system to reduce the dick
access overhead
Mechanism for special case handling
Log analyzer

35
More topics