Title: swcs.ccu.edu.tw
1??????
- ?????????????
- ?? ???(sw_at_cs.ccu.edu.tw)
2What is a search engine?
- A web service site for the Internet Users to find
information in the Internet Cyberspace? - The software to provide web search service
3Use of search engines?
- Look for the contact info about a person or an
organization - Search for information related to a term, eg. to
collect information about ????? - Search for the url of a company/website
- Look for news regarding XXX
- Treat the search engine as a big dictionary
- Search for products/movies/Travel
- Search for Games/softwares,
- ..
4Types of search engines
- Directory browse/search
- Web pages search
- Image Search/Multimedia Object Search
- News Search
- EC Search
- Newsgroup/BBS search
- Ftp search
- People/organization search
- Daily-life information search
- Library search/Literature Search
- Dictionary Search
- .
5Example search engines
- Yahoo,
- Google,
- AltaVista,
- MSN,
- Excite,
- Lycos, ...
- YAM, Kimo, PCHome
- GAIS, Openfind, ...
-
- DejaNews,
- Archie, ...
6 Portal Services
- Directory / Search
- Daily information Weather, Maps. TV, ...
- Free Emails, Free Pages, Calendar
- Personalized services, channel subscription
- Web Chat,
- E-Commerce,
- Content Aggregation
- ...
7Directory implementation
- Each url data is a record
- The url data is managed by a database system
- Search function is supported for searching the
data in the directory tree
8Directory implementation
- The search is in general for locating a website
or a category of web sites. - The data input is through manual registration by
the website owner or the suffer - The management of the directory tree needs
intensive labor work by people who are familiar
with certain domain knowledge
9 The Advantages/Disadvantages of Directory
search engine
- Advantages
- The data is manually maintained, and
thus contains less noise, and is more precise. - The output of search can be categorized and can
be more organized - Can support search within a category
10 The Advantages/Disadvantages of Directory
search engine
- Disadvantages
- The data coverage is limited, and sometimes, can
not find wanted - Does not support relevance ranking
- Labor intensive
11Topics
- Automatic Classification
- Error detection of the classification tree
- Consistency Checking
- Link Evolutions
- Link Revolutions
12Implementation of Webpage search engine
- Data Gathering Subsystem
- Data Preprocessing Subsystem
- Indexing Subsystem
- Query Processing Subsystem
- Service Management
13Search Engines Evaluations
- 0. The quality of the search result in a search
engine basically depends on - a. the quality of the underlying data
- b. the search techniques such as ranking tech.
- 1. Data coverage should be large enough
- 2. Data needs to be filtered, such as removing
redundant pages
14- 3. Should be able to find it if existing
- 4. Quality of ranking
- 5. Speed and scalability
- 6. Search features
- 7. Intelligence
- I.e., evaluation points
- Quality, speed, scale, robustness, features,
Intelligence - ???, ???, ???, ???
15Evaluation and Comparison
- How to compare different engines fairly?
- How to evaluate a search engine in a fair and
scientific way? - How about a model of satisfaction degree
calculation? - How to test the correctness?
16Data Gatherer
- Also known as spider, crawler, robot, ...
- Periodically travels the web space to collect web
pages - Need a list management to decide which and when
to collect - Need a link analyzer to generate new URL list
- Need to decide what to collect and what not to.
17Data Gatherer
- Get-file function through http protocol is the
basic function - Webpage parser module used to extract link info
from a retrieved page, - URL bank manager module to manage the urls to be
fetched. - Robot-controller module to manage the data
collection using multiple clients
18Robot Issues
- Site Based vs URL based
- Site based is popular such as wget, teleport
- robots.txt is easier to implement in SiteBased
robot - URL based robot is more appropriate for large
scale search engines - Retrieval Scheduling
- Incremental Retrieval
- Robots.txt processing
- DOS prevention
19Robot Issues
- What to gather and what not to?
- Hidden web data collection
- Java script
- Cgi
- Focused crawling
- targeting specialized content of web pages
- suitable for special search engines
- evaluated by precision and recall
- Spam detection,
- Crawling optimization
- Scale, speed, quality, efficiency,
20Robot Program ????
- Watch out the traps
- From the colo manager
- One more complaint call, I will shut down all
your servers ! - This is XYZs legal office, I have got a
complaint from one of our customer that your
sites are launching a DOS attack that has caused
serious damage to their biz - Guess how many alias names can a site have
- You bot deleted the content of my site !
21Data Preprocessing
- Remove redundant pages
- Transform the page into internal data format.
- Perform web cross-link analysis to generate a URL
database and linkage db. - Partition the data space
- Language classification
- Knowledge classification
- Data Ranking
- Data Typing, (EC, News, BBS, )
- Time detection
- Hub identification
22Data Preprocessing
- Approximate File Detection
- Keyterms generation
- Redundancy removal, (Data Optimization)
- Data Filtering
- Essential Body identification
- Automatic summary
- Automatic dictionary generation
- Thesaurus dictionary generation
- WNS analysis
- Name Selection
- Data compression
23Redundancy removal
- 15 to 20 of the web pages are replicated on
different websites, e.g., some tutorials such as
Java, Perl, Python, - Can be implemented by partitioned-hashing or
external sorting
24Ranking the URLs
- Link analysis is done to count the mutual
reference between web pages - A URL receiving higher number of references will
get higher score - weighted link
- discount internal link // such as back to home
- Order the web pages in order of score such that a
page with higher rank will have lower ID
25Data Partition
- The data is partitioned by language type
- The language partition can be done as follows
- for each known language, collect certain amount
of web pages of that language - build up high-frequent term set for each language
set from the analysis of the sample data - determine the language type by term analysis
26Indexer
- In general, inverted file is used to generate the
index - Need large data space for the indexing task.
- For each indexed term, an index list is generated
to record which files/locations such term
appears. - Need about the same or more space as the original
data
27Indexer issues
- Data filter/convertor module is used to cope with
different data sources - Can be decomposed into two parts
- Page index,
- Inverted index
- Need to be scalable to handle continuous growing
data size. - Hundreds of Giga bytes
- Tera bytes
- Distributed/Concurrent Indexing
28Indexer issues
- Temporary space minimization
- Index speed optimization (sequential, and
concurrent) - Memory can be utilized to improve the index
performance - Index size optimization
- Hashing and Sorting is the key!
29Query Processing
- Use dictionary/stop-list to preprocess the query
string - Parse the query into expressions of tokens
- Use index structure to locate the matched
- Use TFIDF type technique to score the matched
documents - Combine URL scores to rank the result
- PageRank, WNS, BNS
30Search CGI programs
- search agent CGI
- parse the query and fork a searcher process to do
the search (or use IPC to query the searcher) - when the searcher returns, analyze and process
the result for formatted output - process the result and store it in tmp result
store - log query and some status info
- What portion (matched area) to display
- Show cache
- Similar pages
31Output control
- Site grouping
- group the pages from same website together
- Title grouping
- group the pages with similar title
- Output Clustering, (classification)
- Ontology guided clustering
- Ranking and Ordering
32Interaction
- Term Suggestion
- Related terms
- thesaurus
- term-expansion
- error correction
- phonetic
- spelling
33Personalization
- Keeping track of a users interest such that the
search result can be tuned to improve the
satisfaction to the user - Query Tracking and classification
- Personalized ranking
34Service tools
- Query cache to improve the performance of the
Search, for queries that have been served. - Use memory cache file system to reduce the dick
access overhead - Mechanism for special case handling
- Log analyzer
35More topics
- Query log mining/analysis
- Hot terms
- Events mining
- Dictionary, thesaurus, synonym generation
- Session analysis
- User behavior analysis (what they want?)
- Query transformation and Optimization
- Natural Language processing
- Q-A
- Semantic Indexing
- Deadlinks minimization,
- On the fly query attack detection and service
protection - Intelligent Proxy
- Co-related search
36Conclusion
- Size does matter
- Is still searching for a better engine!