?? ???(sw@cs.ccu.edu.tw)

About This Presentation
Title:

?? ???(sw@cs.ccu.edu.tw)

Description:

(sw_at_cs.ccu.edu.tw) – PowerPoint PPT presentation

Number of Views:4
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: ?? ???(sw@cs.ccu.edu.tw)


1
??????
  • ?????????????
  • ?? ???(sw_at_cs.ccu.edu.tw)

2
What is a search engine?
  • A web service site for the Internet Users to find
    information in the Internet Cyberspace?
  • The software to provide web search service

3
Use of search engines?
  • Search for the url of a company/website
  • Look for the contact info about a person or an
    organization
  • Search for information related to a term, eg. to
    collect information about ?????
  • Look for news regarding XXX
  • Treat the search engine as a big dictionary
  • ...

4
Types of search engines
  • Directory browse/search
  • Web pages search
  • USENET news search
  • Ftp search
  • People/organization search
  • Daily-life information search
  • Library search
  • Commercial product search

5
Example search engines
  • Yahoo,
  • Google,
  • AltaVista,
  • MSN,
  • Excite,
  • Lycos, ...
  • YAM, Kimo, PCHome
  • GAIS, Openfind, ...
  • DejaNews,
  • Archie, ...

6
Portal Services
  • Directory / Search
  • Daily information Weather, Maps. TV, ...
  • Free Emails, Free Pages, Calendar
  • Personalized services, channel subscription
  • Web Chat,
  • E-Commerce,
  • Content Aggregation
  • ...

7
Directory implementation
  • Each url data is a record
  • The url data is managed by a database system
  • Search function is supported for searching the
    data in the directory tree

8
Directory implementation
  • The search is in general for locating a website
    or a category of web sites.
  • The data input is through manual registration by
    the website owner or the suffer
  • The management of the directory tree needs
    intensive labor work by people who are familiar
    with certain domain knowledge

9
The Advantages/Disadvantages of Directory
search engine
  • Advantages
  • The data is manually maintained, and
    thus contains less noise, and is more precise.
  • The output of search can be categorized and can
    be more organized
  • Can support search within a category

10
The Advantages/Disadvantages of Directory
search engine
  • Disadvantages
  • The data coverage is limited, and sometimes, can
    not find wanted
  • Does not support relevance ranking
  • Labor intensive

11
Implementation of Webpage search engine
  • 1.Feature consideration
  • 2.Data Gathering
  • 3.Data Preprocessing
  • 4.Data Indexing
  • 5.Query Processing
  • 6. Interaction
  • 7.Service tools
  • 8.Personalization

12
Requirements for WebPage search engines
  • 0. The quality of the search result in a search
    engine basically depends on
  • a. the quality of the underlying data
  • b. the search techniques such as ranking tech.
  • 1. Data coverage should be large enough
  • 2. Data needs to be filtered, such as removing
    redundant pages

13
Requirements for WebPage search engines
  • 3. Full text search capability should be provided
  • 4. Relevance Ranking mechanism should be provided
  • 5. Search Speed should be fast enough
  • 6. Search features
  • I.e., evaluation points
  • Quality, speed, scale, robustness, features,

14
Data Gatherer
  • Also known as spider, crawler, robot, ...
  • Periodically travels the web space to collect web
    pages
  • Need a list management to decide which and when
    to collect
  • Need a link analyzer to generate new URL list
  • Need to decide what to collect and what not to.

15
Data Gatherer
  • Get-file function through http protocol is the
    basic function
  • Webpage parser module used to extract link info
    from a retrieved page,
  • URL bank manager module to manage the urls to be
    fetched.
  • Robot-controller module to manage the data
    collection using multiple clients

16
Issues of Robot
  • Site Based vs URL based
  • Site based is popular such as wget, teleport
  • robots.txt is easier to implement in SiteBased
    robot
  • URL based robot is more appropriate for large
    scale search engines
  • Retrieval Schedule, BFS is better
  • Incremental Retrieval

17
Robot Issues
  • What to gather and what not to?
  • Hidden web data collection
  • Focused crawling
  • targeting specialized content of web pages
  • suitable for special search engines
  • evaluated by precision and recall

18
Data Preprocessing
  • Remove redundant pages
  • Transform the page into internal data format.
  • Perform web cross-link analysis to generate a URL
    databank.
  • Filter the data to remove data that better not be
    indexed
  • Partition the data space

19
Redundancy removal
  • 15 to 20 of the web pages are replicated on
    different websites, e.g., some tutorials such as
    Java, Perl, Python,
  • Can be implemented by partitioned-hashing or
    external sorting

20
Ranking the URLs
  • Link analysis is done to count the mutual
    reference between web pages
  • A URL receiving higher number of references will
    get higher score
  • weighted link
  • discount internal link // such as back to home
  • Order the web pages in order of score such that a
    page with higher rank will have lower ID

21
Data Partition
  • The data is partitioned by language type
  • The language partition can be done as follows
  • for each known language, collect certain amount
    of webpages of that language
  • build up high-frequent term set for each language
    set from the analysis of the sample data
  • determine the language type by term analysis

22
Indexer
  • In general, inverted file is used to generate the
    index
  • Need large data space for the indexing task.
  • For each indexed term, an index list is generated
    to record which files/locations such term
    appears.
  • Need about the same or more space as the original
    data

23
Indexer - implementation issue
  • Data filter module is used to cope with different
    data sources
  • Inversion module is the kernel module
  • Need to be scalable to handle continuous growing
    data size.
  • Hundreds of Giga bytes
  • Tera bytes
  • Distributed/Concurrent Indexing

24
Indexer - implementation issue
  • Temporary space minimization
  • Index speed is crucial
  • Memory can be utilized to improve the index
    performance
  • Hashing and Sorting is the key!

25
Query Processing
  • Use dictionary/stop-list to preprocess the query
    string
  • Parse the query into expressions of tokens
  • Use index structure to locate the matched
  • Use TFIDF type technique to score the matched
    documents
  • Combine URL scores to rank the result

26
Search CGI programs
  • search agent CGI
  • parse the query and fork a searcher process to do
    the search (or use IPC to query the searcher)
  • when the searcher returns, analyze and process
    the result for formatted output
  • process the result and store it in tmp result
    store
  • log query and some status info
  • cgi for view-next-page
  • showmatch cgi

27
Output control
  • Site grouping
  • group the pages from same website together
  • Title grouping
  • group the pages with similar title
  • Sort the output according to certain criteria

28
Interaction
  • Term Suggestion
  • Related terms
  • thesaurus
  • term-expansion
  • error correction
  • phonetic
  • spelling

29
Personalization
  • Keeping track of a users interest such that the
    search result can be tuned to improve the
    satisfaction to the user
  • Query Tracking and classification

30
Service tools
  • Query cache to improve the performance of the
    Search, for queries that have been served.
  • Use memory cache file system to reduce the dick
    access overhead
  • Mechanism for special case handling
  • Log analyzer

31
Research Issues
  • Hidden Web data collection
  • Distributed index/search
  • Index minimization, incremental Indexing
  • Smart robot
  • Intelligent Retrieval
  • Output result auto classification/clustering
  • Data source clustering/classification
  • classifying/clustering the whole web

32
Conclusion
  • Size does matter
  • Is still searching for a better engine!
Write a Comment
User Comments (0)