swcs.ccu.edu.tw

1 / 36
About This Presentation
Title:

swcs.ccu.edu.tw

Description:

Newsgroup/BBS search. Ftp search. People/organization search. Daily-life information search ... Typing, (EC, News, BBS, ...) Time detection. Hub identification ... – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 37
Provided by: Gai8

less

Transcript and Presenter's Notes

Title: swcs.ccu.edu.tw


1
??????
  • ?????????????
  • ?? ???(sw_at_cs.ccu.edu.tw)

2
What is a search engine?
  • A web service site for the Internet Users to find
    information in the Internet Cyberspace?
  • The software to provide web search service

3
Use of search engines?
  • Look for the contact info about a person or an
    organization
  • Search for information related to a term, eg. to
    collect information about ?????
  • Search for the url of a company/website
  • Look for news regarding XXX
  • Treat the search engine as a big dictionary
  • Search for products/movies/Travel
  • Search for Games/softwares,
  • ..

4
Types of search engines
  • Directory browse/search
  • Web pages search
  • Image Search/Multimedia Object Search
  • News Search
  • EC Search
  • Newsgroup/BBS search
  • Ftp search
  • People/organization search
  • Daily-life information search
  • Library search/Literature Search
  • Dictionary Search
  • .

5
Example search engines
  • Yahoo,
  • Google,
  • AltaVista,
  • MSN,
  • Excite,
  • Lycos, ...
  • YAM, Kimo, PCHome
  • GAIS, Openfind, ...
  • DejaNews,
  • Archie, ...

6
Portal Services
  • Directory / Search
  • Daily information Weather, Maps. TV, ...
  • Free Emails, Free Pages, Calendar
  • Personalized services, channel subscription
  • Web Chat,
  • E-Commerce,
  • Content Aggregation
  • ...

7
Directory implementation
  • Each url data is a record
  • The url data is managed by a database system
  • Search function is supported for searching the
    data in the directory tree

8
Directory implementation
  • The search is in general for locating a website
    or a category of web sites.
  • The data input is through manual registration by
    the website owner or the suffer
  • The management of the directory tree needs
    intensive labor work by people who are familiar
    with certain domain knowledge

9
The Advantages/Disadvantages of Directory
search engine
  • Advantages
  • The data is manually maintained, and
    thus contains less noise, and is more precise.
  • The output of search can be categorized and can
    be more organized
  • Can support search within a category

10
The Advantages/Disadvantages of Directory
search engine
  • Disadvantages
  • The data coverage is limited, and sometimes, can
    not find wanted
  • Does not support relevance ranking
  • Labor intensive

11
Topics
  • Automatic Classification
  • Error detection of the classification tree
  • Consistency Checking
  • Link Evolutions
  • Link Revolutions

12
Implementation of Webpage search engine
  • Data Gathering Subsystem
  • Data Preprocessing Subsystem
  • Indexing Subsystem
  • Query Processing Subsystem
  • Service Management

13
Search Engines Evaluations
  • 0. The quality of the search result in a search
    engine basically depends on
  • a. the quality of the underlying data
  • b. the search techniques such as ranking tech.
  • 1. Data coverage should be large enough
  • 2. Data needs to be filtered, such as removing
    redundant pages

14
  • 3. Should be able to find it if existing
  • 4. Quality of ranking
  • 5. Speed and scalability
  • 6. Search features
  • 7. Intelligence
  • I.e., evaluation points
  • Quality, speed, scale, robustness, features,
    Intelligence
  • ???, ???, ???, ???

15
Evaluation and Comparison
  • How to compare different engines fairly?
  • How to evaluate a search engine in a fair and
    scientific way?
  • How about a model of satisfaction degree
    calculation?
  • How to test the correctness?

16
Data Gatherer
  • Also known as spider, crawler, robot, ...
  • Periodically travels the web space to collect web
    pages
  • Need a list management to decide which and when
    to collect
  • Need a link analyzer to generate new URL list
  • Need to decide what to collect and what not to.

17
Data Gatherer
  • Get-file function through http protocol is the
    basic function
  • Webpage parser module used to extract link info
    from a retrieved page,
  • URL bank manager module to manage the urls to be
    fetched.
  • Robot-controller module to manage the data
    collection using multiple clients

18
Robot Issues
  • Site Based vs URL based
  • Site based is popular such as wget, teleport
  • robots.txt is easier to implement in SiteBased
    robot
  • URL based robot is more appropriate for large
    scale search engines
  • Retrieval Scheduling
  • Incremental Retrieval
  • Robots.txt processing
  • DOS prevention

19
Robot Issues
  • What to gather and what not to?
  • Hidden web data collection
  • Java script
  • Cgi
  • Focused crawling
  • targeting specialized content of web pages
  • suitable for special search engines
  • evaluated by precision and recall
  • Spam detection,
  • Crawling optimization
  • Scale, speed, quality, efficiency,

20
Robot Program ????
  • Watch out the traps
  • From the colo manager
  • One more complaint call, I will shut down all
    your servers !
  • This is XYZs legal office, I have got a
    complaint from one of our customer that your
    sites are launching a DOS attack that has caused
    serious damage to their biz
  • Guess how many alias names can a site have
  • You bot deleted the content of my site !

21
Data Preprocessing
  • Remove redundant pages
  • Transform the page into internal data format.
  • Perform web cross-link analysis to generate a URL
    database and linkage db.
  • Partition the data space
  • Language classification
  • Knowledge classification
  • Data Ranking
  • Data Typing, (EC, News, BBS, )
  • Time detection
  • Hub identification

22
Data Preprocessing
  • Approximate File Detection
  • Keyterms generation
  • Redundancy removal, (Data Optimization)
  • Data Filtering
  • Essential Body identification
  • Automatic summary
  • Automatic dictionary generation
  • Thesaurus dictionary generation
  • WNS analysis
  • Name Selection
  • Data compression

23
Redundancy removal
  • 15 to 20 of the web pages are replicated on
    different websites, e.g., some tutorials such as
    Java, Perl, Python,
  • Can be implemented by partitioned-hashing or
    external sorting

24
Ranking the URLs
  • Link analysis is done to count the mutual
    reference between web pages
  • A URL receiving higher number of references will
    get higher score
  • weighted link
  • discount internal link // such as back to home
  • Order the web pages in order of score such that a
    page with higher rank will have lower ID

25
Data Partition
  • The data is partitioned by language type
  • The language partition can be done as follows
  • for each known language, collect certain amount
    of web pages of that language
  • build up high-frequent term set for each language
    set from the analysis of the sample data
  • determine the language type by term analysis

26
Indexer
  • In general, inverted file is used to generate the
    index
  • Need large data space for the indexing task.
  • For each indexed term, an index list is generated
    to record which files/locations such term
    appears.
  • Need about the same or more space as the original
    data

27
Indexer issues
  • Data filter/convertor module is used to cope with
    different data sources
  • Can be decomposed into two parts
  • Page index,
  • Inverted index
  • Need to be scalable to handle continuous growing
    data size.
  • Hundreds of Giga bytes
  • Tera bytes
  • Distributed/Concurrent Indexing

28
Indexer issues
  • Temporary space minimization
  • Index speed optimization (sequential, and
    concurrent)
  • Memory can be utilized to improve the index
    performance
  • Index size optimization
  • Hashing and Sorting is the key!

29
Query Processing
  • Use dictionary/stop-list to preprocess the query
    string
  • Parse the query into expressions of tokens
  • Use index structure to locate the matched
  • Use TFIDF type technique to score the matched
    documents
  • Combine URL scores to rank the result
  • PageRank, WNS, BNS

30
Search CGI programs
  • search agent CGI
  • parse the query and fork a searcher process to do
    the search (or use IPC to query the searcher)
  • when the searcher returns, analyze and process
    the result for formatted output
  • process the result and store it in tmp result
    store
  • log query and some status info
  • What portion (matched area) to display
  • Show cache
  • Similar pages

31
Output control
  • Site grouping
  • group the pages from same website together
  • Title grouping
  • group the pages with similar title
  • Output Clustering, (classification)
  • Ontology guided clustering
  • Ranking and Ordering

32
Interaction
  • Term Suggestion
  • Related terms
  • thesaurus
  • term-expansion
  • error correction
  • phonetic
  • spelling

33
Personalization
  • Keeping track of a users interest such that the
    search result can be tuned to improve the
    satisfaction to the user
  • Query Tracking and classification
  • Personalized ranking

34
Service tools
  • Query cache to improve the performance of the
    Search, for queries that have been served.
  • Use memory cache file system to reduce the dick
    access overhead
  • Mechanism for special case handling
  • Log analyzer

35
More topics
  • Query log mining/analysis
  • Hot terms
  • Events mining
  • Dictionary, thesaurus, synonym generation
  • Session analysis
  • User behavior analysis (what they want?)
  • Query transformation and Optimization
  • Natural Language processing
  • Q-A
  • Semantic Indexing
  • Deadlinks minimization,
  • On the fly query attack detection and service
    protection
  • Intelligent Proxy
  • Co-related search

36
Conclusion
  • Size does matter
  • Is still searching for a better engine!
Write a Comment
User Comments (0)