Document Clustering and Collection Selection - PowerPoint PPT Presentation

About This Presentation

Title:

Document Clustering and Collection Selection

Description:

referrer, related. query terms. How To Doc-Partition? Random doc assignment Query broadcast ... referrer, related. query terms. query mapping. Developments ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 39

Provided by: malvasia

Category:

more less

Transcript and Presenter's Notes

Title: Document Clustering and Collection Selection

1
Document Clustering and Collection Selection

Diego Puppin
Web Mining, 2006-2007

2
Web Search Engines Parallel Architecture

To improve throughput, latency, res. quality
A broker works as a unique interface
Composed of several computing servers
Queries are routed to a subset
The broker collects and merges results

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Doc-Partitioned Approach

The document base is split among servers
Each server indexes and manages queries for its
own documents
It knows all terms of some documents
Better scalability of indexing/search
Each server is independent
Documents can be easily added/removed

8
Term-Partitioned Approach

The dictionary is split among servers
Each server stores the index for some terms
It knows documents where its terms occur
Potential for load reduction
Poor load balancing (some work...)

9
Some considerations

Every time you add/remove a doc
You must update MANY servers
With queries
only relevant servers are queried but...
servers with hot terms are overloaded

10
How To Load Balance

Put together related terms
This minimizes the number of hit servers
Try to put together group of documents with
similar overall frequence
Servers shouldnt be overloaded
Query logs could be used to predict

11
Multiple indexes

Term-based index
Query-vector or Bag-of-word
Hot text index
Titles, Anchor, Bold etc
Link-based index
It can find related pages etc
Key-phrase index
For some idioms

12
query terms
terms
links
hot
query terms
referrer, related
13
How To Doc-Partition?

Random doc assignment Query broadcast
Collection selection for independent collections
(meta-search)
Smart doc assignment Collection selection
Random assignment Random selection

14
1. Random Broadcast

Used by commercial WSEs
No computing effort for doc clustering
Very high scalability
Low latency on each server
Result collection and merging is the heaviest part

15
Distributed/Replicated Documents
16
2. Independent collections

The WSE uses data from several sources
It routes the query to the most authoritative
collection(s)
It collects the results according to independent
ranking choices (HARD)
Example Biology, News, Law

17
(No Transcript)
18
3. Assignment Selection

WSE creates document groups
Each server holds one group
The broker has a knowledge of group placement
The selection strategy routes the query suitably

19
(No Transcript)
20
4. Random Random

If data (pages, resources...) are replicated,
interchangeable, hard to index
Data are stored in the server that publishes them
We query a few servers hoping to get something

21
(No Transcript)
22
CORI

The Effect of Database Size Distribution on
Resource Selection Algorithms, Luo Si and Jamie
Callan
Extends the concept of TF.IDF to collections

23
(No Transcript)
24
(No Transcript)
25
CORI

Needs a deep collaboration from the collections
Data about terms, documents, size
Unfeasible with independent collections
Statistical sampling, Query-based sampling
Term-based no links, no anchors
Very large footprint

26
(No Transcript)
27
Querylog-based Collection Selection

Using Query Logs to Establish Vocabularies in
Distributed Information Retrieval
Milad Shokouhi Justin Zobel S.M.M. Tahaghoghi
Falk Scholer
The collections are sampled using data from a
query log
Before Queries over a dictionary (QBS)

Recall
Number of found documents (not for web)
Precision at X (P_at_X)
Number of relevant documents out of the first X
results (X 5, 10, 20, 100)
Average precision
Average of Precision for increasing X
MAP (Mean Average Precision)
The average over all queries

29
(No Transcript)
30
And now for something completely different!
31
Important features

It can use any underlying WSE
Links, snippets, anchors...
Good or bad as the WSE it uses!
Small footprint
It can be added as another index

32
query terms
terms
links
hot
query-vector
query terms
referrer, related
query mapping
33
Developments

Query suggestions
The system finds related queries
Result grouping
Documents are already organized into groups
Query expansion
Can find more complex queries still matching

34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Still missing

Using the QV model as a full IR model
Is it possible to perform queries over this
representation?
New query terms cannot be found
CORI should be used for unseen queries
Better testing with topic shift

38
Possible seminars/projects

Advanced collection selection
Statistical sampling, Query-based sampling
Partitioning a LARGE collection (1 TB)
Load balancing for doc- and term-partitioning
Topic shift
Query log analysis

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Naive clustering of a large XML document collection PowerPoint PPT Presentation

Naive clustering of a large XML document collection - Naive clustering of a large XML document collection Antoine Doucet University of Helsinki Department of Computer Science 1st INEX Workshop Schloss Dagstuhl, 10.12.2002 | PowerPoint PPT presentation | free to view

A New Suffix Tree Similarity Measure for Document Clustering PowerPoint PPT Presentation

A New Suffix Tree Similarity Measure for Document Clustering - A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW2007 INTRODUCTION : To develop a document ... | PowerPoint PPT presentation | free to view

A Semi-supervised Document Clustering Algorithm based on EM PowerPoint PPT Presentation

A Semi-supervised Document Clustering Algorithm based on EM - In between automatic categorization and auto-organization of data ... t, in the expectation step (E), some documents are badly classified, these data ... | PowerPoint PPT presentation | free to view

Evaluation of Clustering Techniques on DMOZ Data PowerPoint PPT Presentation

Evaluation of Clustering Techniques on DMOZ Data - Title: ALternative Energy Sources Author: Musa Last modified by: Can Created Date: 5/14/2004 12:43:08 AM Document presentation format: On-screen Show | PowerPoint PPT presentation | free to view

A dynamic pivot selection technique for similarity search PowerPoint PPT Presentation

A dynamic pivot selection technique for similarity search - Title: PowerPoint Presentation Last modified by: Oscar Created Date: 1/1/1601 12:00:00 AM Document presentation format: Presentaci n en pantalla Other titles | PowerPoint PPT presentation | free to view

A New Suffix Tree Similarity Measure for Document Clustering - The Vector Space Document (VSD) - representation of any document as a feature ... 2. all non-word tokens are stripped. 3. all stopwords are identified and removed ... | PowerPoint PPT presentation | free to view

Pattern Recognition and Image Analysis Group (RFAI), Document (Image) Analysis related work PowerPoint PPT Presentation

Pattern Recognition and Image Analysis Group (RFAI), Document (Image) Analysis related work - M. Delalandre. Pattern Recognition and Image Analysis Group (RFAI), Document (Image) Analysis related work. CIL seminar, Athens, Greece, 2th of February 2011. CIL seminar, Athens, Greece, 7th of March 2012. | PowerPoint PPT presentation | free to view

PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles PowerPoint PPT Presentation

PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles - PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer Science | PowerPoint PPT presentation | free to view

Efficient and Effective Clustering Methods for Spatial Data Mining PowerPoint PPT Presentation

Efficient and Effective Clustering Methods for Spatial Data Mining - Title: PowerPoint Presentation Last modified by: Pavan Podila Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles | PowerPoint PPT presentation | free to view

Collection of general data mining briefings PowerPoint PPT Presentation

Collection of general data mining briefings - Presented to: Olin Howard, AFMC/SC, 1/31/96 Walt Shafer, FBIS, 2/1/96 Mike Ware, NSA Y, 6/13/96 | PowerPoint PPT presentation | free to view

Feature Selection PowerPoint PPT Presentation

Feature Selection - Title: Slide 1 Author: Rose Last modified by: PARAND Created Date: 8/16/2006 12:00:00 AM Document presentation format: On-screen Show (4:3) Other titles | PowerPoint PPT presentation | free to view

Small World Networks: Applications in Document Clustering and Healthcare PowerPoint PPT Presentation

Small World Networks: Applications in Document Clustering and Healthcare - Brant Chee. Bruce Schatz. University of Illinois. http://www.beespace.uiuc.edu. Small World Graph ... Average fraction of pairs of neighbors of a node which ... | PowerPoint PPT presentation | free to view

A Search Engine Architecture Based on Collection Selection PowerPoint PPT Presentation

A Search Engine Architecture Based on Collection Selection - The Web is getting bigger and bigger, and users are more and more picky! Precise results are needed very fast. The index is growing, due to added page and advanced ... | PowerPoint PPT presentation | free to view

QUERY AND DOCUMENT EXPANSION IN TEXT RETRIEVAL PowerPoint PPT Presentation

QUERY AND DOCUMENT EXPANSION IN TEXT RETRIEVAL - References Chapter 5 (Query Operations) in Baeza-Yates and Ribeiro-Neto, Modern Information Retrieval. A. Singhal and F. Pereira, ... | PowerPoint PPT presentation | free to view

Web Crawling/Collection Aggregation PowerPoint PPT Presentation

Web Crawling/Collection Aggregation - Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19 | PowerPoint PPT presentation | free to view

An Energy-Efficient Voting-Based Clustering Algorithm for PowerPoint PPT Presentation

An Energy-Efficient Voting-Based Clustering Algorithm for - Title: PowerPoint Presentation Author: David Schwartz Last modified by: minqin Created Date: 10/6/2004 10:54:37 PM Document presentation format: On-screen Show | PowerPoint PPT presentation | free to view

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin PowerPoint PPT Presentation

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin - Title: Information Theoretic Clustering and Co-Clustering for Text Mining Author: Inderjit Dhillon Last modified by: Inderjit Dhillon Created Date | PowerPoint PPT presentation | free to view

Language Independent Methods of Clustering Similar Contexts (with applications) PowerPoint PPT Presentation

Language Independent Methods of Clustering Similar Contexts (with applications) - Our goal is to cluster the target word based on the surrounding contexts ... Convert contexts to be clustered into a vector representation based on these features ... | PowerPoint PPT presentation | free to view

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS PowerPoint PPT Presentation

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS - MINAL PATANKAR MADHURI WUDALI STC allows cluster overlap Why overlap is reasonable? | PowerPoint PPT presentation | free to view

The Failure of Clustering in Search Interfaces PowerPoint PPT Presentation

The Failure of Clustering in Search Interfaces - The Failure of Clustering in Search Interfaces or When/How/Why Clustering can be Successful in Search Interfaces Marti Hearst UC Berkeley Oct 6, 2004 | PowerPoint PPT presentation | free to view

Applications of Data Classification and Data Clustering in Content Management and Personalized Information Services PowerPoint PPT Presentation

Applications of Data Classification and Data Clustering in Content Management and Personalized Information Services - Applications of Data Classification and Data Clustering in Content Management ... The quantity of digital information have been increasing at an astonishing rate. ... | PowerPoint PPT presentation | free to view

Web Document Clustering PowerPoint PPT Presentation

Web Document Clustering - Web Document Clustering Department of Computer Science and Engineering Southern Methodist University Wenyi Ni Why web document clustering is needed? | PowerPoint PPT presentation | free to view

Learning Structure in Unstructured Document Bases PowerPoint PPT Presentation

Learning Structure in Unstructured Document Bases - Learning, Navigating, and Manipulating Structure in Unstructured Data/Document Bases Author: David Cohn Last modified by: David Cohn Created Date: 2/25/2000 1:39:05 PM | PowerPoint PPT presentation | free to view

Birch: Balanced Iterative Reducing and Clustering using Hierarchies PowerPoint PPT Presentation

Birch: Balanced Iterative Reducing and Clustering using Hierarchies - Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jeli 3218/10 e-mail: jelicvladimir5@ ... | PowerPoint PPT presentation | free to view

Techniques of Classification and Clustering PowerPoint PPT Presentation

Techniques of Classification and Clustering - Problem Description Assume A={A1, A2, , Ad}: (ordered or unordered) domain S= A1 A2 Ad : d-dimensional (numerical or non-numerical) space Input V={v1, v2 ... | PowerPoint PPT presentation | free to view

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin - Information Theoretic Clustering, Coclustering and Matrix Approximations Inderjit S' Dhillon Univers | PowerPoint PPT presentation | free to view

Clustering, Jan 2003, Yoon, 1 PowerPoint PPT Presentation

Clustering, Jan 2003, Yoon, 1 - Fast Clustering for XML Bitmap Indexes J. Yoon University of Louisiana at Lafayette Center for Advanced Computer Studies | PowerPoint PPT presentation | free to view