Algorithms and Data Structures

About This Presentation

Title:

Algorithms and Data Structures

Description:

Title: Topical clustering of search results Author: Ugo Last modified by: Rossano Venturini Document presentation format: Presentazione su schermo (4:3) – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 11

Provided by: Ugo46

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms and Data Structures

1

Algorithms and Data Structures
for Massive Datasets
(Acube Lab)

Rossano Venturini Dipartimento di
Informatica Università di Pisa
Paolo Ferragina Giuseppe Prencipe Marco
Cornolti Andrea Farruggia Giovanni Micale
Francesco Piccinno Giorgio Audrito
2
A3 Lab (acube.di.unipi.it)

Algorithms and data structures for massive
dataset
Data Compression
Compressed Indexing
Web or arbitrary texts
Storage and analysis of massive graphs
Information Retrieval on news, tweet,

Submitted US patents 3 with Yahoo, 1 with
NYU Accepted US patents 1 with U. Rutgers, 1
with ATT-Lucent
3
Social Networks and Social Data

Given an idea, you need the right platform to
implement it
HW SW (IT Center)
Algorithms (our Lab)

Graph structure Textual Content
Nodes ? users ( 1 bil)
Edges explicit friend, follower, retweet, 1,
( 10 bil)
Edges implicit similarity, co-occurrence,
click, ( 100 bil)

4
No SQL
2006
Hadoop
Cassandra
HyperTable
Cosmos
5
Storage and access to Labeled Graphs

Compress the graph structure
Compress the node and edge labels
Guarantee fast access, dynamicity and search

5
6
Data Compression Theory Engineering

J. ACM 05
ACM-SIAM Soda 09-14
ACM WSDM 10
ESA 11-14
Algorithmica 12
SIAM J. Computing 13

Key issue
Minimize space occupancy
Maximize decompression speed

Compressor on DBLP Compressed space (MB) Decompression time (secs)
Gzip 191 11.6
bzip2 121 49
Snappy 323 2.1
LZ4 215 1.9
Our result 130 ? 149 2.9 ? 1.9
A new algorithmic concept Multi-objective design
of compressors

Two interesting scenarios
- Energy-efficiency issues
- Cloud computing

Can we fix the space occupancy and minimize the
decompression time ? Or, vice versa ?
7
Compressed Indexing Theory Engineering

J. ACM 05
ACM SIGIR 07
J. ACM 09
ACM Trans. Algo. 10
ESA 13
ACM-SIAM SODA 13
and many others

Key issue
Minimize space occupancy
Maximize substring-search throughput

Suffix-array compressible - Bzip searchable

Performance over hundreds of MBs and commodity PC
Count(P) takes 5 microsecs/char, taking about
bzips space
Locate(P) outputs 100K occ/sec, taking 10
space
This may be 4x faster than IL, within lt35 space
occupancy

8
Compressed Indexing Theory Engineering
No SQL DB
The ltkey,valuegt problem

Trie 14x more space than input data.
Front-coding two-level indexing
110 of input data
4 microsecs/char
Our Compressed Permuterm
lt 25 of input data, i.e. close to bzip2
10?60 microsecs/char
So, time close to FC but one-fourth of its space

Under Y!-patenting
9
We know how to manage everything
10
Information Retrieval

Diego Maradona won against Mexico

Dictionary against Diego Maradona Mexico won
11
Topic Annotators

Diego Maradona won against Mexico

Detect mentions and annotate them with
entity/topic extracted from a catalog
Wikipedia!
we serve about 170k requests/day
12
A new scenario
obama asks iran for RQ-170 sentinel drone back
us president issues Ahmadinejad ultimatum
13
The literature
Many commercial software AlchemyAPI, DBpedia
Spotlight, Extractiv, Lupedia, OpenCalais, Saplo,
SemiTags, TextRazor, Wikimeta, Yahoo! Content
Analysis, Zemanta.
Paper at WWW 2013, we serve about 170k
requests/day
14
Paper at ACM WSDM 2012
Paper at IEEE Software 2012
Details on...http//acube.di.unipi.it/tagme
Paper at ECIR 2012

Write a Comment

User Comments (0)