Title: Algorithms and Data Structures
1- Algorithms and Data Structures
- for Massive Datasets
- (Acube Lab)
Rossano Venturini Dipartimento di
Informatica Università di Pisa
Paolo Ferragina Giuseppe Prencipe Marco
Cornolti Andrea Farruggia Giovanni Micale
Francesco Piccinno Giorgio Audrito
2A3 Lab (acube.di.unipi.it)
- Algorithms and data structures for massive
dataset - Data Compression
- Compressed Indexing
- Web or arbitrary texts
- Storage and analysis of massive graphs
- Information Retrieval on news, tweet,
Submitted US patents 3 with Yahoo, 1 with
NYU Accepted US patents 1 with U. Rutgers, 1
with ATT-Lucent
3Social Networks and Social Data
- Given an idea, you need the right platform to
implement it - HW SW (IT Center)
- Algorithms (our Lab)
- Graph structure Textual Content
- Nodes ? users ( 1 bil)
- Edges explicit friend, follower, retweet, 1,
( 10 bil) - Edges implicit similarity, co-occurrence,
click, ( 100 bil)
4 No SQL
2006
Hadoop
Cassandra
HyperTable
Cosmos
5Storage and access to Labeled Graphs
- Compress the graph structure
- Compress the node and edge labels
- Guarantee fast access, dynamicity and search
5
6Data Compression Theory Engineering
- J. ACM 05
- ACM-SIAM Soda 09-14
- ACM WSDM 10
- ESA 11-14
- Algorithmica 12
- SIAM J. Computing 13
- Key issue
- Minimize space occupancy
- Maximize decompression speed
Compressor on DBLP Compressed space (MB) Decompression time (secs)
Gzip 191 11.6
bzip2 121 49
Snappy 323 2.1
LZ4 215 1.9
Our result 130 ? 149 2.9 ? 1.9
A new algorithmic concept Multi-objective design
of compressors
- Two interesting scenarios
- - Energy-efficiency issues
- - Cloud computing
Can we fix the space occupancy and minimize the
decompression time ? Or, vice versa ?
7Compressed Indexing Theory Engineering
- J. ACM 05
- ACM SIGIR 07
- J. ACM 09
- ACM Trans. Algo. 10
- ESA 13
- ACM-SIAM SODA 13
- and many others
- Key issue
- Minimize space occupancy
- Maximize substring-search throughput
Suffix-array compressible - Bzip searchable
- Performance over hundreds of MBs and commodity PC
- Count(P) takes 5 microsecs/char, taking about
bzips space - Locate(P) outputs 100K occ/sec, taking 10
space - This may be 4x faster than IL, within lt35 space
occupancy
8Compressed Indexing Theory Engineering
No SQL DB
The ltkey,valuegt problem
- Trie 14x more space than input data.
- Front-coding two-level indexing
- 110 of input data
- 4 microsecs/char
- Our Compressed Permuterm
- lt 25 of input data, i.e. close to bzip2
- 10?60 microsecs/char
- So, time close to FC but one-fourth of its space
Under Y!-patenting
9We know how to manage everything
10Information Retrieval
- Diego Maradona won against Mexico
Dictionary against Diego Maradona Mexico won
11Topic Annotators
- Diego Maradona won against Mexico
Detect mentions and annotate them with
entity/topic extracted from a catalog
Wikipedia!
we serve about 170k requests/day
12A new scenario
obama asks iran for RQ-170 sentinel drone back
us president issues Ahmadinejad ultimatum
13The literature
Many commercial software AlchemyAPI, DBpedia
Spotlight, Extractiv, Lupedia, OpenCalais, Saplo,
SemiTags, TextRazor, Wikimeta, Yahoo! Content
Analysis, Zemanta.
Paper at WWW 2013, we serve about 170k
requests/day
14Paper at ACM WSDM 2012
Paper at IEEE Software 2012
Details on...http//acube.di.unipi.it/tagme
Paper at ECIR 2012