Impala and BigQuery - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Impala and BigQuery

Description:

Impala and BigQuery By David Gruzman BigDataCraft.com Impala Hive Traces While dremel converts data into own format, Impala supports multiple formats. – PowerPoint PPT presentation

Number of Views:306

Avg rating:3.0/5.0

Slides: 48

Provided by: filesMeet78

Category:

more less

Transcript and Presenter's Notes

Title: Impala and BigQuery

1
Impala and BigQuery
By David Gruzman BigDataCraft.com
2
Impala and BigQuery

Big Query is google's database service based on
the Dremel. Big Query is hosted by Google.
Impala is open source database inspired by the
Dremel paper. Impala is part of the Cloudera
Hadoop distribution.

by David Gruzman
3
Today agenda

Overview of Dremel as a technology
Overview of the BigQuery
A few words about Impala
DG Mediamind use case
Deeper insights into Impala
Conclusions
QA

4
What is Dremel

It can be viewed as a kind of database technology
/ architecture.
Closest known example MPP databases.
Main difference from them in memory processing
with the following consequences
Only small to big table joins (in first releases)
Small results size.
No operations like external sorts.

5
Dremel's Philosophy

Lets do SQL subset which do have fast and
scalable implementation
It is somewhat similar to other NoSQLs we do
what we can do VERY FAST and scalable. The rest
application problem.

6
Why dremel?

Google is first who got MapReduce
Google is first faced MapReduce main problem
latency. The problem was propagated to engines on
top of MapReduce also.
It is logical that Google was first who
approached it by developing real time query
capability for big data.

7
How dremel is used in google

Dremel is not replacement for the MapReduce or
Tenzing but complements it. (Tenzing is Google's
Hive)
Analyst can make many fast queries using Dremel
After getting good idea what is needed run slow
MapReduce (or SQL based on MapReduce) to get
precise results

8
Why dremel is Unique

Dremel with BigQuery built on top of it is
probably only Interactive big data query engine
today.
I mean that it is only engine capable to produce
results over terabytes of data in seconds!
Main idea (my guess) that is harness huge cluster
of machines for the single query.

9
Dremel as technology

Novel Hierarchical columnar format.
LLVM based code generation.
Distributed aggregation Tree
In-situ data processing. (inside the storage)

10
Dremel Aggregation tree
11
Dremel Nested columnar format
12
Big Query

Service built by google on top of the Dremel
engine
Only (known to me) query engine as a service
working with BigData.
Query time not depends on data size

13
BigQuery main capabilities

Aggregations
Join of big table to small table.
Join of two big tables (recently added)
Hierarchical data format. It makes
pre-aggregations cheaper.

14
Main limitations

Small results size
Intermediate results should not exceed memory
size.
No external tables

15
Pricing model

The pricing is per Gigabyte of processed data.
Price is high - 35 per TB
In my view it is costly because it is hyper
elasitc.
You can do the same processing in amazon, but it
will take a few hours (and much less money).
You can not in Amazon get required CPU power for
a few seconds.

16
Why BigQuery is not popular
17
So,why BigQuery is not popular

Data is not created in google cloud. It is hard
and not practical to move big data. It is heavy,
after all.
Google is used to change APIs. BigQuery also
changed during last years. It is hard to build
busines.
Many companies in Internet related businesses a
wary of sharing data with Google.
It is expensive. 35 per TB can give 1000th of
dollars bills per day.

18
Dremel
19
In the same time it is goodtechnically

I got referances from company doing serious
testing
Marting Fawler's company also tested it and give
very good feedback.

20
Question to all of you

Why Your organization decided not to use google's
Big Query?

21
Where we can find Impala
22
Impala
23
What is impala

Massive parralel processing (MPP) database
engine, developed by Cloudera.
Integrated into Hadoop stack on the same level as
MapReduce, and not above it (as Hive and Pig)

Hive
Pig
Map Reduce
Impala
HDFS
24
Why impala

Data has a gravity
Today a lot of data live in HDFS
It is not practical to move big data
It is practical to bring engine to the data
In the same time MapReduce is not must
Impala process data in Hadoop cluster without
using MapReduce

25
MapReduce bypass

Several other modern Database engines also
realized the opportunity to bypass MapReduce but
work right with HDFS.
They takes various approaches.

26
MapReduce Bypass

Existing MPP databases, like Greenplum store
their external tables in the HDFS

27
MapReduce bypass

Jethrodata store data in their own format on HDFS
and also work with it without MR layer.
They have their proprietary format which enable
full indexing of the data together with columnar
efficiency. In cases of high selectivity queries
this approach has serious advantages.

28
Use Case from DG

I think it is will be typical case in the future
DG is using Hadoop and Hive
Evaluation Impala to do part of things more
efficiently.
After their case presentation we will back to
discuss insights of the Impala

29
Again Impala has different place then Pig and
Hive
Hive and Pig
Map Reduce
Impala
HDFS
30
Impala architecture
31
Impala Dremel traces

LLVM code generation
It is really fast
C as implementation language (not Java...)
Simple query engine. It actually doing things
which can be done in memory.
Broadcast join algorithm is implemented

32
LLVM code generation

Assume you want to write custom code for the
specific query. It will be super efficient
Code generation automate this process for each
query
We actually need to super-optimize inner loop
doing filtering (where) and group by.
LLVM enables us to compile in fraction of seconds
into native code
LLVM enable us to enjoy new CPU capabilities like
SSE in a portable way.

33
Why code generation it interesting?

If you develop own engine, or some peace of code
responsible to process serious data volumes code
generation may give you order of magnitude boost.
I had cases when usage of such technology was
game changing

34
Impala Hive Traces

While dremel converts data into own format,
Impala supports multiple formats. It is kind of
schema on read.
Impala shares metastore with Hive, which enables
very simple adoption
Internally Impala have well defined way to add
new formats

35
Impala unique things

Impala format adapters, called scanners have
predicate pushdown capability.
Probably only open source MPP engine
Today we do not have any other means to run
hundreds of CPU cores in one query efficiently
without expensive license.
Hive give us the same but not efficiently.

36
Impala vs MPP

It usually tooks many years to create MPP
database.
There are serious simplifications
The data is read only
There is actually not DBMS only query engine.
No serious resource management, but measurement
(all over code).

37
Impala hive killer?

Not so quickly.
Hive is doing things Impala can not do yet, like
joins between several big tables.
Hive has convinient java UDF, while impala is not
Impala does not have inter-query fault tolerance.
In the same time MapReduce is not good
framework for the database engine

38
Impala Data Formats

There are scanners for the following types
RCFile
Parquet (native dremel format)
CSV
AVRO
Sequence File

39
Impala future

Will get closer to other MPP engines
Support more formats
More advanced scheduling and resource management

40
Basic benchmark

TPC-H, Q1, SF10
4 EC2 large instances
4 seconds, while hive takes about 1 minute.
This number means group by speed of about
235MB/sec per core.

41
Impala price per GB

1 Large instance costs 0.24
Cluster costs 0.96 per hour.
Cost of 1 second 0.96 / 3600
We process by such cluster 1.75GB per second
So cost of 1 TB processing is about 0.15
It is about 300 times cheaper then BigQuery

42
Performance - summary

It is fast when data reduction is big
It is fast, when data is hot.
It should enjoy fast storage / SSD. My
measurements shows about 200 MB/sec per core
group by processing
Always faster then Hive at least 10 times

43
What with clouds?
44
Impala in cloud is not elastic

To be elastic we need to create cluster when we
need it.
Even if we agree to by hour resolution storage
will be a problem
S3 will not give us hundreds of Mbs per second
per instance
To store data in local file system is transient

45
Impala - conclusions

It is first time I remember when we can put our
hands on free MPP database.
There is no risk to try it side-by-side with Hive
It is possible to offload part of the work to
Impala and do the rest with Hive
It is part of the Cloudera Hadoop distribution
and easily installed by Cloudera Manager

46
Materials used

Benchmarks
http//www.slideshare.net/sudabon/performance-eval
uation-of-cloudera-impala-20121208-15536323
https//amplab.cs.berkeley.edu/benchmark/
Architecture
http//www.slideshare.net/scottleber/impala-191769
06
https//cloud.google.com/files/BigQueryTechnicalWP
.pdf
POC
http//martinfowler.com/articles/bigQueryPOC.html

47
Material used - comparisons

To hive http//www.quora.com/Cloudera/Does-Cloud
era-Impala-have-any-drawbacks-when-compared-with-H
ive
To vertica http//www.quora.com/Cloudera-Impala/H
ow-does-Cloudera-Impala-compare-to-Vertica
To dremel http//www.quora.com/Cloudera-Impala/Ho
w-does-Clouderas-Impala-compare-to-Googles-Dremel

48
Thank you!!!