Title: Exploiting MySQL 5.1
1- Exploiting MySQL 5.1
- For Advanced Business Intelligence Applications
Matt Casters Chief Architect, Data Integration
and Kettle Project Founder MySQL User
Conference, Tuesday April 24, 2007
2Agenda
- Great news
- Pentaho Data Integration introduction
- Pentaho Data Integration MySQL support
- Table partitioning
- Database partitioning
- Clustering
- Metadata to the rescue
- Pentaho company overview platform demo
- Questions closing
3Great news
- The San Francisco Giants beat the Arizona
Diamondbacks last Saturday - Barry Zito pitched a great game
- The hot-dogs where great
4Pentaho news
- Mondrian 2.3. was released a few weeks ago
- Detailed cache control (clearing)
- Sun Microsystems Pentaho announced that Pentaho
Reporting will be used to drive Business
Intelligence requirements in Open Office 2.3 to
be released in the second half of this year. - Tomorrow well do another one.
5Pentaho Data Integration introduction
- From source systems
- to the data warehouse
- to reports
- to analyses
- to dashboard reports
- to better information
6Data Transformation and Integration Examples
- Data filtering
- Is not null, greater than, less than, includes
- Field manipulation
- Trimming, padding, upper and lowercase conversion
- Data calculations
- - X / , average, absolute value, arctangent,
natural logarithm - Date manipulation
- First day of month, Last day of month, add
months, week of year, day of year - Data type conversion
- String to number, number to string, date to
number - Merging fields splitting fields
- Looking up date
- Look up in a database, in a text file, an excel
sheet,
7Pentaho Data Integration (Kettle) Components
- Spoon
- Connect to data sources
- Define transformation rules and design target
schema(s) - Graphical job execution workflow engine for
defining multi-stage and conditional
transformation jobs - Pan
- Command-line execution of single, pre-defined
transformation jobs - Kitchen
- Scheduler for multi-stage jobs
- Carte
- Remote execution slave server with embedded web
server - Pentaho BI Platform
- Integrated scheduling of transformations or jobs
- Ability to call real-time transformations and use
output in reports and dashboards
8Data Transformation and Integration Examples
- Demo time load 1M rows into a database table
9Case Study Pentaho Data Integration
- Organization Flemish Government Traffic Centre
- Use case Monitoring the state of the road
network - Application requirement Integrate
minute-by-minute data from 570 highway locations
for analysis (1550 lanes) - Technical challenges Large volume of data, more
than 2.5 billion rows - Business Usage Users can now compare traffic
speeds based on weather conditions, time of day,
date, season - Best practices
- Clearly understand business user requirements
first - There are often multiple ways to solve data
integration problems, so consider the long-term
need when choosing the right way
10Summary and Resources
- Pentaho and MySQL can address help you manage
your data infrastructure - Extraction, Transformation and Loading for Data
Warehousing and Data Migration - kettle.pentaho.org
- Kettle project homepage
- kettle.javaforge.com
- Kettle community website forum, source,
documentation, tech tips, samples, - www.pentaho.org/download/
- All Pentaho modules, pre-configured with sample
data - Developer forums, documentation
- Ventana Research Open Source BI Survey
- www.mysql.com
- White paper - http//dev.mysql.com/tech-resources
/articles/mysql_5.0_pentaho.html - Kettle Webinar - http//www.mysql.com/news-and-eve
nts/on-demand-webinars/pentaho-2006-09-19.php - Roland Bouman blog on Pentaho Data Integration
and MySQL - http//rpbouman.blogspot.com/2006/06/pentaho-data-
integration-kettle-turns.html
11Pentaho Data Integration MySQL support
- Reading MySQL Result Streaming (cursor
emulation) support - Writing MySQL dialects for data types
- Job entry Bulk Loader of text files for MySQL
- Job entry Bulk writer to a text file for MySQL
12Table partitioning
- Challenges
- More data is being gathered all the time
- Data is coming from more sources than ever
- Faster access to stored information is becoming
more important - More people require concurrent access to the data
- The old solution
- Split up the data into different tables
- SALES_2003
- SALES_2004
-
- SALES_2006
- Hardcode reports to use one or the other table
- Lots of database management issues
- High report maintenance costs
- Nothing is automatic, everything is manual
13Table partitioning
- The new solution
- MySQL 5.1. partitioned tables
- Automatic split up of data into smaller chunks
- SQL engine that can automatically pull results
back together
14Table partitioning
Sales table
Sales table
15Table partitioning
- Advantages
- Reduces query time by reducing the amount of data
to plough through. - Increases performance by
- Pruning the list of partitions to search for
automatically. This is done using the MySQL
query optimizer based on the query thats being
issued. - Massive reduction in I/O
- Smaller partitioned indexes leading to faster
index tree traversal - Allowing parallel access to the different
partitions
16Table partitioning
- How do I create a partitioned table?
- CREATE TABLE sales
- (
- sales_date TIMESTAMP,
- ordernr INT,
- customernr INT,
- productnr INT,
- discount_pct TINYINT
- )
- PARTITION BY RANGE( YEAR(sales_date) )
- (
- PARTITION P2000 VALUES LESS THAN (2001),
- PARTITION P2001 VALUES LESS THAN (2002)
- )
17Table partitioning
- How do I add another partition to the table?
- ALTER TABLE sales
- ADD PARTITION
- (
- PARTITION P2002 VALUES LESS THAN (2003)
- )
18Table partitioning
- How do I drop a partition from the table?
- ALTER TABLE sales
- DROP PARTITION P2002
-
19Table partitioning
- Retrieving partition information
- mysqlgt desc information_schema.partitions
- -----------------------------------------------
------------------------- - Field Type
Null Key Default Extra - -----------------------------------------------
------------------------- - TABLE_CATALOG varchar(512)
YES NULL - TABLE_SCHEMA varchar(64)
NO - TABLE_NAME varchar(64)
NO - PARTITION_NAME varchar(64)
YES NULL - SUBPARTITION_NAME varchar(64)
YES NULL - PARTITION_ORDINAL_POSITION bigint(21)
YES NULL - SUBPARTITION_ORDINAL_POSITION bigint(21)
YES NULL - PARTITION_METHOD varchar(12)
YES NULL - SUBPARTITION_METHOD varchar(12)
YES NULL - PARTITION_EXPRESSION longtext
YES NULL - SUBPARTITION_EXPRESSION longtext
YES NULL - PARTITION_DESCRIPTION longtext
YES NULL - TABLE_ROWS bigint(21)
NO 0
20Table partitioning
- Retrieving partition information
- SELECT partition_name
- FROM information_schema.partitions
- WHERE table_name sales
-
- ----------------
- partition_name
- ----------------
- P2000
- P2001
- P2002
- P2003
- P2004
- P2005
- P2006
- P2007
- P2008
- P2009
- P2010
21Table partitioning
- Use-case Flemish Traffic Center
- Medium sized data set 514M rows
- Partitioned by week
- PARTITION BY RANGE ( YEAR(mtime)100
WEEK(mtime) ) - 80GB of data in hundreds of partitions
- 1 row of data speed, number of vehicles (5
classes), etc. - Per minute
- Per lane of traffic
- For 1550 detectors
- ? 815M rows of data per year
- A test query aggregating speed counts per road
position per minute - Gave back 642.319 rows
- 9 seconds for MySQL to get the result
- 29 seconds to pass the data to the client over
JDBC
22Table partitioning
- Should we do table partitioning? Yes!
- However
- Even though you can do sub-partitioning, you need
to partition the data in one certain way, by
range, by list, - The entry point for optimization is always going
to be that partitioning method - Pruning is always going to be based on that
partitioning method - Queries across the whole data set dont always
get benefit (neither a penalty) - More detailed information
- Partitioning in MySQL 5.1 and Onwards
- Mikael Ronström, Senior Software Architect,
MySQL AB - ?Tomorrow 1150am 1235pm, Ballroom A
- How can we improve upon this?
23Database partitioning
- Challenge
- Dealing with even greater data sets
- The need to present results even quicker
- The old solution
- Large boxes
- Huge amounts of RAM
- Very fast disk systems (Fibre attached, SAN, )
- Multi-CPU monsters
- ? Very expensive and complicated systems
- The new solution
- 2 or more MySQL servers
- Basic, off the shelf, hardware
- Standard software
- Simple configurations
- Pentaho Data Integration MySQL 5.1
24Database partitioning
DB1
DB2
DB3
DB4
25Database partitioning
- Split the data over several databases
- Pick a splitting mechanism in advance
- Divide the load over the databases
- Make the databases smaller
- Increase the query performance
- Increase the load performance
26Database partitioning
- How do you set it up?
- First create a partitioned / clustered database
connection - Then create a Partitioning Schema
27Database partitioning
- Demo time
- Creating partitions partitioned
- Loading data partitioned
- Reading back data partitioned
- Reading back data partitioned and ordered
28Table partitioning
- Use-case Flemish Traffic Center
- Medium sized data set 514M rows
- Same partitioning as before
- Same table layout, engine and configuration
- Partitioned over 3 databases on detector (per
lane) - A test query aggregating speed counts per road
position per minute - Gave back 642.319 rows
- 3 seconds for MySQL to get the result
- 10 seconds to pass the data to the client over
JDBC - ? Demonstrating almost linear scalability!!
29Table partitioning
- Should we do database partitioning? Yes!
- However
- The scalability is limited to the raw CPU power
that is available on a single system. - This poses a scalability problem
- when dealing with larger numbers of partitions
- Reading back large volumes of data
- ? Enters Pentaho Data Integration Clustering
30Pentaho Data Integration Clustering
Server X
31Pentaho Data Integration Clustering
- The challenge is to eliminate CPU bottlenecks
- Pentaho Data Integration Carte comes to the
rescue - Can be started up on any system turning it into a
Slave Server - The combination of database partitioning and PDI
Clustering - Will split up the number of database partitions
among the slave servers - Will split the total CPU load over the slave
servers - Will split the total I/O load over the database
partitions (MySQL instances)
32Pentaho Data Integration Clustering
- Demo-time
- Start up 2 slave servers
- Run a step across the 2 servers
- Monitor
33Pentaho Metadata to the rescue
- The problem
- Reporting becomes harder on a database
partitioned system - In stead of 1 database you read from a bunch of
them - This is where our new project Pentaho Metadata
Comes in - Due in a few months
- Auto-generates SQL now
- Will auto-generate PDI transformations in the
near future - Will auto-deliver the data to your report in the
platform - Will make it very easy for your users to create
drag and drop reports
34Pentaho Introduction
- Worlds most popular enterprise open source BI
Suite - 2 million lifetime downloads, averaging 100K /
month - Founded in 2004 Pioneer in professional open
source BI - Key Projects
- JFreeReport Reporting
- Kettle Data Integration
- Mondrian OLAP
- Pentaho BI Platform
- Weka Data Mining
- Management and Board
- Proven BI veterans from Business Objects, Cognos,
Hyperion, SAS, Oracle - Open source leaders - Larry Augustin, New
Enterprise Associates, Index Ventures - MySQL Gold Partner
35Pentaho Introduction
Strategic
Scorecards
Analysis
Aggregates
Reports
Operational
Sales
Marketing
Inventory
Financial
Production
36Pentaho Introduction
37Questions and Closing
38Birds of a Feather
- MySQL Data Warehousing and BI
- Tonight at 730
- Camino Real
- With
- Chief Geek James Dixon Senior architect CTO,
Pentaho - Julian Hyde, OLAP Architect and Mondrian Project
Founder, Pentaho - Brian Miezejewski, Principal Consultant, MySQL
- Roland Bouman, Certification Developer, MySQL
- MySELF
- A few bottles of beer