Title: GORDA Kickoff meeting INRIA
1GORDA Kickoff meetingINRIA Sardes project
- Emmanuel Cecchet
- Sara Bouchenak
2Outline
- INRIA, ObjectWeb Sardes
- GORDA
3INRIA key figures
A public scientific and technological research
institute in computer science and control under
the dual authority of the Ministry of Research
and the Ministry of Industry
Jan. 2003
A scientific force of 3,000
900 permanent staff 400 researchers 500
engineers, technical and administrative 450
researchers from other organizations 700 Ph.D
students 200 external collaborators 750
trainees, post-doctoral students, visiting
researchers from abroad (universities or
industry))
INRIA Rhône-Alpes
6 Research Units
Budget 120 M (tax not incl.)
4iCluster 2
- Itanium-2 processors
- 104 nodes (Dual 64 bits 900 MHz processors, 3 GB
memory, 72 GB local disk) connected through a
Myrinet network - 208 processors, 312 GB memory, 7.5 TB disk
- Connected to the GRID
- Linux OS (RedHat Advanced Server)
- First Linpack experiments at INRIA (Aug. 03)
have reached 560 GFlop/s - Applications Grid computing,classical
scientific computing, high performance Internet
servers,
5ObjecWeb key figures
- Open source middleware development
- Based on open standard
- J2EE, CORBA, OSGi
- International consortium
- Founded by INRIA, Bull and FT RD in 2001
- Academic partners
- European universities and research centers
- Industrial partners
- RedHat, Suse, MySQL,
- NEC, Bull, France Telecom, Dassault, Cap Gemini,
6Common Software Architecture for Component
Based Development
JMOB
JOnAS
OSCAR
OpenCCM
ProActive
Speedo
RUBiS
JORAM
DotNetJ
CAROL
Enhydra
XMLC
JORM/MEDOR
JOTM
OSCAR
Kilim
Zeus
C-JDBC
Fractal
Jonathan
RmiJdbc
Bonita
Think
JAWE
Octopus
7Sardes project
- Distributed Systems group
- Main research themes
- Reflective component technology
- Autonomous systems management
- Applications areas
- high-availability J2EE servers
- dynamic monitoring, configuration and resource
management in large scale distributed systems - (embedded system networks, ubiquitous computing)
- Result dissemination by ObjectWeb
8Outline
- INRIA, ObjectWeb Sardes
- GORDA
9Sardes experiences
- Component-based open source middleware
- ObjectWeb (http//www.objectweb.org)
- J2EE application servers
- JOnAS clustering (http//jonas.objectweb.org)
- Database replication middleware
- C-JDBC (http//c-jdbc.objectweb.org)
- Benchmarking
- RUBiS (http//rubis.objectweb.org)
- TPC-W (http//jmob.objectweb.org)
- CLIF (http//clif.objectweb.org)
- Monitoring
- LeWYS (http//lewys.objectweb.org)
10Common scalability practice
- Cons
- Cost
- Scalability limit
App. server
Web frontend
Internet
11Replication with shared disks
- Cons
- still expensive hardware
- availability
App. server
Disks
Database
Web frontend
Internet
Another well-known database vendor
12Master/Slave replication
- Cons
- consistency
- failover time on master failure
- scalability
App. server
Master
Web frontend
Internet
13Atomic broadcast-based replication
- Database tier should be
- scalable
- highly available
- without modifying the client application
- database vendor independent
- on commodity hardware
Internet
Atomic broadcast
14C-JDBC
- JDBC compliant (no client application
modification) - database vendor independent
- JDBC driver required
- heterogeneity support
- no 2PC, no group communication between databases
- group communication for controller replication
only
JDBC
Internet
15RAIDb - Definition
- Redundant Array of Inexpensive Databases
- better performance and fault tolerance than a
single database, at a low cost, by combining
multiple database instances into an array of
databases - RAIDb levels offers various tradeoff of
performance and fault tolerance
16RAIDb
- Redundant Array of Inexpensive Databases
- better performance and fault tolerance than a
single database, at a low cost, by combining
multiple database instances into an array of
databases - RAIDb controller
- gives the view of a single database to the client
- balance the load on the database backends
- RAIDb levels
- RAIDb-0 full partitioning
- RAIDb-1 full mirroring
- RAIDb-2 partial replication
- composition possible
17C-JDBC Key ideas
- Middleware implementing RAIDb
- Two components
- generic JDBC 2.0 driver (C-JDBC driver)
- C-JDBC Controller
- C-JDBC Controller provides
- performance scalability
- high availability
- failover
- caching, logging, monitoring,
- Supports heterogeneous databases
18C-JDBC Overview
19Heterogeneity support
- unload a singleOracle DB withseveral MySQL
- RAIDb-2 forpartial replication
20Inside the C-JDBC Controller
Sockets
Sockets
JMX
21C-JDBC features
- unified authentication management
- tunable concurrency control
- automatic schema detection
- tunable replication full partitioning, partial
replication, full replication - caching metadata, parsing, result with various
invalidation granularities - various load balancing strategies
- on-the-fly query rewriting for macros and
heterogeneity support - recovery log for dynamic backend adding and
failure recovery - database backup/restore using Octopus
- JMX based monitoring and administration
- graphical administration console
22Functional overview
23Functional overview
24Failures
execute INSERT INTO t
- No 2 phase-commit
- parallel transactions
- failed nodes are automatically disabled
25Controller replication
jdbcc-jdbc//node125322,node212345/myDB
- Prevent the controller from being a single point
of failure - Group communication for controller
synchronization - C-JDBC driver supports multiple controllers with
automatic failover
26Controller replication
27Mixing horizontal vertical scalability
28Lessons learned
- SQL parsing cannot be generic
- many discrepancies in JDBC implementations
- minimize the use of group communications
- IP multicast does not scale
- notification infrastructure needed
- users want
- no single point of failure
- control (monitoring, plug-able recovery policies,
) - no database vendor locking
- no database modification
- need for an exhaustive test suite
- benchmarking accurately is very difficult
- load injection requires resources
- monitoring and exploiting results is tricky
29Sardes role in GORDA
- provide input
- GORDA APIs
- group communication requirements
- monitoring and management requirements
- middleware implementation based on C-JDBC
- dissemination effort
- ObjectWeb
- possible participation to JCP for JDBC extensions
- hardware resources for experiments
- eCommerce benchmarks
30Other interests
- LeWYS (http//lewys.objectweb.org)
- monitoring infrastructure
- generic hardware/kernel probes for Linux/Windows
- software probes JMX, SNMP,
- monitoring repository
- autonomic behavior
- building supervision loops
- self-healing clusters
- self-sizing (expand or shrink)
- SLAs
31Q without A
- do we consider distributed query execution?
- XA support?
- cluster size targeted?
- do we target grids or cluster of clusters?
- reconciliation
- consistency/caching
- network architecture considered?
- are relaxed or loose consistency options?
- what will really cover the GRI?
- do we impose a specific way of doing replication?
- access to read-set/write-set difficult to
implement with legacy databases - which workloads are considered?
- which WP deals with backup/recovery?
- licensing issues?
32QA_________Thanks to all users and
contributors ...
33Bonus slides
34INTERNALS
35Virtual Database
- gives the view of a single database
- establishes the mapping between the database name
used by the application and the backend specific
settings - backends can be added and removed dynamically
- configured using an XML configuration file
36Authentication Manager
- Matches real login/password used by the
application with backend specific login/ password - Administrator login to manage the virtual database
37Scheduler
- Manages concurrency control
- Specific implementations for Single DB, RAIDb 0,
1 and 2 - Query-level
- Optimistic and pessimistic transaction level
- uses the database schema that is automatically
fetched from backends
38Request cache
- caches results from SQL requests
- improved SQL statement analysis to limit cache
invalidations - table based invalidations
- column based invalidations
- single-row SELECT optimization
- request parsing possible in theC-JDBC driver
- offload the controller
- parsing caching in the driver
39Load balancer 1/2
- RAIDb-0
- query directed to the backend having the needed
tables - RAIDb-1
- read executed by current thread
- write executed in parallel by a dedicated thread
per backend - result returned if one, majority or all commit
- if one node fails but others succeed, failing
node is disabled - RAIDb-2
- same as RAIDb-1 except that writes are sent only
to nodes owning the written table
40Load balancer 2/2
- Static load balancing policies
- Round-Robin (RR)
- Weighted Round-Robin (WRR)
- Least Pending Requests First (LPRF)
- request sent to the node that has the shortest
pending request queue - efficient if backends are homogeneous in terms of
performance
41Connection Manager
- Connection pooling for a backend
- Simple no pooling
- RandomWait blocking pool
- FailFast non-blocking pool
- VariablePool dynamic pool
- Connection pools defined on a per login basis
- resource management per login
- dedicated connections for admin
42Recovery Log
- Checkpoints are associated with database dumps
- Record all updates and transaction markers since
a checkpoint - Used to resynchronize a database from a
checkpoint - JDBCRecoveryLog
- store information in a database
- can be re-injected in a C-JDBC cluster for fault
tolerance
43SCALABILITY
44C-JDBC scalability
- Horizontal scalability
- prevents the controller to be a Single Point Of
Failure (SPOF) - distributes the load among several controllers
- uses group communications for synchronization
- C-JDBC Driver
- multiple controllers automatic failover
- jdbcc-jdbc//node125322,node212345/myDB
- connection caching
- URL parsing/controller lookup caching
45C-JDBC scalability
- Vertical scalability
- allows nested RAIDb levels
- allows tree architecture for scalable write
broadcast - necessary with large number of backends
- C-JDBC driver re-injected in C-JDBC controller
46C-JDBC vertical scalability
- RAIDb-1-1with C-JDBC
- no limit tocompositiondeepness
47C-JDBC vertical scalability
48CHECKPOINTING
49Fault tolerant recovery log
UPDATE statement
50Checkpointing
- Octopus is an ETL tool
- Use Octopus to store a dump of the initial
database state - Currently done by the user using the database
specific dump tool
51Checkpointing
- Backend is enabled
- All database updates are logged (SQL statement,
user, transaction, )
52Checkpointing
- Add new backends while system online
- Restore dump corresponding to initial checkpoint
with Octopus
53Checkpointing
- Replay updates from the log
54Checkpointing
- Enable backends when done
55Making new checkpoints
- Disable one backend to have a coherent snapshot
- Mark the new checkpoint entry in the log
- Use Octopus to store the dump
56Making new checkpoints
- Replay missing updates from log
57Making new checkpoints
- Re-enable backend when done
58Recovery
- A node fails!
- Automatically disabled but should be fixed or
changed by administrator
59Recovery
- Restore latest dump with Octopus
60Recovery
- Replay missing updates from log
61Recovery
- Re-enable backend when done
62HORIZONTAL SCALABILITY
63Horizontal scalability
- JGroups for controller synchronization
- Groups messages for writes only
64Horizontal scalability
- Centralized write approach issues
- Issues with transactions assigned to connections
65Horizontal scalability
- General case for a write query
- 3 multicast 2n unicast
66Horizontal scalability
- Solution No backend sharing
- 1 multicast n unicast 1 multicast
67Horizontal scalability
- Issues with JGroups
- resources needed by a channel
- instability of throughput with UDP
- performance scalability
- TCP better than UDP but
- unable to disable reliability on top of TCP
- unable to disable garbage collection
- ordering implementation is sub-optimal
- Need for a new group communication layer
optimized for cluster
68Horizontal scalability
- JGroups performance on UDP/FastEthernet
69USE CASES
70Budget High Availability
- High availability infrastructure on a budget
- Typical eCommercesetup
- http//www.budget-ha.com
71OpenUSS University Support System
- eLearning
- High availability
- Portability
- Linux, HP-UX, Windows
- InterBase, Firebird, PostgreSQL, HypersonicSQL
- http//openuss.sourceforge.net
72Flood alert system
- Disaster recovery
- Independent nodes synchronized with C-JDBC
- VPN for security issues
- http//floodalert.org
73J2EE benchmarking
- Large scaleJ2EE clusters
- http//jmob.objectweb.org
74PERFORMANCE
75TPC-W
76TPC-W
77TPC-W
78Result cache
- Cache contains a list of SQL-gtResultSet
- Policy defined by queryPattern-gtPolicy
- 3 policies
- EagerCaching variable granularities for
invalidations - RelaxedCaching invalidations based on timeout
- NoCaching never cached
RUBiS bidding mix with 450 clients No cache Coherent cache Relaxed cache
Throughput (rq/min) 3892 4184 4215
Avg response time 801 ms 284 ms 134 ms
Database CPU load 100 85 20
C-JDBC CPU load - 15 7
79Outline
- Motivations
- RAIDb
- C-JDBC
- Performance
- Lessons learned
- Conclusion
80Open problems
- Partition of clusters
- Users want control on failure policy
- Reconciliation must also be user controlled
81LeWYS overview
Observer
Observer
DREAM
DREAM
DREAM
DREAM
DREAM
Observer
Observer
Monitoring repository
82LeWYS
83LeWYS components
- Library of probes
- hardware resources cpu, memory, disk, network
- generic sensors SNMP, JMX, JVMPI,
- Monitoring pump
- dynamic deployment of sensors
- manages monitoring leases
- Event channels
- propagate monitored events to interested
observers - allows for filtering, aggregation, content-based
processing, - Optional monitoring repository
84LeWYS design choices
- Component-based framework
- probes, monitoring pump, event channels
- provides (re)configurability capabilities
- Minimize intrusiveness on monitored nodes
- No global clock
- timestamp generated locally by pump
- Information processing in DREAM channels
85Centralized monitoring using a monitoring
repository (1)
86Centralized monitoring using a monitoring
repository (2)
- Monitoring repository
- stores monitoring information
- service to retrieve monitoring information
- Pros
- DB allows for storing large amount of data
- powerful queries
- correlate data from various probes at different
locations - resynchronize clocks
- browsing history to diagnose failures
- use history for system provisioning
- Cons
- requires a DB (heavy weight solution)
87Outline
- J2EE Cluster
- Group communications
- Monitoring
- motivations
- LeWYS
- implementation
- Status Perspectives
88Monitoring pump implemention
Component
ProbeManager
Probe
ProbeManager
CachedProbe
Cache
Probe
Probe
Probe Repository
Binding Controller
MonitoringMumpManager
Monitoring Pump Thread
TimeStamp
MonitoringPump Manager
PullPush Multiplexer
OutputManager
ChannelOut
OutputManager
RMI
Component
89Hardware Probes
- Pure Java probes
- using /proc
- cost 0.01ms/call (Linux)
cpu, mem, disk, net, kernel, probes
cpu, probes
JNI
JNI
JNI
/proc
.DLL
C
C
C
Linux
Solaris
Windows
Linux
Hardware resources
90Software Probes
- Application level monitoring
- JMX
- ad-hoc
- JVM
SNMP, ad-hoc, probes
JVM probes
JMX based probes
JMX
JVMPI
JVM
Linux
Solaris
Windows
Linux
Hardware resources