Title: Priority Dynamics Presentation to Customers
1Large RAC Benchmark Lessons LearnedHanan Hit
NOCOUG Feb 14, 2006
2Todays agenda
- SUT Definitions
- Benchmark Goals
- Used RAC Mode
- OS (HP-UX) best practices
- HMP - Hyper Messaging Protocol
- UDP Usage
- OS Scheduling
- Deadlocks
- Startup Shutdown
- Execution Plans
- ORA Parameters
- Multiple Block Sizes
3The Tested Application
- Type 1 - Powerbuilder front-end on windows,
tuxedo middleware and Oracle backend. - Type 2 Java on Sun servers using Weblogic as
the middle tier. - Type 3 - Batch processing written in
ProC/ProCobol. - All the Oracle database servers run on HP
servers, running Oracle 9.2.0.5 on HP-UX 11iV2.
4SUT
5SUT (Cont.)
- Database Size 30TB
- RAW Devices
- Non Shared Oracle Home
- Two Full XP 12K storage array from HP
- 82 TB each (1152 disks with 73GB 15K RPM)
- 128GB Cache
- 120,000 IOPS (Max Random Performance Disk)
6Machines
7Machines (Cont.)
8Machines (Cont.)
9Benchmark Goals
- Manual Load Balancing
- Online only 6K Type 1 , 3.5K Type 2 (Test 1)
- All the connections to the RAC instances should
be balanced. - The throughput should be comparable with an
earlier NON-RAC test environment. - Online Batch - 6K Type 1, 3.5K Type 2, 500 Type
3 (Test 2) - All the connections to the RAC instances should
be balanced. - The throughput should be comparable with an
earlier NON-RAC tests. - Type 3 timings should be comparable with the
NON-RAC tests. - Online only- 12K Type 1 , 7K Type 2 (Test 3)
- All the connections to the RAC instances should
be balanced. - The throughput should increase 95 comparable
with an earlier Test1.
10Benchmark Goals (Cont.)
- RAC Load Balancing
- Online only 6K Type 1 , 3.5K Type 2 (Test 1)
- All the connections to the RAC instances should
be balanced. - The throughput should be comparable with an
earlier NON-RAC test environment. - Configuration/Setup
- Amend the /etc/tnsnames.ora with the RAC LB
strings. - RAC Fail Over Testing
- Online only 6K Type 1 , 3.5K Type 2 (Test 1)
- All the database connections that were connected
to the RAC3 instance should have been
successfully failed over to the other RAC1
instance.
11RAC - Mode
12OS Port Specific Best Practices
- IPC Protocols
- HPs HMP
- HP True64s RDG (Reliable DataGram)
- Suns RSM (Remote Shared Memory)
- VIA on Intel
- In general the recommendation is to use low
latency user mode IPC rather then UDP or TCP.
13What is HyperFabric
- HyperFabric is a high-speed cluster interconnect
fabric that supports both the industry standard
TCP/UDP over IP and HPs proprietary Hyper
Messaging Protocol (HMP). - HyperFabric extends the scalability and
reliability of TCP/UDP by providing transparent
load balancing of connection traffic across
multiple network interface cards (NICs) and
transparent failover of traffic from one card to
another without invocation of MC/ServiceGaurd. - The HyperFabric NIC incorporates a network
processor that implements HPs Hyper Messaging
Protocol and provides lower latency and lower
host CPU utilization for standard TCP/UDP
benchmarks over HyperFabric when compared to
gigabit Ethernet. - Hewlett-Packard released HyperFabric in 1998 with
a link rate of 2.56 Gbps over copper. In 2001,
Hewlett-Packard released HyperFabric 2 with a
link rate of 4.0 Gbps over fiber with support for
compatibility with the copper HyperFabric
interface. Both HyperFabric products support
clusters up to 64-nodes.
14HyperFabric Switches
- Hewlett-Packard provides the fastest cluster
interconnect via its proprietary HyperFabric
switches. - The latest product being HyperFabric 2, which is
a new set of hardware components with fiber
connectors to enable low-latency, high bandwidth
system interconnect. - With fiber interfaces, HyperFabric 2 provides
faster speed up to 4Gbps in full duplex over
longer distance up to 200 meters. - HyperFabric 2 also provides excellent scalability
by supporting up to 16 hosts via point-to-point
connectivity and up to 64 hosts via fabric
switches. It is backward compatible with previous
versions of HyperFabric and available on IA-64,
PA-RISC servers.
15Hyper Messaging Protocol (HMP)
- HMP also known as the Lowfat is HPs reliable
user-mode IPC protocol. - Hyper Messaging Protocol significantly expands on
the feature set provided by TCP/UDP by providing
a true Reliable Datagram model for both remote
direct memory access (RDMA) and traditional
message semantics. - Coupled with OS bypass capability and the
hardware support for protocol offload provided by
HyperFabric, HMP provides high bandwidth, low
latency and extremely low CPU utilization with an
interface and feature set optimized for business
critical parallel applications such as Oracle 9i
RAC.
16How to Relink Oracle with HMP?
- OS Configuration
- setprivgrp dba MLOCK
- Persistent over reboots /etc/privgroup dba
MLOCK - Kernel MAXFILES must be at least 1024
- HMP Configuration in /opt/clic/lib/skgxp/skclic.co
nf - HMP relink
- make f ins_rdbms.mk rac_on ipc_hms ioracle
- UDP relink (Back to default)
- make f ins_rdbms.mk rac_on ipc_udp ioracle
17HMP Diagnostics
- Check the alert.log and verify that you are using
HMP protocol. - netstat in Check that the CLIC N interfaces are
up on all nodes and that the MTU is at least 32K - Cluster_interconnects init.ora (spfile) parameter
will not be showing the IP address and shouldnt
be used. - oradebug setmypid
- oradebug ipc
18HMP Findings
- Each Oracle shadow process will require 300-400K
of additional memory. - The system couldnt scale the No. of concurrent
connections beyond 1K and no additional
connections could be established to either of the
instances once this threshold had been reached. - The maximum (single card), memory was 4GB (during
the benchmark timeframe) - We couldnt start more than a single instance
while reaching the above threshold while the
other ones were waiting. - Direct connection could be established to the
single node that was active.
19HMP Findings (Cont.)
- netstat may show that most of the traffic is
going to single interface, but this is really not
a concern as the traffic is load balanced in the
fabric and the command is not able to catch it. - Rollback to UDP over HyperFabric as this is
usually more then good enough.
20UDP Findings
- Use UDP over HyperFabric as this will have the
ability to use multiple cards. - Due to the Fail Over scenarios we had to be able
to cope with at least twice node loads on a
single node, hence minimum 20K concurrent
connections. - The initial setup failed with (ORA-27504) that
were pointing to OS Errors (227). - The cause for the above was found to be the fact
that we were hitting an OS limitations of maximum
UDP ports which is defaulted to 16380 (49152
through 65535). - The solution is to lower the starting port from
49152 to 20000 - /usr/bin/ndd set /dev/udp udp_smallest_anon_port
20000
21Scheduling process prioritization
- Most UNIX OS implemented time-sharing (TS) and
real time (RT) scheduling. - The default for user process is TS.
- With the regular scheduler (TS) the longer a
process runs, its priority get weaker, and the
chances for its preemption increases. - The cost of context switch is likely to be very
high especially for LMD/LMS processes in RAC. - SCHED_NOAGE scheduler allows the processes to
hold the CPU until they give it up. - Starting all processes in the same highest
priority of this scheduler remove the possibility
that they will steal the CPU from each other
22SCHED_NOAGE
- Unlike the normal TS policy a process scheduled
with the SCHED_NOAGE will not increase or
decrease its priority nor it will be preempted. - How to implement
- As root (after reboot)
- setprivgrp -g dba RTSCHED RTPRIO
- Enter in /etc/privgroup dba RTSCHED RTPRIO
- In init.ora (spfile) enter
- hpux_sched_noage178
- Check (ps) that the Oracle backgrond processes
use priority 178. - In RAC make sure that this will be true for all
nodes.
23Deadlock etc.
- Deadlock situations might get extrapolated to a
serious issue in RAC environment. - Please make sure to check these issues prior to
any RAC implementations. - Hints that might help
- .event "7445 trace name ERRORSTACK level 3"
- We encountered a situation that a instance had
crashed due to deadlocks. - RAC1 had crashed while the databases on the other
nodes remained up.
24Deadlock etc. (Cont.)
- In RAC environment the deadlock detection
works different than NON-RAC. In NON-RAC it
updates the alert.log with ORA-60 and
generate a user trace file, But in RAC
environment it puts the deadlock graph only in
LMD Tracefile (background_dump_dest no user trace
file). - Since we are dealing with global resources in
RAC, the deadlock detection mechanism is
different than from a non-RAC environment. In
RAC, LMD process periodically checks for
deadlocks on the database. How often it checks
for deadlocks is controlled by the parameter
_lm_dd_interval. The default is 60 seconds on
most versions / platforms and can be checked with
the following query - select x.ksppinm , y.ksppstvl from xksppi x ,
xksppcv y where x.indx y.indx and
x.ksppinm like 'lm_ddorder by x.ksppinm - Setting _lm_dd_interval too low can cause LMD to
consume too much CPU. Setting it too high can
cause applications to hang up because deadlocks
aren't being detected quickly enough. The
default is sufficient for most cases.
25Deadlock etc. (Cont.)
- Deadlocks may occur on the same node and on
different nodes in the cluster. - Diagnostibility of ORA-60 errors on the single
instance is easier then the case that is spans
multiple instances (getting the SQL stmt). - On multiple instances starting from 9.2.0.6 /
10.1.0.3 and above Oracle dumps additional
diagnostics for deadlock in RAC environment. - On a single instance you can use the following
(After checking the in the LMD trace file)
ON All instances (spfile). - event"60 trace name errorstack level 3name
systemstate level 10"
26Startup Shutdown
- Be cautious while performing startup and shutdown
operations on such a large SGAs. - Shutdown may take very long time (half an hour).
- While performing the below procedure we got few
Oracle Errors - srvctl start database -d
- Bug 2540942 (LMS may spin in kjctr_rksxp() during
simultaneous shutdown). - During simultaneous shutdown of several instances
with large buffer caches there is the chance of
running out of tickets due to the amount of
messages that need to be sent for cleanup
purposes. Most visible side effect is LMS
spinning in kjctr_rksxp(). the ora-600504 is
the side effect of this.
27Different Execution Plans
Enter value for hashvalue 2601636153 SQL_HASH
SQL_TEXT
---------------
--------------------------------------------------
-------------------------
2601636153 select
members.MEMBER_ID into b0b1 from (select
t.MEMBER_ID from
UFMI_INV
u ,RM_MEMBER_ID t where (((((u.URBAN_ID()b2
and
u.FLEET_ID()b3) and u.MEMBER_ID()t.MEMBER_ID)
and u.MEMBER_ID is null
) and
t.MEMBER_IDb4) and t.MEMBER_IDt.MEMBER_ID ) members
where
ROWNUM
Enter value for hashvalue 2601636153 Plan Table
--------------------
--------------------------------------------------
--------------------------------------------------
Operation Name Starts
E-Rows A-Rows Buffers Reads Writes
E-Time
----------------------------------
--------------------------------------------------
------------------------------------
INDEX RANGE SCAN
UFMI_INV_PK 4
SORT JOIN
4
INDEX RANGE SCAN
RM_MEMBER_ID_PK 142K
MERGE JOIN OUTER
FILTER
FILTER
VIEW
142K
COUNT STOPKEY
SELECT STATEMENT
-----------------------------------------
--------------------------------------------------
-----------------------------
12 rows selected.
28Different Execution Plans (Cont.)
Enter value for hashvalue 2601636153 SQL_HASH
SQL_TEXT
---------------
--------------------------------------------------
-------------------------
2601636153 select
members.MEMBER_ID into b0b1 from (select
t.MEMBER_ID from
UFMI_INV
u ,RM_MEMBER_ID t where (((((u.URBAN_ID()b2
and
u.FLEET_ID()b3) and u.MEMBER_ID()t.MEMBER_ID)
and u.MEMBER_ID is null
) and
t.MEMBER_IDb4) and t.MEMBER_IDt.MEMBER_ID ) members
where
ROWNUM
Enter value for hashvalue 2601636153 Plan Table
--------------------
--------------------------------------------------
--------------------------------------------------
- Operation Name Starts
E-Rows A-Rows Buffers Reads Writes
E-Time
----------------------------------
--------------------------------------------------
------------------------------------
INDEX RANGE SCAN
UFMI_INV_PK 1
INDEX RANGE SCAN
RM_MEMBER_ID_PK 1
NESTED LOOPS OUTER
FILTER
FILTER
VIEW
1
COUNT STOPKEY
SELECT STATEMENT
-----------------------------------------
--------------------------------------------------
-------------------------
----
11 rows
selected.
29What might be the Reason?
- The first set of binds on node1 return a larger
result set. - Bind Peeking
- Version 9i / 10G uses Bind Variable Peeking
- the CBO makes the assumption, that the bind
variable values used for the first execution of a
SQL statement are representative of the bind
variable values to be used in future executions
of the same SQL statement. - Bind variable peeking occurs every time the
cursor is parsed or loaded. - If the data is skewed, the execution plan may
change when, for example, the shared pool is
flushed or the underlying object are analyzed - The optimizer is making different estimates
because of the different parameters set on each
node. - Histograms built on these columns
30How to Solve the issue?
- Turn off bind peeking - (_optim_peek_user_bindsfa
lse) - Pros Will eliminate any bind peeking
- Cons - This will affect all statements, some of
which could already be benefiting from bind
peeking - Hints (If the data is skewed or if the result
set of the binds vary a lot) - Pros -Always get the same plan, regardless of
peeked binds - Cons - This would unfortunately require an app
change. - Use Store Outlines (we chose this option)
- Pros No Need to change the application and
applied immediately. - Cons Extra DBA maintenance and any changes to
the query will disable the use of the outline.
31_ Hidden Parameters More
- _row_cr
- Global cache cr request Event This event is
generated when an instance has requested a
consistent read data block (or UNDO segment
headers), and the block to be transferred hasnt
arrived at the requesting instance. Please make
sure to check VCR_BLOCK_SERVER. - The usage of _row_crTRUE (default FALSE) might
reduce the number CR rollbacks and avoid a costly
cleanup/rollback in RAC. Instead of performing
block cleanup/rollback the usage of this
parameter will try to generate a CR for a given
ROW. - Once we used this parameter we were hitting
another bug 3577772.8 which we found during a
Direct Path export.
32_ Hidden Parameters More (Cont.)
- FAST_START_MTTR_TARGET
- This parameters guarantees that instance or crash
recovery will be completed within this time frame
and when set to non zero value will use
incremental checkpoints. The default value is 300
Sec (5 minutes). - The customer were using 0 sec and once setting
this parameter to the Oracle default we reduced
the inter-node communications as well as reduced
the overall I/O on the system. Please make sure
to amend it to suite your site requirements. - _BUMP_ HIGHWATER_ MARK_COUNT
- HW Enqueue (High water mark) - High Water Mark
for the table may be constantly increased since
the table is growing and running out of blocks
for new inserts on the free list. The default
value for bumping up the HWM is 0, which may be
too low on a busy system or for the average
insert size. - We increased it to 5 and nearly eliminated this
Enqueue at all.
33Multiple Block Sizes?
- Single instance was using KEEP buffer pool nearly
the size as the DEFAULT buffer pool. - We tried once to segregate the index from the
data on a smaller buffer pool (2k for indexes). - ORA-00379 no free buffers available in buffer
pool DEFAULT for block size 2K - Due to the un-symmetric machines we encountered
the above error once the proper buffer pool was
not defined on all instances. - The recommendations - Combine the different
buffer pool caches into one, no need to use KEEP,
RECYCLE. Oracle can manage them well.
34Summary
- Address application deadlocks. This especially
important once moving to RAC. - Combine the different buffer caches into one, no
need to use KEEP, RECYCLE. Oracle can manage them
well. - Move to Locally Managed Tablespaces, with ASSM
(Automatic segment space management). - Dont consider using HMP while going to
production in very large systems. The usage of
the fabric over UDP should be enough. - Try to use Hints once you encounter different
execution plans (Binds). - If possible partition to the Applications to few
physical nodes (only in real high concurrency
volumes). - All bugs had been fixed in newer versions of
Oracle. - Takes extra precautions once performing startup
shutdown in such large environments.
35References
- HP Cluster Install
- Oracle 9i Best Practices performance
Diagnostics Tuning. - http//www.oracle.com/technology/documentation/ind
ex.html (And search for HMP).
36