Priority Dynamics Presentation to Customers

About This Presentation

Title:

Priority Dynamics Presentation to Customers

Description:

HMP also known as the 'Lowfat' is HP's reliable user-mode IPC protocol. ... HP Cluster Install. Oracle 9i Best Practices performance Diagnostics & Tuning. ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 37

Provided by: Amir

Learn more at: http://www.nocoug.org

Category:

Tags: customers | dynamics | hp | presentation | priority

more less

Transcript and Presenter's Notes

Title: Priority Dynamics Presentation to Customers

1
Large RAC Benchmark Lessons LearnedHanan Hit
NOCOUG Feb 14, 2006
2
Todays agenda

SUT Definitions
Benchmark Goals
Used RAC Mode
OS (HP-UX) best practices
HMP - Hyper Messaging Protocol
UDP Usage
OS Scheduling
Deadlocks
Startup Shutdown
Execution Plans
ORA Parameters
Multiple Block Sizes

3
The Tested Application

Type 1 - Powerbuilder front-end on windows,
tuxedo middleware and Oracle backend.
Type 2 Java on Sun servers using Weblogic as
the middle tier.
Type 3 - Batch processing written in
ProC/ProCobol.
All the Oracle database servers run on HP
servers, running Oracle 9.2.0.5 on HP-UX 11iV2.

4
SUT
5
SUT (Cont.)

Database Size 30TB
RAW Devices
Non Shared Oracle Home
Two Full XP 12K storage array from HP
82 TB each (1152 disks with 73GB 15K RPM)
128GB Cache
120,000 IOPS (Max Random Performance Disk)

6
Machines
7
Machines (Cont.)
8
Machines (Cont.)
9
Benchmark Goals

Manual Load Balancing
Online only 6K Type 1 , 3.5K Type 2 (Test 1)
All the connections to the RAC instances should
be balanced.
The throughput should be comparable with an
earlier NON-RAC test environment.
Online Batch - 6K Type 1, 3.5K Type 2, 500 Type
3 (Test 2)
All the connections to the RAC instances should
be balanced.
The throughput should be comparable with an
earlier NON-RAC tests.
Type 3 timings should be comparable with the
NON-RAC tests.
Online only- 12K Type 1 , 7K Type 2 (Test 3)
All the connections to the RAC instances should
be balanced.
The throughput should increase 95 comparable
with an earlier Test1.

10
Benchmark Goals (Cont.)

RAC Load Balancing
Online only 6K Type 1 , 3.5K Type 2 (Test 1)
All the connections to the RAC instances should
be balanced.
The throughput should be comparable with an
earlier NON-RAC test environment.
Configuration/Setup
Amend the /etc/tnsnames.ora with the RAC LB
strings.
RAC Fail Over Testing
Online only 6K Type 1 , 3.5K Type 2 (Test 1)
All the database connections that were connected
to the RAC3 instance should have been
successfully failed over to the other RAC1
instance.

11
RAC - Mode
12
OS Port Specific Best Practices

IPC Protocols
HPs HMP
HP True64s RDG (Reliable DataGram)
Suns RSM (Remote Shared Memory)
VIA on Intel
In general the recommendation is to use low
latency user mode IPC rather then UDP or TCP.

13
What is HyperFabric

HyperFabric is a high-speed cluster interconnect
fabric that supports both the industry standard
TCP/UDP over IP and HPs proprietary Hyper
Messaging Protocol (HMP).
HyperFabric extends the scalability and
reliability of TCP/UDP by providing transparent
load balancing of connection traffic across
multiple network interface cards (NICs) and
transparent failover of traffic from one card to
another without invocation of MC/ServiceGaurd.
The HyperFabric NIC incorporates a network
processor that implements HPs Hyper Messaging
Protocol and provides lower latency and lower
host CPU utilization for standard TCP/UDP
benchmarks over HyperFabric when compared to
gigabit Ethernet.
Hewlett-Packard released HyperFabric in 1998 with
a link rate of 2.56 Gbps over copper. In 2001,
Hewlett-Packard released HyperFabric 2 with a
link rate of 4.0 Gbps over fiber with support for
compatibility with the copper HyperFabric
interface. Both HyperFabric products support
clusters up to 64-nodes.

14
HyperFabric Switches

Hewlett-Packard provides the fastest cluster
interconnect via its proprietary HyperFabric
switches.
The latest product being HyperFabric 2, which is
a new set of hardware components with fiber
connectors to enable low-latency, high bandwidth
system interconnect.
With fiber interfaces, HyperFabric 2 provides
faster speed up to 4Gbps in full duplex over
longer distance up to 200 meters.
HyperFabric 2 also provides excellent scalability
by supporting up to 16 hosts via point-to-point
connectivity and up to 64 hosts via fabric
switches. It is backward compatible with previous
versions of HyperFabric and available on IA-64,
PA-RISC servers.

15
Hyper Messaging Protocol (HMP)

HMP also known as the Lowfat is HPs reliable
user-mode IPC protocol.
Hyper Messaging Protocol significantly expands on
the feature set provided by TCP/UDP by providing
a true Reliable Datagram model for both remote
direct memory access (RDMA) and traditional
message semantics.
Coupled with OS bypass capability and the
hardware support for protocol offload provided by
HyperFabric, HMP provides high bandwidth, low
latency and extremely low CPU utilization with an
interface and feature set optimized for business
critical parallel applications such as Oracle 9i
RAC.

16
How to Relink Oracle with HMP?

OS Configuration
setprivgrp dba MLOCK
Persistent over reboots /etc/privgroup dba
MLOCK
Kernel MAXFILES must be at least 1024
HMP Configuration in /opt/clic/lib/skgxp/skclic.co
nf
HMP relink
make f ins_rdbms.mk rac_on ipc_hms ioracle
UDP relink (Back to default)
make f ins_rdbms.mk rac_on ipc_udp ioracle

17
HMP Diagnostics

Check the alert.log and verify that you are using
HMP protocol.
netstat in Check that the CLIC N interfaces are
up on all nodes and that the MTU is at least 32K
Cluster_interconnects init.ora (spfile) parameter
will not be showing the IP address and shouldnt
be used.
oradebug setmypid
oradebug ipc

18
HMP Findings

Each Oracle shadow process will require 300-400K
of additional memory.
The system couldnt scale the No. of concurrent
connections beyond 1K and no additional
connections could be established to either of the
instances once this threshold had been reached.
The maximum (single card), memory was 4GB (during
the benchmark timeframe)
We couldnt start more than a single instance
while reaching the above threshold while the
other ones were waiting.
Direct connection could be established to the
single node that was active.

19
HMP Findings (Cont.)

netstat may show that most of the traffic is
going to single interface, but this is really not
a concern as the traffic is load balanced in the
fabric and the command is not able to catch it.
Rollback to UDP over HyperFabric as this is
usually more then good enough.

20
UDP Findings

Use UDP over HyperFabric as this will have the
ability to use multiple cards.
Due to the Fail Over scenarios we had to be able
to cope with at least twice node loads on a
single node, hence minimum 20K concurrent
connections.
The initial setup failed with (ORA-27504) that
were pointing to OS Errors (227).
The cause for the above was found to be the fact
that we were hitting an OS limitations of maximum
UDP ports which is defaulted to 16380 (49152
through 65535).
The solution is to lower the starting port from
49152 to 20000
/usr/bin/ndd set /dev/udp udp_smallest_anon_port
20000

21
Scheduling process prioritization

Most UNIX OS implemented time-sharing (TS) and
real time (RT) scheduling.
The default for user process is TS.
With the regular scheduler (TS) the longer a
process runs, its priority get weaker, and the
chances for its preemption increases.
The cost of context switch is likely to be very
high especially for LMD/LMS processes in RAC.
SCHED_NOAGE scheduler allows the processes to
hold the CPU until they give it up.
Starting all processes in the same highest
priority of this scheduler remove the possibility
that they will steal the CPU from each other

22
SCHED_NOAGE

Unlike the normal TS policy a process scheduled
with the SCHED_NOAGE will not increase or
decrease its priority nor it will be preempted.
How to implement
As root (after reboot)
setprivgrp -g dba RTSCHED RTPRIO
Enter in /etc/privgroup dba RTSCHED RTPRIO
In init.ora (spfile) enter
hpux_sched_noage178
Check (ps) that the Oracle backgrond processes
use priority 178.
In RAC make sure that this will be true for all
nodes.

23
Deadlock etc.

Deadlock situations might get extrapolated to a
serious issue in RAC environment.
Please make sure to check these issues prior to
any RAC implementations.
Hints that might help
.event "7445 trace name ERRORSTACK level 3"
We encountered a situation that a instance had
crashed due to deadlocks.
RAC1 had crashed while the databases on the other
nodes remained up.

24
Deadlock etc. (Cont.)

In RAC environment the deadlock detection
works different than NON-RAC. In NON-RAC it
updates the alert.log with ORA-60 and
generate a user trace file, But in RAC
environment it puts the deadlock graph only in
LMD Tracefile (background_dump_dest no user trace
file).
Since we are dealing with global resources in
RAC, the deadlock detection mechanism is
different than from a non-RAC environment. In
RAC, LMD process periodically checks for
deadlocks on the database. How often it checks
for deadlocks is controlled by the parameter
_lm_dd_interval. The default is 60 seconds on
most versions / platforms and can be checked with
the following query
select x.ksppinm , y.ksppstvl from xksppi x ,
xksppcv y where x.indx y.indx and
x.ksppinm like 'lm_ddorder by x.ksppinm
Setting _lm_dd_interval too low can cause LMD to
consume too much CPU. Setting it too high can
cause applications to hang up because deadlocks
aren't being detected quickly enough. The
default is sufficient for most cases.

25
Deadlock etc. (Cont.)

Deadlocks may occur on the same node and on
different nodes in the cluster.
Diagnostibility of ORA-60 errors on the single
instance is easier then the case that is spans
multiple instances (getting the SQL stmt).
On multiple instances starting from 9.2.0.6 /
10.1.0.3 and above Oracle dumps additional
diagnostics for deadlock in RAC environment.
On a single instance you can use the following
(After checking the in the LMD trace file)
ON All instances (spfile).
event"60 trace name errorstack level 3name
systemstate level 10"

26
Startup Shutdown

Be cautious while performing startup and shutdown
operations on such a large SGAs.
Shutdown may take very long time (half an hour).
While performing the below procedure we got few
Oracle Errors
srvctl start database -d
Bug 2540942 (LMS may spin in kjctr_rksxp() during
simultaneous shutdown).
During simultaneous shutdown of several instances
with large buffer caches there is the chance of
running out of tickets due to the amount of
messages that need to be sent for cleanup
purposes. Most visible side effect is LMS
spinning in kjctr_rksxp(). the ora-600504 is
the side effect of this.

27
Different Execution Plans

Node 1 Example

Enter value for hashvalue 2601636153 SQL_HASH
SQL_TEXT

---------------
--------------------------------------------------
-------------------------
2601636153 select
members.MEMBER_ID into b0b1 from (select
t.MEMBER_ID from
UFMI_INV
u ,RM_MEMBER_ID t where (((((u.URBAN_ID()b2
and

u.FLEET_ID()b3) and u.MEMBER_ID()t.MEMBER_ID)
and u.MEMBER_ID is null
) and
t.MEMBER_IDb4) and t.MEMBER_IDt.MEMBER_ID ) members
where
ROWNUM

Enter value for hashvalue 2601636153 Plan Table

--------------------
--------------------------------------------------
--------------------------------------------------
Operation Name Starts
E-Rows A-Rows Buffers Reads Writes
E-Time
----------------------------------
--------------------------------------------------
------------------------------------
INDEX RANGE SCAN
UFMI_INV_PK 4

SORT JOIN
4

INDEX RANGE SCAN
RM_MEMBER_ID_PK 142K

MERGE JOIN OUTER

FILTER

FILTER

VIEW

142K
COUNT STOPKEY

SELECT STATEMENT

-----------------------------------------
--------------------------------------------------
-----------------------------
12 rows selected.
28
Different Execution Plans (Cont.)

Node 3 Example

Enter value for hashvalue 2601636153 SQL_HASH
SQL_TEXT

---------------
--------------------------------------------------
-------------------------
2601636153 select
members.MEMBER_ID into b0b1 from (select
t.MEMBER_ID from
UFMI_INV
u ,RM_MEMBER_ID t where (((((u.URBAN_ID()b2
and

u.FLEET_ID()b3) and u.MEMBER_ID()t.MEMBER_ID)
and u.MEMBER_ID is null
) and
t.MEMBER_IDb4) and t.MEMBER_IDt.MEMBER_ID ) members
where
ROWNUM

Enter value for hashvalue 2601636153 Plan Table

--------------------
--------------------------------------------------
--------------------------------------------------
- Operation Name Starts
E-Rows A-Rows Buffers Reads Writes
E-Time
----------------------------------
--------------------------------------------------
------------------------------------
INDEX RANGE SCAN
UFMI_INV_PK 1

INDEX RANGE SCAN
RM_MEMBER_ID_PK 1

NESTED LOOPS OUTER

FILTER

FILTER

VIEW

1
COUNT STOPKEY

SELECT STATEMENT

-----------------------------------------
--------------------------------------------------
-------------------------
----

11 rows
selected.
29
What might be the Reason?

The first set of binds on node1 return a larger
result set.
Bind Peeking
Version 9i / 10G uses Bind Variable Peeking
the CBO makes the assumption, that the bind
variable values used for the first execution of a
SQL statement are representative of the bind
variable values to be used in future executions
of the same SQL statement.
Bind variable peeking occurs every time the
cursor is parsed or loaded.
If the data is skewed, the execution plan may
change when, for example, the shared pool is
flushed or the underlying object are analyzed
The optimizer is making different estimates
because of the different parameters set on each
node.
Histograms built on these columns

30
How to Solve the issue?

Turn off bind peeking - (_optim_peek_user_bindsfa
lse)
Pros Will eliminate any bind peeking
Cons - This will affect all statements, some of
which could already be benefiting from bind
peeking
Hints (If the data is skewed or if the result
set of the binds vary a lot)
Pros -Always get the same plan, regardless of
peeked binds
Cons - This would unfortunately require an app
change.
Use Store Outlines (we chose this option)
Pros No Need to change the application and
applied immediately.
Cons Extra DBA maintenance and any changes to
the query will disable the use of the outline.

31
_ Hidden Parameters More

_row_cr
Global cache cr request Event This event is
generated when an instance has requested a
consistent read data block (or UNDO segment
headers), and the block to be transferred hasnt
arrived at the requesting instance. Please make
sure to check VCR_BLOCK_SERVER.
The usage of _row_crTRUE (default FALSE) might
reduce the number CR rollbacks and avoid a costly
cleanup/rollback in RAC. Instead of performing
block cleanup/rollback the usage of this
parameter will try to generate a CR for a given
ROW.
Once we used this parameter we were hitting
another bug 3577772.8 which we found during a
Direct Path export.

32
_ Hidden Parameters More (Cont.)

FAST_START_MTTR_TARGET
This parameters guarantees that instance or crash
recovery will be completed within this time frame
and when set to non zero value will use
incremental checkpoints. The default value is 300
Sec (5 minutes).
The customer were using 0 sec and once setting
this parameter to the Oracle default we reduced
the inter-node communications as well as reduced
the overall I/O on the system. Please make sure
to amend it to suite your site requirements.
_BUMP_ HIGHWATER_ MARK_COUNT
HW Enqueue (High water mark) - High Water Mark
for the table may be constantly increased since
the table is growing and running out of blocks
for new inserts on the free list. The default
value for bumping up the HWM is 0, which may be
too low on a busy system or for the average
insert size.
We increased it to 5 and nearly eliminated this
Enqueue at all.

33
Multiple Block Sizes?

Single instance was using KEEP buffer pool nearly
the size as the DEFAULT buffer pool.
We tried once to segregate the index from the
data on a smaller buffer pool (2k for indexes).
ORA-00379 no free buffers available in buffer
pool DEFAULT for block size 2K
Due to the un-symmetric machines we encountered
the above error once the proper buffer pool was
not defined on all instances.
The recommendations - Combine the different
buffer pool caches into one, no need to use KEEP,
RECYCLE. Oracle can manage them well.

34
Summary

Address application deadlocks. This especially
important once moving to RAC.
Combine the different buffer caches into one, no
need to use KEEP, RECYCLE. Oracle can manage them
well.
Move to Locally Managed Tablespaces, with ASSM
(Automatic segment space management).
Dont consider using HMP while going to
production in very large systems. The usage of
the fabric over UDP should be enough.
Try to use Hints once you encounter different
execution plans (Binds).
If possible partition to the Applications to few
physical nodes (only in real high concurrency
volumes).
All bugs had been fixed in newer versions of
Oracle.
Takes extra precautions once performing startup
shutdown in such large environments.