Title: Arial 28pt. - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Title: Arial 28pt.

Description:

ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation ... Inefficient algorithm can waster cpu cycles. Must fit in one cache line boundary ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 41

Provided by: fellenm

Category:

more less

Transcript and Presenter's Notes

Title: Title: Arial 28pt.

1
ASE112 Adaptive Server Enterprise Performance
Tuning on Next Generation Architecture
Prasanta Ghosh Sr. Manager- Performance
Development pghosh_at_sybase.com August 15-19, 2004
2
The Enterprise. Unwired.
3
The Enterprise. Unwired.
Industry and Cross Platform Solutions
Unwire People
Unwire Information
Manage Information

Adaptive Server Enterprise
Adaptive Server Anywhere
Sybase IQ
Dynamic Archive
Dynamic ODS
Replication Server
OpenSwitch
Mirror Activator
PowerDesigner
Connectivity Options
EAServer
Industry Warehouse Studio

Unwired Accelerator
Unwired Orchestrator
Unwired Toolkit
Enterprise Portal
Real Time Data Services

SQL Anywhere Studio
M-Business Anywhere
Pylon Family (Mobile Email)
Mobile Sales
XcelleNet Frontline Solutions
PocketBuilder
PowerBuilder Family
AvantGo

Sybase Workspace
4
What will we learn?

Processor Trends
Relevant to the Database world
Present architectural issues
Compiler technology
ASE Architecture
Adapting to new processors
Keeping up with OLTP performance
Discuss some of the hot performance related
topics
Questions
Discussions
Interactive

5
Processor CISC RISC and EPIC

CISC (Complex Instruction Set Computing)
Intel and AMDs x86 processor set
RISC (Reduced Instruction Set Computing)
Goal to optimize performance with simpler
instructions
EPIC (Explicitly Parallel Instruction Computing)
Goal to move beyond RISC performance bounds with
explicit parallel instruction streams

6
Processor Speed

Is Higher clock speed better?
Not always
3.0Ghz Xeon vs. 1.5Ghz Itanium2

Processor Speed
Ultra SPARC IV 1.2Ghz
PA-RISC 8800 1.0Ghz
Xeon/Opteron 2.4-3.0Ghz
Power5 1.5-1.9Ghz
Itanium2 1.5-1.7Ghz
7
Processor Speed ASE behavior

Obviously
faster processing
better response time
Plus
more context switches e.g 112296 vs. 522115 per
minute
not when the engines are idling
demands more from disk IO performance

Time
8
Processor Architecture 64bit Processing

64bit data and address better performance
Must for large database environments
2 versions of OS kernel and ASE for the same
platform

Processor 32bit or 64bit?
Ultra SPARC IV Both
PA-RISC 8800 Both
Xeon 32bit
Opteron Both
Power5 Both
Itanium2 64bit only

Do I need to use 64bit if I dont need gt 4B
memory access?

9
ASE Network and Engine Affinity

Network Affinity
User connection to ASE
Idling or least loaded engine picks up the
incoming connection
Network IO for the user task is performed by that
engine
Engine Affinity
Related to process scheduling
soft binding is automatic (user transparent)
Can use application partitioning to do hard
binding
Runs on that engine as long as it can
Network affinity remains unchanged
unless that engine is made offline
Engine affinity changes
Due to stealing algorithm
Critical resource contention

10
ASE Engine Affinity

Scheduling
Engine Local runnable queue
Global runnable queue
Tasks are mostly in engine runnable queue
Occasionally in global runnable queue
Engine Stealing

Engine 0 queue
Engine 1 queue
Kernel queue
11
Processor Architecture some more

Hyper threading
Intel Xeon processor
Hyper Transport
high-speed, low latency, point-to-point link
Data throughput 22.4GB/sec
Dual Core
PA-8800, Power5
Chip Multithreading Technology (CMT)
Sun Ultra Sparc IV
Non Uniform Memory Access (NUMA)
Critical for large database applications using
huge memory
Large Register Set
Itanium2 has 128 registers

12
Hyper threading and ASE

Should I enable hyper threading for ASE?
Our Experience
On single cpu system, hyper threading helps
On SMP systems, hyper threading does not always
help
Linux AS 2.1 has some scheduling issue
Which is fixed in RHEL 3.0
Does not help on a highly active system where
engines are fully utilized
Havent seen 30 gain for ASE configuration

13
Processor Architecture Limits and EPIC Solutions
Problem Memory/CPU Latency is already large and
growing Solution Speculative Loads for Data
and Instructions Problem Increasing amount of
conditional and/or unpredictable branches in code
-- Solution Predication and prediction of
branches and conditionals orchestrated by the
compiler to use the EPIC Architecture Problem
Complexity of multiple pipelines is too great for
effective on chip scheduling Solution
Compiler handles scheduling and produces code to
take advantage of the on chip
resources Problem Registers and chip resource
availability limit parallelism -- Solution
Increase the number of registers by 4X ( 32- 128 )
14
Traditional Architecture Limiters
Sequential Machine Code
Original Source Code
Hardware
Compiler
parallelized code
parallelized code
multiple functional units
Execution Units Available Used Inefficiently
. . .
. . .
. . .
. . .
Todays Processors often 60 Idle
15
Explicit Parallelism

Instruction Level Parallelism (ILP) is ability to
execute multiple instructions at the same time
Explicitly Parallel Instruction Computing (EPIC)
allows the compiler or assembler to specify the
parallelism
Compiler specifies Instruction Groups, a list of
instructions with no dependencies that can be
executed in parallel
Stop bit or taken branch indicates instruction
group boundary
Instructions are packed in bundles of 3
instructions each
Template field directly maps each instruction to
an execution unit allowing easy parallel dispatch
of the instructions

Template 5 bits
Instruction 2 41 bits
Instruction 1 41 bits
Instruction 3 41bits
stop
stop
stop
16
Processor Architecture TLB miss

Translation Look ahead Buffer (TLB)
Its a fixed table size
Processor uses to search the data in local cache
Large memory configuration
Common to database applications
more chances of TLB miss
Locking the shared memory
Variable OS page size
4KB vs. 8MB or 16MB

17
Processor speed vs. Memory Access

Cpu speed doubles every 1.5 years
Memory speed doubles every 10 years
High speed cpu
Mostly underutilized

18
Reduce Memory Latency

Internal cache
L1, L2, L3 cache
memory closer to processor
On chip or off chip
Shared by CPUs
Data/Instruction cache separate

Processor L1, L2, L3 size
Ultra SPARC IV 64KB-D, 32KB-I, 16MB
PA-RISC 8800 1.5MB, 32MB, NA
Opteron 128KB, 1MB, NA
Xeon 16K, 512KB, 4MB
Power5 64KB, 1.5MB, 36MB
Itanium2 32KB, 256KB, 6-9MB
19
Internal Cache

ASE is optimized to make use of the L1/L2/L3
cache
Database applications are memory intensive
New systems What to watch for?
Higher clock speed
Higher front side bus speed
Large L1/L2/L3 cache
Lower memory latency
Follow OEM guidelines
e.g same speed memory DIMMs

20
Internal Cache Separate L1/L2 Cache
21
Internal Cache Shared L2/L3 Cache

Level 2 Cache Boosts Performance
Size (32MB) and proximity of the L2 cache to the
processors increases performance for many
workloads

More than the CPUs Inside the Processor Module

On-Chip Cache Controller Speeds
Access and Protects Data
On-chip tags help the cache controller quickly
locate and send data to CPU
ECC protection for data tags, cached data and
in-flight data

System Bus
22
Internal Cache ASE optimizations

Smaller footprint
Avoid random access of memory
Only few OS processes
Structure alignments
Minimize cross engine data access
Compiler optimization to pre-fetch data
Better branch prediction

23
ASE FBO Server Speculation

Allows compiler to issue operation early before a
dependency
Removes latency of operation from the critical
path
Helps hide long latency memory operations
Two type of speculation
Control Speculation, which is the execution of an
operation before the branch which guards it
Data Speculation, which is the execution of a
memory load prior to a preceding store which may
alias with it

24
ASE FBO Server Predication

Allows instructions to be conditionally executed
Predicate register operand controls execution
Removes branches and associated mispredict
penalties
Creates larger basic blocks and simplifies
compiler optimizations
Example
cmp.eq p1,p2 r1,r2
(p1) add r1 r2, 4
(p2) ld8.sa r7 r8, 8
If p1 is true, the add is performed, else it
acts as a nop
If p2 is true, the ld8 is performed, else it
acts as a nop

25
ASE FBO Server Optimizations!

Profile Guided Optimizations
Also known as FBO or PBO
Runs typical load using an instrumented server
Collects data on execution profiling
Generates highly optimized code!
Anywhere between 10-40 performance gain

26
ASE architecture High level view
27
Cacheline and Data structure

Main memory to internal cache transfer happens in
chunks
32byte, 64byte or 128byte
In database applications, load misses consumes
almost 90 of cpu cycles
Avoid load misses by rearranging the fields in a
structure
Write only fields
Read only fields
Fields accessed simultaneously

Struct Process int id char
name200 int state
28
Spin lock optimizations

Light weight synchronization mechanism
Effective only when running with more than 1 ASE
engines
Inefficient algorithm can waster cpu cycles
Must fit in one cache line boundary
Varies from platform to platform
Multiple spin lock structures in single cache
line
Too many unnecessary dirty flushes among the cpus
Hyper threading and Intels pause instruction

29
Spin lock decomposition
79
30
ASE Architecture Storage Device

Efficient IO is critical for system performance
Process scheduling and interrupt handling is
important
SCSI or Fiber Channel
Disk spindle RPM
Controller Cache
RAID 0 or RAID 1 or RAID 5
Synchronous vs. Asynchronous IO

31
ASE Architecture File System vs. Raw

Raw for log IO
RAID 0 for data devices
RAID 01 for log devices
File System for better management
For 32bit, platforms for which memory is limited,
file system for data devices is recommended.
4GB memory access limit
OS allocates the rest of memory of the memory for
File System
Mostly read intensive and not heavy write
intensive
This results in better read response for
applications

32
ASE Architecture File system vs. Raw devices

Use mix of File System and Raw devices.

60
33
ASE Architecture Journaling on or off?

EXT3 with journaling disabled

24
34
ASE Architecture Large Memory Support

Xeon has the PAE architecture
Allows applications to address up to 64GB of
memory
ASE on Linux as of 12.5.2 release can support up
to 64GB memory
Easy configuration to setup for large memory
feature

35
Large Memory Support on Linux 32 bit

Intel has the PAE architecture
Allows applications to address up to 64GB of
memory
Memory usage in ASE
Most of the memory on a given system is used for
data caches
Avoid expensive disk reads and writes
File system devices cache the data in OS/FS cache
Double copying problem resulting in wastage of
memory
Writes are very expensive
Increased CPU bandwidth on Xeon is underutilized
by not having large memory support
Most productions environments have raw devices
for ASE
causes under utilization of the system memory