Title: Title: Arial 28pt.
1ASE112 Adaptive Server Enterprise Performance
Tuning on Next Generation Architecture
Prasanta Ghosh Sr. Manager- Performance
Development pghosh_at_sybase.com August 15-19, 2004
2The Enterprise. Unwired.
3The Enterprise. Unwired.
Industry and Cross Platform Solutions
Unwire People
Unwire Information
Manage Information
- Adaptive Server Enterprise
- Adaptive Server Anywhere
- Sybase IQ
- Dynamic Archive
- Dynamic ODS
- Replication Server
- OpenSwitch
- Mirror Activator
- PowerDesigner
- Connectivity Options
- EAServer
- Industry Warehouse Studio
- Unwired Accelerator
- Unwired Orchestrator
- Unwired Toolkit
- Enterprise Portal
- Real Time Data Services
- SQL Anywhere Studio
- M-Business Anywhere
- Pylon Family (Mobile Email)
- Mobile Sales
- XcelleNet Frontline Solutions
- PocketBuilder
- PowerBuilder Family
- AvantGo
Sybase Workspace
4What will we learn?
- Processor Trends
- Relevant to the Database world
- Present architectural issues
- Compiler technology
- ASE Architecture
- Adapting to new processors
- Keeping up with OLTP performance
- Discuss some of the hot performance related
topics - Questions
- Discussions
- Interactive
5Processor CISC RISC and EPIC
- CISC (Complex Instruction Set Computing)
- Intel and AMDs x86 processor set
- RISC (Reduced Instruction Set Computing)
- Goal to optimize performance with simpler
instructions - EPIC (Explicitly Parallel Instruction Computing)
- Goal to move beyond RISC performance bounds with
explicit parallel instruction streams
6Processor Speed
- Is Higher clock speed better?
- Not always
- 3.0Ghz Xeon vs. 1.5Ghz Itanium2
Processor Speed
Ultra SPARC IV 1.2Ghz
PA-RISC 8800 1.0Ghz
Xeon/Opteron 2.4-3.0Ghz
Power5 1.5-1.9Ghz
Itanium2 1.5-1.7Ghz
7Processor Speed ASE behavior
- Obviously
- faster processing
- better response time
- Plus
- more context switches e.g 112296 vs. 522115 per
minute - not when the engines are idling
- demands more from disk IO performance
Time
8Processor Architecture 64bit Processing
- 64bit data and address better performance
- Must for large database environments
- 2 versions of OS kernel and ASE for the same
platform
Processor 32bit or 64bit?
Ultra SPARC IV Both
PA-RISC 8800 Both
Xeon 32bit
Opteron Both
Power5 Both
Itanium2 64bit only
- Do I need to use 64bit if I dont need gt 4B
memory access?
9ASE Network and Engine Affinity
- Network Affinity
- User connection to ASE
- Idling or least loaded engine picks up the
incoming connection - Network IO for the user task is performed by that
engine - Engine Affinity
- Related to process scheduling
- soft binding is automatic (user transparent)
- Can use application partitioning to do hard
binding - Runs on that engine as long as it can
- Network affinity remains unchanged
- unless that engine is made offline
- Engine affinity changes
- Due to stealing algorithm
- Critical resource contention
10ASE Engine Affinity
- Scheduling
- Engine Local runnable queue
- Global runnable queue
- Tasks are mostly in engine runnable queue
- Occasionally in global runnable queue
- Engine Stealing
Engine 0 queue
Engine 1 queue
Kernel queue
11Processor Architecture some more
- Hyper threading
- Intel Xeon processor
- Hyper Transport
- high-speed, low latency, point-to-point link
- Data throughput 22.4GB/sec
- Dual Core
- PA-8800, Power5
- Chip Multithreading Technology (CMT)
- Sun Ultra Sparc IV
- Non Uniform Memory Access (NUMA)
- Critical for large database applications using
huge memory - Large Register Set
- Itanium2 has 128 registers
12Hyper threading and ASE
- Should I enable hyper threading for ASE?
- Our Experience
- On single cpu system, hyper threading helps
- On SMP systems, hyper threading does not always
help - Linux AS 2.1 has some scheduling issue
- Which is fixed in RHEL 3.0
- Does not help on a highly active system where
engines are fully utilized - Havent seen 30 gain for ASE configuration
13Processor Architecture Limits and EPIC Solutions
Problem Memory/CPU Latency is already large and
growing Solution Speculative Loads for Data
and Instructions Problem Increasing amount of
conditional and/or unpredictable branches in code
-- Solution Predication and prediction of
branches and conditionals orchestrated by the
compiler to use the EPIC Architecture Problem
Complexity of multiple pipelines is too great for
effective on chip scheduling Solution
Compiler handles scheduling and produces code to
take advantage of the on chip
resources Problem Registers and chip resource
availability limit parallelism -- Solution
Increase the number of registers by 4X ( 32- 128 )
14Traditional Architecture Limiters
Sequential Machine Code
Original Source Code
Hardware
Compiler
parallelized code
parallelized code
multiple functional units
Execution Units Available Used Inefficiently
. . .
. . .
. . .
. . .
Todays Processors often 60 Idle
15Explicit Parallelism
- Instruction Level Parallelism (ILP) is ability to
execute multiple instructions at the same time - Explicitly Parallel Instruction Computing (EPIC)
allows the compiler or assembler to specify the
parallelism - Compiler specifies Instruction Groups, a list of
instructions with no dependencies that can be
executed in parallel - Stop bit or taken branch indicates instruction
group boundary - Instructions are packed in bundles of 3
instructions each - Template field directly maps each instruction to
an execution unit allowing easy parallel dispatch
of the instructions
Template 5 bits
Instruction 2 41 bits
Instruction 1 41 bits
Instruction 3 41bits
stop
stop
stop
16Processor Architecture TLB miss
- Translation Look ahead Buffer (TLB)
- Its a fixed table size
- Processor uses to search the data in local cache
- Large memory configuration
- Common to database applications
- more chances of TLB miss
- Locking the shared memory
- Variable OS page size
- 4KB vs. 8MB or 16MB
17Processor speed vs. Memory Access
- Cpu speed doubles every 1.5 years
- Memory speed doubles every 10 years
- High speed cpu
- Mostly underutilized
18Reduce Memory Latency
- Internal cache
- L1, L2, L3 cache
- memory closer to processor
- On chip or off chip
- Shared by CPUs
- Data/Instruction cache separate
Processor L1, L2, L3 size
Ultra SPARC IV 64KB-D, 32KB-I, 16MB
PA-RISC 8800 1.5MB, 32MB, NA
Opteron 128KB, 1MB, NA
Xeon 16K, 512KB, 4MB
Power5 64KB, 1.5MB, 36MB
Itanium2 32KB, 256KB, 6-9MB
19Internal Cache
- ASE is optimized to make use of the L1/L2/L3
cache - Database applications are memory intensive
- New systems What to watch for?
- Higher clock speed
- Higher front side bus speed
- Large L1/L2/L3 cache
- Lower memory latency
- Follow OEM guidelines
- e.g same speed memory DIMMs
20Internal Cache Separate L1/L2 Cache
21Internal Cache Shared L2/L3 Cache
- Level 2 Cache Boosts Performance
- Size (32MB) and proximity of the L2 cache to the
- processors increases performance for many
workloads -
More than the CPUs Inside the Processor Module
- On-Chip Cache Controller Speeds
- Access and Protects Data
-
- On-chip tags help the cache controller quickly
locate and send data to CPU - ECC protection for data tags, cached data and
in-flight data
System Bus
22Internal Cache ASE optimizations
- Smaller footprint
- Avoid random access of memory
- Only few OS processes
- Structure alignments
- Minimize cross engine data access
- Compiler optimization to pre-fetch data
- Better branch prediction
23ASE FBO Server Speculation
- Allows compiler to issue operation early before a
dependency - Removes latency of operation from the critical
path - Helps hide long latency memory operations
- Two type of speculation
- Control Speculation, which is the execution of an
operation before the branch which guards it - Data Speculation, which is the execution of a
memory load prior to a preceding store which may
alias with it
24ASE FBO Server Predication
- Allows instructions to be conditionally executed
- Predicate register operand controls execution
- Removes branches and associated mispredict
penalties - Creates larger basic blocks and simplifies
compiler optimizations - Example
- cmp.eq p1,p2 r1,r2
- (p1) add r1 r2, 4
- (p2) ld8.sa r7 r8, 8
- If p1 is true, the add is performed, else it
acts as a nop - If p2 is true, the ld8 is performed, else it
acts as a nop
25ASE FBO Server Optimizations!
- Profile Guided Optimizations
- Also known as FBO or PBO
- Runs typical load using an instrumented server
- Collects data on execution profiling
- Generates highly optimized code!
- Anywhere between 10-40 performance gain
26ASE architecture High level view
27Cacheline and Data structure
- Main memory to internal cache transfer happens in
chunks - 32byte, 64byte or 128byte
- In database applications, load misses consumes
almost 90 of cpu cycles - Avoid load misses by rearranging the fields in a
structure - Write only fields
- Read only fields
- Fields accessed simultaneously
Struct Process int id char
name200 int state
28Spin lock optimizations
- Light weight synchronization mechanism
- Effective only when running with more than 1 ASE
engines - Inefficient algorithm can waster cpu cycles
- Must fit in one cache line boundary
- Varies from platform to platform
- Multiple spin lock structures in single cache
line - Too many unnecessary dirty flushes among the cpus
- Hyper threading and Intels pause instruction
29Spin lock decomposition
79
30ASE Architecture Storage Device
- Efficient IO is critical for system performance
- Process scheduling and interrupt handling is
important - SCSI or Fiber Channel
- Disk spindle RPM
- Controller Cache
- RAID 0 or RAID 1 or RAID 5
- Synchronous vs. Asynchronous IO
31ASE Architecture File System vs. Raw
- Raw for log IO
- RAID 0 for data devices
- RAID 01 for log devices
- File System for better management
- For 32bit, platforms for which memory is limited,
file system for data devices is recommended. - 4GB memory access limit
- OS allocates the rest of memory of the memory for
File System - Mostly read intensive and not heavy write
intensive - This results in better read response for
applications
32ASE Architecture File system vs. Raw devices
- Use mix of File System and Raw devices.
60
33ASE Architecture Journaling on or off?
- EXT3 with journaling disabled
24
34ASE Architecture Large Memory Support
- Xeon has the PAE architecture
- Allows applications to address up to 64GB of
memory - ASE on Linux as of 12.5.2 release can support up
to 64GB memory - Easy configuration to setup for large memory
feature
35Large Memory Support on Linux 32 bit
- Intel has the PAE architecture
- Allows applications to address up to 64GB of
memory - Memory usage in ASE
- Most of the memory on a given system is used for
data caches - Avoid expensive disk reads and writes
- File system devices cache the data in OS/FS cache
- Double copying problem resulting in wastage of
memory - Writes are very expensive
- Increased CPU bandwidth on Xeon is underutilized
by not having large memory support - Most productions environments have raw devices
for ASE - causes under utilization of the system memory
36ASE Architecture Large Memory Support
37Myth ASE Engines vs. of CPUS
- Can I have more engines than of cpus?
- Single server installation
- no need to have more engines
- Multiple ASE servers on a single system
- total number of engines exceeding the of CPUS
- No simple Yes/No answer
38Myth ASE taking most of the cpu cycles
- ASE always looks for work
- Consumes cpu cycles when idling, but only for a
fraction of milliseconds - With increasing CPU clock speed
- The problem seems more severe
- ASE is being improved to release cpu cycles as
soon as possible - But ensure that the users response time is not
affected - Typical ASE tuning
- Number of spins before releasing cpu
- Active IO and idling
- Network and disk IO checks
39Summary
- Processor technology continues to improve
- Higher clock speed
- Dual core chip
- EPIC architecture
- Lot more improvement to expect for Memory Latency
- More internal cache
- Parallel execution engines
- Parallelism pushed to compiler technology
- ASE architecture makes use of new technology
- Best OLTP engine
- New optimizer and execution engine
- Efficient handling of large data set
40Questions