Title: Arial 28pt. - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Title: Arial 28pt.

Description:

ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation ... Inefficient algorithm can waster cpu cycles. Must fit in one cache line boundary ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 41
Provided by: fellenm
Category:
Tags: 28pt | arial | title | waster

less

Transcript and Presenter's Notes

Title: Title: Arial 28pt.


1
ASE112 Adaptive Server Enterprise Performance
Tuning on Next Generation Architecture
Prasanta Ghosh Sr. Manager- Performance
Development pghosh_at_sybase.com August 15-19, 2004
2
The Enterprise. Unwired.
3
The Enterprise. Unwired.
Industry and Cross Platform Solutions
Unwire People
Unwire Information
Manage Information
  • Adaptive Server Enterprise
  • Adaptive Server Anywhere
  • Sybase IQ
  • Dynamic Archive
  • Dynamic ODS
  • Replication Server
  • OpenSwitch
  • Mirror Activator
  • PowerDesigner
  • Connectivity Options
  • EAServer
  • Industry Warehouse Studio
  • Unwired Accelerator
  • Unwired Orchestrator
  • Unwired Toolkit
  • Enterprise Portal
  • Real Time Data Services
  • SQL Anywhere Studio
  • M-Business Anywhere
  • Pylon Family (Mobile Email)
  • Mobile Sales
  • XcelleNet Frontline Solutions
  • PocketBuilder
  • PowerBuilder Family
  • AvantGo

Sybase Workspace
4
What will we learn?
  • Processor Trends
  • Relevant to the Database world
  • Present architectural issues
  • Compiler technology
  • ASE Architecture
  • Adapting to new processors
  • Keeping up with OLTP performance
  • Discuss some of the hot performance related
    topics
  • Questions
  • Discussions
  • Interactive

5
Processor CISC RISC and EPIC
  • CISC (Complex Instruction Set Computing)
  • Intel and AMDs x86 processor set
  • RISC (Reduced Instruction Set Computing)
  • Goal to optimize performance with simpler
    instructions
  • EPIC (Explicitly Parallel Instruction Computing)
  • Goal to move beyond RISC performance bounds with
    explicit parallel instruction streams

6
Processor Speed
  • Is Higher clock speed better?
  • Not always
  • 3.0Ghz Xeon vs. 1.5Ghz Itanium2

Processor Speed
Ultra SPARC IV 1.2Ghz
PA-RISC 8800 1.0Ghz
Xeon/Opteron 2.4-3.0Ghz
Power5 1.5-1.9Ghz
Itanium2 1.5-1.7Ghz
7
Processor Speed ASE behavior
  • Obviously
  • faster processing
  • better response time
  • Plus
  • more context switches e.g 112296 vs. 522115 per
    minute
  • not when the engines are idling
  • demands more from disk IO performance

Time
8
Processor Architecture 64bit Processing
  • 64bit data and address better performance
  • Must for large database environments
  • 2 versions of OS kernel and ASE for the same
    platform

Processor 32bit or 64bit?
Ultra SPARC IV Both
PA-RISC 8800 Both
Xeon 32bit
Opteron Both
Power5 Both
Itanium2 64bit only
  • Do I need to use 64bit if I dont need gt 4B
    memory access?

9
ASE Network and Engine Affinity
  • Network Affinity
  • User connection to ASE
  • Idling or least loaded engine picks up the
    incoming connection
  • Network IO for the user task is performed by that
    engine
  • Engine Affinity
  • Related to process scheduling
  • soft binding is automatic (user transparent)
  • Can use application partitioning to do hard
    binding
  • Runs on that engine as long as it can
  • Network affinity remains unchanged
  • unless that engine is made offline
  • Engine affinity changes
  • Due to stealing algorithm
  • Critical resource contention

10
ASE Engine Affinity
  • Scheduling
  • Engine Local runnable queue
  • Global runnable queue
  • Tasks are mostly in engine runnable queue
  • Occasionally in global runnable queue
  • Engine Stealing

Engine 0 queue
Engine 1 queue
Kernel queue
11
Processor Architecture some more
  • Hyper threading
  • Intel Xeon processor
  • Hyper Transport
  • high-speed, low latency, point-to-point link
  • Data throughput 22.4GB/sec
  • Dual Core
  • PA-8800, Power5
  • Chip Multithreading Technology (CMT)
  • Sun Ultra Sparc IV
  • Non Uniform Memory Access (NUMA)
  • Critical for large database applications using
    huge memory
  • Large Register Set
  • Itanium2 has 128 registers

12
Hyper threading and ASE
  • Should I enable hyper threading for ASE?
  • Our Experience
  • On single cpu system, hyper threading helps
  • On SMP systems, hyper threading does not always
    help
  • Linux AS 2.1 has some scheduling issue
  • Which is fixed in RHEL 3.0
  • Does not help on a highly active system where
    engines are fully utilized
  • Havent seen 30 gain for ASE configuration

13
Processor Architecture Limits and EPIC Solutions
Problem Memory/CPU Latency is already large and
growing Solution Speculative Loads for Data
and Instructions Problem Increasing amount of
conditional and/or unpredictable branches in code
-- Solution Predication and prediction of
branches and conditionals orchestrated by the
compiler to use the EPIC Architecture Problem
Complexity of multiple pipelines is too great for
effective on chip scheduling Solution
Compiler handles scheduling and produces code to
take advantage of the on chip
resources Problem Registers and chip resource
availability limit parallelism -- Solution
Increase the number of registers by 4X ( 32- 128 )
14
Traditional Architecture Limiters
Sequential Machine Code
Original Source Code
Hardware
Compiler
parallelized code
parallelized code
multiple functional units
Execution Units Available Used Inefficiently
. . .
. . .
. . .
. . .
Todays Processors often 60 Idle
15
Explicit Parallelism
  • Instruction Level Parallelism (ILP) is ability to
    execute multiple instructions at the same time
  • Explicitly Parallel Instruction Computing (EPIC)
    allows the compiler or assembler to specify the
    parallelism
  • Compiler specifies Instruction Groups, a list of
    instructions with no dependencies that can be
    executed in parallel
  • Stop bit or taken branch indicates instruction
    group boundary
  • Instructions are packed in bundles of 3
    instructions each
  • Template field directly maps each instruction to
    an execution unit allowing easy parallel dispatch
    of the instructions

Template 5 bits
Instruction 2 41 bits
Instruction 1 41 bits
Instruction 3 41bits
stop
stop
stop
16
Processor Architecture TLB miss
  • Translation Look ahead Buffer (TLB)
  • Its a fixed table size
  • Processor uses to search the data in local cache
  • Large memory configuration
  • Common to database applications
  • more chances of TLB miss
  • Locking the shared memory
  • Variable OS page size
  • 4KB vs. 8MB or 16MB

17
Processor speed vs. Memory Access
  • Cpu speed doubles every 1.5 years
  • Memory speed doubles every 10 years
  • High speed cpu
  • Mostly underutilized

18
Reduce Memory Latency
  • Internal cache
  • L1, L2, L3 cache
  • memory closer to processor
  • On chip or off chip
  • Shared by CPUs
  • Data/Instruction cache separate

Processor L1, L2, L3 size
Ultra SPARC IV 64KB-D, 32KB-I, 16MB
PA-RISC 8800 1.5MB, 32MB, NA
Opteron 128KB, 1MB, NA
Xeon 16K, 512KB, 4MB
Power5 64KB, 1.5MB, 36MB
Itanium2 32KB, 256KB, 6-9MB
19
Internal Cache
  • ASE is optimized to make use of the L1/L2/L3
    cache
  • Database applications are memory intensive
  • New systems What to watch for?
  • Higher clock speed
  • Higher front side bus speed
  • Large L1/L2/L3 cache
  • Lower memory latency
  • Follow OEM guidelines
  • e.g same speed memory DIMMs

20
Internal Cache Separate L1/L2 Cache
21
Internal Cache Shared L2/L3 Cache
  • Level 2 Cache Boosts Performance
  • Size (32MB) and proximity of the L2 cache to the
  • processors increases performance for many
    workloads

More than the CPUs Inside the Processor Module
  • On-Chip Cache Controller Speeds
  • Access and Protects Data
  • On-chip tags help the cache controller quickly
    locate and send data to CPU
  • ECC protection for data tags, cached data and
    in-flight data

System Bus
22
Internal Cache ASE optimizations
  • Smaller footprint
  • Avoid random access of memory
  • Only few OS processes
  • Structure alignments
  • Minimize cross engine data access
  • Compiler optimization to pre-fetch data
  • Better branch prediction

23
ASE FBO Server Speculation
  • Allows compiler to issue operation early before a
    dependency
  • Removes latency of operation from the critical
    path
  • Helps hide long latency memory operations
  • Two type of speculation
  • Control Speculation, which is the execution of an
    operation before the branch which guards it
  • Data Speculation, which is the execution of a
    memory load prior to a preceding store which may
    alias with it

24
ASE FBO Server Predication
  • Allows instructions to be conditionally executed
  • Predicate register operand controls execution
  • Removes branches and associated mispredict
    penalties
  • Creates larger basic blocks and simplifies
    compiler optimizations
  • Example
  • cmp.eq p1,p2 r1,r2
  • (p1) add r1 r2, 4
  • (p2) ld8.sa r7 r8, 8
  • If p1 is true, the add is performed, else it
    acts as a nop
  • If p2 is true, the ld8 is performed, else it
    acts as a nop

25
ASE FBO Server Optimizations!
  • Profile Guided Optimizations
  • Also known as FBO or PBO
  • Runs typical load using an instrumented server
  • Collects data on execution profiling
  • Generates highly optimized code!
  • Anywhere between 10-40 performance gain

26
ASE architecture High level view
27
Cacheline and Data structure
  • Main memory to internal cache transfer happens in
    chunks
  • 32byte, 64byte or 128byte
  • In database applications, load misses consumes
    almost 90 of cpu cycles
  • Avoid load misses by rearranging the fields in a
    structure
  • Write only fields
  • Read only fields
  • Fields accessed simultaneously

Struct Process int id char
name200 int state
28
Spin lock optimizations
  • Light weight synchronization mechanism
  • Effective only when running with more than 1 ASE
    engines
  • Inefficient algorithm can waster cpu cycles
  • Must fit in one cache line boundary
  • Varies from platform to platform
  • Multiple spin lock structures in single cache
    line
  • Too many unnecessary dirty flushes among the cpus
  • Hyper threading and Intels pause instruction

29
Spin lock decomposition
79
30
ASE Architecture Storage Device
  • Efficient IO is critical for system performance
  • Process scheduling and interrupt handling is
    important
  • SCSI or Fiber Channel
  • Disk spindle RPM
  • Controller Cache
  • RAID 0 or RAID 1 or RAID 5
  • Synchronous vs. Asynchronous IO

31
ASE Architecture File System vs. Raw
  • Raw for log IO
  • RAID 0 for data devices
  • RAID 01 for log devices
  • File System for better management
  • For 32bit, platforms for which memory is limited,
    file system for data devices is recommended.
  • 4GB memory access limit
  • OS allocates the rest of memory of the memory for
    File System
  • Mostly read intensive and not heavy write
    intensive
  • This results in better read response for
    applications

32
ASE Architecture File system vs. Raw devices
  • Use mix of File System and Raw devices.

60
33
ASE Architecture Journaling on or off?
  • EXT3 with journaling disabled

24
34
ASE Architecture Large Memory Support
  • Xeon has the PAE architecture
  • Allows applications to address up to 64GB of
    memory
  • ASE on Linux as of 12.5.2 release can support up
    to 64GB memory
  • Easy configuration to setup for large memory
    feature

35
Large Memory Support on Linux 32 bit
  • Intel has the PAE architecture
  • Allows applications to address up to 64GB of
    memory
  • Memory usage in ASE
  • Most of the memory on a given system is used for
    data caches
  • Avoid expensive disk reads and writes
  • File system devices cache the data in OS/FS cache
  • Double copying problem resulting in wastage of
    memory
  • Writes are very expensive
  • Increased CPU bandwidth on Xeon is underutilized
    by not having large memory support
  • Most productions environments have raw devices
    for ASE
  • causes under utilization of the system memory

36
ASE Architecture Large Memory Support
37
Myth ASE Engines vs. of CPUS
  • Can I have more engines than of cpus?
  • Single server installation
  • no need to have more engines
  • Multiple ASE servers on a single system
  • total number of engines exceeding the of CPUS
  • No simple Yes/No answer

38
Myth ASE taking most of the cpu cycles
  • ASE always looks for work
  • Consumes cpu cycles when idling, but only for a
    fraction of milliseconds
  • With increasing CPU clock speed
  • The problem seems more severe
  • ASE is being improved to release cpu cycles as
    soon as possible
  • But ensure that the users response time is not
    affected
  • Typical ASE tuning
  • Number of spins before releasing cpu
  • Active IO and idling
  • Network and disk IO checks

39
Summary
  • Processor technology continues to improve
  • Higher clock speed
  • Dual core chip
  • EPIC architecture
  • Lot more improvement to expect for Memory Latency
  • More internal cache
  • Parallel execution engines
  • Parallelism pushed to compiler technology
  • ASE architecture makes use of new technology
  • Best OLTP engine
  • New optimizer and execution engine
  • Efficient handling of large data set

40
Questions
Write a Comment
User Comments (0)
About PowerShow.com