Title: The Thoroughly Modern Mainframe
1The Thoroughly Modern Mainframe
- Dr. Michael Salsburg
- NTSMF Users' Group
- Dec 9, 2002
2Agenda
- Large Scale WINTEL Servers
- Disruptive technology or trend?
- Scale Up or Scale Out ?
- A Workload-motivated discussion ofSMP and
CC-NUMA - PCI-Based I/O
- Consolidation
- Emerging Technologies
3Server Industry Trends
Source IDC
Intel will dominate server chip market
Windows 2000 will be pervasive server OS
4The x440 CompetitionGartner Oct 2002
Unisys ES7000 Aries 230 Unisys ES7000 Orion Egenera Blade Frame HP ProLiant DL760 HP rp8400 IBM eServer x440
Processors Supported Intel Xeon 1.4, 1.6 GHz Intel Xeon 1.4, 1.6 GHz Intel Xeon 1.4, 1.6 GHz Intel Xeon Pentium 700, 900 MHz HP PA-8700 750, 875 MHz Intel Xeon 1.4, 1.5, 1.6 GHz
Max Procs 16 32 96 8 16 8 ,16 by 2002
Max Mem 32G 64G 288G 16G 64G 64G
Max PCI Slots 48 96 12 10 16 72
5A Comparison using Moores Law
- Comparison of CPU Speeds / tpcM for 4x cpu WINTEL
systems
6TPC-C Top 10
7(No Transcript)
8Scale Up or Scale Out?
- Two of the 3-tiers in current application
architectures use scale-out for growth - Increase of Web servers
- Increase of Application Servers
- Database back end cannot be scaled out
- Scale up is needed for large database
applications - Scale out has some inherent down sides
- additional administrative/management attention
- Move headroom needed for heavy traffic
9SMP / NUMA Workload Discussion
- As code executes on the processor, memory is
referenced. This can be broken into three
regions - High Locality of Reference
- Memory is immediately re-referenced (gt 95)
- Working Set the set of addresses on which the
software primarily focuses - Persistent Storage addresses that are stored on
physical devices
10Scale Out- SMP or NUMA? Workload Interference
- When two processes are running on the same
system, their memory references will interfere. - It is preferable to only interfere at the
persistent storage level - Interference at higher levels can decrease cache
efficiency and slow down processing, effectively
reducing the CPU power
11SMP / NUMA SMP Topology
- A bank of CPUs share a bank of Memory
- Each CPU has a local cache to optimize high
locality of reference - A cache miss has uniform latency time to get data
from memory - Dirty memory references require fetching the
updated memory from another CPUs cache - The CPU can stall waiting for a memory reference
12SMP / NUMA Workload Discussion
- Percentages of references based on TPC-C workload
profile - Relative time units show orders of magnitude
between cache hit and persistent storage
13SMP / NUMA NUMA (Non-Uniform Memory Access)
- Overcome bus congestion and physical fabrication
limitations found in a single bus architecture - Two memory latencies near and far
- The NUMA ratio is the ratio of far latency over
near latency - Originally 30, now it is around 3
14SMP / NUMA Hybrid (Unisys ES7000)
- Another level of cache is introduced
- Memory accesses can be non-uniform when comparing
Next Level Cache hits to memory references - Overcomes the fabrication/congestion problems of
a single bus architecture
15PCI-Based I/O
Cellular MultiProcessing (CMP) Architecture
16PCI-Based I/O
6.4 GB
6.4 GB
SP2
Scalability Port
533MHz
4
PCI-Express
16X, or 8X_at_5Gb
HyperTransport
PCI-X
3
SP1
GB/sec Max Bus or per direction
266
2
1X 4X 8X
133
1
0.8 GB
PCI
66
2001 earlier
2002
2003
2004
2005
17Enterprise-Level Backup / Restore
18Enterprise-Level Backup / Restore
- Complete recovery of a 2.5 terabyte database
- From tape, the database was recovered in only 88
minutes with a sustained throughput during
restore of 2.2 TB/hr. - From the hardware snapshot, the same database was
recovered in only 11 minutes. - Complete backup of a 2.5 terabyte database
- Backup to tape took only 68 minutes with minimal
impact on online operations and sustained
throughput of 2.6 TB/hr.
19Consolidation
- "Our servers were multiplying like rabbits,"
says Jeff Smith, manager of corporate network
services at La-Z-Boy Inc., a Monroe, Mich.-based
residential furniture producer that just
completed a Windows NT server consolidation
project. "Our distributed environment was
becoming more and more difficult to manage." - Thinning The Server RanksComputerworld Aug 26,
2002
20Consolidation
- How do you stuff over 130 CPUs worth of workload
into a 32x CPU system? - Veeerrrry carefully
- Why are current server farms filled with
under-utilized servers? - Web Hosting Sites
- New web servers are installed when Peak CPU
utilization reaches above 35. - Speed and reliability are very important to your
web site. All of our servers are maintained at
less than 15 CPU utilization. This ensures that
your web site downloads as fast as possible!Â
21ConsolidationResponsive Consolidation
- Which would you prefer an average queue size of
0.2 on a 1x or a 32x system?
22ConsolidationBenefits
- Simplified Management / Administration
- Higher Utilization (less headroom)
- Less Variability of Service
- Less Overall CPU Overhead
- Less software licenses
23Emerging CPU Technologies32x INTEL CPU TPC-C
Results
Date Published tpm-C Chip Speed Cache Memory
11/11/2001 165,218 Pentium III Xeon 900MHz 2MB 64 GB
9/9/2002 308,620 Itanium II 1GHz 3 MB 256 GB
11/4/2002 203,518 Pentium IV Xeon 1.6 GHz 1 MB 64 GB
24Itanium IIWhats so great about 64 bits?
- For transaction processing, memory addressing is
increased and therefore the amount of main memory
increases - The top 5 TPC-C results were achieved using 64
bit computing - TPC-C is a large database application this is a
sweet spot for 64 bit commercial computing - Bigger is DEFINITELY Better!!