ASM without HW RAID - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

ASM without HW RAID

Description:

... Storage deployment ASM implementation details Storage in JBOD configuration (1 disk - 1 LUN) Each disk partitioned on OS level 1st partition ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 36
Provided by: LucaCana
Category:
Tags: asm | raid | jbod | without

less

Transcript and Presenter's Notes

Title: ASM without HW RAID


1
Implementing ASM Without HW RAID,A Users
Experience
Luca Canali, CERN Dawid Wojcik, CERN UKOUG,
Birmingham, December 2008
2
Outlook
  • Introduction to ASM
  • Disk groups, fail groups, normal redundancy
  • Scalability and Performance of the solution
  • Possible pitfalls, sharing experiences
  • Implementation details, monitoring, and tools to
    ease ASM deployment

3
Architecture and main concepts
  • Why ASM ?
  • Provides functionality of volume manager and a
    cluster file system
  • Raw access to storage for performance
  • Why ASM-provided mirroring?
  • Allows to use lower-cost storage arrays
  • Allows to mirror across storage arrays
  • arrays are not single points of failure
  • Array (HW) maintenances can be done in a rolling
    way
  • Stretch clusters

4
ASM and cluster DB architecture
  • Oracle architecture of redundant low-cost
    components

5
Files, extents, and failure groups
  • Files and
  • extent
  • pointers
  • Failgroups
  • and ASM
  • mirroring

6
ASM disk groups
  • Example HW 4 disk arrays with 8 disks each
  • An ASM diskgroup is created using all available
    disks
  • The end result is similar to a file system on
    RAID 10
  • ASM allows to mirror across storage arrays
  • Oracle RDBMS processes directly access the
    storage
  • RAW disk access

ASM Diskgroup
Mirroring
Striping
Striping
Failgroup1
Failgroup2
7
Performance and scalability
  • ASM with normal redundancy
  • Stress tested for CERNs use
  • Scales and performs

8
Case Study the largest cluster I have ever
installed, RAC5
  • The test used14 servers

9
Multipathed fiber channel
  • 8 FC switches 4Gbps (10Gbps uplink)

10
Many spindles
  • 26 storage arrays (16 SATA disks each)

11
Case Study I/O metrics for the RAC5 cluster
  • Measured, sequential I/O
  • Read 6 GB/sec
  • Read-Write 33 GB/sec
  • Measured, small random IO
  • Read 40K IOPS (8 KB read ops)
  • Note
  • 410 SATA disks, 26 HBAS on the storage arrays
  • Servers 14 x 44Gbps HBAs, 112 cores, 224 GB of
    RAM

12
How the test was run
  • A custom SQL-based DB workload
  • IOPS Probe randomly a large table (several TBs)
    via several parallel queries slaves (each reads a
    single block at a time)
  • MBPS Read a large (several TBs) table with
    parallel query
  • The test table used for the RAC5 cluster was 5 TB
    in size
  • created inside a disk group of 70TB

13
Possible pitfalls
  • Production Stories
  • Sharing experiences
  • 3 years in production, 550 TB of raw capacity

14
Rebalancing speed
  • Rebalancing is performed (and mandatory) after
    space management operations
  • Typically after HW failures (restore mirror)
  • Goal balanced space allocation across disks
  • Not based on performance or utilization
  • ASM instances are in charge of rebalancing
  • Scalability of rebalancing operations?
  • In 10g serialization wait events can limit
    scalability
  • Even at maximum speed rebalancing is not always
    I/O bound

15
Rebalancing, an example
16
VLDB and rebalancing
  • Rebalancing operations can move more data than
    expected
  • Example
  • 5 TB (allocated) 100 disks, 200 GB each
  • A disk is replaced (diskgroup rebalance)
  • The total IO workload is 1.6 TB (8x the disk
    size!)
  • How to see this query vasm_operation, the
    column EST_WORK keeps growing during rebalance
  • The issue excessive repartnering

17
Rebalancing issues wrap-up
  • Rebalancing can be slow
  • Many hours for very large disk groups
  • Risk associated
  • 2nd disk failure while rebalancing
  • Worst case - loss of the diskgroup because
    partner disks fail

18
Fast Mirror Resync
  • ASM 10g with normal redundancy does not allow to
    offline part of the storage
  • A transient error in a storage array can cause
    several hours of rebalancing to drop and add
    disks
  • It is a limiting factor for scheduled
    maintenances
  • 11g has new feature fast mirror resync
  • Great feature for rolling intervention on HW

19
ASM and filesystem utilities
  • Only a few tools can access ASM
  • Asmcmd, dbms_file_transfer, xdb, ftp
  • Limited operations (no copy, rename, etc)
  • Require open DB instances
  • file operations difficult in 10g
  • 11g asmcmd has the copy command

20
ASM and corruption
  • ASM metadata corruption
  • Can be caused by bugs
  • One case in prod after disk eviction
  • Physical data corruption
  • ASM will fix automatically most corruption on
    primary extent
  • Typically when doing a full backup
  • Secondary extent corruption goes undetected
    untill disk failure/rebalance can expose it

21
Disaster recovery
  • Corruption issues were fixed using physical
    standby to move to fresh storage
  • For HA our experience is that disaster recovery
    is needed
  • Standby DB
  • On-disk (flash) copy of DB

22
Implementation details
23
Storage deployment
  • Current storage deployment for Physics Databases
    at CERN
  • SAN, FC (4Gb/s) storage enclosures with SATA
    disks (8 or 16)
  • Linux x86_64, no ASM lib, device mapper instead
    (naming persistence HA)
  • Over 150 FC storage arrays (production,
    integration and test) and 2000 LUNs exposed
  • Biggest DB over 7TB (more to come when LHC starts
    estimated growth up to 11TB/year)

24
Storage deployment
  • ASM implementation details
  • Storage in JBOD configuration (1 disk -gt 1 LUN)
  • Each disk partitioned on OS level
  • 1st partition 45 of disk size faster part of
    disk short stroke
  • 2nd partition rest slower part full stroke

inner sectors full stroke
outer sectors short stroke
25
Storage deployment
  • Two diskgroups created for each cluster
  • DATA data files and online redo logs outer
    part of the disks
  • RECO flash recovery area destination archived
    redo logs and on disk backups inner part of the
    disks
  • One failgroup per storage array

Failgroup4
Failgroup2
Failgroup3
Failgroup1
DATA_DG1
RECO_DG1
26
Storage management
  • SAN configuration in JBOD configuration many
    steps, can be time consuming
  • Storage level
  • logical disks
  • LUNs
  • mappings
  • FC infrastructure zoning
  • OS creating device mapper configuration
  • multipath.conf name persistency HA

27
Storage management
  • Storage manageability
  • DBAs set-up initial configuration
  • ASM extra maintenance in case of storage
    maintenance (disk failure)
  • Problems
  • How to quickly set-up SAN configuration
  • How to manage disks and keep track of the
    mappingsphysical disk -gt LUN -gt Linux disk -gt
    ASM Disk

SCSI 1013 2013 -gt/dev/sdn
/dev/sdax -gt/dev/mpath/rstor901_3 -gtASM
TEST1_DATADG1_0016
28
Storage management
  • Solution
  • Configuration DB - repository of FC switches,
    port allocations and of all SCSI identifiers for
    all nodes and storages
  • Big initial effort
  • Easy to maintain
  • High ROI
  • Custom tools
  • Tools to identify
  • SCSI (block) devices lt-gt device mapper device lt-gt
    physical storage and FC port
  • Device mapper mapper device lt-gt ASM disk
  • Automatic generation of device mapper
    configuration

29
Storage management
  • lssdisks.py
  • The following storages are connected
  • Host interface 1
  • Target ID 100 - WWPN 210000D0230BE0B5 -
    Storage rstor316, Port 0
  • Target ID 101 - WWPN 210000D0231C3F8D -
    Storage rstor317, Port 0
  • Target ID 102 - WWPN 210000D0232BE081 -
    Storage rstor318, Port 0
  • Target ID 103 - WWPN 210000D0233C4000 -
    Storage rstor319, Port 0
  • Target ID 104 - WWPN 210000D0234C3F68 -
    Storage rstor320, Port 0
  • Host interface 2
  • Target ID 200 - WWPN 220000D0230BE0B5 -
    Storage rstor316, Port 1
  • Target ID 201 - WWPN 220000D0231C3F8D -
    Storage rstor317, Port 1
  • Target ID 202 - WWPN 220000D0232BE081 -
    Storage rstor318, Port 1
  • Target ID 203 - WWPN 220000D0233C4000 -
    Storage rstor319, Port 1
  • Target ID 204 - WWPN 220000D0234C3F68 -
    Storage rstor320, Port 1
  • SCSI Id Block DEV MPath name
    MP status Storage Port
  • ------------- ---------------- -------------------
    - ---------- ------------------ -----
  • 0000 /dev/sda -
    - - -

Custom made script
30
Storage management
  • listdisks.py
  • DISK NAME GROUP_NAME
    FG H_STATUS MODE MOUNT_S STATE
    TOTAL_GB USED_GB
  • ---------------- ------------------ -------------
    ---------- ---------- ------- -------- -------
    ------ -----
  • rstor401_1p1 RAC9_DATADG1_0006 RAC9_DATADG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    111.8 68.5
  • rstor401_1p2 RAC9_RECODG1_0000 RAC9_RECODG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    119.9 1.7
  • rstor401_2p1 -- --
    -- UNKNOWN ONLINE CLOSED NORMAL
    111.8 111.8
  • rstor401_2p2 -- --
    -- UNKNOWN ONLINE CLOSED NORMAL
    120.9 120.9
  • rstor401_3p1 RAC9_DATADG1_0007 RAC9_DATADG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    111.8 68.6
  • rstor401_3p2 RAC9_RECODG1_0005 RAC9_RECODG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    120.9 1.8
  • rstor401_4p1 RAC9_DATADG1_0002 RAC9_DATADG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    111.8 68.5
  • rstor401_4p2 RAC9_RECODG1_0002 RAC9_RECODG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    120.9 1.8
  • rstor401_5p1 RAC9_DATADG1_0001 RAC9_DATADG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    111.8 68.5
  • rstor401_5p2 RAC9_RECODG1_0006 RAC9_RECODG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    120.9 1.8
  • rstor401_6p1 RAC9_DATADG1_0005 RAC9_DATADG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    111.8 68.5
  • rstor401_6p2 RAC9_RECODG1_0007 RAC9_RECODG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    120.9 1.8
  • rstor401_7p1 RAC9_DATADG1_0000 RAC9_DATADG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    111.8 68.6
  • rstor401_7p2 RAC9_RECODG1_0001 RAC9_RECODG1
    RSTOR401 MEMBER ONLINE CACHED NORMAL
    120.9 1.8

Custom made script
31
Storage management
  • gen_multipath.py
  • multipath default configuration for PDB
  • defaults
  • udev_dir /dev
  • polling_interval 10
  • selector "round-robin 0"
  • . . .
  • . . .
  • multipaths
  • multipath
  • wwid
    3600d0230006c26660be0b5080a407e00
  • alias
    rstor916_CRS
  • multipath
  • wwid
    3600d0230006c26660be0b5080a407e01
  • alias rstor916_1
  • . . .

Custom made script
device mapper alias naming persistency and
multipathing (HA)
SCSI 1013 2013 -gt/dev/sdn
/dev/sdax -gt/dev/mpath/rstor916_1
32
Storage monitoring
  • ASM-based mirroring means
  • Oracle DBAs need to be alerted of disk failures
    and evictions
  • Dashboard global overview custom solution
    RACMon
  • ASM level monitoring
  • Oracle Enterprise Manager Grid Control
  • RACMon alerts on missing disks and failgroups
    plus dashboard
  • Storage level monitoring
  • RACMon LUNs health and storage configuration
    details dashboard

33
Storage monitoring
  • ASM instance level monitoring
  • Storage level monitoring

new failing disk onRSTOR614
new disk installed onRSTOR903 slot 2
34
Conclusions
  • Oracle ASM diskgroups with normal redundancy
  • Used at CERN instead of HW RAID
  • Performance and scalability are very good
  • Allows to use low-cost HW
  • Requires more admin effort from the DBAs than
    high end storage
  • 11g has important improvements
  • Custom tools to ease administration

35
QA
  • Thank you
  • Links
  • http//cern.ch/phydb
  • http//www.cern.ch/canali
Write a Comment
User Comments (0)
About PowerShow.com