Title: Yottabytes and Beyond
1Yottabytes and Beyond
- Demystifying Storage and
- Building large Storage Networks
- Part I
- by Bhavin Turakhia, CEO, Directi
- bhavin.t_at_directi.com
- (shared under Creative Commons Attribution
Share-alike License incorporated herein by
reference) - (http//creativecommons.org/licenses/by-sa/3.0/)
2Why is storage important?
- Web 2.0 applications are an extension of your
Desktop - SaaS is here and growing
- Broadband is a reality
- Storage costs are dropping
- Everyone expects near-unlimited storage online
Youtube, Flickr, Facebook et al are storing your
life online - (.. And yea lets not forget your personal
bit-torrent collection)
it would take 1400 TB to store your entire life
in video. 5700 TB if you want to know what was
happening around you. Another 73 TB for the audio
files of everything you heard (MP3 quality).
Thats about 6000 TB for a copy of your life
3Agenda
- Hard disks
- SATA, SAS, FC, Solidstate
- RAID
- DAS
- SAN
4- Large scale storage requires careful planning
5- Choosing your Hard Disk
- (SATA, FC, SAS, SCSI, Solidstate)
6Introduction to Hard Drives
- Basic physical storage unit (aka Physical block
device) - Variables to consider when selecting a drive
- Type (SAS, SATA, FC)
- RPM
- Capacity
- MTBF (Mean Time between Failures)
- Life Expectancy
7Hard Disk types
8Hard Disk types
9Hard Disk Conclusions
- For high IOPs, database applications, low-storage
requirements you have a choice between FC and
SAS - SAS currently seems like the better option
- Future SAS standards promise to be faster than FC
(though it is likely they may remain neck to
neck) - For high-storage requirements (video server, file
servers, photo storage, archivals, mail servers,
backup servers) SATA is the way to go - One may combine SAS and SATA to reduce average
cost and achieve your goals especially since
the backplanes are cross-compatible - Readup the spec sheet of the hard drives you plan
on using for determining specifics
10Solid State Drives
- Uses solid state memory to store persistent data
- Eliminates mechanical parts
- Useful for creating efficient in-between caches
or storing small to mid-sized high performance
databases
11Solid State Drives
- References
- Intro - http//en.wikipedia.org/wiki/Solid_state_d
isk - RAM vs Flash based - http//www.storagesearch.com/
ssd-ram-v-flash.html - SSD based SAN!!! ? - http//www.superssd.com/
12- RAID Primer
- (0, 1, 2, 3, 4, 5, 6, TP, 01, 10, 50, 60)
13Introduction to RAID
- allows multiple disks to appear as a single
contiguous physical block device - provides redundancy / high availability
- A raid group appears as a single physical block
device
RAID
HD1
HD2
HD1
HD2
14Comparison of Single RAID Levels
15Comparison of Single RAID Levels
16Comparison of Single RAID Levels
17Comparison of Single RAID Levels
18Understanding the Parity Penalty
- RAID 5 and RAID 6 store parity information
against data for rebuild - Single Parity can be calculated using a simple
XOR - eg abcdefghijkl on a 4 disk RAID 5 array
- If Disk 2 fails then the data B can be
recalculated as (01000001 XOR 01000011 XOR
01000000) 01000010 B
12124286429
19Understanding the Parity Penalty
- Steps to change B to X on Disk 2
- Read A, C and P
- Recalculate P as A XOR X XOR C
- Write X and P
- A single update required 3 reads and 2 writes
- Random writes in RAID 5 and RAID 6 are very very
expensive
20Understanding the Parity Penalty
- Rebuilding in RAID 5 and RAID 6 is expensive
- The cost increases with increase in number of
disks - As if this isnt enough there is an additional
penalty - All the writes after the computation (ie parity
and the changed block) must be simultaneous
(involving a two-phase commit operation) - The impact can be marginally reduced through
write-back caching
21Comparison of Nested RAID Levels
22Comparison of Nested RAID Levels
23Comparison of Nested RAID Levels
24Nested RAID Misc Notes
- RAID 10 is faster and better than RAID 01 for
the same cost - RAID 60 is similar to RAID 50 except that the
striped sets with parity contain dual parity - Ideally RAID 10 and RAID 50 will be the only
nested RAID levels you will use
25RAID Considerations
- Select your Stripe Size by empirical testing
- smaller stripe size increases transfer
performance, decreases positioning performance,
and vice versa - ideal stripe sizes depend on your application,
typical data read in a read, sequential vs random
reads etc - Try and select hard drives from separate
production batches - Maintain sufficient Spares in a large array
(typically 1 per 10-15 disks is sufficient) - Use Global spares across RAID groups if your
controller supports it
26RAID Considerations
- Use hardware RAID unless performance is not a
consideration - Especially nested RAID levels or parity based
RAID consume more CPU cycles and increase
rebuild time if implemented in software - General rule about Controller Cache the higher
the better - Ensure the controller has battery backup to
retain its cache in case of power failure - For internal RAID Controller cards use faster PCI
buses (PCI-x)
27- The Fun starts Lets build our storage system
28- Passive Disk Enclosure based Direct Attached
Storage (PDE based DAS)
29Passive Disk Enclosure based DAS
- DAS Direct Attached storage
- RAID controller inside host machine
- External chasis is simply a JBOD (Just a Bunch Of
Disks) - (or what Id like to call Passive Disk Enclosure
or PDE) - PDE enables stringing larger number of drives
together as compared to internal RAID array - Eg Dell Powervault MD1000
30Passive Disk Enclosure based DAS
- Passive Disk Enclosure can consist of SAS, SATA
or FC drives - Passive Disk Enclosure to RAID Controller
connectivity can be SAS, FC, SCSI (possibly
different from the backplane) - Multiple PDEs can be daisy chained if they
support it - RAID card is a single point of failure
- Only one host machine supported
- Array of disks can be divided into multiple RAID
groups
31Passive Disk Enclosure based DAS
- Array of disks can be divided into multiple
heterogeneous RAID groups - Size and type of a RAID group depends on RAID
card - PDE may have multiple paths to system with
possibility of multiplexing for increased speed - Global spares can be defined on the RAID card
- Maximum storage size maximum number of PDEs
that can be daisy chained x size of drives
32Passive Disk Enclosure based DAS
- Performance Considerations
- Drives
- RAID configuration
- PDE Interconnect
- PDE to RAID Card connect
- RAID card config (cache etc)
- PCI bus
33- Active Disk Enclosure based Direct Attached
Storage (ADE based DAS)
34Active Disk Enclosure based DAS
- ADE Difference - RAID Card is not in the host
machine but in the enclosure - Host machine has a SAS/FC Host Bus Adaptor (HBA)
depending on ADE to Host connectivity support - Some ADEs may support multiple connection
protocols - ADE may support SAS/FC/SATA drives
- ADE can support daisy-chaining PDEs
- Eg of ADE Dell MD 3000, Infortrend eonstor
devices, Nexsan Satabeast and Sataboy etc
35Active Disk Enclosure based DAS
- ADE may support dual RAID Controllers
- RAID Controllers can be used as Active-Active
(incase of multiple RAID Groups) otherwise as
Active Passive - RAID Controller to HBA connectivity can be
multiplexed - if supported - for higher
throughput - ADEs are wrongly but commonly referred as SAN
(SAN device would still be alright)
36- Partitioning and Mounting
37Logical Volumes
- A RAID Group is a physical unit of storage
- At the Operating System a Logical Group can be
created out of multiple RAID Groups - Each Logical Group can be further divided into
Logical Volumes - Each Logical Volume represents a mountable block
device - In Linux this is done using LVM
- In LVM Logical Volumes are resizable
38- SAN (Storage Area Network)
39SAN
- Multiple host machines connected to an ADE
through a SAN switch - SAN refers to the interconnect Switch ADE
PDE - Switch and HBA can be SAS / FC depending on
interconnect type supported by ADE - ADE would support creation of Volumes
- These can be mounted onto Client and further
subdivided
40SAN
- Care must be taken to mount each Logical Volume
onto a single client (unless you are running a
Clustered File System) - This can be achieved by host masking supported by
ADE and/or the Switch - Without careful host masking and mounting data
corruption can take place
41SAN
- Complex SAN configs include multiple hosts and
multiple ADEs connected to active-active switches
with multiplexed connections - Client hosts can be of heterogeneous operating
systems - (Funnily ADE to PDE paths sometimes are not be
multiplexed)
42SAN
- While this looks complex just think of it as
removing hard disks from the machine and hosting
them outside in separate enclosures - Each machine mounts an independent partition from
the SAN
43SAN
- Performance Considerations
- All variables we covered before
- Switch config
- Ensure that switch / HBA / interconnect does not
become the bottleneck and full hdd throughput can
be utilized
44Throughput Calculations
- Hard disk performance Type, RPM etc
- Data distribution and Type of Data access
- RAID performance, number of drives, RAID type
- RAID card performance cache, active-active
config etc - ADE to switch connection speed
- Switch to HBA connection speed
- HBA to PCI bus speed
45- Thats all Folks
- Lets go build out our Yottabyte arrays and fill
em up
Considerably exaggerated hyperbole given that
the combined space of all computers in the world
today (2007) doesnt add up to 1 Yottabyte (2
80 bytes). Infact the entire worlds storage is
projected to hit 988 exabytes (2 60) by 2010
6th Sep 2007 - http//www.networkworld.com/newsle
tters/stor/2007/0903stor2.html Nanotech
breakthrough could put entire YouTube contents on
an iPod-size device
46Part II sneak preview
- Complex SAN configurations
- iSCSI
- NAS
- Clustered Storage
- GFS
- Backups
- Storage Monitoring
- Storage Benchmarking
- Some Commercial storage vendors
47Shameless HR Propaganda Slide
- Directi builds cool Web products
- Deployed on distributed architecture
- Using terrabytes of storage
- Used by millions of users
- Generating billions of pageviews and
transactions - Spanning every possible software engineering
technology
http//careers.directi.com http//wiki.directi.c
om http//cosmos.directi.com
Personal Blog http//bhavin.directi.com Mail
bhavin.t_at_directi.com