Title: STORAGE ARCHITECTURE/ GETTING STARTED: SAN SCHOOL 101
1STORAGE ARCHITECTURE/GETTING STARTEDSAN SCHOOL
101
- Marc Farley
- President of Building Storage, Inc
- Author, Building Storage Networks, Inc.
2Agenda
- Lesson 1 Basics of SANs
- Lesson 2 The I/O path
- Lesson 3 Storage subsystems
- Lesson 4 RAID, volume management and
virtualization - Lesson 5 SAN network technology
- Lesson 6 File systems
3Basics of storage networking
Lesson 1
4Connecting
5Connecting
- Networking or bus technology
- Cables connectors
- System adapters network device drivers
- Network devices such as hubs, switches, routers
- Virtual networking
- Flow control
- Network security
6Storing
7Storing
- Device (target) command and control
- Drives, subsystems, device emulation
- Block storage address space manipulation
(partition management) - Mirroring
- RAID
- Striping
- Virtualization
- Concatentation
8Filing
9Filing
- Namespace presents data to end users and
applications as files and directories (folders) - Manages use of storage address spaces
- Metadata for identifying data
- file name
- owner
- dates
10Connecting, storing and filing as a complete
storage system
Connecting
11NAS and SAN analysis
NAS is filing over a network SAN is storing over
a network NAS and SAN are independent
technologies They can be implemented
independently They can co-exist in the same
environment They can both operate and provide
services to the same users/applications
12Protocol analysis for NAS and SAN
Filing
NAS SAN Network
Storing
Connecting
13Integrated SAN/NAS environment
NAS ServerSAN Initiator NAS Head
Storing
Filing
Connecting
Connecting
14Common wiring with NAS and SAN
NAS Head
Filing
Storing
Connecting
15The I/O path
Lesson 2
16- Host hardware path components
MemoryBus
System I/O Bus
Storage Adapter (HBA)
Processor
Memory
17Host software path components
Application
Multi-Pathing
VolumeManager
Device Driver
Filing System
CacheManager
Operating System
18Network hardware path components
- Switches, hubs, routers, bridges, gatways
- Port buffers, processors
- Backplane, bus, crossbar, mesh, memory
- Cabling
- Fiber optic
- Copper
19Network software path components
Access and Security
Virtual Networking
FlowControl
Fabric Services
Routing
20Subsystem path components
Access and Security
NetworkPorts
Cache
Resource Manager
Internal Busor Network
21Device and media path components
Disk drives
Tape drives
Solid state devices
Tape Media
22The end to end I/O path picture
Storage Adapter (HBA)
Device Driver
System I/O Bus
CacheManager
MemoryBus
Memory
VolumeManager
Filing System
Multi-Pathing
Operating System
Processor
App
Virtual Networking
FlowControl
Access and Security
Routing
Fabric Services
Network Systems
Cabling
SubsystemNetwork Poirt
Access and Security
Disk drives
Tape drives
Resource Manager
Internal Busor Network
Cache
23Storage subsystems
Lesson 3
24Generic storage subsystem model
Controller (logicprocessors) Access
control Resource manager
Storage Resources
NetworkPorts
Internal Bus or Network
Cache Memory
Power
25Redundancy for high availability
- Multiple hot swappable power supplies
- Hot swappable cooling fans
- Data redundancy via RAID
- Multi-path support
- Network ports to storage resources
26Physical and virtual storage
Subsystem Controller Resource Manager (RAID,
mirroring, etc.)
HotSpare Device
27SCSI communications architectures determine SAN
operations
- SCSI communications are independent of
connectivity - SCSI initiators (HBAs) generate I/O activity
- They communicate with targets
- Targets have communications addresses
- Targets can have many storage resources
- Each resource is a single SCSI logical unit (LU)
with a universal unique ID (UUID) - sometimes
referred to as a serial number - An LU can be represented by multiple logical unit
numbers (LUNs) - Provisioning associates LUNs with LUs subsystem
ports - A storage resource is not a LUN, its an LU
28Provisioning storage
LUN 0
LUN 1
Port S1
LUN 1
LUN 2
Port S2
LUN 2
Port S3
LUN 3
LUN 3
Port S4
LUN 0
Controller functions
29Multipathing
LUN X
MP SW
LUN X
30Caching
Write Caches 1. Write Through (to disk) 2.
Write Back (from cache)
Read Caches 1. Recently Used 2. Read Ahead
31Tape subsystems
TapeDrive
TapeDrive
Tape Subsystem Controller
TapeDrive
TapeDrive
Robot
Tape Slots
32Subsystem management
Now with SMIS
Management station browser-based network mgmt
software
Ethernet/TCP/IP
Out-of-band management port
In-band management
ExportedStorage Resource
Storage Subsystem
33Data redundancy
- Duplication
- Parity
- Difference
2n
n1
-1
d(x) f(x) f(x-1)
f(x-1)
34Duplication redundancy with mirroring
Host-based
Within a subsystem
I/O PathA
MirroringOperator
I/O Path
I/O PathB
Terminate I/O regenerate new I/Os Error
recovery/notification
35Duplication redundancy with remote copy
Host
Uni-directional (writes only)
A
B
A
A
A
36Subsystem Snapshot
Point-in-time snapshot
Host
A
B
A
A
A
C
37RAID, volume managementand virtualization
Lesson 4
38RAID parity redundancy
- Duplication
- Parity
- Difference
2n
n1
-1
d(x) f(x) f(x-1)
f(x-1)
39History of RAID
- Late 1980s RD project at UC Berkeley
- David Patterson
- Garth Gibson (independent)
- Redundant array of inexpensive disks
- Striping without redundancy was not defined (RAID
0) - Original goals were to reduce the cost and
increase the capacity of large disk storage
40Benefits of RAID
- Capacity scaling
- Combine multiple address spaces as a single
virtual address - Performance through parallelism
- Spread I/Os over multiple disk spindles
- Reliability/availability with redundancy
- Disk mirroring (striping to 2 disks)
- Parity RAID (striping to more than 2 disks)
41Capacity scaling
Storage extent 1
Storage extent 2
Storage extent 3
Storage extent 4
Combined extents 1 - 12
Storage extent 5
Storage extent 6
Storage extent 7
Storage extent 8
Storage extent 9
Storage extent10
Storage extent11
Storage extent12
RAID Controller(resourcemanager)
42Performance
RAID controller (microsecond performance)
Diskdrive
Diskdrive
Diskdrive
Diskdrive
Diskdrive
Diskdrive
Disk drives (Millisecond performance)from
rotational latency and seek time
43Parity redundancy
- RAID arrays use XOR for calculating parity
Operand 1 Operand 2 XOR Result False
False False False True True True
False True True True False - XOR is the inverse of itself
- Apply XOR in the table above from right to left
- Apply XOR to any two columns to get the third
44Reduced mode operations
- When a member is missing, data that is accessed
must be reconstructed with xor - An array that is reconstructing data is said to
be operating in reduced mode - System performance during reduced mode operations
can be significantly reduced
XOR M1M2M3P
45RAID Parity Rebuild
Parity rebuild
- The process of recreating data on a replacement
member is called a parity rebuild - Parity rebuilds are often scheduled for
non-production hours because performance
disruptions can be so severe
XOR M1M2M3P
46Hybrid RAID 01
RAID 01, 10
RAID Controller
Diskdrive
Diskdrive
Diskdrive
Diskdrive
Diskdrive
Diskdrive
Diskdrive
Diskdrive
Diskdrive
Diskdrive
1
2
3
4
5
Mirrored pairs of striped members
47Volume management and virtualization
- Storing level functions
- Provide RAID-like functionality in host systems
and SAN network systems - Aggregation of storage resources for
- scalability
- availability
- cost / efficiency
- manageability
48OS kernel
File system
Volume management
Volume Manager
Volume Manager
- RAID partition management
- Device driver layer between the kernel and
storage I/O drivers
HBA drivers
HBAs
49Server system
Volume managers can use all available connections
and resources and can span multiple SANs as well
as SCSI and SAN resources
Virtual Storage
SCSI disk resource
SCSI HBA
SCSI Bus
SAN disk resources
HBA drivers
SAN cable
SAN Switch
SAN HBA
50SAN storage virtualization
- RAID and partition management in SAN systems
- Two architectures
- In-band virtualization (synchronous)
- Out-of-band virtualization (asynchronous)
51In-band virtualization
Exported virtual storage
I/O Path
System(s), switch or router
Disk subsystems
52Out-of-band virtualization
- Distributed volume management
- Virtualization agents are managed from a central
system in the SAN
Virtualizationagents
Disk subsystems
53SAN networks
Lesson 5
54Fibre channel
- The first major SAN networking technology
- Very low latency
- High reliability
- Fiber optic cables
- Copper cables
- Extended distance
- 1, 2 or 4 Gb transmission speeds
- Strongly typed
55Fibre channel
- A Fibre Channel fabric presents a consistent
interface and set of services across all switches
in a network - Host and subsystems all 'see' the same resources
StorageSubsystem
StorageSubsystem
StorageSubsystem
56Fibre channel port definitions
- FC ports are defined by their network role
- N-ports end node ports connecting to fabrics
- L-ports end node ports connecting to loops
- NL-ports end node ports connecting to fabrics or
loops - F-ports switch ports connecting to N ports
- FL-ports switch ports connecting to N ports or
NL ports in a loop - E-ports switch ports connecting to other switch
ports - G ports generic switch ports that can be F, FL
or E ports
57Ethernet / TCP / IP SAN technologies
- Leveraging the install base of Ethernet and
TCP/IP networks - iSCSI native SAN over IP
- FC/IP FC SAN extensions over IP
58iSCSI
- Native storage I/O over TCP/IP
- New industry standard
- Locally over Gigabit Ethernet
- Remotely over ATM, SONET, 10Gb Ethernet
iSCSI
TCP
IP
MAC
PHY
59iSCSI equipment
- Storage NICs (HBAs)
- SCSI drivers
- Cables
- Copper and fiber
- Network systems
- Switches/routers
- Firewalls
60- FC/IP
- Extending FC SANs over TCP/IP networks
- FCIP gateways operate as virtual E-port
connections - FCIP creates a single fabric where all resources
appear to be local
TCP/IPLAN, MANor WAN
FCIPGateway
FCIPGateway
E-port
E-port
61SAN switching fabrics
- High-end SAN switches have latencies of 1 - 3
µsec - Transaction processing requires lowest latency
- Most other applications do not
- Transaction processing requires non-blocking
switches - No internal delays preventing data transfers
62Switches and directors
- Switches
- 8 48 ports
- Redundant power supplies
- Single system supervisor
- Directors
- 64 ports
- HA redundancy
- Dual system supervisor
- Live SW upgrades
63SAN topologies
- Star
- Simplest
- single hop
- Dual star
- Simple network redundancy
- Single hop
- Independent or integrated fabric(s)
64SAN topologies
- N-wide star
- Scalable
- Single hop
- Independent or integrated fabric(s)
- Core - edge
- Scalable
- 1 3 hops
- integrated fabric
65SAN topologies
- Ring
- Scalable
- integrated fabric
- 1 to N2 hops
- Ring Star
- Scalable
- integrated fabric
- 1 to 3 hops
66Lesson 6
File systems
67File system functions
68Storing
Filing
69Think of the storage address space as a sequence
of storage locations (a flat address space)
70Superblocks
- Superblocks are known addresses used to find file
system roots (and mount the file system)
71Filing and Scaling
- File systems must have a known and dependable
address space - The fine print in scalability - How does the
filing function know about the new storing
address space?
Storing
? ? ?
Storing
Filing
72SCSI's role in storage networking
Lesson 2
- Legacy open systems server storage
- Physical parallel bus
- Independent master/slave protocol
- Storing in SANs
- Compatibility requirements with system software
force the use of the SCSI protocol - Storing and wiring in NAS
- SCSI and ATA (IDE) used with NAS
73Parallel SCSI bus technologies
A bus, with address lines and data lines
- 8-bit and 16-bit (narrow and wide)
- Single ended, differential, low voltage
differential (LVD) electronics - 5MB, 10MB, 20MB, 40MB, 80MB, 160MB, 320MB
- Ultra SCSI 3 is 320 MB/sec
- Distances vary from 3 to 25 meters
- Current LVD SCSI is 12 meters
74SCSI command protocol
- Master/slave relationships
- host master, device slave
- Independent of physical connectivity
- CDBs Command Descriptor Blocks
- Command format
- Used for both device operations and data xfers
- Serial SCSI standard created and implemented as
- Fibre Channel Protocol (FCP)
- iSCSI
75SCSI addressing model
Host system
LUN
Targetstoragesubsystem
16 bus addresses with LUN sub-addressing
76SCSI daisy chain connectivity
Target devices or subsystems
Host system
Storageinterface
Storageinterface
Storageinterface
Storageinterface
In / Out
In / Out
In / Out
In /
77SCSI arbitration
Host system ID 7
Target IDs 6 5 4 3 2 1 0
The highest number address 'wins' arbitration to
access the bus next
78SCSI resource discovery
SCSI inquiry CDBtell me your resources
Host system
LUN
Targetstoragesubsystem
There is no domain server concept in SCSI
79SCSI performance capabilities
write
- Overlapped I/O
- Tagged command queuing (Reshuffled I/Os)
status
read
80Parallel SCSI bus shortcomings
- Bus length
- servers and storage tied together
- Single initiator
- access to data depends on server
- A standard full of variations
- change is the only constant
81Disk drives
Lesson 4
- Disk drive components
- Areal density
- Rotational latency
- Seek time
- Buffer memory
- Dual porting
82Disk drives
- Complex electro-mechanical devices
- Media
- Motor and speed control logic
- Bearings and suspension
- Actuator (arm)
- Read/write heads
- Read/write channels
- I/O controller (ext interface int operations)
- Buffer memory
- Power
83Disk drive areal density
- Amount of signal per unit area of media
- Keeps pace with Moore's law
- Areal density doubles approximately every 18
months - Increasingly smaller magnetic particles
- Continued refinement of head technology
- Electro-magnetic physics research
84Rotational latency
- Time for data on media to rotate underneath heads
- faster rotational speed lower rotational
latency - 2 to 10 milliseconds are common
- Application level I/O operations can generate
multiple disk accesses, each impacted by
rotational latency
Memory SAN switch Disk drive nanoseconds micros
econds milliseconds 10 -9 10 -6
10 -3
85Rotational latency filing systems
- Filing systems determine contiguous data lengths
- (file systems and databases)
- Block size definitions
- 512
- 2k
- 4k
- 16k
- 512k
- 2M
Disk media
86Seek Time
- Time needed to position the actuator over the
track - Equivalent to rotational latency in time
Disk head
Disk actuator
Disk media
87Disk drive buffer memory
- FIFO memory for data transfers
- not cache
- Overcome mechanical latencies with faster memory
storage - Enables overlapped I/Os to multiple drives
- Performance metrics
- Burst transfer rate transfer in/out buffer
memory - (Sustained transfer rate transfer with track
changes)
88Dual-ported disk drives
- Redundant connectivity interfaces
- Only FC to date
Controller A
Controller B
89Forms of data redundancy
- Duplication
- Parity
- Difference
2n
n1
-1
d(x) f(x) f(x-1)
f(x-1)
90Business Continuity
- 24 x 7 data access is the goal
- 5 nines through planning and luck
- There are many potential threats
- People
- Power
- Natural disasters
- Fires
- Redundancy is the key
- Multiple techniques cover different threats
91Backup and recovery
Lesson 8
- Removable media, usually tape
- removable redundancy
- Backup systems
- Backup operations
- Media rotation
- Backup metadata
- Backup challenges
92Forms of data redundancy in backup
- Duplication
- Parity
- Difference
2n
n1
-1
d(x) f(x) f(x-1)
f(x-1)
93Backup and recovery tape media
- Magnetic 'ribbon'
- multiple layers of backing, adhesive, magnetic
particles and lube/coating - corrodes and cracks
- requires near-perfect conditions for long-term
storage - Sequential access
- slow load and seek times
- reasonable transfer rates
- can hold multiple versions of files
94Tape drives
- Two primary geometries
- Longitudinal tracking
- Helical tracking
- Highly differentiated
- Speeds (3MB/s to 30MB/s)
- Capacities (20MB to 160MB)
- Physical formats (layouts)
- Compatibility is a constant issue
- Mostly parallel SCSI
95Tape drive formats
- Two primary geometries
- Longitudinal tracking
- Helical tracking
- Highly differentiated
- Speeds (3MB/s to 30MB/s)
- Capacities (20MB to 160MB)
- Physical formats (layouts)
- 4mm, 8mm, ÂĽ inch, DLT, LTO, 19mm
- Cartridge construction, tape lengths
- Compatibility is a constant issue
- Mostly parallel SCSI
96Longitudinal tracking
- Parallel data tracks written lengthwise on tape
by a 'stack' of heads
Data tracks
Tape heads
Technologies DLT, SDLT, LTO, QIC
97Helical tracking
- Single data tracks written diagonally across tape
by a rotating cylindrical head assembly
Tape head
Data tracks
Tape
Technologies 4mm, 8mm, 19mm
98Tape subsystems
- Tape libraries autoloaders
Tape Subsystem Controller
Tapes
Robot
Tape drive
Tape drive
Tape drive
99Generic backup system components
- Tape subsystems
- I/O bus/network subsystem
- Work scheduler manager
- Data mover
- Metadata (database or catalog)
- Media manager (rotation scheduler)
- File system and database backup agents
100Generic Network Backup System
File server
Web server
DB server
APP server
Backupagent
Backupagent
Backupagent
Backupagent
Ethernet network
Work scheduler Data mover Metadata system Media
manager
SCSI bus
Tape drive(s) or Tape subsystem
Backup server
101Backup operations
- Full (all data)
- Longest backup operations
- Usually done over/on weekends
- Easiest recovery with 1 tape set
- Incremental (changed data)
- Shortest backup operation
- Often done on days of the week
- Most involved recovery
- Differential (accumulated changed data)
- Compromise for easier backups and recovery
- Max 2 tape set restore
102Backup operations and data redundancy
- Full
- Duplication redundancy
- One backup for complete redundancy
- Incremental
- Difference redundancy
- Multiple backups for complete redundancy
- Differential
- Difference redundancy
- Two backups for complete redundancy
103Media rotations
- Change of tapes with common names and purposes
- Tape sets - not individual tapes
- Backup job schedules anticipate certain tapes
- Monday, Tuesday, Wednesday, etc..
- Even days, odd days
- 1st Friday, 2nd Friday, etc..
- January, February, March, etc...
- 1st Qtr, 2nd Qtr, etc....
104Media rotation problems
- What happens when wrong tapes are used by
mistake? - Say you use the last Friday's tape on the next
Tuesday? - Data you might need to restore sometime can be
overwritten! - Backup system logic may have to choose between
- Not completing backup (restore will fail)
- Deleting older backup files (restore might fail)
105Backup metadata
- A database for locating data on tape
- Version create/modify date size
- Date/time of backup job
- Tape names backup job ID on tape
- Owner
- Delete records (don't restore deleted data!)
- Transaction processing during backup
- Many small files creates heavy processor loads
- This is where backup fails to scale
- Backup databases need to be pruned
- Performance and capacity problems
106Traditional backup challenges
- Completing backups within the backup window
- Backup window time allotted for daily backups
- Starts after daily processing finishes
- Ends before next day's processing begins
- Media management and administration
- Thousands of tapes to manage
- Audit requirements are increasing
- On/offsite movement for disaster protection
- Balancing backup time against restore complexity
107LAN-free backup in SANs
Tape drives or tape subsystem
SAN
SAN switch
Backupsoftware
Backupsoftware
Backupsoftware
Backupsoftware
File server
Web server
DB server
APP server
Ethernet client network
LAN
108Advantages of LAN-free backup
- Consolidated resources (especially media)
- Centralized administration
- Performance
- Offloads LAN traffic
- Platform optimization
SAN
109Path management
- Dual pathing
- Zoning
- LUN masking
- Reserve / release
- Routing
- Virtual networking
110Dual pathing
- System software for redundant paths
- Path management is a super-driver process
- Redirects I/O traffic over a different path to
the same storage resource - Typically invoked after SCSI timeout errors
- Active / active or active / passive
- Static load balancing only
111Zoning 1
- I/O segregation
- Switch function that restricts forwarding
- Zone membership is based on port or address
Address zoning
Port zoning
- Zone 1
- Addr 1
- Addr 2
- Zone 2
- Addr 3
- Addr 4
- Zone 3
- Addr 5
- Addr 6
112Zoning 2
- Address zoning allows nodes to belong to more
than one zone - For example, tape subsystems can belong to all
zones
- Zone 1
- Addr 1 (server A)
- Addr 2 (disk subsystem port target address A)
- Addr 7 (tape subsystem port target address A)
- Zone 2
- Addr 3 (server B)
- Addr 4 (disk subsystem port target address B)
- Addr 7 (tape subsystem port target address A)
- Zone 3
- Addr 5 (server C)
- Addr 6 (disk subsystem port target address C)
- Addr 7 (tape subsystem port target address A)
Addr1 Addr 2 Addr 7
1
Addr 3 Addr 4 Addr 7
2
Addr 5 Addr 6 Addr 7
3
113Zoning 3
- Zones (or zone memberships) can be 'swapped' to
reflect different operating environments
Changingzones
114LUN masking
- Restricts subsystem access to defined servers
- Target or LUN level masking
- Non-response to SCSI inquiry CDB
- Can be used with zoning for multi-level control
No responseto SCSI inquiry
115Reserve / Release
- SCSI function
- Typically implemented in SCSI/SAN storage
routers - Used to reserve tape resources during backups
- tape drives
- robotics
1st access
2nd access blocked
Reserved
Storage router
116Routing
- Path decisions made by switches
- Large TCP/IP networks require routing in switches
instead of in end nodes - Looping is avoided by spanning tree algorithms
that ensure a single path - OSPF is spanning tree technology for Fibre
Channel - Routing is not HA failover technology
117Name Space
- The Name Space is the representation of data to
end users and applications - Identification and searching
- Organizational structure
- Directories or folders in file systems
- Rows and columns in databases
- Associations of data
- Database indexing
- File system linking
118Metadata and Access Control (Security)
- Metadata is the description of data
- Intrinsic information and accounting information
- Access control determines how (or if) a user or
application can use the data - for example, read-only
- Access control is often incorporated with
metadata but can be a separate function
119- Data has attributes that describe it
- Storage is managed based on data attributes
- Activity info
- Owner info
- Capacity info
- Whatever info
- Data can have security associated with it.
- Data can be erased, copied, renamed, etc.
120Locking
- Managing multiple users or applications with
concurrent access to data - Locking has been done in multi-user systems for
decades - Locking in NAS has been a central issue
- NFS advisory locks provide no guarantees
- CIFS oplocks are enforced
- Lock persistence
121File systems organize data in blocks
- Blocks are SCSIs address abstraction layer
- Filing functions use block addresses to
communicate with storing level entities - Filing systems manage the utilization of block
address spaces (space management) - Block address structures typically are uniform
- Block address boundaries are static for efficient
and error-free space management
122Journaling
- File system structure has to be verified when
mounting (FSCHECK) - FSCheck can take hours on large file systems
- Journaling file systems keep a log of file system
updates - Like a database log file, journal updates can be
checked against actual structures - Incomplete updates can be rolled forward or
backward to maintain system integrity
123V/VM and Filing
- Filing is a filing function
- Virtualization volume management (V/VM) is a
storing function - V/VM manipulates block addresses and creates real
and virtual address spaces - Filing manages the placement of data in the
address spaces exported by virtualization