Mirrored Storage - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Mirrored Storage

Description:

Ten binary arrays named bitmaps. Buddy System Algorithm ... The map field points to a bitmap. Each bit of the bitmap of the kth entry of the free_area array ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 48
Provided by: dasanSe
Category:

less

Transcript and Presenter's Notes

Title: Mirrored Storage


1
Mirrored Storage
  • Mirrored storage replicates data over two or more
    plexes of the same size
  • Mirrored storage with two mirrors corresponds to
    RAID 1
  • Bandwidth and I/O rate of mirrored storage
    depends on the direction of data flow
  • Performance for read operations is
    additivemirrored storage that uses n plexes will
    give n times the bandwidth and I/O rate of a
    single plex for read requests
  • Write bandwidth and I/O rate is a bit less than
    that is a single plex
  • If write requests cannot be issued in parallel,
    but happen one after the other, write performance
    will be n times worse than that of a single mirror

2
Mirrored Storage
  • Algorithms when servicing a read request
  • Round robin
  • Preferred Mirror
  • Least Busy
  • The forte of the mirrored storage is increased
    reliability, whereas striped or concatenated
    storage gives decreased reliability
  • In case a disk fails, it can be hot-swapped
    (manually replaced on-line with a new working
    disk). Alternatively, a hot standby disk can be
    deployed

3
RAID Storage
  • In RAID 3, a stripe spans n subdisks each stripe
    stores data on n-1 subdisks and parity on the
    last
  • RAID 5 differs from RAID 3 in that the parity is
    distributed over different subdisks for different
    stripes, and a stripe can be read or written
    partially

4
(No Transcript)
5
RAID Storage
  • RAID 3 storage capacity equals n-1 subdisks,
    since one subdisk capacity is used up for storing
    parity data
  • Bandwidth and I/O rate of an n-way RAID 3 storage
    is equivalent to (n-1)-way striped storage
  • The minimum unit of I/O for RAID 3 is equal to
    one stripe.
  • If a write request spans one stripe exactly,
    performance is least impacted. The only overhead
    is computing contents of one parity block and
    writing it, thus n I/Os are required instead of
    n-1 I/Os for an equivalent (n-1)-way striped
    storage
  • A small write request must be handled as a
    read-modify-write sequence for the whole
    stripethat makes it 2n I/O
  • RAID 3 storage provides protection against one
    disk failure

6
Chained Declustering
Server0
Server1
Server2
Server3
D0
D1
D2
D3
D3
D0
D1
D2
D4
D5
D6
D7
D7
D4
D5
D6
7
Chained Declustering Server Failure
Server0
Server1
Server2
Server3
D0
D2
D3
D1
D3
D1
D2
D0
D4
D6
D7
D5
D7
D5
D6
D4
Server failed, but all data is still available
8
RAID Storage
  • Equals n-1 subdisks, since one subdisk capacity
    is used up for storing parity data
  • Bandwidth and I/O rate of an n-way RAID 5 storage
    is equivalent to n-way striped storage, because
    the parity blocks are distributed over all disks
  • RAID 5 works the same as RAID 3 when write
    requests span one or more full stripes. However,
    RAID 5 only requires four disk I/Os
  • Read1 old data
  • Read2 parity
  • compute new parity XOR sum of old data, old
    parity, and
  • new data
  • Write3 new data
  • Write4 new parity

9
RAID Storage
  • As is the case with mirrored storage, RAID
    storage is also vulnerable with respect to host
    computer crashes while write requests are in
    flight to disks
  • A single logical request can result in two to n
    physical write requests parity is always updated
  • If some writes succeed and some do not, the
    stripe becomes inconsistent

10
Compound Storage
  • Mirrored Stripes (RAID 10)
  • Two striped storage plexes of equal capacity can
    be mirrored to form a single volumn. Each plex
    would be resident on a separate disk array
  • Striped Mirrors (RAID 01)
  • Multiple plexes, each containing a pair of
    mirrored subdisks, can be aggregated using
    striping to form a single volumn
  • Each plex provides reliability.

11
Compound Storage
  • In both cases, storage cost is doubled due to
    two-way mirroring
  • Mirrored-striped storage
  • If a disk fails in mirrored stripe storage, one
    whole plex is declared failed.
  • After the failure is repaired, the whole plex
    must be rebuilt by copying from the good plex
  • Storage is vulnerable to a second disk failure in
    the good plex until the mirror is rebuilt
  • Striped-mirror storage
  • If a disk fails in striped-mirror storage, no
    plex is failed.
  • After the disk is repaired, only data of that one
    disk needs to be rebuilt from the other disk
  • Storage is vulnerable to a second disk failure
  • Thus, striped mirrors are preferable over
    mirrored stripes

12
(No Transcript)
13
(No Transcript)
14
Dynamic multipathing
  • If the I/O path from host computer to disk
    storage fails due to host bus adapter card
    failure, storage availability is completely lost
  • Redundant I/O channels are added to the hardware
    configuration by putting in extra HBAs that
    connect to independent I/O cables
  • Disk arrays must also support multiple I/O ports
    to plug in multiple cables
  • Once redundant I/O paths are available in the
    hardware, a volumn manager can utilize these
    paths to provide protection against I/O channel
    failure

15
(No Transcript)
16
Dynamic Multipathing
  • Active/passive
  • Only one port to be active for I/O at a time
    while the other port is passive and will not
    provide I/O
  • In case the I/O path to the active port fails,
    the passive port is activated by a special
    command issued on the I/O path to the passive
    port
  • Active/active
  • Allow I/O requests to be sent to its disks down
    both I/O paths concurrently
  • The volume manager sends I/O requests through one
    active I/O channel, or it may balance I/O traffic
    over multiple active channels

17
(No Transcript)
18
Issues with server failure
  • Out of sync
  • To disbelieve all except one mirror and just copy
    all data from this mirror to the remaining ones
  • Can take hours to rebuild out-of-sync mirrored
    storage. Such storage should not be accessed
    until mirror rebuild completes
  • Dirty Region Logging (DRL)
  • Logs the addresses that undergo writes
  • It divides the whole volume into a number of
    regions. If an I/O request falls within a region,
    that region is marked dirty, and its identity is
    logged
  • Since at the most a few hundred I/O requests
    would be in flight when the server crashed, the
    number of blocks that are truly inconsistent is
    much smaller than the total number of blocks on
    the mirrored storage
  • An alternative to DRL is to use a full-fledged
    transaction mechanism to log all intended writes
    to separate stable storage before initiating any
    physical write

19
Page descriptors
  • The kernel must keep track of the current status
    of each page frame
  • State information of a page frame is kept in a
    page descriptor of type struct page

20
Page Descriptors
  • struct list_head list
  • struct address_space mapping
  • Used when the page inserted into the page cache
  • unsigned long index
  • The position of the data stored in the page
  • struct page next_hash
  • atomic_t count
  • unsigned long flags
  • PG_locked, PG_referenced, PG_uptodate,
  • struct list_head lru
  • wait_queue_head_t wait
  • struct page pprev_hash
  • struct buffer_head buffers
  • void virtual
  • struct zone_struct zone

21
Memory zones
  • ZONE_DMA
  • Contains pages of memory below 16MB
  • Used by the DMA processors for ISA buses
  • ZONE_NORMAL
  • Contains pages of memory at and above 16MB and
    below 896MB
  • ZONE_HIGHMEM
  • Contains pages of memory at and above 896MB

22
Memory zones
  • The ZONE_DMA and ZONE_NORMAL zones include the
    normal pages of memory that can be directly
    accessed by the kernel through the linear mapping
    in the fourth gigabyte of the linear address
    space
  • The ZONE_HIGHMEM includes pages of memory that
    cannot be directly accessed by the kernel through
    the linear mapping
  • Each memory zone has its own descriptor of type
    struct zone_struct
  • P220-221

23
Non-Uniform Memory Access (NUMA)
  • Linux 2.4 supports the Non-Uniform Memory Access
    (NUMA) model, in which the access time for
    different memory locations from a given CPU may
    vary
  • The physical memory of the system is partitioned
    in several nodes.
  • The time needed by any given CPU to access pages
    within a single node is the same

24
Non-Uniform Memory Access(NUMA)
  • The physical memory inside each node can be split
    in several zones
  • If NUMA support is not compiled in the kernel,
    Linux makes use of a single node that includes
    all system physical memory

25
Memory initialization
  • paging_init() invokes the free_area_init()
  • Computes the total number of page frames in RAM,
    and stores the result in the totalpages
  • Initializes the active_list and inactive_list
    lists of page descriptors
  • Allocates space for the mem_map array of page
    descriptors
  • Initializes some fields of the node descriptor
    contig_page_data
  • contig_page_data.node_size totalpages
  • contig_page_data.node_start_paddr 0x00000000
  • contig_page_data.node_start_mapnr 0

26
Memory initialization
  • Initializes some fields of all page descriptors
  • for (pmem_map pltmem_maptotalpages p)
  • p-gtcount 0
  • SetPageReserved(p)
  • init_waitqueue_head(p-gtwait)
  • p-gtlist.next p-gtlist.prev p
  • Initializes some fields of the memory zone
    descriptor in the zone local variable
  • zone-gtname zone_namej
  • zone-gtsize zone_sizej
  • zone-gtlock SPIN_LOCK_UNLOCKED
  • zone-gtzone_pgdat contig_page_data
  • zone-gtfree_pages 0
  • zone-gtneed_balance 0
  • Initializes the node_zonelists array of the
    contig_page_data node descriptors

27
Memory initialization
  • mem_init()
  • Initializes the value of num_physpages, the total
    number of page frames present in the system
  • For each page descriptor, sets the count field to
    1
  • Resets the PG_reserved flag
  • Sets the PG_highmem flag if the page belongs to
    ZONE_HIGHMEM
  • Call the free_page() to release the page frame

28
Requesting and releasing
  • alloc_pages(gfp_mask, order)
  • Used to request 2order contiguous page frames. It
    returns the address of the descriptor of the
    first allocated page frame
  • alloc_page(gfp_mask)
  • Used to get a single page frame. It returns the
    address of the descriptor of the allocated page
    frame
  • __get_free_pages(gfp_mask, order)
  • Similar to alloc_pages(), but it returns the
    linear address of the first allocated page

29
Requesting and releasing
  • __GFP_WAIT
  • The kernel is allowed to block the current
    process waiting for free page frames
  • __GFP_HIGH
  • The kernel is allowed to access the pool of free
    page frames left for recovering from very low
    memory conditions
  • __GFP_IO
  • The kernel is allowed to perform I/O transfers on
    low memory pages in order to free page frames

30
Requesting and releasing
  • __GFP_HIGHIO
  • The kernel is allowed to perform I/O transfers on
    high memory pages in order to free page frames
  • __GFP_DMA
  • The requested page frames must be included in the
    ZONE_DMA zone
  • __GFP_HIGHMEM
  • The requested page frames can be included in the
    ZONE_HIGHMEM zone

31
Kernel mappings of High-Memory Page Frames
  • Allocations of high-memory page frames must be
    done only through the alloc_pages function and
    its alloc_page
  • Once allocated, a high-memory page frame has to
    be mapped into the fourth gigabyte of the linear
    address space
  • Permanent kernel mappings
  • Temporary kernel mappings
  • Noncontiguous memory allocation

32
Buddy system algorithm
  • A suitable technique to solve external
    fragmentation is to avoid as much as possible the
    need to split up a large free block
  • DMA ignores the paging circuitry and accesses the
    address bus directly
  • Reduction of translation lookaside buffers misses

33
Buddy System Algorithm
  • All free page frames are groups into 10 lists of
    blocks that contain groups of 1,2,4,8,16,32,64,128
    ,256,512
  • The physical address of the first page frame of a
    block is a multiple of the group size

34
Buddy System Algorithm
  • Assume there is a request for a group of 128
    contiguous page frames
  • Checks first to see whether a free block in the
    128-page-frame list exists
  • If there is no such block, the algorithm looks
    for the next larger blocka free block in the
    256-page-frame list
  • If such a block exists, the kernel allocates 128
    of the 256 page frames and inserts the remaining
    128 page frames into the list of free
    128-page-frame blocks

35
Buddy System Algorithm
  • If there is no free 256-page block, the kernel
    looks for the next larger block
  • If such a block exists, it allocates 128 of the
    512 page frames, inserts the first 256 of the
    remaining 384 page frames into the list of free
    256-page-frame blocks, and inserts the last 128
    of the remaining 384 page frames into the list of
    free 128-page-frame blocks

36
Buddy System Algorithm
  • Two blocks of size b are considered buddies if
  • Both blocks have the same size, say b
  • They are allocated in contiguous physical
    addresses
  • The physical address of the first page frame of
    the first block is a multiple of 2xbx212

37
Buddy System Algorithm
  • Main data structure
  • mem_map array
  • An array having 10 elements of type free_area_t,
    one element for each group size
  • typedef struct free_area_struct
  • struct list_head free_list
  • unsigned long map
  • Ten binary arrays named bitmaps

38
Buddy System Algorithm
  • The kth element of the free_area array in the
    zone descriptor is associated with a doubly
    linked circular list of blocks of size 2k each
    member of such a list is the descriptor of the
    first page frame of a block
  • The map field points to a bitmap. Each bit of the
    bitmap of the kth entry of the free_area array
    describes the status of two buddy blocks of size
    2k page frames.

39
Buddy System Algorithm
  • A zone including 128MB of RAM
  • 32728 single pages, 16384 groups of 2pages each,
    and so on up to 64 groups of 512 pages each
  • Bitmap of free_area0 consists of 16384 bits,
  • one for each pair of the 32768 existing page
    frames the bitmap of free_area1 consists of
    8192 bits, one for each pair of blocks of two
    consecutive page frames

40
Memory Area Management
  • Need a scheme to satisfy the requests for small
    memory areas, a few tens or hundreds of bytes
  • Need a scheme to avoid internal fragmentation
  • Mismatch between the size of the memory request
    and the size of the memory area allocated to
    satisfy the request

41
Slab Allocator
  • View the memory areas as objects consisting of
    both a set of data structures and a couple of
    functions called the constructor and destructor
  • Linux uses the slab allocator to reduce the
    number of calls to the buddy system allocator
  • The slab allocator does not discard the objects
    but instead saves them in memory.
  • When a new object is requested, it can be taken
    from memory without having to be reinitialized

42
Slab allocator
  • The slab allocator groups objects into caches
  • Each cache is a store of objects of the same type
  • The area of main memory that contains a cache is
    divided into slabs each slab consists of one or
    more contiguous page frames that contain both
    allocated and free objects
  • The slab allocator never releases the page frames
    of an empty slab on its own

43
Slab Allocator
object
slab
cache
object
object
slab
44
Slab Allocator
  • Each cache is described by a table of type
    struct kmem_cache_s
  • Each slab of a cache has its own descriptor of
    type struct slab_s
  • Caches are divided into two types general and
    specific

45
Slab Allocator
  • The general caches are
  • The first cache contains the cache descriptors of
    the remaining caches used by the kernel.
  • Twenty-six additional caches contain
    geometrically distributed memory areas. The
    table, called cache_sizes, points to the 26 cache
    descriptors associated with memory areas of size
    32, 64, 128, 256, 512, 1024, 2048, 4096, 8192,
    16384, 32768, 65536, 131072 bytes, respectively

46
Slab Allocator
  • The kmem_cache_init() and kmem_cache_sizes_init()
    functions are invoked during system
    initialization to set up the general caches
  • The kmem_cache_destroy() function destroys a
    cache
  • The kmem_cache_shrink() function destroys all
    slabs in a cache by invoking kmem_slab_destroy()
    iteratively.

47
Slab Allocator
Cache Descriptor
Cache Descriptor
Cache Descriptor
Slab Descriptor
Slab Descriptor
Slab Descriptor
Slab Descriptor

Slab Descriptor
Slab Descriptor
Slab Descriptor
Slab Descriptor
Write a Comment
User Comments (0)
About PowerShow.com