Aligned IO for filesystems on Linux

Why?

Aligned io is much faster on almost every raid system, and on parity raid schemes (RAID5/RAID6) in particular.

How?

The goal here is to align the file system writes with the "native" write size of the underlying raid device.

Unfortunately we have a number of factors working against us:

  • There is a massive confusion regarding the terms stripe/chunk size and what they mean. Sometimes they refer to the per-disk unit and sometimes it's the sum over all disks that's also the optimum IO size.
  • Hardware raid management is not standardized, so every hardware raid manufacturer has their own way of doing things.
  • Partitions, especially the DOS partition table.

Hardware RAID

We need to figure out what the stripesize and layout of your hardware raid is and maybe change it to something that suites your needs.

Unfortunately noone has bothered so standardize hardware raid management so how to check this varies depending on the driver/utility available.

Caveat regarding per-disk or per-stripe units

Many hardware controllers and utilities refer to the per-disk strip size when talking about the stripesize, sometimes even contradicting most of their own documentation that refer to the entire stripe width.

In most RAID schemes the amount of data written to each disk in a stripe must be identical. One way to figure out whether the reported stripe width is the per-disk strip or full-width stripe is to consider the following:

  • Disk blocks are usually 512 bytes (or a multiple up to 4096 bytes).
  • The reported stripesize must make sense compared to the number of data disks in your raid layout.

For example 64k in 512byte blocks can only be evenly split between 1, 2, 4, 8, 16, 34 or 64 data disks. If you have a raid layout consisting of 6 data disks and the RAID hardware reports 64k stripes you can be rather sure that 64k is the per-disk strip size.

HP/Compaq Smart Array

# hpacucli ctrl all show config detail|egrep 'Logical.Drive:|Strip'

Logical Drive: 1
Strip Size: 64 KB
Full Stripe Size: 64 KB
Logical Drive: 2
Strip Size: 64 KB
Full Stripe Size: 640 KB

Newer versions of hpacucli/hpssacli/ssacli have clarified the situation by using the term Strip for the per-disk size.

In this example we have 64kB per-disk stripe size, and hpacucli is helpful and writes the full stripe size of 640kB too.

To verify this, we also need the information about the physical layout, it's usually easier to look at the specific unit:

# hpacucli ctrl slot=2 ld all show

Smart Array P812 in Slot 2

array A

logicaldrive 1 (19.5 GB, RAID 6 (ADG), OK)
logicaldrive 2 (18.2 TB, RAID 6 (ADG), OK)

# hpacucli ctrl slot=2 pd all show

Smart Array P812 in Slot 2

array A

physicaldrive 5I:1:1 (port 5I:box 1:bay 1, SATA, 2 TB, OK)
physicaldrive 5I:1:2 (port 5I:box 1:bay 2, SATA, 2 TB, OK)
physicaldrive 5I:1:3 (port 5I:box 1:bay 3, SATA, 2 TB, OK)
physicaldrive 5I:1:4 (port 5I:box 1:bay 4, SATA, 2 TB, OK)
physicaldrive 5I:1:5 (port 5I:box 1:bay 5, SATA, 2 TB, OK)
physicaldrive 5I:1:6 (port 5I:box 1:bay 6, SATA, 2 TB, OK)
physicaldrive 5I:1:7 (port 5I:box 1:bay 7, SATA, 2 TB, OK)
physicaldrive 5I:1:8 (port 5I:box 1:bay 8, SATA, 2 TB, OK)
physicaldrive 5I:1:9 (port 5I:box 1:bay 9, SATA, 2 TB, OK)
physicaldrive 5I:1:10 (port 5I:box 1:bay 10, SATA, 2 TB, OK)
physicaldrive 5I:1:11 (port 5I:box 1:bay 11, SATA, 2 TB, OK)
physicaldrive 5I:1:12 (port 5I:box 1:bay 12, SATA, 2 TB, OK)

OK, so we have RAID6 (ie. double parity) on 12 disks, which means that we have 10 data disks.

ARECA

# cli64 vsf info

# cli64 vsf info vol=n | egrep 'Level|Strip|Member'

Raid Level      : Raid5
Stripe Size     : 32K
Member Disks    : 6

In this example we have RAID5 on 6 disk with 32kB strip size. That means we have 5 data disks.

LSI/Avago MegaRAID, Dell PERC

The LSI/Avago MegaRAID controllers are known using many names. As of this writing the utility used is commonly storcli (or the renamed perccli on Dell machines).

/opt/MegaRAID/perccli/perccli64 /call/vall show all

This yields a lot of output. To make our example easier, we can filter the output and present it as follows:

# /opt/MegaRAID/perccli/perccli64 /call/vall show all | egrep '/v[0-9]|^[0-9]*/[0-9]|VD[0-9]|Strip Size|Drives Per Span|SCSI NAA Id'

/c0/v2 :
1/2   RAID6 Optl  RW     Yes     RWBD  -   OFF 101.875 TB dCacheVD 
VD2 Properties :
Strip Size = 512 KB
Number of Drives Per Span = 16
SCSI NAA Id = 614187703f1f52001dce0d8d6f55610f

The RAID6 unit listed consists of 16 drives, with a strip size of 512 KiB. As RAID6 is double-parity the number of data drives is 16-2 == 14.

The SCSI NAA Id is the WWN of the unit, and thus it's accessible as /dev/disk/by-id/wwn-0x614187703f1f52001dce0d8d6f55610f which is less error-prone to use compared to /dev/sdX in a multi-controller environment.

Software RAID

As this is standardized on Linux, utilities for creating file systems can read the raid geometry and automagically do the right thing, unless you partition the device.

DOS partitions

DOS partitions are a sure way of messing up RAID alignment. If you really need it, check the partition table with:

# sfdisk -l -uS /dev/your-disk-device

An example output:

# sfdisk -l -uS /dev/foo

Disk /dev/foo: 17562 cylinders, 255 heads, 32 sectors/track
Units = sectors of 512 bytes, counting from 0

Device Boot    Start       End   #sectors  Id  System
/dev/foo1   *        32    791519     791488  83  Linux
/dev/foo2        791520  33553919   32762400  82  Linux swap / Solaris
/dev/foo3      33553920 143305919  109752000  8e  Linux LVM
/dev/foo4             0         -          0   0  Empty

If we create a partition on that disk, for msdos-compat reasons its usually aligned so it skips the first track and leaves it for bootloader/partitioninfo, and that would in this case make the first partition start 32 sectors/track * sectors of 512 bytes = 32*512/1024 = 16kb into the device and that would ruin our stripe-alignment bigtime.

Another common issue is OS installers creating a partition table based on 63 sectors per track, which would make the first partition start 63*512=32256 bytes in. That doesn't align with anything. Newer versions of the FAI setup-storage component can be convinced otherwise by using the align-at directive in the disk config file.

So avoid DOS partitions unless you really need them.

LVM

If you need partitioning, use LVM and place the physical volume (PV) directly on the disk device:

# pvcreate /dev/cciss/c0d1

# pvs -o name,pe_start
PV                1st PE 
/dev/cciss/c0d0p3 192.00K
/dev/cciss/c0d1   192.00K

shows that the first Physical Extent, eg allocation block would start 192K into the device.

For example, a device with 32k per-disk stripes and 6 data drives would give us a matching stripe-width of 192kB.

If we would have needed for example 1024kB stripes we would need to change the alignment.

Use the --dataalignment option to change this, ie. --dataalignment=1024k to align for 1024kB stripes:

# pvcreate --dataalignment=1024k /dev/cciss/c0d1

Older versions of the LVM utilities doesn't have this option, but you can achieve the same result using the --metadatasize option to pvcreate, for example --metadatasize=192k gives 1st PE at 256k. Thus, double check to verify that you get the result you expected.

# vgcreate faivg /dev/cciss/c0d1

# vgs -o name,vg_extent_size
VG     Ext  
faivg  4.00M
rootvg 4.00M

shows that every PE allocated in our PE's is 4MB's in size. Divide this with your alignment to verify that the sizes match.

If they don't align, provide a suitable extent size to vgcreate.

Filesystems

To have really good performance, we need to align the filesystems io with the stripesize also.

to test and demonstrate we create a testlv:

lvcreate -L 5G -n testlv faivg

XFS

xfs has two tunables thats are really intresting at this stage. They come in two flavors, use the one which is easiest given the units you have:

  • sunit (stripe unit, in 512 byte blocks)
  • swidth (stripe width, in 512 byte blocks)

OR

  • su (per-disk stripe unit, in k if suffixed with k).
  • sw (stripe width, in number of data disks)

more of both can be read in the mkfs.xfs man page.

So, for a RAID device with 64k per-disk stripesize and 10 datadisks the command would look like:

# mkfs.xfs -f -d su=64k,sw=10 /dev/devicename

Remember that the printout you get from mkfs.xfs is based on the filesystem block size, so sunit=128,swidth=128 for default 4k blocks prints out "sunit=16 swidth=16 blks,".

EXT3/EXT4

# mkfs.extX -E stride=16,stripe-width=160 /dev/devicename

stride is number of filesystemblocks on each data-disk in that stripe. ext3 and ext4 normally uses 4096 byte blocks, but this can vary depending on options and heuristics.

stripe-width is the total stripe-width in filesystem-blocks, usually stride * number-of-data-disks.

See the manpage for mkfs.ext3 and/or mkfs.ext4 for more information.