|Techdocs Library > Hints, tips & Technotes >
Alternative flash architectural implementation options for AIX with detailed instructions on implementing LVM mirroring with preferred read from flash.
Glenn H Fujimoto
Advanced Technical Sales
AIX; IBM eServer pSeries; Power4; Power5; POWER6; POWER7; PowerHA SystemMirror; Power Systems I/O
|Abstract: This document discusses different ways to use flash/SSD storage with AIX, and provides detailed steps to implement LVM mirroring with preferred read from flash/SSD.|
This document discusses the benefits and positioning of using AIX LVM mirroring with preferred read from Flash/SSD and provides detailed implementation steps to set it up. The solution would look like:
This document specifically gets into implementation for AIX LVM mirroring, but some applications can mirror the data as well and if they provide a preferred read capability, then that approach can be used as well. But there are other approaches to using flash storage.
If you are considering investing in flash or architecting a flash storage solution, then this will be of interest to you. If you are you are mostly interested in how to implement this solution, go to the section of this document, titled Implementing AIX LVM Mirroring with Preferred Read From Flash.
Let's first discuss the difference between flash and solid state disk (SSD).
Flash vs. SSD
SSD and flash, both contain flash storage. The difference is that SSD is typically packaged such that it can be plugged into traditional disk bays. E.G., on Power, we offer a range of SSD which are plugged into SAS disk bays and attached to one or two SAS adapters and configured into RAID arrays which then appear to the host as an hdisk. A FlashSystem is a fibre channel attached disk subsystem, where you can setup RAID arrays and create LUNs of the desired size. SSDs are designed to be placed into disk bays, which may use IO infrastructure more suited to the IO capabilities of HDDs. And SSDs require packaging more suited to HDDs.
Thus, a FlashSystem has more flexibility in choice and number of logical disks or LUNs, and they can be assigned to more LPARs than SAS storage attached to a pair of adapters allows. The use of VIO with SSD does provide more flexibility in that each RAID array, or hdisk (since they are the same thing with SSD) can each be assigned to different LPARs on the server, but we still don't have the flexibility the FlashSystem offers in creating many LUNs of different sizes for different LPARs on many different systems. Thus, flash solutions include both SSDs and solutions such as the IBM FlashSystems.
IBM Flash storage has developed as an alternative persistent storage to hard disk drives (HDDs), because HDDs haven't been improving in performance relative to processor and memory improvements that have improved at an exponential rate. HDDs being mechanical, are limited to mechanical speeds, and the time to do an IO to a HDD remains (for a well tuned disk subsystem) typically in the range of 5 to 10 ms, while we can do reads from flash in a fraction of a millisecond.
Flash can also do significantly more IOPS since it's entirely electronic and has no moving parts. A typical SSD can easily do over 10,000 IOPS while a single HDD can do around 200 IOPS at a reasonable service time for a HDD. IBM FlashSystems can achieve up to 500,000 IOPS from a single 1U device It's actually difficult to find application data with sufficiently high access density (as specified and measured in IOPS/GB) to actually push flash to its IOPS limits, while many customers have learned they need to purchase enough HDDs to get the IOPS bandwidth they need, often containing far more space than needed. E.G., say a 300 GB SSD can achieve 15,000 IOPS, then our application data would need an access density of 15,000/300= 50 IOPS/GB. Most commercial application data has access densities far below this.
Application bottlenecks have moved to disk storage as CPUs and memory performance improvements (thanks to Moore's Law) have increased their performance. Flash also offers significant energy, cooling and space savings. It can even lower software licensing costs because it helps reduce CPU IO wait (i.e. the CPU is idle but it would have some work to do if only some outstanding IO completes) and reduce the number of CPUs needed to run an application at a specified application performance level. Flash can also reduce the need for system RAM when it's used to cache data and avoid IOs from persistent storage: there's a tradeoff between cost and latency here with RAM costing much more than flash per GB, but it is also orders of magnitude faster in terms of latency.
While flash was initially expensive just a few years ago, the price has been coming down as flash technology also benefits from Moore's Law, making flash more competitive with, though currently still more expensive, than HDDs in terms of price per GB. We have now reached a tipping point where the justification for flash storage has made it appealing for all it's benefits including improved performance, low power consumption, savings from fewer HDDs to meet IOPS requirements, and space efficiency.
Other Approaches to Using Flash
IBM, and other vendors, are finding different ways to take advantage of flash. This section discusses some of them and compares the alternatives.
Similar to LVM mirroring with preferred read, we can allow the SVC/V7000 to mirror a VDisk:
Here AIX sees a single hdisk, and the SVC is doing the mirroring.
Another alternative is to use a disk subsystem that uses EasyTier, placing the frequently accessed data (or hot data) on flash, with the less frequently accessed data on HDDs as shown in Figure 3. Note there is only one copy of the data here:
Another recently announced alternative, is the IBM EasyTier Server solution which uses SAS attached SSD as a read cache for DS8000 data, with the contents of the SSDs placed there under the direction of EasyTier on the DS8000:
Another approach is used in the XIV Gen3 disk subsystem, where it uses flash as an extension to the disk cache in the XIV RAM:
And some customers are doing this:
Here the customer manually places some data on flash, and some on HDDs (locally attached or in a disk subsystem). Some customers using this place temporary data on the flash (that they can live without in case the flash or system fails), and use a disk subsystem to mirror the data to a remote site, where in the event of a disaster, they can recover their application. Other customers manually place hot data on flash and cold data on the HDDs. E.G., one can measure LVM logical partition access density with lvmstat and put those logical partitions with the highest access density on flash.
And IBM has announced a statement of direction to provide this capability:
Here the EasyTier automatically moves the hot data to SAS attached SSDs. Note that the EXP30 is attached to a GX slot, but contains two integrated SAS adapters so appears as SAS connected.
Feature Differences Among the Alternatives
There are cost benefit tradeoffs among these alternatives, and a full comparison is outside the scope of this document. It's worth pointing out the major differences.
All these solutions reduce IO latency by doing reads from flash rather than HDDs. But not all solutions will have the same IO latency when reading from flash from the host's point of view. The more hardware and code the IO has to travel to get from/to the host disk driver, the greater the latency of the IO. So flash in a disk subsystem behind a SVC will have higher IO latencies than SAS attached SSDs, or flash in a directly connected disk subsystem.
Not all flash has the same IO latency. The technology has been changing, starting with SLC and MLC and now eMLC technologies. The algorithms with the flash, such as for wear leveling, over provisioning, bad block relocation, compression and data protection have also been changing. Also, a nice feature of the IBM FlashSystem is that rather than using software to handle IO processing, it implements this function via hardware which is faster, and as a result provides excellent IO latency from a fibre channel attached storage subsystem. Some SAS adapters also offer protected write cache while others do not. Writes to adapter cache are faster than writes to flash.
Some of these solutions require sufficient flash to contain a complete copy of the data, while others do not. So this is a cost performance tradeoff. Some of these solutions do not require that we protect the data on flash from flash failure, while others do. To protect the flash from failure one typically implements some form of RAID providing the protection, which results in higher costs for protected space since space is consumed for RAID parity, a cost availability tradeoff.
For example, we examined performance of a 10 TB database, using LVM preferred read mirroring with 10 TB of flash storage, to an Easy Tier solution using only 2 TB of protected flash space, and achieved similar performance.
While SAS attached SSD provides the best IO latency, one loses advanced disk subsystem features such IBM's FlashCopy or MetroMirror if the data isn't also on a disk subsystem supporting that function. However, with the OS/application mirroring of the data as show above in Figure 1 or Figure 4, we retain that capability.
The read preferred mirroring solution has a side benefit in that this reduces the IO workload of the disks in the SAN, by the current read IO rate. So the existing HDDs only handle writes and no longer have to handle read requests. This often results in better average IO latency for the IOs to the disk subsystem. And it allows further consolidation of workloads on fewer HDDs, while significantly improving read IO latency.
Tiering is implemented such that hot data (data with high access densities as specified in IOPS/GB, and is frequently accessed) is automatically moved to flash while cold data remains on HDDs. And application IO is typically skewed such that most of the IOs occur in small fraction of the allocated data space. This leads to cost effective use of the SSDs in a mixed flash and HDD environment.
There are also differences here in the cost and ability to bring in flash where higher disk performance is needed, since some of these solutions require specific disk subsystems.
Implementing AIX LVM Mirroring with Preferred Read From Flash
The high level steps are:
- Plan a RAID level based on your availability requirements
- Plan a flash LUN design considering your LVM volume group (VG) design
- Plan for any LVM changes
- Attach and configure the flash LUNs
- Configure the flash hdisks on AIX
- Mirror the VG and set the LV scheduling policy to parallel/sequential
- Turn off quorums
How one implements this solution also depends on whether you are implementing a new solution, or bringing in flash to improve application performance in an existing setup.
RAID and LVM Planning
When mirroring to flash, choosing a RAID protection scheme on the flash is a belt and suspenders approach to availability. But some customers may choose to implement RAID protection nonetheless. The additional IOPS overhead for RAID is a factor, but typically flash has more IOPS bandwidth than applications can fill. The main disadvantage of RAID protection is that it reduces the total flash space that can be used for data. If customers want the additional availability, RAID 5 offers the least reduction of total space. For other customers, RAID 0 will be the choice since no protection is needed.
In mirroring a VG, we can run into LVM limits. E.G., we may not be able to expand the VG. So there is some LVM planning to do. The main LVM limits to be aware of are the maximum number of physical volumes (PVs) and physical partitions (PPs) in a VG, which varies by VG type, and the maximum number of LPs in an LV which is 32,512 for a standard VG, but not limited for big and scalable VGs. The VG limits are as follows:
We don't have to create a set of LUNs equivalent to the existing LUNs, we only need to create LUNs with enough space to hold a copy of the VG. If necessary we can change a VGs type, provided we have enough free space on the disks in the VG to allow the VGDA area to expand. We can't change the PP size of an existing VG, and adding a mirror doubles the number of physical partitions tracked in the VGDA slowing LVM performance (but not production performance unless it includes LVM operations).
Some customers may decide to take the opportunity to adjust the LVM setup for data layout and performance purposes. E.G., if some LVs weren't spread across hdisks, and the storage is such that doing so would more evenly balance the IOs across the physical disks, then we can use the opportunity to correct the LV setup. Walking the mirror back to HDDs provides the opportunity to correct the LVM data layout.
Most customers will want to use LVM's strict storage pools to ensure a complete copy of the data will reside in each pool (one pool being flash, the other HDDs or SAN storage). Use of LVM mirror pools also helps assure that LVs are mirrored correctly. See the official documentation for LVM mirror pools at http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.baseadmn%2Fdoc%2Fbaseadmndita%2Fmirrorpools.htm, or for a short white paper on using them with an example of the commands used see http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102387.
One will also need to turn the VG quorums off to ensure that if we lose the flash, or the HDDs, the VG will remain online.
Once you've evaluated your LVM setup, then it's time to decide on the number and size of flash LUNs to create to provide sufficient storage to hold a copy of the data. Theoretically you could just create one flash LUN to hold a copy of the data, or you could create more LUNs than you already have. Since AIX can have many in-flight IOs to a single LUN (based on its queue_depth attribute), the author generally prefers fewer larger LUNs as compared to more smaller LUNs, and generally uses scalable VGs with a manageable number of LUNs. But we can have too few LUNs as well, as the hdisk driver is single threaded. So when two IOs are simultaneously requested for a hdisk, the second one handled will have a small bit of latency added to the IO while it waits to be handled by the hdisk driver If you plan your LUNs so that they don't exceed 5,000 IOPS each, then the single threaded hdisk driver won't be an issue.
Another consideration in choosing your flash LUN sizes, is performance in the event of the failure of a flash hdisk. Assuming a flash hdisk might fail while the rest of the flash storage continues working, then from a performance perspective under a flash failure scenario, we do better with more flash hdisks than with fewer. Consider the example with one flash hdisk; if the flash fails we'll be doing all our IO from the HDDs. If we have several flash hdisks and only one fails, then we'll still be doing reads from the remaining flash, with a lower performance impact.
One has more flexibility in choosing LUN sizes from an IBM FlashSystem than when using SAS attached SSDs where an entire RAID array is a LUN. If using SAS attached SSDs using single disk RAID 0 arrays is often a good choice for the flexibility it provides.
You may also want to eliminate unallocated space in the VG so we don't waste flash space to mirror it.
Finally, if using VSCSI to map flash hdisks to VIOCs, you'll want to have a single VSCSI/vhost adapter to handle each hdisk.if you expect high IOPS rates to it. A flash hdisk can do such a high rate of IOPS that it can utilize the entire bandwidth of the VSCSI/vhost adapter.
Attaching and Configuring the Flash
Attaching new storage to an LPAR requires that we make sure the hardware supporting filesets are installed to support it. If the flash is SAN attached, then the SAN/storage/host administrators need to design the paths (via the SAN zoning and the storage LUN masking) so there aren't too many.
Then one attaches the flash to the host, and configures the RAID arrays and LUNs. If SAN attached, the storage administrator creates a the host object (a list of host FC port WWPNs to which LUNs are connected - the terminology varies across disk subsystems), creates the LUNs then assigns the LUNs to the host.
The AIX administrator then runs cfgmgr to configure the flash hdisks on AIX.
LVM Mirroring and Walking the Mirror
Finally the AIX administrator adds the hdisks to the VG, mirrors the data and sets the LV scheduling policy to parallel/sequential indicating writes will be done to both copies in parallel, and reads will come from the first LVM copy which should be on flash.
When synchronizing the mirrors, this creates additional IO workload for the server and might affect application performance due to the additional read IOs from the existing storage. If this is a concern, one can throttle the mirroring by specifying the number of physical partitions to sync in parallel, via the -P flag of the syncvg command. Once mirroring starts you can look at the write rate to the new disks to estimate the time to fully synchronize the mirrors. And one can interrupt the mirroring and restart it later with a different number of physical partitions to sync in parallel, to minimize the impact to production or to get the data synchronized sooner. If taking this approach, it's important to create the mirrors specifying to not synchronize them. By default, LVM synchronizes one PP at a time; thus, specifying to create the mirrors and synchronize the copies will be the slowest approach to getting the mirrors synchronized.
For new implementations, one would create your LVM structures on flash first (to get the first LVM copy on it) then mirror to the HDDs. When bringing in flash to improve existing performance, the first LVM copy already resides on HDDs (or SAN disk). So we perform a procedure referred to as LVM mirror walking, where we have to move the first LVM copy to flash, then mirror it back to the HDDs (or SAN disk). LVM requires that we stop using a LV to change its scheduling policy.
Another approach for customers who always want a complete copy to reside on their original disk subsystem, is to use this procedure:
- Create a second LVM copy on the FlashSystem
- Create a third LVM copy on the original disk subsystem
- Remove the first LVM copy on the original disk subsystem
This results in the first copy residing on the FlashSystem as desired, while a complete copy of the data always resides on the original disk subsystem.
Glenn Fujimoto has provided an excellent presentation which follows that covers two topics, migrating data from one storage frame to another using LVM Mirroring, and bringing in flash storage as a read preferred mirror for improved performance. The first topic is geared toward LVM data migration, but also reduction of LUNs to XIV storage, first reducing the number of HDD backed LUNs to from 9 to 2. For XIV storage a smaller number of larger LUNs is usually better, and often occurs when migrating from other storage. The second topic then covers creating 9 flash LUNs and walking the first copy back to the flash for read preferred performance enhancement. And this presentation shows the AIX commands to do it. So it's a good example of LUN consolidation, bringing in flash, the LVM concepts and the LVM commands to do it.
LVM Mirror Walking - Flash Read Preferred-4.pptx
Installation and Migration
IBM Power Systems; IBM System p Family; IBM System Storage
AIX Power FlashSystem LVM SSD
|Is this your first visit to Techdocs (the Technical Sales Library)?