Home | Cloud | Storage QoS – How Hard Can It Be?
Storage QoS – How Hard Can It Be?

Storage QoS – How Hard Can It Be?

8 Flares Twitter 2 Facebook 0 Google+ 6 StumbleUpon 0 Buffer 0 LinkedIn 0 8 Flares ×

Following up from yesterday’s post and the comments received, I thought it was worth considering how hard it would actually be to implement quality of service within a storage array.

Almost all storage arrays available today (whether physical or virtual) work on the assumption that I/O should be delivered as fast as possible.  This is not an unreasonable premise, especially considering previous history, where storage was the slowest component in the architecture – why bother slowing things down deliberately?  Where different service levels were required, this could be implemented by deploying cheaper hardware or different tiers (speeds/capacities) of disk.  Each tier provides a set performance level in terms of IOPS and response time and of course is matched by different cost.

Cloud environments are different.  There’s a desire to virtualise everything – network, compute, storage because when fine levels of granularity can be achieved, this results in a more efficient service and enables the ability to charge the customer for every increment in their use of resources.  It would be too impractical to expect to design a private or public cloud with many fixed storage tiers as even with economies of scale there would be significant wastage (storage arrays in the cloud context could simply mean a server with multiple disks in it).  It would also be poor service to expect a customer to migrate their server to another tier every time they needed another 100 IOPS to cope with growth.  So, the answer is creating an infrastructure that offers variable IOPS and response times.  How?  Let’s start by thinking about the I/O process itself.

 

The I/O Process

Storage arrays have evolved to cater for the slowest component in the architecture – the hard drive.  As a result, we see techniques employed to manage what are unpredictable response times of an order of magnitude (or more) greater than the processor and memory within the array itself.  A typical I/O will be received on a front-end storage port (FC, iSCSI, FCoE, it doesn’t matter) and added to a queue in cache.  The cache performs a number of functions; it batches up the requests that are pending to disk; it enables I/O acknowledgements to go to the host before data is committed to disk; it stores read requests so they can be serviced faster out of memory on subsequent repeat requests (and may do prefetch of reads too) and in the latest architectures the cache manages compression and de-duplication before data is permanently stored.  I’ve simplified the functionality here for clarity – obviously I/O processing involves many other tasks.

From end-to-end, an I/O is received into cache, queued, eventually gets to be processed, is read from or written to disk, then put back into cache for forwarding to the originating host.  During that time, I/O may be processed out of sequence in order to get better disk throughput.  I/O can also be delayed by local and remote replication and of course as already mentioned, deduplication and compression.

 

 Implementing QoS

So how could we implement QoS?  Firstly for block storage, QoS could be applied to a LUN and tracked at that level.  As the I/O comes in, either it is delayed before processing or delayed before confirmation to the user.  It seems to be more logical to delay before processing, as any pending I/O wouldn’t be committed to disk and require back out in the case of cache failure.  What that means is the QoS component would need to delay processing of the I/O until the prescribed time interval had elapsed, minus the processing time.  For example, if a 5ms response time was desired and processing the I/O takes 1ms, then the I/O would be processed 4ms after being received.

With spinning disks, the ability to guarantee that an I/O could be processed consistently would be difficult.  Hard drives don’t deliver consistent I/O response, especially with mixed sequential and random workloads.  Solid state devices however, are more predictable and have response times significantly faster than disk media.  An SSD array would be much more suited to delivering QoS, when an I/O only takes microseconds to complete and that process is 99.999% guaranteed to complete in a predictable time.

One final consideration; delaying I/O when the host is capable of overwhelming the storage array means careful management is required.  Fibre Channel implements queue depth processing; no more I/O can be started to a LUN once the queue is full – the same can’t be said for iSCSI.

Summary

As we know, SolidFire have implemented QoS.  It appears that Nexgen have also implemented a QoS feature called ioControl in their all-flash arrays (thanks Arjan).  QoS could be yet another good reason to move away from hard drives and implemented all-flash devices.

About Chris M Evans

  • http://twitter.com/jungledave Dave Wright

    Using SSDs instead of Disk certainly makes QoS easier, but there are additional complexities to consider. For example, if there is a DRAM cache, you need to make sure that it can’t be monopolized (based on IO pattern) but a small number of clients. If a lightly loaded array serves most IO cached, but then gets more heavily loaded, performance will degrade significantly.
    Another issue is one of hot-spots – if you are sharing disks between multiple apps, and a certain app has heavy IO to a certain disk, it will almost certainly cause performance issues to other apps doing IO against that disk (and this can apply to SSD or disk).
    IO patterns (and IO sizes) can also result in very different performance out of the storage system, and needs to be compensated for in order to provide consistent performance.

    For these and other reasons, most “QoS” features in storage systems today (including NexGen, 3par, EMC, etc) are priority based, meaning they can give more or less resources to high (or low) priority apps, but don’t necessarily guarantee a specific level of performance to anyone. That’s useful for traditional enterprise scenarios where you know (relatively) how important one application is versus another, but virtually useless in a cloud where applications are owned by different tenants (either public or private).

    SolidFire’s QoS is unique in it’s ability to give consistent performance at specific levels across many apps at the same time. Doing QoS right requires architectural design around it, not just a layer of functionality on top.

  • Pingback: ATWT: Around the Web Today | Thankfully the RAID is Gone

  • http://twitter.com/StorageOlogist Lee Johns

    The problem with QOS is that it tends to assume that you know what the QOS you want is for the applications that you have. In most cases customers don’t. (I agree with David that this is different in a service provider environment where you have multiple customers or tenents). In most mid-size enterprises the issue they have is how can they confidently consolidate applications and vendors who are only providing a prioritization on top of an iSCSI system are not doing that for them. There issues span NAS and SAN, Virtual and physical, IOP oriented and throughput oriented, Read and Write intensive, sequential and random workloads.and they do not have a good way to understand these applciations. That is why at Starboard Storage we have settled on providing predictable performance for mixed workloads with our hybrid storage by Thin provisioning performance. People understand thin provisioning of disk but thin provisioning of performance is a newer concept. Basically we enable a thin layer of SSD to act as an accelerator for applications based on the real time workload. The customer does not have to understand all the differing metrics of every application. however to make sure that we prioritize resources for their most important apps our caching algorithms have the concept of “Quality” associated with them to ensure that high priority apps are kept in cache longer. You use less SSD, get better $/IOP and $/GB and can confidently consolidate all of your workloads. you can read more about quality metrics in caching algoritms on my blog http://blog.starboardstorage.com/blog/bid/233265/Storage-SSD-Caching-Explained

  • http://twitter.com/dpironet Didier Pironet

    Great write-up.
    Pillar Data System, now Oracle Pillar, does QoS for many years now…

  • Pingback: Tortoise and the Hare? (NFS vs. iSCSI and why this Apples to Broccoli) | The SAN Technologist

8 Flares Twitter 2 Facebook 0 Google+ 6 StumbleUpon 0 Buffer 0 LinkedIn 0 8 Flares ×