Architecture Matters – Part II

This blog is the second in a series that focuses on architecture, specifically architecture for scale-out storage. In the last blog, I told the story of ‘Galloping Gertie’, the Tacoma Narrows Bridge. If you didn’t have a chance to read it, please do; you’ll find it here. The thesis was that scale-out is not just single namespace, but true single filesystem and true scaling of resources as nodes are added. The key difference between true scale-out design and those designs that appear to be scale-out – just as Galloping Gertie appeared to be a stable suspension bridge but in reality was flawed – is that the additional nodes contribute to all functions for all other nodes and all entities (files) within the architecture.

In this blog, I’ll focus on the underpinning of any scale-out storage design – the management and handling of disk drives, specifically rotating drives (hard disk, a.k.a. HDD). In another blog to come, I’ll talk about the management of solid state drives (SSD) – that’s another highly interesting topic. But back to HDD.

As we all learned in college, the de facto standard technique to protect block data against drive failure is RAID. Patterson, Gibson and Katz wrote their seminal paper back in 1988. The technique of using block-level parity was described and several implementations outlined, in particular RAID-4 and -5, still used today. Here’s a URL – you should absolutely read it. It’s fundamental to understanding where we are today, and how we got here.

Fast forward to today – do you believe that the authors could have foreseen, 23 years ago, the advent of 3TB HDD? Let alone 4TB, 6TB or 8TB – which should make their debut 12-18 months from now? If they did, my hat is off to them. However, what few in the industry anticipated is the negative business impact of performing RAID rebuilds on drives of that size. Another impact of large drives is the reality of uncorrectable read errors, or UREs for short (often pronounced yoo-rees) during rebuilds. I will blog on the subject of UREs later, and their potentially devastating impact on pseudo-scale-out designs that incorporate file controllers (heads) over RAID.

As many of you know, EMC Isilon does not use heads-over-RAID design, in fact it does not use legacy RAID at all. At EMC Isilon, we use file-level forward error correction using a Reed-Solomon approach. In layman’s terms, we protect your files – which is what matters to enterprises. Drives will fail, nodes will lose power, but files live forever. In the EMC Isilon scale-out architecture, you can lose up to 144 drives simultaneously – 4 entire nodes’ worth – and still have files protected and accessible. Oh, and just for the record, that’s at 80% storage efficiency – not 50% as the RAID-10 advocates would have you believe is necessary to avoid business risk and survive a massive drive failure.

So, you say, how do large drives affect system reliability and hence business risk? Let’s do some math. If I have a 4TB drive @ 140 MB/sec (which is typical maximum throughput of large SATA drives) it takes 28,571 seconds to rebuild it entirely. That’s about 8 hours (7.93, for those keeping score). Two hours per terabyte. Not bad, you say? Well, consider that’s just writing rebuilt data to that drive. In reality, I have to read at least 4TB to rebuild that drive – and that’s in RAID-1, a mirrored pair. For single and double parity schemes, I have to read N*4 TB, where N is the RAID group factor – for example, 8 in an 8+1 RAID-5 or an 8+2 RAID-6. Both of those are very common RAID groups. It means I have to read 32TB to rebuild 4TB. That takes 8*8=64 hours. Plus, I’m being generous – because that’s not considering the computation (CPU) time required to recalculate parity.

Still, you say, 64 hours, that’s less than 3 days, and I use RAID-6. No problem, right? Well, not right – the 64-hour figure is if the array is doing nothing else but reading those particular 32 TB of data to rebuild a particular 4TB hot spare drive. In the real world, rebuilds often take anywhere from 3-5x ‘perfect’ time. That means your 4TB rebuild might well take 192 hours…or more. Two days per terabyte. That’s eight days…up to twelve or thirteen days. This is why, at EMC Isilon, we repair files into free space – the remaining drives. The more drives you have in the cluster, the faster file repair goes. (N-1) drives reading, (N-1) drives writing. Parallel and concurrent, real scale-out. This is not the case with legacy heads-on-RAID. You’re stuck in a small RAID group. It doesn’t scale.

Yes, I can hear it now – “Rob, sheesh, we don’t rebuild entire drives any more, we rebuild only the ‘black space’”, i.e. space that has been written to previously. OK, I’ll buy that. Let’s say you only consume 50% of the 4TB drive, or 2TB. Your rebuild still takes four to six days in a real-world scenario. Plus, your enterprise is now in the situation where it’s wasted half its money buying capacity – after all, you’re only consuming half of what you paid for. This is classic legacy thinking; short-stroking drives not for performance reasons but for reliability and business risk reasons. Yes, it’s come to that for legacy RAID.

There is little mystery why some NAS vendors only offer 2TB drives today – they are, in plain English, scared to death of rebuilds. It’s a dirty little secret in storage. You should be worried too, if you use NAS that calls itself ‘clustered’ or ‘scale-out’ but is really heads-over-RAID. NAS vendors with integrity will call it like it is – scale-up. Don’t get me wrong – there’s nothing wrong with scale-up, at small scale. But small scale is not where enterprises are at today. In my personal travels, I visit IT and business executives at enterprises every week, and the conversation is almost always in petabytes; at minimum, several 100s of terabytes. Sometimes the conversation is 10s of petabytes. Enter the 8TB drive.

I bet you get the point by now – the light bulb is on, and you see the architectural risk of large drives in RAID-based NAS. Now, wait until the 8TB drives come. Oh, and the answer is not triple parity – that is a Hobson’s choice, the 8+3 RAID group, even more inefficiency, even more wasted space. OK, do a more efficient 13+3, you say? Now you are reading 13TB to rebuild each terabyte. The math is even worse. It will take anywhere from 26 to 39 days to finish an 8TB rebuild. Now enter the UREs – the subject of my next blog.

Einstein reminds us that the same thinking that created the problems at hand cannot solve those problems – and you know the rest of that saying. We at EMC Isilon have learned and continue to learn from the design flaws of legacy architectures past and the thinly veiled continuation of those same flaws today (e.g. overlays, concatenation, groups of pairs) and used design innovation to overcome them. It is my hope is that consumers of scale-out storage take heed of those lessons and not confuse today’s architectures that appear to be or are touted to be scale-out for the real thing.

About the Author: Robert Peglar