Software Defined Storage Availability (Part 1): Why Do With Three What You Can Do With Two?

Must an enterprise deploy 3 or more replicas of data, or some form of erasure coding, to achieve enterprise class availability (of 99.9999% uptime)? Several software-defined storage (SDS) vendors seem to insist so. But is this true? We will methodically discuss this topic in a series of three blog posts and equip you with better knowledge on this complex and very important topic.

When it comes to safety, aircraft manufacturers are a great example of an industry that goes to great lengths to ensure maximum passenger safety.  When you boarded your last flight, did you board an aircraft powered by two jet engines or four? Do you generally feel unsafe when you fly aircrafts with two jet engines? Passengers, airliners and airline industry experts all agree that today’s two engine aircrafts provides an equal amount of safety as aircrafts with four jet engines. Why is this so? It is because innovations in engine and aircraft technologies have made a single engine very powerful and efficient – powerful enough to take-off, cruise or land using just a single engine. And once safety is ensured, superior economics determines the winner. As two engines are more economical than four, it makes them hugely popular with airliners (lower costs) and customers (lower ticket prices).

The aircraft analogy is very similar to enterprise storage – enterprise customers can enjoy high levels of availability on well architected systems that require fewer copies of data, costing a lot less, but do not require any form of erasure coding. Dell EMC’s industry leading ScaleIO is one such system. Designed for demanding enterprises, ScaleIO is a data center grade software defined storage solution that regularly meets and exceeds six or even seven nines of availability.

5-9’s or 6-9’s? Definition of Enterprise Availability SLAs

Availability of a system is defined in terms how much time a system is up and running throughout the year. A popular measure of availability is in terms of percentages:

Availability Downtime per year
90% (“one nine”) 36.5 days
99% (“two nines”) 3.65 days
99.9% (“three nines”) 8.76 hours
99.99% (“four nines”) 52.56 minutes
99.999% (“five nines”) 5.26 minutes
99.9999% (“six nines”) 31.5 seconds
99.99999% (“seven nines”) 3.15 seconds

For example, 6-9’s of availability means that a system is unavailable for 31.5 seconds in a year.

The ScaleIO Magic

Using ScaleIO you can easily build a ScaleIO cluster with 6-9’s of availability or more. ScaleIO’s unique declustered raid technology, its ability to quickly detect failures and perform fast rebuild allow our customers to get predictable and consistent performance and achieve 6-9’s or more availability with 33% lower storage capacity (and proportionate cost) than other vendors’ solutions that demand 3 copies of data.

The ScaleIO architecture utilizes multiple mechanisms to achieve enterprise grade availability:

  • The ScaleIO data protection scheme is based on full replicas deployed in a declustered raid scheme. This scheme offers superior recovery time with enterprise availability and provides consistent performance for customer’s applications even during a drive or node rebuild.
  • ScaleIO’s rebuild process utilizes an efficient many-to-many scheme. The rebuild is invoked in the event of a drive or node failure and therefore the rebuild scheme allows for a very fast rebuild.
  • ScaleIO uses ALL SDS devices in the storage pool for rebuild operations. For example, let’s say a pool has 200 drives, when one drive fails, the other 199 drives will be utilized to rebuild the data of the failed drive. This results in extremely quick rebuilds with minimum impact to application performance.
  • ScaleIO detects a disk failure in seconds. This time includes the time it takes the Operating System to detect the issue plus the time it takes ScaleIO to start the rebuild process. ScaleIO starts the rebuild immediately after detecting the disk failure while some software-defined storage solutions wait a considerable amount of time before starting the rebuild process (typically tens of minutes and some have a default of an hour). As will be shown later, the inability to detect failure quickly, and then start the rebuild process immediately, significantly reduces the availability of those solutions.
  • ScaleIO architectural innovations such as Protection Domains, Storage Pools and Fault Sets help customers manage failure domains very flexibly and in some cases improve availability.

Read more on ScaleIO’s unique architecture here.

Since these calculations are complex, ScaleIO provides its customers with FREE online tools to build HW configurations based on ScaleIO Ready Nodes to get comprehensive availability numbers that includes multiple possible failures scenarios. We advise customers to use this tool to build hardware configuration based on desired system availability target. If you are a ScaleIO customer, the tool can be accessed here.

Here are some sample configurations each with availability of 6-9’s or higher

Configuration Number

Configuration Detail

Availability

Full Sizer Output

1

30 x R640 Servers,  10 x 3.84TB SAS SSDs, 2 x 25GbE network, 1PB raw: 1 Storage Pool 99.9999179% Click here

2

49 x R740dx Servers, 24 x 960GB SAS SSDs, 2 x 10GbE network, one storage pool 1PB raw, 3 Storage pools w 196 devices per pool

 

99.9999669%

(per storage pool)

Click here

3

5 x R740dx Servers, 24 x 960GB SAS SSDs, 2 x 10GbE, one storage pool, 100TB raw,  3 storage pools w/ 40 devices per pool

99.9999968%

(per storage pool)

Click here

4

12 x R640 Servers, 10 x 3.84TB SAS SSDs, 2 x 10GbE, 1 storage pool, 250TB raw 99.9999846% Click here

5

72 x R640 servers, 10 x3.84 SAS SSDs, 2 x 25GbE, 1 storage pool, 2500TB raw 99.9999460% Click here

As you can see a variety of highly customizable configurations are possible with varying performance and capacity to meet our customer’s needs with every configuration assured of 6-9’s of availability or higher. So like two engine jets, where modern technology is now capable of providing equal safety as four engines, ScaleIO architecture is capable to deliver 6-9’s of availability requiring just 2 replicas of data, and resulting in 33% lower cost. So: why do with three what you can do with two?

About the Author: Tamir Segal