In my last blog post, I talked about the importance of asking “why” we do things the way we do, and of re-
First off, what are the usage patterns which either predominate now or are expected to predominate? The first would be connected to the rise of mobile devices, tablets, web access, and the like – the expectation that information should be globally available, at any time, across multiple end-user devices. The information is expected to be personalized, tailored to the individual viewing it – like my Facebook page, my Netflix preferences, my yahoo newsfeed, etc. However, from a storage point of view, the storage must be able to deliver thousands to millions, or perhaps billions, of personalized Facebook pages, Netflix preferences, and the like, simultaneously, all over the globe.
Given that pattern, what storage system could match that pattern? The system would need to support 100s of thousands of simultaneous network connections, some fast, some slow, but none driving a tremendous amount of bandwidth on an individual basis. This calls for a scale-out architecture, where lots of nodes within the system cooperate to produce a system which can drive a massive amount of bandwidth in the aggregate, and where the overall bandwidth can be easily increased by just adding additional nodes. The demands on each individual node are relatively minor, thus obviating the need for high-end hardware and specialized processors on the nodes.
The random nature of the overall traffic mix across the simultaneous demands of the users also demands a large spread of the data over a large disk farm. No single drive could ever sustain the transfer rate needed to feed to network demands, but thousands of drives, operating in parallel, can easily supply the bandwidth. And, as more storage or disk bandwidth is needed, more nodes with more disks can be added, allowing the storage to grow to whatever capacity is needed.
With such a large disk farm, there will be a steady stream of disk failures. In such a large scale system, it is impractical to delay healing until a new drive replaces a failed drive, nor is it practical to reserve hot spares of drives or to restrict data to live within a RAID group or similar technology. Rather, the healing must begin as soon as the drive fails, and it must be able to re-protect the data anywhere in the system. This speaks to a degree of location independence in the user-visible name of the data – as the data will be freely reshuffled within the system, the translation of the name to its current location must be done in a way that is purely within the storage system, and where similar names have no location relationship between them (unlike in a block system or a filesystem, where applications will specifically tune their access and naming patterns to make optimal use of locality of data within those systems). A further benefit of this is that upgrades to later technology (faster CPUs, more memory, faster networks) is trivially enabled, as the data can be easily and freely reshuffled within the system with no impact on the applications accessing it.
The logical layout and organization of the data must be done in a way that makes sense from the user/application perspective, and not in a way that optimizes first for the storage system and forces the application to code to the specifics of the storage. This consequence flows from the nature of the traffic. When a user or application first starts storing data, it can be unpredictable for that individual user what kind of traffic will be generated or what kind of data will be stored. For instance, some Facebook users may store 1000s of photos, others will only occasionally update status, while yet others may store 100s of photos but 1000s of videos. If these users were forced into a cookie-cutter mold of storage where they each get the same sized filesystem, then the users who update their status infrequently would wind up with huge amounts of unused storage, while the users who upload videos would be constantly running out of space. All in all, it would lead to a system with huge amounts of unusable storage while users are constantly complaining about not being able to store their data. There is no one-size-fits-all organization of data which would match these patterns, and these patterns are not known when the storage is initially provisioned. This necessitates a design whereby the storage usage is fluid, and can elastically be used by the different users in response to their needs.
Finally, the worldwide nature of the traffic demands a storage infrastructure which spans geographies and which allows active access to data from all geographies. While most data will generally be accessed in the same location where it was originally created, the nature of our mobile society means that pictures uploaded in Japan should be instantly visible by family and friends in the United States, without the need for constant long latency network accesses to storage in Japan. The storage system must be able to replicate data across multiple geographies, and then access the data from the geography which is closest to the requester – all while preserving the user-defined name of the data (further reinforcing the location independent nature of the name.)
What does this lead to? Simply put, this is what object storage is. Object storage is a highly scalable, scale-out, distributed storage system designed to satisfy many simultaneous storage and retrieval requests across a diverse set of objects. The objects themselves are typically organized into user-defined buckets, the object names are defined by the application, and the application will typically associate some descriptive metadata with the object to identify it more clearly. The buckets themselves are not tied to individual drives, but rather the objects within the buckets can be stored anywhere in the system, in response to the changing access pattern for the bucket as well as the changing state of the storage itself, as drives or nodes may fail or become temporarily inaccessible. Geographical data dispersal is not driven by filesystem or RAID group boundaries, but rather it is handled based on the user-defined organization of the data and mobility demands of the user or application, typically on a bucket-by-bucket basis (sometimes on an object-by-object basis.)
Or, even more simply, object storage is what we get when we rethink storage in terms of today’s access patterns and technologies.