The notion of “the whole being greater than the sum of its parts” is true for many implementations of technology. Take, for example, hyper-converged infrastructure (HCI) solutions like the Dell EMC VxRail. HCI combines virtualization software and software defined storage with industry standard servers. It ties these components together with orchestration and infrastructure management software to deliver a combined solution that provides operational and deployment efficiencies that, for many classes of users, would not be possible if the components were delivered separately.
However, certain challenges require separating out the parts – that’s where the solution is found. And, that is true in the case of Server Disaggregation and the potential benefits such an architecture can provide.
So, what is Server Disaggregation? It’s the idea that for data centers of a certain size, efficiencies of servers can be improved by dissecting the traditional servers’ components and grouping like components into resource pools. Once pooled, a physical server can be aggregated (i.e., built) by drawing resources on the fly, optimally sized for the application it will run. The benefits of this model are best described by examining a little history.
B.V.E. (Before the Virtualization Era)
Before virtualization became prevalent, enterprise applications were typically assigned to physical servers in a one-to-one mapping. To prevent unexpected interactions between the programs, such as one misbehaving program consuming all the bandwidth of a server component and starving the other programs, it was common to give critical enterprise applications their own dedicated server hardware.
Figure 1 describes this model. Figure 1 (a) illustrates a concept physical server with its resources separated by class type: CPU, SCM, GPU and FPGA, Network, Storage. Figure 1 (b) shows a hypothetical application deployed on the server and shows the portion of the resources the application consumed. Figure 1 (c) calls out the portion of the server’s resources that were underutilized by the application.
Figure 1 (c) highlights the problem with this model, overprovisioning. The underutilized resources were the result of overprovisioning of the server hardware for the application to be run. Servers were overprovisioned for a variety of reasons including lack of knowledge of the application’s resource needs, fear of possible dynamic changes in workload, and to account for anticipated application or dataset growth overtime. Overprovisioning was the result of a “better safe than sorry” mindset, which was not necessarily bad philosophy when dealing with mission critical enterprise applications. However, this model had its costs (e.g., higher acquisition costs, greater power consumption, etc.). Also, because the sizing of multiple servers for applications was done when the servers were acquired, a certain amount of configuration agility was removed as more knowledge about the true resource needs of the applications was learned. Before virtualization, data center server utilizations could be as low as 15% or less.
The Virtualization Age
When virtualization first started to appear in data centers, one of its biggest value propositions was to increase server utilizations. (Although, many people would say, and I would agree, that equally important are the operational features that virtualization environments like VMware vSphere provide. Features like live-migration, snapshots and rapid deployment of applications, to name a few.) Figure 2 shows how hypervisors increased server utilizations by allowing multiple enterprise applications to share the same physical server hardware. After virtualization was introduced to the data center server utilizations could climb to 50% to 70%.
Disaggregation: A Server Evolution under Development
While the improvement of utilization brought by virtualization is impressive, the amount of unutilized or underutilized resources trapped on each server starts to add up quickly. In a virtual server farm, the data center could have the equivalent of one idle server for every one to three servers deployed.
The goals of Server Disaggregation are to further improve the utilization of data center server resources and to add to operational efficiency and agility. Figure 3 illustrates the Server Disaggregation concept. In the fully disaggregated server model, resources typically found in servers are grouped together into common resource pools. The pools are connected by one or more high-speed, high-bandwidth, low latency fabrics. A software entity, called the Server Builder in this example, is responsible for managing the pooled resources and rack scale fabric.
When an administrator or a higher-level orchestration engine needs a server for a specific application, it sends a request to the Server Builder with the characteristics of the needed server (e.g., CPU, DRAM, persistent memory (SCM), network, and storage requirements). The Server Builder draws the necessary resources from the resource pools and configures the rack scale fabric to connect the resources together. The result is a disaggregated server as shown in Figure 3 (a), a full bare-metal, bootable server ready for the installation of an operating system, hypervisor and/or application.
The process can be repeated if the required unassigned resources remain in the pools, allowing new servers to be created and customized to the application to be installed. From the OS, hypervisor or application point of view, the disaggregated server is undistinguishable from a traditional server, although with several added benefits that will be described in the next section. In this sense, disaggregation is an evolution of server architecture, not a revolution as it does not require a refactoring of the existing software ecosystem.
The Benefits of Being Apart
While having all the capabilities of a traditional server, the disaggregated server has many benefits:
- Configuration Optimization: The Server Builder can deliver a disaggregated server specifically composed of the resources a given application requires.
- Liberation of Unused Resources: Unused resources are no longer trapped within the traditional server chassis. These resources are now available to all disaggregated servers for capability expansion or to be used for the creation of additional servers (see Figure 3 (b)).
- Less Need to Overprovision: Because resources can be dynamically and independently added to a disaggregated server, there will be less temptation to use a larger than needed server during initial deployment. Also, since unused resources are available to all existing and future configurations, spare capacity can be managed from a data center level instead of a per server level, enabling a smaller amount of reserved resources to provide the overflow capacity to more servers.
- Independent Acquisition of Resources: Resources can be purchased independently and added separately to their respective pools.
- Increased RAS (Reliability, Availability and Serviceability): High-availability can be added to server resources where it was not possible or economical to do so before. For example, the rack scale fabric can be designed to add redundant paths to resources. Also, when a CPU resource fails, the other resources can be remapped to a new CPU resource and the disaggregated server rebooted.
- Increased Agility through Repurposing: When a disaggregated server is retired, its resources return to the pool which in turn can be reused in new disaggregated servers. Also, as application loads change, disaggregated servers devoted to one application cluster can be reformed and dedicated to another application cluster with different resource requirements
The above list is not exhaustive and many other benefits of this architecture exist.
The Challenges (and Opportunities) of a Long(ish)-Distant Relationship
Full server disaggregation is not here yet and the concept is under development. For it to be possible, an extremely low-latency fabric is required to allow the components to be separated at the rack level. The fabric also needs to support memory semantics to be able to disaggregate SCM (Storage Class Memory). It remains to be seen if all DRAM can be disaggregated from the CPU, but I believe that large portions can depending on the requirements of the different classes of data used by an application. Fortunately, the industry is already developing an open standard for a fabric which is perfect for full disaggregation, Gen-Z. Information about the Gen-Z effort can be found at www.genzconsortium.org.
The software that controls resources and configures disaggregated servers, the Server Builder, needs to be developed. It also provides opportunities for the addition of monitoring and metric collection that can be used to dynamically manage resources in ways that were not possible with the traditional server model.
Another opportunity is the tying together of the disaggregated server infrastructure with the existing orchestration ecosystems. Server Disaggregation is in no way a competitor to existing orchestration architectures like virtualization. On the contrary, Server Disaggregation is enhancing the traditional server architecture that these orchestration environments already use.
One can imagine that the management utilities administrators use to control their orchestration environments could be augmented to communicate directly to the Server Builder to create the servers they need. The administrator may not ever need to interface directly to the Server Builder. The benefits of disaggregation should be additive to the benefits of the orchestration environments.
Conclusion: An Exciting Time in Server Architecture
It is an exciting time to be involved in server architecture. New technologies like SCM and rack scale, low-latency fabrics are opening new doors for server innovation. Server Disaggregation has the potential to be one of these important innovations. Indeed, we have already seen some of the benefits of the disaggregation of some of the server components in systems like the Dell EMC PowerEdge FX2 and Dell EMC PowerEdge VRTX. Server Disaggregation can build on the benefits these examples provide and lead to a more efficient and more dynamic server infrastructure environment.
 SCM – Storage Class Memory. A class of emerging persistent memory technologies with latencies lower than NAND flash.