Innovation and Technology

IOLanes has redesigned the I/O path to reduce overhead and increase isolation of workloads in large multicore systems. At the core of IOLanes technology are the following fundamental techniques in the I/O path from devices to applications:

Figure 1: Overall architecture of the IOLanes
approach for SSD caching

[Devices to Host]
Dense I/O via transparent SSD caching:

SSDs and NAND Flash is the main enabler for high I/O density in datacenter servers. In IOLanes we design an SSD-based I/O cache that operates at the block-level and is transparent to existing applications, such as databases (Figure 1).

Our design provides various choices for associativity write policies and cache line size, while maintaining a high degree of I/O concurrency. Our main contribution is that we explore differentiation of HDD blocks according to their expected importance on system performance. We design and analyze a two-level block selection scheme that dynamically differentiates HDD blocks, and selectively places them in the limited space of the SSD cache. We implement our SSD cache in the Linux kernel and evaluate its effectiveness experimentally using a server-type platform and large problem sizes with I/O intensive workloads. Our results show that as the cache size increases, we are able to enhance I/O performance by up to 14x. Additionally, our two-level block selection scheme further enhances I/O performance compared to a typical SSD cache by up to 2x.

Figure 2: IOLanes proposes partitioning the I/O path to reduce I/O
interference and contention in system resources.

Host-level isolation via I/O partitioning:

With increasing numbers of cores and improving storage device technology, I/O contention and interference is emerging as the main bottleneck in the current I/O path between application memory and the actual devices. In IOLanes we propose partitioning the I/O path (Figure 2) to isolate the resources used by workloads and the structures used by the Linux kernel offering a single I/O path over multiple physical system resources. We design and implement a filesystem, a partitioned DRAM I/O cache, a partitioned journal mechanism, and an SSD caching layer in the kernel that are able to support real applications.

We implement and evaluate our approach using transactional, analytical processing, and file streaming work - loads in shared multicore servers and we compare our approach with the cgroups mechanism in the Linux kernel. We find that in the presence of interference the performance of workloads can deteriorate by multiple times and up to more than three orders of magnitude, whereas with our mitigating mechanisms the performance penalty is up to only 5.7x in the worst case and usually much less, and always superior to cgroups under conditions of high interference. Moreover, we introduce a system-level metric, cycles per I/O that has a negative correlation with the observed application-level performance. This metric serves to highlight differences between our mechanism and cgroups. Figure 3 and Figure 4 show that performance degradation due to consolidation can be up to multiple orders of magnitude under conditions of high load. IOLanes manages to reduce this performance degradation to at most 5.7x under high load and usually much less.

Figure 3 (left): Performance of datacenter applications may suffer orders of magnitude degradation on modern consolidated servers when ran concurrently with other workloads
Figure 4 (right): Performance improvement under I/O interference compared to native Linux

[Host to Guest]
Low overhead virtualization by eliminating VM exits in the issue  and completion  I/O path

Current virtualization solutions often bear an unacceptable performance cost, limiting their use in many situations, and in particular when running I/O intensive workloads. We argue that this overhead is inherent in Popek and Goldberg’s trap-and-emulate model for machine virtualization, and propose an alternative virtualization model for multi-core systems, where unmodified guests and hypervisors run on dedicated CPU cores. We propose hardware extensions to facilitate the realization of this split execution (SplitX) model and provide a limited approximation on current hardware (Figure 5). We demonstrate the feasibility and potential of a SplitX hypervisor running I/O intensive workloads with zero overheads.

Figure 5 (left): Split-X reduces the number of exits in the issue path for I/Os when using virtualization
Figure 6 (right): ELI reduces interrupt related overheads in the I/O completion path when using virtualization

Figure 7: Improvement of virtualized I/O performance compared to bare metal Linux

In the I/O completion path, direct device assignment enhances the performance of guest virtual machines by allowing them to communicate with I/O devices without host involvement. But even with device assignment, guests are still unable to approach bare-metal performance, because the host intercepts all interrupts, including those interrupts generated by assigned devices to signal to guests the completion of their I/O requests.

The host involvement induces multiple unwarranted guest/host con- text switches, which significantly hamper the performance of I/O intensive workloads. To solve this problem, we present ELI (Exit-Less Interrupts), a software-only approach for handling interrupts within guest virtual machines directly and securely (Figure 6). By removing the host from the interrupt handling path, ELI manages to improve the throughput and latency of unmodified, untrusted guests by 1.3x–1.6x, allowing them to reach 97%–100% of bare-metal performance even for the most demanding I/O-intensive workloads (Figure 7).

[Dynamic optimization]
Dynamic I/O scheduler for mixed workloads.

Device I/O performance is a bottleneck for many workloads. The host-level I/O scheduler plays an important role as it manages the order of I/Os to each device. Therefore, every I/O scheduler has a different behavior depending on the workload and the storage devices. Today, the scheduler is typically configured once by the system administrator and it is used by all workloads. Consequently, it is not possible to select dynamically the scheduler that would best fit a specific workload and set of devices. IOLanes uses an online optimization technique to select automatically the I/O scheduler for the current workload. This selection is performed online, using a workload analysis method with small I/O traces, finding common I/O patterns. The proposed method works with any application and device type (RAID, HDD, SSD), as long as we have a system parameter to tune, without requiring disk simulations or hardware modeling. Our dynamic mechanism adapts automatically to the best scheduler, sometimes even achieving improvements on I/O performance for heterogeneous workloads beyond those of any fixed scheduler.

Figure 8: Overall structure of Merlin

[Complete System]
I/O performance monitoring and analysis

While working on the new I/O stack it has become necessary to examine techniques for detailed profiling of real workloads on actual systems. This has led to a fine-grain monitoring and analysis tool, Merlin (Figure 8), that has helped both with correctness and performance debugging. From Application/User space to Kernel/Hardware space, users can extend Merlin’s functionality and data set by creating their own instrumentation modules (subsystem probes) or reuse one of the extensive set already developed within the IOLANES project to carry out performance evaluation. As an example, for some of the experimental configurations carried out in the current IOLanes prototype and testbed Merlin probes can generate more than 5 million data points per hour per physical or virtual machine. Besides our work, Merlin has also been used in performance debugging of distributed transactional workloads in datacenters in other contexts and it is evolving towards a tool that is able to identify performance bottlenecks in scale-out workloads.

Questions or additional information? Contact us.

Seventh Framework Programme