IOLanes re-architects the I/O path to:
- Eliminate interference across workloads, cores, and devices.
- Eliminate host-guest virtualization overheads.
- Offer end-to-end performance monitoring, visualization, and automatic tuning.
- Demonstrate improvements with real applications.
The I/O path in modern servers exhibits a high degree of interference among different workloads, cores, and devices. For instance a high-volume backup application can significantly impact the response time of a transactional application running on the same server. For example, the response time of TPC-W increases by 10x when it runs concurrently with a simple data transfer application. Mixed I/O patterns from different applications towards shared devices in consolidated environments leads to poor device behavior.
IOLanes re-designs the I/O path to partition and isolate workloads, cores, and devices and improve I/O behavior and performance. Our design centres around a novel, partitioned DRAM cache that allows different workloads or cores to allocate individual I/O caches that can be sized and placed independently in terms of host and device resources. The partitioned cache is complemented by a partitioned journal device that reduces ordering and synchronization. This design requires changes to traditional filesystems that typically are tightly coupled with DRAM caching and journaling. For this purpose we provide a thin filesystem that performs namespace management and aims mainly at hypervisor environments.
Our design is able to isolate workloads in virtualized environments all the way from application memory to the devices dramatically reducing interference. In the TPC-W example, our approach entierly eliminates interference with the data transfer application. In addition, our approach will be able to reduce contention across cores on "dense" many-core servers, especially with current projections based on increasing numbers of cores.
The current I/O stack incurs high overheads due to expensive operations used at various layers in the I/O path. Two main issues are: (a) virtualization overheads and (b) replication of functionality. IOLanes addresses these by building a decomposed I/O stack that avoids replication across guest and host domains as well as reduces the cost of virtualization, caching, and recovery, in cooperation with WP1 that offers some of the required mechanisms.
In addition, we use in-memory page cache deduplication to effectively increase the available DRAM cache size and reduce I/O overehads, especially in virtualization scenarios: 25% of data sharing across workloads results on 3.6 GBytes additional DRAM cache space on a typical server (Figure Extra-Dedup-Memory). For applications that are sensitive to I/O caching, such as transactional workloads, this can have a dramatic impact on performance.
Understanding performance bottlenecks in modern I/O subsystems is a daunting task due to the associated complexity and the multiple layers involved from application to devices. IOLanes provides a single-point for gathering statistics from all layers of a running system. In our stack, this instrumentation phase generates approximately 5 million data point every hour. To explore this vast amount of monitoring data, we designed Merlin, a tool for monitoring and analysis. Merlin allows quick inspection for irregularities across all system layers. Furthermore, Merlin has the ability to perform a wide range of analytical functions that collectively allow users to summarise resource behaviour, explore both expected and novel correlations, and discover potential bottlenecks as the underlying system configuration is altered.
For more info please check this video on Merlin, our visual data mining tool that supports performance investigations across the I/O stack.
In addition, we design IOAnalyzer, a novel approach to select the best I/O scheduler for the currently running workloads. The I/O scheduler has an important impact on the performance of devices and with mixed workloads it is becoming impossible to statically choose the best scheduler. Our analsysis shows that making the wrong choice can decrease application performance by 2-5x. In addition the selection of scheduler depends on a number of paramters, including the workload and system and device characteristics (Figure IOAnalyzer). IOAnalyzer dramatically simplifies this by automatically selecting the best scheduler for a workload mix, by observing the I/O patterns at runtime and examining various alternative scenarios. IOAnalyzer eliminates any related administrator costs for optimizing I/O, directly reducing TCO.
Figure: Reduction in CPU utilization with increasing number of cores.
Today, there are increasing concerns that existing applications and infrastructures will not be able to cope with data growth, limiting our ability to process the available information. Understanding the I/O requirements of existing and emerging applications is important for driving system design. We examine the storage I/O behavior of data-centric applications as the number of cores per server grows. We configure these applications with realistic datasets and examine configuration points where they perform significant amount of I/O. For our analysis we propose and use cycles per I/O (cpio) as a metric for abstracting many I/O subsystem configuration details. We analyze specific architectural issues pertaining to data-centric applications including the usefulness of hyper-threading, sensitivity to memory bandwidth, and the potential impact of disruptive storage technologies.
IOLanes aims at demonstrating the results via real applications in different data management domains. For this purpose, we use three domains, online transactional processing (OLTP), online analytical processing (OLAP) and data streaming (DS), and representative applications within each domain. TPC-W, TariffAdvisor, and LinearRoad, respectivelly. For TPC-W we are able to achieve isolation from other workloads and eliminate any impact of other workloads on transaction response time. Without the partitioned I/O stack of IOLanes TPC-W suffers an increase in response time by 10x when running concurrently with other, lower priority workloads due to I/O path interference. TariffAdvisor, an application used extensively for plan rating at production-level in large mobile Telco providers, is able to achieve up to 30% improvement in processing rate by eliminating the flanctuation that takes place during dynamic allocation of memory resources in the I/O path. LinearRoad achieve an improvement in the virtualized I/O path of 10%, when using the IOLanes optimizations for reducing the number of VM exits.
Questions or additional information? Contact us.