SPAMD: Stressing I/O in modern servers

 

Evaluating the I/O subsystem of modern servers is not straight-forward. With the recent trend towards data-centric (rather than compute-centric) applications, where the emphasis is on processing data rather than computing from memory, using realistic workloads in experimental evaluations becomes increasingly important. In this web page we discuss how to configure, setup, and run experiments with a set of applications used for servicing, profiling, analyzing and managing data (SPAMD). The paper below discusses issues related to evaluating and projecting in the future I/O requirements of these application.


  1. Shoaib Akram, Manolis Marazkis, and Angelos Bilas. Understanding scalability and performance requirements of I/O intensive application on future multicore servers. In The 20th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS'12), Arlington, Virginia, USA, August 2012 [pdf].


Our experience with these applications suggests that it is not easy to run them in a mode where they perform high amounts of I/O, even though their primary goal is to process large amounts of data. Here we discuss the datasets and parameters that stress I/O. We provide reference parameters, datasets, and throughput for each application on a system with 12 GByets of memory (configured for each application with between 1-12 GBytes) and 24 SSDs. You can find a summary of parameters and throughput that result in high amounts of I/O for each application on a reference machine with 24 SSDs in Table 1.


Table 1: Summary of application parameters and reference I/O throughput.












































The details of individual applications and reference parameters for large amount of I/O are provided below.


Psearchy

Psearchy is a file indexing application. We run Psearchy using multiple processes where each process picks files from a shared queue of file names. A hash table is maintained by each process for storing BDB indices in memory. The hash tables are flushed to the storage devices once they reach a particular size. We modify original Psearchy to use block-oriented reads instead of character-oriented reads to improve I/O throughput.

  1. The modified pedsort.C (available upon request)

  2. The setupPsearchy.sh script for creating corpus

  3. Instructions insts_htdl.txt for generating filepaths for the HTDL dataset

  4. The replace.sh script for the HTDL dataset, which replaces each file in the linux source tree with a 10 MB file

  5. The runPsearchy.sh script to run a single instance of Psearchy


Dedup

Dedup is a kernel from the PARSEC benchmark suite that uses a technique for compression used in storage farms and data-centres. The Dedup kernel has five pipeline stages with the first and the last stage for performing I/O and the middle three stages can each use a pool of threads. We observe that the peak memory utilization of Dedup is very high for large files because intermediate queues between pipeline stages grow very large. To drain the queues quickly, a large number of threads is required in the intermediate pipeline stages. The memory utilization is not an issue for small files because there is less intermediate state to be maintained.

  1. The setupDedup.sh script for creating dataset

  2. The runDedup.sh script to run multiple instances of Dedup


Metis

Metis is a Mapreduce programming library for single nodes that is part of the Mosbench benchmark suite. Metis maps the input file in memory and assigns a portion of the file to each of the map threads. A single instance of Metis does not generate a lot of I/O and is primarily dominated by user time. We run multiple instance of Metis and assign a different file to each instance.

  1. The setupMetis.sh script for creating dataset

  2. The runMetis.sh script to run multiple instances of Metis


Borealis

Borealis is a data streaming system that processes streams of records (or tuples) stored on storage devices. Borealis is based on an event-based programming model where tuples are packed into an event by a source and sent to a server running Borealis. The queries in stream processing systems are defined statically while data is continuously (re)defined. We run the client, the Borealis server and the receiver on the same node. The client reads tuples from storage devices and the receiver writes processed records to storage devices. We use a simple filter operator to process tuples. We extensively hand tune Borealis to remove many operations that are not necessary and hurt overall throughput. These operations involve proper buffer management, introducing flow control mechanisms, removing object serialization overheads, and replacing message queues with synchronous read/writes. We run multiple instances of Borealis to increase system utilization because a single instance of Borealis only has three threads; send, receive and process. Two important parameters for Borealis are: tuple size and the batching factor. Batching factor is the number of tuples in an event.

  1. Modified Borealis (available upon request)

  2. Installation instructions in instsBor.txt

  3. The setupBor.sh script for creating dataset

  4. The runBor.sh script for running multiple instances of Borealis


HBase

HBase is a NoSQL data serving system that is part of the Hadoop framework. HBase uses many operations that makes it compute-heavy and inefficient for modern server hardware. We use the YCSB framework from Yahoo as the workload generator.

  1. HBase installation instructions in hbase_install.txt

  2. The runexperiment.sh script for creating an HBase database

  3. The runexperiment.sh for running a single instance of HBase

  4. Workload configuration files: workload_r100.txt and workload_r70u30.txt. Use the workload_r100 for passing parameters that describe the records to YCSB and create the initial data store and then use workload_r70u30 to perform the actual test.


BDB

BDB is a library that provides support for building data stores based on key-value pairs. We use the Java implementation of BDB. The testing methodology is similar to that we use for HBase i.e. we use YCSB framework to generate the workload. We observe that each time we run BDB on the same database, the total number of sectors (read and written) from the devices is reduced. We first run BDB a number of times on the same database until we reach a point where the total number of sectors transferred via I/O does not change due to database reorganization. Since BDB is an embedded data store, the YCSB clients and the BDB code shares the same process address space.

  1. BDB installation instructions in bdb_install.txt

  2. BDB plugin for YCSB (bdbClient.java) (available upon request)

  3. The runexperiment.sh for creating a BDB database

  4. The runexperiment.sh script for running a single instance of BDB

  5. Workload configuration files: workload_r100.txt and workload_r70u30.txt. Use the workload_r100 for passing parameters that describe the records to YCSB and create the initial data store and then use workload_r70u30 to perform the actual test.


TPC-C

TPC-C is an OLTP workload that models an order-entry environment of a wholesale supplier. We run TPC-C using MySQL database system and specify runtime parameters that result in high concurrency. We use an open-source version of TPC-C called Hammerora. We run the hammerora clients and the MySQL database server on the same machine. We observe that hammerora clients consume very little percentage of the entire CPU utilization in our experiments. We run TPC-C with an amount of physical memory that results in realistic amount of I/O for our generated database and test parameters (6 GB).

  1. MySQL configuration file my.cnf

  2. TPC-C sources

  3. The runTpcc.sh script for running a single instance of TPC-C

  4. In addition, you will need to separately generate a MySQL database


TPC-E

TPC-E is an OLTP workload that models transactions that take place in a stock brokerage firm. We run TPC-E over MySQL. For generating the data-set, we the EGen tool, which is provided directly by the TPC consortium. For generating the workload (request stream), we use a "kit" (EGenSimpleTest) originally developed by Percona Consulting. Relevant links are the following:

  1. http://www.percona.com

  2. http://www.mysqlperformanceblog.com/2010/02/08/introducing-tpce-like-workload-for-mysql

  3. https://code.launchpad.net/percona-dev/perconatools/tpcemysql

We run TPC-E over MySQL. The MySQL configuration and the script for running TPC-E are provided below.

  1. MySQL configuration file my.cnf

  2. The runTpce.sh script for running a single instance of TPC-E


Ferret

Ferret is an application that is used for content similarity search. Ferret is part of the PARSEC benchmark suite. Ferret is compute-intensive and does sustained but small amount of I/O. We fit the database of image signatures against which queries are run in memory.

  1. The setupFerret.sh script for creating query images

  2. The runFerret.sh script for running a single instance of Ferret


BLAST

BLAST is an application from the domain of comparative genomics. We run multiple instances of BLAST each running a separate set of queries on a separate database. This is realistic scenario for facilities hosting genomic data as typically the queries are submitted by clients located remotely. We use random query sequences of 5 KB, which is a common case in proteome/genome homology searches. BLAST is I/O intensive and the execution time is primarily dominated by user time. We use BLAST for Nucleotide-Nucleotide sequence similarity search.

  1. Download BLAST from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

  2. Download these databases.txt from ftp://ftp.ncbi.nlm.nih.gov 

  3. The gen_query.sh script for creating query sequences

  4. The runBlast.sh script for running multiple instances of BLAST


zmIO

zmIO is an in-house benchmark that uses the asynchronous API of Linux for performing I/O operations without using the page cache of the kernel. zmIO issues multiple I/O operations (user-defined parameter) and keeps track of the status of each operation in a status queue. When the status queue is full, zmIO performs a blocking operation and waits for an I/O operation to complete. A new operation is issued after completing a pending operation. There is no setup Script associated with zmIO as it operates on raw data rather than files.

  1. Download zmIO

  2. The runZmio.sh script for running multiple instances of zmIO

  3. A sample machine configuration file configZmio.txt


fs_mark

fs_mark is used as a stress test for the file system. fs_mark runs a sequence of operations on the file system. We specify the sequence of operations as create, open, write, read, and then close. To isolate the effects of concurrent transactions of different kinds, we modify fs_mark to perform all transactions of a particular type in one phase.

  1. Modified fs_mark (available upon request)

  2. The setupFsmark.sh script for creating directories needed by fs_mark

  3. The runFsmark.sh script for running a single instance of fs_mark


IOR

IOR simulates various checkpointing patterns that appear in the HPC domain.

  1. IOR benchmark

  2. The runIor.sh script for running a single instance of IOR


Contributors

Shoaib Akram, Manolis Marazakis, Dhiraj Gulati, Markos Fountoulakis, Yannis Klonatos, Angelos Bilas


Contact Information

For any comments and questions please contact:

Shoaib Akram, s h b a k r a m @ i c s . f o r t h . g r

FORTH-ICS, Greece