The BigData Top100 List: Crafting Benchmarks for Big Data

Chaitanya K. Baru, SDSC

We will describe the BigData Top100 List initiative, a new, open, community-based effort, whose objective is to rank big data systems according to a well-defined, verifiable performance metric, while also providing an accompanying efficiency metric. While the performance of traditional database systems is well understood and measured by long-established institutions such as the Transaction Processing Performance Council, there is neither a clear definition of the performance of big data systems nor a generally agreed upon metric for comparison. With “big data” becoming a major force of innovation across enterprises of all sizes, there is a need for an objective benchmark for comparison among such platforms. The BigData Top100 List is a community-based initiative for defining an end-to-end application-layer benchmark for big data applications. We will provide an overview of this initiative and two proposed approaches for workload specification, one called BigBench, which extends the TPC-Decision Support benchmark to include semistructured and unstructured data and queries with data mining operations; and another called Deep Analytics Pipeline, which models multiple steps of an event-based processing pipeline including data ingestion, data transformation/cleaning (ELT), filtering, machine learning, and model prediction. We will discuss data generation issues for both cases. We will also describe another related initiative called the Big Data Analytics Benchmark Suite, whose objective is to identify real-world reference datasets for benchmarking of analytics routines. We actively seek community input and participation in all of these activities.