IBM Platform Computing

The leader in Clusters, Grids, and HPC Clouds

 

IBM Platform Computing

The leader in Clusters, Grids, and HPC Clouds

Breakthrough Hadoop Performance

Better performance, efficiency, and superior management and monitoring with Platform Symphony and InfoSphere BigInsights

 

Breakthrough Hadoop Performance

Better performance, efficiency, and superior management and monitoring with Platform Symphony and InfoSphere BigInsights

IBM Technical Computing

Solutions to address your compute and data-intensive needs

 

IBM Technical Computing

Solutions to address your compute and data-intensive needs

Breakthrough Hadoop Performance

IBM completed big data benchmarking of significance employing IBM Platform Symphony and IBM Infosphere BigInsights. Platform Symphony is a distributed computing and big data analytics product widely used in large scale grid computing environments. IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. Enterprises using the two products together get the benefit of a multi-tenant, heterogeneous application cluster with higher utilization and performance. Using InfoSphere BigInsights you can gain new insights from a combination of data sources and overcome the high costs of converting unstructured data sources to a structured format.

These benchmarks included:

The Terasort benchmark was run on an IBM cloud. In a 1,000 Virtual Machine IBM private cloud environment, IBM obtained a Terasort result, sorting 100 terabytes in 10,369 seconds, slightly less than 3 hours. This bettered a prior world-record result, but required only 10% of the cores used in the previous result.

The SWIM benchmark (Statistical Workload Injector for MapReduce), is a benchmark representing a real-world big data workload developed by University of California at Berkley in close cooperation with Facebook. This test provides rigorous measurements of the performance of MapReduce systems comprised of real industry workloads. Platform Symphony Advanced Edition accelerated SWIM/Facebook workload traces by approximately 6 times.

The “sleep” benchmark has been used as a test to compare the efficiency of core scheduling efficiency of MapReduce workloads and promoted at Hadoop World 2011. IBM demonstrated that Platform Symphony Advanced Edition runs the sleep test 63 times faster than Apache Hadoop 1.0.1 alone demonstrating that low-latency scheduling is critical to maximizing Hadoop workload

 

Audited Report

UCB SWIM & Hadoop scheduling benchmark results




Get the Platform Symphony Advantage

While a major benefit of IBM Platform Symphony is its ability to support diverse applications in a multi-tenant environment while ensuring service levels, these performance tests show that IBM Platform Symphony also helps provide dramatically better performance, efficiency, and superior management and monitoring.

Clients using this technology in conjunction with Infosphere BigInsights can get a fully supported high performance Hadoop stack with ease of use, higher productivity with built in accelerators and management tools.

  • Deliver faster and more accurate analysis for Big Data applications by doing greater processing with less infrastructure
  • Lower costs through reduction in infrastructure and administration overhead
  • Enable business agility by supporting multiple groups and diverse workloads on a single shared cluster

These results are important not only because they demonstrate faster MapReduce job execution times, but because they show that organizations running Hadoop workloads can save a significant amount of money on computing infrastructure by using IBM Platform Symphony.

 




Terasort Benchmark

Running IBM InfoSphere BigInsights on a private cloud environment managed by IBM Platform Symphony, IBM demonstrated a 100 TB terasort result on a cluster comprised of 1,000 virtual machines, 200 physical nodes and 2,400 processing cores. Running the industry standard Terasort benchmark in this private cloud, IBM beat a prior world-record4 using 17 times less servers and 12 times fewer total processing cores. This result showed not only that it is straightforward to build a large-scale Hadoop environment using IBM’s cloud-based solutions, but that big data workloads with IBM BigInsights can be run more economically using IBM Platform Symphony, providing dramatic savings related to infrastructure, power and facilities.


Hardware

  • 200 IBM dx360M3 computers in iDataPlex racks
  • 2 IBM dx360M3 computers in iDataPlex racks as master hosts
  • 120 GB memory per host, 12 x 3 TB spindles per host
  • 2,400 cores

Software

  • 1000 Virtual machines
  • RHEL 6.2 with KVM
  • IBM InfoSphere BigInsights 1.3.0.1
  • IBM Platform Symphony Advanced Edition 5.2
  • IBM Platform Symphony BigInsights Integration Path for 1.3.0.1

Results

  • 100 TB sort in 10,369 seconds



SWIM Benchmark

Perhaps even more compelling are results obtained using real-world workloads. The SWIM benchmark developed at University of California, Berkeley with co-operation from Facebook, measures real-world MapReduce workloads by simulating traces of application activity captured at Facebook in 2009 and 2010. This is viewed as a more rigorous predictor of MapReduce performance by the benchmark authors.


Hardware

  • 17 IBM dx360M3 computers in iDataPlex racks
  • 2 IBM dx360M3 computers in iDataPlex racks as master hosts

Software

  • RHEL 6.3 with KVM
  • Apache Hadoop 1.0.1
  • IBM Platform Symphony Advanced Edition 5.2

Results

  • Run-time for first 20 minutes of FB workload (302 jobs) – 10,100 seconds.
  • Run-time for the same first 20 minutes using Platform Symphony – 1,700 seconds.



Sleep Benchmark

Using this benchmark, IBM demonstrated in results audited by an independent testing organization that by augmenting a Hadoop cluster with Platform Symphony, the simulated Facebook workloads ran nearly six times faster on IBM Platform Symphony than on Apache Hadoop alone. As a corollary, given the nature of the SWIM benchmark, this result demonstrated that equivalent performance with Symphony could have been obtained with dramatically less hardware and less infrastructure cost.


Hardware

  • 17 IBM dx360M3 computers in iDataPlex racks
  • 2 IBM dx360M3 computers in iDataPlex racks as master hosts

Software

  • RHEL 6.3 with KVM
  • Apache Hadoop 1.0.1
  • IBM Platform Symphony Advanced Edition 5.2
  • Hadoop command tested: hadoop jar examples.jar sleep –mt 1 –rt 1-m 5000 –r 1

Results

  • Task throughput 5,000 map tasks on Symphony – 342.94 tasks / seconds
  • Task throughput 5,000 map tasks with Hadoop – 5.48 tasks / seconds



IBM Platform Symphony brings many advantages to distributed computing environments including multi-tenancy, guaranteed service levels, superior management tools, and support for diverse, heterogeneous workloads. Because this test focused on performance, the Hadoop “sleep” benchmark shared at Hadoop World in 20115 was also run to demonstrate the relative scheduling efficiency of IBM Platform Symphony to competing Hadoop distributions. This also was an audited result to be published by a third party. Running a standard test promoted as a measure of scheduling efficiency, IBM Platform Symphony accelerated a Hadoop 1.0.1 sleep test result comprised of 5000 x 1 msec map tasks on identical infrastructure 63 times faster.

Audited Report

UCB SWIM & Hadoop scheduling benchmark results




These results demonstrate that IBM Platform Symphony can provide dramatic performance advantages and financial savings to customers deploying big data environments. For IBM InfoSphere BigInsights users, or those considering open-source or derivative Hadoop environments, IBM Platform Symphony can help accelerate Hadoop workloads while reducing cost and improving workload reliability.




1SWIM benchmark details

2 The actual physical cluster size was 250 nodes, but only 202 servers were included in this test. Two hosts were allocated as management hosts, and 200 hosts were allocated as compute hosts. Technically the number of hosts employed in the test was 201. The second management host was configured to provide fail-over services transparently in case any thing went wrong with the primary host.

3 17 nodes were configured as compute hosts and there were two Symphony management hosts for a total of 19 nodes. The cores counted reflected only the cores running MapReduce workloads. Each node had two processors with six cores per CPU.

4 The result established by Yahoo in 2009 is regarded by many as a de-facto standard 100 TB sorts result. Clearly technology has changed in three years, but what is notable is that this 2012 result was achieved running in VMs while the Yahoo result was achieved on bare-metal, faster in theory.