Apache Hadoop vs. IBM Platform Symphony & Infosphere BigInsights: see our breakthrough Hadoop performance
IBM has completed several big data benchmarks of significance employing IBM Platform Symphony and various Hadoop distributions including IBM Infosphere BigInsights. Platform Symphony is a distributed computing and big data analytics product widely used in large scale grid computing environments. IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. Organizations using the two products together get the benefit of a multi-tenant, heterogeneous application cluster with higher utilization and performance. Using InfoSphere BigInsights you can gain new insights from a combination of data sources and overcome the high costs of converting unstructured data sources to a structured format.
These benchmarks included:
The Terasort benchmark was run on an IBM cloud using IBM InfoSphere BigInsights. In an unaudited result on a 1,000 Virtual Machine IBM private cloud environment, IBM obtained a breakthrough Terasort result, sorting 100 terabytes in 10,369 seconds, slightly less than 3 hours. This bettered a prior world-record result, but required only 10% of the cores used in the previous result.
A Contrail bio workload, an open-source software package used for De Novo genome assembly. In a controlled test conducted in March of 2013, Platform Symphony Advanced Edition was found to reduce the time to require to sequence a 10K read sample from a reference e-coli bacteria genome by a factor of 3.4 on an eight node Apache Hadoop cluster. Read the report here.
The SWIM benchmark (Statistical Workload Injector for MapReduce), is a benchmark representing a real-world big data workload developed by University of California at Berkley in close cooperation with Facebook. This test provides rigorous measurements of the performance of MapReduce systems comprised of real industry workloads. In an audited result conducted by STAC Research, Platform Symphony Advanced Edition accelerated SWIM/Facebook workload traces run using open-source Apache Hadoop by approximately a factor of 7 times. A STAC Report featuring this result published in December of 2012 is available for download here. The "sleep" benchmark has been used as a test to compare the efficiency of core scheduling efficiency of MapReduce workloads and promoted at Hadoop World 2011. IBM demonstrated that Platform Symphony Advanced Edition runs the sleep test 63 times faster than Apache Hadoop alone demonstrating that low-latency scheduling is critical to maximizing Hadoop workload performance. You can see a visual demonstration of this exceptional performance gain in a short video included in a blog article on low-latency Hadoop for Risk Analytics.
Get the Platform Symphony Advantage
While a major benefit of IBM Platform Symphony is its ability to support diverse applications in a multi-tenant environment while ensuring service levels, these performance tests show that IBM Platform Symphony also helps provide dramatically better performance, efficiency, and superior management and monitoring.
Clients using this technology in conjunction with Infosphere BigInsights can get a fully supported high performance Hadoop stack with ease of use, higher productivity with built in accelerators and management tools.
These results are important not only because they demonstrate faster MapReduce job execution times, but because they show that organizations running Hadoop workloads can save a significant amount of money on computing infrastructure by using IBM Platform Symphony.
Running IBM InfoSphere BigInsights on a private cloud environment managed by IBM Platform Symphony in August of 2012, IBM demonstrated a 100 TB terasort result on a cluster comprised of 1,000 virtual machines, 200 physical nodes and 2,400 processing cores. Running the industry standard Terasort benchmark in this private cloud, IBM beat a prior world-record4 using 17 times less servers and 12 times fewer total processing cores. This result showed not only that it is straightforward to build a large-scale Hadoop environment using IBM's cloud-based solutions, but that big data workloads with IBM BigInsights can be run more economically using IBM Platform Symphony, providing dramatic savings related to infrastructure, power and facilities.
Contrail-bio Genome Sequencing Benchmark
Contrail is an open-source software effort that leverages Hadoop MapReduce to accelerate De Novo Genome assembly. During March of 2013, IBM conducted a series of tests to understand the performance advantage that Symphony could offer on a reference Hadoop cluster running a 10K read sample of an e-coli bacteria included as part of the Contrail software suite. In an eight node Hadoop cluster with 108 cores dedicated to Map and Reduce tasks, Platform Symphony was found to compute results 3.4 times faster than Hadoop alone reducing the job run-time from 873 seconds to 258 seconds on the same cluster and dataset. Get the result here.
Equally compelling are results obtained using social media workloads. The SWIM benchmark developed at University of California, Berkeley with co-operation from Facebook, measures real-world MapReduce workloads by simulating traces of application activity captured at Facebook in 2009 and 2010. This is viewed as a more rigorous predictor of MapReduce performance by the benchmark authors.
Using this benchmark, IBM demonstrated in results audited by an independent testing organization that by augmenting a Hadoop cluster with Platform Symphony, the simulated Facebook workloads ran nearly six times faster on IBM Platform Symphony than on Apache Hadoop alone. As a corollary, given the nature of the SWIM benchmark, this result demonstrated that equivalent performance with Symphony could have been obtained with dramatically less hardware and less infrastructure cost.
IBM Platform Symphony brings many advantages to distributed computing environments including multi-tenancy, guaranteed service levels, superior management tools, and support for diverse, heterogeneous workloads. Because this test focused on performance, the Hadoop “sleep” benchmark shared at Hadoop World in 20115 was also run to demonstrate the relative scheduling efficiency of IBM Platform Symphony to competing Hadoop distributions. This also was an audited result to be published by a third party. Running a standard test promoted as a measure of scheduling efficiency, IBM Platform Symphony accelerated a Hadoop 1.0.1 sleep test result comprised of 5000 x 1 msec map tasks on identical infrastructure 63 times faster.
These results demonstrate that IBM Platform Symphony can provide dramatic performance advantages and financial savings to customers deploying big data environments. For IBM InfoSphere BigInsights users, or those considering open-source or derivative Hadoop environments, IBM Platform Symphony can help accelerate Hadoop workloads while reducing cost and improving workload reliability.
2 The actual physical cluster size was 250 nodes, but only 202 servers were included in this test. Two hosts were allocated as management hosts, and 200 hosts were allocated as compute hosts. Technically the number of hosts employed in the test was 201. The second management host was configured to provide fail-over services transparently in case any thing went wrong with the primary host.
3 17 nodes were configured as compute hosts and there were two Symphony management hosts for a total of 19 nodes. The cores counted reflected only the cores running MapReduce workloads. Each node had two processors with six cores per CPU.
4 The result established by Yahoo in 2009 is regarded by many as a de-facto standard 100 TB sorts result. Clearly technology has changed in three years, but what is notable is that this 2012 result was achieved running in VMs while the Yahoo result was achieved on bare-metal, faster in theory.