A Comparison of Hadoop Distributions – Cluster Installation and Management Aspects

Main Article Content

Araya Florence
Thanisa Numnonda

Abstract

Big data is one of most promising technology which works with Cloud computing and Internet of Everything. Every second, data generated from billions of devices are sending to the cloud to be analysed and probably used for prediction or prevention in various applications. Big data platform is a foundation of its implementation to provide an ecosystem that data can be imported, processed and exported. This article reports a comparison of three different platforms; Apache Hadoop, Cloudera (Express), and Hortonworks in the aspects of stability, installation and cluster management. Apache Spark was chosen to test processing of all three distributions since it is ten times faster than Hive and 100 times faster than MapReduce. In addition, HiBench was chosen to be used as a testing benchmark, results were previously reported in [1]. In the aspects of cluster management, commercial based distributions are more likely to offer a better tool for installation and cluster management while Apache Hadoop is robust but lacking manageability.

Article Details

Section
บทความวิจัย (Research Article)

References

[1] Florence A, Numnonda T. A comparison of apache hadoop distributions using HiBench. In: 22nd International Symposium on Artificial Life and Robotics. Japan; 2017. p. 218–222.
[2] Huang S, Huang J, Dai J, Xie T, Huang B. The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. In: 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW). 2013. p. 41–51.
[3] Holmes A. Hadoop in Practice, 1st Edition. Manning Publications; 2012.
[4] Julio P. Big data Analytics with Hadoop, Available from: http://www.slideshare.net/PhilippeJulio /hadoop-architecture [Accessed August 2017].
[5] Apache, Apache Hadoop Releases. Available from: http://hadoop.apache.org/releases.html [Accessed August 2017].
[6] Apache, HDFS Available from: https://hadoop .apache.org/docs/r1.2.1/hdfs_design.html [Accessed August 2017].
[7] Wikipedia, MapReduce Available from: https://en.wikipedia.org/wiki/MapReduce [Accessed August 2016].
[8] Apache, Spark Available from: http://spark. apache.org/ [Accessed August 2016].
[9] Holoman J, O'Dell K. How-to: Deploy Apache Hadoop Clusters Like a Boss. 2015.
[10] Thirumala Rao B, et al. Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing. Global Journal of Computer Science and Technology. 2012; Volume XI Issue VIII May 2011.
[11] Gu R, et al. SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. J. Parallel Distrib. Comput. 2014; 74 (2014): 2166–2179.