Hadoop

Published:

Hadoop, formaly known as Apache Hadoop, is an open-source software framework for distributed storage and distributed processing of very large datasets on computer clusters from commodity hardware. Apache ease the management of the related tasks of distribute computations in an optimal way and solving the problems related with random fails that it can occur. Hadoop is the main software solutions in the ecosystem of Big Data.

The core of Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality- nodes manipulating the data they have access to- to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common: contains libraries and utilities needed by other Hadoop modules.
  • Hadoop Distributed File System (HDFS): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop YARN: a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.
  • Hadoop MapReduce: an implementation of the MapReduce programming model for large scale data processing.

The term Hadoop has come to refer not just to the base modules above, but also to the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm.

Apache Hadoop’s MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System.

The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with “Hadoop Streaming” to implement the “map” and “reduce” parts of the user’s program. Other projects in the Hadoop ecosystem expose richer user interfaces.

See also

Computational intelligence, Mathematical optimization, Computer vision, Machine learning, Artificial Intelligence, Spatial Data Analysis, Data Analysis

Material

  • http://hadoop.apache.org/
  • http://www.ibm.com/analytics/us/en/technology/hadoop/
  • http://hortonworks.com/apache/hadoop/
  • http://www.datascienceassn.org/content/data-locality-hpc-vs-hadoop-vs-spark

Books