Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using programming models that are straightforward. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop was created to scale up from single server to thousands of machines, each offering local computation and storage.
Hadoop framework contains following four modules:
Hadoop Common: These are Java libraries and utilities demanded by other Hadoop modules. These libraries contains the necessary Java files and scripts required to start Hadoop online training and provides OS and filesystem level abstractions.
Hadoop YARN: This really is a framework for cluster resource management and job scheduling.
Hadoop Distributed File System (HDFS(TMark)): A distributed file system providing you with high-throughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing.
MapReduce
The term MapReduce really describes the following two jobs that are different that Hadoop applications perform:
The Map Job:
The Reduce Task: This job takes the output signal from a map job as input and joins those info tuples into a smaller set of tuples. The reduce job is always performed after the map job.
Normally both the input and the output are stored in a filesystem. The framework takes care of scheduling tasks, tracking them and re-do the unsuccessful jobs.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster -node. The slaves TaskTracker execute the tasks as directed by the master and supply task-status info to the master occasionally.