Do you want to know more about the Hadoop Architecture in Big Data? HDFS, MapReduce, and YARN are the three important concepts of Hadoop. In this tutorial, you will learn the Apache Hadoop HDFS and YARN Architecture in details.
What is Hadoop?
Hadoop is an open source framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. It is from Apache and is used to store process and analyze data which are very huge in volume. The summary of the Hadoop framework is as follows:
In a nutshell, Hadoop provides a reliable shared storage and analysis system for big data. The Hadoop Distributed File System (HDFS) is specially designed to be highly fault-tolerant.
Hadoop employs a NameNode and DataNode architecture to implement the HDFS, which provides high-performance access to data across highly scalable Hadoop clusters. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Hadoop Version 2.0 and above, employs YARN (Yet Another Resource Negotiator) Architecture, which allows different data processing methods like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS.
Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch or offline processing. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up easily just by adding nodes in the cluster.
Key Benefits of Hadoop Architecture
Apache Hadoop was developed with the purpose of having a low–cost, redundant data store that would allow organizations to leverage big data analytics at economical cost and maximize profitability of the business.
The four key features of Hadoop Architecture are:
- Economical: Its systems are highly economical as ordinary computers can be used for data processing.
- Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware failure.
- Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in scaling up the framework.
- Flexible: It is flexible and you can store as much structured and unstructured data as you need to and decide to use them later.
Facebook, eBay, LinkedIn and Twitter are among the web companies that used HDFS to underpin big data analytics.
How does Hadoop work?
Hadoop runs code across a cluster of computers and performs the following four primary tasks:
- Data is initially divided into files and directories. Files are then divided into consistently sized blocks ranging from 128 MB in Hadoop 2.0 to 64 MB in Hadoop 1.0.
- Then, the files are distributed across various cluster nodes for further processing of data.
- The JobTracker starts its scheduling programs on individual nodes.
- Once all the nodes are done with scheduling, the output is returned.
What is Hadoop Architecture?
Hadoop follows a Master Slave architecture for the transformation and analysis of large datasets using Hadoop MapReduce paradigm. The three core Hadoop components that play a key role in the Hadoop architecture are as follows:
- Hadoop Distributed File System (HDFS)
- Hadoop MapReduce
- Hadoop YARN – Yet Another Resource Negotiator for Version 2.0.
HDFS file system
The HDFS file system replicates, or copies, each piece of data multiple times and distributes the copies to individual nodes, placing at least one copy on a different server rack than the others. In Hadoop 1.0, the batch processing framework MapReduce was closely paired with HDFS.
MapReduce is a programming model used for parallel computation of large data sets (larger than 1 TB). Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data. Data computed by MapReduce can come from multiple data sources, such as Local File System, HDFS, and databases. Most data computed by MapReduce comes from the HDFS. The high throughput of HDFS can be used to read massive data. After being computed, data can be stored in the HDFS.
YARN – Yet Another Resource Negotiator
YARN was designed to overcome the limitations of the MapReduce 1. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into two separate areas. The Hadoop YARN Architecture consists of the following four main components: Resource Manager, Node Manager, Application Master and Container. In the following sections we are going to discuss more details about them.
Hadoop 1.0 and MapReduce Version 1
Hadoop 1.0 version is also known as MapReduce Version 1 (MRV1). There was only a single master for Job Tracker. The Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. It assigned map and reduce tasks on a number of subordinate processes called the Task Trackers. The Task Trackers periodically reported their progress to the Job Tracker. This design resulted in scalability bottleneck due to a single Job Tracker.
Apart from this limitation, the utilization of computational resources is inefficient in MRV1. Also, the Hadoop framework became limited only to MapReduce processing paradigm.
To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop the ability to run non-MapReduce jobs within the Hadoop framework.
YARN — Yet Another Resource Negotiator, is a part of Hadoop 2 version. YARN was initially called ‘MapReduce 2’ since it took the original MapReduce to another level by giving new and better approaches for decoupling.
Hadoop YARN is designed to provide a generic and flexible framework to administer the computing resources in the Hadoop cluster. YARN allows different data processing methods like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS. Therefore YARN opens up Hadoop to other types of distributed applications beyond MapReduce.
The four key components of the YARN architecture is as follows:
- Resource Manager: Runs on a master daemon and manages the resource allocation in the cluster. The resource manager has two components: scheduler and the Application Master.
- Node Manager: They run on the slave daemons and are responsible for the execution of a task on every single Data Node.
- Application Master: Manages the user job lifecycle and resource needs of individual applications. It works along with the Node Manager and monitors the execution of tasks.
- Container: Package of resources including RAM, CPU, Network, HDD etc. on a single node.
YARN also allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS (Hadoop Distributed File System) thus making the system much more efficient.
1. Resource Manager
Resource Manager is the core component of the entire architecture, which is responsible for the management of resources including RAMs, CPUs, and other resources throughout the cluster. It is the ultimate authority in resource allocation.
On receiving the processing requests, it passes parts of requests to corresponding node managers accordingly, where the actual processing takes place. It is the arbitrator of the cluster resources and decides the allocation of the available resources for competing applications. Optimizes the cluster utilization like keeping all resources in use all the time against various constraints such as capacity guarantees, fairness, and SLAs. It has two major components: a) Scheduler b) Application Manager
The resource manager has two main components: Scheduler and Applications Manager.
The scheduler is responsible for allocating resources to the various running applications subject to constraints of capacities, queues etc. It is called a pure scheduler in ResourceManager, which means that it does not perform any monitoring or tracking of status for the applications. If there is an application failure or hardware failure, the Scheduler does not guarantee to restart the failed tasks. Performs scheduling based on the resource requirements of the applications. It has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various applications. There are two such plug-ins: Capacity Scheduler and Fair Scheduler, which are currently used as Schedulers in ResourceManager.
1.2 Application Manager
It is responsible for accepting job submissions.
Negotiates the first container from the Resource Manager for executing the application specific Application Master.
Manages running the Application Masters in a cluster and provides service for restarting the Application Master container on failure.
2. Node Manager
Node manager takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow on the given node. It registers with the Resource Manager and sends heartbeats with the health status of the node. Its primary goal is to manage application containers assigned to it by the resource manager. It keeps up-to-date with the Resource Manager. Application Master requests the assigned container from the Node Manager by sending it a Container Launch Context(CLC) which includes everything the application needs in order to run. The Node Manager creates the requested container process and starts it. Monitors resource usage (memory, CPU) of individual containers. Performs Log management. It also kills the container as directed by the Resource Manager.
3. Application Master
The third component of Apache Hadoop YARN is, An application is a single job submitted to the framework. Each such application has a unique Application Master associated with it which is a framework specific entity.
It is the process that coordinates an application’s execution in the cluster and also manages faults.
Its task is to negotiate resources from the Resource Manager and work with the Node Manager to execute and monitor the component tasks.
It is responsible for negotiating appropriate resource containers from the ResourceManager, tracking their status and monitoring progress.
Once started, it periodically sends heartbeats to the Resource Manager to affirm its health and to update the record of its resource demands.
The fourth component is: It is a collection of physical resources such as RAM, CPU cores, and disks on a single node.
YARN containers are managed by a container launch context which is container life-cycle(CLC). This record contains a map of environment variables, dependencies stored in a remotely accessible storage, security tokens, payload for Node Manager services and the command necessary to create the process. It grants rights to an application to use a specific amount of resources (memory, CPU etc.) on a specific host
HDFS supports large data-sets across multiple hosts to achieve parallel processing. HDFS is a block-structured file system based on splitting input data into small blocks of fixed size, which are delivered to each node in the cluster.
Goals of HDFS
- Fault detection and recovery − Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
- Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
- Hardware at data − A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.
What is HDFS in Hadoop?
So, what is HDFS? HDFS or Hadoop Distributed File System, which is completely written in Java programming language, is based on the Google File System (GFS). Google File System is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware.
HDFS follows the master-slave architecture. Each cluster comprises a single master node and multiple slave nodes. Internally the files get divided into one or more blocks, and each block is stored on different slave machines depending on the replication factor.
User applications access the filesystem using the HDFS client, a library that exports the HDFS filesystem interface.
Building blocks of HDFS architecture are the Namenode, Datanode, Blocks, JobTracker, and TaskTracker. Hadoop file system work as a master/slave file system in which Namenode works as the master and Datanode work as a slave. A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker.
Namenode does not contain actual data of files. Namenode stores metadata of actual data like Filename, path, number of data blocks, block IDs, block location, number of replicas and other slave related information. The namenode acts as the master server and it does the following tasks −
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and directories.
Datanodes is responsible of storing actual data. When one of Datanode gets down then it will not make any effect on Hadoop cluster due to replication.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default size of the HDFS data block is 128 MB, but it can be increased as per the need to change in HDFS configuration.
To sum up, we have given an overview of the growing Hadoop ecosystem architecture that handles all modern big data problems. We have discussed the YARN architecture, resource manager and other areas. also discussed the HDFS architecture in this tutorial
So now, take the next step, and learn the details of the Hadoop programming.