Designing a Highly Available, Fault Tolerant, Hadoop Cluster with Data Isolation

2022-09-24 03:25:55 By : Mr. xiaoxiong Chai

Live Webinar and Q&A: Top 10 Innovations in the NoSQL Cassandra Ecosystem (Live Webinar October 18, 2022) Save Your Seat

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Jordan Bragg discusses using entry-points, breadth-first scanning, and operation tagging to demystify the domain, see where to dive deeper, and uncover what technical debt may exist.

Even when designing a Minimum Viable Architecture (MVA), developers must consider resource location, especially when mobile apps are part of a distributed system. Distributing the data and processing can introduce new challenges if location is not part of the decision making criteria.

In a web-based service, a slowdown in request processing can eventually make your service unavailable. Chances are, not all requests need to be processed right away. Some of them just need an acknowledgement of receipt. Have you ever asked yourself: “Would I benefit from asynchronous processing of requests? If so, how would I make such a change in a live, large-scale mission critical system?”

Jessica Kerr considers that we should be looking at the software as part of the team, and observability in the software becomes an asset to organizing teams.

At QCon Plus November 2021, Nora Jones, CEO and founder of Jeli, talked about how to build production readiness reviews (PRR) with emphasis on context and psychological safety. Her talk focused on the particulars of a PRR process that relates to incidents.

Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.

Make the right decisions by uncovering how senior software developers at early adopter companies are adopting emerging trends. Register Now.

Adopt the right emerging trends to solve your complex engineering challenges. Register Now.

Your monthly guide to all the topics, technologies and techniques that every professional needs to know about. Subscribe for free.

InfoQ Homepage Articles Designing a Highly Available, Fault Tolerant, Hadoop Cluster with Data Isolation

Hadoop is no longer just a buzzword – It has become a business necessity. We always had an influx of data, but just recently we have unlocked the potential of this exponentially growing data. Modern techniques in big data analysis offer new methods to identify and rectify failures, help with data mining, provide feedback for optimization – the avenues are endless. The modern Hadoop ecosystem not only provides a reliable distributed aggregation system that seamlessly delivers data parallelism, but also allows for analytics that can provide great data insights.

In this article we will investigate the design of a highly available, fault tolerant Hadoop cluster. But first let’s dive into the core components of Apache Hadoop, after which we will walk through a few modifications to cater to a minimalistic list of design requirements that take advantage of the underlying Apache Hadoop infrastructure while adding security and data-level isolation. So, let’s talk about the core components (shown in figure 1) 

Uncover emerging trends and practices from domain experts. Attend in-person at QCon San Francisco (October 24-28, 2022).

Figure 1: Apache Hadoop Core Components

The HDFS cluster is comprised of a NameNode and multiple DataNodes in master-slave architecture as shown in figure 2. The NameNode is the data manager responsible for HDFS files and blocks and also the file system namespace. The information is stored persistently on the local drives as the namespace image and the edit log. The NameNode also stores non-persistent information such as location of all the blocks for a given file. The HDFS files are split into blocks to be replicated and stored on DataNodes. Each DataNode periodically syncs up with the NameNode with information on the blocks. The HDFS architecture takes care of the data storage, fault tolerance and loss prevention parts of Apache Hadoop.

Figure 2: HDFS Architecture – NameNode (Master) – DataNodes (Slave) Configuration

MapReduce 2.0 or YARN divvies up the responsibility of JobTracker, the job dispatcher from previous Hadoop versions, thus separating resource management from application management. It also helps separate job scheduling and monitoring from resource management.

The ResourceManager is a global master and an arbitrator of resources. The ApplicationMaster is capable of managing different user applications. There is one ApplicationMaster per application. So, for example, you will have one for a MapReduce application and another for an interactive application, and so on. There is a JobHistoryServer daemon that tracks these applications and records their completion. Finally, there is also a per-node slave called the NodeManager (akin to previous version’s TaskTracker) that tracks tasks that run on its particular node. YARN architecture takes care of the data computation and management aspects of Apache Hadoop.

Figure 3: YARN Architecture – ResourceManager (Master) – ApplicationMaster + NodeManager (Master+Slave) Configuration

Now that we have a background on the Apache Hadoop core components, let’s jot down our minimalistic set of design requirements. I will list them here first and then walk the details on how to cover the requirements with Apache Hadoop 2.

The Hadoop 2 release, established a configuration to have an active and a standby NameNode, thus avoiding a single point of failure. Anytime the conservative Failover Controller detects a failure, it lets the standby take over and brings down the active NameNode (either via “fencing” or by “shooting the other node in the head”). The standby NameNode can take over very quickly since both the active; as well as the standby NameNodes share the edit log and the DataNodes report back to both the active and standby NameNodes.

Also, ResourceManager in YARN can support high availability. The Failover Controller is a part of the ResourceManager and brings the standby to take over in case the active fails over.

Whenever we talk about security, we talk about the “layers of defense” (also known as the “rings of defense”). The layers are authentication, authorization, audit and data protection:

The most common form of authentication available in native Apache Hadoop is Kerberos. Authentication can be from user to services e.g. HTTP authentication; or can be from services to services (as a user – e.g. proxy user or as a service – e.g. client SSL certificates).

Apache Hadoop already provides Unix-like file permissions and also has access control lists for Map Reduce jobs, YARN, etc.

Audit logs provide for accountability, and native Apache Hadoop has audit logs for NameNodes that record file creation, opening, etc. Also, there are history logs for JobTracker, JobHistoryServer, and ResourceManager. The history logs record all jobs that run on a particular cluster.

Security – Data at Rest & Data in Motion Protection:

Encrypting data at rest is easily done using whatever encryption the operating system has to offer, as well as other hardware level encryptions. Data in motion on the other hand, needs to be enabled in the configuration files as detailed below:

For client interaction over RPC, SASL (Simple Authentication and Security Layer) protocol can be enabled by setting the ‘hadoop.rpc.protection=privacy’ in core-site.xml. Note: Java SASL provides different levels of data protection (also known as QOP - Quality of Protection). Depending on the desired quality, the user can set the protection parameter to “authentication” for authentication-only; “integrity” for authentication and integrity of exchanged data or “privacy” for adding encryption (with symmetric keys) and avoiding “man-in-the-middle” attacks. Both integrity checks and encryption come at a performance cost.

The Data Transfer Protocol (DTP) used by HDFS data doesn’t utilize the SASL framework for authentication, so as a direct effect there is not QOP. Hence it is necessary to wrap the DTP with SASL handshaking which can be achieved by setting ‘dfs.encrypt.data.transfer=true’ in hdfs-site.xml.

Finally, for HTTP over SSL simply setting ‘dfs.https.enable=true’ and then enabling two-way SSL by ‘dfs.client.https.need-auth=true’ in hdfs-site.xml does the trick. For MapReduce Shuffle, SSL can be enabled by setting ‘mapreduce.shuffle.ssl.enabled=true’ in mapred-site.xml.

Although Hadoop boasts of needing only commodity hardware, data traffic with Hadoop is always a big deal. Even if you have a medium size cluster, there is a lot of replication traffic and also movement of data between Mappers and Reducers. Hence, it is very important to choose your cluster hardware with great networking backbone and at the same time provide good performance and prove economical enough to meet or exceed your scale out needs.

At Servergy, we designed such a system using Freescale’s QorIQ T4240 64bit Communications Processor. The power-efficient Servergy CTS storage appliance (shown in figure 4) has two T4240 processors. Each T4240 has a security coprocessor that is used to accelerate encryption/decryption operations. The T4240 also has four 10GigE ports and a 20Gig SRIO (serial Rapid IO) port. SRIO offers a low latency, high bandwidth interconnect. Our current single cluster only uses four of the eight 10GigE ports for each Servergy CTS appliance - two 10GigE ports are connected to an active switch and the other two are connected to a redundant/standby switch to provide high availability at the switch level. This configuration is shown in figure 5. Note: Based on deployment we can bond the 10GigE ports to increase the bandwidth or use the SRIO for low latency transport.

Figure 4: Servergy CTS Storage Appliance Block Diagram

Figure 5: CTS Appliance Solution – 10x12x2 = 240 cores; 480 threads

Many systems in a Hadoop cluster not only handle computational needs but also provide data storage. Hence, if you are thinking of Hadoop-as-a-service, there is always a concern about data security. Enter virtualization! Virtualization not only provides the desired isolation, it also provides elasticity. Virtualization adds to the multi-tenancy offered by YARN and improves system utilization by maximizing resource utilization. Apart from the above, ease of deployment is also a great perk of virtualization.

Keeping up with Hadoop’s traditional layout, each Virtual Machine (VM) could run a NodeManager/ TaskTracker and a DataNode as shown in figure 6. This configuration doesn’t provide all the benefits of virtualization though. To begin with, the configuration is not really elastic; you have to pre-configure it for your (growing) needs. For example, any growth in data on a tightly configured cluster would need addition of new nodes to the cluster, but now you have spare computational resources and there is also the necessity to balance the cluster. And there is no separation of data computation from data storage.

Figure 6: Traditional Hadoop layout on Virtual Machines.

In-order to make it more elastic, we can change the per VM data + compute design configuration into a more service-oriented architecture which would be more conducive to Hadoop-as-a-service providers (and even Infrastructure-as-a-service providers). Let’s discuss two new configurations:

Figure 7: Virtualized Hadoop in a Service-Oriented Architecture Configuration

Figure 7 shows a configuration where we have a single virtualized DataNode per multiple NodeManagers/TaskTrackers. Depending on the computational needs of the cluster, more NodeManagers can be added by adding more virtual nodes to the cluster. Since, each VM runs its own NodeManager, this configuration not only provides compute and data level isolation it also provides great multi-tenant isolation. The key is to find a good balance for the total number of virtualized NodeManagers per virtualized DataNode on a given host. This will largely depend on your data, replication factor, application(s) and your cluster capacity.

Another configuration that is gaining a lot of attention in the Infrastructure-as-a-service realm is having persistent data and adding virtualized computational nodes (NodeManagers/TaskTrackers) or a combination of data nodes and computational nodes to complete your cluster. This configuration is shown in figure 8.

Figure 8: Virtualized Hadoop in an Infrastructure-As-A-Service Configuration

Here the infrastructure takes care of all the networking, load balancing and persistent data storage needs. The Cluster and VM manager controls the Hadoop cluster that is comprised of the compute VMs (TaskTrackers/ NodeManagers) and the combination of ‘DataNode + NodeManager’ VMs. This configuration helps with ease of access of HDFS data in the cloud (when needed for computational purposes) and also provides for ease with data backup to persistent storage. In such a configuration, there is no need for a long running cluster in the cloud since all the data (raw, analyzed or mined) are stored persistently.

Monica Beckwith is a Java Performance Consultant. Her past experiences include working with Oracle/Sun and AMD; optimizing the JVM for server class systems. Monica was voted a Rock Star speaker @JavaOne 2013 and was the performance lead for Garbage First Garbage Collector (G1 GC). You can follow Monica on twitter @mon_beck

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

You need to Register an InfoQ account or Login or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Real-world technical talks. No product pitches. Practical ideas to inspire you and your team. QCon San Francisco - Oct 24-28, In-person. QCon San Francisco brings together the world's most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices. Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now

InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with. Privacy Notice, Terms And Conditions, Cookie Policy