R – Technology explained

In this post we will proceed to our series on Big Data. If the reader has not read the previous posts in this series, the links can be found at the end of this post, or the menus, “Big Data” section.

In this post, we will discuss one of the most popular technologies of the moment in the development of Big Data solutions: Apache Hadoop.

Origin

Hadoop was created in 2005 by two developers, Doug Cutting and Mike Cafarella. The symbol of Hadoop, the famous yellow elephant, it is Mike’s son toy elephant, and the name “Hadoop” is the elephant’s name. In the video below we can see the co-creator in an interview, talking about the challenges of data mining:

Architecture

Speaking of the Hadoop architecture, we can separate into 2 main parts, consisting of two clusters:

One part consists of a cluster that implements a distributed file system, known as HDFS (Hadoop Distributed File System);
The second part consists of another cluster, which provides an environment for executing programs written following the MapReduce model. If the reader does not know what is mapreduce, we address this point in Part 2 of our series;

Lets examine, in general, each of these clusters:

HDFS

The HDFS, Hadoop file storage system consists of a system that allows files to be stored across multiple nodes (servers). When we insert a file in the cluster, either through the command line interface and / or its REST interface, or when we have mapreduce processes generating processing of output files, the HDFS makes a “break” of the file into several smaller files – by default, parts of 64MB – and distributes the files across the nodes, managing details on run time as the number of copies that each part must have, remaking this balancing in case of a cluster node falls. All this break is transparent to the developer because the cluster will make the mounting in every query made through the interfaces.

A HDFS cluster consists of two components:

NameNodes: Central cluster component, responsible for managing the assembly metadata – used to reconstruct the original files – and make the management of the files in the cluster;
DataNodes: “Physical” component of the cluster, responsible for making the read / write of the files on the disk. Each node has its DataNode, which performs the read / write on the server disk that is running;

PS: Na versão 2.0 do Hadoop, um novo componente foi incluso no HDFS, chamado YARN (Yet Another Resource NameNode), cujo objetivo principal é fornecer uma camada a mais de interface entre os usuários do HDFS e o mesmo. Graças a essa melhoria, podemos ter no cluster também diversos NameNodes para efetuar o gerenciamento dos arquivos, evitando assim o problema da possível perda de um cluster HDFS no caso de problemas irrecuperáveis com os NameNodes, como no caso do Hadoop 1.0, onde tinhamos, tipicamente, apenas dois processos de NameNode, sendo um deles para mecanismo de failover.

Mapreduce cluster

The mapreduce cluster consists of a cluster which performs the execution of mapreduce processes. Typically, the input / output of a mapreduce job in Hadoop is with HDFS using the NameNode (YARN in version 2.0) to interface the cluster with the HDFS. The following are the components of a MapReduce cluster:

JobTracker: Component that interface the cluster with the developers, makes the management of process execution, identifying with the NameNode from the HDFS cluster where each part of the input data is to be processed, indicating for TaskTrackers which part of the data to process, and manages the beginning and end of each stage of processing;
TaskTracker:Component that receives from the JobTracker the instructions for executing jobs, which parts of the mass of data it is responsible for processing, and report to the JobTracker when processing is complete;
Task: Smaller cluster unit, the Task is responsible for making the processing itself. Each task can be performed in a JVM instance initiated during process execution, or it can be instantiated in a JVM already started with other Tasks running, according to the specified memory consumption settings in the cluster;

Hadoop complementary software

Several software was created to complement the use of Hadoop, or even built from the same. A brief description of some of them:

Mahout: Allows you to use machine learning techniques for data analysis in hadoop;
Sqoop: Allows integrate HDFS with relational databases;
Hive: Allows consultations in HDFS more easy way, through SQL commands;
Hama: Allow the development of jobs in Hadoop on other models besides the mapreduce, as the BSP;
HBase: NoSQL database, built under the HDFS;

In future posts, we will discuss this software in more detail.

Conclusion

And so, we concluded one more post of our series. With the growth of projects and solutions in Big Data worldwide, the hadoop has grown a lot as a market-leading technology, already having market implementations of large players like Cloudera and Hortonworks. In the next post, we will address other well known technology in the world of Big Data: the Spark. Until next time.

Technology explained

Technology blog from Alexandre Eleutério Santos Lourenço.

R

Big Data – part 3