Big Data – part 3

Standard

In this post we will proceed to our series on Big Data. If the reader has not read the previous posts in this series, the links can be found at the end of this post, or the menus, “Big Data” section.

In this post, we will discuss one of the most popular technologies of the moment in the development of Big Data solutions: Apache Hadoop.

Origin

Hadoop was created in 2005 by two developers, Doug Cutting and Mike Cafarella. The symbol of Hadoop, the famous yellow elephant, it is Mike’s son toy elephant, and the name “Hadoop” is the elephant’s name. In the video below we can see the co-creator in an interview, talking about the challenges of data mining:

Architecture

Speaking of the Hadoop architecture, we can separate into 2 main parts, consisting of two clusters:

  1. One part consists of a cluster that implements a distributed file system, known as HDFS (Hadoop Distributed File System);
  2. The second part consists of another cluster, which provides an environment for executing programs written following the MapReduce model. If the reader does not know what is mapreduce, we address this point in Part 2 of our series;

Lets examine, in general, each of these clusters:

HDFS

The HDFS, Hadoop file storage system consists of a system that allows files to be stored across multiple nodes (servers). When we insert a file in the cluster, either through the command line interface and / or its REST interface, or when we have mapreduce processes generating processing of output files, the HDFS makes a “break” of the file into several smaller files – by default, parts of 64MB – and distributes the files across the nodes, managing details on run time as the number of copies that each part must have, remaking this balancing in case of a cluster node falls. All this break is transparent to the developer because the cluster will make the mounting in every query made through the interfaces.

A HDFS cluster consists of two components:

  1. NameNodes: Central cluster component, responsible for managing the assembly metadata – used to reconstruct the original files – and make the management of the files in the cluster;
  2. DataNodes: “Physical” component of the cluster, responsible for making the read / write of the files on the disk. Each node has its DataNode, which performs the read / write on the server disk that is running;

PS: Na versão 2.0 do Hadoop, um novo componente foi incluso no HDFS, chamado YARN (Yet Another Resource NameNode), cujo objetivo principal é fornecer uma camada a mais de interface entre os usuários do HDFS e o mesmo. Graças a essa melhoria, podemos ter no cluster também diversos NameNodes para efetuar o gerenciamento dos arquivos, evitando assim o problema da possível perda de um cluster HDFS no caso de problemas irrecuperáveis com os NameNodes, como no caso do Hadoop 1.0, onde tinhamos, tipicamente, apenas dois processos de NameNode, sendo um deles para mecanismo de failover.

Mapreduce cluster

The mapreduce cluster  consists of a cluster which performs the execution of mapreduce processes. Typically, the input / output of a mapreduce job in Hadoop is with HDFS using the NameNode (YARN in version 2.0) to interface the cluster with the HDFS. The following are the components of a  MapReduce cluster:

  1. JobTracker: Component that interface the cluster with the developers, makes the management of process execution, identifying with the NameNode from the HDFS cluster where each part of the input data is to be processed, indicating for TaskTrackers which part of the data to process, and manages the beginning and end of each stage of processing;
  2. TaskTracker:Component that receives from the JobTracker the instructions for executing jobs, which parts of the mass of data it is responsible for processing, and report to the JobTracker when processing is complete;
  3. Task: Smaller cluster unit, the Task is responsible for making the processing itself. Each task can be performed in a JVM instance initiated during process execution, or it can be instantiated in a JVM already started with other Tasks running, according to the specified memory consumption settings in the cluster;

Hadoop complementary software

Several software was created to complement the use of Hadoop, or even built from the same. A brief description of some of them:

  • Mahout: Allows you to use machine learning techniques for data analysis in hadoop;
  • Sqoop: Allows integrate HDFS with relational databases;
  • Hive: Allows consultations in HDFS more easy way, through SQL commands;
  • Hama: Allow the development of jobs in Hadoop on other models besides the mapreduce, as the BSP;
  • HBase: NoSQL database, built under the HDFS;

In future posts, we will discuss this software in more detail.

Conclusion

And so, we concluded one more post of our series. With the growth of projects and solutions in Big Data worldwide, the hadoop has grown a lot as a market-leading technology, already having market implementations of large players like Cloudera and Hortonworks. In the next post, we will address other well known technology in the world of Big Data: the Spark. Until next time.

Continue reading

Raspberry PI – Entering the world of embedded software

Standard

In this post, we will discuss a first look at the Raspberry PI and a macro view of what’s embedded software. We’ll start by answering the following question: but after all, what is a Raspberry PI?

A computer the size of a credit card

The Raspberry PI it is a laptop, or in other words an integrated circuit board containing several features such as audio, video, network – the standard version of the board does not have support for wi-fi – the size of a card credit. Developed in the UK, its main goal is to provide a means for computer science learning in schools, and allows the development of solutions involving the combination software + hardware, such as home automation, monitoring systems for security, etc.

Currently, the card is sold in two designs, known as A and B:

especs_pi

Source: Wikipedia

As we can see in the specifications, it is a very powerful processing power, considering its dimensions. Some features of the board, such as the camera and the wifi, is not available from the factory, so it is necessary to buy separately the official accessories, or use the GPIO interface (which we will discuss later) to connect the card to additional circuits such as sensors, alarms or even electronic circuits that you can build yourself!

A  B model raspberry pi

Operational system

The standard operating system of the card is the raspbian, a Debian distribution made especially for the card. The OS installation procedure on the SD card is quite simple and can be learned in the tutorial video below:

Embedded Software

The concept of embedded software, is of programs that run on low-cost and / or hardware resources, usually for projects involving the automation of processes of the physical world, as controls closing doors with motion detection, curtains that open and close according to the brightness of a room, fire detectors and so on. On this site you can see several projects made with the raspberry pi. There´s even drones!

The raspberry pi due to have a complete operating system, with all the basic features of a desktop operating system, allows to use different programming languages such as Python, Ruby, and even Java. In my experiences with the card, I recommend using Python, due to have a more complete support in many ways, as a good library for handling the GPIO.

GPIO

The GPIO – General Purpose IO – consists of a set of electronic pins which provide input, output and ground for the connection of other electronic components to the pi. The figure below shows the layout of the pins, in a future post, we’ll talk specifically about this interface.

Shops

From my experiences on my country (Brazil), this are the best shops to buy eletronic components and the PI itself, so I recommend:

Lab de garagem

Farnell Newark

Multi Comercial (loja da Santa Efigênia)

Conclusion

And so we conclude our first general contact with the world of embedded software, and raspberry pi. In future posts, we will deepen more on the many issues that we brushed on the post. Until next time.