Big Data – final part


Welcome, dear reader, to the last post in our series on Big Data. If the reader has not read the previous posts in this series, the links can be found at the end of this post, or in the menus, “Big Data” section. In this final post, we will discuss some interesting cases in the use of Big Data in order to demonstrate how it has been used by the market. If the reader wants to know more about any case in particular, the reference links to them can be found at the end of this post.

 HealthMap: preventing diseases

Driven by the need to monitor the progress of epidemics around the world, the HealthMap tool was created by researchers in Boston. It uses various data sources such as social media, local news, etc in order to predict the progress of the diseases across the globe. The tool was highlighted in the media recently as predicted the emergence of Ebola in Guinea nine days before the WHO announcement.

 Google: better efficience on the data sources

Possessing the world’s largest search engine, and other solutions such as cloud computing, Google has huge data centers to support its operation. Through a Big Data solution that collects various information such as power consumption, temperature, CPU power, memory, etc, was possible to establish a quickly number of measures that improved the performance of the data center, such as adjustments in the cooling system, thereby preventing temperature peaks that compromise the performance of the equipment, and increases energy consumption.

 Target: predicting consumer behavior

Target retail company, with branches in several countries such as USA and Australia, has implemented an interest case of Big Data. Using sales data and navigation of its customers, extracted through its channels such as E-Commerce, the retailer can trace their customers behavior, providing what products they would be more interested in purchasing. Gained prominence in the media predicting the products to offer for pregnant women, where through the purchases of a customer, the solution detects that it is a pregnant women and through promotional email offers for the products that will be interested in acquiring on the next week of pregnancy.

  Ford: vehicle real-time analysis

The famous car manufacturer has implemented a Big Data solution very innovative, which involves collecting data from customers’ own cars in real time. Through sensors, data of the engine and other parts of the cars are sent in real time to Ford’s data centers, which use the data for applications such as the correction of design engineering of future releases, preventive maintenance and offers greater flexibility in recalls detection.


And so, we conclude our first journey in the world of Big Data. It is clear, however, that we will not stop here: I promise the reader to continue evolving our studies in Big Data. Please be sure to follow my blog, where I intend to start a series of hands-on, where we will see more of the key technologies associated with Big Data put it in practice. I thank all who have accompanied me in this series and I wish you all much success in your careers, whether in the world of big data or not. Thank you

Continue reading

Big Data – part 4


Welcome, dear reader, to another post in our series on Big Data. If the reader has not read the previous posts in this series, the links can be found at the end of this post, or in the menus, “Big Data” section. In this post, we will cover a technology that has gained quite popularity in the world of Big Data: The Apache Spark.


Created in Berkeley University by AMPLab, the goal of Spark is to provide a computer model, according to the official website, up to 100x faster than a conventional mapreduce Hadoop job. But how it hopes to achieve this performance improvement?


Such gain is based on this one point in Hadoop mapreduce model. During the execution of a Hadoop job, we have 3 times when the data is “stored” in the processing:

  • At initial processing, before the map step;
  • In the midst of processing, when the data filtered by the map phase is being stored for later stages of sort and reduce;
  • At the end of the processing, when the final result is delivered;

In Hadoop, on these three aforementioned moments, we have an IO disk consumption, because the data is stored on disk, rather than kept in memory, including the intermediate step between the steps of map and reduce. In a production environment of Big Data, it is common to have iterative jobs, running several times on a given body of data, using the result of the previous run as input for the next run. It is precisely in this scenario that the Spark has its biggest gain: keeping the data in memory, the access / write of the data becomes much faster, thus ensuring the announced earnings. From this seemingly simple change, the Spark project, which allows constructing jobs following the BSP model (Bulk Synchronous Parallel), was born keeping as much as possible of the data in memory within a run, thus ensuring a fast and scalable computational model. In the picture below we can see the architecture of the Spark and its subprojects, which we will discuss below

Complementary modules

From the Spark initial project, 4 subprojects were born, that complement his use. All these modules are already part of the default installation of Spark and they are:

Spark SQL: Similar to what is the Hive for Hadoop, Spark SQL brings a language similar to SQL for data query on a Spark installation;

Spark Streaming: Spark streaming allows the build of streaming style applications, where the data can be read / written during the processing, instead of the traditional model, where results from a process can only be delivered at the end of a execution;

MLlib: Equivalent to Apache Mahout, allows the construction of machine learning processes. Machine learning is a field within computer science, where using of statistical and logical rules, programs can “learn” and draw your own conclusions from a mass of data provided as input, simulating a human reasoning;

GraphX: The Spark GraphX allows processing to be built in the Graph format, allowing the resolution of problems through algorithms like Pert, BFS and DFS.

Spark & Hadoop

The reader may be wondering at this point: may I use Spark or Hadoop in my Big Data project? Like everything in the world of technology, this is no simple answer. Several factors may influence this decision, not only technical, but also business, such as the absence, to date, of major players that provide distributions with commercial support, unlike Hadoop that already has commercial distributions of weight as Cloudera and Hortonworks. Due to his complementary nature – Spark integrates with most of the components that make up Hadoop – however, it is possible that Spark could go for a complementary technology over than a competing platform. An example of this is the distribution of Cloudera itself, which provides a Hadoop distribution that also has a Spark distribution. Thus, we have as an increasingly scenario, the combination of the two technologies, rather than using only one of the two. After all, why should we use only 1, if we can enjoy the best that each has to offer us?


And so we come to the conclusion of another chapter of our series. In the next and last post in our series, we will examine some cases of the use of Big Data in the world, in order that we see in practice all the benefits that the Big Data can offer us. Until next time.
Continue reading

Big Data – part 3


In this post we will proceed to our series on Big Data. If the reader has not read the previous posts in this series, the links can be found at the end of this post, or the menus, “Big Data” section.

In this post, we will discuss one of the most popular technologies of the moment in the development of Big Data solutions: Apache Hadoop.


Hadoop was created in 2005 by two developers, Doug Cutting and Mike Cafarella. The symbol of Hadoop, the famous yellow elephant, it is Mike’s son toy elephant, and the name “Hadoop” is the elephant’s name. In the video below we can see the co-creator in an interview, talking about the challenges of data mining:


Speaking of the Hadoop architecture, we can separate into 2 main parts, consisting of two clusters:

  1. One part consists of a cluster that implements a distributed file system, known as HDFS (Hadoop Distributed File System);
  2. The second part consists of another cluster, which provides an environment for executing programs written following the MapReduce model. If the reader does not know what is mapreduce, we address this point in Part 2 of our series;

Lets examine, in general, each of these clusters:


The HDFS, Hadoop file storage system consists of a system that allows files to be stored across multiple nodes (servers). When we insert a file in the cluster, either through the command line interface and / or its REST interface, or when we have mapreduce processes generating processing of output files, the HDFS makes a “break” of the file into several smaller files – by default, parts of 64MB – and distributes the files across the nodes, managing details on run time as the number of copies that each part must have, remaking this balancing in case of a cluster node falls. All this break is transparent to the developer because the cluster will make the mounting in every query made through the interfaces.

A HDFS cluster consists of two components:

  1. NameNodes: Central cluster component, responsible for managing the assembly metadata – used to reconstruct the original files – and make the management of the files in the cluster;
  2. DataNodes: “Physical” component of the cluster, responsible for making the read / write of the files on the disk. Each node has its DataNode, which performs the read / write on the server disk that is running;

PS: Na versão 2.0 do Hadoop, um novo componente foi incluso no HDFS, chamado YARN (Yet Another Resource NameNode), cujo objetivo principal é fornecer uma camada a mais de interface entre os usuários do HDFS e o mesmo. Graças a essa melhoria, podemos ter no cluster também diversos NameNodes para efetuar o gerenciamento dos arquivos, evitando assim o problema da possível perda de um cluster HDFS no caso de problemas irrecuperáveis com os NameNodes, como no caso do Hadoop 1.0, onde tinhamos, tipicamente, apenas dois processos de NameNode, sendo um deles para mecanismo de failover.

Mapreduce cluster

The mapreduce cluster  consists of a cluster which performs the execution of mapreduce processes. Typically, the input / output of a mapreduce job in Hadoop is with HDFS using the NameNode (YARN in version 2.0) to interface the cluster with the HDFS. The following are the components of a  MapReduce cluster:

  1. JobTracker: Component that interface the cluster with the developers, makes the management of process execution, identifying with the NameNode from the HDFS cluster where each part of the input data is to be processed, indicating for TaskTrackers which part of the data to process, and manages the beginning and end of each stage of processing;
  2. TaskTracker:Component that receives from the JobTracker the instructions for executing jobs, which parts of the mass of data it is responsible for processing, and report to the JobTracker when processing is complete;
  3. Task: Smaller cluster unit, the Task is responsible for making the processing itself. Each task can be performed in a JVM instance initiated during process execution, or it can be instantiated in a JVM already started with other Tasks running, according to the specified memory consumption settings in the cluster;

Hadoop complementary software

Several software was created to complement the use of Hadoop, or even built from the same. A brief description of some of them:

  • Mahout: Allows you to use machine learning techniques for data analysis in hadoop;
  • Sqoop: Allows integrate HDFS with relational databases;
  • Hive: Allows consultations in HDFS more easy way, through SQL commands;
  • Hama: Allow the development of jobs in Hadoop on other models besides the mapreduce, as the BSP;
  • HBase: NoSQL database, built under the HDFS;

In future posts, we will discuss this software in more detail.


And so, we concluded one more post of our series. With the growth of projects and solutions in Big Data worldwide, the hadoop has grown a lot as a market-leading technology, already having market implementations of large players like Cloudera and Hortonworks. In the next post, we will address other well known technology in the world of Big Data: the Spark. Until next time.

Continue reading

Big Data – part 2


This is the second part of a series of posts on Big Data.On this post, let’s talk about the two most popular distributed  processing models of Big Data, the mapreduce, and the BSP (Bulk Synchronous Parallel). A process model is a kind of algorithm upon which to develop software.

Mapreduce model

Modelo map reduce

In the figure above, we can see the mapreduce model. This model is widely used in the market today, especially in companies that use Hadoop as her main Big Data technology. The model consists of two well-defined steps, called map and reduce:

  • In the step known as Map, hundreds – or even thousands – of parallel processes, called “threads”, perform a type of task called mapping, where a large mass of data is divided into pieces, and each performs a filtering process within a respective piece, creating a mass of values in the key-value format. At the end of this phase, there is a group phase, where the values for the same key are grouped to form data in the format key: {value1, value2, value3 …. valueN};
  • In the step known as reduce, the data generated by the map phase is again divided into pieces and passed to hundreds or even thousands of processes that perform processing on the received data bits and generate as a key-value output, which is the final output of the processing that is finally grouped into a mass of results;

In a future post, we’ll take a hands-on hadoop, where we can see an example of this processing model in practice with the WordCount.

BSP Model (Bulk Synchronous Parallel)


Although widespread, the mapreduce model is not without its drawbacks. When we talk about the model being applied in the context of Hadoop, for example, all of the cluster steps and mounting of the final mass with the results is done through files on the file system of Hadoop, HDFS, which generates an overhead in performance when it has to perform the same processing in a iterative manner.Another problem is that for graph algorithms such as DFS, BFS or Pert, MapReduce model is not satisfactory. For these scenarios, there is the BSP.

In the BSP algorithm, we have the concept of supersteps. A superstep consists of a unit of generic programming, which through a global communication component, makes thousands of parallel processing on a mass of data and sends it to a “meeting” called synchronization barrier. At this point, the data are grouped, and passed on to the next superstep chain. In this model, it is simpler to construct iterative workloads, since the same logic can be re-executed in a flow of supersteps. Another advantage pointed out by proponents of this model is that it has a simpler learning curve for developers coming from the procedural world.

Speaking in terms of platforms, Hadoop has the Apache Hama as implementation of this model. The main competitor of Hadoop, Spark, come with this feature natively.


And so we conclude another part of our series on Big Data. To date, these are the main models used by the Big Data platforms. As a technology booming, it is natural that in the future we could have more models emerging and gaining their adoption shares. In the next parts of our series, we’ll talk about the two most known implementations of Big Data to date: Hadoop and Spark. U.

Continue reading

Big Data – part 1


This is a series of posts that will be published, in order to elucidate the concept of Big Data.

In this first part, we will start a discussion on what is Big Data. In future posts, we’ll talk about new processing models that try to address the problem, and new technologies that are emerging to put into practice these concepts.

My posts are based on the idea of collaboration. Please all who wish to contribute to the discussion, feel free to do so, bringing more knowledge and experience for all.

Let’s start our series talking about what is, after all, Big Data.

 The explosion of data

Never in the world has the production of data been so big. According to infographic produced by IBM, 100 terabytes of data are produced every day only on Facebook, 294 billion emails are sent daily and 230 billion tweets are made every day! (Source)

This huge amount of data produces a phenomenon known in the world of big data as the 5 Vs:

Volume: Huge amounts of data being produced;

Velocity: Amounts of data being produced at a very high speed;

Variety: Amounts of data being produced in different structures that nonetheless may have intrinsic relations. The content sent by e-mail a user has a close relationship with the tweets that it is (are data produced by the same user, which may refer to the same subject), but they have a completely different structure;

Veracity: In a world where large amounts of data are produced at high speed, and in different formats, it is more difficult to get data “cleaned up”, without incompleteness problems or even duplicity. The email you sent with the cake recipe of your grandmother is the same one when you published it on Facebook, just in a different formats;

Value: All these data have a high value for the business, as they bring information about the behavior, beliefs and preferences of its customers;

To resolve this issue, were developed processing models, using a technique called distributed processing. In the next post, we’ll talk more about them.

For those who have more interest in knowing about the “Vs”, this presentation is a good reference:

Continue reading