Big Data – part 2

Standard

This is the second part of a series of posts on Big Data.On this post, let’s talk about the two most popular distributed  processing models of Big Data, the mapreduce, and the BSP (Bulk Synchronous Parallel). A process model is a kind of algorithm upon which to develop software.

Mapreduce model

Modelo map reduce

In the figure above, we can see the mapreduce model. This model is widely used in the market today, especially in companies that use Hadoop as her main Big Data technology. The model consists of two well-defined steps, called map and reduce:

  • In the step known as Map, hundreds – or even thousands – of parallel processes, called “threads”, perform a type of task called mapping, where a large mass of data is divided into pieces, and each performs a filtering process within a respective piece, creating a mass of values in the key-value format. At the end of this phase, there is a group phase, where the values for the same key are grouped to form data in the format key: {value1, value2, value3 …. valueN};
  • In the step known as reduce, the data generated by the map phase is again divided into pieces and passed to hundreds or even thousands of processes that perform processing on the received data bits and generate as a key-value output, which is the final output of the processing that is finally grouped into a mass of results;

In a future post, we’ll take a hands-on hadoop, where we can see an example of this processing model in practice with the WordCount.

BSP Model (Bulk Synchronous Parallel)

bsp

Although widespread, the mapreduce model is not without its drawbacks. When we talk about the model being applied in the context of Hadoop, for example, all of the cluster steps and mounting of the final mass with the results is done through files on the file system of Hadoop, HDFS, which generates an overhead in performance when it has to perform the same processing in a iterative manner.Another problem is that for graph algorithms such as DFS, BFS or Pert, MapReduce model is not satisfactory. For these scenarios, there is the BSP.

In the BSP algorithm, we have the concept of supersteps. A superstep consists of a unit of generic programming, which through a global communication component, makes thousands of parallel processing on a mass of data and sends it to a “meeting” called synchronization barrier. At this point, the data are grouped, and passed on to the next superstep chain. In this model, it is simpler to construct iterative workloads, since the same logic can be re-executed in a flow of supersteps. Another advantage pointed out by proponents of this model is that it has a simpler learning curve for developers coming from the procedural world.

Speaking in terms of platforms, Hadoop has the Apache Hama as implementation of this model. The main competitor of Hadoop, Spark, come with this feature natively.

Conclusion

And so we conclude another part of our series on Big Data. To date, these are the main models used by the Big Data platforms. As a technology booming, it is natural that in the future we could have more models emerging and gaining their adoption shares. In the next parts of our series, we’ll talk about the two most known implementations of Big Data to date: Hadoop and Spark. U.

Continue reading