Social media and sentimental analysis: knowing your clients

Standard

Hi, dear readers! Welcome to another post of my blog. On this post, we will talk about social media and sentimental analysis, seeing how she is revolutionizing the way companies are targeting their clients.

Social media

Undoubtedly, there is no doubt about the importance of social media on the modern life. As a example of the power that social media has today, we can say about the recent protests on Brazil against the president and her government’s corruption, which leaded thousands of people to the streets across the country, all organized by Facebook!

Today, social media has a very strong power of influence in the masses, reflecting the tastes and opinions of thousands of people. All of this gigantic amount of information is a gold mine to the companies, just waiting to be tapped.

Just imagine that you are the director of a area responsible for developing new products for a gaming company. Now, imagine if you could use the social medias to analyse the reactions of the players to the trailers and news your company releases on the internet. That information could be crucial to discover, for example, that your brand new shinning multiplayer mode is angering your audience, because of a new weapon your development team thought it would be awesome, but to the players feel extreme unbalanced.

Now imagine that you are responsible for the public relations of a oil company. Imagine that a ecological NGO start launching a “attack” at your company’s image on the social networks, saying that your refinery’s locations are bad for the ecosystems, despise your company’s efforts on reforestation. Without a proper tool to quickly analyse the data flowing on the social networks, it may be too late to revert the damage on the company’s image, with hundreds of people “flagellating” your company on the Internet. This may not seen important at first, until you realize that some companies you provide your fuel start buying less from you, because they are worried with their own image on the market, by associating themselves with you. More and more, the companies are realizing the importance of how positive is the image of their brands on the eyes of their customers, a term also known as “brand health”.

This “brand’s health” metric is very important on the marketing area, already influencing several companies to enter on the social media monitoring field, providing partial or even complete solutions to a brand’s health monitoring tool, many times on a SAAS model. Examples of companies that provide this kind of service are Datasift, Mention and Gnip.

Sentimental analysis

A very important metric on the brand’s health monitoring is the sentimental analysis. In a simple statement, sentimental analysis is exactly what the name says: is the analysis of the “sentiment” the author of a given text is feeling about the subject of a given text he wrote about, been classified as negative, neutral or positive. Of course, it is very clear how important this metric is for most of the analysis, since is the key to understand the quality of your brand’s image on the perspective of your public.

But how does this work? How is it possible to analyse someone’s sentiments? This is a field still on progress, but there’s already some techniques been applied for this task, such as keywords scoring (presence of words such as curses, for example), polarities scores to balance the percentage a sentence is positive, neutral and negative in order to analyse the overall sentiment of the text and so on. At the end of this post, there is a article from Columbia’s University about sentimental analysis of Twitter posts, that the reader can use as a starting point to deepen on the details of the techniques involved on the subject.

Big Data

As the reader may have already guessed, we are talking about a big volume of data, that grows very fast, is unstructured, has mixed veracity – since we can have both valuable and non-valuable information among our dataset – and has a enormous potential of value for our analysis, since are the opinions and tastes – or the “soul” – of our costumers. As we have see previously on my first post about Big Data, this data qualifies on the famous “Vs” that are always talked about when we heard about Big Data. Indeed, generally speaking, most of the tools used on this kind of solution can be classified as Big Data’s solutions, since they are processing amounts of data with this characteristics, heavily using distributed systems concepts. Just remember: It is not always that because it uses social media, that it is Big Data!

A practical exercise

Now, let’s see a simple practical exercise, just to see a little of the things we talked about working on practice. On this hands-on, we will make a simple Python script. This Python script will connect to Twitter, to the public feed to be more precise, filtering everything with the keyword “coca-cola”. Then, it will make a sentimental analysis on all the tweets provided by the feed, using a library called TextBlob that provides us with Natural Language Processing (NLP) capabilities and finally it will print all the results on the console. So, without further delay, let’s begin!

Installation

On this lab, we will use Python 3. I am using Ubuntu 15.04, so Python is already installed by default. If the reader is using a different OS, you can install Python 3 by following this link.

We will also use virtualenv. Virtualenv is a tool used to create independent Python’s environments on our development machine. This is useful to isolate the dependencies and versions of libraries between Python applications, eliminating the problems of installing the libraries on the global Python’s installation of the OS. To install Virtualenv, please refer to this link.

Set up

To start our set up, first, let’s create a virtual environment. To do this, we open a terminal and type:

virtualenv –python=python3.4 twitterhandson

This will create a folder called twitterhandson, where we can see that a complete Python environment was created, including executables such as pip and python itself. To use Virtualenv, enter the twitterhandson folder and input:

source bin/activate

After entering the command, we can see that our command prompt got a prefix with the name of our environment, as we can see on the screen bellow:

 That’s all we need to do in order to use Virtualenv. If we want to close, just type exit on the console.

Using a IDE

On this lab, I am using Pycharm, a powerfull Python’s IDE developed by Jetbrains. The IDE is not required for our lab, since any text editor will suffice, but I recommend the reader to experiment the IDE, I am sure you will like it!

Downloading module dependencies

On Python, we have modules. A module is a python file where we can have definitions of variables, functions and classes, that we can reuse later on more complex scripts. For our lab, we will use Pip to download the dependencies. Pip is a tool recommended by Python used to manage dependencies, something like what Maven do for us in the Java World. To use it, first, we create on our virtualenv root folder a file called requirements.txt and put the following inside:

httplib2
simplejson
six
tweepy
textblob

The dependencies above are necessary to use the NLP library and use the Twitter API. To make Pip download the dependencies, first we activate the virtual environment we created previously and then, on the same folder of our txt file, we input:

pip3 install -r requirements.txt

After running the command above, the modules should be downloaded and enabled on our virtualenv environment.

Using sentimental analysis on other languages

On this post, we are using TextBlob, which sadly has only english as supported language for sentimental analysis – he can translate the text to other languages using Google translator, but of course is not the same as a analyser specially designed to process the language. If the reader wants a alternative to process sentimental analysis on other languages as well, such as Portuguese for example, is there a REST API from a company called BIText – which provides the sentimental analysis solution for Salesforce’s Marketing products – that I have tested and provides very good results. The following link points for the company’s API page:

BIText

Creating the Access Token

Before we start our code, there is one last thing we need to do: We need to create a access token, in order to authenticate our calls on Twitter to obtain the data from the public feed. In order to do this, first, we need to create a Twitter account, on Twitter.com. With a account created, we create a access token, following this tutorial from Twitter.

Developing the script

Well, now that all the preparations were made, let’s finally code! First, we will create a file called config.py. On that file, we will create all the constants we will use on our script:

accesstoken='<access token>’
accesstokensecret='<access token secret>’
consumerkey='<consumer key>’
consumerkeysecret='<consumer key secret>’

And finally, we will create a file called twitter.py, where we will code our Python script, as the following:

from config import *
from textblob import TextBlob
from nltk import downloader
import tweepy


class MyStreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print('A TWEET!')
        print(status.text)
        print('AND THE SENTIMENT PER SENTENCE IS:')
        blob = TextBlob(status.text)
        for sentence in blob.sentences:
            print(sentence.sentiment.polarity)


auth = tweepy.OAuthHandler(consumerkey, consumerkeysecret)
auth.set_access_token(accesstoken, accesstokensecret)

downloader.download('punkt')

myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth=auth, listener=myStreamListener)

stream = tweepy.Stream(auth, myStreamListener)
stream.filter(track=['coca cola'], languages=['en'])

On the first time we run the example, the reader may notice that the script will download some files. That is because we have to download the resources for the NLTK library, a dependency from TextBlob, which is the real NLP processor that TextBlob uses under the hood. Beginning our explanation of the script, we can see that we are creating a OAuth handler, which will be responsible for managing our authentication with Twitter. Then, we instantiate a listener, which we defined at the beginning of our script and pass him as one of the args for the creation of our stream and then we start the stream, filtering to return just tweets with the words “coca cola” and on the english language. According to Twitter documentation, it is advised to process the tweets asynchronously, because if we process them synchronous, we can lose a tweet while we are still processing the predecessor. That is why tweepy requires us to implement a listener, so he can collect the tweets for us and order them to be processed on our listener implementation.

On our listener, we simply print the tweet, use the TextBlob library to make the sentimental analysis and finally we print the results, which are calculated sentence by sentence. We can see the results from a run bellow:

A TWEET!
RT @GeorgeLudwigBiz: Coca-Cola sees a new opportunity in bottling billion-dollar #startups http://t.co/nZXpFRwQOe
AND THE SENTIMENT PER SENTENCE IS:
0.13636363636363635
A TWEET!
RT @momosdonuts: I told y’all I change things up often! Delicious, fluffy, powdered and caramel drizzled coca-cola cake. #momosdonuts http:…
AND THE SENTIMENT PER SENTENCE IS:
0.0
0.4
0.0
A TWEET!
vanilla coca-cola master race

tho i have yet to find a place where they sell imports of the british version
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @larrywhoran: CLOUDS WAS USED IN THE COCA COLA COMMERCIAL AND NO CONTROL BEING PLAYED IN RADIOS AND THEYRE NOT EVEN SINGLES YAS SLAY
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @bromleyfthood: so sei os covers e coca cola dsanvn I vote for @OTYOfficial for the @RedCarpetBiz Rising Star Award 2015 #RCBAwards
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @LiPSMACKER_UK: Today, we’re totally craving Coca-Cola! http://t.co/V140SADKok
AND THE SENTIMENT PER SENTENCE IS:
0.0
0.0
A TWEET!
RT @woodstammie8: Early production of Coca Cola contained trace amounts of coca leaves, which, when processed, render cocaine.
AND THE SENTIMENT PER SENTENCE IS:
0.1
A TWEET!
RT @designtaxi: Coca-Cola creates braille cans for the blind http://t.co/cCSvJLv7O0 http://t.co/UA0PGoheO2
AND THE SENTIMENT PER SENTENCE IS:
-0.5
A TWEET!
Instrus, weed, Coca-Cola y snacks.
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @larrywhoran: CLOUDS WAS USED IN THE COCA COLA COMMERCIAL AND NO CONTROL BEING PLAYED IN RADIOS AND THEYRE NOT EVEN SINGLES YAS SLAY
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
1 Korean Coca-Cola Bottle in GREAT CONDITION Coke Bottle Coke Coca Cola http://t.co/IHhxoJ7aMz
AND THE SENTIMENT PER SENTENCE IS:
0.8
A TWEET!
#Coca-Cola#I#♥#YOU#
Fanny#day#Good… https://t.co/5PU7L4QchC
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
Entry List for Coca-Cola 600 #NASCAR Sprint Cup Series race at Charlotte Motor Speedway is posted, 48 drivers entered http://t.co/UYXPdOP9te
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
@diannaeanderson + walk, get some Coca-Cola, and spend some time reading. Lord knows I need to de-stress.
AND THE SENTIMENT PER SENTENCE IS:
0.0
0.0
A TWEET!
Apply now to work for Coca-Cola #jobs http://t.co/ReFQUIuNeK http://t.co/KVTvyr1e6T
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @jayski: Entry List for Coca-Cola 600 #NASCAR Sprint Cup Series race at Charlotte Motor Speedway is posted, 48 drivers entered http://t.…
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @SeyiLawComedy: When you enter a fast food restaurant and see their bottle of Coca-Cola drink (35cl) is N800; You just exit like » http:…
AND THE SENTIMENT PER SENTENCE IS:
0.2
A TWEET!
Entry List for Coca-Cola 600 #NASCAR Sprint Cup Series race at @CLTMotorSpdwy is posted, 48 drivers entered http://t.co/c2wJAUzIeQ
AND THE SENTIMENT PER SENTENCE IS:
0.0

The reader may notice that the sentimental analysis of the tweets could be more or less inaccurate to what the sentiment of the author really was, using our “human analysis”. Indeed, as we have talked before, this field is still improving, so it will take some more time for us to rely 100% on this kind of analysis

Conclusion

As we can see, it was pretty simple to construct a program that connects to Twitter, runs a sentimental analysis and print the results. Despise some current issues with the accuracy of the sentimental analysis, as we talked about previously, this kind of analysis are already a really powerfull tool to explore, that companies can use to improve their perception of how the world around them realize their existences. Thank you for following me on another post, until next time.

Continue reading

Hands-on HazelCast: using a open source in-memory data grid

Standard

Welcome, dear readers! Welcome to another post of my blog. On this post, we will talk about Hazelcast, a open source solution for in-memory data grids, used by companies such as AT&T, HSBC, Cisco and HP. But after all, why do we want to use a in-memory data grid?

In-memory Data Grids

In-memory data grids, also known as IMDGs, are distributed data structures, where the whole data is stored entirely on the RAM (Random Access Memory) across a cluster. This way, we could have both the advantages of using faster resources such as the RAM, opposed to the hard disk – off course, IMDGs such as HazelCast also use the hard disk as a persistent store for fault tolerance, but still, for the applications usage, the major resource is the RAM – and the horizontal scalability of a cluster, opposed to the traditional vertical scalability of a relational database, for example.

Common use case scenarios for IMDGs are as a caching layer for databases, clustering web sessions for web applications and even as NOSQL solutions.

Talking about HazelCast, we have a very robust architecture, which provides features such as encryption, clustering replication and partitioning and more. The following picture, extracted from HazelCast’s documentation, shows the features from the architecture both on free and enterprise editions.

Installation & set up

HazelCast has a very simple installation, that basically consist of downloading a tar.gz and extracting his contents. In order to do this, we visit HazelCast’s download page and click on the tar.gz button – or zip, if the reader is using Windows, I am using Ubuntu 15.04 -, after the download, we extract the tar.gz on the folder of our preference and that’s it, we finished our installation! The only prior requisite is to have Java installed, which the reader can install downloading from here.

To start a HazelCast node, we simply start a shell script provided by the installation. On versions previous to 3.2.6, this script was called run.sh and was inside the bin folder. On newer versions, the script is called server.sh, but it is still inside the bin folder. So to start a node, simply navigate on a terminal to the folder we extracted previously – if the reader is using the same version I am using, it would be called hazelcast-3.4.2 – and type:

cd bin

./server.sh

After some seconds, we can see that the node has successfully booted:

For this lab, we will use a HazelCast cluster composed of 3 nodes, so, we open another 2 terminal windows and repeat the previous procedure to start the first node. HazelCast uses multicast to check and establish the participation of new nodes on the cluster, as we can see on the members list updated by the nodes on their consoles, as we include new nodes:

That’s all we need to do to create the cluster for our lab. Now, let’s start using HazelCast!

Using distributed objects

To connect to our HazelCast cluster,  there’s 3 ways:

  • Programmatic configuration, by using the classes of the Java API;
  • By XML configuration, using a XML called hazelcast.xml that we put on the classpath;
  • By integrating HazelCast with Spring;

On this lab, we will explore the first option, so, starting our lab, let’s create a new Maven project – without adding any archetype – and include the following dependencies on the pom.xml:

.

.

.

<dependencies>

<dependency>
<groupId>com.hazelcast</groupId>
<artifactId>hazelcast</artifactId>
<version>3.4.2</version>
</dependency>

<dependency>
<groupId>com.hazelcast</groupId>
<artifactId>hazelcast-client</artifactId>
<version>3.4.2</version>
</dependency>

</dependencies>

.

.

.

After including the dependencies,  let’s create a class, responsible for creating a single HazelCast’s instance for us, for reuse on the whole application:

public class HazelCastFactory {

private static HazelcastInstance cluster;

private static boolean shutDown = true;

public static HazelcastInstance getInstance() {

if (shutDown) {
ClientConfig clientConfig = new ClientConfig();
ClientNetworkConfig clientNetworkConfig = new ClientNetworkConfig();
clientNetworkConfig.addAddress(“127.0.0.1:5701”);
clientConfig.setNetworkConfig(clientNetworkConfig);
cluster = HazelcastClient.newHazelcastClient(clientConfig);
shutDown = false;

}

return cluster;

}

public static void shutDown() {
cluster.shutdown();
shutDown = true;
}

}

On the code above we created a client to connect to our HazelCast’s cluster, pointing the node on the 5701 port as the entry point of our connection. If we want to add other addresses for cases in which our entry node falls on the connection start, we just add more addresses with the addAddress method. There’s no need to add the whole cluster to use the data grid, however: HazelCast itself is responsible for load balancing the requests across the cluster. We also included a method to shutdown our connection to the data grid, releasing the resources allocated.

NOTE: One interesting thing to note is that, when a client connects to a HazelCast cluster, he actually establish a connection to the cluster, not just the node we informed as the entry point, meaning that, even if the entry node falls after the connection is established, our client will still maintain a connection with the cluster.

Let’s begin by creating a distributed object on the cluster, a map data structure. Distributed objects are objects created and managed by the cluster, with their data distributed and replicated across the cluster.

To create it, all we have to do is call the getMap mehod on the client instance we receive from the factory class, providing a unique name on the cluster to identify the map:

public class HazelCastDistributedMap {

public static void main(String[] args) {

HazelcastInstance client = HazelCastFactory.getInstance();

Map<String, String> map = client.getMap(“mymap”);

HazelCastFactory.shutDown();

}

}

As we can see, is very simple to create a map. To use it, is even more simple: all we have to do is use the methods from the Map interface, just like we do with any basic Map on a common Java program. Behind the scenes, HazelCast is working for us, supplying a IMDG for our data. Let’s demonstrate this by creating a more elaborated example. First, we create a POJO, representing a client (NOTE: in order to be distributed by HazelCast, the objects used must be serializable):

public class Client implements Serializable {

/**
*
*/
private static final long serialVersionUID = -4870061854652654067L;

private String name;

private Long phone;

private String sex;

public String getName() {
return name;
}

public void setName(String name) {
this.name = name;
}

public Long getPhone() {
return phone;
}

public void setPhone(Long phone) {
this.phone = phone;
}

public String getSex() {
return sex;
}

public void setSex(String sex) {
this.sex = sex;
}

}

Then, we change the example to the following:

public class HazelCastDistributedMap {

public static void main(String[] args) {

HazelcastInstance client = HazelCastFactory.getInstance();

IMap<Long, Client> map = client.getMap(“customers”);

Client clientData = new Client();

clientData.setName(“Alexandre Eleuterio Santos Lourenco”);
clientData.setPhone(33455676l);
clientData.setSex(“M”);

map.put(clientData.getPhone(), clientData, 5, TimeUnit.MINUTES);

clientData = new Client();

clientData.setName(“Lucebiane Santos Lourenco”);
clientData.setPhone(456782387l);
clientData.setSex(“F”);

map.put(clientData.getPhone(), clientData, 2, TimeUnit.MINUTES);

clientData = new Client();

clientData.setName(“Ana Carolina Fernandes do Sim”);
clientData.setPhone(345622189l);
clientData.setSex(“F”);

map.put(clientData.getPhone(), clientData, 120, TimeUnit.SECONDS);

HazelCastFactory.shutDown();

client = HazelCastFactory.getInstance();

Map<Long, Client> mapPostShutDown = client.getMap(“customers”);

for (Long phone : mapPostShutDown.keySet()) {
Client cli = mapPostShutDown.get(phone);

System.out.println(cli.getName());

}

System.out.println(mapPostShutDown.size());

HazelCastFactory.shutDown();

}

}

As we can see on the new code, we obtained a distributed map from HazelCast, inserted some clients on it, shutdown the connection, reopened and finally iterated by the map, printing the names of the clients and the size of the map. If we run the code, we will receive, alongside logging information from the HazelCast’s client, the following prints:

Ana Carolina Fernandes do Sim
Alexandre Eleuterio Santos Lourenco
Lucebiane Santos Lourenco
3

If we revisit the code, two things are noticeable: First, we used a put method which received alongside the key and value, another 2 parameters. The two last parameters we used on the put method are the time-to-live (TTL), which can be set individually like we did previously, or in a global fashion, using the Configuration classes we used on our factory class. The other thing we can notice is that we got our map from HazelCast’s client first with the IMap interface from HazelCast and secondly with the plain classic java.util.Map. The IMap interface provide us with other features alongside the traditional ones from a map such as listeners to be executed at every put/get/delete action on the map or even a processor to be executed for a certain key or even all the keys. On the next sections, we will see examples of both of this features. This interface also implements the Map interface, reason why we can also get our map from HazelCast as a java.util.Map.

One thing that it is very important to notice is that, after we retrieve a value from a map already stored on HazelCast, the object is not updated on the cluster. Let’s see a example to understand in better detail the implications of this behavior. Let’s analyse if we change our code to the following:

.

.

.

clientData = new Client();

clientData.setName(“Ana Carolina Fernandes do Sim”);
clientData.setPhone(345622189l);
clientData.setSex(“F”);

map.put(clientData.getPhone(), clientData, 120, TimeUnit.SECONDS);

clientData = (Client) map.get(33455676l);

clientData.setName(“Alexandre Eleuterio Santos Lourenco UPDATED!”);

HazelCastFactory.shutDown();

client = HazelCastFactory.getInstance();

.

.

.

If we run this code, we will see that the client we tried to update before we shutdown our connection for the first time won’t be updated. The reason for this is that objects already stored on HazelCast aren’t re-serialized if we make changes on them, unless we explicit do this, by re-inputing the data on the map. To do this, we change our code for the following:

public class HazelCastDistributedMap {

public static void main(String[] args) {

HazelcastInstance client = HazelCastFactory.getInstance();

IMap<Long, Client> map = client.getMap(“customers”);

Client clientData = new Client();

clientData.setName(“Alexandre Eleuterio Santos Lourenco”);
clientData.setPhone(33455676l);
clientData.setSex(“M”);

map.put(clientData.getPhone(), clientData, 5, TimeUnit.MINUTES);

clientData = new Client();

clientData.setName(“Lucebiane Santos Lourenco”);
clientData.setPhone(456782387l);
clientData.setSex(“F”);

map.put(clientData.getPhone(), clientData, 2, TimeUnit.MINUTES);

clientData = new Client();

clientData.setName(“Ana Carolina Fernandes do Sim”);
clientData.setPhone(345622189l);
clientData.setSex(“F”);

map.put(clientData.getPhone(), clientData, 120, TimeUnit.SECONDS);

clientData = (Client) map.get(33455676l);

clientData.setName(“Alexandre Eleuterio Santos Lourenco UPDATED!”);

map.put(clientData.getPhone(), clientData, 5, TimeUnit.MINUTES);

HazelCastFactory.shutDown();

client = HazelCastFactory.getInstance();

Map<Long, Client> mapPostShutDown = client.getMap(“customers”);

for (Long phone : mapPostShutDown.keySet()) {
Client cli = mapPostShutDown.get(phone);

System.out.println(cli.getName());

}

System.out.println(mapPostShutDown.size());

HazelCastFactory.shutDown();

}

}

If we run again the code and check the prints, we can see that now our client is correctly updated.

Listeners

Like we talked before, the IMap interface – and other interfaces from HazelCast’s API – allow us to use another features besides the traditional operations from a Java Collection, such as listeners. With listeners, we can create additional code to run every time a entry is added/updated/deleted/evicted (entries can be evicted automatically by a policy in order to maintain the cluster’s memory capabilities), or even when the whole map is evicted or cleared. In order to implement a listener to our Map, we use a interface called EntryListener and implement a class like the following:

public class MyMapEntryListener implements EntryListener<Long, Client> {

public void entryAdded(EntryEvent<Long, Client> event) {
System.out.println(“entryAdded:” + event);

}

public void entryRemoved(EntryEvent<Long, Client> event) {
System.out.println(“entryRemoved:” + event);

}

public void entryUpdated(EntryEvent<Long, Client> event) {
System.out.println(“entryUpdated:” + event);

}

public void entryEvicted(EntryEvent<Long, Client> event) {
System.out.println(“entryEvicted:” + event);

}

public void mapEvicted(MapEvent event) {
System.out.println(“mapEvicted:” + event);

}

public void mapCleared(MapEvent event) {
System.out.println(“mapCleared:” + event);

}

}

And finally, we add the listener to our map, like the following snippet:

.

.

.

HazelcastInstance client = HazelCastFactory.getInstance();

IMap<Long, Client> map = client.getMap(“customers”);

map.addEntryListener(new MyMapEntryListener(), true);

.

.

.

On the code above, we added the listener using the addEntryListener method. The second parameter, a boolean, is used to indicate if the event class we receive as parameter for the event’s methods should receive the entry’s value or not. If we run the code, we will see that among the messages on the console, we will receive outputs from our listener, like the following:

entryAdded:EntryEvent{entryEventType=ADDED, member=Member [192.168.10.104]:5702, name=’customers’, key=33455676, oldValue=null, value=com.alexandreesl.handson.model.Client@44daa124}
entryAdded:EntryEvent{entryEventType=ADDED, member=Member [192.168.10.104]:5702, name=’customers’, key=456782387, oldValue=null, value=com.alexandreesl.handson.model.Client@588ec3d1}
entryAdded:EntryEvent{entryEventType=ADDED, member=Member [192.168.10.104]:5702, name=’customers’, key=345622189, oldValue=null, value=com.alexandreesl.handson.model.Client@4476e1bd}
entryUpdated:EntryEvent{entryEventType=UPDATED, member=Member [192.168.10.104]:5702, name=’customers’, key=33455676, oldValue=com.alexandreesl.handson.model.Client@63a68758, value=com.alexandreesl.handson.model.Client@1f5df98a}

One important thing to note about the EntryListener, is HazelCast’s threading system. If we include a sysout to print the current thread on our listener – I included on the add and update entries methods, since are the ones we are using on our examples – we can see that the executions are asynchronous, since HazelCast creates a thread pool to serve the listener calls. The following snippet from the console shows this behavior:

thread:hz.client_0_dev.event-2
thread:hz.client_0_dev.event-1
entryUpdated:EntryEvent{entryEventType=UPDATED, member=Member [192.168.10.104]:5702, name=’customers’, key=33455676, oldValue=com.alexandreesl.handson.model.Client@63a68758, value=com.alexandreesl.handson.model.Client@1f5df98a}
entryUpdated:EntryEvent{entryEventType=UPDATED, member=Member [192.168.10.104]:5702, name=’customers’, key=456782387, oldValue=com.alexandreesl.handson.model.Client@588ec3d1, value=com.alexandreesl.handson.model.Client@564936b0}
thread:hz.client_0_dev.event-3
entryUpdated:EntryEvent{entryEventType=UPDATED, member=Member [192.168.10.104]:5702, name=’customers’, key=345622189, oldValue=com.alexandreesl.handson.model.Client@4ab1169b, value=com.alexandreesl.handson.model.Client@4476e1bd}
thread:hz.client_0_dev.event-2
entryUpdated:EntryEvent{entryEventType=UPDATED, member=Member [192.168.10.104]:5702, name=’customers’, key=33455676, oldValue=com.alexandreesl.handson.model.Client@1f4f3962, value=com.alexandreesl.handson.model.Client@e8d682e}

With that being said, one thing is important to keep a eye out when using this feature: according to HazelCast’s documentation, this thread pool is exclusive for the execution of the events, not been shared with the thread that execute the action itself – meaning that, even if we run our examples creating the node on the main method, running the node embedded on our program, instead of connecting to a remote cluster, still the thread that runs our event won’t be the same that manipulate the data – so if we create too much logic to run with our listener, we can run on a situation that some of the calls to the listener would fail, because there would be no threads available to run the listener.

Processors

Another feature we talked about previously is Processors. With a processor, we can add logic that we want to run to a single key on our map, or even all keys. One very important thing to notice in this feature, opposed to the listener one, is that this feature is scalable, because not only it runs on the server side, but it is also sent to all nodes of the cluster, executing the logic on all entries of the map, if applicable, on parallel.  Let’s get a example where we want to change all the phones of our clients to “888888888”. To do this, we implement the following class to accomplish this task:

public class MyMapProcessor extends AbstractEntryProcessor<String, Client> {

/**
*
*/
private static final long serialVersionUID = 8890058180314253853L;

@Override
public Object process(Entry<String, Client> entry) {

Client client = entry.getValue();

client.setPhone(888888888l);

entry.setValue(client);

System.out.println(“Processing the client: ” + client.getName());

return null;
}

}

And finally, in order to test our processor, we change the code of our main class to the following:

.

.

.

IMap<Long, Client> mapProcessors = client.getMap(“customers”);

mapProcessors.executeOnEntries(new MyMapProcessor());

HazelCastFactory.shutDown();

client = HazelCastFactory.getInstance();

mapProcessors = client.getMap(“customers”);

for (Long phone : mapProcessors.keySet()) {
Client cli = mapProcessors.get(phone);

System.out.println(cli.getName() + ” ” + cli.getPhone());

}

.

.

.

However, if we run our code the way it is, we will be greeted with the following error:

Exception in thread “main” com.hazelcast.nio.serialization.HazelcastSerializationException: java.lang.ClassNotFoundException: com.alexandreesl.handson.examples.MyMapProcessor
at com.hazelcast.nio.serialization.DefaultSerializers$ObjectSerializer.read(DefaultSerializers.java:201)
at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:44)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:309)
at com.hazelcast.nio.serialization.ByteArrayObjectDataInput.readObject(ByteArrayObjectDataInput.java:439)
at com.hazelcast.map.impl.client.MapExecuteOnAllKeysRequest.read(MapExecuteOnAllKeysRequest.java:96)
at com.hazelcast.client.impl.client.ClientRequest.readPortable(ClientRequest.java:116)
at com.hazelcast.nio.serialization.PortableSerializer.read(PortableSerializer.java:88)
at com.hazelcast.nio.serialization.PortableSerializer.read(PortableSerializer.java:30)
at com.hazelcast.nio.serialization.StreamSerializerAdapter.toObject(StreamSerializerAdapter.java:65)
at com.hazelcast.nio.serialization.SerializationServiceImpl.toObject(SerializationServiceImpl.java:260)
at com.hazelcast.client.impl.ClientEngineImpl$ClientPacketProcessor.loadRequest(ClientEngineImpl.java:364)
at com.hazelcast.client.impl.ClientEngineImpl$ClientPacketProcessor.run(ClientEngineImpl.java:340)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
at —— End remote and begin local stack-trace ——.(Unknown Source)
at com.hazelcast.client.spi.impl.ClientCallFuture.resolveResponse(ClientCallFuture.java:202)
at com.hazelcast.client.spi.impl.ClientCallFuture.get(ClientCallFuture.java:143)
at com.hazelcast.client.spi.impl.ClientCallFuture.get(ClientCallFuture.java:119)
at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:151)
at com.hazelcast.client.proxy.ClientMapProxy.executeOnEntries(ClientMapProxy.java:890)
at com.alexandreesl.handson.examples.HazelCastDistributedMap.main(HazelCastDistributedMap.java:68)
Caused by: java.lang.ClassNotFoundException: com.alexandreesl.handson.examples.MyMapProcessor
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)

.

.

.

The reason for this is because we don’t have the class on the classpath of our cluster. In order to resolve this, first we will export both the Client and the MyMapProcessor classes on a jar – the reader can obtain a jar already prepared by me on the github repository at the end of this post -, save the jar on the lib folder of HazelCast´s installation folder and edit the classpath on the server.sh script on the bin folder of HazelCast:

.

.

.

export CLASSPATH=$HAZELCAST_HOME/lib/hazelcast-3.4.2.jar:$HAZELCAST_HOME/lib/hazelcastcustomprocessors.jar

.

.

.

After restarting our cluster, we can see on the console that all the clients have the same phone, proving that our processor was successful:

Ana Carolina Fernandes do Sim 888888888
Alexandre Eleuterio Santos Lourenco UPDATED! 888888888
Lucebiane Santos Lourenco 888888888

Out of curiosity, one last thing to notice is that, as we have also coded a sysout on our processor, we can search the console of our nodes and see that our sysouts are printed there, like in the screen bellow:

NOTE: According to HazelCast´s documentation, it is a good practice to isolate all the classes shared across client and server – in our case, our HazelCast´s client and the cluster itself – in a separate project, making the structure less coupled.

MultiMaps

One last feature we are going to visit is the multimaps. Sometimes, we would like to store multiple values for a single key, like storing a list of orders by user id, for example. On a simple solution, we could simply use a common map and pass for the value parameter as a Collection, like a List or a Set. However, when using this approach on a distributed system, this would lead to 2 problems:

  1. Performance: when using a distributed system, the whole value is serialized and deserialized before the operations are made. That means that if we have a list with 50 items on the value, for example, every time we want to include a new value to the list, the whole list will be deserialized, the new value included and the whole list be reserialized again! Of course, this leads to a overhead, which in turn affects the performance of our map;
  2. Thread safety: when using the common implementation of the map collection, there is no concurrency controls implemented on his methods. That means that if we are using a common map with a collection as the value and we have multiple consumers updating the values, we could run with problems of concurrency, such as values not being updated or even removed;

In order to address this problems, HazelCast provides us with a special implementation called Multimap. To illustrate the use of the implementation, let’s create a class called HazelCastDistributedMultiMap and code it like the following:

public class HazelCastDistributedMultiMap {

public static void main(String[] args) {

HazelcastInstance client = HazelCastFactory.getInstance();

MultiMap<String, Client> map = client.getMultiMap(“multicustomers”);

Client clientData = new Client();

clientData.setName(“Alexandre Eleuterio Santos Lourenco”);
clientData.setPhone(33455676l);
clientData.setSex(“M”);

map.put(clientData.getSex(), clientData);

clientData = new Client();

clientData.setName(“Lucebiane Santos Lourenco”);
clientData.setPhone(456782387l);
clientData.setSex(“F”);

map.put(clientData.getSex(), clientData);

clientData = new Client();

clientData.setName(“Ana Carolina Fernandes do Sim”);
clientData.setPhone(345622189l);
clientData.setSex(“F”);

map.put(clientData.getSex(), clientData);

HazelCastFactory.shutDown();

client = HazelCastFactory.getInstance();

map = client.getMultiMap(“multicustomers”);

for (String key : map.keySet()) {

for (Client cli : map.get(key)) {

System.out.println(“The Client: ” + cli.getName()
+ ” for the Key: ” + key);

}

}

HazelCastFactory.shutDown();

}

}

If we run the code, we can see by the sysouts that HazelCast has successfully grouped the clients by the sex, proving that our code was a success:

The Client: Alexandre Eleuterio Santos Lourenco for the Key: M
The Client: Ana Carolina Fernandes do Sim for the Key: F
The Client: Lucebiane Santos Lourenco for the Key: F

Other features

Of course, there is much more HazelCast is capable of alongside what we see on this post, like the ability to work with primitives on the cluster, implement asynchronous solutions with queues and topics and even a criteria API that enable us to search our data on a “JPA-like” fashion. On the links section, there is a link for a PDF called Mastering HazelCast, made by HazelCast themselves, which is free and a very good source of information to get more deep on all the subjects about HazelCast.

Conclusion

And this concludes another post. With the advances of the hardware technologies, it became more and more easy to develop solutions entirely on pure RAM and/or with a very heavy usage of parallelism, such as Big Data´s related technologies. In time, distributed systems will became so common – and in a sense, they already are – that one day we will wonder how we lived before them! Thank you for following me on another post, until next time.

Continue reading

Encontro técnico do SouJava de abril com a IBM

Standard

Another good meeting from the SouJava! I am not going to attend personally this time, but I will watch by Hangouts! Don’t miss it if you can!

SouJava

encotro_soujava_ibm

No próximo encontro técnico do SouJava teremos a participação da IBM. Com uma abertura com Bruno Souza, presidente do SouJava que participou recentemente do evento InterConnect em Las Vegas e compartilhará um pouco das novidades com os participantes. Teremos também uma palestra com Sérgio Gama, que além de uma visão geral sobre Cloud Computing falará sobre Bluemix e Watson.

Uma superoportunidade de aprendizado e networking!

Programação:

  • 19:00 Coffee-break

  • 19:30 Novidades e Inovações em Cloud, Bruno Souza

  • 20:00 Desenvolvimento de aplicação Java, utilizando computação cognitiva, Sergio Gama

  • 21:30 Sorteios

Mais informações e inscrições: http://www.globalcode.com.br/gratuitos/minicursos/minicurso-desenvolvimento-java-com-ibm-bluemix

O SouJava está no Facebook. Acompanhe o Twitter do @SouJava

View original post

Mockito & DBUnit: Implementing a mocking structure focused and independent to your automated tests on Java

Standard

Hi, my dear readers! Welcome to my blog. On this post, we will make a hands-on about Mockito and DBUnit, two libraries from Java’s open source ecosystem which can help us in improving our JUnit tests on focus and independence. But why mocking is so important on our unit tests?

Focusing the tests

Let’s imagine a Java back-end application with a tier-like architecture. On this application, we  could have 2 tiers:

  • The service tier, which have the business rules and make as a interface for the front-end;
  • The entity tier, which have the logic responsible for making calls to a database, utilizing techonologies like JDBC or JPA;

Of course, on a architecture of this kind, we will have the following dependence of our tiers:

Service >>> Entity

On this kind of architecture, the most common way of building our automated tests is by creating JUnit Test Classes which test each tier independently, thus we can make running tests that reflect  only the correctness of the tier we want to test. However, if we simply create the classes without any mocking, we will got problems like the following:

  • On the JUnit tests of our service tier, for example, if we have a problem on the entity tier, we will have also our tests failed, because the error from the entity tier will reverberate across the tiers;
  • If we have a project where different teams are working on the same system, and one team is responsible for the construction of the service tier, while the other is responsible for the construction of the entity tier, we will have a dependency of one team with the other before the tests could be made;

To resolve such issues, we could mock the entity tier on the service tier’s unit test classes, so we can have independence and focus of our tests on the service tier, which it belongs.

independence

One point that it is specially important when we make our JUnit test classes in the independence department is the entity tier. Since in our example this tier is focused in the connection and running of SQL commands on a database, it makes a break on our independence goal, since we will need a database so we can run our tests. Not only that, if a test breaks any structure that it is  used by the subsequent tests, all of them will also fail. It is on this point that enters our other library, DBUnit.

With DBUnit, we can use embedded databases, such as HSQLDB, to make our database exclusive to the running of our tests.

So, without further delay, let’s begin our hands-on!

Hands-on

For this lab, we will create a basic CRUD for a Client entity. The structure will follow the simple example we talked about previously, with the DAO (entity) and Service tiers. We will use DBUnit and JUnit to test the DAO tier, and Mockito with JUnit to test the Service tier. First, let’s create a Maven project, without any archetype and include the following dependencies on pom.xml:

.

.

.

<dependencies>

<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>

<dependency>
<groupId>org.dbunit</groupId>
<artifactId>dbunit</artifactId>
<version>2.5.0</version>
</dependency>

<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
<version>1.10.19</version>
</dependency>

<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-entitymanager</artifactId>
<version>4.3.8.Final</version>
</dependency>

<dependency>
<groupId>org.hsqldb</groupId>
<artifactId>hsqldb</artifactId>
<version>2.3.2</version>
</dependency>

<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-core</artifactId>
<version>4.1.4.RELEASE</version>
</dependency>

<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>4.1.5.RELEASE</version>
</dependency>

<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>4.1.5.RELEASE</version>
</dependency>

<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-tx</artifactId>
<version>4.1.5.RELEASE</version>
</dependency>

<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-orm</artifactId>
<version>4.1.5.RELEASE</version>
</dependency>

</dependencies>

.

.

.

On the previous snapshot, we included not only the Mockito, DBUnit and JUnit libraries, but we also included Hibernate to implement the persistence layer and Spring 4 to use the IoC container and the transaction management. We also included the Spring Test library, which includes some features that we will use later on this lab. Finally, to simplify the setup and remove the need of installing a database to run the code, we will use HSQLDB as our database.

Our lab will have the following structure:

  • One class will represent the application itself, as a standalone class, where we will consume the tiers, like a real application would do;
  • We will have another 2 classes, each one with JUnit tests, that will test each tier independently;

First, we define a persistence unit, where we define the name of the unit and the properties to make Hibernate create the table for us and populate her with some initial rows. The code of the persistence.xml can be seen bellow:

<?xml version=”1.0″ encoding=”UTF-8″?>
<persistence xmlns=”http://java.sun.com/xml/ns/persistence&#8221;
version=”1.0″>
<persistence-unit name=”persistence” transaction-type=”RESOURCE_LOCAL”>
<class>com.alexandreesl.handson.model.Client</class>

<properties>
<property name=”hibernate.hbm2ddl.auto” value=”create” />
<property name=”hibernate.hbm2ddl.import_files” value=”sql/import.sql” />
</properties>

</persistence-unit>
</persistence>

And the initial data to populate the table can be seen bellow:

insert into Client(id,name,sex, phone) values (1,’Alexandre Eleuterio Santos Lourenco’,’M’,’22323456′);
insert into Client(id,name,sex, phone) values (2,’Lucebiane Santos Lourenco’,’F’,’22323876′);
insert into Client(id,name,sex, phone) values (3,’Maria Odete dos Santos Lourenco’,’F’,’22309456′);
insert into Client(id,name,sex, phone) values (4,’Eleuterio da Silva Lourenco’,’M’,’22323956′);
insert into Client(id,name,sex, phone) values (5,’Ana Carolina Fernandes do Sim’,’F’,’22123456′);

In order to not making the post burdensome, we will not discuss the project structure during the lab, but just show the final structure at the end. The code can be found on a Github repository, at the end of the post.

With the persistence unit defined, we can start coding! First, we create the entity class:

package com.alexandreesl.handson.model;

import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.Id;
import javax.persistence.Table;

@Table(name = “Client”)
@Entity
public class Client {

@Id
private long id;

@Column(name = “name”, nullable = false, length = 50)
private String name;

@Column(name = “sex”, nullable = false)
private String sex;

@Column(name = “phone”, nullable = false)
private long phone;

public long getId() {
return id;
}

public void setId(long id) {
this.id = id;
}

public String getName() {
return name;
}

public void setName(String name) {
this.name = name;
}

public String getSex() {
return sex;
}

public void setSex(String sex) {
this.sex = sex;
}

public long getPhone() {
return phone;
}

public void setPhone(long phone) {
this.phone = phone;
}

}

In order to create the persistence-related beans to enable Hibernate and the transaction manager, alongside all the rest of the beans necessary for the application, we use a Java-based Spring configuration class. The code of the class can be seen bellow:

package com.alexandreesl.handson.core;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.jdbc.datasource.DriverManagerDataSource;
import org.springframework.orm.jpa.JpaTransactionManager;
import org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean;
import org.springframework.orm.jpa.vendor.Database;
import org.springframework.orm.jpa.vendor.HibernateJpaDialect;
import org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter;
import org.springframework.transaction.annotation.EnableTransactionManagement;

@Configuration
@EnableTransactionManagement
@ComponentScan({ “com.alexandreesl.handson.dao”,
“com.alexandreesl.handson.service” })
public class AppConfiguration {

@Bean
public DriverManagerDataSource dataSource() {

DriverManagerDataSource dataSource = new DriverManagerDataSource();

dataSource.setDriverClassName(“org.hsqldb.jdbcDriver”);
dataSource.setUrl(“jdbc:hsqldb:mem://standalone”);
dataSource.setUsername(“sa”);
dataSource.setPassword(“”);

return dataSource;
}

@Bean
public JpaTransactionManager transactionManager() {

JpaTransactionManager transactionManager = new JpaTransactionManager();

transactionManager.setEntityManagerFactory(entityManagerFactory()
.getNativeEntityManagerFactory());
transactionManager.setDataSource(dataSource());
transactionManager.setJpaDialect(jpaDialect());

return transactionManager;
}

@Bean
public HibernateJpaDialect jpaDialect() {
return new HibernateJpaDialect();
}

@Bean
public HibernateJpaVendorAdapter jpaVendorAdapter() {
HibernateJpaVendorAdapter jpaVendor = new HibernateJpaVendorAdapter();

jpaVendor.setDatabase(Database.HSQL);
jpaVendor.setDatabasePlatform(“org.hibernate.dialect.HSQLDialect”);

return jpaVendor;

}

@Bean
public LocalContainerEntityManagerFactoryBean entityManagerFactory() {

LocalContainerEntityManagerFactoryBean entityManagerFactory = new LocalContainerEntityManagerFactoryBean();

entityManagerFactory
.setPersistenceXmlLocation(“classpath:META-INF/persistence.xml”);
entityManagerFactory.setPersistenceUnitName(“persistence”);
entityManagerFactory.setDataSource(dataSource());
entityManagerFactory.setJpaVendorAdapter(jpaVendorAdapter());
entityManagerFactory.setJpaDialect(jpaDialect());

return entityManagerFactory;

}

}

And finally, we create the classes that represent the tiers itself. This is the DAO class:

package com.alexandreesl.handson.dao;

import javax.persistence.EntityManager;
import javax.persistence.PersistenceContext;

import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;

import com.alexandreesl.handson.model.Client;

@Component
public class ClientDAO {

@PersistenceContext
private EntityManager entityManager;

@Transactional(readOnly = true)
public Client find(long id) {

return entityManager.find(Client.class, id);

}

@Transactional
public void create(Client client) {

entityManager.persist(client);

}

@Transactional
public void update(Client client) {

entityManager.merge(client);

}

@Transactional
public void delete(Client client) {

entityManager.remove(client);

}

}

And this is the service class:

 package com.alexandreesl.handson.service;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

import com.alexandreesl.handson.dao.ClientDAO;
import com.alexandreesl.handson.model.Client;

@Component
public class ClientService {

@Autowired
private ClientDAO clientDAO;

public ClientDAO getClientDAO() {
return clientDAO;
}

public void setClientDAO(ClientDAO clientDAO) {
this.clientDAO = clientDAO;
}

public Client find(long id) {

return clientDAO.find(id);

}

public void create(Client client) {

clientDAO.create(client);

}

public void update(Client client) {

clientDAO.update(client);

}

public void delete(Client client) {

clientDAO.delete(client);

}

}

The reader may notice that we created a getter/setter to the DAO class on the Service class. This is not necessary for the Spring injection, but we made this way to get easier to change the real DAO by a Mockito’s mock on the tests class. Finally, we code the class we talked about previously, the one that consume the tiers:

package com.alexandreesl.handson.core;

import org.springframework.context.ApplicationContext;
import org.springframework.context.annotation.AnnotationConfigApplicationContext;

import com.alexandreesl.handson.model.Client;
import com.alexandreesl.handson.service.ClientService;

public class App {

public static void main(String[] args) {

ApplicationContext context = new AnnotationConfigApplicationContext(
AppConfiguration.class);

ClientService service = (ClientService) context
.getBean(ClientService.class);

System.out.println(service.find(1).getName());

System.out.println(service.find(3).getName());

System.out.println(service.find(5).getName());

Client client = new Client();

client.setId(6);
client.setName(“Celina do Sim”);
client.setPhone(44657688);
client.setSex(“F”);

service.create(client);

System.out.println(service.find(6).getName());

System.exit(0);

}

}

If we run the class, we can see that the console print all the clients we searched for and that Hibernate is initialized properly, proving our implementation is a success:

Mar 28, 2015 1:09:22 PM org.springframework.context.annotation.AnnotationConfigApplicationContext prepareRefresh
INFO: Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@6433a2: startup date [Sat Mar 28 13:09:22 BRT 2015]; root of context hierarchy
Mar 28, 2015 1:09:22 PM org.springframework.jdbc.datasource.DriverManagerDataSource setDriverClassName
INFO: Loaded JDBC driver: org.hsqldb.jdbcDriver
Mar 28, 2015 1:09:22 PM org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean createNativeEntityManagerFactory
INFO: Building JPA container EntityManagerFactory for persistence unit ‘persistence’
Mar 28, 2015 1:09:22 PM org.hibernate.jpa.internal.util.LogHelper logPersistenceUnitInformation
INFO: HHH000204: Processing PersistenceUnitInfo [
name: persistence
…]
Mar 28, 2015 1:09:22 PM org.hibernate.Version logVersion
INFO: HHH000412: Hibernate Core {4.3.8.Final}
Mar 28, 2015 1:09:22 PM org.hibernate.cfg.Environment <clinit>
INFO: HHH000206: hibernate.properties not found
Mar 28, 2015 1:09:22 PM org.hibernate.cfg.Environment buildBytecodeProvider
INFO: HHH000021: Bytecode provider name : javassist
Mar 28, 2015 1:09:22 PM org.hibernate.annotations.common.reflection.java.JavaReflectionManager <clinit>
INFO: HCANN000001: Hibernate Commons Annotations {4.0.5.Final}
Mar 28, 2015 1:09:23 PM org.hibernate.dialect.Dialect <init>
INFO: HHH000400: Using dialect: org.hibernate.dialect.HSQLDialect
Mar 28, 2015 1:09:23 PM org.hibernate.hql.internal.ast.ASTQueryTranslatorFactory <init>
INFO: HHH000397: Using ASTQueryTranslatorFactory
Mar 28, 2015 1:09:23 PM org.hibernate.tool.hbm2ddl.SchemaExport execute
INFO: HHH000227: Running hbm2ddl schema export
Mar 28, 2015 1:09:23 PM org.hibernate.tool.hbm2ddl.SchemaExport execute
INFO: HHH000230: Schema export complete
Alexandre Eleuterio Santos Lourenco
Maria Odete dos Santos Lourenco
Ana Carolina Fernandes do Sim
Celina do Sim

Now, let’s move on for the tests themselves. For the DBUnit tests, we create a Base class, which will provide the base DB operations which all of our JUnit tests will benefit. On the @PostConstruct method, which is fired after all the injections of the Spring context are made – reason why we couldn’t use the @BeforeClass annotation, because we need Spring to instantiate and inject the EntityManager first – we use DBUnit to make a connection to our database, with the class DatabaseConnection and populate the table using the DataSet class we created, passing a XML structure that represents the data used on the tests.

This operation of populating the table is made by the DatabaseOperation class, which we use with the CLEAN_INSERT operation, that truncate the table first and them insert the data on the dataset. Finally, we use one of JUnit’s event listeners, the @After event, which is called after every test case. On our scenario, we use this event to call the clear() method on the EntityManager, which forces Hibernate to query against the Database for the first time at every test case, thus eliminating possible problems we could have between our test cases because of data that it is different on the second level cache than it is on the DB.

The code for the base class is the following:

package com.alexandreesl.handson.dao.test;

import java.io.InputStream;
import java.sql.SQLException;

import javax.annotation.PostConstruct;
import javax.persistence.EntityManager;
import javax.persistence.EntityManagerFactory;
import javax.persistence.PersistenceUnit;

import org.dbunit.DatabaseUnitException;
import org.dbunit.database.DatabaseConfig;
import org.dbunit.database.DatabaseConnection;
import org.dbunit.database.IDatabaseConnection;
import org.dbunit.dataset.IDataSet;
import org.dbunit.dataset.xml.FlatXmlDataSetBuilder;
import org.dbunit.ext.hsqldb.HsqldbDataTypeFactory;
import org.dbunit.operation.DatabaseOperation;
import org.hibernate.HibernateException;
import org.hibernate.internal.SessionImpl;
import org.junit.After;

public class BaseDBUnitSetup {

private static IDatabaseConnection connection;
private static IDataSet dataset;

@PersistenceUnit
public EntityManagerFactory entityManagerFactory;

private EntityManager entityManager;

@PostConstruct
public void init() throws HibernateException, DatabaseUnitException,
SQLException {

entityManager = entityManagerFactory.createEntityManager();

connection = new DatabaseConnection(
((SessionImpl) (entityManager.getDelegate())).connection());
connection.getConfig().setProperty(
DatabaseConfig.PROPERTY_DATATYPE_FACTORY,
new HsqldbDataTypeFactory());

FlatXmlDataSetBuilder flatXmlDataSetBuilder = new FlatXmlDataSetBuilder();
InputStream dataSet = Thread.currentThread().getContextClassLoader()
.getResourceAsStream(“test-data.xml”);
dataset = flatXmlDataSetBuilder.build(dataSet);

DatabaseOperation.CLEAN_INSERT.execute(connection, dataset);

}

@After
public void afterTests() {
entityManager.clear();
}

}

The xml structure used on the test cases is the following:

<?xml version=”1.0″ encoding=”UTF-8″?>
<dataset>
<Client id=”1″ name=”Alexandre Eleuterio Santos Lourenco” sex=”M” phone=”22323456″ />
<Client id=”2″ name=”Lucebiane Santos Lourenco” sex=”F” phone=”22323876″ />
</dataset>

And the code of our test class of the DAO tier is the following:

package com.alexandreesl.handson.dao.test;

import static org.junit.Assert.assertNotNull;
import static org.junit.Assert.assertNull;

import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.test.context.ContextConfiguration;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;
import org.springframework.test.context.transaction.TransactionConfiguration;
import org.springframework.transaction.annotation.Transactional;

import com.alexandreesl.handson.core.test.AppTestConfiguration;
import com.alexandreesl.handson.dao.ClientDAO;
import com.alexandreesl.handson.model.Client;

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(classes = AppTestConfiguration.class)
@TransactionConfiguration(defaultRollback = true)
public class ClientDAOTest extends BaseDBUnitSetup {

@Autowired
private ClientDAO clientDAO;

@Test
public void testFind() {

Client client = clientDAO.find(1);

assertNotNull(client);

client = clientDAO.find(2);

assertNotNull(client);

client = clientDAO.find(3);

assertNull(client);

client = clientDAO.find(4);

assertNull(client);

client = clientDAO.find(5);

assertNull(client);

}

@Test
@Transactional
public void testInsert() {

Client client = new Client();

client.setId(3);
client.setName(“Celina do Sim”);
client.setPhone(44657688);
client.setSex(“F”);

clientDAO.create(client);

}

@Test
@Transactional
public void testUpdate() {

Client client = clientDAO.find(1);

client.setPhone(12345678);

clientDAO.update(client);

}

@Test
@Transactional
public void testRemove() {

Client client = clientDAO.find(1);

clientDAO.delete(client);

}

}

The code is very self explanatory so we will just focus on explaining the annotations at the top-level class. The @RunWith(SpringJUnit4ClassRunner.class) annotation changes the JUnit base class that runs our test cases, using rather one made by Spring that enable support of the IoC container and the Spring’s annotations. The @TransactionConfiguration(defaultRollback = true) annotation is from Spring’s test library and change the behavior of the @Transactional annotation, making the transactions to roll back after execution, instead of a commit. That ensures that our test cases wont change the structure of the DB, so a test case wont break the execution of his followers.

The reader may notice that we changed the configuration class to another one, exclusive for the test cases. It is essentially the same beans we created on the original configuration class, just changing the database bean to point to a different DB then the previously one, showing that we can change the database of our tests without breaking the code. On a real world scenario, the configuration class of the application would be pointing to a relational database like Oracle, DB2, etc and the test cases would use a embedded database such as HSQLDB, which we are using on this case:

package com.alexandreesl.handson.core.test;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.jdbc.datasource.DriverManagerDataSource;
import org.springframework.orm.jpa.JpaTransactionManager;
import org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean;
import org.springframework.orm.jpa.vendor.Database;
import org.springframework.orm.jpa.vendor.HibernateJpaDialect;
import org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter;
import org.springframework.transaction.annotation.EnableTransactionManagement;

@Configuration
@EnableTransactionManagement
@ComponentScan({ “com.alexandreesl.handson.dao”,
“com.alexandreesl.handson.service” })
public class AppTestConfiguration {

@Bean
public DriverManagerDataSource dataSource() {

DriverManagerDataSource dataSource = new DriverManagerDataSource();

dataSource.setDriverClassName(“org.hsqldb.jdbcDriver”);
dataSource.setUrl(“jdbc:hsqldb:mem://standalone-test”);
dataSource.setUsername(“sa”);
dataSource.setPassword(“”);

return dataSource;
}

@Bean
public JpaTransactionManager transactionManager() {

JpaTransactionManager transactionManager = new JpaTransactionManager();

transactionManager.setEntityManagerFactory(entityManagerFactory()
.getNativeEntityManagerFactory());
transactionManager.setDataSource(dataSource());
transactionManager.setJpaDialect(jpaDialect());

return transactionManager;
}

@Bean
public HibernateJpaDialect jpaDialect() {
return new HibernateJpaDialect();
}

@Bean
public HibernateJpaVendorAdapter jpaVendorAdapter() {
HibernateJpaVendorAdapter jpaVendor = new HibernateJpaVendorAdapter();

jpaVendor.setDatabase(Database.HSQL);
jpaVendor.setDatabasePlatform(“org.hibernate.dialect.HSQLDialect”);

return jpaVendor;

}

@Bean
public LocalContainerEntityManagerFactoryBean entityManagerFactory() {

LocalContainerEntityManagerFactoryBean entityManagerFactory = new LocalContainerEntityManagerFactoryBean();

entityManagerFactory
.setPersistenceXmlLocation(“classpath:META-INF/persistence.xml”);
entityManagerFactory.setPersistenceUnitName(“persistence”);
entityManagerFactory.setDataSource(dataSource());
entityManagerFactory.setJpaVendorAdapter(jpaVendorAdapter());
entityManagerFactory.setJpaDialect(jpaDialect());

return entityManagerFactory;

}

}

If we run the test class, we can see that it runs the test cases successfully, showing that our code is a success. If we see the console, we can see that transactions were created and rolled back, respecting our configuration:

.

.

.

ar 28, 2015 2:29:55 PM org.springframework.test.context.transaction.TransactionContext startTransaction
INFO: Began transaction (1) for test context [DefaultTestContext@644abb8f testClass = ClientDAOTest, testInstance = com.alexandreesl.handson.dao.test.ClientDAOTest@1a411233, testMethod = testInsert@ClientDAOTest, testException = [null], mergedContextConfiguration = [MergedContextConfiguration@70325d20 testClass = ClientDAOTest, locations = ‘{}’, classes = ‘{class com.alexandreesl.handson.core.test.AppTestConfiguration}’, contextInitializerClasses = ‘[]’, activeProfiles = ‘{}’, propertySourceLocations = ‘{}’, propertySourceProperties = ‘{}’, contextLoader = ‘org.springframework.test.context.support.DelegatingSmartContextLoader’, parent = [null]]]; transaction manager [org.springframework.orm.jpa.JpaTransactionManager@7c2327fa]; rollback [true]
Mar 28, 2015 2:29:55 PM org.springframework.test.context.transaction.TransactionContext endTransaction
INFO: Rolled back transaction for test context [DefaultTestContext@644abb8f testClass = ClientDAOTest, testInstance = com.alexandreesl.handson.dao.test.ClientDAOTest@1a411233, testMethod = testInsert@ClientDAOTest, testException = [null], mergedContextConfiguration = [MergedContextConfiguration@70325d20 testClass = ClientDAOTest, locations = ‘{}’, classes = ‘{class com.alexandreesl.handson.core.test.AppTestConfiguration}’, contextInitializerClasses = ‘[]’, activeProfiles = ‘{}’, propertySourceLocations = ‘{}’, propertySourceProperties = ‘{}’, contextLoader = ‘org.springframework.test.context.support.DelegatingSmartContextLoader’, parent = [null]]].
Mar 28, 2015 2:29:55 PM org.springframework.test.context.transaction.TransactionContext startTransaction
INFO: Began transaction (1) for test context [DefaultTestContext@644abb8f testClass = ClientDAOTest, testInstance = com.alexandreesl.handson.dao.test.ClientDAOTest@2adddc06, testMethod = testRemove@ClientDAOTest, testException = [null], mergedContextConfiguration = [MergedContextConfiguration@70325d20 testClass = ClientDAOTest, locations = ‘{}’, classes = ‘{class com.alexandreesl.handson.core.test.AppTestConfiguration}’, contextInitializerClasses = ‘[]’, activeProfiles = ‘{}’, propertySourceLocations = ‘{}’, propertySourceProperties = ‘{}’, contextLoader = ‘org.springframework.test.context.support.DelegatingSmartContextLoader’, parent = [null]]]; transaction manager [org.springframework.orm.jpa.JpaTransactionManager@7c2327fa]; rollback [true]
Mar 28, 2015 2:29:55 PM org.springframework.test.context.transaction.TransactionContext endTransaction
INFO: Rolled back transaction for test context [DefaultTestContext@644abb8f testClass = ClientDAOTest, testInstance = com.alexandreesl.handson.dao.test.ClientDAOTest@2adddc06, testMethod = testRemove@ClientDAOTest, testException = [null], mergedContextConfiguration = [MergedContextConfiguration@70325d20 testClass = ClientDAOTest, locations = ‘{}’, classes = ‘{class com.alexandreesl.handson.core.test.AppTestConfiguration}’, contextInitializerClasses = ‘[]’, activeProfiles = ‘{}’, propertySourceLocations = ‘{}’, propertySourceProperties = ‘{}’, contextLoader = ‘org.springframework.test.context.support.DelegatingSmartContextLoader’, parent = [null]]].
Mar 28, 2015 2:29:55 PM org.springframework.test.context.transaction.TransactionContext startTransaction
INFO: Began transaction (1) for test context [DefaultTestContext@644abb8f testClass = ClientDAOTest, testInstance = com.alexandreesl.handson.dao.test.ClientDAOTest@4905c46b, testMethod = testUpdate@ClientDAOTest, testException = [null], mergedContextConfiguration = [MergedContextConfiguration@70325d20 testClass = ClientDAOTest, locations = ‘{}’, classes = ‘{class com.alexandreesl.handson.core.test.AppTestConfiguration}’, contextInitializerClasses = ‘[]’, activeProfiles = ‘{}’, propertySourceLocations = ‘{}’, propertySourceProperties = ‘{}’, contextLoader = ‘org.springframework.test.context.support.DelegatingSmartContextLoader’, parent = [null]]]; transaction manager [org.springframework.orm.jpa.JpaTransactionManager@7c2327fa]; rollback [true]
Mar 28, 2015 2:29:55 PM org.springframework.test.context.transaction.TransactionContext endTransaction
INFO: Rolled back transaction for test context [DefaultTestContext@644abb8f testClass = ClientDAOTest, testInstance = com.alexandreesl.handson.dao.test.ClientDAOTest@4905c46b, testMethod = testUpdate@ClientDAOTest, testException = [null], mergedContextConfiguration = [MergedContextConfiguration@70325d20 testClass = ClientDAOTest, locations = ‘{}’, classes = ‘{class com.alexandreesl.handson.core.test.AppTestConfiguration}’, contextInitializerClasses = ‘[]’, activeProfiles = ‘{}’, propertySourceLocations = ‘{}’, propertySourceProperties = ‘{}’, contextLoader = ‘org.springframework.test.context.support.DelegatingSmartContextLoader’, parent = [null]]].

Now let’s move on to the Service tests, with the help of Mockito.

The class to test the Service tier is very simple, as we can see bellow:

package com.alexandreesl.handson.service.test;

import static org.junit.Assert.assertEquals;

import org.junit.BeforeClass;
import org.junit.Test;
import org.mockito.Mockito;
import org.mockito.invocation.InvocationOnMock;
import org.mockito.stubbing.Answer;

import com.alexandreesl.handson.dao.ClientDAO;
import com.alexandreesl.handson.model.Client;
import com.alexandreesl.handson.service.ClientService;

public class ClientServiceTest {

private static ClientDAO clientDAO;

private static ClientService clientService;

@BeforeClass
public static void beforeClass() {

clientService = new ClientService();

clientDAO = Mockito.mock(ClientDAO.class);

clientService.setClientDAO(clientDAO);

Client client = new Client();
client.setId(0);
client.setName(“Mocked client!”);
client.setPhone(11111111);
client.setSex(“M”);

Mockito.when(clientDAO.find(Mockito.anyLong())).thenReturn(client);

Mockito.doThrow(new RuntimeException(“error on client!”))
.when(clientDAO).delete((Client) Mockito.any());

Mockito.doNothing().when(clientDAO).create((Client) Mockito.any());

Mockito.doAnswer(new Answer<Object>() {
public Object answer(InvocationOnMock invocation) {
Object[] args = invocation.getArguments();

Client client = (Client) args[0];

client.setName(“Mocked client has changed!”);

return client;
}
}).when(clientDAO).update((Client) Mockito.any());

}

@Test
public void testFind() {

Client client = clientService.find(10);

Mockito.verify(clientDAO).find(10);

assertEquals(client.getName(), “Mocked client!”);

}

@Test
public void testInsert() {

Client client = new Client();

client.setId(3);
client.setName(“Celina do Sim”);
client.setPhone(44657688);
client.setSex(“F”);

clientService.create(client);

Mockito.verify(clientDAO).create(client);

}

@Test
public void testUpdate() {

Client client = clientService.find(20);

client.setPhone(12345678);

clientService.update(client);

Mockito.verify(clientDAO).update(client);

assertEquals(client.getName(), “Mocked client has changed!”);

}

@Test(expected = RuntimeException.class)
public void testRemove() {

Client client = clientService.find(2);

clientService.delete(client);

}

}

On this test case, we didn’t need Spring to inject the dependencies, because the only dependency of the Service class is the DAO, which we create as a mock with Mockito on the @BeforeClass event, which executes just once before any test case is executed. On Mockito, we have the concept of Mocks and Spys. With Mocks, we have to define the behavior of the methods that we know our tests will be calling, because the mocks didn’t call the original methods, even if no behavior is declared. If we want to mock just some methods and use the original implementation on the others, we use a Spy instead.

On our tests, we made the following mocks:

  • For the find method, we specify that for any call, it will return a client with the name “Mocked client!”;
  • For the delete method, we throw a RuntimeException, with the message “error on client!”;
  • For the create method, we just receive the argument and do nothing;
  • For the update method, we receive the client passed as argument and change his name for “Mocked client has changed!”;

And that is it. We can also see that our test cases use the verify method from Mockito, that ensures that our mock was called inside the service tier. This check is very useful to ensure for example, that a developer didn’t removed the call of the DAO tier by accident. If we run the JUnit class, we can see that everything runs as a success, proving that our coding was successful.

This is the final structure of the project:

Conclusion

And that concludes our hands-on. With a simple usage, but a powerful usability, test libraries like Mockito and DBUnit are useful tools on the development of a powerful testing tier on our applications, improving and ensuring the quality of our code. Thank you for following me, until next time.

Continue reading

Encontro Técnico com participação internacional

Standard

Very good event for Java developers on Brazil, I am going!

SouJava

No dia 25 de março, quarta-feira, o SouJava promoverá um encontro técnico, o primeiro encontro do ano, visando fortalecer a cultura de desenvolvimento na plataforma Java. Nesse encontro contaremos com Otávio Santana, além de um convidado internacional, Luca Garulli autor do OrientDB.

Data: 25 de março, quarta-feira

Horário: 19:00

Local: Auditório da Globalcode

Av. Bernardino de Campos, 327 cj. 22

São Paulo,  próximo ao Metro Paraíso

Inscreva-se AQUI

Programação do Encontro

Turbinando suas coleções com Stream

Descrição: Conheça esse recurso que permite verdadeiramente turbinar suas coleções, o stream. O stream permite com que você trabalhe as suas coleções de uma forma mais simples, intuitiva e fluente.

Palestrante: Otávio Santana

Bio: Um Desenvolvedor apaixonado pelo que faz. Praticante da filosofia ágil e do desenvolvimento poliglota na Bahia, JUG Leader do JavaBahia, coordenador do SouJava além de auxiliar em diversos JUGs ao redor do mundo, um dos fomentadores do grupo LinguÁgil…

View original post 222 more words

My ELK Series was mentioned on ElasticSearch’s official blog!

Standard

Welcome, my dear readers! It is always nice to see our work been recognized, so I want to post this in thanks from ElasticSearch’s official blog, which I found out that it has praised my series about the ELK stack on his blog. Thank you very much for you guys on elastic.co, you guys rock!

https://www.elastic.co/blog/2015-03-04-this-week-in-elasticsearch

My work of graduation on SOA architecture (In Portuguese)

Standard

Hi, my dear readers, on this post, I thought of doing something a little different: On 2010, I have concluded a graduation study on SOA’s architecture and have made this work about SOA security. So, on my philosophy of sharing the knowledge, here it is my work, please enjoy the reading and if you have any questions, please ask. Oh, and sorry about been in Portuguese, but it is my native language, hehehe

Artigo_TCC_Seguranca_SOA

ELK: using a centralized logging architecture – final part

Standard

Welcome, dear reader! This is the last part of our series about the ELK stack, that provides a centralized architecture for gathering and analyse of application logs. On this last part, we will see how we can construct a rich front-end for our logging solution, using Kibana.

Kibana

Also developed by Elasticsearch, Kibana provides a way to explore the data and/or build rich dashboards of our data on the indexer. With a simple configuration, that consist of pointing Kibana to a ElasticSearch’s index – it also allows to point to multiple indexes at once, using a wildcard configuration -, it allow us to quickly set up a front-end for our stack.

So, without further delay, let’s begin our last step on the setup of our solution.

Hands-on

Installation

Beginning our lab, let’s download Kibana from the site. To do this, let’s open a terminal and type the following command:

curl https://download.elasticsearch.org/kibana/kibana/kibana-4.0.0-linux-x64.tar.gz | tar -zx

PS: this lab assumes the reader is using a linux 64-bit OS (I am using Ubuntu 14.10). If the reader is using another type of OS, the download page for Kibana is the following:

http://www.elasticsearch.org/overview/kibana/installation/

After running the command, we can see a folder called “kibana-4.0.0-linux-x64”. That’s all we need to do to install Kibana on our lab. Before we start running, however, we need to start both Logstash and ElasticSearch, to make our whole stack go live. If the reader didn’t have the stack set up, please refer to part 1 and part 2.

One configuration that the reader may take note is the elasticsearch_url property, inside the “kibana.yml” file, on the config folder. Since we didn’t change the default port from our elasticsearch’s cluster and we are running all the tools on the same machine, there is no need to change anything, but on a real world scenario, you may need to change this property to point to your elasticsearch’s cluster. The reader can made other configurations as well on this file, such as configuring a SSL connection between Kibana and Elasticsearch or configure the name of the index Kibana will use to store his metadata on elasticsearch.

After we start the rest of the stack, it is time to run Kibana.To run it, we navigate to the bin folder of Kibana’s installation folder, and type:

 ./kibana

After some seconds, we can see that Kibana has successfully booted:

Set up

Now, let’s start using Kibana. Let’s open a web browser and enter:

http://localhost:5601/

On our first run, we will be greeted with the following screen:

The reason because we are seeing this screen is because we dont have any index configured. It is on this screen that we can, for example, point to multiple indexes. If we think, for example, about the default naming pattern of logstash’s plugin, we can see that, for each new date we run, logstash will demand the creation of a new index with the pattern “logstash-%{+YYYY.MM.dd}”. So, if we want Kibana to use all the indexes created this way on a single dataset, we simply configure the field with the value “logstash-*”

On our case, we change the default naming of the index, to demonstrate the possible configurations of the tool, so we change the field to “log4jlogs”. One important thing to note on logstash’s plugin documentation, is that the idea behind the index pattern is to make more easy to do operations such as deleting old data, so in a real world scenario, it would be more wise to leave the default pattern enabled.

The other field we need to set is the Time-field name . This field is used by Kibana as a primary filter, allowing him to filter the data from the dashboards and data explorations on a timeline perspective. If we see our document type from the documents storing log information on elasticsearch, we can see it has a field called @timestamp. This field is perfectly fit to our time-field need, so we put “@timestamp” in the field.

After we click on the “create” button, we will be sent to a new page, where we can see all the fields from the index we pointed, proving that our configuration was a success:

Of course, there is other settings as well that we can configure, such as the precision on the numeric fields or the date mask when showing this kind of data on the interface, all available on the “advanced” menu. For our lab needs, however, the default options will suffice, so we leave this page by clicking on the “Discover” menu.

Discovering the data

When we enter the new page, if we didnt send any data to the cluster in the last 15 minutes, we will see a message of no data found (see bellow). The reason for this is because, by default, Kibana limits the results for the last 15 minutes only. To change this, we can either change the time limit by clicking on the value on the upper right corner of the page, or by running again the java program from the previous post, so we have more fresh data.

After we set some data to explore, we can see that the page changes to a data view, where we can dinamically change field filters, by changing the left menu, and drill down the data to the fields level, as we can see bellow. The reader can feel free to explore the page before we move on to our next part.

Dashboards

Now that we have all set, let’s begin by creating a simple dashboard. we will create a pie chart of the messages by the log level, a table view showing the count of messages by class and a line chart of the error messages across the time.

First, let’s begin with the pie chart. We will select, from the menu, “Visualize” , then follow this sequence of options on the next pages:

pie chart >> from a new search

We will enter a page like the one bellow:

In this page, we will make our log level pie chart. To do this, we leave the metrics area unchanged, with the “count” value as slice size. Then, we choose “split slices” on the buckets area and select a “terms” aggregation, by the field “priority”. The screen bellow show our pie chart born!

After we made our chart, before we leave this page, we click on the disk icon on the top menu, input a name for our chart – also called visualization by Kibana’s terminology – and save it, for later use.

Now it is time for our table view. From the same menu we used to save our pie chart, click on the most left icon, called “new visualization”, then select “data table” and “from a new search”. We will see a similar page from the pie chart one, with similar options of aggregations and metrics to configure. With the intent of not bothering the reader with much repetitive instructions, from now on I will keep the explanation the more graphical possible, by the screenshots.

So, our data table view is made by making the configurations we can see on the left menu:

Analogous with what we made on the previous visualization, let’s also save the table view and click on new visualization, selecting line chart this time. The configuration for this chart can be seen bellow:

That concludes our visualizations. Now, let’s conclude our lab by making a dashboard that will show all the visualizations we build on the same page. To do this, first we select the “dashboard” option on the top menu. We will be greeted with the screen bellow:

This is the dashboard page, where we can create, edit and load dashboards. To add visualizations, we simply click on the plus icon on the top right side and select the visualization we want to add, like we can see bellow:

After we add the visualizations, it is pretty simple to arrange them: all we have to do is to drag and drop the visualizations as we like. My final arrangement of the dashboard is bellow, but the reader can arrange the way he finds best:

Finally, to save the dashboard, we simply click on the disk icon and input a name for our dashboard.

Cons

Of course, we cant make a complete overview of a tool, if we dont mention any noticeable con of it. The most visible downside of Kibana, on my opinion, is the lack of a native support to authentication and authorization. The reader may notice, by our tour of the interface, that there is no user or permission CRUDs whatsoever. Indeed, Kibana doesnt come with this feature.

One common workaround for this is to configure basic authentication on the web server Kibana runs with it – if we were using Kibana 3 on our lab, we would have to install manually a web server and deploy Kibana on it. On Kibana 4, we dont have to do this, because Kibana comes with a embedded node.js – , so the authentication is made by URL patterns. This link to the official node.js documentation provides a path to set basic authentication on the embedded server Kibana comes with it. The file that starts the server and need to be edited is on /src/bin/kibana.js.

Very recently, Elasticsearch has released shield, a plugin that implements a versatile security layer on Elasticsearch and subsequently on Kibana. Evidently, is a much more elegant and complete solution then the previous one, but the reader has to keep it in mind that he has to acquire a license to use shield.

Conclusion

And that concludes our tour by the ELK stack. Of course, on a real world scenario, we have some more steps to do, like the aforementioned authentication configuration, the configuration of purge policies and so on.However, we still can see that the core features we need were implemented, with very little effort. Like we talked about on the beginning of our series, logging information is a very rich source of information, that it has to be better explored on the business world. Thanks to everyone that follows my blog, until next time.

Continue reading

ELK: using a centralized logging architecture – part 2

Standard

Welcome, dear reader, to another post of our series about the ELK stack for logging. On the last post, we talked about LogStash, a tool that allow us to integrate data from different sources to different destinations, using transformations along the way, in a stream-like form. On this post, we will talk about ElasticSearch, a indexer based on apache Lucene, which can allow us to organize our data and make textual searches on the data, in a scalable infrastructure. So, let’s begin by understanding how ElasticSearch is organized on the inside

Indexes, documents and shards

On ElasticSearch, we have the concept of indexes. A index is like a repository, where we can store our data in the format of documents. A document on ElasticSearch’s terminology consists of a structure for the data to be stored, analysed and classified, following a mapping definition, composed of a series of fields – a important thing to note, is that a field on ElasticSearch has the same type across the whole index, meaning that we cant have a field “phone” with the type int on a document and the type string on another.

In turn, we have our documents stored on shards, which divide the data on segments based on a rule – by default, the segmentation is made by hashing the data, but it can also be manually manipulated -, making the searches faster.

So, in a nutshell, we can say that the order of organization of ElasticSearch is as follows:

Index >> Document (mappings/type) >> shard

This organization is used by the user on the two basic operations of the cluster: indexing and searching.

One last thing to say about documents is that they can not only be stored as independent , but also be mounted on a tree-like hierarchy, with links between them. This is useful in scenarios that we can make use of hierarchical searches, such as product’s searches based on their categories.

Indexing

Indexing is the action of inputing the data from a external source to the cluster. ElasticSearch is a textual indexer, which means he can only analyse text on plain format, despite that we can use the cluster to store data in base64 format, using a plugin. Later on the post, we will see a example installation of a plugin, which are extensions we can aggregate to expand our cluster usability.

When we index our data,  we define which fields are to be analysed, which analyser to use, if the default ones does not suffice and which fields we want to store the data on the cluster, so we can use as the result of our searches. One important thing to note about the indexing operations is that, despite it has CRUD-like operations, the data is not really updated or deleted on the cluster, instead a new version is generated and the old version is marked as deleted.

This is a important thing to take note, because if not properly configured to make purges – which can be made with a configuration that break the shards into segments, and periodically make merges of the segments, phisically deleting the obsolet documents on the process -, the cluster will keep indefinitely expanding in size with the “deleted” older versions of our data, making specially the searches to became really slow.

All the operations can be made with a REST API provided by ElasticSearch, that we will see later on this post.

Searching

The other, and probably most important, action on ElasticSearch, is the searching of the data previously indexed. Like the indexing action, ElasticSearch also provide a REST API for the searches. The API provides a very rich range of possibilities of searching, from basic term searches to more complex searches such as hierarchical searches, searches by synonims, language detections, etc.

All the searching is based on a score system, where formulas are applied to confront the accuracy of the documents founded versus the query supplied. This score system can also be customized.

By default, the searching on the cluster occurs in 2 phases:

  • On the first phase, the master node sends the query for all the nodes, and subsequently shards , retrieving just the IDs and scores of the documents. Using a parameter called size which defines the maximum results from a query, the master selects the more meaningful documents, based on the score;
  • On the second phase, the master send requests for the nodes to retrieve the documents selected on the previous phase. After receiving the documents, the master finally sends the result for the client;

Alongside this search type, there’s also other modes, like the query_and_fetch. On this mode, the searching is made simultaneous on all shards, not only to retrieve the IDs and scores but also returning the data itself, limited only by the size parameter, which is applied per shard. In turn, on this mode, the maximum of results returned will be the size parameter plus the number of shards.

One interesting feature of ElasticSearch’s configuration options is the ability to make some nodes exclusive to query operations, and others to make the storage part, called data nodes. This way, when we query, our query dont need to run  across all the cluster to formulate the results, making the searches faster. On the next section we will see a little more about cluster configurations.

Cluster capabilities

When we talk about a cluster, we talk about scalability, but we also talk about availability. On ElasticSearch, we can configure the replication of shards, where the data is replicated by a given factor, so we dont lose our data if a node is lost. The replication if also maintained by the cluster, so if we lost a replica, the cluster itself will distribute a new replica for another node.

Other interesting feature of the cluster are the ability to discover itself. By the default configuration, when we start a node he will use a discovery mode called Zen, which uses unicast and multicast to search for another instances on all the ports of the OS. If it founds another instance, and the name of the cluster is the same – this is another one of the cluster’s configuration properties. All of this configurations can be made on the file elasticsearch.yml, on the config folder -, it will communicate with the instance and establish a new node for the already running cluster. There is another modes for this feature, including the discover of nodes from other servers.

Logging

The reader could be thinking: “Lol, do I need all of this to run a logging stack?”.

Of course that ElasticSearch is a very robust tool, that can be used on other solutions as well. However, on our case of making a centralized logging analysis solution, the core of ElasticSearch’s capabilities serve us well for this task, after all, we are talking about the textual analysis of log texts, for use on dashboards, reports, or simply for real-time exploration of the data.

Well, that concludes the conceptual part of our post. Now, let’s move on to the practice.

Hands-on

So, without further delay, let’s begin the hands-on. For this, we will use the previous Java program we used on our lab about LogStash. The code can be found on GitHub, on this link. On this program, we used the org.apache.log4j.net.SocketAppender from log4j to send all the logging we make to LogStash. However, on that point we just printed the messages on the console, instead of sending to ElasticSearch. Before we change this, let’s first start our cluster.

To do this, first we need to download the last version from the site and unzip the tar. Let’s open a terminal, and type the following command:

curl https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz | tar -zx

After running the command, we will find a new folder called “elasticsearch-1.4.4” created on the same folder we run our command. To our example, we will create 2 copies of this folder on a folder we call “elasticsearchcluster”, where each one will represent one node of the cluster. To do this, we run the following commands:

mkdir elasticsearchcluster

sudo cp -avr elasticsearch-1.4.4/ elasticsearchcluster/elasticsearch-1.4.4-node1/

sudo cp -avr elasticsearch-1.4.4/ elasticsearchcluster/elasticsearch-1.4.4-node2/

After we made our cluster structure, we dont need the original folder anymore, so we remove:

rm -R elasticsearch-1.4.4/

Now, let’s finally start our cluster! To do this, we open a terminal, navigate to the bin folder of our first node (elasticsearch-1.4.4-node1) and type:

./elasticsearch

After some seconds, we can see our first node is on:

For curiosity sake, we can see the name “Feral” on the node’s name on the log. All the names generated by the tool are based on Marvel Comic’s characters. IT world sure has some sense of humor, heh?

Now, let’s start our second node. On a new terminal window, let’s navigate to the folder of our second node (elasticsearch-1.4.4-node2) and type again the command “./elasticsearch”. After some seconds, we can see that the node is also started:

One interesting thing to notice is that our second node “Ooze”, has a mention of comunicating with our other node, “Feral”. That is the zen discover on the action, making the 2 nodes talk to each other and form a cluster. If we look again at the terminal of our first node, we can see another evidence of this bidirectional communication, as “Feral” has added “Ooze” to the cluster, as his role as a master node:

 Now that we have our cluster set up, let’s adjust our logstash script to send the messages to the cluster. To do this, let’s change the output part of the script, to the following:

input {
log4j {
port => 1500
type => “log4j”
tags => [ “technical”, “log”]
}
}

output {
stdout { codec => rubydebug }
elasticsearch_http {
host => “localhost”
port => 9200
index => “log4jlogs”
}
}

As we can see, we just included another output – we remained the console output just to check how logstash is receiving the data – including the ip and port where our ElasticSearch cluster will respond. We also defined the name of the index we want our logs to be stored. If this parameter is not defined, logstash will order elasticsearch to create a index with the pattern “logstash-%{+YYYY.MM.dd}”.

To execute this script, we do like we did on the previous post, we put the new script on a file called “configelasticsearch.conf” on the bin folder of logstash, and run with the command:

./logstash -f configelasticsearch.conf

PS1: On the GitHub repository, it is possible to find this config file, alongside a file containing all the commands we will send to ElasticSearch from now on.

PS2: For simplicity sake, we will use the default mappings logstash provide for us when sending messages to the cluster. It is also possible to pass a elasticsearch’s mapping structure, which consists of a JSON model, that logstash will use as a template. We will see the mapping from our log messages later on our lab, but for satisfying the reader curiosity for now, this is what a elasticsearch’s mapping structure look like, for example for a document type “product”:

“mappings” : {
“product”: {
“properties” : {

“variation” : { “type” : int }

“color”  : { “type” : “string” }

“code” : { “type” : int }

“quantity” : { “type” : int }

}
}
}

After some seconds, we can see that LogStash booted, so our configuration was a success. Now, let’s begin sending our logs!

To do this, we run the program from our previous post, running the class com.technology.alexandreesl.LogStashProvider . We can see on the console of logstash, after starting the program, that the messages are going through the stack:

Now that we have our cluster up and running, let’s start to use it. First, let’s see the mappings of the index that ElasticSearch created for us, based on the configuration we made on LogStash. Let’s open a terminal and run the following command:

curl -XGET ‘localhost:9200/log4jlogs/_mapping?pretty’

On the command above, we are using ElasticSearch’s REST API. The reader will notice that, after the ip and port, the URL contains the name of the index we configured. This pattern for calls of the API is applied to most of the actions, as we can see below:

<ip>:<port>/<index>/<doc type>/<action>?<attributes>

So, after this explanation, let’s see the result from our call:

{
“log4jlogs” : {
“mappings” : {
“log4j” : {
“properties” : {
“@timestamp” : {
“type” : “date”,
“format” : “dateOptionalTime”
},
“@version” : {
“type” : “string”
},
“class” : {
“type” : “string”
},
“file” : {
“type” : “string”
},
“host” : {
“type” : “string”
},
“logger_name” : {
“type” : “string”
},
“message” : {
“type” : “string”
},
“method” : {
“type” : “string”
},
“path” : {
“type” : “string”
},
“priority” : {
“type” : “string”
},
“stack_trace” : {
“type” : “string”
},
“tags” : {
“type” : “string”
},
“thread” : {
“type” : “string”
},
“type” : {
“type” : “string”
}
}
}
}
}
}

As we can see, the index “log4jlogs” was created, alongside the document type “log4j”. Also, a series of fields were created, representing information from the log messages, like the thread that generated the log, the class, the log level and the log message itself.

Now, let’s begin to make some searches.

Let’s begin by searching all log messages which the priority was “INFO”. We make this searching by running:
curl -XGET ‘localhost:9200/log4jlogs/log4j/_search?q=priority:info&pretty=true’
A fragment of the result of the query would be something like the following:

{
“took” : 12,
“timed_out” : false,
“_shards” : {
“total” : 5,
“successful” : 5,
“failed” : 0
},
“hits” : {
“total” : 18,
“max_score” : 1.1823215,
“hits” : [ {
“_index” : “log4jlogs”,
“_type” : “log4j”,
“_id” : “AUuxkDTk8qbJts0_16ph”,
“_score” : 1.1823215,
“_source”:{“message”:”STARTING DATA COLLECTION”,”@version”:”1″,”@timestamp”:”2015-02-22T13:53:12.907Z”,”type”:”log4j”,”tags”:[“technical”,”log”],”host”:”127.0.0.1:32942″,”path”:”com.technology.alexandreesl.LogStashProvider”,”priority”:”INFO”,”logger_name”:”com.technology.alexandreesl.LogStashProvider”,”thread”:”main”,”class”:”com.technology.alexandreesl.LogStashProvider”,”file”:”LogStashProvider.java:20″,”method”:”main”}
}

.

.

.

As we can see, the result is a JSON structure, with the documents that met our search. The beginning information of the result is not the documents themselves, but instead information about the search itself, such as the number of shards used, the seconds the search took to execute, etc. This kind of information is useful when we need to make a tuning of our searches, like manually defining the shards we which to use on the search, for example.

Let’s see another example. On our previous search, we received all the fields from the document on the result, which is not always the desired result, since we will not always use the whole information. To limit the fields we want to receive, we make our query like the following:
curl -XGET ‘localhost:9200/log4jlogs/log4j/_search?pretty=true’ -d ‘
{
“fields” : [ “priority”, “message”,”class” ],
“query” : {
“query_string” : { “query” : “priority:info” }
}
}’
On the query above, we asked ElasticSearch to limit the return to only return the priority, message and class fields. A fragment of the result can be seen bellow:

.

.

.

{
“_index” : “log4jlogs”,
“_type” : “log4j”,
“_id” : “AUuxkECZ8qbJts0_16pr”,
“_score” : 1.1823215,
“fields” : {
“priority” : [ “INFO” ],
“message” : [ “CLEANING UP!” ],
“class” : [ “com.technology.alexandreesl.LogStashProvider” ]
}
}

.

.

.

Now, let’s use the term search. On the term searches, we use ElasticSearch’s textual analysis to find a term inside the text of a field. Let’s run the following command:
curl -XGET ‘localhost:9200/log4jlogs/log4j/_search?pretty=true’ -d ‘
{
“fields” : [ “priority”, “message”,”class” ],
“query” : {
“term” : {
“message” : “up”
}
}
}’
If we see the result, it would be all the log messages that contains the word “up”. A fragment of the result can be seen bellow:

{
“_index” : “log4jlogs”,
“_type” : “log4j”,
“_id” : “AUuxkESc8qbJts0_16pv”,
“_score” : 1.1545612,
“fields” : {
“priority” : [ “INFO” ],
“message” : [ “CLEANING UP!” ],
“class” : [ “com.technology.alexandreesl.LogStashProvider” ]
}
}

Of course, there is a lot more of searching options on ElasticSearch, but the examples provided on this post are enough to make a good starting point for the reader. To make a final example, we will use the “prefix” search. On this type of search, ElasticSearch will search for terms that start with our given text, on a given field. For example, to search for log messages that have words starting with “clea”, part of the word “cleaning”, we run the following:
curl -XGET ‘localhost:9200/log4jlogs/log4j/_search?pretty=true’ -d ‘
{
“fields” : [ “priority”, “message”,”class” ],
“query” : {
“prefix” : {
“message” : “clea”
}
}
}’
If we see the results, we will see that are the same from the previous search, proving that our search worked correctly.

Kopf

The reader possibly could ask “Is there another way to send my queries without using the terminal?” or “Is there any graphical tool that I can use to monitor the status of my cluster?”. As a matter of fact, there is a answer for both of this questions, and the answer is the kopf plugin.

As we said before, plugins are extensions that we can install to improve the capacities of our cluster. In order to install the plugin, first let’s stop both the nodes of the cluster – press ctrl+c on both terminal windows to stop – then, navigate to the nodes root folder and type the following:

bin/plugin -install lmenezes/elasticsearch-kopf

If the plugin was installed correctly, we should see a message like the one bellow on the console:

.

.

.

-> Installing lmenezes/elasticsearch-kopf…
Trying https://github.com/lmenezes/elasticsearch-kopf/archive/master.zip&#8230;
Downloading …………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………..DONE
Installed lmenezes/elasticsearch-kopf into….

After installing on both nodes, we can start again the nodes, just as we did before. After the booting of the cluster, let’s open a browser and type the following URL:

http://localhost:9200/_plugin/kopf

We will see the following web page of the kopf plugin, showing the status of our cluster, such as the nodes, the indexes, shard information, etc

Now, let’s run our last example from the search queries on kopf. First, we select the “rest” option on the top menu. On the next screen, we select “POST” as the http method, include on the URL field the index and document type to narrow the results and on the textarea bellow we include our JSON query filters. The print bellow shows the query made on the interface:

 Conclusion

And so we conclude our post about ElasticSearch. A very powerful tool on the indexing and analysis of textual information, the central stone on our ELK stack for logging is a tool to be used, not only on a logging analysis system, but on other solutions that his features can be useful as well.

So, our stack is almost complete. We can gather our log information, and the information is indexed on our cluster. However, a final piece remains: we need a place where we can have a more friendly interface, that allow us not only to search the information, but also to make rich presentations of the data, such as dashboards. That’s when it enters our last part of our ELK series and the last tool we will see, Kibana. Thank you for following me on another post, until next time.

Continue reading

ELK: using a centralized logging architecture – part 1

Standard

Welcome, dear reader, to another post from my blog. On this new series, we will talk about a architecture specially designed to process data from log files coming from applications, with the junction of 3 tools, Logstash, ElasticSearch and Kibana. But after all, do we really need such a structure to process log files?

Stacks of log

On a company ecosystem, there is lots of systems, like the CRM, ERP, etc. On such environments, it is common for the systems to produce tons of logs, which provide not only a real-time analysis of the technical status of the software, but could also provide some business information too, like a log of a customer’s behavior  on a  shopping cart, for example. To dive into this useful source of information, enters the ELK architecture, which name came from the initials of the software involved: ElasticSearch, LogStash and Kibana. The picture below shows in a macro vision the flow between the tools:

As we can see, there’s a clear separation of concerns between the tools, where which one has his own individual part on the processing of the log data:

  • Logstash: Responsible for collect the data, make transformations like parsing – using regular expressions – adding fields, formatting as structures like JSON, etc and finally sending the data to various destinations, like a ElasticSearch cluster. Later on this post we will see more detail about this useful tool;
  •  ElasticSearch: RESTful data indexer, ElasticSearch provides a clustered solution to make searches and analysis on a set of data. On the second part of our series, we will see more about this tool;
  • Kibana: Web-based application, responsible for providing a light and easy-to-use dashboard tool. On the third and last part of our series, we will see more of this tool;

So, to begin our road in the ELK stack, let’s begin by talking about the tool responsible for integrating our data: LogStash.

LogStash installation

To install, all we need to do is unzip the file we get from LogStash’s site and run the binaries on the bin folder. The only pre-requisite for the tool is to have Java installed and configured in the environment. If the reader wants to follow my instructions with the same system then me, I am using Ubuntu 14.10 with Java 8, which can be downloaded from Oracle’s site here.

With Java installed and configured, we begin by downloading and unziping the file. To do this, we open a terminal and input:

curl https://download.elasticsearch.org/logstash/logstash/logstash-1.4.2.tar.gz | tar xz

After the download, we will have LogStash on a folder on the same place we run our ‘curl’ command. On the LogStash terminology, we have 4 types of configurations we can make for a stream, named:

  • input: On this configuration, we put the sources of our streams, that can range from polling files of a file system to more complex inputs such as a Amazon SQS queue and even Twitter;
  • codec: On this configuration we make transformations on the data, like turning into a JSON structure, or grouping together lines that are semantically related, like for example, a Java’s stack trace;
  • filter: On this configuration we make operations such as parsing data from/to different formats, removal of special characters and checksums for deduplication;
  • output: On this configuration we define the destinations for the processed data, such as a ElasticSearch cluster, AWS SQS, Nagios etc;

Now that we have established LogStash’s configuration structure, let’s begin with our first execution. In LogStash we have two ways to configure our execution, one way by providing the settings on the start command itself and the other one is by providing a configuration file for the command. The simplest way to boot a LogStash’s stream is by setting the input and output as the console itself, to make this execution, we open a terminal, navigate to the bin folder of our LogStash’s installation and execute the following command:

./logstash -e ‘input { stdin { } } output { stdout {} }’

As we can see after we run the command, we booted LogStash, setting the console as the input and the output, without any transformation or filtering. To test, we simply input anything on the console, seeing that our message is displayed back by the tool:

Now that we get the installation out of the way, let’s begin with the actual lab. Unfortunately -or not, depending on the point of view -, it would take us a lot of time to show all the features of what we can do with the tool, so to make a short but illustrative example, we will start 2 logstash streams, to do the following:

1st stream:

  • The input will be made by a java program, which will produce a log file with log4j, representing technical information;
  • For now, we will just print logstash’s events on the console, using the rubydebug codec. On our next part on the series, we will return to this configuration and change the output to send the events to elasticsearch;

2nd stream:

  • The input will be made by the same java program, which will produce a positional file, representing business information of costumers and orders;
  • We will then use the grok filter to parse the data of the positional file into separated fields, producing the data for the output step;
  • Finally, we use the mongodb output, to save our data – filtering to only persist the orders – on a  Mongodb collection;

With the streams defined, we can begin our coding. First, let’s create the java program which will generate the inputs for the streams. The code for the program can be seen bellow:

package com.technology.alexandreesl;

import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import org.apache.log4j.Logger;

public class LogStashProvider {

private static Logger logger = Logger.getLogger(LogStashProvider.class);

public static void main(String[] args) throws IOException {

try {

logger.info(“STARTING DATA COLLECTION”);

List<String> data = new ArrayList<String>();

Customer customer = new Customer();
customer.setName(“Alexandre”);
customer.setAge(32);
customer.setSex(‘M’);
customer.setIdentification(“4434554567”);

List<Order> orders = new ArrayList<Order>();

for (int counter = 1; counter < 10; counter++) {

Order order = new Order();

order.setOrderId(counter);
order.setProductId(counter);
order.setCustomerId(customer.getIdentification());
order.setQuantity(counter);

orders.add(order);

}

logger.info(“FETCHING RESULTS INTO DESTINATION”);

PrintWriter file = new PrintWriter(new FileWriter(
“/home/alexandreesl/logstashdataexample/data”
+ new Date().getTime() + “.txt”));

file.println(“1” + customer.getName() + customer.getSex()
+ customer.getAge() + customer.getIdentification());

for (Order order : orders) {
file.println(“2” + order.getOrderId() + order.getCustomerId()
+ order.getProductId() + order.getQuantity());
}

logger.info(“CLEANING UP!”);

file.flush();
file.close();

// forcing a error to simulate stack traces
PrintWriter fileError = new PrintWriter(new FileWriter(
“/etc/nopermission.txt”));

} catch (Exception e) {

logger.error(“ERROR!”, e);
}

}

}

As we can see, it is a very simple class, that uses log4j to generate some log and output a positional file representing data from customers and orders and at the end, try to create a file on a folder we don’t have permission to write by default, “forcing” a error to produce a stack trace. The complete code for the program can be found here. Now that we have made our data generator, let’s begin the configuration for logstash. The configuration for our first example is the following:

input {
log4j {
port => 1500
type => “log4j”
tags => [ “technical”, “log”]
}
}

output {
stdout { codec => rubydebug }
}

To run the script, let’s create a file called “config1.conf” and save the file with the script on the “bin” folder of logstash’s installation folder. Finally, we run the script with the following command:

 ./logstash -f config1.conf

This will start logstash process with the configurations we provided. To test, simply run the java program we coded earlier and we will see a sequence of message events in logstash’s console window, generated by the rubydebug codec, like the one bellow, for example:

{
“message” => “ERROR!”,
“@version” => “1”,
“@timestamp” => “2015-01-24T19:08:10.872Z”,
“type” => “log4j”,
“tags” => [
[0] “technical”,
[1] “log”
],
“host” => “127.0.0.1:34412”,
“path” => “com.technology.alexandreesl.LogStashProvider”,
“priority” => “ERROR”,
“logger_name” => “com.technology.alexandreesl.LogStashProvider”,
“thread” => “main”,
“class” => “com.technology.alexandreesl.LogStashProvider”,
“file” => “LogStashProvider.java:70”,
“method” => “main”,
“stack_trace” => “java.io.FileNotFoundException: /etc/nopermission.txt (Permission denied)\n\tat java.io.FileOutputStream.open(Native Method)\n\tat java.io.FileOutputStream.<init>(FileOutputStream.java:213)\n\tat java.io.FileOutputStream.<init>(FileOutputStream.java:101)\n\tat java.io.FileWriter.<init>(FileWriter.java:63)\n\tat com.technology.alexandreesl.LogStashProvider.main(LogStashProvider.java:66)”
}

Now, let’s move on to the next stream. First, we create another file, called “config2.conf”, on the same folder we created the first one. On this new file, we create the following configuration:

input {
file {
path => “/home/alexandreesl/logstashdataexample/data*.txt”
start_position => “beginning”
}
}

filter {
grok {
match => [ “message” , “(?<file_type>.{1})(?<name>.{9})(?<sex>.{1})(?<age>.{2})(?<identification>.{10})” , “message” , “(?<file_type>.{1})(?<order_id>.{1})(?<costumer_id>.{10})(?<product_id>.{1})(?<quantity>.{1})” ]
}
}

output {
stdout { codec => rubydebug }
if [file_type] == “2” {
mongodb {
collection => “testData”
database => “mydb”
uri => “mongodb://localhost”
}
}
}

With the configuration created, we can run our second example. Before we do that, however, let’s dive a little on the configuration we just made. First, we used the file input, which will make logstash keep monitoring the files on the folder and processing them as they appear on the input folder.

Next, we create a filter with the grok plugin. This filter uses combinations of regular expressions, that parses the data from the input. The plugin comes with more then 100 patterns pre-made that helps the development. Another useful tool in the use of grok is a site where we could test our expressions before use. Both links are available on the links section at the end of this post.

Finally, we use the mongodb plugin, where we reference our logstash for a database and collection of a mongodb instance, where we will insert the data from the file into mongodb’s documents. We also used again the rubydebug codec, so we can also see the processing of the files on the console. The reader will note that we used a “if” statement before the configuration of the mongodb output. After we parse the data with grok, we can use the newly created fields to do some logic on our stream. In this case, we filter to only process data with the type “2”, so only the order’s data goes to the collection on mongodb, instead of all the data. We could have expanded more on this example, like saving the data into two different collections, but for the idea of passing a general view of the structure of logstash for the reader, the present logic will suffice.

PS: This example assumes the reader has mongodb installed and running on the default port of his environment, with a db “mydb” and a collection “testData” created. If the reader doesn’t have mongodb, the instructions can be found on the official documentation.

Finally, with everything installed and configured, we run the script, with the following command:

./logstash -f config2.conf

After logstash’s start, if we run our program to generate a file, we will see logstash working the data, like the screen bellow:

And finally, if we query the collection on mongodb, we see the data is persisted:

Conclusion

And so we conclude the first part of our series. With a simple usage, logstash prove to be a useful tool in the integration of information from different formats and sources, specially log-related. In the next part of our series, we will dive in the next tool of our stack: ElasticSearch. Until next time

Continue reading