datasift – Technology explained

Hi, dear readers! Welcome to another post of my blog. On this post, we will talk about social media and sentimental analysis, seeing how she is revolutionizing the way companies are targeting their clients.

Social media

Undoubtedly, there is no doubt about the importance of social media on the modern life. As a example of the power that social media has today, we can say about the recent protests on Brazil against the president and her government’s corruption, which leaded thousands of people to the streets across the country, all organized by Facebook!

Today, social media has a very strong power of influence in the masses, reflecting the tastes and opinions of thousands of people. All of this gigantic amount of information is a gold mine to the companies, just waiting to be tapped.

Just imagine that you are the director of a area responsible for developing new products for a gaming company. Now, imagine if you could use the social medias to analyse the reactions of the players to the trailers and news your company releases on the internet. That information could be crucial to discover, for example, that your brand new shinning multiplayer mode is angering your audience, because of a new weapon your development team thought it would be awesome, but to the players feel extreme unbalanced.

Now imagine that you are responsible for the public relations of a oil company. Imagine that a ecological NGO start launching a “attack” at your company’s image on the social networks, saying that your refinery’s locations are bad for the ecosystems, despise your company’s efforts on reforestation. Without a proper tool to quickly analyse the data flowing on the social networks, it may be too late to revert the damage on the company’s image, with hundreds of people “flagellating” your company on the Internet. This may not seen important at first, until you realize that some companies you provide your fuel start buying less from you, because they are worried with their own image on the market, by associating themselves with you. More and more, the companies are realizing the importance of how positive is the image of their brands on the eyes of their customers, a term also known as “brand health”.

This “brand’s health” metric is very important on the marketing area, already influencing several companies to enter on the social media monitoring field, providing partial or even complete solutions to a brand’s health monitoring tool, many times on a SAAS model. Examples of companies that provide this kind of service are Datasift, Mention and Gnip.

Sentimental analysis

A very important metric on the brand’s health monitoring is the sentimental analysis. In a simple statement, sentimental analysis is exactly what the name says: is the analysis of the “sentiment” the author of a given text is feeling about the subject of a given text he wrote about, been classified as negative, neutral or positive. Of course, it is very clear how important this metric is for most of the analysis, since is the key to understand the quality of your brand’s image on the perspective of your public.

But how does this work? How is it possible to analyse someone’s sentiments? This is a field still on progress, but there’s already some techniques been applied for this task, such as keywords scoring (presence of words such as curses, for example), polarities scores to balance the percentage a sentence is positive, neutral and negative in order to analyse the overall sentiment of the text and so on. At the end of this post, there is a article from Columbia’s University about sentimental analysis of Twitter posts, that the reader can use as a starting point to deepen on the details of the techniques involved on the subject.

Big Data

As the reader may have already guessed, we are talking about a big volume of data, that grows very fast, is unstructured, has mixed veracity – since we can have both valuable and non-valuable information among our dataset – and has a enormous potential of value for our analysis, since are the opinions and tastes – or the “soul” – of our costumers. As we have see previously on my first post about Big Data, this data qualifies on the famous “Vs” that are always talked about when we heard about Big Data. Indeed, generally speaking, most of the tools used on this kind of solution can be classified as Big Data’s solutions, since they are processing amounts of data with this characteristics, heavily using distributed systems concepts. Just remember: It is not always that because it uses social media, that it is Big Data!

A practical exercise

Now, let’s see a simple practical exercise, just to see a little of the things we talked about working on practice. On this hands-on, we will make a simple Python script. This Python script will connect to Twitter, to the public feed to be more precise, filtering everything with the keyword “coca-cola”. Then, it will make a sentimental analysis on all the tweets provided by the feed, using a library called TextBlob that provides us with Natural Language Processing (NLP) capabilities and finally it will print all the results on the console. So, without further delay, let’s begin!

Installation

On this lab, we will use Python 3. I am using Ubuntu 15.04, so Python is already installed by default. If the reader is using a different OS, you can install Python 3 by following this link.

We will also use virtualenv. Virtualenv is a tool used to create independent Python’s environments on our development machine. This is useful to isolate the dependencies and versions of libraries between Python applications, eliminating the problems of installing the libraries on the global Python’s installation of the OS. To install Virtualenv, please refer to this link.

Set up

To start our set up, first, let’s create a virtual environment. To do this, we open a terminal and type:

virtualenv –python=python3.4 twitterhandson

This will create a folder called twitterhandson, where we can see that a complete Python environment was created, including executables such as pip and python itself. To use Virtualenv, enter the twitterhandson folder and input:

source bin/activate

After entering the command, we can see that our command prompt got a prefix with the name of our environment, as we can see on the screen bellow:

That’s all we need to do in order to use Virtualenv. If we want to close, just type exit on the console.

Using a IDE

On this lab, I am using Pycharm, a powerfull Python’s IDE developed by Jetbrains. The IDE is not required for our lab, since any text editor will suffice, but I recommend the reader to experiment the IDE, I am sure you will like it!

Downloading module dependencies

On Python, we have modules. A module is a python file where we can have definitions of variables, functions and classes, that we can reuse later on more complex scripts. For our lab, we will use Pip to download the dependencies. Pip is a tool recommended by Python used to manage dependencies, something like what Maven do for us in the Java World. To use it, first, we create on our virtualenv root folder a file called requirements.txt and put the following inside:

httplib2
simplejson
six
tweepy
textblob

The dependencies above are necessary to use the NLP library and use the Twitter API. To make Pip download the dependencies, first we activate the virtual environment we created previously and then, on the same folder of our txt file, we input:

pip3 install -r requirements.txt

After running the command above, the modules should be downloaded and enabled on our virtualenv environment.

Using sentimental analysis on other languages

On this post, we are using TextBlob, which sadly has only english as supported language for sentimental analysis – he can translate the text to other languages using Google translator, but of course is not the same as a analyser specially designed to process the language. If the reader wants a alternative to process sentimental analysis on other languages as well, such as Portuguese for example, is there a REST API from a company called BIText – which provides the sentimental analysis solution for Salesforce’s Marketing products – that I have tested and provides very good results. The following link points for the company’s API page:

BIText

Creating the Access Token

Before we start our code, there is one last thing we need to do: We need to create a access token, in order to authenticate our calls on Twitter to obtain the data from the public feed. In order to do this, first, we need to create a Twitter account, on Twitter.com. With a account created, we create a access token, following this tutorial from Twitter.

Developing the script

Well, now that all the preparations were made, let’s finally code! First, we will create a file called config.py. On that file, we will create all the constants we will use on our script:

accesstoken='<access token>’
accesstokensecret='<access token secret>’
consumerkey='<consumer key>’
consumerkeysecret='<consumer key secret>’

And finally, we will create a file called twitter.py, where we will code our Python script, as the following:

from config import *
from textblob import TextBlob
from nltk import downloader
import tweepy


class MyStreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print('A TWEET!')
        print(status.text)
        print('AND THE SENTIMENT PER SENTENCE IS:')
        blob = TextBlob(status.text)
        for sentence in blob.sentences:
            print(sentence.sentiment.polarity)


auth = tweepy.OAuthHandler(consumerkey, consumerkeysecret)
auth.set_access_token(accesstoken, accesstokensecret)

downloader.download('punkt')

myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth=auth, listener=myStreamListener)

stream = tweepy.Stream(auth, myStreamListener)
stream.filter(track=['coca cola'], languages=['en'])

On the first time we run the example, the reader may notice that the script will download some files. That is because we have to download the resources for the NLTK library, a dependency from TextBlob, which is the real NLP processor that TextBlob uses under the hood. Beginning our explanation of the script, we can see that we are creating a OAuth handler, which will be responsible for managing our authentication with Twitter. Then, we instantiate a listener, which we defined at the beginning of our script and pass him as one of the args for the creation of our stream and then we start the stream, filtering to return just tweets with the words “coca cola” and on the english language. According to Twitter documentation, it is advised to process the tweets asynchronously, because if we process them synchronous, we can lose a tweet while we are still processing the predecessor. That is why tweepy requires us to implement a listener, so he can collect the tweets for us and order them to be processed on our listener implementation.

On our listener, we simply print the tweet, use the TextBlob library to make the sentimental analysis and finally we print the results, which are calculated sentence by sentence. We can see the results from a run bellow:

A TWEET!
RT @GeorgeLudwigBiz: Coca-Cola sees a new opportunity in bottling billion-dollar #startups http://t.co/nZXpFRwQOe
AND THE SENTIMENT PER SENTENCE IS:
0.13636363636363635
A TWEET!
RT @momosdonuts: I told y’all I change things up often! Delicious, fluffy, powdered and caramel drizzled coca-cola cake. #momosdonuts http:…
AND THE SENTIMENT PER SENTENCE IS:
0.0
0.4
0.0
A TWEET!
vanilla coca-cola master race

tho i have yet to find a place where they sell imports of the british version
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @larrywhoran: CLOUDS WAS USED IN THE COCA COLA COMMERCIAL AND NO CONTROL BEING PLAYED IN RADIOS AND THEYRE NOT EVEN SINGLES YAS SLAY
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @bromleyfthood: so sei os covers e coca cola dsanvn I vote for @OTYOfficial for the @RedCarpetBiz Rising Star Award 2015 #RCBAwards
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @LiPSMACKER_UK: Today, we’re totally craving Coca-Cola! http://t.co/V140SADKok
AND THE SENTIMENT PER SENTENCE IS:
0.0
0.0
A TWEET!
RT @woodstammie8: Early production of Coca Cola contained trace amounts of coca leaves, which, when processed, render cocaine.
AND THE SENTIMENT PER SENTENCE IS:
0.1
A TWEET!
RT @designtaxi: Coca-Cola creates braille cans for the blind http://t.co/cCSvJLv7O0 http://t.co/UA0PGoheO2
AND THE SENTIMENT PER SENTENCE IS:
-0.5
A TWEET!
Instrus, weed, Coca-Cola y snacks.
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @larrywhoran: CLOUDS WAS USED IN THE COCA COLA COMMERCIAL AND NO CONTROL BEING PLAYED IN RADIOS AND THEYRE NOT EVEN SINGLES YAS SLAY
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
1 Korean Coca-Cola Bottle in GREAT CONDITION Coke Bottle Coke Coca Cola http://t.co/IHhxoJ7aMz
AND THE SENTIMENT PER SENTENCE IS:
0.8
A TWEET!
#Coca-Cola#I#♥#YOU#
Fanny#day#Good… https://t.co/5PU7L4QchC
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
Entry List for Coca-Cola 600 #NASCAR Sprint Cup Series race at Charlotte Motor Speedway is posted, 48 drivers entered http://t.co/UYXPdOP9te
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
@diannaeanderson + walk, get some Coca-Cola, and spend some time reading. Lord knows I need to de-stress.
AND THE SENTIMENT PER SENTENCE IS:
0.0
0.0
A TWEET!
Apply now to work for Coca-Cola #jobs http://t.co/ReFQUIuNeK http://t.co/KVTvyr1e6T
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @jayski: Entry List for Coca-Cola 600 #NASCAR Sprint Cup Series race at Charlotte Motor Speedway is posted, 48 drivers entered http://t.…
AND THE SENTIMENT PER SENTENCE IS:
0.0
A TWEET!
RT @SeyiLawComedy: When you enter a fast food restaurant and see their bottle of Coca-Cola drink (35cl) is N800; You just exit like » http:…
AND THE SENTIMENT PER SENTENCE IS:
0.2
A TWEET!
Entry List for Coca-Cola 600 #NASCAR Sprint Cup Series race at @CLTMotorSpdwy is posted, 48 drivers entered http://t.co/c2wJAUzIeQ …
AND THE SENTIMENT PER SENTENCE IS:
0.0

The reader may notice that the sentimental analysis of the tweets could be more or less inaccurate to what the sentiment of the author really was, using our “human analysis”. Indeed, as we have talked before, this field is still improving, so it will take some more time for us to rely 100% on this kind of analysis

Conclusion

As we can see, it was pretty simple to construct a program that connects to Twitter, runs a sentimental analysis and print the results. Despise some current issues with the accuracy of the sentimental analysis, as we talked about previously, this kind of analysis are already a really powerfull tool to explore, that companies can use to improve their perception of how the world around them realize their existences. Thank you for following me on another post, until next time.