What is Big Data?

Big Data is a concept that has grown manifold in recent years. Primarily, it is the IT world from the time of its conception. Big Data is defined as a term used for software techniques, and for the management of huge volumes of both structured and unstructured data that is difficult to process using the traditional Database concepts. The word ‘Data’ here reflects datasets which are so large that it is hard to manage or store within a single computer. In today’s world, Big Data refers to technology an organization requires to manage and protect their ever-increasing volumes of data and storage facilities.

This term ‘Big Data’ has been in existence since 1990. John Mashey is the person who played a key role in researching this concept, and making this technology so popular today. To make things a little easier for newbies, let’s take the weather forecast as a case in point. For any person monitoring the worldwide weather, there is a massive set of unstructured data that needs to be collected from all over the globe. This volume of weather data reflects the characteristics of the Big Data concept. We are required to process this huge sea of data in real time, and we need to do it in the most efficient way. That is through Data analysis.

Let’s go through some of the points which make Big Data so important and useful in these types of situations:

Most of the data collected is unstructured and requires specialized processing methods

Storing all this data within the traditional relational databases is a tough and, at times, impossible affair

Next Step - Big Data Analytics

So, we have our data collected in a raw and totally unstructured format. Here comes our next big step - to analyze the data and give it a proper business shape. The process of collecting, organizing, and analyzing a large set of data, to extract useful information or describe a pattern of the data is referred to as Big Data Analytics. The first major step required to get to this point is ‘collecting’ the data. This task is commonly known as ‘Data Mining’, in layman’s terms. Going deep, Data Mining is most commonly used in science and information technology fields, and also with marketers who try to get an assortment of useful information related to consumer data. Today, we will learn some of the techniques to gather ‘Twitter data’, which will prove useful to us in multiple ways. Click here to find out the summary of the techniques.

Twitter data- why is it effective?

Before we start mining, let’s understand why Twitter Data is the ‘tall poppy’ in comparison with other social platforms. Primarily, Twitter Data offers businesses the ability to identify the best pattern related to customer sentiments and gauge their marketing effectiveness. With approximately 500 million tweets per day, there’s a lot of data to analyze and to play with. These include thoughts, trending worldwide news, links and pictures, promotion of events, the launch of new products, advertisements and much more. For every customer, Twitter data has something unique to share. With the rapid growth of social media in recent years, researchers are heading towards Twitter data mining to analyze and understand the wants and needs of individuals, products, events, etc., as it is a hugely sought after media to vent thoughts, feelings, and opinions. Looking at those clients who have a small history of Twitter data - around 6 months - here is a collection of the research topics reviewed via Twitter mining:

Influencing users list

Feedback or survey of products

Number of tweets related to your organization

Study of new ideas

Tweets based on location or language

Engagement rates

Trend of maximum retweets

Famous people and monitoring their tweets and reactions

Best keywords used by users

So, now we have a fair idea why Twitter is considered to be the best social platform for data mining. Let’s move forward, and learn about how Twitter API can give the ultimate results.

Overview of the Tools

In this article, we will use Python 2.7.0 as a programming language and PyCharm - Community Edition as an IDE. To connect to Twitter API, we suggest using Tweepy, which is an open-sourced Python library hosted on GitHub. It enables Python to communicate with the Twitter platform, and use its API.

Once the project has been created, under the ‘Keys and Token’ tab, you can fetch your Consumer Key along with your Consumer secret

Create an ‘Access token’ by scrolling down the page and keeping the tab open

Tweepy Installation

Tweepy is an easy-to-use Python library for implementing Twitter API. To install Tweepy, either we can use ‘PIP’ by typing ‘pip install Tweepy’ in your termina,l or via GitHub using git clone https://github.com/tweepy/tweepy.git, cd tweepy, python setup.py install.

Authentication

In order to build any application, the first step we need to take is to authenticate ourselves with the developer credentials. Once this has been done, we can use Tweepy to create an API object to use the other functions through it. Talk to an expert now for required authentication.

API Object

Since we require the API object in every application, the above task is quite important. Now, let’s go through some examples so we can understand how the whole process works out.

Timeline - We will extract 10 of the most frequent tweets from our Twitter feeds. To do this, we will use API object function - home timeline. To print the output, we will store the result in a variable and loop it accordingly.

We will get multiple random tweets followed by the URL. If we want to follow any tweet, it will again redirect us to the tweet itself. Here, if we click on the first tweet, it will show the original tweet.

If we want to print the tweet’s text, it’s advisable to run this through the terminal or PyCharm, otherwise it can result in formatting issues in relation the printed text. Below are some of the other attributes:

print tweet.created_at - To find the date of a particular tweet

print tweet.user.screen_name and print tweet.user.location - To fetch the name and location details of the tweeter. These attributes are also useful if your application is dependent on Spatial Data, which is mostly used in geographical information systems and other related services.

Tracking Tweets from a particular user - We will try to pull 20 tweets from account-@NyTimes. For this, we need to use the user timeline function. Below is the list of parameters we can use with this function:

So we have to use the ID and COUNT parameters in our code which is illustrated below:

Searching Tweets with keywords - In this example, we will try to pull tweets from a distinctive keyword. Here, we require the search function in our code. Below is the list of parameters we can use with this function:

So, we have to use the set language as ‘English’. Also, we can print the name of the user who wrote the tweet, using ‘our’ loop.

Although this article just provides a preview of using Twitter API’s for better data mining and achieving business goals, we need to have a fair knowledge of server side scripting languages like Python, PHP to make the requests to Twitter API. As the results are in JSON format, it is quite easy to understand and use. Each of the API’s allows developers to build and extend their application in a better way. Since Twitter API’s are evolving on a daily basis, it is getting easier for developers to explore new options, and reach new horizons in Data mining.

Did we miss any point in this guide? Is it a point that holds the potential to change the entire plan of action? Let us know in the comments below.