Adaboosting is proven to be one of the most effective class prediction algorithms. It mainly consists of an ensemble simpler models (known as “weak learners”) that, although not very effective individually, are very performant combined.

The process by which these weak learners are combined is though more complex than simply averaging results. Very briefly, Adaboosting training process could be described as follows:

For each weak learner:

1) Train weak learner so that the weighted error sum of squares is minimised

2) Update weights, so that correctly classified cases have their weight reduced and misclassified cases have their weights increased.

3) Determine weak learner’s weight, i.e., the total contribution of the weak learner’s result to the overall score. This is known as alpha and is calculated as 0.5 * ln((1- error.rate)/error.rate))

As the weight is updated on each iteration, each weak learner will tend to focus more on those cases that were misclassified on previous instances.

Decision stumps as weak learners

The most common weak learner used in Adaboosting is known as Decision Stump and consists basically on a decision tree of depth 1, i.e., a model that returns an output based on a single condition, which could be summarised as “If (condition) then A else B”.

ada package in R

Although the implementation provides very good results in terms of model performance, “ada” package has two main problems:

It creates too big objects: Even with not so big datasets (around 500k x 50) that the final model object can be extremely big (5 or 6 Gb) and consequently too expensive to keep in memory. Of course, this is needed to perform any kind of prediction with the model. This is because the ada object is an ensemble of rpart objects, which holds a bunch of other information which is actually not needed when you just want to train an adaBoost model with Stumps and then predict with it.

Its execution time is too high: As both ada and rpart objects have a lot of other uses, they are not optimal for Stumps in terms of timings

Introducing adaStump

Considering that each stump consists just of a condition, two probability estimates (one when the condition is TRUE and the other when it’s FALSE) and a multiplier (alpha), I thought it could be much better, both in terms of size and timings, even just using R.

For that reason I developed adaStump, which is a package that takes advantage of ada’s excellent training process implementation, but creates a smaller output object and generates faster predictions.

An Example

Before running the example, you need to download the package from my Github repo. In order to do that, you need to have devtools installed

After the package is installed, let’s start with the example! In this case, the letters dataset is loaded, its variable names assigned and the variable isA is created, which gets a value 1 if true and 0 if false. That will be the class we will predict.

Now we are going to create two identical models. As AdaBoosting is usually ran with bagging, the bag.frac parameter is set to 1 so that all the cases are included. As what we want to demonstrate is that we can get the same results with much less memory usage and in less time, we need to ensure that both models are identical.

Taking a look at these numbers we can definitely say that the first problem is solved. Besides, the size of ada objects depend on the bag fraction, the sample size and the amount of iterations while adaStump objects only depend on the amount of iterations. There is almost no scenario where the size of the adaStump object can turn into a resource problem even on old personal computers. Now, let’s see if the new model can also predict quicker:

Indeed 🙂 . Let us check if the results are the same. Contrary to ada, adaStump prediction method only returns the probability estimate of the class outcome (i.e., the probability that the class variable is 1). Its complement can be easily calculated by doing 1 – the vector obtained.

Although there is some difference in some cases, it tends to be minimal. We can say now that we can produce almost the same outcome using 1% of the memory and more than 2.5 times faster. However, we can do better

Usually, AdaBoosting generates stumps based on the same condition. In this case, for example, we can see by accessing to the model frame (fit_with.adaStump$model), that the condition x2ybr >= 2.5 is repeated constantly. Of course, all these rows can be collapsed so that the condition is evaluated only once. For that purpose you can use pruneTree()

Although the difference between both methods is much greater in this case (there is a pre-calculation step where some decimals are discarded) it is still acceptable for most use cases. So, choosing to do the reduction or not is a decision based on the level of precision needed.

“First things first”. Tautological, always true. However, sometimes some data scientists seem to ignore this: you can think of using the most sophisticated and trendy algorithm, come up with brilliant ideas, imagine the most creative visualizations but, if you do not know how to get the data and handle it in the exact way you need it, all of this becomes worthless. In other words, first things first.

In my professional experience, I have heard thousands of potenitally brilliant ideas which couldn’t be even tested because the data scientist in question did not know how to handle data according to his/her needs. This has become particularly problematic with the popularisation of JSON: despite the undeniable advantages that this data structure has in terms of data storage, replication, etc, it presents a challenge for data scientist, as most algorithms require that the input data is passed in a tabular form. I have myself faced to this problem and, after a couple of times of creating some temporary solution, felt it was high time to come up to a general one: of course, the solution I have come up to is just one approach.

In order to understand it, two things from JSONs structure must be considered: Firstly, JSON objects are basically an array of objects. This means that, unlike traditional tabulated Databases there is no indication what they don´t have. This might seem trivial but if you want to tabulate JSONs it becomes certainly something to be solved. Secondly, this array of objects can contain objects which are arrays themselves. This means that, when transforming this to a table, that value has to be transformed somehow to an appropiate format for a table. The function in question contemplates these two scenarios: it is intended to be used not only to tabulate objects with identical structure but also for objects whose internal structure is different

Before deep diving into the code of the function, let us propose a scenario where this could be useful. Imagine we would like to tabulate certain information from professional football players. In this mini-example, we have chosen Lionel Messi, Cristiano Ronaldo and George Best. The information is retrieved from dbpedia, which exposes the corresponding JSONs of their wikipedia’s entries. In the code below, the first steps:

I have to admit that this is not the simplest example as the JSONs retrieved are extremely complex and need much more pre-processing than usual (for example, than most API calls). The problem is that nowadays most API calls need an API Key and so the example becomes less reproducible.

Back to the example, firstly the data for each player is retrieved using getURL from RCurl and then converted to list by using fromJSON() from rjson, which converts json strings or files to lists. All the call is done inside a lapply statement, so the object returned is basically a list of JSONs converted to list, i.e., a list of lists. After that, from each of the sub-lists( in this case, 3, as there were three players) we are selecting only the piece of information that refers to the player himself; in a few words, a bit of cleanup. Finally, the file is unlisted. unlist() is a function which turns a list (a list of lists is a list itself) into a named vector of a certain type, to which all the elements can be coerced to; in this case, characters.

This is the point of manual intervention, where the variable selection has to be done. At this stage, it is highly recommended to have at least one example loaded in some JSON parser (JSON Viewer, or any browser’s JSONView add-in). The names of the vector generated by unlist() (“players.unlist” in the example) are a concatenation of the names of the lists that value was in, separated by “.”.

The elements in the character vector are ordered in the same way they appeared in the lists. So, the first thing to do is to establish where each of the cases start. In order to do so, the challenge now is to find a common pattern between the vector names which can identify where each of the elements start. This element does not necessarily have to be inside the final table. In this case I chose “surname\\.value$”. However, I would recommend, in most of the cases, to choose the first element that appears in the object.

Similarly, the challenge now is to find common patterns in the names of the vector elements that will define each of the variables. In order to do that, the wisest would be to takle a look at the names that you have. In this case:

By taking a look at the JSON of any of these examples, we can see that “type” refers to the type of value of “value”. So, in this case, we know that we are going to need only the ones that end with “.value”. However, the JSONs are too large to do a complete scan of all its elements, so the wisest thing would be to look for the desired value manually. For example, if we would like to take the date of birth:

Now we know that there are several different birth dates fields. In this case, we should take a look manually and based upon that choose the one that better fits our needs.In this example, I chose 6 arbitrary variables, extracted its pattern and chose suitable variable names for them. Please notice that, for example, date of death should be empty for Messi and Ronaldo while number is empty in the case of George Best (players didn’t use to have a fixed number in the past). Apart from that, “goals” has multiple entries per player, as the json has one value per club. This is the end of the script:

Finally, we get to the function. tabulateJSON() expects four parameters: an unlisted json (or whatever character vector that has similar characteristics as the ones produced by unlisting a JSON), a string that represents the name of the starting positions of the elements, a vector of characters (normally regex patterns) with the element names to be saught and finally the names to assign to those columns generated.

Now let’s take on how look tabulateJSON() works. Below, the entire code of the function:

How does it work? Firstly, it looks for the items that define a the start of an object based upon the pattern passed and return their indexes. These will be the delimiters.

start.ind <- grep(start.obj, names(json.un))

After that, it will look for the names value that match the pattern passed in “columns” and return the indexes. Those indexes, have to be assigned to a position (i.e., a row number), which is done with findInterval(). This function maps a number to an interval given cut points.

For each of the variables, a data frame with 2 variables is created, where the first one is the position and the second one the value itself, which is obtained by accessing the values by the column indexes in the character vector. As it will be seen later, this is done in this way because it might be possible that for a sought pattern (that is converted to a variable) there might be more than one match in a row. Of course, that becomes problematic when trying to generate a tabulated dataset. For that reason, it is possible that temp.frames contains data frames with different number of rows.

After this, it is necessary to collapse those variables which contain multiple values per row. Firstly, it checks whether there is any of the elemnts in the list whose amount of rows is greater than the amount of delimiters (which define the amount of rows). In the case there is any, the function uses plyr’s ddply() to collapse by row specifier (pos). After this process all the data frames in temp.frames will be of equal length:

Of course, using this function straightforward is a bit “uncomfortable”. For that reason, and depending your particular needs, you can include tabulateJSON() as part of a higher-level function. In this particular example, the function could receive the names of the and a list of high level variables, which will be mapped to a particular pattern, for example:

Probably, the most common complains against R are related to its speed issues, especially when handling a high volume of information. This is, in principle, true, and relies partly on the fact that R does not run parallely…. unless you tell it to do so!

R offers a wide variety of packages dedicated to parallelisation. After doing several tries with some of them, I found snow the best one (for me) to perform classical parallellizable operations. Among the functions that this package provides, I found the parApply family the most useful ones and the easiest to work with to take profit of parallelisation.

In order to use these functions, it is necessary to have firstly a solid knoweldge of the apply-like functions in traditional R, i.e., lapply, sapply, vapply and apply. What are they exactly? In essence, they apply a function over an array of objects.

The example below would return a numeric vector consistent of 3,4,5…10 for variable “result”

Of course, in this example no significant speed differences can be found between the 2 pieces of code shown. However, even in one-node executions, the first alternative is considerably faster, which can be appreciated when working with larger amount of data. For a vector of length 1000000, the difference is 0.94 vs 3.04 secs

In the internet, plenty of posts and tutorials about the use of lapply,sapply,vapply and apply can be found, for example here. However, it is not redundant to explain again what each function does:

a)Lapply

Applies a function over a vector and returns a vector of lists. If lapply were used in the previous example and would want to obtain the first element, a “list-like” indexing should be used

my.vec <- 1:10
result <- lapply(my.vec,function(x) x+2)
#get the third element of result
result[3][[1]]

b)Sapply/Vapply

It is very similar to lapply, but, instead of a vector of lists, it returns a vector of some type (numeric, character, Date, etc). Sapply will “deduce” the class of the output elements. In vapply, the output type has to be specified. Although it is simpler to use sapply, as there is no need to specify output type, vapply is faster (0.94 secs vs 4.04) and enables the user to control output type.

The FUN.VALUE argument is where the output type of vapply is specified, which is done by passing a “general form” to which the output should fit. If the output returned by the function does not match with the specified return type, R will throw an error

Tip: You may have noticed that you can write apply-like functions with a function(…) argument or without it. Usually you use “function(…) + a function” when the attributes of the object you are passing to the function have to do different things (like in the rowwise apply example). However, if it is not the case, you can just pass the name of the function, like in the apply columnwise example. Other functions you could use similarly are median, max, min, quantile, among others.

Parallelisation

If the use of apply functions is clear, then parallelisation is just one small step beyond with snow. The functions equivalents are

Base R

snow

lapply

parLapply

sapply

parSapply

vapply

–

apply(rowwise)

parRapply, parApply(,1)

apply(columnwise)

parCapply, parApply(,2)

The only thing you really have to keep in mind is that when parallelising you have to explicitly export/declare everything you need to perform the parallel operation to each thread. This is done mainly with the clusterExport() and clusterEvalQ() functions. This is very important to keep in mind because you might be able to run something on a one-node traditional R code and then get errors with the same execution in parallel due to the fact that these things are missing.

From apply() rowwise to parRapply()

Below you will find a small example, very similar to the one done above with apply rowwise, illustrating the above mentioned small changes/additions needed in order to run your code parallely

#Return if the result of (x[1]+x[2])*x[3] is greater than 20 or not
# The same df as before
ex.df <- data.frame(a=seq(1,100,1),b=seq(10,1000,10),c=runif(100))
# Define the threshold
ths <- 20
# These 2 statements in Base R are equivalent
aa <- apply(ex.df,1, function(x) (x[1]+x[2])*x[3] > 20)
aa <- apply(ex.df,1, function(x) (x[1]+x[2])*x[3] > ths)
### Equivalent parallel execution ###
# Declare the cluster object. Here we use the default settings (SOCK)
# and the number of nodes is specified by the number given
clus <- makeCluster(3)
# The equivalent for the first alternative would be very easy
aa <- parRapply(clus,ex.df, function(x) (x[1]+x[2])*x[3] > 20)
#However, if the variable "ths" needs to be used, a line has to be added
clusterExport(clus,"ths")
aa <- parRapply(clus,ex.df, function(x) (x[1]+x[2])*x[3] > ths)

The clusterExport() function exports an object to each node, enabling them to work parallely. The use of it, as it can be appreciated, is extremely simple: you need to pass the variable name/s in a character vector (or a single string, as in this case)

To conclude, imagine you would need to apply a custom function to each row of your data frame. Following the same example, in Base R it would be

To perform the same action parallely with snow, you can declare the function inside a clusetEvalQ() statement or declare it in the base-workspace and then export it. If you use clusterEvalQ() you will not see the function in your workspace

Of course, the main goal of parallelisation is to reduce execution times. However, it does not make much sense to work over a small set (a small list, data frame, etc) as parallelisation requires time for distributing the task among the nodes and collecting the results from them. This opertaion consumes some time of course, which is not significant when working with large volume of data but in cases of small datasets, the overall execution time required to fulfill a task parallely might be greater than doing it in just one core.

Although the execution time reduces considerably, it should not be assumed that it will decrease proportionally to the nodes introduced.,i.e., if the execution time in one node is 30 seconds it will not decrease to 10 seconds with 3 nodes. Probably, it will be higher.

These are the execution times for 1 and 3 nodes for the example above with a data frame of 1M and 3M rows.

# of Rows

1 Node

3 Nodes

1M

18.81 secs

10.90 secs

3M

65.67 secs

40.26 secs

Feel free to write any questions,suggestions, comments, etc.! I hope you liked it and find it useful

Previously on this post, I introduced limitedly some features of the Shiny package. However, I felt the need of doing a new post related to Dashboards due to many reasons:

a) Shiny has changed most of its functions and the previous one is outdated
b) In this case, the outputs are done with GoogleVis, which is a very interesting package itself, so it is worth dedicating some lines to it
c) I finally got a space in the hosting service in RStudio to upload my Dashboard, so you can see how it works live here. IMPORTANT: Ignore the error message you get at the beginning

Context:

EAHU is an acronym for “Encuesta Anual de Hogares Urbanos” (in English, Yearly Survey of Urban Homes), a survey done by the official statistics institute in Argentina whose microdata is available here. The Dashboard´s raw data consists of a processed version of the 2010 to 2012 results.

How it works:

Although the filters are almost self-explanatory, it is necessary to highlight that the first chart (in the Income Tab) is not affected by the year filter, as the plot itself “slices” the data temporarily.

Bugs:

During the set-up, I had several issues with the encoding (Spanish uses many multi-byte characters, such as á, é or ñ) and decided to replace them for ASCII characters. The problem that turned out is that Google did not recognise 3 of the 5 provinces that have a special character in its name (Neuquén, Río Negro and Tucumán. The first two are under “Patagonia” and the last one is under “NOA”) so you will never see them plotted in the map, although they will definitely appear in the rest of the charts.

The code consists mainly of three files, the standards for any Dashboard written in Shiny, global.R, ui.R and server.R. You will see that global.R is however not mandatory to do a Dashboard and, as it will be explained afterwards is not precisely necessary in this case.

As it is perfectly explained here under “Global Objects”, this code could have been perfectly inserted in server.R outside ShinyServer() function as the ui.R file does not have any component that is dependent of an external source. In this case, global.R is indeed not necessary because the source file is completely static and consequently there is no need to reference the UI.R file to an external source, which should be loaded prior to the display of the Dashboard. However, global.R could be extremely useful when working, for instance, with data coming from the Internet which might change more often and where the UI.R file should “adapt” itself to a source file, especially when this source file is also used by server.R. In this case, as the raw data is completely static, the UI.R file filters are completely “hard coded”. On the contrary, if your data comes from a SQL Query, for example, where your filter variables change frequently, the best option would be to do all the possible workout of the data in global.R so that ui.R and server.R can use it afterwards.

As it was explained in this post, shiny works with an input – output logic based on two files that, unlike global.R, are mandatory for the whole Dashobard to work as expected. In this sense, it is essential to keep a clear idea of what a Dashobard does and how the information flows across the files to fulfill its purpose. Basically, it could be separated in three steps:

1) Some data is “given”(i.e. filters are selected, terms/numbers are entered, files are uploaded, etc etc)
2) The data is processed according to the given input data in 1) and results are “given”(outputs)
3) The results are shown

In Shiny, this could be divided in:

1) UI.R(“input” part)
2) server.R
3) UI.R(“output” part)

Parts 1 and 3 were explained before, and they are clearly easier than server.R where the most important work takes place. Firstly, an overview of the whole code.

Two different datasets are created; the first one is used by all the plots except the motionchart, which uses the second one. The main difference is that the motionchart does not take into account the year filter, as it is not needed because the graph itself can “slice” the data per year. In my opinion, although it might not be neat to use two different subsets for the plots displayed in the same dashboard, I think that in the case of the motion chart, the year filter does not really play any role, because the output itself enables the user to perform the same action.

The first of the plots created with the googleVis package. When I first got in touch with it, it surprised me how it managed to plot 6 variables (i.e. the vertical and horizontal axis, the time, the name of the elements, the size and the colour of the balls) in a very clear way. Moreover, it has a menu where you can change to timeseries or barplots, so I would say it is 3 plots at the same time.

For this output, you can see that the data is processed using the cast() function, which belongs to the reshape package. This is because the data uploaded was previously “molten”. Equivalent results could have been obtained withthe use of aggregate(), which is in fact implemented in other outputs. This function is used to obtain a data frame with the mean income and its standard deviation.

Then, the coefficient of variation (standard variation/mean). Probably, these 3 measures could have been obtained with any visualization specific software, such as Tableau or Qlikview. However, is it possible to calculate GINI dynamically? If it is, it probably requires programming the whole integral calculation. In R, however, you can do it with just 4 lines using the reldist package

Finally, a small data frame with all the info is built and is plotted witgh the gvisMotionChart() function. (see the documentation for more details).

Please notice that you cannot call the reactive functions that filter the data, store them in an object and use them accross the functions because you will get an error. This is because the functions generating the plots are private, even if they are all contained in Shinyserver()

This map displays the average age of each province given the selected filters. The gvisGeoChart() enables to work with names that refer to geographical items (i.e, states, countries, continents, etc.) or with coordinates. In this case, it was considerably easier to work with names. The main drawback of it, due to the encoding issue mentioned, was that some of the provinces were not identified and were not included in the plot. In addition, I could not manage to add a title to it. It seems as if the geoplot does not enable titles.

The main advantage of this barplot is that the quartiles are built dynamically based upon filters applied. The options documentation for the gvis plots is not very good in R itself. In my opinion, the best is to go to the google documentation. There you will be able to see what is editable and what is not. Please note that you will have to adapt what is written in the documentation to the R language.

This is the code to generate tables with the googlevis package. For this kind of visualizations, it might be more helpful than a normal R table because the rows can be sorted by any of the columns (ascendingly or descendingly) by just clicking on the column.

Finally, the education curve is based upon the concept that the maximum education level achieved implies having completed prior stages. For instance, someone, whose education level is “Complete High School” means that he/she has achieved primary incomplete, primary complete, high school incomplete and high school complete.

Based upon this principle, you can pot a descending curve from the lowest to the highest level of education and compare graphically the drops in each province.

In addition, you can build a measure that calculates the area under those curves, which can support a graphical hypothesis with an empiric “hard” measure.

As the calculations are done with the subsetted data, the plot is einterly dynamic. This can enable you to see how the curves/measures change over time or age of the population.

Conclusion

The last tab is a perfect example of the power of R. Far from being a mere statistics software, R is a tool in constant expansion, that can perfectly replace and even beat the most used commercial specialized softwares. In this case, for instance, it has been proved that R can perform BI tasks with much more flexibility and potential than most of the softwares dedicated to that. In this context, it is just a question of time to overcome old and expensive softwares in the “data world” and replace them definitely by free and open source tools like R, among many others.

Sudoku is nowadays probably the most widespread puzzle game in the world. As such, it has an interesting variety of solving techniques, not just with paper and pencil but also with computers.

Of course, I am not the first one who treated this issue. In the Internet, there is a huge variety of solutions with multiple languages. In R, there is even a package, dedicated exclusively to Sudokus.

In this first part, the solution presented just solves the puzzles that have always at least one empty cell with just one possible value. However, there are some puzzles where this is not enough, i.e., there is a certain point where no empty cell has just one possible value, which redunds in the need of the implementation of other strategies (which will be treated in future posts).

This first part covers just the cases where there is at least one empty cell with a unique possible value that can be filled. In my case, I tried to do it as most “human-like” as possible. Consequently, the following approach has NO “brute force like” code.

Therefore, I evaluated the ways in which I solve the puzzles without a computer and came up to four different general ways of identifying “sure” values:

1) A cell has just one possible value

As you can see, the circled cell can only be a 2 as 4 and 7 are already in the same sector, 3 and 5 are present in the same row and 1,6,8 and 9 appear in the same column

2) A specific value is only valid in one cell of a certain row

The red circled cell is the only one in its row that can be a 3, because the other three empty cells in the row belong to a sector where there is already a 3.

3) A specific value is only valid in one cell of a certain column

The red circled cell is the only one in its column that can be a 9, as, from the other three empty cells, the first one from above has a 9 in the same sector and the last two ones have a 9 in the same row.

4) A specific value is only valid in one cell of a certain sector

The red circled cell is the only one in that sector that can be a 1, as there are the blue squared ones above prevent all the empty cells from that column that also belong to the sector in question from being a 1. Additionally the other squared 1 in the bottom right sector, does not allow the left center empty cell of the bottom left sector to be a 1.

Below, you will find the code with its step by step explanation, which basically implements iteratively a series of functions that respond to the four above mentioned situations.

As you can see, the puzzle (no matter how it was generated) is put into an array of 9x9x(9+1). As you can deduce from the code, the puzzle lays on the 1st z index and from 2 to 10, the number 1 is placed. The 2:10 z indexed slices will store the “status” of a specific number (z-1) in a specific cell, where 1 indicates that the cell can be z-1, 0 that it cannot and 2 that it is z-1.

For instance, if data[2,3,4] is 1, it means that the cell [2,3] in the puzzle can be a 3. If it is 2, it means that the cell [2,3] is definetly a 3.

The NA transformation is only necessary if your original input data (i.e., the puzzle) has any 0.

These 3 functions update the underlying cube accodring to the values in the z=1 layer according to the aforementioned references (0 = [x,y,z] means that [x,y] is NOT z-1, 1= [x,y,z] can be z-1, 2=[x,y,z] IS z-1).

They are executed at the beginning with the initial puzzle configuration and re-executed whenever an update is done to the z=1 layer

Once the functions are declared, it is only necessary to insert them in a loop; as you could see before, whenever a value is updated the flag is set to 1, enabling the program to loop once more if anything was modified and stopping if there have been no changes. Please note that the value of the flag is not assigned in an usual way (with “<-" or "=") but inside the assign() function, which is the way in which you can declare/modify a public variable in R.

As it was stated at the introduction of this post, this script does not solve any unique-solution puzzle. In the next post dedicated to Sudokus, new functions will be presented to cover the situations where this script is not enough, basically, when there is no cell with just one possible value

I hope you enjoyed this post. Any comments,critics,corrections are welcome!

Social Media Analysis has become some kind of new obsession in Marketing. Every company wants to engage existing customers or attract new ones through this communication channel. Therefore, they hire designers, editors, community managers, etc. However, when it comes to measurement, when the impact of all the resources spent on social media communication has to be analyzed, they either do it extremely manually (i.e. manual counts, etc) or they spend an important amount of money in “pre-built” solutions that might a) not meet their needs exactly, b) be too difficult to master, c) need to constantly contact Customer Service whenever a problem is found or a change must be done, etc.

In my opinion, there is a misunderstanding of the importance of having the right tools (and persons) to analyze these results underlying this issue; although it might seem a field of exclusive qualitative focus, the work of a Data scientist is, undoubtely, as important as the above mentioned ones to succeed in Social Media Marketing.

Only using the right tools with solid sustainable analysis we can gather good insights from Social Media Marketing. If not, you’re being “trendy” but you are probably losing money and time.

The objective in this ocasion is to show how easy it can be to build your own Social Media Tool and at the same time give a very brief introduction to four very useful packages for the topic in question that can help you with your own developments:

Shiny: A package created by RStudio (http://shiny.rstudio.org/), to build Web applications very easily. In this case, you will see the code to operate locally. Here you may be able to find more info on how to run it on your own or a hosted server. (I have not tried this option yet)

I would encourage you to go a step beyond with this package, as only a few features are presented here. As you will see, “front-end” possibilities are barely exploited in this example.

twitteR: A very powerful package for Twitter Monitoring. Simple, easy and very effective.

tm: As you might have already imagined, tm stands for “Text Mining”. Apart from having text mining tools, it also provides very useful functions to pre-process texts

wordcloud: Package used to do Wordcloud plots.

Below, what you all expected… the code!! If you copy/paste it, you will see that an error message saying “Error: You must enter a query” appears twice (it comes from the twitteR package). My apologies, I could not manage to hide it (I have not found any error handler that could do it). If anyone has the solution, I will appreciate, if he/she could share it. I will include it and cite him/her in the post.

“Shiny” package requires two separate Scripts, named UI and server. They must be placed in the same folder. Once finished, to run your web application you should enter

As you can see in the scripts, shiny package works with an input/output logic. In order to build your own applications, you will need to have a very clear idea of what should be given by the user (input) and what should be showed out (output) and in which file (ui or server) you should place each part of the process. The steps below might help

Before we go on to analyze in more detail all the code presented, this is how the application looks like:

From the user interface perspective, it is quite simple; we just type in a word select the amount of tweets we would like to extract and finally select the language. Below you will see a screenshot of the result of entering the term “music”, selecting 200 tweets and language English:

The Menu logic is extremely easy to understand. Each of the widgets to enter parameters has the id as first argument. To call it in further actions, you will write input$(id). It works as any other variable. In the radioButtons, the name of the button is first entered and then the value that the radioButton variable will receive if that option is chosen.

The only thing to point out specially is the submitButton() function. Unless you include it, the script will process and output the results every time you change the parameters. The submitButton() function is particularly useful if the process that has to take place is expensive or if you need from the user to enter more than one parameter for the whole script to run correctly.

2) Parameters values are taken by the function and processed. server.R

Step 2 is a bit more complex. Firstly, all the necessary libraries (apart from shiny) have to be initialized in server.R. It is always advisable, when possible, to start them all together.

shinyServer(function(input, output) must always be the first line of your code in server.R before working with the “input” variables (the parameters entered by the user). Otherwise, it will not work.

reactive() is a function that indicates that whatever is processed inside that function, it will be done whenever the parameters are changed (if you entered the submitButton() in the UI, whenever you hit it. Otherwise, whenever you change the parameters). It is particularly useful to build the raw Data that you will process afterwards. You can find a very good explanation of it in the documentation

In this case, searchTwitter() uses the information entered by the user (the first argument is the term entered, the second the amount of tweets and the third one the language) and gives its output.

As the object returned by searchTwitter() is a bit difficult to handle, it is advisable to turn it into a Data Frame (twListToDF()) if you want to work, for example, with their texts.

reactiveTable() and reactivePlot() are functions to call other functions and return specific “reactive” outputs (in this case, a table and a plot) that will be afterwards displayed in the web interface. It is important to name them as output$(…) because, as it will be explained in Step 4, otherwise you will not be able to display it in the web application.

The reactive Table in this example just returns the first five records of the first column of the data frame and it can be understood by just reading the code.

The reactive Plot is a bit more sophisticated: Firstly, it takes the tweets and changes the encoding (enc2native()). I tried this script in 3 different computers and I had problems with encoding in one of them. That is just to avoid this issue.

Then, it converts everything to lower case. After that, the function removeWords() from the package “tm” is used to delete common words. As you can appreciate in the example, you can input whatever word, list of words, regex, etc. you would like to be removed. In this case, the stopwords from the user-entered language (input$lang) are removed, plus the term “rt”. In order to do a wordcloud (our final objective), this is particularly useful, as we will never would like to have “common words” in it. After that, punctuation is also removed.

Finally, all the tweets are turned into a list of words, then turned into a table (i.e., a frequency table), ordered descendently and the first 15 are chosen to plot.

For the wordcloud function, we enter the labels (the words themselves) of the generated table and the frequencies (the first two arguments in the example). This will determine the size of the words. The last arguments refer to the color order, the palette, and the maximum and minimum size for each word in the plot.

The concluding part is very simple. You just have to specify where (in this case, the main panel) and what (in this case, the table and the plot) to show. Remember that every object generated with a recative function in the server.R file and declared correctly as output$(…) must now be called as “…”. In this case, the wordcloud was declared as output$wordcl and is called in the UI as plotOutput(“wordcl”).

That´s it! I hope you enjoyed it and sorry for the length :D. As usual, if you have any question, critic or correction, please feel free to write.

The world of Data Visualization offers infinite variants to display our Data. However, there is still some reluctance in exploiting all the possibilities that computers give us nowadays in this field, probably not because of a rejection of novelties but due to certain tendency of sticking to a vision related to “paper support”, i.e. to representations that could be put into paper. Even if interactive options are included (such as menus with filters, pivot tables, etc.), the way the data is finally displayed does not differ very much from what could have been done a hundred years ago; pie charts, bars, axes plots, etc.

This has of course several limitations: there is no possibility to show something dynamic, which is definetly not merely just an aesthetic detail. Animations can sometimes evidence considerably easier the incidence of time in the data we are exhibiting than presenting any static model.

Of course, this can be done in R. Below you will find the code for a hipothetic network graph and how the connections (edges) occur throughout time. In this post, the idea is to focus on the animation features, which can be used in any other type of graph. In the future, I will go on deeper in social networks and graphs.

#This is an invented Dataset that describes the edges generated throughout time between the different nodes. In order to understand this example, we have to assume that this is the order in which the edges where generated. In order to understand it, you have to read it by column, i.e., the first edge origined is 1 and 2, the second 3 and 4, the third 5 and 6, etc.

#reset the record of frames. In order to ensure that what we are going to animate does not include something else, it is always a good practice to reset before starting

ani.record(reset=TRUE)

# loop created to generate the animation. As there is no way to do it “cummulative”, it is always the same graph depicted (with all its nodes) but some of the edges are white to give the idea of progressive connection

#There are many possibilities to save the animation created. Below you will find only 2

#This is the best option. as I do not have the chance to store the images in a server and use the saveHTML function which is indeed considerably better. If you are going to run the animation locally or you have a server to store it, I would highly recommend this option.

saveHTML(ani.replay())

#This is the one I used in the post. It creates an mp4 file. Therefore, you will need to download ffmpeg.