Machine Learning Blog & Software Development News

In the previous article we have discussed about the Data Envelopment Analysis technique and we have seen how it can be used as an effective non-parametric ranking algorithm. In this blog post we will develop an implementation of Data Envelopment Analysis in JAVA and we will use it to evaluate the Social Media Popularity of webpages and articles on the web. The code is open-sourced (under GPL v3 license) and you can download it freely from Github.

Update: The Datumbox Machine Learning Framework is now open-source and free to download. Check out the package com.datumbox.framework.algorithms.dea to see the implementation of Data Envelopment Analysis in Java.

Data Envelopment Analysis implementation in JAVA

The code is written in JAVA and can be downloaded directly from Github. It is licensed under GPLv3 so feel free to use it, modify it and redistribute it freely.

The code implements the Data Envelopment Analysis algorithm, uses the lp_solve library to solve the Linear Programming problems and uses extracted data from Web SEO Analytics index in order to construct a composite social media popularity metric for webpages based on their shares on Facebook, Google Plus and Twitter. All the theoretical parts of the algorithm are covered on the previous article and in the source code you can find detailed javadoc comments concerning the implementation.

Below we provide a high level description of the architecture of the implementation:

1. lp_solve 5.5 library

In order to solve the various linear programming problems, we use an open source library called lp_solve. The particular lib is written in ANSI C and uses a JAVA wrapper to invoke the library methods. Thus before running the code you must install lp_solve on your system. Binaries of the library are available both for Linux and Windows and you can read more information about the installation on lp_solve documentation.

Please make sure that the particular library is installed on your system before trying to run the JAVA code. For any problem concerning installing and configuring the library please refer to the lp_solve documentation.

2. DataEnvelopmentAnalysis Class

This is the main class of the implementation of DEA algorithm. It implements a public method called estimateEfficiency() which takes a Map of records and returns their DEA scores.

3. DeaRecord Object

The DeaRecord is a special Object that stores the data of our record. Since DEA requires separating the input and output, the DeaRecord Object stores our data separately in a way that DEA can handle it.

4. SocialMediaPopularity Class

The SocialMediaPopularity is an application which uses DEA to evaluate the popularity of a page on Social Media networks based on its Facebook likes, Google +1s, and Tweets. It implements two protected methods the calculatePopularity() and the estimatePercentiles() along with two public methods the loadFile() and the getPopularity().

The calculatePopularity() uses the DEA implementation to estimate the scores of the pages based on their social media counts. The estimatePercentiles() method gets the DEA scores and converts them into percentiles. In general percentiles are easier to explain than DEA scores; thus when we say that the popularity score of a page is 70% it means that the particular page is more popular than the 70% of the pages.

In order to be able to estimate the popularity of a particular page, we must have a dataset with the social media counts of other pages. This makes sense since in order to predict which page is popular and which is not, you must be able to compare it with other pages on the web. To do so, we use a small anonymized sample from Web SEO Analytics index provided in txt format. You can build your own database by extracting the social media counts from more pages on the web.

The loadFile() method is used to load the aforementioned statistics on DEA and the getPopularity() method is an easy to use method that gets the Facebook likes, Google +1s and the number of Tweets of a page and evaluates its popularity on social media.

Using the Data Envelopment Analysis JAVA implementation

In the DataEnvelopmentAnalysisExample Class I provide 2 different examples of how to use the code.

The first example uses directly the DEA method to evaluate the efficiency of organizational units based on their output (ISSUES, RECEIPTS, REQS) and input (STOCK, WAGES). This example was taken from an article of DEAzone.com.

The second example uses our Social Media Popularity application to evaluate the popularity of a page by using data from Social Media such as Facebook Likes, Google +1s and Tweets. All social media counts are marked as output and we pass to DEA an empty input vector.

Necessary Expansions

The provided code is just an example of how DEA can be used as a ranking algorithm. Here are few expansions that must be made in order to improve the implementation:

1. Speeding up the implementation

The particular DEA implementation evaluates the DEA scores of all the records in the database. This makes the implementation slow since we require solving as many linear programming problems as the number of records in database. If we don’t require calculating the score of all the records then we can speed up the execution significantly. Thus a small expansion of the algorithm can give us better control over which records should be solved and which should be used only as constrains.

2. Expanding the Social Media Counts Database

The provided Social Media Counts Database consists of 1111 samples from Web SEO Analytics index. To be able to estimate a more accurate popularity score, a larger sample is necessary. You can create your own database by estimating the social media counts from more pages of the web.

3. Adding more Social Media Networks

The implementation uses the Facebook Likes, the Google +1s and the number of Tweets to evaluate the popularity of an article. Nevertheless metrics from other social media networks can be easily taken into account. All you need to do is build a database with the social media counts from the networks that you are interested in and expand the SocialMediaPopularity class to handle them accordingly.

Final comments on the implementation

To be able to expand the implementation you must have a good understanding of how Data Envelopment Analysis works. This is covered on the previous article, so please make sure you read the tutorial before you proceed to any changes. Moreover in order to use the JAVA code you must have installed in your system the lp_solve library (see above).

If you use the implementation in an interesting project drop us a line and we will feature your project on our blog. Also if you like the article, please take a moment and share it on Twitter or Facebook.