Friday, October 9, 2015

HPCC Systems was designed to solve “big data” problems. It can process, analyze and find links and associations in high volumes of complex data at high speed and with incredible accuracy. While it was originally created by LexisNexis and is still used in-house, the HPCC Systems Project went open source four years ago. Free downloads of the software, documentation and training materials are available from our website.This is the first time we participated in Google Summer of Code (GSoC) and it has been a great success. As a first-time organization, we were allocated two student slots. It was quite hard to choose which proposals to accept because there were many high quality contenders. We selected two projects that highlight areas of specific interest not just for us but for our community and the world of big data.Add Statistics to the Linear and Logistic Regression Modules - Sarthak JainMachine learning statistics are important to the big data world, providing a way to drill down into data using complex queries and produce meaningful results to help businesses maintain their competitive edge in the market place. The HPCC Systems Machine Learning Library has been around for a while now and we are always looking for ways to improve it. The new statistics added as part of this project give vastly improved results about the models created.

Slide taken from Sarthak's presentation describing some of the tasks completed

The statistics Sarthak added provide metrics which indicate the “goodness” of the model created. He completed the tasks associated with these statistics in very good time and also added three stepwise functions to the same modules which find the best model by adding or taking away independent variables. A goodness metric was also added to these features to select which independent variables are added to or taken away from the model. The three functions he added were forward, backward and bidirectional.Expand the HPCC Systems Visualization Framework (Web-Based) - Anmol JagetiaCurrently the HPCC Systems Platform has very little support for visual analytics. While there are plenty of “off the shelf” visual analytic tools and dashboard creators, none are really suitable for big data because they typically work with local datasets (think charting with a spreadsheet). The HPCC Systems Visualization Framework aims to solve the issue by bringing together existing “best of breed” visualizations as well as bespoke HPCC Systems visualizations into a consistent framework.Anmol’s project involved adding unit tests and linting as well as adding new visualization widgets and enhancing existing ones. He used his knowledge and experience to enhance our build quality infrastructure and has also added a range of new features to the existing framework including the addition of a time lapse capability and a number of features which enable bar charts to be used as Gantt charts. The work he has done, which is already being used, significantly improves the user experience.Below is an illustration of the work Anmol did to add range support in a column chart where there is both an upper and lower bound.

We’ve really enjoyed participating in GSoC this year and we will definitely apply to be accepted again next year. Our thanks go to the students for contributing to our project. We hope they enjoyed working with us.By Lorraine Chapman, HPCC Systems Release Manager and GSoC Org Admin