Details

Matthew Purdy is a technology minded business person. As CEO of a local software engineering services company with a focus on big data, his role is to grow his business through relationships. As a software engineer, his every day job requires handling very large data with interesting restraints and complexity.

Matt has close to 20 years experience with many different language, software platforms, and distributed systems. His current focus has been big data using Hadoop, Accumulo, and Spark; however, he is alway interested in learning new technologies and listening to others to find better ideas to handle complex problems.

Spark and Accumulo: handling really big data

The primary focus of data science is the understanding of data. As data gets large and complex, more work needs to be done to do the bookkeeping. Apache Accumulo is an implementation of Google’s big table with the focus on large scale and security. Apache Spark is the next generation compute platform. Both Accumulo and Spark can be used on Hadoop file system (HDFS). Both technologies together become a great way to do ETL first, and follow up with analytics using Sparks MLlib or GraphX.

The goal of this presentation is to add new tools to the the data scientist toolbox. We will go through the benefits of using Accumulo and Spark as a custom data analytic platform and provide with some simple examples to ease processing. Once you know the basics, you can leverage any of your data analytic skills to build on top of a solid platform.