open source

Last month in New York City, it was good bye Strata+Hadoop World, welcome Strata Data Conference! While the name of one of the most important big data industry events has changed, its importance has not. Clearly, Hadoop was no longer the only (or even predominant) elephant in the room at this year’s conference, and even … Read more

Cask Tracker is a self-service CDAP Extension that automatically captures rich metadata and provides users with visibility into how data is flowing into, out of, and within a Data Lake. Tracker was first introduced in CDAP v3.4. Tracker v0.2 has just been released along with CDAP 3.5 and packs a ton of new features. Dataset … Read more

It was so great to see everyone at the Big Data Applications Meetup last week! The meetup was sponsored by Cask, the company making big data applications easy, and by Ampool, and we would like to thank Milind Bhandarkar, the Founder and CEO of Ampool, for supporting this event. For those that couldn’t join us, we … Read more

The Big Data community has long been searching for a good abstraction to express data processing pipelines. And now one possible answer to that quest may have emerged. For ad-hoc querying of data that standard is clearly SQL and, not surprisingly, SQL has found its way back into the “noSQL” world in various incarnations of … Read more

Hadoop is a collection of 47+ components. Recently, Andreas Neumann (blog) and Merv Adrian (blog) in their respective blogs discussed what makes a Hadoop technology Hadoop. They both did a great job of asking the right questions and presenting the facts about what makes up Hadoop today. While Andreas focused on picking the right technologies … Read more

Cask is excited to announce easy CDAP integration for Apache Ambari users. Previously, we introduced you to integration with Cloudera Manager. This post will familiarize you with integration with Apache Ambari, the open source provisioning system for HDP (Hortonworks Data Platform). Adding the CDAP service to Ambari To install CDAP on a cluster managed by … Read more

Coopr is a cluster provisioning system designed to fully facilitate cluster lifecycle management in public and private clouds. In this blog, we will take an inside look at what happens when Coopr provisions a cluster. Deploying clusters can be time-consuming. For many system deployments, this work can be accomplished with a configuration management tool such … Read more

Collecting metrics and providing access to metrics is a must-have requirement for any application platform — and is even more important when it comes to distributed systems. In this post we will examine some aspects of designing metrics systems for a distributed application platform and take a brief look at one built for the open … Read more

Developing features for CDAP follows a similar workflow as working on many projects. Developers have their local checkout of the source, make modifications in a feature branch, build and test locally on their development machines, push their branch, and submit a pull request for code review. During this process, developers build CDAP clusters (for testing) … Read more

Hadoop provides specialized tools and technologies that can be used for transporting and processing huge amounts of weblog data. In this blog, we’ll explore the end-to-end process of aggregating logs, processing them and generating analytics on Hadoop to gain insights about how users interact with your website. With the digitization of the world, generating knowledge … Read more