2015 Top Blogs

At Cask, we are passionate about software development and developer productivity. We build software that we’d want to use and most of the software we have built is a result of “scratching our own itch”. For instance we built a cluster provisioning system – Coopr – to make it easy to provision clusters for internal … Read more

We introduced Cask Hydrator to provide an easy way for users to build a data lake. Users can create ETL pipelines by simply choosing a source, one or more sinks, and optional transforms. When we were designing Hydrator, we wanted to make sure that it was easy to use; users should be able to configure … Read more

The Cask Data Application Platform (CDAP) is an open-source platform to build and deploy data applications on Apache Hadoop™. As of version 3.0, it includes a slick new user interface to help users deploy, manage and monitor their data applications. This UI provides real-time updates from the CDAP backend. Problem Statement Initiating too many HTTP … Read more

Apache Oozie is a workflow scheduler system to manage Apache Hadoop™ jobs. It is one of the most popular open-source workflow scheduler systems for Hadoop. Cask Data Application Platform (CDAP) is an open-source platform to build and deploy data applications on Hadoop. CDAP provides abstractions on top of Hadoop that enable developers to rapidly build, … Read more

Java class loading is one of the most fundamental and powerful concepts provided by the Java Platform. Understanding the class loading mechanism helps you when designing and building extensible application frameworks. You can also avoid spending many hours in debugging exceptions such as ClassCastException and ClassNotFoundException, among others. In this post, we will talk about … Read more

One of the many things that I love about Cask are the hackathons before every release. It is not only a way for us to dog-food new features in the CDAP platform but it is also an opportunity to let your imagination run loose and implement an integration with another system; or develop an interesting … Read more

A real time stream processing framework usually involves two fundamental constructs: processors and queues. A processor reads events from a queue, executes user code to process them, and optionally writing events to another queue for additional downstream processors to consume. Queues are provided and managed by the framework. Queues transfer data and act as a … Read more

In this post, we will walk you through performing common ETL tasks on Hadoop using the open-source Cask Data Application Platform. A typical ETL pipeline consists of a data source, followed by a transformation, used for filtering or cleaning data, ending in a data sink. For example, an organization might take a snapshot of their … Read more

One of the most-cited advantages of Hadoop is that it enables a “schema-on-read” data analysis strategy. “Schema-on-read” means you do not need to know how you will use your data when you are storing it. This allows you to innovate quickly by asking different and more powerful questions after storing the data. However, few people … Read more