This topic is where I actually planned to pivot to for quite a while, but either had no bandwith\time with moving to Seattle or some other excuse like that. This topic is an interesting to me as its intersection of technologies where I used to spend quite a bit of time including:

Data, including both traditional RDBMS databases and NoSQL designs

Cloud, including Microsoft Azure, AWS and GCP

Containers and container orchestration (that is area new to me)

Those who worked with me either displayed various degrees of agreement, but more often of annoyance, on me pushing this topic as growing part of data engineering discipline, but I truly believe in it. But before we start combining all these areas and do something useful in code I will spend some time here with most basic concepts of what I am about to enter here.

I will skip introduction to basic concepts of RDBMS, NoSQL or BigData\Hadoop technologies, as if you didn’t hide under the rock for last five years, you should be quite aware of those. That brings us straight to containers:

As definition states – “A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings. Available for both Linux and Windows based apps, containerized software will always run the same, regardless of the environment. Containers isolate software from its surroundings, for example differences between development and staging environments and help reduce conflicts between teams running different software on the same infrastructure.”

Containers are a solution to the problem of how to get software to run reliably when moved from one computing environment to another. This could be from a developer’s laptop to a test environment, from a staging environment into production, and perhaps from a physical machine in a data center to a virtual machine in a private or public cloud.

But why containers if I already have VMs?

VMs take up a lot of system resources. Each VM runs not just a full copy of an operating system, but a virtual copy of all the hardware that the operating system needs to run. This quickly adds up to a lot of RAM and CPU cycles. In contrast, all that a container requires is enough of an operating system, supporting programs and libraries, and system resources to run a specific program.
What this means in practice is you can put two to three times as many as applications on a single server with containers than you can with a VM. In addition, with containers you can create a portable, consistent operating environment for development, testing, and deployment.

On Linux, containers run on top of LXC. This is a userspace interface for the Linux kernel containment features. It includes an application programming interface (API) to enable Linux users to create and manage system or application containers.

Docker is an open platform tool to make it easier to create, deploy and to execute the applications by using containers.

The heart of Docker is Docker engine. The Docker engine is a part of Docker which create and run the Docker containers. The docker container is a live running instance of a docker image. Docker Engine is a client-server based application with following components :

A server which is a continuously running service called a daemon process.

A REST API which interfaces the programs to use talk with the daemon and give instruct it what to do.

A command line interface client.

The command line interface client uses the Docker REST API to interact with the Docker daemon through using CLI commands. Many other Docker applications also use the API and CLI. The daemon process creates and manage Docker images, containers, networks, and volumes.

Docker client is the primary service using which Docker users communicate with the Docker. When we use commands “docker run” the client sends these commands to dockerd, which execute them out.

You will build Docker images using Docker and deploy these into what are known as Docker registries. When we run the docker pull and docker run commands, the required images are pulled from our configured registry directory.Using Docker push command, the image can be uploaded to our configured registry directory.

Finally deploying instances of images from your registry you will deploy containers. We can create, run, stop, or delete a container using the Docker CLI. We can connect a container to more than one networks, or even create a new image based on its current state.By default, a container is well isolated from other containers and its system machine. A container defined by its image or configuration options that we provide during to create or run it.

That brings us to container orchestration engines. While the CLI meets the needs of managing one container on one host, it falls short when it comes to managing multiple containers deployed on multiple hosts. To go beyond the management of individual containers, we must turn to orchestration tools. Orchestration tools extend lifecycle management capabilities to complex, multi-container workloads deployed on a cluster of machines.

Some of the well known orchestration engines include:

Kubernetes. Almost a standard nowdays , originally developed at Google. Kubernetes’ architecture is based on a master server with multiple minions. The command line tool, called kubecfg, connects to the API endpoint of the master to manage and orchestrate the minions. The service definition, along with the rules and constraints, is described in a JSON file. For service discovery, Kubernetes provides a stable IP address and DNS name that corresponds to a dynamic set of pods. When a container running in a Kubernetes pod connects to this address, the connection is forwarded by a local agent (called the kube-proxy) running on the source machine to one of the corresponding backend containers.Kubernetes supports user-implemented application health checks. These checks are performed by the kubelet running on each minion to ensure that the application is operating correctly.

Apache Mesos. This is an open source cluster manager that simplifies the complexity of running tasks on a shared pool of servers. A typical Mesos cluster consists of one or more servers running the mesos-master and a cluster of servers running the mesos-slave component. Each slave is registered with the master to offer resources. The master interacts with deployed frameworks to delegate tasks to slaves. Unlike other tools, Mesos ensures high availability of the master nodes using Apache ZooKeeper, which replicates the masters to form a quorum. A high availability deployment requires at least three master nodes. All nodes in the system, including masters and slaves, communicate with ZooKeeper to determine which master is the current leading master. The leader performs health checks on all the slaves and proactively deactivates any that fail.

Over the next couple of posts I will create containers running SQL Server and other data stores, both RDBMS and NoSQL, deploy these on the cloud and finally attempt to orchestrate these hopefully as well. So lets move off theory and pictures into world of parctical data engine deployments.

Although I am mainly Microsoft and Azure centric guy,after reading a bit on Google Big Query it got me interested. Therefore I decided to explore it a bit more as well here.

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service (IaaS) that may be used complementarily with MapReduce. BigQuery (BQ) is reportedly based on Dremel,a scalable, interactive ad hoc query system for analysis of read-only nested data. To use the data in BigQuery, it first must be uploaded to Google Storage and in a second step imported using the BigQuery HTTP API. BigQuery requires all requests to be authenticated, supporting a number of Google-proprietary mechanisms as well as OAuth.

Lets start with what BigQuery is not. Its not either RDBMS nor MapReduce technology. As I stated BigQuery is based on internal Google technology called Dremel. Dremel is a query service that allows you to run SQL-like queries against very,very large data sets and get accurate results in mere seconds. You just need a basic knowledge of SQL to query extremely large datasets in an ad hoc manner. At Google, engineers and non-engineers alike, including analysts, tech support staff and technical account managers, use this technology many times a day. BigQuery provides the core set of features available in Dremel to third party developers. It does so via a REST API, a command line interface, a Web UI, access control and more, whilemaintaining the unprecedented query performance of Dremel.

Why BigQuery is so fast? The answer can be found in two core technologies which gives BQ this unprecedented performance:

Columnar Storage. Data is stored in a columnar storage fashion which makes possible to achieve very high compression ratio and scan throughput

Tree Architecture is used for dispatching queries and aggregating results across thousands of machines in a few seconds.

BigQuery stores data in its columnar storage, which means it separates a record into column values and stores each value on different storage volume, whereas traditional RDBMS normally store the whole record on one volume

Actually this technology isnt that new for anyone who dealt with DW technologies for a while, it’s a fairly hot topic today with SQL Server Column-Store Indexes and SAP HANA In-Memory Column Store for example. As you may know Column Storage has following advantages:

Traffic minimization. Only required column values on each query are scanned and transferred on query execution. For example, a query “SELECT top(title) FROM foo” would access the title column values only.

Higher compression ratio. One study reports that columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3. Because each column would have similar values, especially if the cardinality of the column (variation of possible column values) is low, it’s easier to gain higher compression ratios than row-based storage.

Columnar storage has the disadvantage of not working efficiently when updating existing records. In the case of BigQuery, it simply doesn’t support any update operations. Thus the technique has been used mainly in read-only OLAP/BI type of usage. Although the technology has been popular as a data warehouse database design, Dremel\BigQuery is one of the first implementations of a columnar storage-based analytics system that harnesses the computing power of many thousands of servers and is delivered as a cloud service.

One of the challenges Google had in designing Dremel\BigQuery was how to dispatch queries and collect results across tens of thousands of machines in a matter of seconds. The challenge was resolved by using the Tree architecture. The architecture forms a massively parallel distributed tree for pushing down a query to the tree and then aggregating the results from the leaves at a blazingly fast speed.

By leveraging this architecture, Google was able to implement the distributed design for Dremel\BigQuery and realize the vision of the massively parallel columnar-based database on the cloud platform.

BigQuery provides the core set of features available in Dremel to third party developers. It does so via a REST API, command line interface, Web UI,access control, data schema management and the integration with Google Cloud Storage. BigQuery and Dremel share the same underlying architecture and performance characteristics. Users can fully utilize the power of Dremel by using BigQuery to take advantage of Google’s massive computational infrastructure. This incorporates valuable benefits like multiple replication across regions and high data center scalability. Most importantly, this infrastructure requires no management by the developer.

So why BigQuery over MapReduce? The difference here is MapReduce is batch based programming framework for very large datasets, whereas BIgQuery is an interactive data query tool for large datasets

Ok, how do I use it? Assuming you already have Google Cloud account you will have to create a new project from the dropdown in your Google Cloud Console.

Once that is done, you can navigate to project console and enable BigQuery APIs for use with your project:

Now in you left side menu you can pick BigQuery from Big Data offerings.

Before we can run any queries, we need some data! There are a couple of options here:

Load you own data

Use Google provided public data

For now I will settle on the second choice. I will pick Shakespeare dataset here, This dataset contains the words in Shakespeare’s works, the word_count for each word, in which corpus the word appears, and the date the corpus was written. First I will issue a simple SELECT. SELECT is the most basic clause and specifies what it is that you want to be returned by the query. FROM specifies what dataset we are using.

SELECT corpus
FROM (publicdata:samples.shakespeare)
GROUP BY corpus

Let’s switch gears. Say we want to count something – say, the number of words in Shakespeare’s works. Luckily, we have word_count, which represents how many times a particular word appeared in a particular corpus. We can just sum all of these values, and we are left with the total number of words that he wrote.

SELECT SUM(word_count) AS count
FROM (publicdata:samples.shakespeare)

945,845 words! Pretty good – but there must surely be some duplicates. How would we query the number of unique words that he used?

SELECT COUNT(word) AS count, word
FROM publicdata:samples.shakespeare
GROUP BY word
ORDER BY count Desc

Here we use the COUNT function to count how many words there are, and group them by word so as not to show duplicates. 32,786 unique words. Moreover I order these by mostly used words in descending order.

Ok, lets finally add WHERE clause. For example I wonder how many times Shakespeare is using word “Sir”:

SELECT word, SUM(word_count) as count
FROM (publicdata:samples.shakespeare)
WHERE word = "Sir"
GROUP BY word
ORDER BY count DESC

From results I can state that Shakespeare was really polite guy:

This of course is pretty basic. However, BigQuery now does allow for joins. I could take my large table and join it to smaller lookup table using standard ANSI SQL join syntax

There are also some complex functions it supports like:

In the next part I am planning to upload a dataset and do some joins and more complex processing.

I realize that this post is pretty basic, however I think it may be useful to someone starting to develop Web applications on Google App Engine. In my previous post I shown you how you can use NetBeans to develop for App Engine, this time however I will use Eclipse as with real Google plug-in it’s a lot more slick, despite my preference for NetBeans as Java IDE. In general I am a server side vs. UI guy so my servlet will be very basic, but that I don’t think is the point here as servlets in general are quite easy to do and make more more functional.

So you decided to build Web App on App Engine? First you need to install Google Plug-In for Eclipse for AppEngine. Installation instructions are available here. In Eclipse Go To Help Menu –>Install Software and add proper link as per table in Google doc: If you are using latest Eclipse (Luna) full instructions are here – https://developers.google.com/eclipse/docs/install-eclipse-4.4

After you install Plug-In you will see new icon on your toolbar in Eclipse:

Next I will create a Dynamic Web Project in Eclipse to house my Servlet. Go To File –> New –>Dynamic Web Project. Resulting dialog looks like this:

After you complete this step, the project-creation wizard will create a simple servlet-based application featuring a Hello World-type servlet. After you set that up you should see your project structure:

Now on to Java Servlets. Java servlet is a Java programming language program that extends the capabilities of a server. Although servlets can respond to any types of requests, they most commonly implement applications hosted on Web servers. Such Web servlets are the Java counterpart to other dynamic Web content technologies such as PHP and ASP.NET.

servlet is a Java class that implements the Servlet interface. This interface has three methods that define the servlet’s life cycle:

public void init(ServletConfig config) throws ServletException
This method is called once when the servlet is loaded into the servlet engine, before the servlet is asked to process its first request.

public void service(ServletRequest request, ServletResponse response) throws ServletException, IOException
This method is called to process a request. It can be called zero, one or many times until the servlet is unloaded. Multiple threads (one per request) can execute this method in parallel so it must be thread safe.

public void destroy()
This method is called once just before the servlet is unloaded and taken out of service.

The init method has a ServletConfig attribute. The servlet can read its initialization arguments through the ServletConfig object. How the initialization arguments are set is servlet engine dependent but they are usually defined in a configuration file.

The doGet method has two interesting parameters: HttpServletRequest and HttpServletResponse. These two objects give you full access to all information about the request and let you control the output sent to the client as the response to the request.

The servlet gets mapped under the URI /simpleservletapp in the web.xml, as you can see below:

The project-creation wizard also provides an index.html file that has a link to the new servlet.

Before I deploy this to App Engine I probably want to debug it locally. Lets find Debug button on toolbar

Hit dropdown arrow and you can setup local server for debugging. If your application uses App Engine but not GWT, the only indication that the App Engine development server is running will be output in the Console view. App Engine-only launches will not appear in the development mode view. The console output includes the URL of the server, which by default is http://localhost:8888/. You can change the port number via Eclipse’s launch configuration dialog by selecting your Web Application launch and editing the Port value on the Main tab.

Simple result here:

Now lets upload this “application” to App Engine. Since I already configured development server I used Google Engine WPT Deploy. Right click on the project and you can pick that option:

Something I learned along the way – At this time you cannot run Java 8 on GAE. My IT department is a stickler for latest Java installed on my work laptop. Therefore I am running Java 8 SDK and JRE. However when I first build this application in Java 8 and deployed to App Engine trying to run it I got http 500 result. Looking at the logs in dashboard I can see following error:

2015-05-29 16:49:22.442
Uncaught exception from servlet
java.lang.UnsupportedClassVersionError: com/example/myproject/HelloAppEngineServletServlet : Unsupported major.minor version 52.0
at com.google.appengine.runtime.Request.process-ee32bdc226f79da9(Request.java)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:817)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.lang.ClassLoader.loadClass(ClassLoader.java:375)
at org.mortbay.util.Loader.loadClass(Loader.java:91)
at org.mortbay.util.Loader.loadClass(Loader.java:71)
at org.mortbay.jetty.servlet.Holder.doStart(Holder.java:73)
at org.mortbay.jetty.servlet.ServletHolder.doStart(ServletHolder.java:242)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:685)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1250)
at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517)
at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:467)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at com.google.tracing.TraceContext$TraceContextRunnable.runInContext(TraceContext.java:437)
at com.google.tracing.TraceContext$TraceContextRunnable$1.run(TraceContext.java:444)
at com.google.tracing.CurrentContext.runInContext(CurrentContext.java:230)
at com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContextNoUnref(TraceContext.java:308)
at com.google.tracing.TraceContext$AbstractTraceContextCallback.runInInheritedContext(TraceContext.java:300)
at com.google.tracing.TraceContext$TraceContextRunnable.run(TraceContext.java:441)
at java.lang.Thread.run(Thread.java:745)

Did a bit of digging and here is what I found out.

I believe these are the current version numbers:

J2SE 7 = 51, //Note this one

J2SE 6.0 = 50,

J2SE 5.0 = 49,

JDK 1.4 = 48,

JDK 1.3 = 47,

JDK 1.2 = 46,

JDK 1.1 = 45

51.0 appears to be Java 7, which would mean in my case I am using 8 as its seeing 52. So issue here is that I compiled on higher version JDK (52 = Java 8) and then execute it on lower JRE version (GAE uses java 7). Repointing PATH to Java 7 did the trick, moreover as it was suggested elsewhere on the web I just created a batch file for this purpose as

Although I am fairly Microsoft centric and Azure centric in my interests I have decided to give a quick try to a competing alternative using language that I am familiar with – Java and IDE that I know fairly well for it – NetBeans. I wanted to try something quick and easy, but my first task was making sure that NetBeans actually can deploy code to AppEngine.

First things first. Navigate to developers.google.com and sign up to enable your Google Account for use with Google App Engine. Create a project, make sure you write down Project ID and name. Download and unzip the Google App Engine SDK to a folder on your hard drive. Obviously here I am using Java SDK.

Now, I will assume that you have NetBeans already installed. If not it can be installed from here – https://netbeans.org/ . It is my favorite IDE when it comes to non-Microsoft or Java IDEs at this time.

If you’re using NetBeans 6.9 you must also install the Java Web Applications plugin for NetBeans.

After you are done you should see plugins in your installed tab like this:

To install the Google App Engine service in NetBeans, follow these instructions:

Start NetBeans

Click on the Services tab next to Projects and Files

Right-click on Servers and click Add

Select Google App Engine and Click Next

Select the location you unzipped the Google App Engine SDK and Click Next

Unless you have another service running on port 8080 and port 8765 leave the default port values

Finally Click Finish

After you are done you should see Google AppEngine in your Servers”

Next I created really easy small JSP file that I want to deploy. First thing lets take a look at appengine-web.xml settings file. As per Google docs –“ An App Engine Java app must have a file named appengine-web.xml in its WAR, in the directory WEB-INF/. This is an XML file whose root element is .” My config looks like this:

Frankly the experience for me wasn’t as seamless as I would like (perhaps I am spoiled a bit by Visual Studio Azure SDK integration for Azure PaaS), I may have had an easier time with official Google plugin and Eclipse, but as stated I like NetBeans more.