Day: August 22, 2016

In this blog post we describe how to tell a Kafka Streams application to reprocess its input data from scratch. This is actually a very common situation when you are implementing stream processing applications in practice, and it might be required for a number of reasons, including but not limited to: during development and testing, when addressing bugs in production, when doing A/B testing of algorithms and campaigns, when giving demos to customers or internal stakeholders, and so on.

The quick answer is you can do this either manually (cumbersome and error-prone) or you can use the new application reset tool for Kafka Streams, which is an easy-to-use solution for the problem. The application reset tool is available starting with upcoming Confluent Platform 3.0.1 and Apache Kafka 0.10.0.1.

In the first part of this post we explain how to use the new application reset tool. In the second part we discuss what is required for a proper (manual) reset of a Kafka Streams application. This parts includes a deep dive into relevant Kafka Streams internals, namely internal topics, operator state, and offset commits. As you will see, these details make manually resetting an application a bit complex, hence the motivation to create an easy to use application reset tool.

Being able to reprocess streams is a critical part of the Kappa architecture, and this article is a nice overview of how to do that if you’re using Kafka Streams.

With CREATE permissions this isn’t the case; there is a piece of the above template that isn’t needed, and it’s quite easy to see why when I sat down and thought about it.

Specifically, it’s this bit:

<On What>

I’m granting CREATE permissions; since I haven’t created anything, I can’t grant the permission on anything.

I like this post for the direct reason (granting certain permissions doesn’t require specifying an object), but for the implicit point as well: we build up internal systems of rules and processes as we act on things. This inductive reasoning tends to work well for us in most scenarios, but at some point, our systems break down and we find out either that we need to incorporate edge cases into our system, or that we were actually focusing on an edge case the entire time.

The first thing to keep in mind is that ASDW was designed to be a cloud based system. As such, it aims to be very flexible for resource allocation and very efficient to scale up or down. To meet those goals, the system allows you too:

Increase or decrease compute power represented by Data Warehousing Units.

The amount of storage can grow and is charged independently from the compute power.

The compute power can be completely paused and only storage is payed at that point.

If you want to spin up an Apache HadoopⓇ cluster, you need to grapple with the question of how to attach your disks. Historically, this decision has favored direct attached storage (DAS). This approach is in keeping with the fundamental Hadoop principle of moving processing to a where the data lives, thereby taking advantage of disk locality to optimize performance. Disk locality is so core to Hadoop that virtually any description of Hadoop starts with this.

The alternative is to use network attached storage (NAS). In contrast to DAS, NAS separates the compute and storage layers so that storage can be shared across a number of servers by shipping data over the network. Historically, this heavy dependence on the network made NAS an order of magnitude slower. Remember, the state of the art was 1GbE networks, and switches were slower and more expensive. I/O requirements for demanding Hadoop-based applications could only be met by DAS.

This is a very interesting discussion. In my limited experience, I’ve had trouble selling operations teams on DAS, given the increased ops effort required to keep a bunch of attached disks going. Hat tip Ari Amster.

So you are a DBA and you are in a virtual environment – VMware in particular. You are curious to know the health of the VMware hosts in terms of CPU and RAM, but you really don’t know how to get the data you need and you’re not certain if the information you are asking for is entirely accurate. Well, chances are you have access to the VMware databases themselves – if that is the case, you can create these reports based on a blog post from Jonathan Kehayias: “Querying the VMware vCenter Database (VCDB) for Performance and Configuration Information“.

I have created five reports that are based on Jonathan’s queries and you can download the RDL for the SSRS reports below – enjoy!

The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Pagegrid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService

The bullet chart is a variation of a bar graph but designed to address some of the problems that gauges have.

Allows you to split chart by categories

Visuals can be vertical or horizontal

Some of the visualizations in this series have been hit-or-miss for me. I’m on the fence about bullet charts: they seem potentially useful, but also rather dense. I like my visuals to be self-explanatory, and I’d be concerned that if I showed this to management, I’d have to explain what’s going on in more detail than I’d like.

When I see those numbers in Microsoft marketing slides, I sometimes wonder if they can be real, but then I put these numbers together myself. Granted you would get some discounts, but the fact that all of these features are built into SQL Server, should convince you of the value SQL Server offers. Pricing discounts are generally similar between vendors, so that is not really a point of argument. If you are doing a really big Oracle deal you may see a larger upfront discount, but you will still be paying your 23% support fees on that very large list price. (Software Assurance from Microsoft will be around 20%, but from a much lower base) Additionally, several of these features ae available in SQL Server Standard Edition. None of these features are in Oracle’s Standard Edition.

Postgres is a really good database engine, with a rich ecosystem of developers writing code for it. SQL Server on the other hand, is a mature product that has had a large push to support analytic performance and scale.

Additionally, this customer is leveraging the Azure ecosystem as part of their process, and that is only possible via SQL Server’s tight integration with the platform.

This isn’t a direct comparison to help determine in some absolute sense which product is better, but rather looking at a use case from a customer which takes advantage of many of the features in SQL Server.

Page Compression is what I like to refer to as “compression for real this time” as it goes well beyond the smart storage method of row and uses patterns/repeating values to condense the stored data.

First, to gain a better understanding of this method, check out a simple representation of a page of data. This is illustrated below in Figure 1. You’ll notice that there are some repeating values (e.g. SQLR) and some repeated strings of characters (e.g. SSSLL).

I really appreciate getting an idea of what kind of data does not compress well. You’d think auto-incrementing numbers would be another scenario, but Melissa explains how that’s not necessarily the case.

In this case, the error message is quite clear. There is more than one row in the source (staging) that matches a single row in the target (data warehouse). When we are warehousing data, we setup key fields that allow us to match up a record in staging to a record in the data warehouse. In most systems, you can use the source system’s primary key to accomplish this. After all, most systems use a RDBMS of some sort to store data. However, in this case the source data is from a SharePoint list, and the only source key available is a list item ID.

So why are we not using that? There is a very simple answer and that is because end users delete old data from the list, which can lead to a recycling of ID values from SharePoint. If an ID gets recycled, then the data warehouse will improperly overwrite data in the fact table or discard the new row as a duplicate depending on how we configure the extract routine.

Figuring out the cause of the problem is a multi-step process, as Jesse shows.