Archive for: April 26th, 2018

In Qubole’s 2018 Data Activation Report, we did a deep-dive analysis of how companies are adopting and using different big data engines. As part of this research, we found some fascinating details about Hadoop that we will detail in the rest of this blog.

A common misconception in the market is that Hadoop is dying. However, when you hear people refer to this, they often mean “MapReduce” as a standalone resource manager and “HDFS” as being the primary storage component that is dying. Beyond this, Hadoop as a framework is a core base for the entire big data ecosystem (Apache Airflow, Apache Oozie, Apache Hbase, Apache Spark, Apache Storm, Apache Flink, Apache Pig, Apache Hive, Apache NiFi, Apache Kafka, Apache Sqoop…the list goes on).

I clipped this portion rather than the direct analysis because I think it’s an important point: the Hadoop ecosystem is thriving as the matter of primary importance switches from what was important a decade ago (batch processing of large amounts of data on servers with direct attached storage) to what is important today (a combination of batch and streaming processing of large amounts of data on virtualized and often cloud-based servers with network-attached flash storage).

When we do a transformation on any RDD, it gives us a new RDD. But it does not start the execution of those transformations. The execution is performed only when an action is performed on the new RDD and gives us a final result.

So once you perform any action on an RDD, Spark context gives your program to the driver.

The driver creates the DAG (directed acyclic graph) or execution plan (job) for your program. Once the DAG is created, the driver divides this DAG into a number of stages. These stages are then divided into smaller tasks and all the tasks are given to the executors for execution.

I wanted to know what he was up to, but the sql_text field only gives “xp_cmdshell”, not anything useful that might help to identify what went wrong.

So we have to go to Taskmanager on the server. On the “Process Details” page, you can select which detail columns you want to see. We want to see the Command Line, as that’ll tell us if it’s some manually-launched batch job that’s failed or something else going wrong.

An alternative to using the Task Manager is to open ProcMon, part of the Sysinternals toolset. It takes a bit of getting used to, but is quite powerful once you know its ins and outs.

As discussed previously, SQL Server is not time zone aware, nor does it have to be. This is because the operating system that SQL Server runs on can have multiple custom regional settings depending on which user is logged into the server.

This holds true for the SQL Server service account as well, which is just another user on the operating system. When any of these functions is called, it is asking for the date and time from the operating system.

If you’re going to use DATETIME2 (which you generally should), take advantage of the precision that SYSUTCDATETIME() gives you over GETUTCDATE().

After making changes and testing your report, make sure to clear any slicer values before publishing, if you have row-level security on a field shown in a slicer and you leave values selected. The selected values will be shown to users when they view the report. For example, let’s say you have created a row-level security role that can only see Product A, but you can see everything, and you left Product A and Product B selected and deployed the report. A user who views the report next and is a member of that RLS role will see the two selected values in the slicer, even though they can’t see the data for Product B on the page. This may not be a big deal for an internal report. But now imagine this is for clients. You don’t want clients to see other clients in the list. This behavior is consistent in the Power BI web service and isn’t specific to embedding. It’s just important to remember this.

There are plenty of interesting notes here, so check it out if you’re thinking of a Power BI project.

My approach to teaching people to use Power Query is to always use the UI where possible. I first use the UI to do the hard work, then jump in and make small changes to the code created by the UI to meet any specific variations required. Keep this concept in mind as you read this article.

I am going to use Power BI Desktop as the tool for this, but of course Power Query for Excel will work just as well and the process is identical. In fact the calendar query at the end can easily be cut and pasted between Power BI and Power Query for Excel.

Check it out for another method for building calendar tables. I tend to build them in SQL Server because that’s what I’m most familiar with, but it’s good to know a few different ways of doing this.