The AU Analyzer looks at all the vertices (or nodes) in your job, analyzes how long they ran and their dependencies, then models how long the job might run if a certain number of vertices could run at the same time. Each vertex may have to wait for input or for its spot in line to run. The AU Analyzer isn’t 100% accurate, but it provides general guidance to help you choose the right number of AUs for your job.

You’ll notice that there are diminishing returns when assigning more AUs, mainly because of input dependencies and the running times of the vertices themselves. So, a job with 10,000 total vertices likely won’t be able to use 10,000 AUs at once, since some will have to wait for input or for dependent vertices to complete.

In the graph below, here’s what the modeler might produce, when considering the different options. Notice that when the job is assigned 1427 AUs, assigning more won’t reduce the running time. 1427 is the “peak” number of AUs that can be assigned.

I like this kind of tooling, as it provides a realistic assessment of tradeoffs.

The diagram below captures the building blocks together at a high level. If you have to tie this back to a fictitious self-flying drone company, the company will collect tons of raw images from the test drones’ built-in cameras for computer vision. Those images can be stored in the Apache Hadoop data lake in a cost-effective (with erasure coding) yet highly available manner (multiple standby namenodes). Instead of providing GPU machines to each of the data scientists, GPU cards are pooled across the cluster for access by multiple data scientists. GPU cards in each server can be isolated for sharing between multiple users.

Support of Docker containerized workloads means that data scientists/data engineers can bring the deep learning frameworks to the Apache Hadoop data lake and there is no need to have a separate compute/GPU cluster. GPU pooling allows the application of the deep learning neural network algorithms and the training of the data-intensive models using the data collected in the data lake at a speed almost 100x faster than regular CPUs.

If the customer wants to pool the FPGA (field programmable gate array) resources instead of GPUs, this is also possible in Apache Hadoop 3.1. Additionally, use of affinity and anti-affinity labels allows us to control how we deploy the microservices in the clusters — some of the components can be set to have anti-affinity so that they are always deployed in separate physical servers.

It’s interesting to see Hadoop evolve over time as the ecosystem solves more real-time problems instead of focusing on giant batch problems.

Switching to the editor mode is pretty simple and straight-forward. At the bottom of the screen, we can see the help bar which guides us through the switching process between the available editor modes. The options available for instant switching are the multiline mode, activated by pressing F3, and the Emacs mode, activated by pressing the F4 button.

To run the multi-line query in the multi-line mode, append the query with a semicolon and then press the enter key to execute it.

Use the same keys as mentioned above to turn on and turn off the editor modes—F3 for the multi-line query mode and F4 for the EMACS mode.

A recent post on the Power BI community website asked if it was possible to compress a group of numbers into text that described the sequential ranges contained within the numbers. This might be a group of values such as 1, 2, 3, 4, 7, 8, 9, 12, 13: (note there are gaps) with the expected result grouping the numbers that run in a sequence together to produce text like “1-4, 7-9, 12-13”. Essentially to identify gaps when creating the text. This seemed like an interesting challenge and here is how I solved it using DAX.

Read on for the solution, which is conceptually very similar to the T-SQL solution but a bit different in implementation.

In this video, Christian Wade joined Adam Saxton to discuss Incremental Refresh with Power BI Premium. You can use Incremental Refresh with Power BI Premium to take your dataset beyond 1GB and avoid failures such as timeouts.