Introduction

Azure Data Lake Storage Generation 2 was introduced in the middle of 2018. With new features like hierarchical namespaces and Azure Blob Storage integration, this was something better, faster, cheaper (blah, blah, blah!) compared to its first version – Gen1.

Since then, there has been enough time to prepare the appropriate libraries, thanks to which we could connect to our data lake.

Is that right? Well, not really…

ADLS Gen2 is globally available since 7th of February 2019. Thirty-two days later, there is still no support for the BLOB API, and it means no support for az storage cli or REST API. So you gonna have problems here:

“Blob API is not yet supported for hierarchical namespace accounts”

Are you joking?

And good luck with searching for a proper module in PS Gallery. Of course, this will change in the future.

As a matter of fact – I hope, that this article will help someone to write it 🙂 (yeah, I’m too lazy or too busy or too stupid to do it myself 😛 )

So for now, there is only one way to connect to Azure Data Lake Storage Gen2… Using native REST API calls. And it’s a hell of the job to understand the specification and make it work in the code. But it’s still not rocket science 😛

And by the way. I’m just a pure mssql server boy, I need no sympathy 😛 So let me put it in that way: web development, APIs, RESTs and all of that crap are not in my field of interest. But sometimes you have to be your own hero to save the world… I needed this functionality in the project, I read all the documentation and a few blog posts. However, no source has presented how to do it for ADLS Gen2 😮

So now I’ll try to fill this gap and explain this as far as I understood it, but not from the perspective of a professional front-end/back-end developer, which I am definitely not!

GENERAL specification for authenticating connections to any Azure service endpoint using “Shared key”, documented HERE.

Knowing the first one gives you the ability to invoke proper commands with proper parameters on ADLS. Just like in console, mkdir or ls.

Knowing the second – just how to encrypt those commands. Bear in mind that no bearer token, no passwords or keys are used for communication! Basically, you prepare your request as a bunch of strictly specified headers with their proper parameters (concatenated and separated with new line sign) and then you “simply” ENCRYPT this one huge string with your Shared Access Key taken from ADLS portal (or script).

Then you just send your request in pure http(s) connection with one additional header called “Authorization” which will store this encrypted string, along with your all headers!

You may ask, whyyyyy, why we have to implement such a logic?

The answer is easy. Just because they say so 😀

But to be honest, this makes total sense.

ADLS endpoint will gonna receive your request. Then it will also ENCRYPT it using our secret Shared Key (which once again, wasn’t transferred in the communication anywhere, it’s a secret!). After encryption, it will compare the result to your “Authorization” header. If they will be the same it means two things:

you are the master of disaster, owner of the secret key that nobody else could use to encrypt the message for ADLS

there was no in-the-middle injection of any malicious code and that’s the main reason why this is so tortuous…

How to encrypt – it is just a different kettle of fish… Because you have to use keyed-Hash Message Authentication Code (HMAC) prepared with SHA256 algorithm converted to Base64 value using a hash table generated from your ADLS shared access key (which, by the way, is the Base64 generated string :D) But no worries! It’s not so complicated as it sounds, and we have functions in PowerShell that can do that for us.

Authorization Header specification

We are implementing an ADLS gen2 file system request, which belongs to “Blob, Queue, and File Services” header. If you are going to do this for Table Service or if you would like to implement it in “light” version please look for a proper paragraph in docs.

Preparing Signature String

Quick screen from docs:

We can divide it into 4 important sections.

VERB – which is the kind of HTTP operation that we are gonna invoke in REST (in.ex. GET will be used to list files, PUT for creating directories or uploading content to files)

Fixed Position Header Values – which must occur in every call to REST, obligatory for sending BUT you can leave them as an empty string (just “” with an obligatory end of line sign)

CanonicalizedHeaders – what should be there is also specified in the documentation. Basically, you have to put all available “x-ms-” headers here. And this is important – in lexicographical order by header name! For example:
"x-ms-date:$date`nx-ms-version:2018-11-09`n"

CanonicalizedResource – also specified in docs. Here come all parameters that you should pass to adls endpoint according to its own specification. They should be ordered exactly the same as it is in the invoked URI. So if you want to list your files recursively in root directory invoking endpoint at
https://[StorageAccountName].dfs.core.windows.net/[FilesystemName]?recursive=true&resource=filesystem" you need to pass them to the header just like this:
"/[StorageAccountName]/[FilesystemName]`nnrecursive:true`nresource:filesystem"

Bear in mind that every value HAS TO BE ended with new line sign, except for the last, that will appear in the canonicalized resource section.

In examples above, I’m using
`n which in PowerShell is just a replacement for
\n

So if I want to list files in the root directory first I declare parameters:

So what about fixed headers? For listing files in ADLS and with defined x-ms-date I can leave them all empty. It would be different if you would like to implement PUT and upload the content to Azure. Then you should use at least
Content-Length and
Content-Type . For now it looks like this:

PowerShell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

$stringToSign="$method$n"#VERB

$stringToSign+="$n"# Content-Encoding + "\n" +

$stringToSign+="$n"# Content-Language + "\n" +

$stringToSign+="$n"# Content-Length + "\n" +

$stringToSign+="$n"# Content-MD5 + "\n" +

$stringToSign+="$n"# Content-Type + "\n" +

$stringToSign+="$n"# Date + "\n" +

$stringToSign+="$n"# If-Modified-Since + "\n" +

$stringToSign+="$n"# If-Match + "\n" +

$stringToSign+="$n"# If-None-Match + "\n" +

$stringToSign+="$n"# If-Unmodified-Since + "\n" +

$stringToSign+="$n"# Range + "\n" +

$stringToSign+=

<# SECTION: CanonicalizedHeaders + "\n" #>

"x-ms-date:$date"+$n+

"x-ms-version:2018-11-09"+$n#

<# SECTION: CanonicalizedHeaders + "\n" #>

$stringToSign+=

<# SECTION: CanonicalizedResource + "\n" #>

"/$StorageAccountName/$FilesystemName"+$n+

"recursive:true"+$n+

"resource:filesystem"#

<# SECTION: CanonicalizedResource + "\n" #>

Implementing file upload is much more complex than only listing them. According to the docs, you have to first CREATE file then UPLOAD a content to it (and it looks like you can do this also in parallel 😀 Happy threading/forking!)

Don’t forget about the required headers. UPDATE gonna need more than listing.

Now small debug in Visual Studio Code, adding a watch for
$result variable to see a little more than in console output:

From here you can see, that
recursive=true did the trick, and now I have a list of 4 objects. Three folders in root directory and one file in FOLDER2, sized 49MB. Just rememebr that API has a limit of 5000 objects.

Easy peasy, right? 😉 Head down to the example script and try it for yourself!

Example Scripts

Three examples:

List files

This example should list the content of your root folder in Azure Data Lake Storage Gen2, along with all subdirectories and all existing files recursively.

Create directory (or path of directories)

This example should create folders declared in PathToCreate variable.

Bear in mind, that creating a path here, means creating everything declared as a path. So there is no need to create FOLDER1 as first, then FOLDER1/SUBFOLDER1 as second. You can make them all at once just providing the full path
"/FOLDER1/SUBFOLDER1/"