Azure Databricks is an Apache Spark based analytics platform optimised for Azure. Designed in collaboration with the founders of Apache Spark, Azure Databricks is deeply integrated across Microsoft’s various cloud services such as Azure Active Directory, Azure Data Lake Store, Power BI and more.

Azure Databricks makes it easier for users to leverage the underlying Apache Spark engine by providing managed Spark clusters that can be associated with web-based workspaces (i.e. Spark-as-a-Service). It is these interactive workspaces that enable collaboration among data scientists, data engineers, and business analysts, to develop Spark applications and ultimately derive value out of data.

History

Getting Started

The quickest way to get started is by spinning up an Azure Databricks service within your Azure subscription and launching directly into a workspace.

1. Log in to the Azure Portal.2. Navigate to Create a Resource > Analytics > Databricks.3. Populate the Azure Databricks Service dialog with the appropriate values and click Create.4. Once the workspace has been deployed, navigate to the resource and click Launch Workspace.

Note: While most of the required fields are fairly straightforward, the choice between Standard and Premium pricing tiers may be less obvious. While it is important to keep in mind that these things are typically subject to change, at this point in time the primary difference between the two is the ability to enforce Role Based Access Control for Notebooks, Clusters, Jobs, and Tables. In the initial instance, I suggest starting with the Premium trial.

User Interface

Once you have launched into an Azure Databricks workspace, the next thing to do is orient yourself with the user interface. Click through the gallery to get a feel for the layout and various screens.

ClusterA set of computation resources (Azure Linux VMs) that can be associated with a Notebook or Job in order to run Spark application code. While clusters are launched and the individual components are visible within an Azure subscription, the end-to-end lifecycle is managed via the Databricks portal.

JobA way of running Spark application code either on-demand or on a scheduled basis. A job encapsulates the task that needs to be executed (Notebook or JAR), an association to a cluster (new or existing), and an option to specify a schedule in which the job will run.

Databricks File System (DBFS)The Databricks File System is an abstraction layer on top of Azure Blob Storage that comes preinstalled with each Databricks runtime cluster. It contains directories, which can contain files and other sub-folders. Note: Since data is persisted to the underlying storage account, data is not lost after a cluster is terminated. Data can be accessed using the Databricks File System API, Spark API, Databricks CLI, Databricks Utilities (dbutils), or local file APIs.

Next Steps

Exploring the Quickstart Tutorial notebook (available directly from the Databricks main screen) is a great first step in further familiarising yourself with the Azure Databricks platform. Note: Before you can run the notebook you will need to create a cluster and associate that cluster to the notebook so that it has access to a computational resource for processing.

Create a ClusterNavigate to the ‘Create Cluster’ screen (either via the Azure Databricks main screen or Clusters > Create Cluster). Since the majority of defaults are sensible, we will make the following minimal changes.

Provide a Cluster Name (e.g. MyFirstCluster).

Uncheck Enable Autoscaling.

Change the number of Workers to 1.

Click Create Cluster.

Once the cluster is in a Running state, return back to the Azure Databricks main screen and click Explore the Quickstart Tutorial. Behind the scenes, this will create a persisted version of the notebook within your private workspace (you can check this by opening the workspace panel and looking inside your user directory).

Attach and RunIn order to execute the runnable code within each of the notebook cells, we need to attach our notebook to the recently created cluster.

Click Run All.

A prompt will inform you that the notebook is currently not attached to a cluster, click Attach and Run.

That’s it. The cluster will work its way through each of the runnable cells in sequence. Each cell is able to leverage the results of the work done prior. The pop-up notifications will indicate which cell is actively being worked on until the notebook is completely processed.

By this point, you should have a basic understanding of Azure Databricks and some of the key concepts. For continued learning and points of reference, check out the list of resources below.

History

Ecosystem

Spark SQLSpark SQL is a Spark library for structured data processing. Spark SQL brings native SQL support to Spark as well as the notion of DataFrames. Information workers are free to use either interface or toggle between both while the underlying execution engine remains the same.

Spark StreamingSpark Streaming can ingest and process live streams of data at scale. Since Spark Streaming is an extension of the core Spark API, streaming jobs can be expressed in the same manner as writing a batch query.

GraphX (Graph Processing)GraphX is a Spark library that allows users to build, transform and query graph structures with properties attached to each vertex (aka node) and edge (aka relationship).

Spark CoreSpark Core is the underlying execution engine that all other functionality is built on top of. Spark Core provides basic functionalities such as task, scheduling, memory management, fault recovery, etc as well as Spark's primary data abstraction - Resilient Distributed Datasets (RDDs).

Apache Spark vs. MapReduce

TL;DR - Spark is faster and easier to use.

MapReduce (introduced back in 2004), a mature software framework, has been the mainstay programming model for large-scale data processing. While MapReduce is great for single-pass computations, it is inefficient when multiple passes of data are required. While not a big deal for batch processing, MapReduce can be painfully slow in scenarios which require the sharing of intermediate results. This is quite common for certain classes of applications such as interactive ad-hoc queries, machine learning, real-time streaming, etc.

MapReduce, as was the case for many frameworks at the time, would need to write an interim state to disk (i.e. a distributed file system) in order to reuse data between computations. Each iteration would incur a significant performance overhead with each pass due to data replication, data serialisation, disk I/O, etc, consuming a substantial amount of the overall execution time.

In contrast, Spark's programming model revolves around Resilient Distributed Datasets (RDDs), an abstraction of distributed memory (in addition to distributed disk), making the framework an order of magnitude faster for algorithms that are iterative in nature.

In addition to being performant, Spark provides high-level operators (Transformations and Actions) that can be expressed in a number of language APIs (Java, Scala, Python, SQL, and R), making Spark easy to use in comparison to MapReduce which can get quite verbose as developers are required to write low-level code.

Spark Operations

Spark supports two types of operations:

TransformationsTransformations take an input, perform some type of manipulation (e.g. map, filter, sample, distinct), and produce an output.

Since data structures within Spark are immutable (i.e. unable to be changed once created), the output of a transformation is not the results themselves but a new (transformed) data abstraction (i.e. RDD or DataFrame).

Transformations are lazy (i.e. Spark will not compute the results until an action requires the results to be returned). This allows Spark to optimise the physical execution plan right at the last minute to run as efficiently as possible.

Once downloaded, extract the zipped contents, navigate to the spark directory, and start the spark shell.

For example:

cd ~/Downloads/spark-2.3.1-bin-hadoop2.7

bin/pyspark

Note: In this example I have started the Spark shell in Python, alternatively, you can use Scala by typing bin/spark-shell

In terms of next steps, check out Apache's Quick Start which contains some sample code for both Scala and Python. In a subsequent post, I will cover how you can tap into Spark-as-a-Service in the cloud using Azure Databricks (stay tuned).

Getting Started with Apache SparkOCR on iOS with Workflow and Cognitive ServicesTaygan RifatMon, 23 Jul 2018 17:26:37 +0000https://www.taygan.co/blog/2018/07/23/ocr-on-ios-with-workflow-and-cognitive-services5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5b55a16df950b71441cff084Develop your own OCR (Image to Text) application on iOS (iPhone or iPad)
with zero lines of code using Workflow and Microsoft’s Cognitive Services.Workflow is a powerful iOS task automation application that can daisy chain actions across different apps on your iPhone or iPad, and execute the combination of steps via a single tap (think along the lines of Automator for the Mac or Logic Apps on Azure). Recently, I discovered that Workflow can also consume and parse JSON from a web response which opens the door to a number of possibilities... :) one of which being OCR via Cognitive Services.

2. Navigate to the resource and copy and paste the Computer Vision API key. Preferably, copy the key across to a text editor on your iOS device (e.g. Notes) as we will need this later on to update our workflow.

3. Download the pre-created workflow by tapping on this link via an iOS device (iPhone or iPad) that has Workflow installed. Once loaded, tap Open in "Workflow".

If you created your Computer Vision resource in the West US region, you are done and can hit the play button to test the workflow. The workflow app will present a one-time warning that "...this workflow was imported from Safari. Are you sure you want to run it?", tap Run Workflow to dismiss this message.

If you created your Computer Vision resource in a different region, you will need to perform an additional step.

5. ** This is only required if your Computer Vision resource has been created outside the West US region ** Scroll down to the URL step and replace westus with your region. Note: You can check your resource endpoint via the Azure portal under the Overview section of your Computer Vision resource.

Workflow Recipe

The illustration below provides a helicopter view of all steps encompassed in the workflow.

OCR on iOS with Workflow and Cognitive ServicesAzure IoT CentralTaygan RifatWed, 27 Jun 2018 20:25:32 +0000https://www.taygan.co/blog/2018/06/21/azure-iot-central5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5b2bf62a758d4645a0204353Azure IoT Central is a fully managed SaaS (software-as-a-service) offering,
that aims to lower the barrier of entry by abstracting much of the
complexity and technical debt, typically required to deploy and manage an
IoT solution in the enterprise.When it comes to Azure and the Internet of Things (IoT), there are several options that can be considered in order to come up with a design that encompasses the common objectives (Device Connectivity, Data Processing & Analytics, Presentation & Business Connectivity) typically sought after in an enterprise-grade solution.

Options:

Pick and choose (a la carte) from the laundry list of Azure services to come up with something custom.

Azure IoT Central aims to lower the barrier of entry by abstracting much of the complexity and technical debt typically required to deploy and manage an IoT solution. If you are looking to accelerate time-to-value and don't require deep levels of service customization, Azure IoT Central is a great place to start.

Note: Azure IoT Central and Azure IoT Solution Accelerators are underpinned by the same services typically found at the core of any Azure IoT solution (e.g. Azure IoT Hub, Stream Analytics, Time Series Insights, etc).

UI Overview

HomeThe homepage is the landing area for your IoT application. When in Design Mode (top right-hand corner), you can customize your homepage by adding new or editing existing tiles. At the moment, tiles can either be in the form of a Link (Title, URL, Description) or an Image (Title, Image, URL).

Device ExplorerDevice Explorer provides a list of device templates (aka device types) and devices associated with those templates. It is within this section that you can add real or simulated devices.

Device SetsDevice sets allows us to define a grouping of related devices (i.e. a sub-group of devices within a particular device template). For example, I may define a device template (aka device type) of Raspberry Pi, but would like to sub-group devices of this type by particular characteristics to manage them as a collection (e.g. by Customer, Installation Date, Manufacturer, Location, etc).

Application BuilderApplication Builder has links a builder can use to either create a device template (Application Builder > Device Templates > Custom) or configure the home page (i.e. Homepage with Design Mode toggled ON).

AdministrationThe Administration section enables an Administrator of the application to configure:

Application Settings (Image, Name, URL)

Delete the application

Add or Delete Users

Extend your trial or convert to paid

Note: The image below provides a helicopter view of the primary screens when using the sample Contoso application. These sections are accessible from the side menu panel.

Hover over your newly created tile and move your mouse to the bottom right-hand corner until you see the cursor change to indicate that you can click and resize the tile. Resize the tile so that it is double the standard width.

Finally, turn Design Mode OFF.

Upon completion, your home page should look something like the below.

4. Device Template

Before we can start adding real (or simulated) devices, we need to create a device template. You can think of a device template as a way of describing a device type (e.g. Smart TV, Thermostat, Lighting) that defines the characteristics and behaviors of devices that will ultimately send data back to Azure IoT Central.

In this demo, my MacBook Pro will act as the IoT device and therefore I will create a 'Laptop' device template.

Navigate to Application Builder > Create Device Template > Custom

Enter a name (e.g. Laptop) and click Create

Click + New Measurement

Select Telemetry

Set the following property values:

Display Name: CPU Usage

Field Name: cpu

Units: Percent

Minimum Value: 0

Maximum Value: 100

Decimal Places: 1

Click Save

Jump to Properties

Toggle Design Mode ON

Click Text

Set the following property values:

Display Name: Manufacturer

Field Name: manufacturer

Click Save

5. Add a Real Device

Navigate to Device Explore

Click on the + New dropdown menu and click Real

Click on the image to update the device logo.

Click on the text on the right-hand side of the image to rename the device (e.g. Taygan's MacBook Pro)

Copy the primary connection string (make a note of this value to be used later in the code sample, e.g. paste into a text editor).

6. Setup a Python environment

The code sample was made to work with Python 3 (I'm specifically running 3.6.4). If you haven't already installed Python 3, do that first. In addition to the standard libraries, you will also need to install requests and psutil.

pip install requests
pip install psutil

7. Update and Execute the Code

The primary connection string that was copied in the last step within 5. Add a Real Device should have looked something like this...

As you can see in the string, there are values for HostName, DeviceId and SharedAccessKey. Extract the values to update the appropriate variables within the code. For instance, the above example would look like...

Copy and paste the code into a file within your Python environment (e.g. handler.py).

Update the variables HOST_NAME, DEVICE_ID, and SHARED_ACCESS_KEY with the values from your primary connection string.

Execute the python script (e.g. python handler.py).

Note: If successful, this will start sending CPU usage from your device to Azure IoT Central.

Finally, navigate to the Analytics section via the side menu to confirm the data is indeed being received by Azure IoT Central and automatically visualized for further analysis.

And of course, you can add additional devices by going into Device Explorer and then copy the connection details to the scripts running on the other devices (e.g. other laptops). In the final example, I have the Python script running on my MacBook Pro, Surface Laptop and Raspberry Pi. Note the ability to split the measure (CPU Usage) by a property (Manufacturer).

Azure IoT CentralGetting Started with Power BITaygan RifatTue, 12 Jun 2018 07:55:58 +0000https://www.taygan.co/blog/2018/06/10/getting-started-with-power-bi5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5b1cbc916d2a73a4cf3292c6Power BI is a suite of analytics tools that empower users to consume,
analyse and visualise data. Underpinned by the Power BI service (SaaS),
authors can publish reports to the cloud and provide users with a number of
different channels to consume insights.What is Power BI?

Power BI is a suite of analytics tools that empowers users to consume, analyse and visualise data. Underpinned by the Power BI service (SaaS), authors can publish reports to the cloud and provide users with a number of different channels to consume insights. Whether that be on the web via powerbi.com, on the mobile (iOS or Android), within a 3rd party application (embedded) or via Cortana on Windows 10.

Typical Data-to-Insight Workflow

While the Power BI suite caters for a number of analytic scenarios, a common data to insight workflow would typically consist of two components:

Power BI Service (*) as a destination for authors to publish and share content.

(*) Note: The Power BI Service is not limited to content consumption, but also allows content creation (i.e. the authoring of reports and dashboards). To keep things simple though, particularly when just getting started, consider Power BI Desktop as the tool of choice for content creation.

Power BI Family

While we will primarily focus on Power BI Desktop and the Power BI Service, below is a short description for each product within the Power BI suite.

Power BI ServiceCommonly referred to as PowerBI.com, a publicly accessible cloud-based (SaaS) service that underpins Power BI content and enables the creation of dashboards.

Power BI DesktopA free desktop application that enables content producers to model data and author reports typically without IT involvement.

Power BI MobileA set of apps across iOS, Android and Window 10 that enables the consumption of Power BI content on mobile devices.

Power BI Report ServerOn-premises (aka self-hosted) offering to serve and create Power BI reports (subset of Power BI service).

Power BI PremiumDedicated capacity which allows for the distribution of Power BI content without per-user licensing.

Power BI EmbeddedEnables ISVs and developers to integrate Power BI content (reports, dashboards and tiles) directly into an application.

Getting Started

Before you can get started, you will need to download and install Power BI Desktop.

GOTCHASForecasting is missing? In Part 2 (Data Visualization and Report Authoring), if you are unable to see the Forecasting section within the Analytics pane, ensure that the [Symbol] column has been removed from the legend in addition to filtering for a specific value (e.g. BTC).

Getting Started with Power BIEmbedded Analytics with Power BITaygan RifatTue, 15 May 2018 06:59:13 +0000https://www.taygan.co/blog/2018/05/14/embedded-analytics-with-power-bi5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5af6f7248a922da689c7e4a9Supercharge your apps by delivering insights directly to where your users
are! Power BI Embedded enables ISVs and application developers to embed
analytical content (dashboards, reports and visualisations) directly within
an app.Content

What is Embedded Analytics?

Why should I care?

What is Power BI Embedded?

History

Power BI Embedded with Azure Functions (Demo)

1. What is Embedded Analytics?

2. Why should I care?

User adoption of centralised BI systems has and continues to be a persistent challenge. With only a fraction of users ever leveraging some form of analytical capability, the majority of users are left ill-informed with adoption rates stagnating between 15 - 25%. This kind of makes absolute sense when you think of the expectation that is placed on consumers in order to reap any reward.

Typical Example:

User X is in the middle of working with Bus App Y and would like to analyse some of the underlying data.

In order to do so, User X would need to stop what they are doing and switch over to a BI platform, usually via a centralised portal.

Once complete, switch back and resume working.

As you can appreciate, this is quite disruptive to the users flow, and herein lies the opportunity! Embedded analytics delivers insights directly to users, significantly increasing the likelihood of adoption while accelerating the users time to value.

3. What is Power BI Embedded?

Power BI Embedded enables ISVs and developers to integrate Power BI content (reports, dashboards and tiles) directly into an application.

You simply provision and pay for the dedicated capacity required to serve the content and meet peak usage demand.

Each individual consumer does not require a license.

There only needs to be a single "master" Power BI Pro account.

Note:

The Power BI Embedded offering is targeted at ISVs and developers (i.e. building apps for external users).

If you are looking to embed Power BI within the enterprise context (e.g. an internal portal, SharePoint, Microsoft Teams), the guidance from Microsoft is to look at Power BI Premium, as this SKU may be better suited when combined with existing internal usage of the Power BI service.

4. Power BI Embedded History

Not to dwell on the past but understanding Power BI Embedded's history to date may provide some needed clarity as the offering has evolved since its initial release.

With the above in mind...

Power BI Workspace Collections

You may still see this resource in the Azure marketplace or stumble across the term in Google, this is the legacy version of the Power BI Embedded offering prior to the API convergence that occurred in May 2017.

Power BI Workspace Collections are on the deprecation path and will be completely retired in June 2018.

The legacy pricing model was based on sessions (current model based on capacity).

Note: Power BI Embedded is billed hourly. If you pause the resource, you will not be charged.

Do I need Power BI Embedded capacity during Development / Testing?During development and testing, you do not need Power BI Embedded dedicated capacity (i.e. you could skip this step) as you are able to generate a limited number of embed tokens (we'll get to embed tokens later) without this resource being active. In terms of the number of embed tokens that can be generated before being exhausted, this (afaik) is undocumented. That said, generally speaking, you can live without Power BI Embedded dedicated capacity until you are ready to transition to Production. With Power BI Embedded capacity you are then free to generate unlimited embed tokens.

Once complete, you should see a diamond icon next to your App Workspace to indicate that it is backed by premium capacity.

Note: You will need to publish some content into your App Workspace (i.e. Dashboards, Reports). You can do this via Power BI Desktop otherwise clicking Get Data > Samples can quickly get you started.

Step 3. Get an Embed Token (Power BI API)

The diagram below illustrates the high-level interchange of requests. In this example I am using an Azure Function to do the bulk of the work, once authenticated it will invoke the Power BI API to construct a JSON response that will include an Embed Token. The Embed Token allows the client (Power BI JavaScript Library) to embed Power BI content. There are obviously a number of ways this could be done (e.g. back-end web server logic). Using Azure Functions keeps our complexity low and should be easy to follow along.

Azure Function App

Log on to the Azure portal.

Create a Function App.

Fill in the required property values.

App Name (e.g. pbieFunctionApp)

Subscription (valid Azure subscription)

Resource Group (e.g. PBIE)

OS (Windows)

Hosting Plan (Consumption Plan)

Location (e.g. West Central US)

Storage (Create New)

Application Insights (Off)

Click Create

Azure Function (Code)

Navigate to your Azure Function App.

Create a new C# HTTP triggered function (i.e. Webhook + API).

Navigate back to the Azure Function App (parent) and click Application Settings.

Add the following settings:

Name: PBIE_CLIENT_ID; Value: Azure AD > App Name > Application ID

Name: PBIE_GROUP_ID; Value: <Power BI Group ID> (see notes)

Name: PBIE_REPORT_ID; Value: <Power BI Report ID> (see notes)

Name: PBIE_USERNAME; Value: Your Power BI Pro Account Username

Name: PBIE_PASSWORD; Value: Your Power BI Pro Account Password

Click Save

Navigate back to the Azure Function.

Add a new file called project.json

Copy and paste the JSON code from below into the file and click Save.

Open run.csx.

Copy and paste the C# code from run.csx below and click Save.

Note: The easiest way to get the Group ID and Report ID is directly from PowerBI.com. Simply open a report from your App Workspace that you intend on embedding and you can see the values in the URL (e.g. https://powerbi.com/groups/<Group ID>/reports/<Report ID>/ReportSection). These values can also be attained programmatically but to keep this demo as simple as possible, we are going to supply them directly.

That's it. To test your API, click on the </> Get Function URL link and navigate to it in your browser. The JSON response should look something like the below.

Note: You will notice that the HTML is referring to online hosted versions of the JavaScript library dependencies (jQuery & Power BI JS). This means we can get away with a single HTML document.

That's it! In this example I'm just running the HTML code from my local machine, if you wanted to go FULL SERVERLESS, you could upload the HTML document on to Blob storage and make the container public.

Embedded Analytics with Power BIImage Processing with Cognitive ServicesTaygan RifatSat, 05 May 2018 15:59:16 +0000https://www.taygan.co/blog/2018/4/28/image-processing-with-cognitive-services5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5ae4cf46562fa7d97594071aBy tapping into the Optical Character Recognition (OCR) capabilities of
Microsoft’s Computer Vision API (part of the Cognitive Services suite), I
will show how you can extract text from images.Scenario

So the week before this post, I needed to get an extract of all my financial transactions for the last year. The initial thinking... log on to my banks customer portal and request an export to CSV/XLS. Sounds simple enough? Unfortunately for me (and I suspect for many others), computer says no! The web app would only export the last 30 days so anything prior to that could only be accessed via statements downloaded in a PDF format.

Challenge

Ultimately, I needed to extract meaningful data (Date, Transaction, Amount, etc.) into a structured format (i.e. CSV or XLS) that could be further analysed to derive actionable insights. In order to do this, the content of the PDF would need to be converted into text. Microsoft Cognitive Services to the rescue!

High-Level Flow

Convert each page of each PDF into an image.

Invoke the Computer Vision API to convert each image into text.

Parse the JSON response from the API and output the results into a text file.

Note: A quick Google search will show there are a ton of ways a PDF can be split and converted into an image. FYI - In my specific example I have used the "Render PDF Pages as Images" action within a simple Automator workflow. That said, in this post I will be primarily focusing on the code used within the Python script to tap into the OCR capabilities of the Computer Vision API.

Computer Vision

Computer Vision provides developers a number of different image processing capabilities by simply invoking a HTTP endpoint. From tagging images based on their content to celebrity recognition. To find out more, check out Microsoft's official documentation.

In order to get started we need to get access to an API key. To do this, create a Computer Vision API resource within your Azure subscription (Azure Portal > Search the marketplace for "Computer Vision API" > Create).

Note: At time of this post, you are entitled to 20 calls per minute (5,000 per month) under the free tier.

API Key and HTTP Endpoint

Once the Computer Vision API resource has been created, there are two key pieces of information we need to make note of for later use within our code. Navigate to the sections as outlined below and copy/paste the values into a text editor.

Computer Vision API > Overview > Endpoint

Computer Vision API > Keys (under Resource Management) > Key 1

Python Environment

Since we can interact with the API via a HTTP endpoint, you are free to use whatever language you like. In this post, I'll focus how it can be done using Python. The only dependency we have is on the requests library, so you will need to install that before updating and executing any code.

Code

Finally the code sample. To use, simply update the three variables: API_KEY, ENDPOINT and DIR. That's it, you are good to convert images into text! The code sample will loop through all images in the directory path and dump all the extracted text into a single file. Remember, under the free tier you are limited to only 20 calls per minute.

Notes:

The loop that cycles through each file in the nominated directory currently only proceeds if the filename ends in "*.jpeg", you may also need to change this for your specific requirements (i.e. JPG, PNG, etc.).

If you completely overwrite the ENDPOINT variable in the code by copy and pasting from the Azure portal, you will need to add "/ocr" to the end of the string.

Video

Image Processing with Cognitive ServicesUpsert to Azure SQL DB with Azure Data FactoryTaygan RifatFri, 20 Apr 2018 22:08:07 +0000https://www.taygan.co/blog/2018/04/20/upsert-to-azure-sql-db-with-azure-data-factory5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5ad378f603ce641bc89ed0f4Copy data from Table Storage to an Azure SQL Database with Azure Data
Factory, by invoking a stored procedure within the SQL sink to alter the
default behaviour from append only to UPSERT (update / insert). While tinkering away on a passion project, I eventually reached a point where I needed to leverage the ELT capabilities of Azure Data Factory, to enable a pipeline that would copy data from Azure Table Storage to Azure SQL DB. The challenge... Azure Data Factory's built-in copy mechanism is set by default to append only (i.e. attempt to insert record; if exists: error).

For instance, after setting up a copy activity for the first time, while the initial execution of the pipeline is successful, subsequent executions would fail if a record from the source already exists in the destination environment. When this error is caught, you may encounter a message similar to that below.

In this scenario, the desired outcome is to alter the copy activity to perform the equivalent of an UPSERT (i.e. if record already exists: update, else: insert). This can be achieved in Azure Data Factory with some additional configuration to invoke a stored procedure during the copy.

Demo: Table Storage to Azure SQL Database

Note: If you are just getting up to speed with Azure Data Factory, check out my previous post which walks through the various key concepts, relationships and a jump start on the visual authoring experience.

After performing an initial load (i.e. first run of the pipeline), change some data in the source and trigger a second execution. If all is working the update should flow through without any problems and update the destination environment with the new value.

Upsert to Azure SQL DB with Azure Data FactoryReact in Real-Time with Azure Event GridTaygan RifatThu, 22 Mar 2018 20:38:38 +0000https://www.taygan.co/blog/2018/03/21/react-in-real-time-with-azure-event-grid5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5ab260088a922d66acdb16bfAzure Event Grid is a fully managed, event routing service that provides a
uniform approach to event consumption via the publish-subscribe model.
Learn how you can enable event driven solutions to react in real-time.Having only recently become generally available, Azure Event Grid is one of the newest members to join the existing suite of Azure messaging services. Azure Event Grid is a fully managed, event routing service, that provides a uniform approach to event consumption via a publish-subscribe model.

The graphic below is a summary of the current state of native integration. Note: Microsoft has stated that the intention is to roll this capability out much further, with Azure Data Lake Store, Cosmos DB and Data Factory already mentioned to be coming soon...

Benefits & Features

Get started quickly with built-in support for a number of Azure services out of the box.

Ability to send custom events and support for third-party publishers via Custom Topics.

No longer need to perform scheduled or continuous polling (i.e. is it done yet? is it done yet? is it done yet?) with Event Grid pushing events as they occur.

Support for high throughput scenarios, Event Grid can scale to handle millions of events per second.

24-hour retry with exponential backoff to ensure events are delivered.

The ability to fan-out (i.e. a single published event can be processed by multiple handlers).

Subscriptions can be configured with filters (e.g. Event Type, Published Path, etc) as a way of routing events that are relevant to subscribed handlers.

Concepts

Event Publisher: Where the event took place.

Event Handler: The application or service reacting to the event.

Topics: An endpoint where Event Publishers can send events.

Subscriptions: An endpoint where events are routed to be actioned by subscribed Event Handlers.

Demo

Scenario1. Logic App is set to be triggered by Event Grid by subscribing to events from a Storage Account.2. Storage Accounts fire an event whenever a Blob is Created or Deleted.3. The subscription is configured to filter on a specific Event Type (Blob Created).4. As the diagram illustrates, if a Blob is created then subsequently deleted, despite two events being sent and ingested by Event Grid, only a single event is ultimately delivered to the Logic App.

Currently only available across regions within the US, Europe and Asia.

Azure Messaging Services

At the beginning of this post, I mentioned that Event Grid is just one of many services that Azure currently offers within the messaging domain (e.g. Service Bus, Event Hubs, IoT Hub, etc). This can make life challenging when tasked with choosing the right service for your solution. While there is some definite overlap and similarities between some of these products, each does have its own unique value proposition which is important to understand before making your decision. That said, comparing and contrasting the suite of Azure messaging services is another article in its own right so for now, I point you to some existing material that should help provide some clarity.

React in Real-Time with Azure Event GridStreaming Sensor Data in Real-Time with Azure IoT HubTaygan RifatMon, 12 Mar 2018 20:29:00 +0000https://www.taygan.co/blog/2018/03/12/streaming-sensor-data-in-real-time-with-azure-iot-hub5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5aa68f480d92979ac9957edfStream sensor data into the cloud. Using an IoT device (Raspberry
Pi), temperature readings will be sent to Azure IoT Hub, consumed by Stream
Analytics and visualised in real-time by Power BI.Note: If you are just unpacking your Raspberry Pi, check out how to get up and running via a headless setup (no external keyboard or mouse required).

This post builds upon my previous demo, How to Build a Raspberry Pi Temperature Sensor. I will show how to extend the project to send temperature data into the cloud via Azure IoT Hub, then consume the incoming events to be visualised in real-time using Stream Analytics and Power BI.

Video

Streaming Sensor Data in Real-Time with Azure IoT HubHow to Build a Raspberry Pi Temperature SensorTaygan RifatSat, 10 Mar 2018 12:40:06 +0000https://www.taygan.co/blog/2018/03/10/how-to-build-a-raspberry-pi-temperature-sensor5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5aa3af0341920255c00c6737Learn how to build a Raspberry Pi temperature sensor. All you need outside
of the Raspberry Pi itself is a couple of jumper cables, a breakout
board, a DS18B20 one-wire digital sensor and a laptop for remote access.Note: If you are just unpacking your Raspberry Pi, check out my previous post on how to get up and running via a headless setup (no external keyboard or mouse required).

In this post, I'll show how you can build a temperature monitor using a Raspberry Pi and the DS18B20 one-wire digital sensor. The project build time is about 5 - 10 minutes. We will start off by putting together the actual circuit, enable the modules via SSH and finally get our temperature readings with some Python code. When you're all said and done, you should end up with something similar to that below.

6. Video

That's it! In a future post I will be looking to bring all this together to explore some of IoT capabilities within Azure, so stay tuned...

]]>How to Build a Raspberry Pi Temperature SensorSetup a Raspberry Pi with No Keyboard or MonitorTaygan RifatThu, 08 Mar 2018 14:19:01 +0000https://www.taygan.co/blog/2018/03/08/setup-a-raspberry-pi-with-no-keyboard-or-monitor-headless5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5a9e6e7b24a6940b3d5188e7After picking up my Raspberry Pi I stumbled onto challenge number one
pretty quickly... how do I set this thing up without access to a keyboard
or monitor? Fortunately, a headless setup is relatively painless. With a
laptop and SD card reader, you can be up and running in no time with remote
access via SSH.So I bought myself a Raspberry Pi! After picking up the device I stumbled onto my first challenge pretty quickly... how do I set this thing up without access to a keyboard or monitor? Fortunately, a headless setup is relatively painless. With a laptop and SD card reader, you can be up and running in no time with remote access to the command line via SSH.

4. Enable SSH

Simply create an ssh file with no extension in the /Volumes/boot directory by typing: touch ssh

At this point your SD card is ready to go, unmount the disk and insert it into your Raspberry Pi.

5. Power Up

1. Turn on the Raspberry Pi by plugging in the micro USB power supply.

2. After you have given the device some time to boot up and connect to the Wi-Fi, find the Raspberry Pi's local IP address. There are multiple ways you can do this, I've personally used LanScan but you could also use nmap or check out your router's list of connected devices.

7. Safely Shutdown

Setup a Raspberry Pi with No Keyboard or MonitorHow to Build a Chatbot Without CodeTaygan RifatMon, 05 Mar 2018 15:15:31 +0000https://www.taygan.co/blog/2018/03/05/how-to-build-a-chatbot-without-code5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5a9a89c441920278292db86fLearn how to create an AI powered chatbot in minutes with zero lines of
code. Using QnA Maker and Azure Bot Service, we build and deploy a Slack
chatbot that can answer Tesla's frequently asked questions for new owners.** Update (30th June 2018) **Since this blog post, QnA Maker has come out of Preview and become Generally Available. One of they key changes that may catch you is the change in keys that need to be populated within the Application Settings of your Web App Bot. The new keys are QnAAuthKey, QnAEndpointHostName and QnAKnowledgebaseId.

You can find these values by logging into QnA Maker and navigating to My knowledge bases > View Code or simply clicking on SETTINGS if you already have opened your Knowledge Base open. The image below illustrates how these values map to the Application Settings within your Web App Bot.

Content

What is a Chatbot?

Chatbot Categories

QnA Maker

High-Level Process

Demo: Build a QnA Chatbot

Next Steps

Video: Part 1 - Tesla Chatbot with QnA Maker

Video: Part 2 - Slack Integration with Azure Bot Service

1. What is a Chatbot?

A chatbot is an interactive computer program that has the ability to process natural language typically fed in the form of audio or text and return a response, simulating human conversation.

4. High-Level Process

5. Demo: Build a QnA Chatbot

a) Specify a Service Name (e.g. Tesla FAQ).b) Add the URL as a data source.c) Click Create.

3. Click Publish. Note: In the example below, I have shown how the endpoint can be tested using Postman.

6. Next Steps

So we have a functional HTTP endpoint which encompasses the intelligence aspect of our chatbot (i.e. question > understanding > response), but how we do we make this service widely accessible?

While there is nothing stopping you from using QnA Maker standalone, Azure Bot Service has a number of rapid deployment templates which allows the integration to a variety of channels with little to no code. Build once, deploy many.

Check out the videos below to see the process end to end from service creation using QnA Maker to channel deployment via Azure Bot Service.

7. Video: Part 1 - Tesla Chatbot with QnA Maker

8. Video: Part 2 - Slack Integration with Azure Bot Service

How to Build a Chatbot Without CodeAzure Data Lake Series: Working with JSON - Part 3Taygan RifatFri, 02 Mar 2018 10:46:54 +0000https://www.taygan.co/blog/2018/03/02/azure-data-lake-series-working-with-json-part-35a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5a95703571c10b3c2ce6a257In part 3, we ratchet up the complexity once more to see how we handle
multiple JSON files with schema structures that require extraction at
multiple levels (i.e. highly hierarchical). Using U-SQL via Azure Data Lake
Analytics we will transform semi-structured data into flattened CSV files
to glean insights on the top restaurants across London.Important to Note: If you are just beginning and trying to figure out how to parse JSON documents with U-SQL and Azure Data Lake Analytics, I highly recommend kicking off with Part 1 and Part 2 in this series.

Prerequisites

Exercise - Top Restaurants in London

In this final part, I'm going to illustrate how we can extract data from multiple levels of a hierarchical JSON document across multiple files. In this example, I will be using JSON documents that contain data on popular restaurants across London.

Note:

The purpose of this article is to focus on the parsing capabilities of U-SQL.

I will not be providing the Python script used to source the JSON files themselves or provide a copy of the documents. The data source has been chosen for educational purposes only.

That said, a slimmed down representation of what each document looks like is provided below.

In reality, each document contains an array of 250 restaurants. The output will be the result of parsing in excess of 20,000 restaurants spread across 87 files.

Key pieces of information for each restaurant is spread across varying levels in the hierarchy.

There is some inconsistency with the data (e.g. some fields such as cost are null).

Each restaurant can have one or many related cuisines (i.e. an array), but we want to keep the output as one row per restaurant.

Challenge 1 - Multiple Files

In Parts 1 and 2 we were only working with a single file and therefore the path was written in its absolute entirety. In this example we want to point U-SQL to a "file set" by specifying a pattern within curly braces. See official documentation for more information.

The snippet below would match a set of files that begin with document_ and end in .json (e.g. document_0, document_1, ..., document_86.json).

Challenge 2 - Multiple Levels

Fortunately, the JSON assembly includes a MultiLevelJsonExtractor which allows us to extract data from multiple JSON paths at differing levels within a single pass. See the underlying code and inline documentation over at Github.

Challenge 4 - Aggregation

Our final challenge is consolidating our output to ensure there is only one row per restaurant. The problem, each restaurant is related to an array of cuisines. The snippet below shows how we can explode the values of the array and then aggregate into a single row using ARRAY_AGG.

Video: Mastering JSON in Azure Data Lake with U-SQL

Azure Data Lake Series: Working with JSON - Part 3Natural Language Understanding (LUIS)Taygan RifatSun, 11 Feb 2018 13:15:59 +0000https://www.taygan.co/blog/2018/02/11/natural-language-understanding-luis5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5a7dca5d652deac0d4fdb1f7Extract intent and key pieces of information from text with LUIS (Language
Understanding Intelligent Service), a machine learning based offering that
falls under Microsoft's Cognitive Services suite. Natural language
understanding is a key component in enabling developers to engineer
features out of text (i.e. creating data out of data) as well as building
conversation based interfaces.Content

What is LUIS?

Key Concepts

API Key Limits

Relationships

Video

Demo: Build a Language Understanding Application with Music Skills

Demo: Teach LUIS to recognise a New Entity (Song)

Demo: Train LUIS to be more accurate (Python)

1. What is LUIS?

Language Understanding Intelligent Service (LUIS) is a machine learning based offering that forms part of Microsoft's Cognitive Services suite. LUIS is capable of understanding a users intent and deriving key pieces of information (entities) from natural language when provided input in the form of text. Combined with other services such as Azure Bot Service and the Bing Speech API, together, can enable conversation based interfaces (e.g. chatbots).

2. Key Concepts

UtteranceA sequence of continuous speech followed by a clear pause. In the context of LUIS, utterances are used as input to predict intent and entities.

EntitiesEntities are the key pieces of information that we want to extract from utterances. From a developers perspective, entities can be thought as variables. For example: Input Text = "Play Billie Jean by Michael Jackson"; Entities include: Song (Billie Jean) and Artist (Michael Jackson).

Non-Interchangeable refers to phrase lists that contain values that are similar in other ways (e.g. Arsenal, Chelsea, Liverpool, Tottenham).

3. Relationships

The following diagram illustrates relationships between our core elements.

4. API Key Limits

Once a LUIS resource has been created via the Azure Portal (as I will demonstrate later in the demo), the API key needs to be added to the LUIS application so we can query our LUIS endpoint. Note: LUIS applications come with a "Starter Key" by default, the purpose of this key is for authoring purposes only.

Basic Tier: 1 million transactions per month (maximum of 50 calls per second).

5. Demo [Video]

In this demo, we will create a LUIS application that can derive intent and entities within the music domain. Fortunately, LUIS has a number of prebuilt domains which is a great way to get started quickly.

Note: The demo has been broken into three parts. If you skip down to part 6 of this post, there are also step by step instructions available.

6. Demo: Build a Language Understanding Application with Music Skills

1. Create a Language Understanding (LUIS) resource within the Azure Portal. We will revisit this resource when we need to retrieve an API key to use in conjunction with a published application.

2. Login to https://www.luis.ai and create a new application. Note: You may need to sign up for a LUIS account.

Note: The domain will have been successfully added once the "Remove Domain" button appears, this can take up to a minute.

You'll notice that your application now contains Intents, Entities and Utterances focused on the Music domain (e.g. Play Music, Increase Volume, Genre, Artist).

4. Train the application.

Before we can execute any tests, we will need to train our application based on the content (Intents, Entities and Utterances) provided by the prebuilt domain. Once trained, we can test the applications ability to understand certain phrases (e.g. Input: "Play Drake's playlist"; Intent: Play Music; Artist: Drake;).

Relationships Visualised

Our Application (Music) has many Intents.

Our Intents (e.g. Music.PlayMusic) has many Utterances.

Each Utterance (e.g. play me a blues song) can have zero or more Entities (e.g. Music.Genre).

5. Publish the application and add your API Key.

Production or Staging: Each application has two slots available, this allows developers to deploy and test a non-production endpoint.

Include all predicted intent scores: Checking this option will alter the endpoint to include "verbose=true" (i.e. all possible intents).

Enable Bing spell checker: Checking this option will alter the endpoint to include "bing-spell-check-subscription-key={YOUR_BING_KEY_HERE}". This enhances the intelligence of the application to autocorrect spelling mistakes before processing.

Resources and Keys: It is within this section that you can add your API key from the LUIS resource we created in Step 1.

Important: The region of the LUIS resource must match the region the key is being added to within the Publish screen.

7. Demo: Teach LUIS to recognise a New Entity (Song)

As you may have noticed, while our application is able to derive Genre and Artist, we are missing Song. Follow the steps below to learn how to add a new entity, associate the entity with utterances and invoke LUIS's ability to perform "active learning".

1. Create a New Entity.

2. Label an Utterance.

Navigate to Intents > Music.PlayMusic. In the text box type an example utterance that includes a song name (e.g. "Play Hotline Bling by Drake"), hit enter and label the entities by clicking on them. Once the new utterance is labeled, re-train and test the application.

Notice the difference in results between the updated application vs. the version that is currently published.

3. Active Learning

While our application can now correctly understand the phrase "Play Hotline Bling by Drake", if we test another phrase (e.g. "Play Humble by Kendrick Lamar"), the application fails to recognise the song. Fortunately, as part of LUIS's ability to actively learn, we can review any queries that have hit the endpoint and validate them via the "Review endpoint utterances section".

While this is great, we obviously don't want to submit and review every possible song, this is not the expectation. Provided enough examples, LUIS begins to learn and establish patterns. That said, we need to provide a lot more utterances. I'll demonstrate how to hit the endpoint programmatically using Python then we'll revert to our application to review.

8. Demo: Train LUIS to be more accurate (Python)

For this demo, we will use Python to programmatically hit our endpoint using data from Shazam in two passes.

Pass 1

Query the application with the first 25 songs (1 - 25).

Review the endpoint utterances and update the entity labels.

Re-train and publish the updated model.

Pass 2

Query the application with the next 25 songs (26 - 50).

Review the endpoint utterances.

We should notice a significant improvement in the applications ability to derive the correct entities.

Note: You would have gleaned that obviously the more examples we can feed LUIS, the higher the accuracy in being able to predict entities and intent. There are other ways to work with LUIS more efficiently (e.g. client SDK, batch testing, etc). I won't go over these other methods in this post but are worth checking out.

That's it! Hopefully, this gives you a good starting point on how to get started with LUIS and potentially a thought provoker on what might be possible if this capability was to be used in conjunction with other services...

Natural Language Understanding (LUIS)Getting Started with Speech to TextTaygan RifatFri, 09 Feb 2018 15:23:49 +0000https://www.taygan.co/blog/2018/02/09/getting-started-with-speech-to-text5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5a7d5c9d71c10b289b762122With Bing Speech API, I will show you how to convert human speech (i.e.
audio) to text. Bing Speech API is part of the Azure Cognitive Services
suite and shares the same speech recognition technology used by other
Microsoft products such as Cortana.** Update (8th June 2018) **It appears the API endpoint has recently changed (this may have occurred around MS Build 2018). I have since updated the code sample. Note: To remain up to date, refer to Microsoft's official documentation.

Content

Bing Speech API

Use Cases

Key Concepts

URL Query Parameters

Pricing

Demo: Speech to Text (Python)

1. Bing Speech API

Part of Azure Cognitive Services, the Bing Speech API shares the same underlying speech recognition technology used by other Microsoft products such as Cortana.

At a high level, the API is capable of:

Converting Speech to Text

Converting Text to Speech

2. Use Cases

Transcribe and analyse customer call centre data.

Build intelligent applications that can be triggered by voice.

Increase accessibility for users with impaired vision.

3. Key Concepts

UtteranceA sequence of continuous speech followed by a clear pause.

Audio StreamTo optimise performance, audio data (e.g. speaking into a mic) is typically collected, sent and transcribed in chunks to form a stream.

API KeyThis will be required to programmatically work with the API and can be attained from the Azure Portal once a Bing Speech resource has been created.

4. URL Query Parameters

Recognition ModeThe service optimises speech recognition based on which mode is specified, so it is important to define the mode most appropriate to your application. Concise summary below, for more details check out Microsoft's documentation.

3. Copy and paste the code sample below into a file within your virtual environment (e.g. handler.py). Ensure to update the API key which you can attain from the Azure Portal under "Bing Speech API > Resource Management > Keys".

Note: The audio file path will also need to be updated (path of least resistance, simply place the audio file into the same directory).

Getting Started with Speech to TextAzure Event Hubs, Stream Analytics and Power BITaygan RifatSat, 03 Feb 2018 10:09:20 +0000https://www.taygan.co/blog/2018/02/02/azure-event-hubs-stream-analytics-and-power-bi5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5a7452474192029240387034How to get started with Azure Event Hubs and Stream Analytics. Understand
the high-level architecture, key concepts, relationships and then finally
dive into a demo where we will stream Bitcoin price data in real-time to
Power BI.Content

What is Azure Event Hubs?

Example Use Cases

Key Concepts and Relationships

Demo: Streaming Bitcoin Price Data in Real-Time [Video]

Reference Material

1. What is Azure Event Hubs?

Azure Event Hubs is a highly scalable event ingestion service, capable of processing millions of events per second with low latency and high reliability. Conceptually, Event Hubs can be thought of as a liaison between “event producers” and “event consumers” as depicted in the diagram below.

Note: In each scenario, the data would lose a large share of its potential value if not processed quickly.

3. Key Concepts and Relationships

Event PublisherAn entity that sends data to an Event Hub (ingress).

Event ConsumerAn entity that reads data from an Event Hub (egress).

CaptureCapture is a feature that allows streaming data to be automatically stored in either Azure Blob Storage or Azure Data Lake.

PartitionsPartitions is a mechanism in which data can be organised, enabling consumers to only read a specific subset (or partition) of a stream. Paritions also enable downstream parallelism in scenarios where there are multiple consuming applications.

Partition KeyUsed by publishers to map event data to a specific partition. If no key is specified, a round-robin assignment is used.

Consumer GroupsA consumer group is a view (state, position or offset) of an entire event hub. Consumer groups allow multiple consumers to have distinct views of an event stream (i.e. read at their own pace). Consumer groups can be thought of as a subscription to an event stream, if another application (i.e. consumer) would like to subscribe to the same stream but process the data differently, this would be an example of where an additional consumer group could be beneficial.

Note: Partitions can only be accessed via a consumer group (there is always a default in an event hub).

Throughput Units (TU)Capacity is defined by selecting a number of Throughput Units (TU) during the initial creation of an Event Hub namespace. The number of throughput units applies to all event hubs within a namespace.

Each unit is entitled to:

Ingress (events sent into Event Hub): Up to 1 MB per second or 1,000 events per second (whichever comes first)

The following diagram illustrates the relationships between the core elements.

While it may be easier to conceptualise the high-level flow with single entities (i.e. Producer, Hub, Consumer), in reality, there can be multiple entities sending events to an event hub and by the same token, multiple consumers reading from an event stream.

4. Demo: Streaming Bitcoin Price Data in Real-Time

In this demo, we will stream live Bitcoin price data to Power BI. We will use a timer based Azure Function as our "event producer", Stream Analytics as our "event consumer", and an Azure Event Hub as our ingestion service.

Azure Event Hubs, Stream Analytics and Power BIGetting Started with Azure Data FactoryTaygan RifatFri, 26 Jan 2018 20:34:12 +0000https://www.taygan.co/blog/2018/01/26/getting-started-with-azure-data-factory5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5a69ae1df9619adf795a770bAzure Data Factory is a fully managed data integration service that allows
you to orchestrate and automate data movement and data transformation in
the cloud. In Microsoft's latest release, ADF v2 has been updated with
visual tools, enabling faster pipeline builds with less code.Content

What is Azure Data Factory?

Key Concepts

Visual Authoring

Supported Data Stores

Demo: Blob Storage to Cosmos DB with Zero Code [Video]

What is Azure Data Factory?

Azure Data Factory (ADF) is a fully managed data integration service that enables the orchestration and automation of data movement and data transformation in the cloud. Azure Data Factory works with heterogeneous environments, enabling data-driven workflows to integrate disparate cloud and on-premise data sources.

At time of this post, Azure Data Factory is available in only three regions (East US, East US 2 and West Europe).

That said, Azure Data Factory does not persist any data itself. All data movement and transformation activities are handled by Integration Runtimes which are available globally to ensure data compliance and efficiency.

DatasetA dataset represents the structure of a data store (e.g. table, file, document) that is intended to be used as an input or output within an activity.

Linked ServiceA linked service provides Data Factory the necessary information to establish connectivity to an external resource (i.e. a connection string to a dataset).

The following diagram illustrates the relationships between these core elements.

Visual Authoring

One of the most recent developments for Azure Data Factory is the release of Visual Tools, a low-code, drag and drop approach to create, configure, deploy and monitor data integration pipelines. If you have used Data Factory in the past, you would be familiar with the fact that this type of capabiltiy was previously only possible programatically, either using Azure PowerShell, a supported SDK (Python, .NET) or invoking the ADF V2 REST API.

Demo: Blob Storage to Cosmos DB with Zero Code

In this demo we will create a pipeline that copies data from a JSON document stored in Azure Blob Storage to an Azure Cosmos DB collection. Both these data stores have first class support and are therefore fully configurable via the GUI.

Getting Started with Azure Data FactoryAzure Cosmos DB Graph API with PythonTaygan RifatTue, 23 Jan 2018 18:03:56 +0000https://www.taygan.co/blog/2018/01/23/azure-cosmos-db-graph-api-with-python5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5a672cc89140b7172810a1c9Explore how Azure Cosmos DB can be used to store, query and traverse graphs
using Python and the Gremlin query language of Apache TinkerPop.What is a graph database?

"In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates data items in the store. - Wikipedia

The example below illustrates the basic concept. Two nodes (vertices) connected by a single edge (relationship).

When reading up on graph computing you may notice that the words Node and Vertex are used interchangeably. It is important to note that there is no difference between them.

With that in mind, from here on in I will stick with Vertex as this is the terminology seen throughout the Gremlin query language used by Azure Cosmos DB.

Why should I use a graph database?

Highly connected data models where relationships are critically important (if not more important) than the nodes themselves, typically benefit from being stored and queried from graph databases. Traditional databases tend to slow down when handling data that is relationship intensive (i.e. many joins) whereas graph databases remain highly performant.

1. Setting Up

1.1 Create an Azure Cosmos DB accountTo get started you will need to create an Azure Cosmos DB account with the API set to Gremlin (graph). Gremlin is the graph traversal language of Apache TinkerPop (an open-source graph computing framework).

1.2 Create a GraphOnce the resource has been successfully deployed, launch Data Explorer and create a new graph.

Azure Cosmos DB > Data Explorer > New Graph

Enter a Database ID (e.g. cosmosDb)

Enter a Graph ID (e.g. cosmosCollection)

Change the Throughput (e.g. 400)

Click OK

1.3 Python Virtual EnvironmentWe will be using the gremlinpython library to programmatically load our graph database. In this example, I am using Python 3.6.1 and gremlinpython 3.2.6 (Note: There seems to be an incompatibility issue with Cosmos DB and the latest release of gremlinpython which is 3.3.1 at time of this post).

2. Example Graph

In the example below there are 6 nodes with 8 edges that describe the following:

Two people (Tim and Jonathan) work for a company called Apple.

Tim manages Jonathan.

Tim and Jonathan have Leadership as a common skill.

Jonathan is also competent in Design and Innovation.

In our python file, we will be executing gremlin queries. While the language syntax is quite descriptive, you may want to head over to Apache TinkerPop's Getting Started to get a basic understanding of the commands.

2.1 QueriesThe code below contains queries to create our Vertices and Edges as described in our scenario as an array of strings.

Azure Cosmos DB Graph API with PythonText Analytics with Microsoft Cognitive ServicesTaygan RifatThu, 18 Jan 2018 19:52:21 +0000https://www.taygan.co/blog/2018/01/18/text-analytics-with-microsoft-cognitive-services5a4908d949fc2b8e312bdf53:5a490b6af9619ae3bb5d5f5e:5a5e19529140b7de43ba44d5In this post, I show how easy it is to take advantage of Text Analytics, a
cloud-based offering from Microsoft's suite of Cognitive Services. I'll
demonstrate language detection, key phrase extraction, and a deeper dive
into sentiment analysis parsing 3000 IMDB user reviews.In a previous post (Text Mining POTUS with Python), I showed how NLTK can be used to analyse raw text input and derive linguistic features using pure Python. Today, we are going to look at how the process of text analytics can be made even easier using the readily available API as part of Microsoft's Cognitive Services suite.

Prerequisites

Tip: Microsoft offer a free tier which allow up to 5,000 transactions per month. Keep in mind, each document processed counts as a transaction.

Demonstration

The example code below demonstrates all three API's. To get the example working, you will need:

Python 3

The "requests" library (pip install requests)

Update two variables with your own values: ACCESS_KEY and URL

The approach for all three API's is identical:

Prepare the HTTP request header to include the ACCESS_KEY.

Construct the URL for the appropriate HTTP endpoint (e.g. languages, keyPhrases or sentiment).

Create a POST request which includes the JSON documents to be processed in the body.

Load the response.

The only subtle difference is the structure of the JSON document input when detecting languages (id & text) vs. extracting key phrases or attaining sentiment (id, language and text).

"""
File Name: text_analytics.py
Author: Taygan Rifat
Python Version: 3.6.1
Date Created: 2018-01-18
"""
import json
import requests
# Azure Portal > Text Analytics API Resource > Keys
ACCESS_KEY = 'INSERT_YOUR_ACCESS_KEY_HERE'
# Text Analytics API Base URL
URL = 'https://YOUR_REGION.api.cognitive.microsoft.com/text/analytics/v2.0/'
def get_insights(api, documents):
"""
Get insights using Microsoft Cognitive Service - Text Analytics
"""
# 1. Set a Request Header to include the Access Key
headers = {'Ocp-Apim-Subscription-Key': ACCESS_KEY}
# 2. Set the HTTP endpoint
url = URL + api
# 3. Create a POST request with the JSON documents
request = requests.post(url, headers=headers, data=json.dumps(documents))
# 4. Load Response
response = json.loads(request.content)
print('------------------------------------')
print('API: ' + api)
for document in response['documents']:
print(document)
def language_detection():
"""
The API returns the detected language and a numeric score between 0 and 1 indicating certainty.
"""
documents = {
'documents': [
{"id":"1", "text":"Le renard brun rapide saute par-dessus le chien paresseux" },
{"id":"2", "text":"敏捷的棕色狐狸跳过了懒狗" },
{"id":"3", "text":"The quick brown fox jumps over the lazy dog" }
]
}
get_insights('languages', documents)
def key_phrases():
"""
The API returns a list of strings denoting the key talking points in the input text.
"""
documents = {
'documents': [
{ "id":"1", "language":"en", "text":"Apple's plan to bring home hundreds of billions of dollars in overseas cash has triggered a guessing game on Wall Street about what it might do with all that money. The tech giant could find itself with about $200 billion to spend, after taxes, if it repatriates all its overseas holdings into the U.S." },
{ "id":"2", "language":"en", "text":"Tableau Software is revamping a core part of its technology to analyse data faster, a move intended to keep up with its customers' increasing big-data needs. The Seattle company, which makes software to visualise analytics, is introducing its so-called Hyper engine in a software update Jan 17. The technology is designed to make the data-visualisation process five times faster, meaning businesses can input millions of data points and see results in seconds." },
{ "id":"3", "language":"en", "text":"Reviews of the Tesla Model 3 praise the car as a futuristic, mold-breaking car that may be the best electric vehicle at its price point. But that doesn't mean it's perfect. Overall, Tesla's first attempt at a less expensive car than their higher-end S and X models has received strong acclaim for its smooth, quiet ride, uniquely minimalist interior and dashboard, and body design." }
]
}
get_insights('keyPhrases', documents)
def sentiment():
"""
The API returns a numeric score between 0 and 1. Scores close to 1 indicate positive sentiment, and scores close to 0 indicate negative sentiment.
"""
documents = {
'documents': [
{ "id":"1", "language":"en", "text":"What a great way to run the public transport in a city ! Loved the regular frequency, clear mapping and the accessible stops. Well done Melbourne !" },
{ "id":"2", "language":"en", "text":"Boarding at Spring st, near Parliament station - initially very crowded as the previous tram broke down, the journey went half way around the city - when at the corner of Flinders and Spencer St we were ll advised to disembark - as it was the end of the drivers shift - and there was no replacement driver - over 100 people were left stranded - truly a poor example of Melbourn hospitality. - Many tourists not knowing how to get back to there original destination." },
{ "id":"3", "language":"en", "text":"What a terrific way to get around the Melbourne CBD. You can hope on any tram within the CBD area and it is free. The Number 35 tram does a complete circuit of the CBD with commentary about Melbourne landmarks but it can get very crowded. Make sure you use it." },
]
}
get_insights('sentiment', documents)
if __name__ == '__main__':
language_detection()
key_phrases()
sentiment()

Sentiment Analysis Example - IMDB User Reviews

Sentiment analysis can have a number of real-world business use cases, from analysing support calls in an effort to better understand Voice of the Customer, to supporting strategies when trading on financial markets. That said, in this example we are going to see if we can determine the quality of a movie by analysing the sentiment of user reviews from IMDB.

The Movies

Star Wars: Episode IV – A New Hope (1977)

Star Wars: Episode V – The Empire Strikes Back (1980)

Star Wars: VI – Return of the Jedi (1983)

Star Wars: I – The Phantom Menace (1999)

Star Wars: II – Attack of the Clones (2002)

Star Wars: III – Revenge of the Sith (2005)

High-Level Flow

The Python script scrapes user reviews from IMDB. The response is received as raw HTML.

The HTML is parsed and converted into JSON as an array of documents (ID, Language and Text). The HTTP POST request is made to the Text Analytics API with the JSON passed as data.

The HTTP response contains the results in JSON as an array of documents (ID, Score).

The final output is saved to CSV (Movie Name, Document ID, Score).

Code

Note: ACCESS_KEY and SENTIMENT_URL will need to be updated. The variable LIMIT acts as a kind of throttle (currently set to 5).

In order to stay within the free tier's transaction limit, results are based on ~500 reviews per movie (i.e. ~3,000 reviews in total).

In an ideal world we would have calculated sentiment based on as much data as posisble but for the purposes of this exercise to convey proof of value, the existing data set should be sufficient.

If you do re-hash this exercise, be aware your results may differ depending on which sample of user comments are analysed and the possibility that Microsoft's API has since been updated.

While the sentiment score from Microsoft is provided as a value between 0 and 1, in order to make the comparisons more digestable when compared to IMDB, I have multipled the results by 10 (e.g. 0.87 = 8.7).

Insights:

Results are inline with IMDB:

Episode V - The Empire Strikes Back (1980) is the best episode in the series.

Episode IV- A New Hope (1977) is the second best episode in the series.

Episodes II and III are poorer quality movies in comparison.

Results out of sync:

Episode I – The Phantom Menace. According to the IMDB, this was the lowest rated movie but the sentiment analysis gave it a favourable score of 8.0.

Lastly, the range of values for Episove V (as depicted by the boxplot visualisation) is a lot narrower. This may point to a tigher consensus amongst reviewers compared to other episodes.

Results Visualised

Finished

Hopefully this gives you a taste of how sentiment analysis can be used and just how accessible the technology is with ready to consume, publicly available services such as Microsoft's Text Analytics API.