Developer Bloghttps://www.microsoft.com/developerblog
Exploring Real Solutions to Innovative Problems
Tue, 30 Jul 2019 18:25:25 +0000 en-US
hourly
1 https://wordpress.org/?v=5.1.2https://codestoryedge.azurewebsites.net/developerblog/wp-content/uploads/2017/01/Untitled-32x32.pngDeveloper Bloghttps://www.microsoft.com/developerblog
3232Using Blockchain To Securely Transfer And Verify Fileshttp://feedproxy.google.com/~r/microsoft/devblog/~3/e1Uyonh49II/
https://www.microsoft.com/developerblog/2019/07/30/using-blockchain-to-securely-transfer-and-verify-files/#commentsTue, 30 Jul 2019 18:25:22 +0000https://www.microsoft.com/developerblog/?p=10019MOOG came to CSE in hopes of building a technology solution that would prove the provenance and transfer of digital assets securely between parties. Working together, Microsoft and MOOG set out to develop a demo of this solution that leverages Azure and blockchain technology.

]]>As 3D printing becomes an integral part of the manufacturing industries, there are new challenges that must be solved to promote mass adoption. Primarily, how do you securely transfer accredited schematics and verify data integrity once it’s been delivered? In other words, how can you ensure the schematics were not tampered with in the midst of the transfer process? This is of paramount importance in mission critical systems.

MOOG, a multi-billion-dollar company headquartered in Buffalo, New York, sought to develop such a system. MOOG identified a need in the manufacturing industry to verify a part’s authenticity before it got installed within a mission critical system. The very real threat of compromised blueprints causing mechanical failures is a problem the manufacturing industry faces regularly.

At the time MOOG started this innovative journey, no reference or open source technologies existed that solved these complex issues in their entirety. Microsoft was able to partner with MOOG to help solve these real-world problems.

MOOG is traditionally known for manufacturing motion and control systems for the aerospace, defense, industrial, and medical industries; so, this project would require development as well as research. In order to produce a usable proof of concept quickly, they chose to collaborate with Microsoft’s CSE team, which specializes in such novel engagements.

Challenges and Objectives

At the outset, the biggest obstacle for our provenance solution was designing an immutable database to store the order details at every step of the transaction. The solution also needed to provide a mechanism to verify the integrity of digital blueprints on the buyer’s side.

The problem was a prime candidate for a blockchain solution. Once a block in a blockchain is created it becomes immutable, since each new block is based off of its hash. Furthermore, the cryptographic algorithms used in a blockchain can verify the integrity of each block’s data before it is added on to the chain.

However, there was a major obstacle with using blockchain and we immediately encountered a scarcity of proven design patterns. We knew blockchain was the best approach for verifying the authenticity of manufacturing blueprints given its established reputation in the provenance space. That said, designing an architecture where MOOG could implement their custom part ordering logic was a challenge.

We set out with the intention of developing a solution which would allow a “seller” to list 3D parts available for one or more “buyers” to order. Once a part has been ordered, a buyer can request for the seller to transfer the file to a location dictated by the buyer. After the buyer has printed the part, they log the details of the printing so that the seller can have a record of all prints.

Solution

Since we needed to use a blockchain to back the system, we decided to use Ethereum. Importantly, Ethereum supports user-defined Smart Contracts. We deployed an Ethereum Proof-Of-Authority Consortium[1] on Azure to develop against, since it closely matched the production scenario. On our local machines, and for testing, we often used a local Ethereum blockchain called Ganache[2].

In the end our solution consisted of four key components:

– Ethereum Smart Contracts that acted as the core business logic for interacting with and modifying the state of the chain.
– A Transaction Proxy to form & submit smart contract transactions, which abstracted away direct interaction with the chain.
– An Agent, which provides a user-friendly API for users to interact with the solution, as well as additional business logic.
– An Oracle to read the state of the blockchain and notify the Agent of events so that it may react in real-time.

A diagram of how these components interact is illustrated below:

Smart Contracts

We used the blockchain in a manner similar to how a database might be used in a non-blockchain system; information such as “who is authorized to buy a part”, “who has ordered parts”, and “what serial number is associated with what part” are all stored on the blockchain. The actual mechanism which allows us to do this are Ethereum Smart Contracts,[3] which not only store the data on the blockchain but also provide the core business logic for mutating the data. Thus, both the actual data of “how many parts has user X ordered” and the logic that allows “user X to order a part” both exist inside of a Smart Contract on the blockchain.

Designing the Smart Contracts for the system provided a unique set of challenges. Given the number of restrictions and technical considerations that must be handled when developing Solidity smart contracts, and the fact that the Smart Contracts made up the core business logic of our system, we spent quite a bit of time designing and auditing our smart contracts for security.

In developing the contracts, we decided to make one Smart Contract per “object” type to reduce the amount of code per contract, to give each contract a specific purpose, and to prevent future updates from requiring a full redeployment. Furthermore, we wanted the smart contracts to leverage data stored in others to reduce the amount of data stored per contract. We accomplished this by “linking” the contracts together in a tree-like hierarchy such that they are aware of their “parent contract.”

We developed eight core contracts with the following purposes:

– Seller Root: Each seller has a root contract that contains a collection of any buyers’ (as Buyer Root contracts) who are registered to order and print 3D part schematics.
– Buyer Root: Each buyer has a root contract which acts as a central reference point for any contracts related to the system, such as the buyer’s Catalog, Order Registry, and Serial Number Generator.
– Catalog: Stores a list of 3D part schematic hashes available for the buyer to order. From the catalog, a part can be ordered where an Order contract is created and registered inside of the Order Registry.
– Order Registry: Stores a list of all orders (as Order contracts) the buyer has placed, as well as providing other helpful business logic functionality relating to Orders.
– Serial Number Generator: Encapsulates business logic for generating a list of serial numbers to assign to an order.
– Order: Contains details relating to an order, such as the quantity, requests to transfer the schematics, and a print log. It also contains the functionality to initiate a schematic transfer by creating a Transfer Request, and to log a print by creating a Print Job.
– Transfer Request: Stores details and state of a request to transfer a part schematic from the seller to the buyer.
– Print Job: Stores the details and state of an attempted print job of the part schematic.

This approach proved to work well. It allowed us to unit test each contract independently, stubbing out any dependencies that may be required for testing. Additionally our storage footprint was minimized – common data, such as a part number, would be stored once inside of an Order contract, and dependent contracts such as a Print Job would reference upwards in the tree to retrieve it when ended.

Transaction Proxy

Smart Contracts, once deployed on a blockchain, can be cumbersome and difficult to interact with. When a Smart Contract is compiled, an ABI (Application Binary Interface) is created. The ABI defines what methods and variables the contract contains, as well as how to interact with them. Since this ABI is required whenever communication is attempted with the blockchain, we decided to create a single service that would hold a copy of the ABIs used in our solution and broker any interaction with the Smart Contracts.

We called this service the Transaction Proxy, and gave it three responsibilities:
– Form Transaction Payloads for Smart Contract method calls
– Submit Signed Transaction Payloads
– Read Data Contained Within Smart Contracts

This utility service meant that instead of having to update every service with a copy of the Smart Contract ABIs (so that it can communicate with the blockchain), we could store them in only one place (the Transaction Proxy), and funnel all interaction through the Transaction Proxy. Additionally, this service is application-agnostic and can be used with other projects in the future.

The implementation turned out to be pretty straightforward. First, an RPC (Remote Procedure Call) connection to a blockchain node is established, allowing the Transaction Proxy to interact with the blockchain. In our case it was the Ethereum POA (Proof-Of-Authority) Consortium we used for development.
Secondly, the Transaction Proxy needs to have access to the Smart Contract ABI files so that it can properly form transaction payloads. For this engagement, we implemented support for local filesystem lookup and Azure Blob Storage lookup.
Lastly, we expose the functionality of the Transaction Proxy through a Web API using Azure Functions.

Here’s an example request that creates a transaction payload which then calls the onboardBuyer(address buyerWallet) method in the Seller contract:

In this way we had a way of interacting with the Smart Contracts in our solution through the Transaction Proxy, without needing the actual ABIs. The user still must know the name of the contracts they wish to interact with, as well as the names of the methods to call, but this is easier to account for compared to needing an entire ABI file.

Agent

Interacting with the blockchain through the Transaction Proxy requires contextual knowledge of what Smart Contracts are being interacted with. If you didn’t know that a Smart Contract named Buyer existed and had a method named GetCatalog, then you’d never know what to input for the "contractName" and "method" fields for a request to the Transaction Proxy. We wanted to abstract away any sort of interaction like this from the user, so we created another micro-service called the Agent, with which the user would interact instead.

To accomplish this we had the Agent implement individual endpoints for each relevant Smart Contract method.
For instance, the endpoint on the Agent GET /buyer/0x123.../orders/ would return a list of all orders, fulfilled by the Transaction Proxy by calling the GetOrders() method on the OrderRegistry contract belonging to the buyer at the 0x123... address.
Likewise the endpoint POST /parts/cf0194z would create an order for the part whose hash is cf0194z.

We also realized that some operations that might take multiple calls to the Transaction Proxy (and therefore the Smart Contracts) could be simplified into a single call to the Agent.

An example is when a user asks for the details of an order. In our solution, an order has a number of details associated with it: the quantity of parts ordered, the hash of the part, how many parts have been printed, and more. Since the actual data is stored in a Smart Contract, interaction is also restricted to what Smart Contracts allow. One such restriction is that Smart Contract methods cannot return structs (or collections of data), and so each detail of the order (quantity, hash, ect.) must be retrieved individually.

We weren’t satisfied with having our API require the user to make individual calls for each detail and instead wanted to return all the details in a single call. Therefore we wrote additional business logic into the GET /order/0x321... endpoint, so that the Agent would make multiple calls to the Transaction Proxy – getting each individual detail and then concatenating it into a single response.

The Oracle

At this point our solution was almost complete. We had Smart Contracts that allowed us to store and interact with our data on the blockchain, a Transaction Proxy that simplified interacting with Smart Contracts, and an Agent that provided a specialized and user-friendly interface for the Transaction Proxy. The last piece of our solution was to enable the Agent to react to changes to the blockchain instead of only when a user accessed it through a Web API.

In our solution, whenever a request for a 3D part file to be transferred is created on the blockchain, we needed the Agent to automatically see this and react accordingly. To do this, we needed a way to monitor the blockchain transactions and look for events indicating a transfer was created.

To this end, we created an Oracle that finds and verifies real-world occurrences and submits this information to a blockchain to be used by Smart Contracts. This can be done by listening to events or activities on the chain. For this project we decided to go with a chain watcher approach rather than the usual event listening approach. In other words, we wanted an Oracle that would read all the blocks on a chain and log all changes that happen to a contract, as opposed to just listening for events triggered by a Smart Contract.

To implement an Oracle in this way, we decided to leverage the Torchwood project, an open-source Ethereum library for reading blocks on a chain and logging contract changes. The Torchwood project was initially started here at Microsoft and we extended its functionality further in order to be able to detect and read event triggers on the chain. Torchwood also provided the ability to cache the chain data on a storage and thus allow for possible further processing.

The next obstacle was deciding how the Oracle would notify the Agent. We wanted to follow an observer design pattern to create a loosely coupled connection. Also, it was vital to ensure multiple agents could listen for the same contract changes. For these reasons, Event Hubs was the ideal choice for our needs. The Oracle would post any changes on Event Hubs topics and the agents could subscribe to any topics they were interested in.

Summary

MOOG wanted to build a technology solution that proves the provenance and transfer of digital assets securely between parties.
Through examination of available technologies, Microsoft and MOOG set out to develop a demo of this solution that leverages Azure and blockchain technology.

Along the way we discovered that many parts of our design pattern are reusable for similar projects built on blockchain.
We have open-sourced the Oracle and we’re exploring doing the same with Transaction Notary, hoping that it will help others designing similar solutions.

In the last eight years, poaching in Africa has been happening at an alarming rate. Currently, the continent loses three rhinos a day. As part of our team’s mission to find new ways that technology can positively impact our world, we’ve been collaborating with the Combating Wildlife Crime team from Peace Parks Foundation (PPF), in partnership with the South African conservation agency Ezemvelo KZN Wildlife (Ezemvelo). Together, we’re working to support rhino anti-poaching efforts through the power of Cloud Computing and AI on Azure. PPF facilitates the establishment and sustainable development of transfrontier conservation areas throughout southern Africa, with the aim of restoring critical ecosystems to the benefit of man and nature alike.

Figure 1

Rate of rhino poaching in recent years

Digital transformation in the AI for Good domain assists PPF and Ezemvelo to take a holistic view on how to strategically plan rhino anti-poaching activities. The smart park initiative coordinates collaboration across various players in the wildlife crime prevention space and empowers timely decision making by providing tools for data analysis.

In 2018, PPF piloted the approach of using images taken by camera traps as the source of additional insights on what’s happening in conservation areas:

This blog post describes the challenges, solutions, and technical details related to building a scalable and reliable system that can trigger rangers’ rapid response through a real-time alert announcing if a person has been detected in the monitored area. A subsequent blog post will cover more in depth Machine Learning specifics of the problem space.

Challenges and Objectives

Our plan was to build a reliable end-to-end system that could rapidly detect suspicious human activity in the conservation areas and alert rangers. The challenges included:

Designing a system to achieve balance between price, complexity and fault tolerance.

Addressing privacy, compliance, and security.

Ensuring the pipeline is resilient to intermittent failures.

Having an AI solution that can detect people in conditions with very limited light (e.g., nighttime photos taken in the savanna).

Solution

Design considerations

In designing the architecture of the alerting system, we adhered to the following requirements:

Build a scalable solution that can handle a variable (through day, night and seasons) load of camera trap based input.

Guarantee the resilience of the solution to expansion of the camera trap installations over time.

Ensure a quick turnaround time so that rangers receive an alert as fast as possible.

Make it easy to deploy the alert system to additional conservation parks.

Find a solution that takes into consideration price, complexity and fault tolerance.

Ensure privacy, compliance and security are handled with the high standard required by the crime prevention use case.

Enable PPF and its partners to have automatic build and deployment processes.

In the subsequent sections we describe the architectural design of the system that satisfies the above requirements.

Azure Functions-based pipeline

As PPF doesn’t have resources to manage server infrastructure on its own (i.e., a virtual or physical server, the operating system, and other web server hosting processes required for an application to run), we were considering Platform as a Service (PaaS) and Serverless approaches for hosting our application. Both of these approaches eliminate the need for management of server hardware and software. The primary difference is in the way the application is composed and deployed, and therefore the scalability of the application.

Azure offers Web Apps for PaaS. With Web apps, an application is deployed as a single unit. Scaling is only done at the entire application level.

With a Serverless approach using Azure Functions, an application is composed into individual, autonomous functions. Each function is hosted by an Azure Function and can be scaled automatically as the number of requests increases or decreases. It’s a very cost-effective way of paying for compute resources, as we only pay for the time that the functions get called, rather than paying to have an application always running.

We chose Serverless deployment with Azure Functions as our hosting platform. An additional benefit to choosing Azure Functions is the built-in auto-retry mechanism. Some pieces of our code (which we isolated into functions) depend on third-party services that can be down or have connectivity issues. The auto-retry mechanism becomes important to ensure fault tolerance. We decided to use queues for communication between functions. When a function fails, Azure Functions retries the function up to five times for a given queue message, including the first try. If all five attempts fail, the functions runtime adds the message to a poison queue. We can monitor the poison queue and take appropriate actions.

Architectural overview

Regardless of the model, any motion-activated devices (camera traps) support an ability to send photos by email, so we leveraged this feature to move images from a device to the cloud. As our email service we used SendGrid, which provides reliable transactional email delivery and supports the ability to setup webhook to intercept email. This webhook becomes the entrance to our image-processing pipeline.

Figure 2

The solution architecture

The SendGrid webhook triggers the Process Email Azure Function. This function parses the email to extract information about the organization and camera, saves this metadata to an Azure PostgreSQL Database, and uploads the images to Azure Blob Storage. An output binding of the function is the Process Image queue where the function puts blob URLs.

The Process Image Azure Function is then triggered from this queue. Its main job is to call our scoring service to get a probability of people detected in the image. More details about the scoring service will be covered in the next section. Based on the detection results and business logic, the function decides if an alert should be sent to rangers. If that’s the case, it delivers the blob URL to the SendAlert queue.

Each park can have a specific tool that rangers use for notifications, for example Command and Control Collaborator (CMORE), Domain Awareness System (DAS), emails etc. The SendAlert Azure function sends alerts to the appropriate notification system. The alert contains the location of the camera, the timestamp of the photo, the results of the detection, and the photo attached. Here is an example of an alert in CMORE:

Figure 3

(Please note: partial view due to sensitivity of the data)

When deploying a complex service to Azure, it’s important to have at least one test environment to check the system’s health before proceeding with deployment to production. We used Azure resource groups that have an identical list of resources in order to separate Dev and Production environments.

To automate the initial deployment of Azure resources, we used Azure Resource Management (ARM) templates. Using ARM templates, we can repeatedly deploy the solution throughout the development lifecycle and have confidence that resources are deployed in a consistent state. It is also important that with automated initial deployment we can make sure resources for Dev and Prod resource groups were created based on the same templates.

We used Key Vault for secure storing of secrets such as the SendGrid API key, password for database, etc. To optimize performance and cost, Azure Functions don’t read secrets from Key Vault directly. Instead, when Azure Functions are created with ARM templates, it reads the secrets from Key Vault and puts them as environment variables in the appropriate function app.

For continuous integration and deployment we used Azure DevOps. Each time a pull request (PR) is created with suggested changes, it triggers a PR build that runs unit tests and linting. When a pull request is completed and changes are merged to master, it starts a Master build that archives functions and puts them to artifacts. Completion of a Master build triggers automatic deployment of functions to the Dev stage. When changes are verified in the Dev stage and approved to be applied to production, a release of those changes to the Prod resource group can be triggered manually and requires at least one person’s approval. This whole process ensures that features are delivered safely as soon as they’re ready.

Figure 4

Azure DevOps

Machine learning operationalization

In this section we will cover the deployment of the Scoring Service.

Scoring is the process of generating prediction values based on a trained Machine Learning model given new input data – an image in our scenario. As we trained our custom model to detect people within the specific African game park environment, the result of the scoring service is the probability of detecting people on the given image.

The scoring flow end-to-end interaction with the trained Machine Learning model was done via sending POST messages to a web service. From a hosting/deployment perspective, we can treat the scoring service as a black box that takes images as input and returns results of people detected as an output.

To achieve the goal of having high-quality detection results we trained an object detection model using Python Deep Learning libraries Keras and Tensorflow.

It is worth mentioning that the current state-of-the-art algorithms are highly computationally- and memory-intensive, as they involve significant amounts of matrix-related calculations.

Design considerations

We identified the following requirements for the scoring service deployment:

Response time should be less than 2 minutes

Should be easy to maintain

Reasonable price

Scalable

Fault tolerant

Option to use GPU Compute

As the serverless architecture of the system is built on Azure Functions, it would be natural to consider Azure Function as hosting infrastructure for the scoring service as well. As of December 2018 (when we started the project), Python support in Azure Functions hasn’t been released to general availability. Hosting our model through Azure Functions would additionally limit the underlying infrastructure to being CPU only. With more and more cameras being deployed into parks, and even more advanced Machine Learning image analytics being developed, using GPU machines for scoring may eventually be not only faster but cheaper. We faced the same constraint with the App Service option.

Deployment of the scoring service manually as a web service on Azure Virtual Machine (VM) doesn’t make it scalable and resilient to faults. Deployment of a number of distributed VMs with a load balancer would solve that problem, but deployment and management of the system becomes cumbersome.

Further investigation brought us to the Azure Machine Learning service that became generally available in December 2018. Azure ML is a cloud service that can be used for the end-to-end Machine Learning model management life cycle – including training and deployment of Machine Learning models – all at the broad scale that the cloud provides. At the time of creation of this blog post there are 3 compute options available for the scenario of deploying a model to the cloud:

Azure Kubernetes Service (AKS)

Azure ML Compute

Azure Container Instances (ACI).

The AKS option is recommended for real-time scenarios in production, as it is good for high-scale production deployments, provides autoscaling, and enables fast response times. It satisfied all our requirements and we chose Azure ML AKS as a hosting platform for our scoring service.

Scoring service hosting on Azure Machine Learning Kubernetes Cluster

Kubernetes is a container orchestration system for automating application deployment, scaling, and management across clusters of hosts. Azure offers an Azure Kubernetes Service (AKS) to make it simple to deploy a managed Kubernetes cluster in Azure. AKS reduces the complexity and operational overhead of managing Kubernetes by offloading much of that responsibility to Azure.

Azure Machine Learning builds front-end services on top of AKS for predictably fast latency. The default cluster deployed to Azure ML will have 3 front-end services. For fault tolerance all services should land on different VMs (nodes) – thus it’s required to have at least 3 VMs (nodes) in the cluster. Front-end Azure ML services, in addition to AKS services, consume resources – so it’s recommended to have at least 12 cores in the cluster. As a result of these considerations, the smallest AKS cluster will have 3 nodes of Standard_D3_v2 VMs, where each VM has 4 cores.

One of the benefits of using Azure ML AKS cluster is that it can be used for hosting multiple services. For example, different models can be deployed as a separate service to the same cluster. If that’s the case, autoscaling settings for each service will enable setting up resource distribution rules across the scoring web services: such as CPU and memory. At time of the creation of this blog post, Azure ML AKS did not support scaling on nodes level – the number of nodes (VMs) is defined during deployment. Autoscaling on the pods level (also called containers or replicas in this context) is supported. A pod is the basic building block of Kubernetes – the smallest and simplest unit in Kubernetes that you create or deploy and represents a running process on the cluster. Azure ML AKS autoscaler adjusts the number of containers (pods, replicas) depending on the current request rate. Currently for the PPF scenario we deployed only 1 type of scoring to Azure AML AKS – the model that detects people. We therefore turned off autoscaling.

In preparation for deployment we calculated the number of containers required for the service – how many parallel requests we can and need to process. The parameters below were taken into account to calculate the estimate:

In this example, a cluster of 18 16-core nodes should be deployed. And, when choosing specific VM configuration, CPU/memory requirements should be taken into account. For example, if a container needs 4 GB of RAM and we deploy 16 of them per VM, nodes should have at least 64GB of RAM.

The people detection Machine Learning model had C dependencies that needed compilation. Here is how we did that:

# /var/azureml-app/ is where dependencies from ContainerImage.image_configuration are uploaded
RUN mkdir model_dependencies
RUN cd model_dependencies
RUN tar -xvzf /var/azureml-app/modelSetup.tar.gz
RUN python setup.py build_ext --inplace
RUN pip install -e .
RUN cd ..

Conclusion

During our collaboration with Peace Parks Foundation and Ezemvelo we built a scalable pipeline for processing images and alerting rangers of the presence of persons within a restricted territory. The pipeline design is extendable and could be adopted for processing additional sources of data – for example, audio. The approach we have covered in this article could be applied to any domain where independent stability is important (for example, manufacturing).

Digital transformation empowered by AI can make a difference in various areas of critical importance to our world: global climate issues, sustainable farming, biodiversity and water conservation, just to name a few.

We hope our narrative will spark ideas and help turn concepts into reliable software – perhaps even software with an AI for Good story.

]]>https://www.microsoft.com/developerblog/2019/05/07/preventing-rhino-poaching-though-microsoft-azure/feed/1https://www.microsoft.com/developerblog/2019/05/07/preventing-rhino-poaching-though-microsoft-azure/Using uPort For Authentication On Blockchain To Improve Standards On The Food Logistics Chainhttp://feedproxy.google.com/~r/microsoft/devblog/~3/u3H9itEWgAQ/
https://www.microsoft.com/developerblog/2019/03/29/using-uport-for-authentication-on-blockchain/#respondFri, 29 Mar 2019 17:31:26 +0000/developerblog/?p=8045Food ingredients travel thousands of miles along logistics chains. One bad batch of produce can ruin a restaurant’s reputation, but it’s hard to identify who’s at fault. CSE has worked with Hitachi to build a sample infrastructure leveraging blockchain – thus achieving a new level of accountability for those producing, storing and serving food.

Background

Supply chains – made up of individual suppliers, shippers, warehouses, and customers – are extremely complicated conglomerates. When something goes wrong along a supply chain, knowing where and how a shipment went bad can be a huge use of time and energy. This is especially true in the food logistics industry. While modern consumers may not think too much about the journey food products take before ending up on our plates, ingredients travel thousands of miles through multiple stages on the logistics chain.

In the case that a bad batch of produce is discovered, traditionally it has been challenging to highlight which stage of the logistics process was the root of the problem. Did the farm produce a contaminated batch? Was the produce stored at the right temperature during different stages of transportation and storage? Or has the restaurant or store simply prepared the food badly?

Restaurants and stores risk ruining customer trust and facing lawsuits if they cannot ensure that the products they sell adhere to the standards of quality legislated by multiple jurisdictions across a complicated network of local, national, and international regulations. This means that once a trustworthy supplier is found, switching to a cheaper one or trying someone new can be far too risky to even attempt.

Aiming to tackle this multifaceted problem, Microsoft partnered with Hitachi‘s Application Services Division to build out a sample infrastructure together, leveraging the power of blockchain. Such an infrastructure could ensure that suppliers, carriers, and distributors are more accountable for producing, storing and serving food in a safe manner.

Blockchain technology is offering a new level of accountability across a range of industries. From storing energy companies’ emissions data to recording the source of raw minerals, wood, and metals, the tamper-proof nature of data stored on the blockchain is making it more difficult for individual players to cut corners and use unethical practices. It also offers a means of increasing transparency in the food logistics industry, allowing restaurants and stores to put their faith in new suppliers and distributors. By leveraging a publicly distributed ledger, the supplier, carrier, and restaurant can effectively work together to get the best food possible to customers – without risking exposure to bad actors along the way.

Using blockchain, individual players on the logistics chain can be assigned an identity that, while tied to the underlying chain, is otherwise interchangeable. For example, while we used uPort, another app that implemented the same callbacks and protocols would work just as well. This identity off-chain could be used to take ownership of a crate of goods. The individual can sign for the goods received and record a picture of the item, which should include a thermochromatic seal. The seal changes colors once a certain temperature is reached, and won’t change back. Anyone with access to the chain can audit the information stored and catch bad actors.

Challenges and Objectives

We needed a way to track products as they changed hands along a complicated network of organizations. The solution needed to be versatile – anything platform-specific wouldn’t fly, since we needed most organizations to be able to pick this solution up easily. Additionally, because of the nature of supply chains, one actor in the chain might deal with two competitors at any level. While we ended up with a local solution, a long-term centralized store of trust was a luxury we couldn’t afford.

This unique set of prerequisites meant blockchain was a perfect fit. For this discussion, suffice to say that blockchain is immutable, decentralized, and trusted by those who use it. While these statements come with some qualifiers, we won’t be delving into those nuances here.

Our approach was to build a web page that could be accessed via mobile or desktop. This web page would use a public blockchain as a back end, storing ownership immutably. The proof of concept would only deal with proof and transfer of ownership of generalized objects. A theoretical implementation would include a picture stored in a hash addressable storage network (IPFS/Swarm) and that hash being stored on chain. A peer would just have to look up the hash of the image via its address, compare it to the hash of the image received, and verify they match. However, we wouldn’t deal with actual pictures on chain just yet. This approach was versatile, decentralized, and secure.

Since blockchain is still a technology in its infancy, we knew that good design principles would be hugely important to the hack. Blockchain as a concept is a highly secure system, but only if it’s used correctly. Badly written contracts are still contracts, and we knew that secure and durable contract design would be our biggest challenge during the hack.

General Flow

Going into the hack, there was some interest in blockchain as a technology, but there wasn’t much expertise in this area. The user for this solution would be low-income earners, including farmers, in Southeast Asia. Given this, a low-cost system that could be built up quickly and distributed easily was required. We suggested the use of uPort, a smartphone app. This was a great fit for us, since smartphones are nearly ubiquitous. In Vietnam, for example, smartphone penetration is over 70%, which means that an app-based solution would be able to support the majority of use cases, without any additional infrastructure investment. While Hitachi had already started working with blockchain, with Microsoft’s help they were able to get their idea fully realized in a proof of concept. Additionally, we were able to provide vital insights into best practices in using the chain.

In general, the flow that we ended up with from farm to table follows the path outlined in the following chart:

In this example, a farmer creates a shipment and registers it on the chain. When a carrier picks it up, they initiate a transfer with the farmer and take responsibility for the package on the chain. Additionally, they could include a picture of the state of the seal to verify that it’s been added to the package and that it hasn’t risen above the allowed temperature during storage.

They take the package to a warehouse in this example, which puts another transfer on the chain. The warehouse can add that same info, and so can the second carrier, who finally takes it to the restaurant. Not only is there undeniable proof of where a bad shipment went wrong, but if the patrons want (and the restaurant made their view of the chain available to them), they could trace the history of their food all the way back to the farm where their produce came from, since all of that information is saved on the chain.

Proof of Concept (PoC)

Methodologies

Login

Since we knew that the system was going to be used by multiple users within the supply chain, we of course needed a way to associate users with steps on the chain. That way, when a user transferred goods to another user, we would be able to follow that transaction through the chain as a transaction between two different addresses, each address representing a user. On top of that, we have to associate users with sessions within a web browser, in case users want to access the page from desktop. We wrote a webpage that interacts with a blockchain to allow an actor to create or take ownership of a crate of goods, with approval needed from the current owner if there is one. First, the user would create a browser session and associate it with their identity as stored in the uPort app.

To do this, the user would be presented with a QR code. The QR code is a graphical representation of a token, created to be associated with this session. The web service uses a local data store to keep track of this session information.

All that would be left for the user to do is to scan the QR code they’ve been presented with (they could also simply click the QR code if they’re accessing the page from mobile).

Upon scanning the QR code, the user would receive a notification on their phone, letting them know which application they’re about to interact with, and allowing them to approve or deny the attestation. The user would verify who they’re working with, and the web client would associate the user’s identity with the current client session.

Every QR code generated encodes a callback with a token associated with the user and desired action. This QR code in particular, as stated, was designed to associate the session on the browser with the user’s information from uPort via callback on the blockchain. To observe when this has happened, the browser polls the web service to see when a login request comes in. With each request, the browser checks if that request contains some known user’s information to associate with the session created. If it’s a match, the browser can proceed, and will grab information from the web service about other known transactions. In the future, there is an opportunity to build out functionality on chain. We realized that things like membership fees, subscription levels, and more could be stored on chain, but for now we don’t actually have to have login associated on chain.

Package Creation

Now that the user has logged in, we wanted them to be able to do something. Namely, this would be the point where a supplier in the supply chain would be able to add new goods to be tracked on the blockchain. In this example, the supplier has just grown and packed a new shipment of cucumbers. While ideally they’d include a picture of the cucumbers with a thermochromatic seal that was mentioned earlier, for the proof of concept we just built out a system that could track transfer ownership through a system.

We built a simple UI to allow a farmer to create a package of some product. Upon creating the package, the farmer is presented with a popup on the uPort app associated with the identity associated with this browser session, asking if they’re sure they want to create a new package.

We needed creating packages to be easy and secure. On top of that, this was a perfect fit for our need to be able to verify every step of the way. Once the farmer hits approve, the transaction would be signed with the private key within the uPort app. That signature would then be sent back to the service, which submits the transaction to a node in the Ethereum network. This submission is relayed to other nodes, and the transaction is added to a new block on chain. An example of one of these transactions on chain can be seen here on etherscan.io.

Package Transfer

The only other piece that we wanted to prove out on chain was to be able to transfer a package to another user. To do so, we built a UI flow around this, allowing us to setup a transfer QR code to be shown to another organization within the supply chain. Since this is a permanent transfer of ownership, we made sure to make the current owner confirm that this was indeed the package they wanted to transfer.

This creates and displays another QR code. Here again, the browser watches the public chain to make sure it understands when and where that transaction takes place.

While the browser watches the chain for updates, the user is presented with another QR code to scan. This time, the recipient would scan the QR code and get a notification ensuring that they’re taking ownership of what they’re expecting. In the future, they could include information about the pickup (such as the date and time) and also a picture of the package itself, to ensure that the chromatic seal is applied and associated with that crate. This is especially useful if this transfer process is automated: since a human wouldn’t be checking each and every seal at each and every transfer, storing all the information on the chain automatically would mean that anyone could trace the package back and see where exactly the seal was broken.

Once the carrier accepts, that information goes onto the chain and the transfer is complete. The web browser – which again, has been watching the chain for the completion of this transfer – updates its internal model of who owns the package. And voila! We’ve transferred a package from supplier to carrier, all verifiable yet decentralized, sitting on a blockchain.

Summary

Once this demo was built out, Hitachi had a working proof of concept for blockchain-backed proof of ownership of packages, as well as a way to transfer those packages from an owner to a recipient. Our solution also offers the exciting opportunity to bring OAuth to the blockchain. We are working to understand what a general solution for this might look like, and considering how we might build out an open source library that fits the specs of OAuth 2.0 for blockchain. In particular, the flow that we created resembles Google’s OAuth for devices flow, and we believe that this solution could be transformed into a library for OAuth on blockchain.

Today this solution lives here. We’d appreciate any feedback, comments, concerns, questions, and ideas for what this project could use to expand.

Issues

Transaction Latency

In this example, every time a QR code is displayed that turns into a transaction which is mined on the chain. This can be slow. It takes, at the time of writing, somewhere between 15 and 30 seconds to mine a block. For a traditional application, comparable transactions take nowhere near this amount of time. Also, if packages are added individually to the chain, this can mean that the time to transfer a truckload of goods from a carrier to a restaurant may increase quite a bit.

This can be mitigated by reducing the number of times a block is introduced to the chain. In particular, two parties transferring could perform multiple transactions together off chain, logging them to the chain afterwards.

Sacrifices for the demo

In this demo, we manually scanned QR codes with our phones. In reality, this could be automated to a certain degree. We did play around a little with adding links to QR codes so that the users could just interact with mobile pages. At the end of the day, carriers won’t be dealing with unitary crates of cucumbers, but with truckloads of them at a time, and having some sort of sidechain would be a great way to minimize the number of times that a transaction hits the chain. This was not explored in this demo.

While pictures were part of our ideal flow, this wasn’t built into the demo either. Initially, we stored pictures on chain, but this amount of information would be very expensive to store. Instead, a different solution would have to be built out for picture storage. When pictures are generated, they’d be hashed, and that hash would be stored on chain. This, along with other useful metadata, was not built into the demo, due to time constraints.

]]>https://www.microsoft.com/developerblog/2019/03/29/using-uport-for-authentication-on-blockchain/feed/0https://www.microsoft.com/developerblog/2019/03/29/using-uport-for-authentication-on-blockchain/Real-time Streaming Of 3D Enterprise Applications From The Cloud To Low-powered Deviceshttp://feedproxy.google.com/~r/microsoft/devblog/~3/qpV7olHkWq0/
https://www.microsoft.com/developerblog/2019/03/19/real-time-streaming-of-3d-enterprise-applications-from-the-cloud-to-low-powered-devices/#respondTue, 19 Mar 2019 17:24:19 +0000https://www.microsoft.com/developerblog/?p=9399Microsoft recently partnered with AVEVA, an engineering, design and management software provider to the Power, Oil & Gas and Marine industries. AVEVA’s challenge is one that is becoming more and more common in the construction visualization space: ever increasing complexity of 3D data that needs to be highly interactive to a customer base that operates […]

]]>Microsoft recently partnered with AVEVA, an engineering, design and management software provider to the Power, Oil & Gas and Marine industries. AVEVA’s challenge is one that is becoming more and more common in the construction visualization space: ever increasing complexity of 3D data that needs to be highly interactive to a customer base that operates on mobile platforms (smartphones, tablets, and untethered headsets like HoloLens). AVEVA believes that remote rendering and live asset streaming is the way forward for their business.

As a result of our collaboration, we built 3D Streaming Toolkit, an open-source toolkit that provides an approach for developing 3D server applications that stream frames in real-time to other devices over the network. Specifically:

A server-side C++ plugin and samples for remotely rendering and streaming 3D scenes

Background

The next interaction platform for human computing interaction is mixed reality. Mixed reality is the blanket term to encapsulate Virtual Reality, Augmented Reality, Wearable Computing and other device verticals that enhance or replace our natural senses and capabilities with human-centered technologies.

Unlike the human-computer interactions of the 20th centuries, which have been machine-focused, the future of computing in the 21st century rests decidedly in swinging the pendulum towards humanity-first solutions. The physical button including remotes, keyboards, mice, touchscreens, dials, and switches won’t disappear from our lives. Mixed reality is an additive platform that provides human beings with a broader range of ways to interact with technology, improving efficiency, experiences and expression.

Mixed reality isn’t happening in a vacuum. It is growing up in a mobile computing world. This means smaller, portable, battery powered devices that are wearable and comfortable for users. The total shipment forecast for 2020 will be over 60 million headsets. It is important to note that smartphone solutions like Google Daydream, as well as fit-to-purpose untethered headsets like the HoloLens and others already in the market, will outsell tethered solutions by 2020 with a 54% market share from 17% in 2018.

The biggest problem for these standalone HMD (head mounted displays) is the inherent tradeoff between mobility and fidelity. However, the race has already started to bring “desktop”-class experiences to these larger volume devices. Perhaps even more importantly, there is a massive appetite for existing mobile products like smartphones and tablets that are capable of delivering high-end 3D visualization experiences far beyond any on-board GPU capabilities. With billions of these devices already in the wild and mixed reality devices following the same base platform technologies and additive technology strategy, addressing both segments is within reach.

AVEVA built a multi-discipline 3D plant design software called Everything3D (AVEVA E3D screenshot shown above) that increases project productivity, improves project quality and drives down project time. This software can render a full-sized plant without any loss of detail and enables engineers to make real-time changes to the plan. Such a detailed rendering requires powerful rendering power and is out of reach for any low-powered device like smartphones, tablets and headsets like HoloLens. AVEVA wanted to find a solution that can use their existing product and capabilities but work on cross-platform devices including a mixed-reality experience on HoloLens, so that engineers can interact with the content in the real-world.

The solution

After analyzing the market and AVEVA’s requirements, we decided the best solution was to develop cloud rendering capabilities for their 3D application. This approach also addressed a number of security, content management and cross-platform issues that were critical to AVEVA’s success. Our initial implementation was specifically created for AVEVA’s rendering application, but we quickly realized that the components behind this solution were re-usable and we decided to open-source a toolkit that will enable cloud rendering for any 3D application. We called it 3D Streaming Toolkit (3DSTK) and we developed it to meet the following goals:

Low latency rendering of complex scenes

Latency is the main blocker for any cloud-rendering scenario. Unlike consumer solutions like game streaming, with enterprise solutions we can mandate end-to-end hardware requirements to ensure smooth and predictable performance with commercial wireless access points, managed switches and carefully considered data center routing.

With the correct configuration, our end-to-end latency from the cloud to clients is around 100ms.

Multiple peers connected to a single instance of a server

In a large-scale deployment, it is crucial that multiple peers can connect to a single server. This allows the re-use of the same dataset in multiple viewpoints and substantially lowers the cloud computing costs.

Consistent delivery across heterogeneous device ecosystems

The mobile devices on the market have extremely variable compute and GPU capabilities onboard. With cloud rendering, server hardware is managed by the enterprise and the clients are essentially playing a real-time video, a capability that is standard in almost all mobile devices on the market.

Extremely lightweight on-device footprint

The datasets that are useful for visualization are continually changing and are prohibitively large. It’s neither feasible nor practical to push multi-gigabyte updates down to devices to maintain shared context across users in an organization.

With server managed context, rolling out updates can happen instantaneously to the user.

Inherent protection of potentially sensitive IP and assets

Often the datasets and models required to enable users to perform basic work functions are extremely sensitive IP. By keeping this off-device, they can both keep the asset current and have no worries about losing their IP due to theft, malware, or malicious agents.

Ease of remote sharing, recording for training and supervision

Shared POV context is critical for many traceability and transparency scenarios – law enforcement, liability, etc.

Fast, intuitive development, scale and rollout

For the foreseeable future (next 5-7 years) there will be a continued explosion of new hardware players as well as continual platform flux across Windows, iOS and Android. A service-based rollout strategy dramatically reduces exposure to new hardware adoption and capital expenditures for device management.

Cloud / On-Prem solution

For installations without low-latency networks (e.g., oil rigs, conventions, and types of transit like planes, trains, and buses) the same technology can run in a Windows Server instance locally. This was a critical need for development and testing, enabling isolation of network issues from other functional issues during development of their products and services.

The open source toolkit has the same components used for AVEVA’s solution. We simply abstracted the components to enable easy integration with other rendering applications. Now that we understand the goals of this toolkit and why it’s important for enterprise customers, let’s dive into the architecture behind 3D Streaming Toolkit.

Architecture

Instead of starting to develop a solution from scratch, the team chose to evaluate existing technologies that could be leveraged to achieve the stated goals. The team chose to make use of the WebRTC (Web Real-Time Communications) protocols and API, as well as the NVENCODE hardware encoding library from NVIDIA to enable zero-latency encoding on the server side.

NVENCODE

Most NVIDIA graphics cards include dedicated hardware for accelerated video encoding, and this capability is engaged through NVIDIA’s NVENCODE library. This library provides complete offloading of video encoding without impacting the renderer’s graphics performance. This was key to the success of our implementation, as any delay in encoding would drastically influence latency and make the experience unusable. These capabilities are also available on Virtual Machines on the cloud, like the NV series VMs from Azure.

For video encode acceleration, we chose to use NvPipe, a simple and lightweight C API library for low-latency video compression. It provides easy-to-use access to NVIDIA’s hardware-accelerated H.264 and HEVC video codecs and is a great choice to drastically lower the bandwidth required for networked interactive server/client application.

The WebRTC project was released by Google in 2011 as an open source project for the development of real-time communications between browser-based applications. The stated goal of WebRTC is “To enable rich, high-quality RTC applications to be developed for the browser, mobile platforms, and IoT devices, and allow them all to communicate via a common set of protocols.” The project has been widely applied to support low latency VOIP audio and video applications.

Below is a diagram outlining the typical functional blocks used by WebRTC solutions. Each client that connects to WebRTC can act as both a client and a server of audio, video, and data streams. The clients make use of a Signaling Server to establish connectivity to one another. Once connected, a STUN service is used to evaluate the connectivity options between clients in order to establish the most performant pathway for the lowest latency communications. This service allows for the possibility of direct peer-to-peer sessions. When direct connectivity between clients is not allowed by VPN, firewall, or proxy, then a TURN service is used as a message relay.

All communication between clients is managed through one or more data channels. TheVideo Engine is offered as a middleware service to establish a video data channel and to automate buffer, jitter, and latency management. The Audio Engine performs the same duties for audio connectivity and is tailored for efficient processing of voice data. Applications are free to establish multiple data channels for custom / arbitrary formatted messages.

The diagram below summarizes the 3DStreamingToolkit’s extensions to typical WebRTC usage. The set of video encoders is expanded to include NVIDIA NVENCODE hardware encoder library for real time encoding of 3D rendered content. A custom data channel is established to manage camera transforms and user interaction events, and these are applied by plugins that encapsulate rendering by Unity or native OpenGL or DirectX rendering engines.

In practice, it is only useful to apply NVENCODE on the rendering server side and the WebRTC diagram can be simplified to the diagram below. Co. Since the TURN Service and rendering service are on the same network, there is no performance penalty for always making use of the TURN relay service and as a result the STUN service can be bypassed.

A common desired scenario is to connect multiple peers to a single instance of a server. The architecture above is targeted at a single server to a single peer, but our toolkit is also capable of a large-scale architecture to allow thousands of users to connect to a streaming experience.

Large scale cloud architecture

The team decided to create a reference architecture for a large-scale architecture that makes is easy for any partner to deploy the components to Microsoft Azure, customize the configuration for any scenario, and start streaming a custom 3D experience.

This architecture contains the required WebRTC servers (signaling and TURN) and an orchestrator capable of monitoring and scaling up/down pools of VMs that host the rendering applications. Clients can easily connect to the signaling server and the orchestrator will decide what VM should connect to the user.

Now, let’s look at some of the 3D Streaming Toolkit components in more detail:

A server-side C++ plugin and samples for remotely rendering and streaming 3D scenes

One key challenge was to develop a server-side C++ plugin that can grab the GPU buffer of any 3D application (DirectX for AVEVA’s solution), encode it with H264 NvPipe, and send it over the network to any connected client using WebRTC. The clients will then use native WebRTC components to receive and decode the H264 frame and display it on the screen as soon as it’s received.

We are keeping the RGB image buffer created by the application in the GPU memory and we are passing the pointer down to WebRTC. Once a frame is ready to be sent, our WebRTC H264 encoder extension will use NvPipe to encode the frame at the GPU level. This frame does not need any color conversion or any CPU copying to output the encoded image. This enables zero-latency encoding and it will not affect any rendering capability.

In our main game loop, we then check all the connected peers, update the camera transformation for each perspective, render the scene, and send the raw frame buffer to the peer conductor. This ensures that each user will have their own experience and can freely navigate the scene based on their input.

For a stereoscopic experience, we update the left/right views corresponding to each eye. Our HoloLens clients have sample code on how to project the frames correctly and display a stereoscopic experience.

The HoloLens client is a special case as it also allows stereoscopic rendering. With the use of frame prediction and custom matrix projections, we enabled a smooth experience that can run at 60fps. Currently, the Unity HoloLens client does not have frame prediction and we recommend the DirectX version for any production quality experiences. For more info, check our HoloLens docs: https://3dstreamingtoolkit.github.io/docs-3dstk/Clients/directx-hololens-client.html

WebRTC extensions for 3D content and input

WebRTC is a key component of our solution but we needed to add a few features to enable 3D streaming. The HoloLens client is originally based on the WebRTC UWP repo maintained by the Windows Developer Platform team at Microsoft. We published all of our changes as part of a unified native and UWP WebRTC repository.

Video frame extensions to allow pointers of RGB and texture data to be passed to the encoder

WebRTC uses I420 as the main video frame buffer across the stack, and W. To achieve this, we extended VideoFrame class and added the following methods:

NvPipe provides a simple C interface to enable low-latency encoding, and we integrated this library into the WebRTC H264 encoder plugin. We pre-compile the dll based on our branch and we keep the dll inside the server application. When the encoder is created, we dynamically load the dll and create an NvPipe instance:

For HoloLens clients, we needed a way to sync frame prediction timestamps between the server and clients. To achieve this, we extended the VideoFrame and EncodedFrame classes to also hold prediction timestamps:

Conclusion and Reuse

Through our collaboration with AVEVA, we developed an open-source toolkit that enables the creation of powerful stereoscopic experiences that run on the cloud and stream to low-powered devices. This approach resolves many business issues for consistent delivery across heterogeneous device ecosystems, on-device footprint, protection of potentially sensitive IP and fast, intuitive development, scale and rollout. These were crucial to AVEVA and many other partners that have started using our open-source solution.

The solution can be adapted to any cloud-rendering scenario and we have seen many industries – including medical, manufacturing and automotive – adopting this approach . It is our hope that the work presented here will lead to further collaborations in the future on additional projects using the 3D Streaming Toolkit.

Watch this episode of the Decoded Show for an in-depth look with Gary Bradski, the inventor of OpenCV, and the Nestlé team.

In the Global Burden of Disease Study 2015, Acne was identified as the 8th most common disease in the world and is estimated to affect as many as 700 million people globally. This skin health condition medically known as acne vulgaris occurs when pores become clogged with dead skin cells and oil from the skin creating blackheads, whiteheads and, as inflammation worsens, red pimples.

Due to the common nature of the ailment, there are numerous treatments available in markets around the world. However, for consumers suffering from acne it can be challenging to self-assess their acne lesions, to choose the products they should use, and to monitor the impact of treatments. Not all consumers can seek or want to seek help from qualified dermatologists. To date there have been limited digital tools available universally which allow consumers to safely and adequately self-assess, treat and monitor the progress of their acne.

Our mobile app allows users to self-assess the severity of their own acne with roughly the same accuracy as a dermatologist’s assessment. With the input of additional demographic information — such as age, gender, race, skin type, acne history — the app is also able to suggest treatment plans appropriate to the specific level of severity of the user’s acne as well as guide users towards intelligent choices for skin care products, indoor or outdoor activities, nutritional suggestions, etc..

While the app is not intended to replace a visit to dermatologist for clinical assessment in all cases, it provides Nestlé Skin Health the opportunity to actively engage with their customers, make dermatological expertise user-friendly and accessible universally, and address their customers’ needs promptly. This application significantly shortens the feedback loop for their products with real-time customer data, thus enabling the company to proactively design new or improve existing products.

Figure 1. AI empowers consumers with cost-effective curated, useful scientific knowledge-based solutions. With expert assessment, treatment and management on demand and right in the palms of their hands, consumers have a high-quality option to the unyielding and expensive health care system, to help them solve their simplest health problems.

Challenges and Technical Overview

In previous experiments, teams had attempted to detect and classify acne lesions by applying the texture method, the HSV ((hue, saturation, value) method and ‘k-means’ techniques, but were not able to find a suitable application to accurately determine the severity of specific user’s acne. The Nestlé Skin Health internal data science team had also explored the texture method with reasonable success but could not match the targeted Root- Mean-Square- Error (RMSE) of their in-house dermatologist labelled images.

To train and evaluate CNN models, every image in the training and testing sets has to be labeled. We developed a web app in Azure to allow Nestlé Skin Health dermatologists to label the images by just using their web browsers.

Train the CNN model on a limited number of images with substantial noises in both images and labels.

We applied a facial landmark detection model and the one-eye open-cv model to detect and extract skin patches on faces, and then developed an innovative image augmentation method to help the trained CNN model to generalize better on testing data. Due to the limited number of images in the training data, we applied the pre-trained ResNet-152 model as a feature extractor and trained a full-connected neural network just on the feature set of the training images. We then applied transfer learning methodologies.

Operationalize the CNN model so that any application can use it to predict the severity of acne.

We built a web service API using Python Flask. A docker image was created to hold the built model and all the dependencies and published to Azure Container Service (ACS). Then the docker image was pushed into Azure Kubernetes Service (AKS) cluster and hosted in Azure.

Approaches

Image Labeler Web App on Azure

The joint Microsoft and Nestlé Skin Health teams used a total of 4700 unlabeled selfie images that included 230 images selected by a Nestlé Skin Health dermatologist as ‘golden set’ testing images, and the remaining 4470 images which were used as training images. The golden set images were separated from the model training process and used to evaluate the performance of the CNN models.

To train and test the CNN models, we first needed to label the images. The 4470 training images were randomly split into 11 even folds, and each fold was assigned to one of the 11 Nestlé Skin Health dermatologists to be labeled. In addition to that, each dermatologist was also assigned the entire golden set images, meaning each training image was only labeled once, whereas each golden set image was labeled 11 times. The dermatologists were not told which images were from the golden set to ensure the experiment was fair.

Figure 2. The image allocation mechanism.

To facilitate this labeling process, we developed a simple Python Flask web app on Azure as shown in Figures 3.1 and 3.2. This web app allowed the dermatologists to check and label each image by just using a web browser. The access to images through the web application was secured through Azure Active Directory (AAD) to ensure that only those dermatologists whose Microsoft accounts were added to the AAD were able to access the labeling web app. More details about how we built an AAD-secured web app using Python and Azure SQL Database for image labeling will be shared in a future code story.

Figure 3.1. Web App using Python Flask

Figure 3.2. Web Interface for Labeling

Data Augmentation

The selfie images used in this application varied dramatically in quality and backgrounds. This presented a big challenge for the CNN model since it introduced a significant level of background noise to the training data, especially considering the number of training images was limited. We took the following steps to augment the images so that the trained CNN models could generalize efficiently on the golden set images:

Step 0. Prescreened images with poor quality.
The quality of the selfie images posed a big challenge for classification. Among the 4430 training images, there were a significant number of images with bad environment control, such as over or under-exposed images or images with very low resolution. Therefore, the first step we took was to manually pre-screen the training images and remove images with low resolution or bad exposure, leaving about 1000 images.

Step 1. Extracted skin patches from facial skins.
This step was implemented in Jupyter Notebook – Extract Forehead, cheeks, and chin skin patches from raw images using facial landmark model and One Eye model. In this step, we extracted different skin patches from the forehead, both cheeks, and chin of each face image by applying the pretrained Shape Predictor 68 Face Landmarks (landmark model) published by Akshayubhat on github.com. Sample Python codes of applying landmark model in detecting facial features can be found here. We found that the landmark model did not work as well if there was only one eye clearly visible in the image, such as when a selfie is taken from a profile perspective. To extract skin patches from these types of profile images, we employed the One Eye model in opencv to detect the location of the single eye, and then inferred the regions of the forehead, the cheek side, and the chin skin patches.

Figure 4.1. Detected Facial Landmarks

Figure 4.2. Extracted Facial Skin Patches

For skin patches taken from training images, the labels of the images were used to differentiate between the skin patches. Here are some sample Python codes for facial landmark detection:

Step 2. Rolling skin patches to help the CNN models generalize better on the validation and golden set images. To help CNN models generalize better on the validation and golden set images, we first had to roll the skin patches. This proved to be a critical data augmentation in this application by mitigating the spatial sensitivity of CNN models.

In CNN models, the features extracted from a certain region will be projected onto a certain location (neuron) in the feature space. Generally, the CNN model will not be able to recognize the acne lesion well if the acne lesion on the testing image appears on new locations that the CNN model had never registered before. Unfortunately, as a condition, the seriousness of acne is not dependent on location, but instead the volume and severity of acne lesions on a patient’s face.

Converting Classification to Regression Problems

Another challenge we faced was in cleaning up the labels on our images, as many of the image labels provided by the dermatologists featured noise. When organizing our samples, we noticed that there were multiple identical or close-to-identical images in the training image set, which had been labeled differently, by different dermatologists, which would have posed a challenge when training our classification models. To mitigate the impact of label noise on the model, instead of going down the route of building an image classification model, we instead decided to build a regression CNN model.

In the regression CNN model, we assigned ordinary numerical values to the five acne severity levels as follows: 1-Clear, 2-Almost Clear, 3-Mild, 4-Moderate, and 5-Severe. Our team decided that assigning ordinary numerical values to the severity levels made sense here considering that a higher numerical value does necessarily mean the user has more severe acne lesions.

Transfer Learning Model

For the next part of the process, we used a pre-trained deep learning model in CNTK, ResNet-152, to extract features from the training image skin patches. We then trained a full-connected neural network model on these features, to make the entire deep learning model specific to the acne severity classification domain.

The features we extracted were from the last max pooling layer of the pre-trained ResNet-152 model. The trained full connected neural network was then stored for future scoring pipeline. The Transfer Learning Model was implemented in Step3_Training_Pipelinejupyter notebook.

The feature extraction from ResNet-152 model was implemented in the following function:

def extract_features(image_path):

img = Image.open(image_path)

resized = img.resize((image_width, image_height), Image.ANTIALIAS)

bgr_image = np.asarray(resized, dtype=np.float32)[…, [2, 1, 0]]

hwc_format = np.ascontiguousarray(np.rollaxis(bgr_image, 2))

arguments = {loaded_model.arguments[0]: [hwc_format]}

output = output_nodes.eval(arguments) #extract the features from the pretrained model, and output

return output

The features of the training image patches were used to train a full-connected neural network (fcnn) with three hidden layers, each with 1024, 512, and 256 hidden neurons. The following lines defined and trained the fcnn model:

The trained fcnnmodel was then saved and used to predict the labels of the golden set images. This image scoring process was implemented in Step4_Scoring_Pipelinejupyter notebook.

Performance Metrics and Bar

Since we had built CNN regression models to tackle the problem, we chose RMSE to measure the performance of the model on the golden set images. Since each image on the Golden set had 11 labels from all 11 dermatologists, after replacing acne severity levels with numerical values, we used the average of the 11 numerical labels to represent the ground truth rating of the severity level of each image.

As we mentioned earlier in the image augmentation section, multiple skin patches were extracted from each selfie image, and the model was trained on the individual skin patches and used to predict the severity levels of each sample. We chose to use the average of predicted severity values of the multiple skin patches as the final predicted severity level of each selfie image.

Nestlé Skin Health and Microsoft decided to set the performance bar of this CNN model to be at least as good as the highest RMSE among the 11 dermatologists. In the following figure, the RMSEof dermatologist iis calculated as follows, is the label provided by dermatologist ion image k, and Nis the number of golden set images. It describes how far the labels provided by the dermatologist are away from the consensus of all 11 dermatologists on the golden set images.

The following table lists the RMSE of all 11 dermatologists. This meant that RMSE=0.517 was set as the performance figure that the CNN model had to beat to be considered a success.

ID

1

2

3

4

5

6

7

8

9

10

11

RMSE

0.517

0.508

0.500

0.495

0.490

0.484

0.454

0.450

0.402

0.400

0.388

Performance of Our Deep Learning Models

We can see that the CNN model achieved RMSE=0.482, which is better than the target performance bar 0.517 we set at the beginning of this engagement. As such, it appears there is a strong positive correlation (0.755) between the predicted and actual target values.

If we use [1.5, 2.5, 3.5, 4.5] as the list of edges to discretize the ground truth and the predicted severity levels into categorical severity levels – with the severity label < 1.5 labeled as 1, and severity labels in the range of 1.5 – 2.5 are labeled as 2 — we get a confusion matrix as shown in Figure 5.2.

We can see that the model is extremely accurate for images of mild acne (82%), but had more difficulty differentiating Almost Clear (label 2) from Mild (label 3), Moderate (label 4) from Mild, and Severe (label 5) from Moderate.

This was consistent with what we observed in the original training images labels, where some identical images were labeled differently by different dermatologists.

Figure 5.1 Real vs. Predicted Severity

Figure 5.2 Confusion Matrix if Discretize Severity Levels

Operationalizing the Scoring Pipeline in Containers and Azure Kubernetes Services for Scalability

We then operationalized the trained CNN model, together with the image augmentation steps as a Python Flask web service API using Azure Container Service (ACS) and Azure Kubernetes Service (AKS) so that a selfie image could be sent to the web service API and an acne severity score could be returned. Assuming that you have installed dockeron your local machine, cloned the github repository to a local directory, and changed the directory to it, you can take the following steps to create a docker image with the Flask web service API, and deploy this image to AKS:

Step 1. Run the following command to build a docker image nestleapi.

docker build -t ‘nestleapi’

Step 2. Tag the nestleapiimage with the login server of the container registry on ACS.

docker tag nestleapi <acrName>.azurecr.io/nesteapi:v1

Step 3. Push the nestleapi image to the registry.

docker push <acrName>.azurecr.io/nestleapi:v1

Step 4. Depoly the application to an AKS cluster:

kubectl apply -f nestledeploy.yml

The web service API in this example takes the URL of an image as the input. To test this web service API, you can run the following shell command:

Conclusion and Future Recommendations

During this collaboration, our team built a CNN deep learning model to assess the severity level of acne lesions from selfie images. This will allow Nestlé Skin Health to develop an application which will make it easier for acne sufferers all over the world to self-assess their condition, choose the most appropriate products for treatment and monitor the response of their acne over time.

The biggest challenge in building an image classification CNN model with a level of accuracy comparable to that of a human dermatologist, was reducing the background noise in the selfie images as much as possible. Experience shows us that this type of noise would have had a significant impact on the accuracy of the model, and the ability of the model to generalize effectively when using a small set of training data.

During the process, our team first applied the landmark and one-eye models to extract skin patches from different sectors of the face, and designed a special image augmentation technique, image rolling, to help the model generalize well on the testing images. Our results show that transfer learning is an extremely effective method to train a domain-specific model when using a small sample training set.

The authors feel the model could be further improved. In the current model the team used the label of the entire face image as the label of each skin patch. In retrospect this potentially introduced additional noise to the labels of the training skin patches in the case that most of the face was clear and only one skin patch has moderate acne lesion. Furthermore, although the trained deep learning model outperformed the performance bar, the model could potentially be improved if more labeled images with less noise and better-controlled environment factors such as low resolution and over exposure were addressed. Finally, if the training images included metadata, additional correction could have been implemented which may have also improved the model significantly.

While GPU is immensely powerful in training deep learning models and scoring new images, it is expensive, which could limit its use in certain applications. As an additional test, our team assessed the performance of the solution using ResNet-50 (instead of ResNet-152 as aforementioned) as the feature extractor and kept everything else unchanged. Using this technique, we achieved approximately the same performance as reported previously, still accounting for better than the performance bar. This means that for future projects it would be possible to migrate an image scoring pipeline onto Microsoft’s Brainwave (FPGA) platform to achieve high throughput while reducing costs at the same time.

Nestlé Skin Health plans to make this application globally available to the millions of people who are affected by acne. Using the app will make the consumer’s journey a more individualized experience with instantaneous analysis of selfie images and personalized guidance for treatment and management using interactive coaching. Nestlé Skin Health and Microsoft hope this will not only have a positive impact to consumers but also provide healthcare professionals around the world with new insights on acne based on learning from the deep neural network

For anyone interested in tackling the problem of facial dermatological problems using a similar model, you can find codes samples in the relevant GitHub repository.

Acknowledgements

We would like to acknowledge the great support on this collaborative work from our colleagues Thierry Roulette and Laurent Chantalat at Nestle Skin Health.

]]>https://www.microsoft.com/developerblog/2019/02/05/assessing-the-severity-of-acne-via-cell-phone-selfie-images-using-a-deep-learning-model/feed/0https://www.microsoft.com/developerblog/2019/02/05/assessing-the-severity-of-acne-via-cell-phone-selfie-images-using-a-deep-learning-model/Running Parallel Apache Spark Notebook Workloads On Azure Databrickshttp://feedproxy.google.com/~r/microsoft/devblog/~3/DQxhkpirruY/
https://www.microsoft.com/developerblog/2019/01/18/running-parallel-apache-spark-notebook-workloads-on-azure-databricks/#respondFri, 18 Jan 2019 19:39:57 +0000https://www.microsoft.com/developerblog/?p=9978This article walks through the development of a technique for running Spark jobs in parallel on Azure Databricks. The technique enabled us to reduce the processing times for JetBlue's reporting threefold while keeping the business logic implementation straight forward. The technique can be re-used for any notebooks-based Spark workload on Azure Databricks.

In today’s fast-moving world, having access to up-to-date business metrics is key to making data-driven customer-centric decisions. With over 1000 daily flights servicing more than 100 cities and 42 million customers per year, JetBlue has a lot of data to crunch, answering questions such as: What is the utilization of a given route? What is the projected load of a flight? How many flights were on-time? What is the idle time of each plane model at a given airport? To provide decision-makers answers to these and other inquiries in a timely fashion, JetBlue partnered with Microsoft to develop a flexible and extensible reporting solution based on Apache Spark and Azure Databricks.

A key data source for JetBlue is a recurring batch file which lists all customer bookings created or changed during the last batch period. For example, a batch file on January 10 may include a newly created future booking for February 2, an upgrade to a reservation for a flight on March 5, or a listing of customers who flew on all flights on January 10. To keep business metrics fresh, each batch file must result in the re-computation of the metrics for each day listed in the file. This poses an interesting scaling challenge for the Spark job computing the metrics: how do we keep the metrics production code simple and readable while still being able to re-process metrics for hundreds of days in a timely fashion?

The remainder of this article will walk through various scaling techniques for dealing with scenarios that require large numbers of Spark jobs to be run on Azure Databricks and present solutions that were able to reduce processing times by over 60% compared to our initial solution.

Developing a Solution

At the outset of the project, we had two key solution constraints: time and simplicity. First, to aid with maintainability and onboarding, all Spark code should be simple and easily understandable even to novices in the technology. Second, to keep business metrics relevant for JetBlue decision-makers, all re-computations should terminate within a few minutes. These two constraints were immediately at odds: a natural way to scale jobs in Spark is to leverage partitioning and operate on larger batches of data in one go; however, this complicates code understanding and performance tuning since developers must be familiar with partitioning, balancing data across partitions, etc. To keep the code as straightforward as possible, we therefore wanted to implement the business metrics Spark jobs in a direct and easy-to-follow way, and to have a single parameterized Spark job that computes the metrics for a given booking day.

Cluster Size and Spark Job Processing Time

After implementing the business metrics Spark job with JetBlue, we immediately faced a scaling concern. For many Spark jobs, including JetBlue’s, there is a ceiling on the speed-ups that can be gained by simply adding more workers to the Spark cluster: past a certain point, adding more workers won’t significantly decrease processing times. This is due to added communication overheads or simply because there is not enough natural partitioning in the data to enable efficient distributed processing.

Figure 1 below demonstrates the aforementioned cluster-size related Spark scaling limit with the example of a simple word-count job. The code for the job can be found in the Resources section below. The graph clearly shows that we encounter diminishing returns after adding only 5 machines to the cluster; and past a cluster size of 15 machines, adding more machines to the cluster won’t speed up the job.

After using cluster size to scale JetBlue’s business metrics Spark job, we came to an unfortunate realization. It would take several hours to re-process the daily metrics. This was unacceptable.

Figure 1: Processing time versus cluster size of a simple word-count Spark job. We note that past a specific cluster size, adding more machines to a job doesn’t speed up the runtime anymore.

Parallel Execution of Spark Jobs on Azure Databricks

We noticed that JetBlue’s business metrics Spark job is highly parallelizable: each day can be processed completely independently. However, using separate Databricks clusters to run JetBlue’s business metrics Spark job on days in parallel was not desirable – having to deploy and monitor code in multiple execution environments would result in a large operational and tooling burden. We therefore reformulated the problem as such: was there a way in which we could run JetBlue’s jobs in parallel on the same cluster?

The Driver Notebook Pattern in Azure Databricks

Azure Databricks offers a mechanism to run sub-jobs from within a job via the dbutils.notebook.run API. A simple usage of the API is as follows:

// define some way to generate a sequence of workloads to run
val jobArguments = ???
// define the name of the Azure Databricks notebook to run
val notebookToRun = ???
// start the jobs
jobArguments.foreach(args =>
dbutils.notebook.run(notebookToRun, timeoutSeconds = 0, args))

Using the dbutils.notebooks.run API, we were able to keep JetBlue’s main business metrics Spark job simple: the job only needs to concern itself with processing the metrics for a single day. We then created a separate “driver” Spark job that manages the complexity of running the metrics job for all the requisite days. In this way we were able to hide the complexity of scheduling for performance from the business logic. This fulfilled our code simplicity goal with JetBlue.

Using Scala Parallel Collections to Run Parallel Spark Jobs

Upon further investigation, we learned that the run method is a blocking call. This essentially means that the implementation is equivalent to running all the jobs in sequence, thus leading back to the previously experienced performance concerns. To achieve parallelism for JetBlue’s workload, we next attempted to leverage Scala’s parallel collections to launch the jobs:

Figure 3 at the end of this section shows that the parallel collections approach does offer some performance benefits over running the workloads in sequence. However, we discovered that there are two factors limiting the parallelism of this implementation.

First, Scala parallel collections will, by default, only use as many threads as there are cores available on the Spark driver machine. This means that if we use a cluster of DS3v2 nodes (each with 4 cores) the snippet above will launch at most 4 jobs in parallel. This is undesirable given that the calls are IO-bound instead of CPU-bound and we could thus be supporting many more parallel run invocations.

Additionally, while the code above does launch Spark jobs in parallel, the Spark scheduler may not actually execute the jobs in parallel. This is because Spark uses a first-in-first-out scheduling strategy by default. The Spark scheduler may attempt to parallelize some tasks if there is spare CPU capacity available in the cluster, but this behavior may not optimally utilize the cluster.

To further improve the runtime of JetBlue’s parallel workloads, we leveraged the fact that at the time of writing with runtime 5.0, Azure Databricks is enabled to make use of Spark fair scheduling pools. Fair scheduling in Spark means that we can define multiple separate resource pools in the cluster which are all available for executing jobs independently. This enabled us to develop the following mechanism to guarantee that Azure Databricks will always execute some configured number of separate notebook-runs in parallel:

The code above is somewhat more involved than the parallel collections approach but offers two key benefits. Firstly, the use of a dedicated threadpool guarantees that there are always the configured number of jobs executing in parallel regardless of the number of cores on the Spark driver machine. Additionally, by setting explicit Spark fair scheduling pools for each of the invoked jobs, we were able to guarantee that Spark will truly run the notebooks in parallel on equally sized slices of the cluster.

Using the code above, if we were to create a cluster with 40 workers and set the number of parallel jobs to 4, then each individual job will utilize 10 workers in the cluster. If implemented correctly, the stages tab in the cluster’s Spark UI will look similar to Figure 2 below, which shows that there are 4 concurrently executing sets of Spark tasks on separate scheduler pools in the cluster.

As shown in Figure 3 below, the fair scheduler approach provided great performance improvements. However, determining the optimal number of jobs to run for a given workload whenever the cluster size changed would have been a non-trivial time overhead for JetBlue. Instead, we ran a benchmark similar to Figure 1 to determine the inflection point after which adding more workers to our Spark job didn’t improve the processing time anymore. We then were able to use this information to dynamically compute the best number of jobs to run in parallel on our cluster:

// define the number of workers per job
val workersPerJob = ???
// look up the number of workers in the cluster
val workersAvailable = sc.getExecutorMemoryStatus.size
// determine number of jobs we can run each with the desired worker count
val totalJobs = workersAvailable / workersPerJob

This approach worked well for JetBlue since we noticed that all the individual Spark jobs were roughly uniform in processing time and so distributing them evenly among the cluster would lead to optimal performance.

Figure 3 below shows a comparison of the various Spark parallelism approaches described throughout this section. We can see that by using a threadpool, Spark fair scheduler pools, and automatic determination of the number of jobs to run on the cluster, we managed to reduce the runtime to one-third of what it was when running all jobs sequentially on one large cluster.

Figure 3: Comparison of Spark parallelism techniques. Note that using an approach based on fair scheduler pools enables us to more effectively leverage larger clusters for parallel workloads.

Limitations of Parallel Spark Notebook Tasks

Note that all code included in the sections above makes use of the dbutils.notebook.run API in Azure Databricks. At the time of writing with the dbutils API at jar version dbutils-api 0.0.3, the code only works when run in the context of an Azure Databricks notebook and will fail to compile if included in a class library jar attached to the cluster.

The best performing approaches described in the previous section require Spark fair scheduler pools to be enabled on your cluster. You can double check that this is the case by executing the following snippet:

// must return "FAIR"
spark.conf.get("spark.scheduler.mode")

Furthermore, note that while the approaches described in this article do make it easy to accelerate Spark workloads on larger cluster sizes by leveraging parallelism, it remains important to keep in mind that for some applications the gains in processing speed may not be worth the increases in cost resulting from the use of larger cluster sizes. The techniques outlined in this article provide us with a tool to trade-off larger cluster sizes for shorter processing times and it’s up to each specific use-case to determine the optimal balance between urgency and cost. Additionally, we must also realize that the speedups resulting from the techniques are not unbounded. For instance, if a Spark jobs read from an external storage – such as a database or cloud object storage system via HDFS – eventually the number of concurrent machines reading from the storage may exceed the configured throughput on the external system. This will lead to the jobs slowing down in aggregate.

Conclusions and Next Steps

In this article, we presented an approach to run multiple Spark jobs in parallel on an Azure Databricks cluster by leveraging threadpools and Spark fair scheduler pools. This enabled us to reduce the time to compute JetBlue’s business metrics threefold. The approach described in the article can be leveraged to run any notebooks-based workload in parallel on Azure Databricks. We welcome you to give the technique a try and let us know your results in the comments below!

For a subscription service business, there are two ways to drive growth: grow the number of new customers, or increase the lifetime value from the customers that you already have by retaining more of them. Improving customer retention requires the ability to predict which subscribers are likely to cancel (referred to as churn), and to intervene with the right retention offers at the right time. Recently, the use of deep learning algorithms that learn sequential product usage customer behavior to make predictions have begun to offer businesses a more powerful method to pinpoint accounts at risk. This understanding of an account’s churn likelihood allows a company to proactively act to save the most valuable customers before they cancel.

CSE recently partnered with the finance group of Majid Al Futtaim Ventures (MAF), a leading mall, communities, retail and leisure pioneer across the Middle East, Africa and Asia, to design and deploy a machine learning solution to predict attrition within their consumer credit card customer base. MAF sought to use their customer records – including transaction and incident history plus account profile information – to inform a predictive model. Once developed, MAF needed to deploy this model in operation to allow them to make effective retention offers to the customers directly from their customer marketing systems.

Managing churn is fundamental to any service business. The lifetime value of the customer (LTV) is the key measure of business value for a subscription business, with churn as the central input. It’s often calculated as Lifetime Value = margin * (1/monthly churn). Reducing monthly churn in the denominator increases the LTV of the customer base. With increased LTV for the customer base comes increased profitability, and with that increased profit comes the economic support to increase marketing activity and investment in customer acquisition, completing a virtuous cycle for the business.

Technical Problem Statement

Predicting that a customer is likely to churn requires understanding both the patterns in the sequence of user behavior and user state for churners compared to non-churners. Modeling of these user states and behavioral sequences requires using tabular data coming from transaction systems, incident management systems, customer and product records, and then turning these data into a model-friendly set of numerical matrices. This pre-processing of the original tabular data into model-friendly sequences is, in and of itself a significant piece of technical work.

For the prediction task we had to choose whether to predict the attrition event itself or the inactivity that might presage a later attrition. In our modeling approach we predict churn itself. Despite the fact that the actual cancellation decision is a lagging indicator, our model delivered sufficient precision. In future modeling efforts, we will label long-standing inactivity as effective churn.

We worked with MAF to define a performance requirement in terms of accuracy, precision and recall. Once a model performed at the required levels, we could experiment with retention offers for those predicted to cancel. To run these experiments we needed to deploy this predictive model in a secure and operationally reliable environment where we could retrieve batch predictions daily and drive retention offers with the associated workflow.

The Data

The source data includes profile data as well as time series of sequential numerical and categorical information. These data come from historical transaction activity, historical customer incident activity, the product portfolio information and customer profile data. The customer profile data includes state information on the characteristics of the subscriber and the product they are using. The time series of sequential data includes transactions and incidents with a time stamp for each event. In some cases, the time series may have many events in a single day. In other cases, there may be no activity for a period of time.

While we can’t share the anonymized MAF data, we use here three open source examples of source data that is similar in shape and type in order to better illustrate the starting point.

First, we had customer portfolio information, similar to that detailed in the telco churn open data set on Kaggle. These data characterizes the customer demographics, preferences and type of product used. See a selection of the data in the chart below.

Public Data Sample – Customer Profile Data

Next, we have transaction history information. The transaction information used a schema similar to banking transaction information, like that shared in the PKDD 99 Discovery Challenge. See a small snippet of these data in the chart below.

Public Data Sample – Transactional Data

We also had customer incident history. Customer incident data structure the case ID, opened and closed date, category, subcategory and details, as well as status on the case. Below is a sample of a similarly structured data set from a public incident data set from San Francisco on data.gov. You can see that these data track the incident cases, categorization, activity and resolution.

Public Data Sample – Incident Data

The structure and content of the available data informed our approach for pre-processing and modeling.

Approach

Our project has three parts: pre-processing, modeling, and deployment in operation. We start with pre-processing: manipulating this tabular business application data into the formats that can feed our modeling. This step is highly generalizable to many types of deep learning work with tabular business data.

We broke the data pre-processing work down into several steps. This allowed us to reduce the size of the very large data, especially the transaction tables, for easier manipulation. It also focused next pre-processing steps, enabling our team members to coordinate and focus our individual work more efficiently. And, additionally, it set up the work to be deployed in operation, script by script, into SQL, while allowing us to refine our pre-processing work in later steps as we learned from our early models.

For our sequential modeling approach we required a specific data format. For an attention deep learning model like the Long Short-Term Memory (LSTM) we used, the data need to be expressed ultimately as a three-dimensional array of account x timestep x feature. More details on LSTM data formats are available in this excellent blog post.

Our text information needed to be expressed numerically. We chose to represent the sequence with word tokens, and embeddings. Read more about embedding approaches within this comprehensive blog post.

In our modeling work, we wanted to derive signal from two somewhat different sets of information, the history of incidents and the time series of account-related transactions and states. To do this, we prepared two data sets to supply to a hybrid model. Recent innovative churn prediction models are typically multi-input hybrid models. The first model input is time series numerical data in the account x timestep x feature format that feeds into a bidirectional LSTM. The second model input is textual and categorical incident data to feed a 1D CNN. We used the Keras functional API to easily build a multi-input model. We detail the specific modeling method and choices we made in the section below.

High Level MAF Hybrid Deployment Overview

Lastly, the data pre-processing and the predictive model needed to be deployed into production such that it could be called easily on a per-account basis. For our customer, this required integrating model services with on-premise data, a hybrid cloud and on-premise implementation. The chart here demonstrates how the model will be called in operation. Pre-processed tables are generated in SQL Azure. At a high level, from within an Azure Machine Learning workspace, we retrieve data and then run a model script to generate a prediction score. Then we store the predictions for each account from that batch run into a SQL Azure table for later use by the MAF marketing operations.

The Data Preprocessing

Pre-processing comprises four sequential functional steps with an associated script for each. The first is ‘filtering and formatting’ from the source tables of interest. This reduces the data bulk for our design phase. The second script performs ‘joins, folds and conversions’. It joins our tables of interest, folds the data (sequentially or in aggregate) into our final timestep of interest for our sequential models, and generates meaningful numeric values from non-numeric information. The third script performs feature engineering. In this script we derive additional features and augmented the data in order to provide our models with more signal. We iterated on this script as we refined the model.

We organize these scripts separately so that we could deploy each script individually into SQL for deployment, while leaving us flexibility to iterate on those remaining as we matured our understanding and approach to the model. This approach afforded us speedier collaboration on the later scripts.

The following is an excerpt of the processing done at each stage. To illustrate we highlight just the time series pre-processing and not the text pre-processing. This github repo contains all the pre-processing with more detailed information.

a.) Filtering and Formatting: The goal of this step is to reduce the size of the data passed along to subsequent steps and to do rudimentary formatting and cleaning (e.g. replace missing values and convert datatypes). For example, we selected columns of interest and filtered to eliminate account types that are not in our consideration set. We also limit the history for each account that we bring forward for sampling, reducing the size of the transaction table substantially. With the size of the data reduced, the time required to design the subsequent steps is significantly reduced. A few snippets of the filtering and formatting steps are highlighted below, while the full notebooks are here.

Pre-processing – Filtering and Formatting Script Excerpt

b.) Joins, Folds and Conversions: After filtering the source data, the data are joined to form two data frames. The goal of this step is to get close to having the data in the functional model-ready numerical form, but before much augmentation. This simplifies the later augmentation work.

We chose the ‘daily’ timestep interval to model on. Where the data were more frequent than daily, we ‘folded’ the data – that is, we generated aggregates, concatenations and other descriptive statistics of the information within that day. For example, with the incident data, we concatenated the text descriptions of the incidents of each day, up to the last five. In this iteration, we preserved detail of the last five events that day which captured the lion’s share of the multiple events per day sample. We chose aggregations that expressed the maximums and totals for within day events by type. We also generated conversions of datetime variables, representing them as count-of-day differences (e.g., days since last activity). A selection of steps from this script can be seen here:

Pre-processing – Join, Merge and Fold Script Excerpt

c.) Feature engineering: Within the feature engineering step, we removed the predictive period – in our case 14 days – added derivative features, wrote out the final labels, and added new engineered or augmented features to the array or data frame.

Pre-processing – Feature Engineering Script Excerpt

Modeling Method

As we noted in the approach section, we applied a multi-input model construction. We used the Keras functional API and combined the textual and categorical data coming from incident history with the time series transaction and state sequence for the account. The multi-input approach allows us to essentially concatenate these and fit the hybrid model.

High Level Model Summary

Our sequential non-text information is best harnessed in a Bidirectional LSTM – a type of sequential model described in more detail here and here – that allows the model to learn end-of-sequence and beginning-of-sequence behavior. This maps to domain experts’ knowledge that distinctive behavior at the end of the subscription period presages churn. It also captures the patterns in the progression of events over time that can be used to predict eventual churn.

On the other hand our textual and categorical data need a separate model to learn from this differently structured data. We have several options here. The simplest option that we developed was to convert the sequence of textual and categorical data, coming from our incident data, into a 1D CNN. We created row-wise sequences of textual incident information for each account, tokenized these words, and applied Glove word embeddings for each. These were right-aligned and pre-padded to better learn patterns from most recent dates in order to discern sequences and events that presage churn.

In future iterations, we could add additional input models to learn from non-volatile state and profile information to improve the signal, and categorical embedding modeling for the categorical incident information and other categorical data.

We developed and fit each model input independently to understand its performance. Hyperparameters were adjusted to a good starting point to fit the hybrid model. Then we combined them into a hybrid construction and fit against the two inputs. An excerpt of the model is below, with more about the approach and the rest of the code in this Github repo.

Model Design – Code Excerpt

Results

Our multi-input model performed much better at identifying the portion of accounts at risk than either of the independent models. However, while the model can be used to begin to test retention offers and we achieved our performance goal of >80% accuracy, the precision can most likely still be improved with additional data.

Model Training Accuracy and Loss

Model Results with Confusion Matrix

Model Deployment

We used Microsoft Azure Machine Learning Service to deploy our model as a REST API endpoint to be consumed/invoked by MAF internal application. Azure Machine Learning service provides the following targets to deploy your trained model:

We chose Azure Container Instances (ACI) over Azure Kubernetes Service (AKS) as our deployment target for two reasons: First, we needed to quickly deploy and validate the model. Second, the deployed model won’t be used on high-scale as it’s intended to be consumed by just one internal MAF application on a scheduled interval.

To get started with Azure Machine Learning Service and to be able to use it for deploying our model we installed Azure Machine Learning SDK for Python. Azure portal can also be used but we preferred using the SDK to have our deployment workflow documented step by step in Jupyter Notebooks.

As a first step in our deployment workflow we used Azure ML SDK to create the Azure Machine Learning Service Workspace (as shown below). You can look at Azure Machine Learning Workspace as the foundational block in the cloud that you use to experiment, train and deploy your Machine Learning models. In our case it was our one-stop shop for registering, deploying, managing and monitoring our model.

Creating Azure ML Workspace Code Snippet

We followed the deployment workflow here to deploy our model as REST API end point:

Deployment workflow and artifacts

Scoring File

We created a score.py file to load the model, return prediction results and generate a JSON schema that defines our REST API input/output parameters. Since our model is a hybrid model that expects 2 inputs (CNN data and RNN data) we designed our API to take 1 flattened NumPy array that combines CNN and RNN data, which the scorer then splits & reshapes into 2 NumPy arrays: 1D for CNN and 3D for RNN. Once the reshaping is done, the 2 NumPy arrays are passed as arguments to the model for prediction and returning results.

Score.py – Loading the model

Score.py – Load data and predict

Score.py – Generate json schema

Configuring Docker Image & ACI Container

By having our scoring file up & running, we started working on creating our conda environment file. This file defines our targeted runtime for the Python 3.6.2 model in addition to the dependencies and required packages. For the Azure ML Execution of our model, we needed the following packages:

azureml-defaults

scikit-learn

numpy

keras==2.2.4

tensorflow

In our Docker image & ACI configuration step, we relied on azureml.core to configure our Docker image using the score.py and conda environment.yml file that we created. Additionally, we configured our container to have 1 CPU core and 1 gigabyte of RAM, which is sufficient for our model.

Docker image and Azure Container Instance configuration

Register the Model, Create the Docker Image, and Deploy!

In our final deployment step, we used Webservice.deploy() method from Azure ML Python SDK. This enabled us to pass our model file, Docker image configuration, ACI configuration, and our created Azure ML Workspace to perform each of these steps for every new model we deployed:

Register the model in a registry hosted in our Azure Machine Learning Service workspace.

Create and register our docker image. This image pairs a model with a scoring script and dependencies in a portable container, taking into consideration the image configuration we created. The registered image is stored in Azure Container Registry.

Deploy our image to ACI as a web service (REST API). We deployed the model directly from the model file. This option registered the model in our Azure Machine Learning Service workspace with the least amount of code; however, it allowed us the least amount of control over the naming of the provisioned components. The other way of doing it is to use Model. Register() method first to control the naming of your provisioned resources.

Last Step – Mode Deployment

Once the deployment is fully executed, we were able to visualize the resources provisioned in addition to the registered model, image created and web service through Azure portal.

Figure 5: Azure Portal Deployment Screenshots

Conclusion

We have shared an end-to-end example of preparing tabular business data, combining hybrid data to generate sequences that can be used to train a multi-input deep learning model. We have also shown how to deploy the model in operation using Azure ML Workspace and services and Azure Container Instances.

MAF now has an initial model that they can use in operation to identify likely churners and take proactive action to retain these customers. The AML Workspace and ACI environment will allow MAF to continue to update this model. For example, the model accuracy may be improved by adding additional data sources and more transaction details beyond our simple daily aggregates. They can also run additional models within the same environment for other purposes.

We hope that this blog is of use to others building prediction models of a combination of transaction and account information. You can find the code and scripts we developed for data pre-processing here, for modeling here, and for the deployment here. Please feel free to reach out if you have any questions.

]]>https://www.microsoft.com/developerblog/2019/01/10/develop-and-deploy-a-hybrid-multi-input-churn-prediction-model-with-azure-machine-learning-services/feed/0https://www.microsoft.com/developerblog/2019/01/10/develop-and-deploy-a-hybrid-multi-input-churn-prediction-model-with-azure-machine-learning-services/Real-time time series analysis at scale for trending topics detectionhttp://feedproxy.google.com/~r/microsoft/devblog/~3/3JIxNWkVG3A/
https://www.microsoft.com/developerblog/2019/01/02/real-time-time-series-analysis-at-scale-for-trending-topics-detection/#respondWed, 02 Jan 2019 16:46:36 +0000https://www.microsoft.com/developerblog/?p=9537This code story describes a collaboration with ZenCity around detecting trending topics at scale. We discuss the datasets, data preparation, models used and the deployment story for this scenario.

]]>Managing big cities and providing citizens public services requires municipalities to have a keen understanding of what citizens care the most about. And, as the cities we live in seek more and more to become the ‘smart cities’ of tomorrow, this means gathering and analyzing vast amounts of data. Data gathered from social and municipal sources enable municipalities to understand what specific trending issues citizens currently care most about, and to better understand how they feel about any given matter.

In this code story we’ll look at Commercial Software Engineering’s (CSE) recent collaboration with ZenCity, a startup focused on providing municipalities with insights gained through social and operational data. ZenCity’s platform provides municipalities with insights into topics currently discussed on social media and operational systems – from current security issues to upcoming events or festivals – the sentiment being expressed on each issue, and additional context such as specific locations related to issues. In engaging with Microsoft, ZenCity was interested in two outcomes: first, in expanding their offering with a detector for temporal patterns (i.e., points in time in which specific events pop up from the data) and second, in extending their data ingestion pipeline to better handle data at scale and new types of data.

Introducing temporal patterns into ZenCity’s system is an important addition to the product, as it allows the municipality to automatically detect urgent issues and can provide more contextual understanding of events. For example, such a module would enable the city to easily identify issues pertaining to a certain festival, a specific hazard, or any other event that occurred in the city.

Working with Zencity, we developed a system capable of detecting trending topics that is comprised of a data ingestion and preparation pipeline and a set of models capable of detecting interesting anomalies or trends. Additionally, we developed a supporting Spark-based pipeline capable of handling events gathered from various sources. In this code story we’ll focus on the data understanding and modeling part, and as an example we’ll use the San Francisco 311 data, available in the San Francisco Open Data portal.

Challenges and Solution

Detecting trending topics requires the processing of multiple textual items, often at scale, extracting one or more topics from each item, and then looking at the temporal characteristics of each topic and the entire set of topics. Our solution uses time series analysis methods for how much a topic is trending, as well as a pipeline for handling textual items from ingestion through text analytics to a statistical model that detects which topics are currently trending.

Figure 1 describes the data flow from a social network to a trending topics detection mechanism. Data from social networks or other sources is collected and ingested into the system. Then, various processes cleanse the data and enrich it (for example by detecting language or entities). For each item, a topic or multiple topics are extracted if found, and the count of items per topic over time is stored. The counts over time form a time series which can be analyzed for trends or anomalies.

Figure 1 – Data flow

Problem formulation

Trending topic detection is the ability to automatically extract topics which are temporally common than usual. Once topics are extracted for each body of text (e.g. tweets, Facebook posts or incidents on the city’s CRM – Customer-Relationship Management System), one can count the number of texts per topic over a fixed period of time (e.g. 1 hour), and look for differences between time windows. Changes of quantity between time windows can be caused by an external factor for a specific topic (e.g. a festival that took place in a city), an external factor for all topics altogether (e.g. citizens opened a new Facebook group), or a system internal factor, such as a change to the way topics are extracted or how texts are gathered from sources. It is important to distinguish the different cases, since in most cases only an external factor for a specific topic is of interest.

There are different types of patterns we can consider as a trend in a specific topic. Figure 2 shows three examples of such patterns.

Figure 2: Different types of trend patterns, taken from [1]. Ramp-up in the time series means that the number of items related to this topic start growing after being stable. Mean shift (or change point) is the case where the number of incidents was fixed, and from a certain point in time it increased up significantly to a higher value. Pulse is the most common pattern. From a typical normal state, the time series increases significantly, and then returns to the typical value.

Data understanding

The San Francisco 311 Cases data holds ~3M records from calls to the 311 call-center from July 2008. Each record has a predefined category (topic). There are 102 categories on the dataset, some of which were only used for a certain period of the time. Out of the 102 categories, 46 have more than 1000 incidents and were used for more than 100 days. In this dataset, topics (categories) are predefined. When working with social network data, we used ZenCity’s proprietary topic extraction module.

Figure 3 shows the number of events per category for the top 10 most common categories.

Figure 3: Number of incidents per category

Different categories are used in different periods of time. Figure 4 shows the number of yearly incidents per category, for the top 10 categories. Some categories exhibit a trend from year to year, while others have relatively the same number of incidents per year.

Figure 4: Number of incidents per year, per category

In order to transform a set of incidents into intervals for time-series analysis and analyze trending topics, we developed moda, a python package for transforming and modeling such data. We’ll look more at moda in the experimentation section. Figure 5 shows the time series of one category, using 3 different time interval values. The following code snippet demonstrates how to turn the SF311 dataset into a time series that can be analyzed using moda:

Figure 5: Time series for one category, represented by different time intervals (30 minutes, 3 hours, 24 hours)

Data preparation

The data preparation steps we used were the following:

Decide on a time interval, based on the business requirement and frequency of incidents. There is a trade-off between detecting changes quickly and the detection accuracy. We used a 24-hour window for small cities and 1-hour windows for bigger cities like San Francisco.

Count the number of incidents per time interval and handle additional information such as the number of shares/likes/retweets a post received.

Remove rare categories: If rare categories are also of less interest to the municipality, they can be removed by putting a lower bound to incident/post count.

Remove old categories: Some categories appear in the data but aren’t used by the municipality any longer.

Missing values handling: There are cases when a topic had no incidents on a specific time interval. We padded the time series with zero on such time intervals, as this is the real time series value at these points in time.

Decide on history size: In most cases, the entire history is unnecessary. We decided to look back 30 or 60 days. This constant might change for other datasets.

Approaches to modeling

As mentioned earlier, we focused on time series methods for modeling. Specifically, we looked at methods capable of identifying pulses, as it is the most frequent form of change that we have seen in the data. To find an optimal model, we evaluated different time series methods. Additional methods exist such as the ones surveyed in [1].

Moving average based seasonality decomposition (MA adapted for trendiness detection)

This method is a naive decomposition that uses a moving average to remove the trend, and a convolution filter to detect seasonality. The result is a time series of residuals. See [2] for additional information on seasonal decomposition. To detect anomalies and interesting trends in the time series, we look for outliers on the decomposed trend series and the residuals series. Points are considered outliers if their value is higher than a number of standard deviations of historical values. We evaluated different policies for trendiness prediction: residual anomaly only; trend anomaly only; residual OR trend anomaly; and residual AND trend anomaly.

Seasonality and trend decomposition using Loess (Adapted STL)

STL uses iterative Loess smoothing [5] to obtain an estimate of the trend and then Loess smoothing again to extract a changing additive seasonal component. It can handle any type of seasonality, and the seasonality value can change over time. We used the same anomaly detection mechanism as the moving-average based seasonal decomposition. For more on this method see [2] and [4]. Figure 6 provides an example of running STL on one category, and how anomalies are calculated following the time series decomposition.

Figure 6: Trendiness detection using STL on one category on the SF 24H dataset. Top four plots: The original, seasonality, trend and residual decomposed time series, as decomposed by the STL algorithm. Bottom four plots: Anomalies found in the trend time-series, anomalies found in the residual time-series, anomalies combined (Either one of the anomalies or both), the ground truth values (label).

Azure Machine Learning Anomaly Detection API

Another way to look at the problem is strictly as an anomaly detection problem. The exploratory data analysis showed that most of the interesting areas are actually anomalies, in contrast to a large upward trend or a change point. We used the Azure Machine Learning Anomaly Detection API as a black box for detecting anomalies. We further used the upper bound of the time series provided by the tool to estimate the degree of anomaly.

Twitter Anomaly Detection

An anomaly detection method, which employs methods similar to STL and MA is the Twitter Anomaly Detection package. An initial experimentation showed good results, so we included it in the analysis. The official implementation is in R, and we used a 3rd party Python implementation which works a bit differently.

Univariate representation vs. multi-category representation

Anomalous patterns can appear for multiple topics at once. Some trending topics detection methods, such as the one proposed by Kostas Tsioutsiouliklis [3], represent the data as multi-category, and attempt to find topics that have a higher proportion than usual, in contrast to a higher quantity than usual. The pros of this method are that topics are always compared to others, and an increase in one topic only would be detected, while an increase in multiple topics should not change the topics distribution. In this analysis we model the problem while assuming independence among topics. It is possible to look for co-occurring trending topics as a post process.

Evaluation

Labeling

Four different time series datasets were manually labeled using the TagAnomaly labeling tool, which was built as a part of this engagement. TagAnomaly allows the labeler to view each category independently or jointly with other categories, to better understand the nature of the anomaly / shift. It further allows the labeler to look at the raw data in a specific time range, to see what the nature of the posts / incidents was about.

It is important to note that manual labels are highly subjective. For example, a weekly pattern for the topic “religion” could be interesting for some municipalities but irrelevant for others. Peaks in garbage collection could be interesting in some cases but could also be random outliers. In addition, some labelers are more conservative than others and would mark less points as interesting.

Metric

We evaluated each model by comparing the predicted trending timestamps with the timestamps labeled manually as trending. For short time intervals, or for trends spanning multiple timestamps, it is possible that the predicted and actual point in time are not completely aligned. Therefore, we implemented a soft evaluation metric, which looks for matches in a nearby window. For this method, the lookup window size is customizable. Then, we counted True Positives, False Positives and False Negatives from each category. These were used to calculate precision, recall, f1 and f0.5 for each category individually and for the entire dataset. Figure 7 shows the way metrics are calculated for this multi-category time series.

Figure 7: Metrics for multi-category classification used in this analysis

Experimentation

All models and evaluation code exist in moda. The package provides an interface for evaluating models on either univariate or multi-category datasets. It further allows the user to add additional models using a scikit-learn style API. All models described here were adapted to a multi-category scenario using the package’s abstract trend_detector class, which allows a univariate model to run on multiple categories. It further provides functionality for the evaluation of models using either a train/test split or a time-series cross validation. We also provide sample code for a grid-search hyper parameter optimization for different models.

The following code snippet shows how to run one model using moda:

from moda.evaluators import get_metrics_for_all_categories, get_final_metrics
from moda.dataprep import read_data
from moda.models import STLTrendinessDetector
model = STLTrendinessDetector(freq='24H',
min_value=10,
anomaly_type='residual',
num_of_std=3, lo_delta=0)
# Take the entire time series and evaluate anomalies on all of it or just the last window(s)
prediction = model.predict(dataset)
raw_metrics = get_metrics_for_all_categories(dataset[['value']], prediction[['prediction']], dataset[['label']],
window_size_for_metrics=1)
metrics = get_final_metrics(raw_metrics)
## Plot results for each category
model.plot(labels=dataset['label'])

Results

We compared the moving average seasonal decomposition, STL, Twitter’s model and Azure anomaly detector on four datasets: San Francisco 311 with 1H interval (SF 1H), San Francisco 311 with 24H interval (SF 24H), social network data for the city of Corona (Corona 12H), and social network data for the city of Pompano Beach (Pompano 24H). Here are the F1 scores:

Adapted Moving Average Seasonal Decomposition

Adapted STL

Twitter Anomaly Detection

Azure Anomaly finder

Number of items

Number of categories

Number of samples

Corona 12H

0.68

0.7

0.74

0.73

21,800

37

2982

Pompano 24H

0.72

0.71

0.76

0.76

24,245

39

3308

SF 24H

0.69

0.75

0.57

0.38

385K

84

180K

SF 1H

0.35

0.31

0.28

0.09

385K

84

432K

Table 1: Results for 4 different datasets and 4 different models, and additional information on each dataset.

Conclusion

We can see that for datasets with 12/24 hour intervals, we get decent results. For a 1H interval dataset, F1 measure drops to 35%. There are two main reasons:

The time series is much more volatile and sparser, thus harder to model

There are more points in this dataset (432K vs 180K), so manual labeling is more difficult and more subjective

Figure 8 shows an example of the time series, the prediction (of adapted STL) and the manually labeled data for one category on the 1H dataset. The two reasons are apparent in this example. In most cases, the recall was higher than precision, so possibly exploring higher thresholds might improve precision.

Figure 8: Top: original time series for the “Homeless Concerns” category on the SF1H dataset. Middle: The STL model prediction, Bottom: Points labeled manually as anomalous. The human labeler in this case is more conservative than the model.

Operationalization

A data pipeline for this model was built on Azure. The pipeline is mainly based on Databricks, which is a data engineering and analytics platform for Spark. Figure 9 shows the high-level design of the system on Azure. For the entire deployment story, refer to this code story [11].

Metadata from all enriched items is stored in a SQL server. The database stores the time, topic and additional metadata about each item, for further aggregation at a later phase. (Save to SQL)

A batch operation runs every fixed period equal to the time interval selected during modeling and reads two datasets: The history (e.g. 1 month back), and events from the latest time interval.

The model trains on the history time-series and predicts anomalies for the last time interval. (Model based trend detection)

Stream: In parallel to the batch operation, a Spark Streaming operation groups items at relatively short time intervals to detect extreme anomalies. It compares the number of items per time range to a constant. (Threshold based trend detection)

Predicted events are further processed in UIs or alerts such as email senders. (Alerted items)

Time series forecasting used in real time for a stream of data is inherently different from other machine learning tasks. Most models are lazy, i.e. the model is trained with the entire data to forecast the next value, and data is usually non-stationary. In a classical machine learning setting, a model is trained once and serves new samples multiple times. In the time-series scenario, the model must be trained or updated before each prediction or a small set of prediction, and predictions usually require the last k values of the time-series as features. Therefore, we use a SQL server to hold historical data and supply it to the model during inference. Some models, such as recurrent neural networks, can still be used without being trained before each sample, and some models allow an online learning setting which adapts the model for each new sample.

Discussion

By detecting trending topics, we allow municipalities, which are ZenCity’s customers, to gain insights about events in specific points in time. Municipalities which become more data driven than ever have an additional set of tools to detect, examine and act upon events occurring in their jurisdiction. They can respond to events which were previously impossible to identify and estimate the level of engagement for different activities, thus becoming better decision makers and better servants for their citizens. Moreover, fusing such data with other temporal data like sensors can provide additional insights and help the municipality better define root causes for many of the city’s problems.

Trending topics analysis is an interesting and challenging task. In its essence, it’s a hybrid text analytics and time series analysis problem. Manual labeling is difficult and subjective, and multiple labelers should be used for a robust labeled dataset. In this engagement we adapted and evaluated multiple trending topics detectors and built a pipeline to support such models at scale. We built an open source labeling tool, taganomaly, for time series anomaly detection, and developed an open source python package, moda, for running and evaluating models. We find that the best model is often dependent on the dataset characteristics, such as the time interval size, seasonality, volume of data and the accuracy of topic extractors that feed it with data.

]]>https://www.microsoft.com/developerblog/2019/01/02/real-time-time-series-analysis-at-scale-for-trending-topics-detection/feed/0https://www.microsoft.com/developerblog/2019/01/02/real-time-time-series-analysis-at-scale-for-trending-topics-detection/Improving Safety and Efficiency in BMW Manufacturing Plants with an Open Source Platform for Managing Inventory Deliveryhttp://feedproxy.google.com/~r/microsoft/devblog/~3/biqDyCJB_3M/
https://www.microsoft.com/developerblog/2018/12/19/improving-safety-and-efficiency-in-bmw-manufacturing-plants-with-an-open-source-platform-for-managing-inventory-delivery/#commentsWed, 19 Dec 2018 19:47:50 +0000https://www.microsoft.com/developerblog/?p=9771Over the course of twelve months Microsoft and BMW partnered three different times to help BMW with its vision for technical transformation. An open-source package called ROS-Industrial was used to help provide the building blocks for the robotics work.

German car manufacturer BMW has always been at the forefront of technological advancement within the auto industry. And a major part of the company’s innovation focus is on its production system, ensuring quality and flexibility alike.

When it comes to cutting-edge production systems, and more precisely logistics, new autonomous vehicles are among the very most innovative new technologies. However, in order to use these vehicles in a manufacturing environment, they need to be administered by open transportation services. For the moment, these robots and the systems that run them are still vendor-specific.

While open source technology, IoT, and cloud-based connectivity have dramatically increased in other areas of manufacturing, to date this has not been the case with robotics. BMW needs a way to connect their robots to an open source platform, allowing a heterogeneous fleet of robots to work together in perfect harmony with the human workers on the assembly line.

BMW aims to truly bring innovation to the factory floor by creating a system through which all of the robots in its colossal manufacturing plants – from the warehouses to the production lines – are connected with and communicating via one cloud-based, open source system.

This new system is production critical; hence it has to be highly available, robust, and easily scalable, so that it may be used for a significant number of robots. To accelerate this technical transformation, BMW has partnered with Microsoft to tackle this robotics innovation challenge.

Objectives and Challenges

BMW highlighted one process on the production line where efficiency could be dramatically increased by adding multiple robots working side by side with humans. Until now, parts are mainly transported from the warehouse to specific assembly docks on the production line using route trains. This process should be tackled by autonomous robots.

Microsoft and BMW worked through the objectives together and came up with three challenges that they could collaborate on. The first of these was to create an architecture steering autonomous transport robots that comply with all the above-mentioned criteria. The second challenge was to connect the orchestration system to hundreds of physical robots running on the factory floor in an automated and scalable manner. Finally, given the high cost of interruptions to the BMW production line, there is value in simulating the orchestration of the robots and their behavior. This would provide a higher degree of confidence in the decisions regarding robot fleet size, robot behavior, and orchestration algorithms.

The Development

Over the course of twelve months Microsoft and BMW partnered three different times to help BMW with its vision for technical transformation. An open-source package called ROS-Industrial was used to help provide the building blocks for the robotics work.

The First Engagement: Orchestration

When a human worker on the assembly line empties a parts bin, the bin needs to be swapped out with a fresh bin that has the right parts for the human worker to continue to assemble the vehicles that are coming down the assembly line. The orchestration engine is a collection of micro services in Azure that receive the order from a backend system, select the most suitable robot of the fleet, and assign the order to it so that parts can be transported to the assembly line.

The first integration edge of the orchestration system is an ingest endpoint that can receive the order and safely store it before placing it in a queue for processing. Upon successfully saving and queueing the order, the orchestration system responds back with a HTTP-200. The order is then picked up by the Order Manager for assignment to a robot.

The Order Manager looks at all of the orders that are stored in the queue, asks the Fleet Manager for an inventory of available robots, and creates a plan to deliver them. The process of planning the delivery of each order is a highly customized function that is deeply influenced by the environment where the work is being done. In our shared Github repositories we have created a simple planning plug (assignment class) that could be easily swapped out for something more specific to someone else’s needs.

Orders that are ready to be sent to the robots are sent to a dispatcher. The dispatcher wraps the order in a job that the robot understands and sends the job to the robot over the Azure IoT Hub. The robot will periodically update the status of the job as it proceeds with the delivery of the order to the assembly line. While the robot is online and operating in the factory, it is also broadcasting telemetry back to the orchestration system (vehicle status). The telemetry sent from the robot to the orchestration system contains information about the battery level of the robot, location, and heading. The telemetry is sent through the Azure IoT Hub back to the orchestration system and is collected by the Fleet Manager. The Fleet Manager uses this information to calculate the availability of a robot, and also schedule the robot for maintenance functions like recharging its battery.

The Second Engagement: ROS Integration

Once BMW had an orchestration system capable of scheduling and dispatching work across a fleet of autonomous robots, our next challenge was to connect the robots in a plug and play mode to that system. BMW’s Autonomous Transport System (ATS) robots run on a software stack built atop a flexible open-source framework called the Robotic Operating System (ROS). ROS lacked the ability to natively talk to Azure IoT Hub, and furthermore needed a way to interpret and respond to the commands issued by the orchestration system.

Our next engagement set out to solve exactly those challenges with the goal of having real ATS robots fulfilling real work in the real world. At its core ROS runs a peer-to-peer network of processes called nodes which support both synchronous RPC-style communication over services and asynchronous streaming of data over topics. Crucially, although nodes communicate in a peer-to-peer fashion, a centralized ROS Master node acts as a lookup service for service and topic registrations. ROS code is built and distributed as packages that anyone can develop to extend the native capabilities of ROS using any of several supported languages.

However, although ROS is distributed with a variety of tools and capabilities for commonly-used functionality, we needed to create our own packages to handle the integration with Azure IoT Hub and to handle the interactions between the ATS robots and the orchestration system. We designed the Azure IoT Hub Relay as an adapter to bridge between the MQTT interface exposed by Azure IoT Hub and the native messaging capabilities of ROS. The Relay runs on the robot as a ROS node. It facilitates both bidirectional communication between ROS topics and the IoT Hub, as well as fulfilling commands from IoT Hub by invoking ROS services. With the Relay running on our ATS robot we could register our robot as an IoT device in Azure IoT Hub and start sending messages in both directions. We then released the Relay back to the ROS community as an open-source project to give ROS developers the ability to leverage all of the capabilities of the Azure cloud. You can check out the Github repository at https://github.com/Microsoft/ros_azure_iothub.

This gave us a ROS topic containing the stream of messages from the orchestration system but nothing listening to that topic. We therefore then had to build another ROS node that could understand those messages and respond appropriately. In the case of the ATS robots our initial instruction set was pretty basic: move to a given location, pick up a bin, and drop a bin. The basic navigation components in ROS support moving to a given coordinate pair – we just needed to forward that instruction to the appropriate ROS topic in the format that the navigation node expected. Similarly, another custom node provided services that would drive the servos needed to raise and lower the ATS robot. Our node was essentially a controller which would delegate the handling of commands by either publishing a message on a ROS topic or calling a particular service – a common design pattern in ROS.

That took care of the Cloud-to-Device messaging – but we still needed to be able to send telemetry back to the orchestration system to keep it notified of the current state. Things like current location, battery level, and error conditions all needed to be taken into account when the orchestration system made decisions about work assignments. We did this by configuring the Relay node to listen to a particular ROS topic and to publish those messages as Device-to-Cloud messages via Azure IoT Hub. We tagged the messages with metadata that IoT Hub could use to route the incoming telemetry to an Azure EventHub message queue. The messages were then read and handled by an Azure Function which could inform the orchestration system of any changes in robot state.

In order for the navigation commands to work, the robot needs to actually understand where it is and where you would like it to go. ROS solves this problem using Simultaneous Localization and Mapping (SLAM). Given a map, SLAM works out the current location by looking around using its sensors and identifying the region on the map that looks like what it sees – a process called localization. Since it knows the coordinates of that spot on the map, it can plot out a path to navigate to the desired destination.

But how can the robot localize, much less navigate, without a map? We solved that challenge by creating a ROS node to bootstrap the ATS robot with an initial map and a service in the orchestration system to send a map on demand. When the robot starts up, this node sends a message over the Relay to the orchestration system and asks for an initial map. The service replies over Azure IoT Hub and the Relay node then publishes the map data on a particular topic that the Bootstrap node listens on. Once the map is loaded to the ROS mapping subsystem, the robot sends a telemetry message to the orchestration system informing it that the robot is ready to accept work.

With those pieces in place we had a fully-functioning system. ATS robots could be brought online and bootstrap all the information they needed to accept work. The orchestration system could be quickly notified about any changes in robot state that might affect its decision making. And the robots could respond appropriately to an extensible list of commands from the orchestration system. BMW then set about testing a limited number of ATS robots in a designated part of one of their factories to see how the system worked in real life.

But all of this raised yet another set of questions. How can we know that the orchestration system will make good decisions given a large amount of work and a large fleet of robots? How can we verify that the ROS nodes running on the ATS robots behave well under real world conditions? How can we know how many robots are necessary to fulfill a given volume of work? To answer these and other pressing questions, we needed to turn to simulation.

The Third Engagement: Simulation

Path optimization, congestion control, and human safety protocols are just a few of the key challenges to be addressed in a factory environment with robots. Yet, many of the problems that arise with large fleets of robots do not manifest themselves when small numbers of experimental robots are first introduced to the factory floor. Attempting to run real life experiments at scale can cause interruptions on the factory floor, leading to lost revenue – not to mention the initial investment needed for the robot fleet. With simulation at scale, these concerns could be first explored in the virtual world without needing to invest in cost-prohibitive experiments in the real world. Microsoft, BMW, and ROS Industrial came together to create a scalable simulation system that could support such explorations.

There are three basic components needed to run ROS simulation at scale: a simulation engine, a ROS adapter, and an orchestrator to scale the fleet of ROS robots. The simulation engine drives the physics of the world and generates LIDAR and other types of sensor data, which would otherwise come from physical sensors on the robot. The simulation engine also tracks the movement of robots in the virtual space and moves them according to commands given from the robot. The ROS adapter transforms the sensor data from the simulator and publishes them to the robots in a standardized ROS format. Conversely, the adapter transforms ROS commands from the robot into movements executed by the simulator. Finally, an orchestrator is needed to deploy the simulator and the robots, and the orchestrator can ensure that communication can happen across the components.

The first step in building out the simulation system was to evaluate available options for simulation engines with a base requirement that they also be categorized as open source software. The engine needed to scale effectively for the overall simulation to support over 100 robots – this was critical. In addition, the simulation needed to run in real time since it would eventually be used to track the activity of robots executing real delivery jobs, albeit in the virtual world. To reach an informed decision on the simulator, Microsoft and ROS Industrial partnered to perform tests on each of the three most prevalent robotics simulation engines, utilizing a Real Time Factor (RTF) metric to gauge how well a simulator was able to maintain its speed as the number of robots increased.

Gazebo, Stage, and ARGoS each had a niche target audience and performed differently under high load. Both Gazebo and Stage-Ros had heavy support from the ROS community and had a wide range of robots which could be run in simulation. Gazebo, in particular, was built specifically with ROS in mind, and the simulation environment works with ROS messages by default. Stage had a ROS package available and was installed by default with many releases of ROS. However, both Gazebo and Stage-Ros failed to perform at scale, and the RTF began to degrade quickly.

Ultimately, ARGoS was the most suitable choice. ARGoS – implemented with a multi-threaded architecture – could take advantage of many cores (up to 128 virtual cores in Azure), whereas both Gazebo and Stage-Ros could only leverage a few cores. The major drawback of ARGoS was its relatively small community of developers and its nearly nonexistent usage in the ROS community. There was only one open source project available at the time – a sample ARGoS-ROS bridge that demonstrated how to send a simple custom ROS message (list of objects) from ARGoS and receive movement messages in response. In order to leverage ARGoS for BMW’s use case, it required adding much more functionality to the bridge to take it beyond its initial sample concept and generalize it for ROS robotic simulation.

The ARGoS-ROS bridge allowed the simulator and robots to communicate over a number of essential topics. The robot needs to send movement commands to the simulation over the /cmd_vel topic. The simulation engine needs to send sensor data to the robots, including laser scans and odometry (over the /scan and /odom topics). Finally, the bridge needs to format the sensor data into ROS standardized messages and populate the fields necessary for ROS navigation libraries. Not all of the information required for ROS was available in ARGoS, so new generic ARGoS plugins for LIDAR and ground truth odometry were written to accommodate the necessary changes.

The simulation engine also functioned as a clock server and as the source of truth for the current time in the virtual world. This required a way to share the centralized clock with the robots. Additional functionality was added to the ARGoS-ROS bridge to publish the simulated timestamp with each cycle of the simulation, so that the robots could stay in sync. Without such a mechanism, variations in simulation speeds led to abnormal robot behavior in simulation, given the simulator and robots had differing perceptions of time.

BMW-specific changes were implemented in the simulator as well. ARGoS supported only a few types of robots for swam robotics applications, but these robots did not have the same dimensions or behave in the same way as the custom BMW robots. In order for the simulation to mimic the real world, physics representations matching the BMW robots needed to be implemented. In addition, an OpenGL plugin was also added so that visualization could be used. For the open source version of the project, the Turtlebot Waffle PI robot was written to appeal to the larger ROS community.

To run simulation at scale, a distributed approach was used for the robots. Kubernetes was chosen as the orchestrator, and it managed the deployment of both the simulator and the robots to a cluster. A dedicated node was given to the simulator, which consumed far more CPU and memory than the robots. The robots, on the other hand, only each consumed a few CPU and could be distributed across the cluster, since they had no dependencies on each other. Both ACS-Engine and AKS were suitable choices for creating a Kubernetes cluster on Azure. As the number of robots increased, ACS-Engine became the more practical choice, since heterogeneous clusters could be used to provision one very large simulator virtual machine and several smaller virtual machines for the robots.

ROS networking became a bottleneck as the number of robots grew. Originally, one shared ROS master node helped map topics between the simulator and robots. This pattern of the one master node and distinct namespaces for each of the robots was common in multi-robot configurations for Gazebo and Stage-Ros. But this design created several challenges. For one, many packages in ROS are not written to fully accommodate namespaces, which then caused mismatched topics. Also, the one ROS master had a shared transform tree that decayed in performance with each additional robot.

To resolve the issue, the open source project NIMBRO helped relay specific topics to the other nodes, which allowed for a ROS master node for each robot. This is the most desirable configuration, as robots are run in the real world with their own ROS master, and the simulation should test ROS controllers under similar conditions.

In the end, the simulation system scaled to more than 100 robots, and BMW was able to run scenarios that would otherwise require much higher investment for real-world testing. Additionally, a simulation environment served several functions. It not only provided a much-needed exploration environment for their questions around robot fleet management, but also served as a staging ground for new deployments to their factories, giving BMW more confidence that new releases would perform well under realistic loads.

Future steps

Our partnership with BMW to develop a parts delivery system on Azure, connected to autonomous robots, was a success. Rolling out a robot fleet in a highly critical assembly line area can only be done with great care and consideration for the safety of the plant and its workers. To this end, BMW is currently testing the robot functions in a test facility representing a portion of the full assembly plant. Additional testing is being done within the simulation environment with the same live data used to run the plant. These two testing methods together will build confidence in the true capability of the robot, the accuracy of the overall system, and help define the volume at which the robots can be safely released into production.

Looking ahead, BMW plans on running the simulation environment and the production environment side-by-side with the production data feeding into both. In the simulation environment, the virtual robots are running the same release of ROS as the real robots, the virtual world is using the same map as the robots use to navigate the real world, and the robots are communicating with an instance of the orchestration software that involves the same bits running in production. Bringing future upgrades through the simulation environment will provide a production-like test environment to validate the changes with very high levels of confidence.

Results

Microsoft has created a ROS package that enables two-way communication between any ROS-powered robot and Microsoft’s Azure cloud via Azure IoT Hub. Telemetry from robots can be sent to Azure for processing while command and control messages can be sent through Azure to individual robots. Microsoft and BMW are actively improving the system, bringing new functionality (Edge) to ensure high availability and robustness for any assembly scenario.

We have shared two ROS frameworks on github you can download to learn more. One is a ROS simulation package that allows you to simulate up to 300 robots in a factory environment. The other is a ROS orchestration which will allow you to orchestrate the control of 300 robots from messages sent via Azure IOT.

Finally, all of the enhancements to the ARGoS simulation platform have been submitted back to their original projects for inclusion and the scripts necessary to run the environment in Azure are available online.

]]>https://www.microsoft.com/developerblog/2018/12/19/improving-safety-and-efficiency-in-bmw-manufacturing-plants-with-an-open-source-platform-for-managing-inventory-delivery/feed/1https://www.microsoft.com/developerblog/2018/12/19/improving-safety-and-efficiency-in-bmw-manufacturing-plants-with-an-open-source-platform-for-managing-inventory-delivery/Social Stream Pipeline on Databricks with auto-scaling and CI/CD using Travishttp://feedproxy.google.com/~r/microsoft/devblog/~3/ILjPGFYf0xs/
https://www.microsoft.com/developerblog/2018/12/12/databricks-ci-cd-pipeline-using-travis/#respondWed, 12 Dec 2018 21:45:51 +0000https://www.microsoft.com/developerblog/?p=9429This code story describes CSE's work with ZenCity to create a data pipeline on Azure Databricks supported by a CI/CD pipeline on TravisCI. The aim of the collaboration was to create a pipeline capable of processing a stream of social posts, analyzing them, and identifying trends.

For the tech companies designing tomorrow’s smart cities, making local authorities able to collect and analyze large quantities of data from many different sources and mediums is critical. Data can come from different sources – from posts on social media and data automatically collected from IoT devices, to information submitted by citizens on a range of different channels.

To consolidate a continuous stream from this myriad of sources, these companies need an infrastructure that is strong enough to support the load. But, they also require a flexible infrastructure that offers the right tools, has the ability to automatically scale up or down, and which offers an environment that is dynamic enough to support quick changes to model processing and data scoring.

One such company, ZenCity, is dedicated to making cities smarter by processing social, IoT and LOB data to identify and aggregate exceptional trends. In June 2018, ZenCity approached CSE to partner in building a pipeline that could analyze a varying array of data sources, scale according to need, and potentially scale separately for specific customers. At the outset of our collaboration with ZenCity, our team evaluated ZenCity’s existing infrastructure, which consisted of manually managed VMs that were proving difficult to maintain and support given the startup’s rapidly growing customer base. It was very important to understand ZenCity’s needs and to try and predict how those needs would evolve in the near future as the company grows.

Our primary role in the collaboration was to investigate Azure Databricks and other streaming alternatives that might meet their requirements and to recommend the best approach from a technical standpoint. However, we also aimed to integrate the systems with other Azure services and online OSS libraries that could support the sort of pipeline ZenCity needed.

As the project progressed, our team discovered that there are very few online open source examples that demonstrate building a CI/CD pipeline for a Spark-based solution. And, to the best of our knowledge, none of the examples demonstrated a Databricks-based solution that can utilize its rich features. As a result, we decided to provide a CI/CD sample and address the challenges of continuous integration and continuous deployment for a Databricks-based solution.

This code story describes the challenges, solutions and technical details of the approach we decided to take.

Challenges

In searching out the most suitable solution for ZenCity, we faced the following challenges:

Finding an Event Stream Processing solution that was capable of near real-time processing of events coming in from social networks, LOB (line of business) systems, IoT devices, etc.

Building a solution that scales and could support a growing market of customers

Constructing a CI/CD pipeline around the solution that supports several environments (e.g., development, staging and production)

Solution

For simplicity, the architecture diagram below describes a single workload, chosen as an example from several somewhat similar ones that we focused on.
This diagram depicts the processing of tweets from a Twitter feed and analyzing them.

Architecture

The above architecture uses Databricks notebooks (written in scala) and Event Hubs to distinguish between computational blocks and enable scalability in a smart way.

The pipeline works as a stream that flows in the following manner:

Ingest tweets and push them into the pipeline for processing

Each tweet is enriched with Language and an associated Topic

From here the stream diverges into 3 parallel parts:

Each enriched tweet is saved in a table on an Azure-based SQL Database

A model meant to run on an active window frame and scan the last 10 minutes for topic anomalies

Once a day, the entire batch of tweets is processed for topic anomalies

Each time an anomaly is detected, it is passed to a function app that sends an email describing the anomaly

Databricks

Databricks is a management layer on top of Spark that exposes a rich UI with a scaling mechanism (including REST API and cli tool) and a simplified development process. We chose Databricks specifically because it enables us to:

Create clusters that automatically scale up and down

Schedule jobs to run periodically

Co-edit notebooks (*)

Run Scala notebooks interactively and see results interactively

Integrate with GitHub

The option to create clusters on demand can also potentially enable a separate execution environment for a specific customer that scales according to their individual need.

* The idea of notebooks used in Databricks, is borrowed from Jupyter Notebooks and meant to provide an easy interface to manipulate queries and interact with data during development.

Databricks Deployment Scripts

In order to create a maintainable solution that supports CI/CD and manual deployment with ease, it was necessary to have a suite of scripts that could support granular actions like “deploy resources to azure” or “upload environment secrets to Databricks.” While almost all of the actions are achievable using azure-cli, Databricks-cli, or other libraries needed to deploy, build and test such a solution, it is essential to be able to call on those actions quickly when developing the solution and wanting to check changes. More importantly, it is critical for supporting a CI/CD pipeline that doesn’t require any manual interaction.

To aggregate all scripts/actions into a manageable and coherent collection of commands, we built upon Lace Lofranco’s work and used make which can be run locally from a Linux terminal or on Travis.

Using make, the entire solution can be deployed by running

make deploy

, while providing (according to prompt) the appropriate parameters for the resource group name, region and subscription id.

The Makefile deployment script, runs a collection of script that uses azure-cli, databricks-cli and Python scripts.

Getting Started

Using our sample project on GitHub, you can run deployment on Azure from your local environment and follow the prompt for any details.

Running the

make deploy

command can also be done with all parameters, but would still require a prompt for a token from Databricks by using the command:

To set up a test environment, follow the Integration Tests section on the README file.

Deploying the ARM template

The ARM template enabled us to deploy all the resources in the solution in a single resource group, while associating the keys and secrets between them. Using ARM deployment, all resources, except for Databricks, could also be configured, and the various secrets and keys could be configured quickly.

We also used the ARM output feature to export all keys and secrets into a separate .env file which could later be used to configure Databricks.

Configuring Databricks Remotely

To configure Databricks, we used databricks-cli, which is a command line interface tool designed to provide easy remote access to Databricks and most of the API it offers.

The first script uploads all the relevant secrets into the Databricks environment, making them available to all clusters that will be created in it. The second script configures the libraries, clusters, and jobs that are required to run as part of the pipeline.

This line adds a parameter to the job execution of each notebook. The notebooks which are affected by this parameter will change their functionality to run with mock data.

Cleanup

When running on a test environment, it was necessary to remove the excess resources once the test had completed its execution. In a full scenario, we’d be able to completely delete the ARM resource group and re-create it in the next execution. But, because there’s currently no API for creating a token for Databricks, it was necessary to generate the token manually. This meant we were not able to delete the resources between executions and it was important to keep the Databricks resource (although not its clusters) alive between test runs. Because this was the case, we needed to make sure all jobs we initiated were terminated so that they didn’t continue to consume resources – the cleanup script would only be run in a test deployment, after a successful / failed test run.

Java Packages and Build

The build stage is used to build the Java packages that are uploaded and used in the job execution on Databricks.

The following test is run by Travis-CI to connect to Azure Event Hubs and listen on the last event hub in the pipeline, the event hub receiving the alerts. If a new alert is identified, it will exit the test process successfully. Otherwise, it will fail the test.

Databricks Continuous Integration Using Travis

Travis-CI is a great tool for continuous integration, listening to GitHub changes, and running the appropriate deployment scripts.

In this project, we used Travis to listen to any change in the master branch and execute a test deployment. This configuration can also be changed to run once a day if you think that every change should cause a build.

All the configuration of Azure and Databricks can currently be done remotely and automatically via the scripts described in this article, except for one task – creating an authentication token in Databricks. This task requires manual interaction with the Databricks API.

For that reason, to run tests in a test environment, it was necessary to first deploy a test environment, and use the output from the deployment to configure Travis repository settings with those parameters. To see how to do that, continue reading here.

Integration Testing

Together with ZenCity, we discussed adding Unit Testing to the Databricks pipeline notebooks, but we found that running unit tests on a stream-based pipeline was not something we wanted to invest in at that moment. For more information on Unit Testing Spark-based stream, see here.

Integration testing on the other hand seemed like something that would help us figure out the maturity of the code being checked end to end, but also seemed quite a challenge. The idea behind the integration testing was to spin up an entire test environment, let the tests run with mock data, and “watch” the end of the pipeline while waiting for specific results.

Mocking Cognitive Services

We used Text Analytics in Cognitive Services to get a quick analysis of the language and topics on each tweet. Although this solution worked great, the throttling limits on Cognitive Services (which are present on all SKU levels with varying limits) could not support a pipeline with the scale we were hoping to support. Therefore, we used those services as an implementation example only, while in the customer deployment, we used scalable propriety models developed by the customer.

For that reason, in the published sample, we chose to mock those requests using a Function App with constant REST responses – in a real-life scenario this should be replaced with Cognitive Services (in case of a small scale stream), an external REST API, or a model that can be run by Spark.

Twitter API

In this sample, the production version uses Twitter API to read tweets on a certain hashtag/topic and ingests them into the data pipeline. This API was encapsulated to enable mocking using predefined data in a test environment. Both the Twitter and the mock implementations use the following interface:

Conclusion

We started out looking to help ZenCity solve a challenge around building a cloud-based data pipeline, but found that the real challenge was finding and creating a CI/CD pipeline that can support that kind of solution.

Along the way we worked on generalizing the solution in a way that would allow it to be configured and enhanced to work for any CI/CD pipeline around a Databricks-based pipeline. The generalization process made our solution flexible and adaptable for other similar projects.

Additionally, the approach we took with the pipeline – which includes integration testing on the entire pipeline in a full test environment – can also be integrated into other projects with include a data pipeline.

There are two requirements needed to adopt this approach: first, a streaming pipeline, with which it’s hard to test each component or the SDK does not expose an easily testable API; and second, a data pipeline that enables controlling the input to the pipeline and monitoring the output.

The scripts in this solution can be changed to support the deployment and configuration of any remotely controlled streaming infrastructure.