Topics

Featured in Development

Understandability is the concept that a system should be presented so that an engineer can easily comprehend it. The more understandable a system is, the easier it will be for engineers to change it in a predictable and safe manner. A system is understandable if it meets the following criteria: complete, concise, clear, and organized.

Featured in Architecture & Design

Sonali Sharma and Shriya Arora describe how Netflix solved a complex join of two high-volume event streams using Flink. They also talk about managing out of order events and processing late arriving data, exploring keyed state for maintaining large state, fault tolerance of a stateful application, strategies for failure recovery, data validation batch vs streaming, and more.

Featured in Culture & Methods

Tim Cochran presents research gathered from ThoughtWorks' varied clients and projects, and shows some of the metrics their teams have identified as guides to creating the platform and the culture for high performing teams.

Key Takeaways

Currently, Azure Data Lake Analytics can be used for batch workloads only. For streaming and event processing workloads, alternate big data analytics solutions on Azure like HDInsight or Azure Databricks should be used.

U-SQL combines the concepts and constructs both of SQL and C#; the power of U-SQL comes from the simplicity and declarative nature of SQL with the programmatic power of C# including rich types and expressions.

U-SQL operates on unstructured data stored in files and provides a schematized view on top of it. It also provides a general metadata catalog system very similar to relational databases for structured data.

Even though big data and Hadoop technologies are more than a decade old now, big data and big data analytics are more relevant than ever. While the initial version of Hadoop was only able to handle batch workloads, now Hadoop ecosystem has tools for other use cases like structured data, streaming data, event processing, machine learning workloads and graph processing.

While Hadoop ecosystem has a bunch of tools like Hive, Impala, Pig, Storm, and Mahout to provide the complete set of features, newer data analytics framework like Spark have an integrated approach to handle different types of workloads.

Azure Data Lake Analytics, or ADLA, is one of the newer big data analytics engines. ADLA is Microsoft’s fully managed, on-demand analytics service on Azure cloud. Together with Azure Data Lake Storage and HDInsight, Azure Data Lake Analytics forms the complete cloud hosted data lake and analytics offering from Microsoft. Azure Data Lake Analytics introduces a new big data query and processing language called U-SQL. This article provides an overview of U-SQL language and how to use it in applications.

Azure Databricks– Managed serverless analytics service based on Azure Spark. Supports a Jupyter/ iPython/Zeppelin like Notebook experience, along with other communication features, and supports Scala, Python, R and SQL.

Azure Analysis Service – A fully managed analytics engine on Azure; helps to build semantic models on the cloud. It’s built on familiar SQL Server Analysis Server which is an on-premise analytics engine based on SQL Server. As of now, Azure Analysis Service only supports Tabular Models and does not support Multidimensional Models (remember cubes?).

U-SQL Introduction

U-SQL is the big data query and processing language for Azure Data Lake Analytics. It’s a new language created by Microsoft especially for Azure Data Lake Analytics. U-SQL combines SQL-like declarative language with the programmatic power of C#, including C# rich types and expressions. U-SQL provides the familiar big data processing concepts such as "schema on read", "lazy evaluation", custom processors and reducers. Data engineers who have previously used languages like Pig, Hive and Spark would find similarity with those. Developers with C# and SQL knowledge would find U-SQL easy to learn and start with.

Figure 3: How U-SQL relates to C# and SQL

Though U-SQL uses many concepts and keywords from SQL language, it’s not ANSI SQL compliant. It adds unstructured file handling capabilities using keywords like EXTRACT and OUTPUT.
Currently, ADLA and U-SQL can be used for batch processing only. It doesn’t provide stream analytics or event processing capability.

U-SQL Concepts and Scripts

U-SQL query and processing logic is written in files with ".usql" extension called U-SQL scripts. Visual Studio IDE or Azure portal could be used for authoring these scripts. A U-SQL project in Visual Studio contains multiple scripts, code behind files and related reference assemblies.

U-SQL scripts follow the familiar Extract/Retrieve, Transform and Load/Output pattern (ETL) used by other big data languages like Pig or Spark. It can extract data from text files (both unstructured text files and semi structured files like JSON or XML) and tables.

U-SQL imposes a schema while retrieving unstructured data from files – this helps in performing SQL-like operations of retrieved data.

Rowset is the primary data structure of U-SQL. It’s used across for extracting data from input file/table, and performing transformation, as well as for writing to output destination. Rowsets are unordered which helps Azure Data Analytics Engine to parallelize the processing using multiple processing nodes.

U-SQL scripts can use types, operators and expressions from C#.

U-SQL scripts use SQL constructs like SELECT, WHERE, JOIN and other data definition (DDL) and data manipulation language (DML). All keywords must be written in upper case only.

Supports control flow constructs like IF ELSE, but do not support While or For loop.

[Click on the image to enlarge it]

Figure 5: Data flow in U-SQL script

What is required for U-SQL local development

Microsoft provides an emulator-like setup for trying U-SQL and Azure Data Lake on local machine or laptop. For this, three components are required:

Visual Studio 2017 or 2019

Azure SDK (version 2.7.1 or higher) which comes up with the client side SDKs to interact with Azure cloud services and required for storage, compute etc.

The scripts would write the entire restaurant rowsets to output file in a tab separated format.

Note that C# datatypes are used here (e.g. string and not char/varchar as typically used in SQL). Not only can we use the datatypes of C#, but expressions and all the goodness of an expressive programming language.

U-SQL script performing joins

U-SQL supports joins between two different datasets. It provides Inner Join, Outer Join, Cross Join, etc.
In the following code snippet, we perform inner join between a restaurants dataset and dish ingredients dataset.

This returns the restaurants with higher ratings, along with the ingredient details of its favorite dish, which is retrieved by joining the restaurant details rowset with dish ingredient rowset through inner join.

U-SQL catalog

So far, we have focused on unstructured and semi-structured data being read from files and written to files. While one of U-SQL’s strengths is to operate on unstructured data stored in files and provide a schematized view on top of unstructured data, it can manage structured data. It provides a general metadata catalog system like Hive. Below is a list of primary objects supported by U-SQL:

Database: U-SQL supports databases similar to other big data systems like Hive.

Tables and Indexes: Tables are containers to hold structured data. Tables can contain columns of different data types. Table data is stored in files. Tables provide additional advantages above just schematized views over unstructured files like indexing, partitioning table data into multiple buckets with each bucket backed up by a file.

Views: U-SQL views are of two types – (i) views that are based on a U-SQL table and (ii) views that point to a file and use EXTRACT to get the data.

Functions: Supports both scalar and table valued functions.

Procedures: Procedures are similar to functions but they don’t return any value.

Now, let’s say in our restaurant rating example, we would like to further analyze restaurants with low ratings. To do so, we would like to move all the restaurants with less than a four rating to a U-SQL table for further analysis.

U-SQL database, tables and indexes

In the below example. we will create a U-SQL database which will be created within the database with a schema and index key. We are not creating a schema specifically here, so the table will be created under the default schema ‘dbo’ (Remember SQL Server?) inside the database.

U-SQL views

U-SQL views are similar to database views– they do not physically store the data and provide a view over data stored in table or files. Views could be based on table or based on an extraction over files.

The example script below shows how to create a view that’s based on an extraction.

U-SQL table valued functions (TVF)

U-SQL supports both scalar function and table valued function (TVF). Functions take zero to many arguments and return either a single scalar value or a table, which is a dataset comprised of columns and rows.

The below code snippet shows first how to create a TVF and then how to invoke it. It takes a single parameter and returns a table.

Now let’s invoke the table valued function we just created and pass ‘Bakery’ as parameter– it would return all the restaurants which are of type Bakery.

OUTPUT RestaurantsDW.dbo.tvf_SearchRestaurants("Bakery")
TO "/output/BakeryRestaurants.csv"
USING Outputters.Csv();

Case Study

The following case study highlights using Azure Data Lake Analytics and U-SQL language in a multiyear, large, strategic digital transformation program. The customer, a large insurance major, over the year acquired multiple insurance companies and brokers, and as a result used multiple customer engagement systems for interacting with customers over email, text/SMS, web/mobile chat and calls (both inbound and outbound). Because of the fractured approach, it became very difficult for the customer to analyze customer interaction data.

While the customer embarked on a journey to build an omni channel platform and an integrated contact center for customer service over various channels (email, text, chat bot, contact center voice calls), their immediate tactical choice was to analyze data from various sources for email, text/SMS, chat and call logs.

An Azure Data Lake-based solution was developed to answer the immediate need of analyzing data from different systems, in different formats. Data from various source systems were moved to Azure Data Lake Store and were then analyzed using Azure Data Lake Analysis and U-SQL.

Ingest – In the ingest phase, unstructured and structured data from two different sources (Email/Text/Chat data as well as Call Log) are moved to Azure using Azure Data Factory ETL service.

Analyze – Various types of analysis including filtering, joins, aggregation, windowing etc. are performed in U-SQL.

Model and Serve – Analyzed data is stored in structured tables for later consumption from Power BI/custom reports by user.

[Click on the image to enlarge it]

Figure 10: Azure Data Analytics Pipeline

Conclusions

Azure Data Lake Storage and Analytics have emerged as a strong option for performing big data and analytics workloads in parallel with Azure HDInsight and Azure Databricks. Though it’s still in its early days and lacks streaming and event processing capabilities, its power lies in the new U-SQL language which combines the simplicity and ubiquity of SQL with Mirosoft’s flagship, the powerful C# language. Also, Microsoft’s development tools like Visual Studio and local dev/test capability make it a powerful competitor in big data & analytics space.

About the Author

Aniruddha Chakrabarti has 19 years of experience spread across strategy, consulting, product development and IT Services. He has experience across functions including solution architecture, presales, technology architecture, delivery leadership and program management. As AVP of digital in Mphasis, Chakrabarti is responsible for presales, solution, RFP/RFI and Technology Architecture of large digital deals and programs. Prior to joining Mphasis he has played various leadership and architecture-focused roles at Accenture, Microsoft, Target, Misys and Cognizant. His focus areas include cloud, big data & analytics, AI/ML, NLP, IoT, distributed systems, microservices and DevOps.