Introducing U-SQL – A Language that makes Big Data Processing Easy

Microsoft announced the new Azure Data Lake services for analytics in the cloud that includes a hyper-scale repository, a new analytics service built on YARN that allows data developers and data scientists to analyze all data, and HDInsight, a fully managed Hadoop, Spark, Storm and HBase service. Azure Data Lake Analytics includes U-SQL, a language that unifies the benefits of SQL with the expressive power of your own code. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the store and across relational stores such as Azure SQL Database. In this blog post I will outline the motivation for U-SQL, some of our inspiration, and design philosophy behind the language, and show you a few examples of the major aspects of the language.

Why U-SQL?

If you analyze the characteristics of Big Data analytics, several requirements arise naturally for an easy to use, yet powerful language:

Process any type of data. From analyzing BotNet attack patterns from security logs to extracting features from images and videos for machine learning, the language needs to enable you to work on any data.

Use custom code easily to express your complex, often proprietary business algorithms. The example scenarios above may all require custom processing that is often not easily expressed in standard query languages, ranging from user defined functions, to custom input and output formats.

Scale efficiently to any size of data without you focusing on scale-out topologies, plumbing code, or limitations of a specific distributed infrastructure.

How do existing Big Data languages stack up to these requirements?

SQL-based languages (such as Hive and others) provide you with a declarative approach that natively does the scaling, parallel execution, and optimizations for you. This makes them easy to use, familiar to a wide range of developers, and powerful for many standard types of analytics and warehousing. However, their extensibility model and support for non-structured data and files are often bolted on and harder to use. For example, even if you just want to quickly explore your data in a file or remote data source, you need to create catalog objects to schematize file data or remote sources before you can query them, which reduces your agility. And although SQL-based languages often have several extensibility points for custom formatters, user-defined functions, and aggregators, they are rather complex to build, integrate, and maintain, with varying degrees of consistency in the programming models.

Programming language-based approaches to process Big Data, for their part, provide an easy way to add your custom code. However, a programmer often has to explicitly code for scale and performance, often down to managing the execution topology and workflow such as the synchronization between the different execution stages or the scale-out architecture. This code can be difficult to write correctly, and optimized for performance. Some frameworks support declarative components such as language integrated queries or embedded SQL support. However, SQL may be integrated as strings and thus lacking tool support, the extensibility integration may be limited or – due to the procedural code that does not guard against side-effects – hard to optimize, and does not provide for reuse.

Taking the issues of both SQL-based and procedural languages into account, we designed U-SQL from the ground-up as an evolution of the declarative SQL language with native extensibility through user code written in C#. This unifies both paradigms, unifies structured, unstructured, and remote data processing, unifies the declarative and custom imperative coding experience, and unifies the experience around extending your language capabilities.

U-SQL is built on the learnings from Microsoft’s internal experience with SCOPE and existing languages such as T-SQL, ANSI SQL, and Hive. For example, we base our SQL and programming language integration and the execution and optimization framework for U-SQL on SCOPE, which currently runs hundred thousands of jobs each day internally. We also align the metadata system (databases, tables, etc.), the SQL syntax, and language semantics with T-SQL and ANSI SQL, the query languages most of our SQL Server customers are familiar with. And we use C# data types and the C# expression language so you can seamlessly write C# predicates and expressions inside SELECT statements and use C# to add your custom logic. Finally, we looked to Hive and other Big Data languages to identify patterns and data processing requirements and integrate them into our framework.

In short, basing U-SQL language on these existing languages and experiences should make it easy for you to get started and powerful enough for the hardest problems.

Show me U-SQL!

Let’s assume that I have downloaded my Twitter history of all my tweets, retweets, and mentions as a CSV file and placed it into my Azure Data Lake Store.

In this case I know the schema of the data I want to process, and for starters I want to just count the number of tweets for each of the authors in the tweet “network”:

@t = EXTRACT date string

, time string

, author string

, tweet string

FROM “/input/MyTwitterHistory.csv”

USING Extractors.Csv();

@res = SELECT author

, COUNT(*) AS tweetcount

FROM @t

GROUP BY author;

OUTPUT @res TO “/output/MyTwitterAnalysis.csv”

ORDER BY tweetcount DESC

USING Outputters.Csv();

The above U-SQL script shows the three major steps of processing data with U-SQL:

Extract data from your source. Note that you just schematize it in your query with the EXTRACT statement. The datatypes are based on C# datatypes and I use the built-in Extractors library to read and schematize the CSV file.

Transform using SQL and/or custom user defined operators (which we will cover another time). In the example above, it is a familiar SQL expression that does a GROUP BY aggregation.

Output the result either into a file or into a U-SQL table to store it for further processing.

Note that U-SQL’s SQL keywords have to be upper-case to provide syntactic differentiation from syntactic C# expressions with the same keywords but different meaning.

Also notice that each of the expressions are assigned to a variable (@t and @res). This allows U-SQL to incrementally transform and combine data step by step expressed as an incremental expression flow using functional lambda composition (similar to what you find in the Pig language). The execution framework, then, composes the expressions together into a single expression. That single expression can then be globally optimized and scaled out in a way that isn’t possible if expressions are being executed line by line. The following picture shows you the graph generated for the next query in this blog post:

Going back to our example, I now want to add additional information about the people mentioned in the tweets and extend my aggregation to return how often people in my tweet network are authoring tweets and how often they are being mentioned. Because I can use C# to operate on the data, I can use an inline C# LINQ expression to extract the mentions into an ARRAY. Then I turn the array into a rowset with EXPLODE and apply the EXPLODE to each row’s array with a CROSS APPLY. I union the authors with the mentions, but need to drop the leading @-sign to align it with the author values. This is done with another C# expression where I take the Substring starting at position 1.

@t = EXTRACT date string

, time string

, author string

, tweet string

FROM “/input/MyTwitterHistory.csv”

USING Extractors.Csv();

@m = SELECT new SQL.ARRAY<string>(

tweet.Split(‘ ‘).Where(x => x.StartsWith(“@”))) AS refs

FROM @t;

@t = SELECT author, “authored” AS category

FROM @t

UNION ALL

SELECT r.Substring(1) AS r, “mentioned” AS category

FROM @m CROSS APPLY EXPLODE(refs) AS Refs(r);

@res = SELECT author

, category

, COUNT(*) AS tweetcount

FROM @t

GROUP BY author, category;

OUTPUT @res TO “/output/MyTwitterAnalysis.csv”

ORDER BY tweetcount DESC

USING Outputters.Csv();

As a next step I can use the Azure Data Lake Tools for Visual Studio to refactor the C# code into C# functions using the tool’s code-behind functionality. When I then submit the script, it automatically deploys the code to the service.

I can also deploy and register the code as an assembly in my U-SQL metadata catalog. This allows me and other people to use the code in future scripts. The following script shows how to refer to the functions, assuming the assembly was registered as TweetAnalysis:

REFERENCE ASSEMBLY TweetAnalysis;

@t = EXTRACT date string

, time string

, author string

, tweet string

FROM “/input/MyTwitterHistory.csv”

USING Extractors.Csv();

@m = SELECT Tweets.Udfs.get_mentions(tweet) AS refs

FROM @t;

@t = SELECT author, “authored” AS category

FROM @t

UNION ALL

SELECT Tweets.Udfs.cleanup_mentions(r) AS r, “mentioned” AS category

FROM @m CROSS APPLY EXPLODE(refs) AS Refs(r);

@res = SELECT author

, category

, COUNT(*) AS tweetcount

FROM @t

GROUP BY author, category;

OUTPUT @res

TO “/output/MyTwitterAnalysis.csv”

ORDER BY tweetcount DESC

USING Outputters.Csv();

Because I noticed that I need to do a bit more cleanup around the mentions besides just dropping the @ sign, the assembly also contains a cleanup_mentions functions that does additional processing beyond dropping the @.

This is why U-SQL!

I hope you got a glimpse at why we think U-SQL makes it easy to query and process Big Data and that you understand our thinking behind the language. Over the next couple of weeks we will be expanding more on the language design philosophy and provide more sample code and scenarios over at our Big Data topic in the Azure blog. We’ll also dive in deeper into many of the additional capabilities such as:

Not Just U-SQL – Azure Data Lake provides Productivity on All Your Data

U-SQL is just one of the ways that we are working to make Azure Data Lake the most productive environment for authoring, debugging and optimizing analytics at any scale. With rich support for authoring and monitoring Hive jobs, a C# based authoring model for building Storm jobs for real time streaming, and supporting every stage of the job lifecycle from development to operationalization, the Azure Data Lake services let you focus more on the questions you want to answer than spending time debugging distributed infrastructure. Our goal is to make big data technology simpler and more accessible to the greatest number of people possible: big data professionals, engineers, data scientists, analysts and application developers.

Michael has been doing data processing and query languages since the 1980s. Among other things he has been representing Microsoft on the XQuery and SQL design committees and has taken SQL Server beyond relational with XML, Geospatial and Semantic Search. Currently he is working on Big Data query languages such as SCOPE and U-SQL when he is not enjoying time with his family under water or at autocross.

Tags

I think an article with a few concrete examples comparing T-SQL vs U-SQL (including statistics, execution plans, …) would be really interesting.

When is V-SQL planned ? 🙂

3 years ago

Ilya Geller

U-SQL is obsolete.

Why?

For instance, there are two sentences:

a) ‘Pickwick!’

b) 'That, with the view just mentioned, this Association has taken into its serious consideration a proposal, emanating from the aforesaid, Samuel Pickwick, Esq., G.C.M.P.C., and three other Pickwickians hereinafter named, for forming a new branch of United Pickwickians, under the title of The Corresponding Society of the Pickwick Club.'

Evidently, that the ' Pickwick' has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrases, which contain 'Pickwick', weights: the first has 1, the second – 0.11; the greater weight signifies stronger emotional ‘acuteness’; where the weight refers to the frequency that a phrase occurs in relation to other phrases.

U-SQL does not see and cannot produce the above statistics. How can you use what does not work with statistics?

3 years ago

mwinkle (msft)

@Ilya, one of the key features is the ease with which you can incorporate user code, eg, a library which computes the weight or other language processing type statistics. Is there a specific library that you are partial to?

3 years ago

Yoshihiro Kawabata

Awesome, U-SQL.

I want to see U-SQL performance for over 100 TB data in Azure Data Lake Store.

3 years ago

Martin

Hi Ilya, here is how you would compute the importance of words in sentences:

@Sebastien: Since the execution engine is different and the use cases are different, this would be a bit of an apple to orange comparison at the moment. But eventually there will be more comparisons. I will have a SQLPASS presentation on U-SQL for the SQL dev and dba, so functionality will be compared at some level. And V-SQL is not planned at the moment :).

@XSISTOR: Indeed the use case is including TB and PB processing. Assuming you are adding enough processing capacity, U-SQL can scale over very large files. There will be a demo by us later this week that shows a bit more data than my simple example above (that was just highlighting the language functionality).

3 years ago

Rajeesh Menoth

Awesome…!!!

Good News..

3 years ago

Richard

Exciting.

Does it optimise the queries of a C# developer with limited DBA skills? 😛

U-SQL is designed to run on large clusters for now. There will be local execution for development as well, but a resdistributable windows executable is currently not planned. If you have use cases please them to us on the feedback alias above.

@Ryan: You can use any .Net language to code your registered Assembly. We actually have a few F# based libraries in use. However, the U-SQL expression language is C# based at the moment. But the implementation of your functions that you call can be written any .Net language. You can even call native code or deploy an .exe (although that is a bit more complex).

3 years ago

Chris Moore

You had me until "…a resdistributable windows executable is currently not planned". Seriously? So, to use U-SQL you have to use it in the context of Azure which is rather absurd. There are many obvious use cases for U-SQL..everywhere. SQL CLR integration is OK, but it's rather clunky and definitely not agile. U-SQL could be a sweet spot for many scenarios. But keeping in locked-up in Azure is not a brilliant move. Personally, I have no motivation to learn a language that is so restrictive. On the other hand, if I could develop with it on-premise the way I can with T-SQL, C#, F#, R, whatever, then I would be all over it.

@Chris: You will be able to develop your U-SQL code of course on-prem outside of Azure with a local execution component provided by the tools. Given that the current target scenario is to use the language to scale out processing over a possibly large amount of data and nodes in Azure, we (as in the Azure Data Lake team) are not focusing on providing an on-prem Azure Data Lake with U-SQL processing "product" though. However I am very happy to hear your excitement about U-SQL and I would be very happy to help other teams to provide an on-premise offering. 🙂

Using which tool will the analytics service interact with HDInsight application? How will the data interaction between HDInsight blob storage and Data Lake store happen?

3 years ago

Andrey Artemyev

Do the posts like this one mean that all the further Big Data evolution is about Azure? Could you guys tell me if I can expect some new Data MIning features in SSAS and on-prem SQL Server Data Tools or should I not waste my time on it and go with Azure?

@Krati: Any textual file that you are storing through HDInsight in WASB or Azure Data Lake Storage can be read with U-SQL. We are working on other formats as well.

@Andrey: I would suggest that you ask this question to the respective teams in SQL. SQL Server is not going away with this and we are not determining their product roadmap, although your feedback may influence them 🙂

3 years ago

Troy Rose

This looks and feels a lot like grunt/pig. What are the key differentiators between the two?

3 years ago

Saad Mahboob

Hi Michael,

I have a question. Does the U-SQL combines the analytics, analytics and Big data concept with MS-SQL. What is the main difference ?. What is the timeline of learning the U-SQL ?, and how can it a major benefit to old customers…doing not so much Big data stuff.

Thanks,

Saad

3 years ago

Scott

Who is the target developer for this?

In my experience, DBAs, BI devs, and others with a strong T-SQL background tend to be rather intimidated by C#. Even when not intimidated, the quality of their C# code is sometimes an issue (which could be disasterous in cases where coding mistakes could cost their companies big money on rented Data Lake nodes).

People with C# skills tend to have weakness on the relational side. When properly motivated, they tend to pick up T-SQL constructs and concepts quickly, but it's not a skill set they tend to walk in the door with.

In this language, those issues are amplified, because you switch back and forth between the syntaxes. Even if you know both, there is an added layer of complexity from understanding which context you are in and how the contexts interact.

I know this has been very successful with developers within Microsoft, but you guys aren't the average data dev. Is this something you're concerned about and/or working to address?

Grunt is more file command level comparable to the HDFS command set. U-SQL and Pig have some things (like load vs EXTRACT or the expression flow model) in common, but U-SQL is more approachable for SQL developers and has a much deeper extensibility story.

@Saad:

U-SQL at this time is the language of the Azure Data Lake Analytics service and not part of SQL Server. In the ADL context it does combine these aspects. Learning should not be too hard, if you know any SQL dialect you can get started fairly quickly (for example see http://aka.ms/usql-hol for an simple walk through). The C# aspects you can add slowly as you get more experienced.

@Scott:

The target developer is the big data developer and analyst (once we have the interactive version) that is familiar with SQL, needs to start operating with Big Data processing but may have been intimidated with the complexity of managing clusters etc (the later of course is not the value proposition of the language but the Job Service).

By just understanding a few (3 to 5 depending on the person) core differences, you can go pretty far with U-SQL: type system is C# types, 2-valued logic and nulls, keywords are all upper-case, use == and not =. The C# aspect can be brought in slowly after that. Also note that we will soon have a local development experience so C# code can be tested before deploying it into the cluster :).

If you see any specific harsh transitions between the SQL and C# contexts, please let me know. One of the design goals of the language is to make it seamless.

We are of course also looking at improving tooling and making it easy for people with less language background.

3 years ago

Clarence

ok, where does that leave SQL Parallel data warehouse and does USQL allow, if even through the C# code, parallel processing… I mean, I'm very new to this but as an old school programmer this all seem to come down to easy key/value pair data storage and parallel processing and has very little use for transactional or operations processing. Thoughts?

@Clarence: U-SQL is providing you the parallelization framework to execute your code in parallel across the data, so you do not have to do explicit marshalling of parallelism yourself. Note that the data is not stored as a key-value pair (once you schematize the content of a file or when you refer to a table). SQL Parallel Datawarehouse is similar with respect to using SQL as the way to provide a simple language to do parallel execution. However its current use cases are different. Azure Data Lake Analytics (that contains U-SQL) is at the moment focusing on batch processing and doing data preparation, while SQL DW is providing interactive analytics.

@Noel: The expression language inside a U-SQL script is C#. However, you can write your custom code assemblies in any .Net language. You can even write a C# wrapper that allows you to call out into other languages. For some languages you may have to deploy the runtime as part if the script (there is a way to do so) if the runtime is not available. I am planning on writing an example sometimes next year for a blog post. That wrapper can then be offered so programmers that prefer that other language can use it more easily.

This looks more like the declarative equivalent of the graphical MLstudio. Nevertheless very cool stuff , but it needs to be seen how it scales up and how it can be used to replace/work with apachespark-hadoop like distributed environments. As I see it , this looks like a very good companion of streaming like algorithms .waiting for more , nice to see Microsoft doing some serious innovation, feels like a very good time to be Microsoft developer ,unlike in the last decade.