Featured in AI, ML & Data Engineering

In this article, author shows how to use big data query and processing language U-SQL on Azure Data Lake Analytics platform. U-SQL combines the concepts and constructs both of SQL and C#. It combines the simplicity and declarative nature of SQL with the programmatic power of C# including rich types and expressions.

Featured in Culture & Methods

The book Agile Leadership in Practice - Applying Management 3.0 by Dominik Maximini is an experience report of the agile transformation journey of NovaTec. Maximini shares his experiences from applying principles and practices from Management 3.0, success stories, failure stories, and learnings from experiments.

Featured in DevOps

Yuri Shkuro presents a methodology that uses data mining to learn the typical behavior of the system from massive amounts of distributed traces, compares it with pathological behavior during outages, and uses complexity reduction and intuitive visualizations to guide the user towards actionable insights about the root cause of the outages.

Bio

Rob Pike is a Principal Engineer at Google. Recently he has been helping develop the Go language. Before that Rob was at the CS Research Center at Bell Labs where he worked on computer graphics, languages, concurrent programming, etc and was the architect of the Plan 9 and Inferno OSes and is the co-author with Brian Kernighan of The Unix Programming Environment and The Practice of Programming.

About the conference

The target audience for GOTO conferences are software developers, IT architects and project managers. GOTO Aarhus is an annual event in Denmark. The idea for GOTO (formerly known as JAOO) came about as the management at Trifork was dissatisfied with the conferences that existed and wanted to create a forum that would give the development staff inspiration, energy and desire to learn coupled with the chance to network with the IT community.

Let me answer this by saying why we did it, which is slightly different. I have given a talk in a programming language series at Google, which is actually on Youtube, talking about an early language I did called Newsqueak, which was much earlier, it was in the eighties. And in giving that talk I started to think a little bit about why some of the ideas in Newsqueak were not available to me now in the environment I was working with which is primarily C++. And also at Google we tend to build very large programs, they take a long time to build, and they have issues with dependency management, binaries get big because they are pulling in things they don't need, link times get big, compilation times get big, and also there is a sort of old fashioned way that C++, because really down at the bottom it is really C, which is at this point a thirty years old language, forty years old language, and there is a lot of new things to think about in computing with the hardware that is coming around, multi core machines, networking, distributed systems, cloud computing, all of this stuff.

It just seemed like there was enough things that we could think about that it was worth looking at making a new language to help us write the kind of code we write at Google. And so a couple of us started talking about whether it was worth trying to do this and we chatted for a while and Ken Thompson, Robert Griesemer and I said "Yes, there is actually a reason to do this, we should probably try it". And so that's how we got started.

The features that most people react to when they first see it is that it's got concurrency as a language primitive, which is very nice and very important to us we're dealing with distributed computing and multi core stuff. I think a lot of people when they first look at Go think that it's a very simple uninteresting language there's not much going on, because it looks like the ideas are fairly straight forward. In fact it's actually a much more unusual language than it really looks like at first blush. And people who've used it many of them discover that it is actually a very expressive and effective productive language which actually addresses the problems we wrote the language to address.

It compiles extremely quickly, binaries are good sized, dependency management is handled by the way the language deals with stuff, there's a story there but it's not, maybe we don't want to talk about that, but the available concurrency in the language right at the front makes it possible to do really simple patterns for fairly complicated operations and distributed computing environments. And so I think the big deal is probably the concurrency and then we can talk some about the type system, which is very different from the more traditional object oriented type systems that languages like C++ and Java have.

There are two things that make it fast: first of all there are two compilers, two independent implementations. One of them is essentially a fresh compiler written in the Plan 9 style I would say, which has its own unusual way of working, but it's in a whole new compiler, and then there is a GCC front-end called GCC Go which was written by Ian Taylor later. So there are two compilers, and they share one property to go fast, but the Plan 9-like compiler is about five times faster just because it's a whole new compiler written from scratch, it doesn't have all the GCC backend stuff which spends a great deal of time generating really really good code.

The GCC one generates better code, but much slower. But the really important point is that the dependency management issue makes it possible to compile really quickly. If you look at a C or C++ program, it has header files to describe libraries, object code, things like that. And there is no dependency forced on you by the language, you basically have to parse code to figure out what your libraries look like every time. And if you want to compile using another class in the C++ program, you first have to compile its dependencies and its header files and so on and so on. If you compile a C++ program that has many classes in it that are inter-related, you can compile the same header file many many hundreds of times, even thousands of times. And obviously you can use pre-compiled headers and tricks to get around it.

But the language isn't helping you, the tools can fight to make it better. And then the biggest problem is there is no guarantee that the things you are compiling are actually needed for the program. It might be a header file you are including you don't actually need. But there is no way to know that because the language doesn't enforce it. So Go has a much stricter model of dependency. There are these things called packages which you can think of as being equivalent to a Java class file or something like that, or a library in the old school, it's a little different from both of them but the same basic idea. But the key point is that if this guy depends on this guy, and this guy depends on this guy, so A depends on B, B depends on C, then the first thing you have to do is you have to compile the inner most dependencies: you compile C, then you can compile B and then you can compile A.

But what happens is if A depends on B but A does not directly depend on C, there is a transitive dependency, then what happens is all of the information that B needs from C gets put into the object code for B. So when I then compile A, I don't have to look at C anymore. It is a very simple thing: you just pull the type information up the tree as you compile, so that when you get to the top of the tree, you are only compiling your immediate dependencies and nothing else. And if you do the arithmetic you'll find that in Objective-C or C++ or languages like that, you can compile hundreds of thousands of lines, just to include a simple header file because of all the transitive dependencies. Whereas in Go you will open one file and it might have only twenty lines because it only expresses the public interface there.

It doesn't sound like a big deal with only three files in a chain like that, but if you have tens of thousands of files, it's actually an exponentially faster effect. And we believe that we should be able to compile millions of lines of code, in just a few seconds with the Go tools, because of this dependency management problem, whereas millions of lines of an equivalent program written in C++ there's just so much more overhead to put a program together, that it would take many minutes to build the same equivalent program. So it's about dependency management, it's really where the speed comes from.

A Type in Go is analogous to a type in any other traditional programming language. You can have integers, strings, data structures called structs, there are arrays, there are these things called slices, which are more like C arrays, [but they're] a little easier to use, arrays are a little more rigid, and then you can declare your local types and name them, and use them the way you normally would. What's different about Go from an objected oriented perspective is that types are just a way of writing down data. Methods are a completely independent idea, so you can put a method on a struct there is no concept of class in Go, instead there is a structure and then you say here are some methods for that structure.

They are not conflated to make a class. But you can also put methods on an array, or methods on an integer or methods on a float, or methods on a string, any type at all can have methods attached to it. So there is this very much more general idea of methods, than there is in, say, Java where methods are part of a class, and that's all there is to it. So for instance you can have methods on an integer, which doesn't sound useful, until you realize you can do things like make pretty-printing constants like days of the week that can print themselves by having a to_string method attached to the integer constant named Tuesday. Or the ability to reformat strings by having them print themselves in different ways, all kinds of methods or just really nice things, why stick them to classes, why not have them work on anything at all?

No, the thing about methods and packages is you are only allowed to define a method for a type that you implement in that package. I can't pull in your type and put my own methods directly on it, but using a thing called anonymous fields we can wrap it in a different way, but a method is not something that you can add to anything you like, you define a type and then you can put methods on it. And that is because you want a sort of encapsulation in packages and we can talk about interfaces but interfaces become very hard to understand if you don't have these rigid boundaries about who is allowed to put a method on an object.

You typedef the integer, give it a name, call it "Day" if you are doing days of the week or something but you are not allowed to put a method on int, you are allowed to put a method on a type you declare, which could have another name. And that's because all of the methods are you didn't define integer. The int type is not yours, it's not in your package, it's imported but it's not defined by your package, and that means you simply can't put, you are not allowed to add methods to something you didn't actually write.

It's safe and controlled and very easy to understand. We thought it would be inconvenient, we wanted to be able to add methods post facto like that, but it turned out to make interfaces very hard to understand. So we just took them out, we actually never put them in, we couldn't figure how to do it so methods are only on local types. But then it's very easy to understand what they mean and how to use them.

It's called "type", yes, you say type, the name you want like "type Day int" and then you have a new type which you can put methods on and declare variables with that type, and that type is not the same as int, it's not like in C where it's just another name for the same thing, you've actually made a new type, that is not the same as int, and it's called "Day" and it has the structural properties of int but it may have it's own method set.

The smallest type would be bool probably, it's bool, int and float and then there are size things like int32, float64, there is string, complex, I may be missing something but that' the basic set. And then from those you can build structures, arrays, maps, map is a build-in thing in Go, it's not a library. And then we are going to get to interfaces at some point I suspect, which is where the really interesting stuff starts to happen.

Int is a value type. Everything is a value in Go, like in C, everything is called by value and values, but you can have pointers. And so if you want to have a reference to something you take its address and now you have a pointer. And Go has pointers but they are much more restrictive than C pointers and they are actually safe, because they are type safe, you can't cheat and there is no pointer arithmetic so if you have a pointer to something you can't move it outside the object and cheat.

Yes, they are closer to references but you can write through them just the way you expect. And you can take the address of something inside the middle of a structure like a buffer, it's not like in Java where you have to allocate a buffer on the side, it's an extra allocation, you actually allocate the object as part of the structure in one block of memory which is important to performance for a lot of reasons.

Yes. It can be if it's a value as opposed to a pointer to a value. You can obviously put a pointer inside the struct and then it's outside, but if you have a struct A and you put a struct B inside then it's just one piece of memory, which is not how Java works and one of the reasons Java has performance issues sometimes with memory.

Interfaces in Go are really, really simple. An interface means two different things: it means the the idea of the type, an interface type is a type that lists a set of methods, so if you have a behavior that is defined by a set of methods that you want to abstract, you define an interface and write down those methods. And now you have a type, let's call it the interface type, and now any type in the system including basic types, or structures or maps or whatever that have those methods implemented, then satisfies that interface implicitly. And what's really interesting, the thing that is very different about interfaces compared to most languages, is that there is nothing like an 'implements' declaration.

You don't have to say "My object implements this interface" you just define the methods and it implements it automatically. And some people are really nervous about that, because I think they want to declare, it's really important that we know that I am implementing. And there are some tricks you can do if you really want to be sure that you really do implement that. But the idea is quite the opposite, the idea is you shouldn't have to think about it, you should be able to just write stuff down and because you don't have to decide ahead of time that you are implementing an interface, it may turn out that later you are implementing an interface that you didn't even know about because it hadn't been designed yet, but now you are implementing it.

You might find later that two classes that you didn't think were related – I used the word class, I think about Java too much – two structs that implement related methods in some small subset are actually useful, it's useful to write a function that can operate on either one of them. So you just declare an interface and then off you go. And that works even if those things are in someone else's code, you are not allowed to edit, whereas if they were Java things, they would have to say they are implementing your interface, in some sense the implements idea only goes one direction, whereas in Go because it's implicit it can go both ways. And so there are really some beautiful simple examples.

The one we use all the time is a really common example is a thing called "Reader" so there is an IO package, and IO has a Reader inside and Reader is the interface with a single method, which is the standard signature for read, like reading from the operating system from a file. And that interface is implemented by anything in the system that does a read system call with the standard signature. So obviously files, but also networks, buffers, decompressors, decryptors, pipes, anything that wants to access the data, can provide a Reader interface to its data and then anyone who wants to read from that thing can make it flow and it's a little bit like some of the ideas we were talking earlier about Plan 9 but generalized in a different way.

And similarly there is a Writer and to pick an example that is really easy to understand, Writer is implemented by everybody who does write and then formatted printing, the first argument to fprintf is not a file but a Writer. And so therefore fprintf can do formatted IO to anything that implements write. And there is lots of nice examples: HTTP if you are implementing an HTTP server, you can just do fprintf to the connection to deliver data to the client without any fancy gymnastics. You can write through compressors, you can write through all the things I mentioned go the other way. Compressors, encryptors, buffers, network connections, pipes, files, you can just do fprintf directly to them because they all implement the write method and therefore implicitly satisfy the writer interface.

It's a bit like structural typing except it's behavioral, it's totally abstract, it's not what they have, it's what they do. You have structs, you write down what the memory looks like and then the methods say what the behaviors are and then interfaces abstract those methods to anything else that does the same methods, it's sort of like duck typing, but just behavioral, not structural.

A struct with methods is just like a class. The interesting difference is that Go does not have sub-type inheritance, there is no sub-classing and you just have to learn to write differently in Go, there is stuff of comparable power I think even more expressiveness, but Java programmers and C++ programmers who come to Go at first they are a little surprised because they try to write Java programs in Go and they don't work very well, you can make it happen but it's kind of clumsy. But if you just take a step back and say "How would I write this thing in Go?" you find there is different patterns you might call and you can express similar ideas but often much shorter programs because you don't have to reproduce the behavior by doing all the sub-classing stuff. It's a very different environment as I said it's much more different from what it looks like when you first look at it.

There is a concept called an anonymous field, also colloquially called embedding. The idea here is if you have a struct and you have something that implements a behavior you want, you can embed this thing inside your struct and when you do that it gets not only the data but also all the methods. So if you have some common behavior, like you have a name method for some type, in Java you would think it was a bunch of sub-classes, you take this nameable thing and then you put it in all of the implementations that you want, and they all get the name method for free, you don't have to write all that out. It's a very simple example but there are a lot of interesting structural stuff you can do with embedded things.

Plus you can embed multiple things inside a single struct so you might think of that as multiple inheritance but that's much more confusing than what really goes on in Go it's just very simple, it's just a set where you add up all of the, basically union all the methods, and you can have multiple behaviors defined for just one line of code for each set of methods.

Name clashes are resolved statically and it's actually trivial, the rule is that if at the same level of this you got embedding, you can imagine embedding, embedding, embedding, there's a level idea, the highest level name wins, if there is two names, identical names, identical method at the same level, that's simply a static error, you don't do that and that just takes care of it, it's a very simple rule and statically checked and in practice it just doesn't seem to come up much.

One of the interesting ideas about interfaces is that they are just sets, sets of methods, so there's an empty set, interface with no methods in it, that's called the empty interface. And the empty interface is satisfied by everything in the system. So it's somehow analogous to Java Object, except that it's also satisfied by int and float and string, it doesn't have to be an actual class, because there are no classes, but it's totally uniform across everything, and so it's a little like void*, but again in void* it only works for pointers but it doesn't work for values.

But an empty interface value can represent anything at all in the entire system, so it's also very general. So, if you made an array of empty interfaces you could essentially have a polymorphic bag of stuff and then if you want to take it out again, there is a type switch so you can ask what is inside as you unpack it, so you can unpack it safely.

Coroutines and Goroutines are different that's why they have different names. We made up a new name because there is a lot of terminology around processes, threads, lightweight processes, chords, there is a million different names for these things, and Goroutines - they are not totally novel I think the same ideas have been done in some other systems. But the idea is different enough from those names I've used that we wanted to have our own name for them. And the idea behind a Goroutine is it's a coroutine but if it blocks then it may get moved to another – well other coroutines on the same thread will get moved so they don't block.

So, essentially Goroutines are a bunch of coroutines that get multiplexed onto sufficient operating threads that none of them is ever blocked by another coroutine. So if they are all just cooperating they may never need more than one thread. But if there is a lot of IO going on, a lot of operating system action you may have many many threads. But Goroutines are also very cheap you can have hundreds of thousands of them, totally fine and with a reasonable amount of memory and they are cheap to create, they are garbage collected it's all very simple.

Yes, once you have two independently executing functions or whatever, Goroutines, they need to be able to talk to one another if it's going to coordinate. And so there is this concept of a channel, which is a typed message queue in effect, that you can use to send values, if you are holding one end of the channel in one Goroutine you can send a typed value to something at the other end and it will get it when it wants. There are synchronous channels and also they can be asynchronous and we prefer to use synchronous ones when possible because there is a nice idea where you synchronize and communicate at the same time, everything runs in nice lockstep.

But sometimes it makes sense to buffer them for efficiency reasons or scheduling reasons, so you can also buffer them. So you can send channels of integers, you can use strings, structures, pointers to structures, whatever and also very interestingly you can send a channel on a channel. So I can send you the ability to communicate with someone else, which is an interesting idea.

No, synchronous is unbuffered; asynchronous and buffered mean the same thing, because if there is a buffer and there is room in the buffer I can just put the value in and keep going. But if there is no buffer then I have to wait for someone to take the value, and so unbuffered and synchronous mean the same thing.

Right, so what Goroutines do is that when they are created, they have a typically small stack, I think it's 4 K, probably a bit smaller, and it's just in the heap and it's just a thing, and of course what would happen if you had a small stack like that in C you would immediately overflow when you call functions or allocate an array or something. In Go instead what happens is there is a couple of instructions on the beginning of every function that checks whether the stack pointer has reached a limit and if it has, it chains on another block and that's called a segmented stack so if you use more stack than you first started with then you get this chained string of stack blocks which we call a segmented stack.

It's actually very cheap it's only a couple of instructions. Obviously you have to allocate the stack blocks but in Go the compiler trends to move big things on to the heap, so you actually have to call quite a few functions in typical usage before you hit even a 4 K boundary and so it doesn't happen very often but the really important points are: they are very cheap to create because there is just one allocation, it's a very small allocation, you don't have to specify how big the stack is when you create a new Goroutine which is just a nice abstraction not to have to worry about that.

Then they grow and shrink as needed and so you don't have to worry about recursion, you don't have to worry about big buffers or any of that stuff it's just completely invisible to the programmer, the language simply takes care of it which is the whole idea of having a language.

My take on this is it's a very hard problem, and we haven't solved it yet, we have a working garbage collector but it has some latency issues, so things can pause, but our take is that we believe although it's a research problem, it's not solved but we are working on it. With these concurrent machines that we have today, it's possible to collect in parallel with the execution of your program by dedicating some fractions of the cores to the garbage collection problem just as a background task in effect. There has been a lot of work in this area and a lot of success in it but it's a very subtle problem and I don't believe we will ever get latency to be zero but I believe we can get latency low enough so it won't be an issue for the vast majority of system software. I don't guarantee that every program would never notice any latency but I think we can get there and it's definitely an active area of work in the language.

Go lets you talk about the layout of memory, you can allocate your own arena, you can do your own memory management, if you want, there is no alloc and free but you can declare a buffer and lay things out in it, the trick is to avoid generating garbage unnecessarily. It's just like in C. In C if you call malloc and free all the time, it's expensive. So you allocate an array of objects and link them together, make a linked list, manage your own arena and now you don't have malloc and free and it's much faster. You can do exactly the same kind of things in Go because Go gives you the ability to talk about low level things safely, so you don't have to cheat the type system to do this, you can actually do it yourself.

I made this point earlier, in Java whenever you have nested things inside your structure they are done by pointers, in Go you can lay it all out in a single thing if you want. So if you have some data structure that needs a couple of buffers, you can put the buffers in the memory of your structure and that means not only is it more efficient because you don't have to go indirect to get to the buffer but it also means that a single thing can be allocated and garbage collected in one step. And that is just less overhead. So, if you think about the fact that it's garbage collected, and when you are designing a performance sensitive thing, you shouldn't think about it all the time.

But if you are performance sensitive, think about the layout of memory, Go gives you the tools to actually control how much memory and garbage get generated while still being a truly garbage collected language. I think that is a pretty important insight a lot of people tend to miss.

We designed it to be a systems level language because the problems we do at Google are systems level, right? Web servers and database systems, and storage systems and those are systems. Not operating systems, I don't know that Go would be a good operating system language but I am not sure it wouldn't be, what was interesting was because of the approach we took in the design of the language, somewhat to our surprise it turned out to be a really nice general purpose language. And so I think most of our users are not actually thinking about it from a systems point of view although a lot of them are doing little web servers and things like that.

There are lots of much more application-like things that Go is really nice at, and it will get better as the libraries improve and more and more toolkits and things like that developed to make Go useful, but it's a very nice general purpose language and it's the most productive language I have ever worked in.

Allocated/Controlled Memory vs Garbage Collection

Your message is awaiting moderation. Thank you for participating in the discussion.

I thought his comments at the end on garbage collection were very interesting--Go's ability to allocate memory blocks to minimize indirection, object creation, and garbage collection.

Of course, as he stated, once you make you're own area, you have to manage it yourself.

Now if only someone would make the jump to nested garbage collectors--the ability to allocate an area on the stack, then designate additional garbage collectors to utilize inside that area, even as the general garbage collector is operating heap-wide. That way, I could control memory very specifically, but still have garbage collection operating all the way down.

This would generalize and make selectable various memory schemes, like Java's, Go's more customizable heap/stack, and Erlang's segmented/shared-nothing heaps.

Re: Allocated/Controlled Memory vs Garbage Collection

Your message is awaiting moderation. Thank you for participating in the discussion.

I assume that Go's collector does compaction (i.e. physically moves memory blocks around to defragment memory) in which case a nested collector will be hard to do.

Azul's collector is pauseless. Apparently, in Azul, the reference to an object is intercepted and corrected if/when needed. If an object has been moved and a reference to it has not yet been updated, next time that reference is accessed, the system will check and correct the reference.

In essence objects can be moved and references left 'broken' till a reference is actually used - to achieve pauseless GC. Not sure if such an approach will work for Go.

Re: Allocated/Controlled Memory vs Garbage Collection

Your message is awaiting moderation. Thank you for participating in the discussion.

Absolutely, it reminds me of a similar thing I implemented years ago in C which was basically a linked-list of blocks of memory that I wanted to manage manually. The linked-list structure helped me walk up or down the memory blocks and allocate/re-allocate or de-allocate. That was quite good fun, I must dig up that code..