Topics

Featured in Development

Peter Alvaro talks about the reasons one should engage in language design and why many of us would (or should) do something so perverse as to design a language that no one will ever use. He shares some of the extreme and sometimes obnoxious opinions that guided his design process.

Featured in AI, ML & Data Engineering

Today on The InfoQ Podcast, Wes talks with Katharine Jarmul about privacy and fairness in machine learning algorithms. Jarul discusses what’s meant by Ethical Machine Learning and some things to consider when working towards achieving fairness. Jarmul is the co-founder at KIProtect a machine learning security and privacy firm based in Germany and is one of the three keynote speakers at QCon.ai.

Featured in Culture & Methods

Organizations struggle to scale their agility. While every organization is different, common patterns explain the major challenges that most organizations face: organizational design, trying to copy others, “one-size-fits-all” scaling, scaling in siloes, and neglecting engineering practices. This article explains why, what to do about it, and how the three leading scaling frameworks compare.

Stuart Halloway on Datomic, Clojure, Reducers

Bio

Stuart Halloway is a co-founder of Relevance, Inc. Stuart is the author of Programming Clojure, Component Development for the Java Platform, and Rails for Java Developers.
Stuart regularly speaks at industry events including the No Fluff, Just Stuff Java Symposiums, the Pragmatic Studio, RubyConf, and RailsConf.

About the conference

GOTO Copenhagen is produced jointly by Trifork and DANSK IT and is a vendor independent conference where the program is put together by a Program Committee. The target audience for GOTO conferences are software developers, IT architects and project managers.

I don`t have any idea how I was chosen for the Iconoclast track; someone reached out to me and said, “We have this Iconoclast track; you should come and say something provocative or offensive,” and I said, “Anybody who knows me know that I never say anything provocative or offensive; and so it would be an interesting experiment for me to to try to do that; so what the heck, I`ll give it a shot.”

There`s this idea that has lot of currency that there is an impedance mismatch between relational database and object orientation and I think certainly there`s pain and suffering; so if you look at people who have tried to bridge that with ORM, you`ll see all kinds of awkwardness that ensues but I think that the problem is really more fundamental than that and that if you can and unask some decisions and assumptions that have been made, both on the relational side and the object side, that there actually does not have to be any impedance but it requires a pretty significant revisiting of a lot of different areas to get there.

So particularly after the brilliant keynote that Rich just gave, a great one-sentence summary is, what happens if you program with values in your database?

In Rich`s talk this morning, he enumerated a lot of different benefits for programming with values versus programming with places with, “I`ve got to go to this location and do a getter or a setter,” and I would definitely encourage people to watch that talk in its entirety; but the place where it hit home for me and this is not an argument that I had heard him make before was when he started talking about the tools that developers give consumers and the tools that developers give themselves.

So developers give everybody else these update-in-place systems and then we have to manage our own stuff; we use ‘git` where we have complete auto history and timestamping and we`re programming with values; the moment in time in the repository is a value when we want to troubleshoot our own systems, we look at logs which are cumulative; they grow over time and they`re all timestamped; but now as he pointed out, log analysis is kind of a bummer because you're sort of backtracked into needing to do this and maybe we could put some of that information in more amenable data structures to begin with.

But it`s a plain fact that when we care about our data, then we treat it in a cumulative fashion and we timestamp it, Datomic works that way.

So one of the fun things about Datomic is that explodes the traditional notion of what the database is; and so a lot of times you`ll make some observations about the database and with Datomic, you end up needing to make three observations because the three things that are normally shipped together as one cohesive unit are split apart: at the bottom you have a storage service, like this is something that can, with varying degrees of quality of service, keep stuff, hopefully the minimum degree needs to be keeping it; but then typically you want to have redundancy, you want to have certain guarantees on latency, whatever.

Then you have this sort of thing that makes ACID happen - the Transactor; so the job of the Transactor is to organize transactions to make sure that the rules are followed; to ensure ACID and also in the background that builds indexes; that guy doesn`t do any storage; so he`s out of the storage business; and then you have query and read access to data which actually happens in your application; so the query engine is actually a library that you install inside your JVM app instead of being a remote procedure call to a place; and that separation to these three pieces, in particular having query in your own application requires programming with values because programming with values is essentially saying, “Anything that I have a hold of, I can cache forever,” because it`s a value; it`s timestamped.

And if I want to go back and say, “Well, I really want to know what happened in the last five minutes,” I can always go back and ask again; but I can sit there and party with that stable value in my queries which is something that's very difficult to do with an update-in-place system; and in fact, most developers get it wrong; if you take the number of times you`ve gone and then you looked at an application where somebody reads something out of the database, and a few seconds later, they read something else out of the database and presume that those two reads are based on the same point in time.

And the worst thing about that is, of course, that system can seem to work 99.9% of the time until you get lucky.

So in all of the different deployments scenarios for Datomic, it uses persistent data structures in the sense that`s used in Clojure, right; so these are immutable data structures where updates don`t require going back and copying the entire thing; so it`s efficient to create copies with modification of these data structures.

And then having the persistence layer separate means that you can choose different characteristics or qualities there; so right now, on the high-end, you could choose Dynamo DB and you could have pretty much arbitrary reliability, very, very good latency and you can actually just turn a knob in Dynamo DB and say, “I want to be able to do this much writes per second and that much reads for second,” which is a beautiful thing; I just turn that up; if you got a problem that requires a big solution, you can keep turning that knob until you`re there.

For people that are not ready for the cloud yet, you want something else; and so you can also target traditional relational databases and use them as your storage.

Now, when we do that, and in fact in all these cases, Datomic treats its storage engine the way a traditional database treats the file system; so it doesn't write individual things in the little cubby holes, thing are chunked up into blocks and so if you look at Datomic`s usage of Postgres, if you use that as you backing store, what you see in the backing store is, you know, 64k chunks of index that are then retrieved as necessary.

And then one of the beautiful things about this model is that scales all the way down to nothing; so you can have a pure in-memory Datomic; and in fact, when you`re running that library that lets you do query and that lets you submit transact to the Transactor, that library actually can instantiate the entire programming model in memory which is, first of all, it's going to be useful just in some production scenarios where I have state I don`t really need to keep around; I don`t really need to go through these other stuff but it`s also fantastic for testing.

A lot of times people go through a lot of complicated gyrations to get something that`s kind of like their database but not and in particular not so expensive to set up and this thing is - boom, you have your database and you can use it; it doesn`t have any semantic differences obviously timing differences and I would recommend not testing your whole system but a really beautiful story for micro level testing using the memory storage.

One of the problem with the impedance mismatch is that it is a blanket term that encapsulates a bunch of different problems; and so you can`t incrementally solve the problems and a lot of solutions, like if we enumerate and we`ll do this in a minute, we enumerate and here`s six or seven different problems and you go and say, “I`m going to solve three or four of them today and I got some solution that does that,” and then if I have to, I come back and think about five and six, you can never get into the ends in that way and you find yourself in some sort of local maximum where you solve those four problems but you haven`t considered the other ones at all and you can`t get there.

One of the problems that people suffer with is the rectangularness of the relational world; so over here, in the relational world, thing are rectangles; and over here, thing are really, we say objects, but let`s just be realistic, they`re dictionaries; dictionaries might have methods or data or whatever but they`re dictionaries; and so you have that distinction. The bigger one though is that your objects are over here in your program and those rectangles are over there; and so when you look at what ORM systems do, there`s a ton of gyration around bundling of operations because they have to happen over there.

And so I might not want to make…and a lot of times you see that couched in terms of round-trips, the n+1 problem or those kinds of thing; well, I don`t want to make 17 trips, you know, over there to do something but actually the bigger problem has to do with on every round-trip, that guy may have changed; so even if the round-trips were free, you know, going over there; going to a place to say, “I`ve got to go to the place to do this operation,` is problematic; and so that…you know, trying to bundle thing up and say, “Well, I`ve got to get..I`ve got to get all my work done.”

An obvious example is some sort of update where I want my update to be based on the most recent value, that really needs to happen over there; but a less obvious example is, “I`m doing some sort of complex query and then some reporting based on that query and I have some decision making that I want to do in the middle of the query that I maybe can`t express effectively in my query language.

So I do bundle up the first half of my query and I do this kind of complicated SQL and I send it over there; then I get back this answer and then based on that I want to finish my query but say, “Oh phooey,” when I go back over there, that stuff may not even be there anymore. So what do I do?

And so there`s all these packaging of overloading more and more work and the fundamental problem there is that I have to go over there not just to transact but to read; like just to perceive information, I have to go to the place.

And so by programming with values, that totally falls away. If I`m over here, I`m looking at a value of the database, at a point of time; if I look at it again in a second, it hasn`t gone anywhere; it can`t and I don`t want to use append-only, I`ll say cumulative - append only has suggestions about implementation details which are troublesome but the database is cumulative. There maybe new stuff over here and if I want to find out about it, I can go there.

Now, once you said, you`re going to program with values; now, regardless of what values you use, it could be documents or rectangles or whatever, they can be over here; once you`re programming with values, then you have the question of “What data structure is going to have no impedance and have good performance for the different kinds of things that you want to do; and there really is a generic universal answer for this and we call it the datum, it`s entity,attribute,value, time; John likes pizza as of this point in time.

We`ve already talked about we need that point in time, that`s how we get a value but if you talk about the other pieces, if you go smaller than John likes pizza, you don`t have a fact. If I say, ‘likes pizza,` you don`t know anything; and that is we do a lot of work that way, that`s called the key/value store.

And the semantics then have to be actually be built in your application; I have to know that this key/value store is the thing that keeps track of different things people like and this other one keeps track of whatever it keeps track of; so key/value is kind of too small and all of the other traditional approaches are too big; rectangles are too big, we already know this, right? It`s terrible, if you have sparse data it makes multi-arity relationships a disaster.

I want to say that people can belong to more than one club and to say that in a relational database is like, “Well, I got to know the idea and the person table and the idea and the club table and the name of the person table and the name of the club table; and you sort of bundle all this together; and kind of the typical solution for that is what people now call convention over configuration .

So you could take the ActiveRecord approach and say, “I`ll make a bunch of assumptions,” but that`s not getting rid of the underlying complexity; it`s just putting a package around it.

So tables are too big, documents are appallingly too big; so the worst thing that you could ever do to data is to make documents the unit of storage; documents are a wonderful unit of report; right at the end of a query, I produce a document: here`s the thing that you asked for but by making that the unit of storage, it`s like just saying, “I`m going to be as far from agile as possible and be really really good at giving you this answer, but not good at all at organizing information in any other way.”

So the other thing I should say though is that over here, in the object world, we're used to have dictionaries; there is a mechanical transformation between Datomic datums and dictionaries, right? All you have to do is go and grab all of the EAVs that have the same E, all the guys that have the same entity and say, “Boom!”that is a map-like thing and it`s look like a dictionary.

So you can programmatically use logic programming; so you can use logic in your queries and work with these things at that raw level if you want; or, having gotten back an entity, you can then say, “I want to treat this like a map,” and “I want to treat it like a lazy map,” so if I haven`t gotten all the pieces yet, that`s fine; if I decide five minutes from now that I want the rest of it, that`s fine; you could go in your JSP page, God help you and be pulling new things because you know nothing has changed; you know you`re not actually going back and helping to the transactional system, you`re just acquiring additional details.

The other thing that that does is it undermines the problem that people have with those complex queries because it separates out finding the entities from producing the details of the report; you can write a query that says, “Give me all the entities,” and then at your leisure, you can say, “Now, it turns out that I need these five fields about this guy and these seven fields about that guy,” that's all just local getter type access in your own process.

That`s right, Datalog has a long track record; it was developed like Prolog but to be a query language for databases; there are some thing about Prolog that make it not the best choice or maybe not the first choice, I should say.

There are a couple of things; Prolog is tuple at a time and if you`re coming from a relational database, you`re used to set operations, right? I don`t want to think mechanically about that these things are processed. Prolog has ordering dependencies and so getting decent performance requires often juggling your clause order and really understanding the internal details of the prolog evaluation model.

Datalog was specifically designed to address these things; it`s not Turing complete; it`s guaranteed to terminate, clause order doesn`t matter; and it has equivalent expressive power to the relational model plus recursion; so when you say, “I can`t be totally in love with SQL but I really do like that relational model" you can keep all of that power and use Datalog to get there.

Right, so a couple of things there: any kind of query side capability because the query runs in your process, you don`t really need any special support for it; so it`s a whole new world; any usage for which you would say, “I want to have some sort of view on my data,” you can do all that in your own process with Datalog rules which is another nice thing and I should just side track for a moment, rules in Datalog are the moral equivalent to views in SQL except that when you program with values, you get to run them in your own process; so you want to have a new view on your data; you don`t have to go to the DB admin gods, and say, “Can we install on the one precious production box that if we make it fall over, it`s the end of the world,” you can say, “I`m going to run this over here in my applications box; if I shoot myself in the foot, I have a foot injury.”

It`s not quite the same old problem; so where you especially want to have functions that run in the Transactor, is when you have atomicity requirements; so raw facts look like assertions or retractions; ‘I assert that John likes pizza,` as of now; or, ‘I retract that John likes pizza, John no longer likes pizza.” And you can build those up and you can use an interface that`s map-oriented instead of you know, that raw; so if you pass maps in, they just get expanded into that stuff; you don`t even have to worry about at that level; but where you run into trouble is when you say, “John just put $10 more in his bank account.”

So you`re sitting in your applications and say, “What`s John`s balance right now?” “It`s a 100,” okay, so I want to put $10 in his account; I am going to assert that his balance is now 110.” So we send that flying over to the database? Meanwhile, somebody else comes in and deducts $20 from John`s bank account balance and now we have this inconsistency; we have these two updates that needed to be atomic and I`m updating based on bad information.

And so you really need to have code then that runs in the transaction portion of the system; and so what database functions in Datomic do is that you write functions and right now in Java or Clojure, we`ll add more languages in the future but right now in Java or Clojure, you write a function and that function is then stored in the database; it`s versioned, like all data in Datomic is, and you can then run that function inside of a transaction just by sending a piece of data naming it; so normally you would send a list of data saying, “The first item assert; the second item ‘John` you know, it`s just a list.” Now you send a list where the first item is Foobar and the database sees that, “You said Foobar, I don`t really know Foobar, all I know is assert and retract and then it`s the database; so it looks in itself and says, “Aha, Foobar is this code,” and it runs it.

One of the really cool things about this is that when you use a database function inside a transaction like this and we call that case out and give it a separate name , transaction functions, those transaction functions are pure functions; they take as their first argument the value of the database as of now; and then as their second and further arguments whatever you sent in that list; and so they have the ability to say, “What is John`s balance as of this point in time because we`re now inside the transaction where we can now say, “I know definitively that nobody else has gotten in and changed the balance.”

Again, the testing story is stunning because you write these functions, you test them locally in your own process and they`re pure functions; so the test suite for them is these inputs and these outputs; no complicated set-up, tear down, mocking, stubbing… just try it with a good sampling of the possible inputs that makes you happy based on your level of paranoia and then announce that you`ve tested that portion of the system.

Werner Schuster: That wasn`t a part of the original Datomic, was it a recent addition?

Stuart: So we have had three or four announced releases and we actually do code drops more often than not just for smaller ... we plan to continue to do that.

There are several things in Datomic that have been part of the design for three years now that we haven`t rolled out yet; we don`t like to roll out designs until there`s enough of an implementation for people to try them because otherwise you`re just talking.

I have this plan that in the future, Datomic will store millions and millions of petabytes of data on Saturn`s rings, doesn`t really prove anything; so we prefer to actually have code in hand.

So probably the thing that`s most topical and timely right now is the new reducers library that just got added into Clojure master; there`s actually a targetable maven release called Clojure 15 Alpha 1 as of last Friday [Editor's note: interview was taped in late May 2012] that you can those bits and start using them in your program.

So the idea is, people are pretty familiar at this point with MapReduce and if you look at MapReduce, you have this notion of that you have this operation that you can split into chunks of work; so you break up your operation and you spread it out; that`s the mapping phase and each one of those mapping phase typically people have the interface in mind historically that a map is a function of an item to an item; so mapping over a collection is walking that collection and doing something.

And so you have that kind of level of operation; and then having done all that work; spreading it out over a bunch of boxes; having done something you couldn`t possibly have finished on one box in a sane amount of time perhaps, you bring it back and you have a reduce phase and the reducer`s job is to combine and the kind of result that it might produce is arbitrary; it typically is not going to produce a result that's as big as the result passed in because if that stuff is so huge you can't fit it in one box, it`s not clear how you would even do that; but also because you`re trying to get typically some sort of rolled up or aggregated like summary of some job that has more elements in it.

There are a couple of problems with MapReduce as it is conceived and I probably have steered people wrong by driving it into the distributed analogy; so even stepping back; just think about using it in your own functional programming environment, MapReduce has built into it in the function signatures the notion of that we`re doing something with a collection and what that collection looks like; so that`s it`s a list or that it`s lazy or what have you.

And I think Guy Steele did a great talk about this where he said, “Look, these interfaces basically commit you to a sequential approach; and even if you`re doing MapReduce, if you`re using map with that signature, you`re committed to a sequential operation of the individual sub-components.

And so what the new reducers library does is it says, “In short, let`s not call that thing MapReduce, let`s call it Reduce/Combine.” And so that first step now no longer has any order in it; and actually I should say it actually is processed in order but it no longer has the notion of the collection in it and the second step, the combined step, again pulls things together.

Once you have reconceived, first up in order to reconceive it this way, when you`re composing functions, you no longer get to have them..have the signatures they had before and this is the part where my head starts to hurt a little bit; you start to think about, “You know, before I had a function of a thing to a thing to do map,” now map is a transforming function of a function; so instead of working with the concrete thing, I`m working on building the function I will eventually run.

There are two things that fall out of that: imagine that you have composed a piece of work by five maps on top of each other; map this, then map that then map the other thing; if you build it out of maps, you have to realize all those subsequences as you go; you don`t have to realize them in their entirety but you have to call map from this thing to this thing and then that thing has to be in some sort of collection and gets to pass that thing; it can be lazy but you still have to realize all the pieces; if you instead conceive of map as a function of the function, then when you`re doing this map, you`re not actually doing anything; you`re actually when you call the map in the reducers frameworkd , you`re just building the guy that will do the work; so you`re just building the function of a function; that guy actually does all the work but he doesn`t have any object argument in him until the end and then at the very end, reduce pulls the string and does all the work.

When you do it that way, you don`t have to build all those intermediate results which is going to be a savings of memory and computation on large jobs; but the bigger thing is that Reduce/Combine is Fork/Join ready; so if the collections themselves are amenable to it, then it can just automatically use Fork/Join; and so several of the things in Clojure now have already been retrofitted this way to just use Fork/Join so that the work can be divided out with work stealing and take advantage of the capabilities there.

So if you use things like vectors and maps, then those being persistent data structures are fully ready to participate nicely in the system; if you`re using your own data structures, the reducing framework can always degrade to doing it old school; so if you have a collection type that has not been made aware of how to participate in this, then you can fall back to a more sequential style of processing.

The nice thing about is that you can switch to the signature and then your applications will just get faster if and when your collections get better.

You just bring in a different namespace and all of these APIs have the same shape as the APIs that they replace in Clojure; they have different semantics so they have different names; but you can actually go and do a search/replace if you know that you need these new semantics, you can do a search/replace and upgrade to these new names and then all of a sudden you`ll be taking advantage of this capability.

Now, the other thing you need to do is run out and buy yourself an 8-core or a 32-core or a 128-core box so you can really see the benefit; you`re going see modest benefit with two cores and then from four you'll start to really see the difference.

I`m okay with that as long as pipeline doesn`t mean to you some implications about ordering; the individual operations on a single guy have to be ordered; if I said, “You know, multiply by two then add one, I don`t want to add one and multiply by two,” but the order of operations across the entire thing, I want to be able to do the work towards the end.

So all of the operations that you`re using here need to be associative so that you can combine these guys over here and then combine these guys over here and then pull the result together.

That`s right and so this API, if you`re working on three`s which you always are when you`re using Clojure data structures, then you get these benefits automatically; if you as a convenience, these APIs also, if you pass in lists, they just work; they just don`t get that performance boost.

So you should start using them right now; the naming convention for Clojure releases is that we call releases Alpha if we don`t consider them feature complete for the next version; we call them Beta if we consider them feature complete and we call it done when we call it done; so there`s an Alpha out there right now that has them in it; if you go and look at that Alpha, it doesn`t have much of a code delta from the shipping 1.4 right now other than the addition of these reducers; so don`t put your production system on it without testing but test and see and then put your production system on it; I plan to run production system on the 1.5 Alpha line; I have no problem with that.

So the fancy feature in Clojure 1.4 is kind of a sleeper; it doesn`t seem like big of a deal; it`s data literals and the idea with data literals is that you can add literals that will be understood by the Reader and those literals are then composed out of smaller pieces which are the parts that Clojure understands.

And so what this means is that theoretically, there could be other Readers that do the right thing in a different environment and we`re seeing a lot more Clojure environments right now, you see Clojure-Py and Clojure running on the CLR; and ClojureScript; and so you may have some piece of data that you need to treat differently in those worlds.

Probably the bigger benefit there though, the extensibility is a big deal; so in my career, we`ve gone from binary frameworks and protocols to XML, which was extensible but kind of overly baroque and therefore disastrous to JSON which is spartan but not extensible and so also kind of disastrous; of course, there have been some efforts to add extensibility to JSON but there`s nothing that has, in my mind, sufficient traction to say there`s a real solution in that space.

And so the data literals in Clojure 1.4 are a nascent data serialization library that might even not have to be part of Clojure; this is an approach to serializing data that answers the common complaints if you want to build up from JSON or the common complaints if you want to come down from the baroque monstrosity that is XML; and those complaints are like, I can`t really treat everything as a String; I really need to know this thing is a URI and that thing is a Date; and I don`t want to write application code that has to figure that out again and again forever and support that; and that`s very easy when you`re building your very first web service end point and whatever but it becomes tedious over time.

So I`ve been extremely excited about the potential of data structures in Clojure becoming a lingua franca even outside of Clojure; they do really solve a problem and fill a need and it`s difficult to get right and not do too much.

There have been a lot of attempts to serialize that did way, way too much. I was at one of the companies that was a signatory to the SOPA spec - I'm sorry - and it`s easy to do way too much but we sort of over reacted with JSON and did too little; and I think that the data literals hit a real sweet spot there.

So there`s a couple of things there; Clojure as a programming language targets being able to deliver the efficiency of wherever it`s hosted and that's not an obvious or a trivial goal and so the idea there is that if you`re running Clojure on Java, that you don`t have to dip down the Java for performance anywhere and there are many factors that go into that in the design; one of the biggest ones is unproxied access to thing; so when you`re talking to your host platform from any Clojure, there`s not a proxy in between you; there`s not this layer that tries to be smarter or make your life better or easier or whatever; you`re talking to that thing.

It`s the most frequent complaint about from people new to Clojure and they`re wrong; it`s actually the key feature that makes Clojure viable as a language in places that are trying to do things; I won`t deny that it sucks when you're learning you get stacktraces and whatever; I think that problem needs to be solved and it can`t be solved by like killing the goose here and the goose really is not just running on the JVM, there are hundreds of languages that do that but thinking about how to target that.

Since then Rich and the core team have developed ClojureScript which targets Javascript source code but particularly targets Javascript source code that can then be whole program optimized by the Google Closure compiler. I apologize for their language being also named Closure ; they were second and they should apologize to us and change their name [Editor's note to readers: Stuart might not be totally serious about this last point, as can be seen from the video.] but that allows you to produce a very, again, the design was targeting efficient use of JavaScript as a platform that wouldn't force you to drop down to Javascript when you`re getting work.

So the combination of being a Lisp and having a tiny core and having that bootstrapping approach and having a design aesthetic that is about being willing to stick your fingers into the platform and really use the platform has encouraged an ecosystem that is far beyond anything that any one person has direct control over.

So far as I know, I certainly have not any conversations with anybody working on Clojure-Py; I`ve just watched in awe as that has come together; I have talked to David who`s done great work with Clojure on the CLR but I think that it really proves out several aspects of Clojure`s model that it`s easy to target to other platforms and it`s worth bothering targeting to other platforms.