Thursday, December 27, 2012

I've been working with Angular.js for the past little while, both on Web-Mote (still listed as a Perl project for some reason) and various work projects. Overall, the impression is very good. For the most part, the Angular approach saves a lot of lines and cycles over jQuery-style DOM-traversal and manipulation. It always saves a lot of lines over Handlebars-style HTML templating, which was honestly a bit of a surprise at first. Proper routing is slightly more annoying than its backbone.js counterpart, but forces you to break your app out into discrete, composeable pieces, which seems like it would help you scale up down the line.

There are a couple of small places where DOM-traversal seems to be the easier way forward[1], and there's one omission made by the Angular guys[2], but otherwise, I can heartily recommend the library, even after initial frustration.

The big perspective change you need to come to grips with is the shift from imperative/functional-ish programming to a model-centric, reactive approach. Using plain jQuery, you might define a dynamic list as a $.each/append call wrapped in a render method somewhere. You might define the template manually in your js code if it's simple enough, or do it using Handlebars or a similar HTML-templating if it's more involved. If you needed to collect the contents/order of the list later, you traverse the DOM and pull out the chunks you need.

It's not an unreasonable way of going about things, but Angular does it better; the same task is done by using the HTML-DSL to describe the relationship of a model (a literal JS list of objects) to the markup, and then populating that model. The framework reactively updates the DOM whenever model changes occur. Later, when you need the data back, you don't need to traverse anything. You just send the model out to wherever it needs to go.

Lets go through some before and after shots of web-mote for illustrative purposes. Specifically, lets take a look at the controls, since that's the simpler piece. Incidentally, I'm not claiming that this is the most elegant code either before or after. I just want to show you the structural and philosophical differences between approaches.

The target element gets its own id so that we can refer to it from our jQuery code. The script blocks are Handlebars template declarations. I've elided the rest of the HTML markup because it's all include/template/meta overhead, but you can see it in the appropriate Web-Mote commit if you are so inclined.

command is only relevant because it switches out the pause button for a play button when its pressed successfully. Observe that all of the rendering here is happening through DOM manipulations. We run .append over the result of calling the controlBlock template on each group of player controls, and each call to controlBlock itself applies the control template. When we need to do that button switch I mentioned, we do it by calling .replaceWith on the appropriate DOM selector. We probably could have avoided going to sub-templates for control buttons, but that would have saved us five lines at the outside; just the script tag boilerplate in the HTML markup, and that Handlebars helper definition.

That's that. Like I said, this isn't the most elegant code I've ever written. If I really put my mind to it, I might be able to shave off ten lines or so, and clarify my intent in a couple of places, but I think it would be pretty difficult to do much better without fundamentally changing the approach.

It should be fairly self-explanatory. That's not the clearest code you're likely to find, but it's illustrative. We've got a bunch of non-HTML directives strewn about; all the stuff starting with ng- is part of the Angular DSL. While we need to do the {{}} thing to evaluate code inside of standard HTML properties, any code inside of ng- properties is automatically run in the context of the controllerCommandCtrl.

That's all, by the way. You've seen all the code for the Angular version, and the two are functionally identical from the users' point of view.

Unlike in the jQuery solution, there's no DOM manipulation here. We've got a model called controlTree which contains the same specification of controls that the earlier version did, but this time, the actual construction of relevant templates is taken care of by the framework. We just specify the relationship between that model and the front-end in the form of the HTML code above, and Angular automatically updates. The clearest demonstration of that is these lines

Where we're back to templating ourselves. You can also see the same principles affecting that code hacking around older versions of Safari; we're just setting up some objects rather than doing DOM traversal ourselves.

The effect is the same, but the particulars of updating and rendering are kept comfortably away from us.

As I said, the above example was picked to clearly illustrate the differences between approaches, not necessarily because it's the biggest gain in clarity I've gotten out of porting over[3]. I'm sure a headache or two will pop up down the line, but I submit that this is a fundamentally more humane way to craft responsive web front-ends than the alternatives.

Footnotes

1 - [back] - (re-ordering complex elements is really the only one I've observed; stuff that's too complex to do like this, but where you still need to pass the current order of some set of UI elements back to the server for persistence. As I said already, angular-ui does it just fine for simple constructs, but for anything more complicated, the Angular solution is ~30-lines of sub-module, where the DOM-traversal solution is a mere 5)

2 - [back] - (the $http.post function doesn't do the jQuery thing of encoding an object as POST parameters. The default behavior is to dump the parameter object to a JSON string and pass that to the server as a post body. I could actually see that being the easier approach if you had perfect control of the server, since that would let you do some not-exactly-HTTP processing on the incoming structure. If you're using a pre-built one, though, you're probably stuck doing something manual and annoying like this

Not too ugly once you throw in the usual pinch of underscore, but this is the sort of thing that really seems like it should be built in as a default behavior. Unless the Angular devs really think some large portion of their users are going to build their own servers to work the other way)

3 - [back] - (in fact, this is probably the least clarity I've gained by moving over to the reactive approach. As I said earlier, the line-count is usually halved without breaking a sweat)

Friday, December 14, 2012

The flu can go fuck itself in its nonexistent, viral ass. This shit will not beat me. While I run down the clock, I'm profiling more things to make me feel a bit better.

First off, neither GHCi nor Haskell mode comes with an interactive profiler. Or, as far as I can tell, any utilities to make batch profiling any easier. The way you profile Haskell programs is by installing the profiling extensions

apt-get install libghc-mtl-dev libghc-mtl-prof

compiling your program with the profiling flags on

ghc -prof -auto-all -o outFile yourFile.hs

and then running the result with some different profiling flags.

./outfile +RTS -p

That should create a file called outFile.prof in the directory you just ran it from, and that file will contain a well formatted couple of tables that will tell you where your space and time cost-centers are.

Those functions are both now part of my ha-custom mode. The big one takes a Haskell file, compiles it to a tempfile with the appropriate flags, runs the result with the other appropriate flags, and returns the name of the profiling output file. The little function takes the current buffer and runs it through the big one, then opens the result in a new window. That should make it a bit easier to actually do the profiling.

Actually Profiling Haskell

We started with pretty much the same thing as the Lisp code. And, I'll strip the printing elements again for the purposes of this exercise; we're not interested in how inefficient it is to actually produce a grid based on our model of the world.

It's almost the same, actually, because we determine frequencies differently. Instead of doing a single traversal of the corpus, we do what looks like a much more expensive operation composing group onto sort onto concatMap neighbors. In a book, that would be called "foreshadowing".

We're actually only interested in that small table, so I'll omit the exhaustive one for the future. Basically, yes. grouped and neighbors are the resource-hogs here. Even still, this compares favorably against the Common Lisp infinite plane version; both in terms of program complexity and in terms of runtime. Not to mention that the initial CL version actually crashed at ~3000 iterations because it doesn't like tail recursion.

Anyhow, the first thing we're doing this time is limiting the size of the world.

Granted, inRange is on the map as a cost center, but this shaved ~28 seconds off the final run time, I'm gonna call that fair enough. Given the numbers we were posting yesterday, I'm almost tempted to call this good enough. Lets see where it all goes, shall we? Step size of

It's funny, after just clipping the board, we start getting much better numbers with unoptimized Haskell than we saw with unoptimized Common Lisp. That's not really much of a victory, since optimized lisp was handily beating the numbers we're putting down today, but it's also not the showdown I want to see. I want to know how optimized Haskell stacks up, and I want to know how Gridless Life stacks up to a gridded implementation. Back to Rosetta Code, I guess. Second verse same as the first; added a grid-appropriate gun[1] and stripped all but the final printing code.

That's ... almost sad enough not to be funny. Almost. Do note for the record that this is an order of magnitude up from the gridless version with the same inputs. And when you think about what's involved in each traversal of each corpus, it kind of becomes obvious why that is. The grids' corpus traversal always has 2500 stops. The gridless traversal is somewhere between 50 and 100 for a comparably populated board of the same size. 2500 is our worst case, and we'll probably never hit it.

I'm not even going to bother profiling the higher steps with this approach if 5000 took two minutes. I do still want to see how low we can go, and how we'd go about it.

The first thought I have is to try out that iterate approach, rather than recurring manually

I'm gonna go ahead and put that one down to a profiler error, especially since running the same program in interactive mode confers no such magical acceleration. This does kind of call the process into question somewhat though...

Oh, well, I'm meant to be exploring. Lets pull the same incremental stuff we did with CL yesterday. Firstly, we're already using Set here, so the member check is already as tight as it's going to get. Our last valid profiler ping told us that lifeStep.grouped is where the big costs are paid, so lets see if we can't reduce them somewhat.

I'm going to cut it here for now. I think I've done enough damage. I won't be putting the latest up[2] for obvious reasons. Yes, I peeked ahead, which is why I knew this particular optimization wouldn't work in Haskell early enough to foreshadow it, but I still wanted to formalize my thoughts about it.

It's hard not to learn something from playing with a languages' profiler. This experience tells me that I might have the wrong model in my head, or it might be that predicting where a traversal will happen is a lot more difficult in lazy languages, or, as I suspect from the latest profiler readouts, it might be that a Haskell Maps' lookup speed isn't constant time. The reason I suspect this is that some of our biggest cost centers are now frequencies.rec.newM (which does a Map.lookup each call) and frequencies.inc (which manipulates a particular element of a Map, so I assume a lookup is part of it).

I'm off to read up on Haskell data structures and test these hypotheses.

Oh. And heal up.

Footnotes

1 - [back] - (by the way, this makes clear that whatever the performance comparisons come down to, the gridless version has a more elegant notation)

2 - [back] - (though the limited-size version and the gridded competitor will be checked in)

Thursday, December 13, 2012

So, to make myself feel better, I'm profiling things. Specifically, the Common Lisp version of Life I wrote last time. I'll be using Emacs and SLIME, but I'm pretty sure you can do at least some of this using time in whatever REPL you've got lying around.

Gosper's gun is the simplest emitter I could find, and I need to test that sort of thing to convince myself of the performance of this abstract machine. The .cells->list function exists purely to convert files like this into inputs suitable for our peculiar model of the Life world. You'll also notice that I stripped all printing code from run-life; I'm not interested in how inefficient the conversion between sparse-array and grid is, and I imagine that it would have been the main cost-center had I kept it. Lets hop into the REPL

Ok, I guess that's not entirely unexpected. After all, run-life is still recursive, and Common Lisp doesn't guarantee tail-call optimization. Still, we probably got some pretty decent data, even from a failed attempt. M-x slime-profile-report says

frequencies and life-step are obviously the culprits here, and since we now know what the cost-centers are, we can mitigate them. Discounting micro-optimization[1], there are essentially three ways to optimize a piece of code for time[2]

reduce the number of traversals of your corpus

reduce the time taken per traversal

eliminate sequential data dependencies and do more traversals at once through parallelism

We won't be doing the third because the Game of Life problem doesn't inherently lend itself to it; you need to compute step N before you can compute step N+1, and that can't really be helped. We might be able to take advantage of parallelism in a couple of places during each step, but that tends to have its own costs associated and typically doesn't pay off except on very large data sets.

There are a bunch of ways to do one and two. We can re-write pieces of our code with tight loops; reducing readability somewhat but removing traversals where we can. We can change the representation of our corpus to something more easily searchable, or we can be more aggressive up-front about throwing out elements we know we won't need later. We'll probably end up doing all of that.

So frequencies takes up a fuckton of conses, and the second most execution time, right behind life-step. This preliminary survey probably wasn't worth doing on a program this size; just looking at our defined functions would probably have convinced you who the culprits are and aren't.

First off, (mapcan #'moore-neighborhood cells) is one traversal of the input. Ok, not too much we can do about that, we need to do it at least once. Calling frequencies on that is a second traversal, and we can probably tweak our code enough that those two happen at once. The subsequent loop call is another traversal of (* ~8 cells). We do actually need to traverse f, but it's currently longer than it needs to be because it's a hash-table that contains all cells in any living cells' Moore neighborhood. Fixing that would mean tweaking frequencies so that it automatically threw out cells with fewer than two or more than three neighbors, since those couldn't possibly be alive next time. Finally, it might not be entirely obvious, but member is a linked-list operation that traverses its list argument each time its called. I put it in the tail end of an and, which means it should only be getting called for cells with two neighbors, but each time it does get called, it traverses some part of cells; all of it, if its argument wasn't alive last time. We'll fix that by using a data type that has a more efficient membership check than a linked list.

Oh, by the by, I have to apologize for the poor frequencies implementation last time. It turns out that Common Lisp has something like Python's defaultdict built-in; gethash takes an optional third argument which it returns as the default value. Which is nice because (incf (gethash [key] [table] 0)) will do exactly what you think it should. Now then, one traversal eliminated, lets hook the new thing into life-step

Not bad, actually. compute-frequencies conses more, but saves us about half a second over the programs' running time. A direct result of this is a ~5 second drop in computing time for a 5000 step Gosper gun. Not too shabby for five minutes' worth of work. Next up, lets try to ignore irrelevant cells. That means not adding them to the result hash unless they've got at least two neighbors, and it means knocking them out if they have more than three. In other words, we'll be wanting more hash-table.

Hm. Very slightly easier, it turns out. All the time we buy in reducing the number of cells we need to traverse seems to get eaten by the more complex check. I'm not entirely sure it was worth it, but lets keep that optimization where it is for now. We've got one trick left up our sleeves, and it's changing the representation of cells. At the moment, it's represented by that Lisp mainstay, the Linked List. In Clojure, we'd use a set, but we don't have ready access to those here. So, we'll need to use the next best thing; a data structure with quick insertion and constant-time lookup.

Subtle changes happen to each of those functions to support the overarching change, which is that we're using hash-tables everywhere now. Because member has to traverse the entire list of cells, while gethash is constant time, this should knock the shit out of our performance problems.

Granted, we're still consing like crazy, but removing that member check has pushed life-step down so low that it actually takes up significantly fewer resources than friggin moore-neighborhood. We've cut our total running time from ~40 seconds to under 10. In fact, lets crank this fucker up to eleven.

Aaaaand I got bored. Yeah, that took a while. What were you expecting? We get plenty of new cells each iteration, and we don't actually throw any away unless they die naturally. Which doesn't happen often when you're dealing with a generator. That's the last "optimization" we can make; instead of (declare (ignoreing the world-size, we can use it to forget cells that lie outside of our target area. It won't help all patterns, but the *gosper-glider-gun* won't create a Malthusian disaster for our computing resources.

Pretty good right? All things considered? Before we go, lets take a look at how this approach compares to the traditional grid Life technique. Here's the code pulled from Rosetta Code, using a two-dimensional array instead of a list of the living. Oh, I've commented out printing of intermediate steps, and included a 50x50 field with the Gosper Gun, just to make sure this is as even as possible. And I also have to reset the starting world for :life-grid each time, since its process is destructive.

Hm. Honestly wasn't expecting to be cleaning the grids' clock yet, but we're using about a quarter of the time and about a sixth of the memory. Remember, at the low-end of the spectrum, the difference between a poor algorithm and a good one isn't very big. If you've got a corpus of length 20, it really doesn't matter whether you pick bubble-sort, quicksort or timsort. In fact, you'd expect the better algorithms to do mildly worse on smaller data sets, since their optimizations don't have as much opportunity to pay for themselves.

The optimized gridless approach is holding steady at about 1/4 time taken and about 1/6 memory used. Again, because this is a garbage collected language, those affect each other. Each trip of the collector adds precious seconds to the tally of consumed resources, so being a memory hog does come back to bite you in the ass even if you're not directly optimizing for space. Last one. Don't try this at home, unless you have something to do for a little while.

We're still the same fraction better, but the numbers have increased pretty drastically. I know which one I'd rather rely on for crunching large life patterns.

These aren't all the optimizations we could pull, by the way. If we wanted to do better, we could inline moore-neighborhood within compute-frequencies, and we could prevent it from consing nearly as much by using its integers directly rather than allocating a fresh list of conses every time. A particular optimization we could do that would be relatively difficult with the grid approach would be to check for a barren world before each step; if we ever get an empty set as a result, we can return immediately rather than spinning wheels until we reach the end of our step counter. It would be easy for us to do, since we just need to check (= 0 (hash-table-count cells)), whereas doing it the obvious way would add another traversal of the corpus per step for the already much slower traditional approach.

Ok. I'm going to sleep. I was going to do a similar writeup using the Haskell profiler, but that took a lot out of me. Hopefully, you've learned something from all this. Fresh code up at my Life github. Feel free to beat those numbers. I'd be particularly interested if someone wanted to do some micro-optimization on the same problem and put forth an explanatory article.

Footnotes

1 - [back] - Which is a huge topic in its own right, and involves things like hand-optimizing memory cache interactions for minimum fall-through and various other low-level, machine oriented optimizations.

2 - [back] - We're optimizing for time because space tends to be cheap, and running things fast is fun.

Ruby and Erlang each come with their own modes, and recent Emacs versions ship with a built-in Python mode and shell. Smalltalk uses its own environment (though GNU Smalltalk does have its own mode), and I'd really rather not talk about PHP. If you're writing in it, chances are you're using Eclipse or an IDE anyway.