A couple of years ago I tried to do some simulation of trading strategies with historical data, and had some problems with Yahoo!'s data. Yesterday I updated that data, and suddenly the results of my simulations were entirely different.

After digging for a little bit, I came across some changes in the data for BLT.L. For 4th February 2010, 2nd and 5th April 2010, and 4th May 2010, Yahoo! finance now has the open, low, high and close values of 2301. For some of those dates that's a big outlier. What's strange is that the data I have from Yahoo earlier has believable values for 4th February, which match the values Google provides now. The data was taken at the start of March 2010, so pre-dates the other days which now have strange values.

On a different theme, the opening values in 2003 are quite different to the values I had previously. Here's a brief sample. New values are the first quintuple, old ones are the second quintuple:

Date

Open

High

Low

Close

Volume

Open

High

Low

Close

Volume

2003-03-20

330.25

339.69

327.5

330.25

7611300

338

338

327.5

330.25

7723600

2003-03-21

336.25

339.5

330

336.25

8991500

330

339.5

330

336.25

6820800

2003-03-24

323

334.34

322.75

323

9223700

332

332

322.75

323

7317400

2003-03-25

329.75

331.75

321.5

329.75

6180300

323

331.75

321.5

329.75

4764400

2003-03-26

331

332

317.5

331

12376800

332

332

326

331

5400800

2003-03-27

331.75

331.95

326

331.75

17529400

331

332

326

331.75

9259400

2003-03-28

331

332

323

324.75

0

331

332

323

324.75

7892000

2003-03-31

326

332

317

317

0

326

332

317

317

6881800

2003-04-01

325.5

326.25

317

325.5

0

318.5

326.5

317

325.5

9346800

2003-04-02

334.5

335.75

320

334.5

14982600

320

335.75

320

334.5

8688600

2003-04-03

337.5

340

332.25

337.5

7572100

335

340

332.25

337.5

7030600

2003-04-04

334.5

338

329.5

334.5

13705200

334.5

337.75

329.5

334.5

7976800

2003-04-07

351.5

351.5

336

351.5

12427200

336

351.5

336

351.5

10592300

2003-04-08

343

392.5

338.75

343

9995800

347

347

338.5

343

7796600

2003-04-09

340.5

341.34

335.25

335.25

0

340.5

341.25

335.25

335.25

5601000

2003-04-10

326

335

326

326

8175000

334.5

335

326

326

7375500

2003-04-11

333.75

339.94

329

333.75

10880400

330

335

329

333.75

6761300

2003-04-14

335.75

336.48

331

335.5

0

335.75

336.25

331

335.5

5256700

There doesn't seem to be any pattern to these changes, but clearly some of them are significant enough to make a big difference to a simulation. I am hoping to get the chance to investigate some other potential sources of price data soon.

I generally have a lot of things to say about DBMSs in general, but here I'm going to concentrate on Datomic's features, and try to leave the comparison with other DBMSs to a more general post on the subject (that post has been in progress for many years now).

Datomic is a new DBMS based on Rich Hickey's ideas. It does a few things fundamentally differently to other DBMSs. One is the architecture: to be ACID, all transactions go through a single transactor node. What's unusual is that nothing else needs to happen on that node. Data from there goes to storages. The application runs elsewhere, and any queries run on those application nodes. Nodes access the storages to download and cache any data required for the queries given. See these slides for pretty pictures which may help clarify things.

Like quite a few modern DBMSs, Datomic uses append-only storage. However, Datomic really takes advantage of this. Every transaction corresponds to a version of the database, and that version can be queried for eternity. So if someone asks "What was the customer's address at the time the order was placed?", you just need to find the transaction id when the order was created, and look at the address record at that transaction. It's also possible to add other data to transactions beyond the default transaction id and time; the most obvious to me is the user responsible for the changes.

Having all this built into the database sounds very powerful. What percentage of database applications end up with a history table mirroring every data table? I'm willing to bet it's significant, and it's certainly not fun.

What about storage limits?

Since people always get obsessed about performance, let's discuss the amount of data we're going to store briefly. Suppose we're writing an app for a company which has a large data entry department, say with 100 employees. Every day each of those employees enters details from 50 forms, each of which contains 100 integers and 2 strings averaging 1000 characters. That's roughly (1000x2 + 100x8)x50*200*100 = 2,800,000,000 bytes or 2.8GB per year. Add an overhead for structure and storage of about 4x, and we get about 10GB/year. DNUK will currently charge me £1380 to put 128GB of RAM in a server. For some strange reason people still don't like selling servers with SSDs in them (meaning that for many tasks they can't compete with my Macbook Air), but anyway 1TB of SSD is about £400.

My point here is that, for human entered data, there is no reason to have update-in-place unless you're designing a system to record all the items purchased at Tesco, in which case, why are you reading my blog? For the vast majority of systems ever developed, storage is not going to be an issue. Of course if you're planning of developing the next Facebook you might want to choose something else, but I suggest waiting until you look like you might have a problem. At that point you can probably afford a big enough team to address the issue.

Okay, so how does it work?

Schemas

Datomic could be described as an entity-attribute-value-transaction database. A database has a schema determining what attributes can be set on entities, what type they are and whether they are single or multiple cardinality. Since any entity can have any attribute, this makes data modelling very flexible, and you don't get stuck with the old pain of how polymorphic objects might map to SQL. Let me say that again: the schema specifies what attributes are valid; there is no constraint as to which attributes an entity can have. If you want SQL-like tables then you could stick a :table attribute on all entities.

Adding data

To add data to the database, you commit facts as datoms. A datom is the addition or retration of an (entity, attribute, value, transaction) tuple.

When designing a schema, there will obviously be times when there is more tha one way to do it. One example I came across was flags. If you have an entity representing an order, you may want a flag to say whether the order has shipped. There are two simple ways to do this: either have an attribute :shipped which is added when the order is shipped, or have that attribute added and set to false when the entity is created, and changed to true when the order ships. It turns out that for cases where the flag can go from true to false, it's better to have it always present; the advantage of this is that it is easy to query to find out when it was set to its current state.

Querying

Queries are written in Datalog, a logic query language, and, since Rich Hickey wrote this, everything is just data and the query can be used on other data structures too, as well as database values.

Arbitrary logic

Since Datomic runs on the JVM, it's possible to throw code around in various sneaky ways. Since Java makes me sleepy, I'm going to assume the use of Clojure. There are two points at which we might want to do sneaky things: during transactions, and during queries.

During transactions you may wish to call a function to generate the exact update so that changes are atomic; the canonical example of this is a bank transfer, where it is critical that the new balance is a function of the old value at the time the transaction began and not at an arbitrary point in the past. If a database function throws an exception then the transaction is aborted.

The documentation says that transactions can also be used for integrity checks. Thinking about this, I guess that it must mean that the update logic can check the current state and make sure the update doesn't break it, which is rather more an update than an integrity check. For me an integrity check would imply a function being called after an update with the new state of the database and a list of new facts, but that doesn't seem to be possible.

The other thing which seems odd to me about update functions is that they look like other facts in a query. Let's define ourselves a credit funciton which adds credit to an account:

What I expected was that the balance would be increased by 15. Or maybe 12. I certainly didn't expect the entity to grow two balance values, though thinking about it I can see that given that the transaction was explicitly requesting multiple values, that's what you get. An error that the schema is contradicted would seem more natural to me, though. Anyway, two values is what Datomic gives you, and maybe this highlights a more general point: Datomic is less strict than many users of relational databases will be used to; the impetus is on the application to be correct. For databases which are shared between many applications, this may be an popular balance, as it makes it more likely that other people break your data model, but maybe the way this is avoided is to have a shared data model which the applications all build on.

The reason I was surprised that calling a database function in a transaction looks like a normal fact is that transactional database functions will normally want to be called on their own; although each individual function will check that it maintains consistency, having more than one, or having other updates in the same transaction, is not guaranteed to. On further thought, this is probably just another example of the app being responsible for not making a mess of the database.

In queries, code may be used to, er, do other clever things. And possibly basic things like sorting. As you can probably tell, I'm not really too clear about this, but hopefully I'll get a better understanding once I can use database functions in queries from the REST interface.

Indexes

By default, Datomic keeps indexes on EAVT (entity, attribute, value, transaction), AEVT and VEAT. Datomic can optionally also index on AVET; I should learn more about this and how it affects range queries.

Partitions

Data is split into partitions. Querying within a partition is quick, across partitions is slow. Presumably querying also gets slower as partitions get larger, but until I understand more I guess I'm going to stick everything in one big partition and wait for something to go wrong.

Usage from other languages

Whilst it's most natural to use Datomic from JVM languages, and particularly from Clojure, Datomic recently grew a REST API so that everyone else can access it. I've written a Python Datomic client.

Since you can still call database functions (or will be able to soon in the case of queries, I think), you still have most of the power you need. But clearly the question of where application logic goes and what needs to be pushed into the database gets more critical, both because any logic will have to be written in a different language, and because the cost of transferring query results out of the database will be a lot higher.

Personally I see this being the way I would be most likely to use Datomic. I learnt Clojure a while back, but whilst there are probably tasks out there that I might choose it for, they are few and far between. (Developing Datomic, is, as Rich says, a very good fit). Web applications usually consist of many independent threads which communicate only at the database layer, so the value of Clojure's cunning concurrency is small. On the opposite side, Python's web development libraries are very mature and flexible, and not something I'd want to lose.

Common SQL problems

Associating users with changes

Because every transaction is logged and has a time, it's easy to find the time of any update. But since transactions are themselves entities, arbitrary extra data can also be recorded. My application transaction function adds the current user to the transaction:

transaction.append('{:db/id #db/id[:db.part/tx] :data/user %s}'%user)

Get the id of a new entity

Getting the row id that was created by a SQL insert can be surprisingly painful. Fortunately, this one's quite easy in Datomic. Transactions contain responses confirming what has changed, like this:

The tempids map values and the values of created entities. Provided you're only created one entity in your transaction, life is simple. Otherwise you'll need to think a little harder.

Conclusion

I'm quite excited by Datomic. It's quite different from my attempt to write something similar to the ZODB for Clojure! I think the accessible history is great. I've never used a EAV-store before, so Clojure's data model is entirely new.

Hopefully I'll soon get the chance to publish my version of the Pyramid todo app backed by Datomic. Whether I'll get to use Datomic for anything more is not clear. I'd like to use it for CMS development - performance is never going to be an issue, and having full history is naturally really useful.

Closures were the last significant feature of Python that mastered, and I now seem to use them very regularly in any siginificant project. (I haven't yet had occasion to use metaclasses, but I can't imagine they will become so fundamental to me). Closures in Python are pretty similar to most other languages which have them, which is basically all dynamic languages (JavaScript, LISP, Clojure).

So, what is a closure? Simply, it's a binding of variables in a function definition to the enclosing scope. For example:

Right, great, but that's not very useful. However, there are very many cases where it is useful to pass around functions, and often those functions will need to be dynamically generated. Without closures, in Python we would be stuck with code gereration.

Objects are a Poor Man's Closures?

When closures refer to mutable objects, it can be possible to use them to replace objects. For example:

Now we can generate a pair of functions which share access to a data structure, but with no other way of accessing that data structure. Let's just check that we can create two and that they are independent:

Although this is a very neat technique, I don't think I have ever used it in finished code. Producing objects always seems cleaner, and passing around methods from objects actually results in passing closures around those objects anyway. I will just make that point clear with an example:

So what's going on here? Well, each function is bound to the variable i in the enclosing scope, which changes after each iteration. Indeed, I could still change it:

>>> i=100
>>> functions[0](2)
102

Some languages (Scheme, Clojure, maybe Perl) work differently. In those languages, the loop variable in a for loop creates a new scope for every iteration, and thus there is no risk of writing code which has the same bug. Here's a demonstration in Clojure:

This issue sounds like it could be fixed by Python behaving differently, and creating a new variable i and de-scoping the old i each time through the loop, as Clojure does. But that wouldn't make a lot of difference; there are many occasions when one wants to create closures in loops bound to something other that the iterator variable itself, and these would still have the same problem. Making each iteration of a loop an entirely different scope would solve the problem, but then it wouldn't be Python.

Another one to watch for

Today I wanted a decorator function which would check that the length of an argument list was an integer, though in one case I wanted it to check a range instead. The obvious way to do this is for the function producing the decorator to accept both an integer and a function.

Eh? Well, this is basically the same issue as in the for loop, but has quite a different feel. Indeed, the fact that we have created a closure here is accidental, and this could be quite a confusing issue is you aren't expecting it.

The problem is that the function defined by the lambda refers to condition, which itself becomes the said function. So the function is testing whether integers are equal to itself, which unsurprisingly they aren't. Again, this can be fixed by either having an external function to create the check function:

>>> def check_int(n):
... return lambda x:x==n

or by using keyword arguments:

>>> condition = lambda x, cond=condition:x==cond

Overall, closures are a very powerful feature, and one well worth understanding. They make it easy to pass functions with context around, rather than trying to pass around functions with argument lists as may be tempting otherwise. Here's a final example to demonstrate that:

I've had the good fortune to get to experiment with Python's excellent multiprocessing module recently. It's an interesting case of a language feature largely designed to take advantage of a performance characteristic of some operating systems; namely, highly efficient forking in unix-style systems. It exists on Windows too, but as a developer you have to be aware of the relatively high cost of producing new processes.

The multiprocessing module is Python's answer to the parallelisation problem. Because of the GIL, cPython can only have one thread actually executing at any instance. (Some Python implementations may be able to avoid the GIL, but there is a performance cost and maybe as significantly, no compatibility with existing C extension modules). But since threaded programming is hard, using separate processes and having a simple but explicit method of transferring data between them is very useful.

I found multiprocessing very easy to work with. After a while manually scheduling tasks into processes and polling them to see whether they had completed, I figured that using multiprocessing.Pool would do most of the work for me, and most importantly give me callbacks instead of polling. Here's a simple example of using multiprocessing.Pool:

frommultiprocessingimportPooldeff(x):returnx**2defprinter(x):printxpool=Pool()pool.map_async(f,range(10),callback=printer)pool.close()pool.join()# Essential otherwise main process can exit before results returned

This example works. What I was initially doing didn't. The reason what I was doing didn't work was that the library I was writing didn't have a fixed list of functions to pass to map_async. Because of that, I had to dynamically produce a function to use as a callback, as I can't work out what to do with the result without knowing which function is came from.

[It would have been possible to create a Pool for one function at a time, and record that function in a shared variable, but that's pretty ugly too. Besides, each function took some other arguments beside that one being iterated over by map, so I needed a closure on those arguments anyway.]

So what's the problem with dynamically producing a function? Well, map_async passes the function to the subprocess executing it via pickling. Python functions can't truly be pickled, but so long as they are importable in the standard way they appear to be, though they are actually passed by name. So the problem I had was that I got errors like this:

pickle.PicklingError: Can't pickle <function b at 0x5014f0>: it's not found as pickletest.b

What's going on here is that the function I have dynamically created, b, which has acquired module pickletest (presumably since that's where the function that created it lives), isn't actually importable as pickletest.b. A read of the pickle module source showed that what matters is:

the module (__module__ on the function) is importable

the module contains the function with name matching the function's __name__ attribute

Clearly if I create a function in the main process which isn't in the subprocess that isn't going to work. But if I create it before forking and creating the worker pool, it seems a bit silly to be prevented from using it just because it can't be pickled.

The solution I came up with was to create a module to use as a namespace to hold the functions I dynamically generated. The module has to be a physical file (not created by instantiating types.ModuleType), but what it contains can be added at run-time (before forking), and the rules above for what can be pickled are still met.

Note that due to the way Python's imports work, if the module is already imported then re-running import doesn't actually do anything (though it seems that it does have to be imported once: importing a module created from types.ModuleType fails). So, putting all that together I came up with something along these lines:

frommultiprocessingimportPoolimportmapfnimportresultfndefsquare(x):returnx**2defcube(x):returnx**3defmakeContextFunctions(*fns):forfinfns:newName=f.__name__defmapper(x):returnf(x)mapper.__name__=newNamemapper.__module__='mapfn'setattr(mapfn,newName,mapper)f.mapper=mapperdefresultFn(x):print"%s returned result %s"%(f.__name__,x)resultFn.__name__=newNameresultFn.__module__=resultfnsetattr(resultfn,newName,resultFn)f.resultFn=resultFnmakeContextFunctions(square)makeContextFunctions(cube)pool=Pool()pool.map_async(square.mapper,range(10),callback=square.resultFn)pool.map_async(cube.mapper,range(10),callback=cube.resultFn)pool.close()pool.join()# Essential otherwise main process can exit before results returned

What happens here is that after dynamically creating functions I set their __name__ and __module__ attributes, set the functions to be attributes of a module, and, to make finding them easier, actually set them as attributes of the function which they are based on.

One problem with the code above is that the function names I am dealing with are not guaranteed to be unique, since the functions come from many modules. I tried to replicate the module structure the functions were found in, but that failed to due to pickle failing to import dynamically created modules. I ended up encoding the full function path by replacing - with -- and . with -_-.

Finally I realised that the code responsible for doing the dance above to make pickling work shouldn't be confused with the functions to make the mapper and callback functions. So I ended up writing a function with took three functions are arguments, the latter two which expect the first as an argument. And it worked first time. It should provide me with good code for interview questions!

I found a website which gives some advice for swing traders. Reading it, I continuously found the need to call out citation needed! Since I couldn't find any research by anyone else, I thought I would do some tests myself.

The first thing I wanted to test was relating to the use of moving averages. The simplest test I could think of was comparing holding stocks when the 10 period simple moving average (SMA) is above the 30 period exponential moving average (EMA) versus holding them continually. The advice on the website looks immediately dubious since if you look at the example chart, the stock price at the point the presumed downward trend starts is actually slightly lower than the price at the time it ends.

Choosing companies to test

Selecting some companies to test the strategy with wasn't easy. Since if I do any investment it is likely to initially be in London, I wanted to test with London companies. The FTSE-100 is an obvious choice, but testing with the current FTSE-100 companies biases towards those which have been successful over the testing period. Surprisingly, there doesn't seem to be a source which will give a list of companies in the FTSE-100 at a given date. There is a list of FTSE 100 constituent changes, which looks slightly suspicious as it shows no changes since 2008; it also doesn't give the symbol for companies, making it harder to look them up.

For the sake of getting something done, I downloaded from Yahoo finance data for the current FTSE 100, and then removed any companies for which I didn't have data from January 2003 to March 2010. This left me with 70 stocks. This is clearly still biased, but probably not in a way which would significantly affect this experiment.

The simulation

I compared the result of putting £1 into shares of each of the companies and leaving it there with a strategy based on only holding shares when they were supposedly trending upwards.

My moving averages were calculated on closing values. If the short-term moving average was above the long-term average and my money for that stock was currently in cash, I converted it to shares at the next day's opening price. Conversely, if the short-term average moved below the long-term average and I was holding shares I sold at the next day's opening price.

Results

A naive approach, (even ignoring stock splits) gave the following results: Holding the shares from January 2003 until March 2010 would give you £168 from your £70 investment. Following the trends would return you £113.

The site suggests only holding stocks when the trends are strong and the averages are well separated. So I re-ran the simulation holding stocks only when the averages were separated by 1% and 0.5%. For 0.5% I got an amazing return, and decided to investigate.

On the middle day, the opening value is only 1% of what it should be. Ahem. That's not useful data, and obviously wrong since the opening value is less that the minimum. Time to look at the data: how often are the opening and closing values not between the maximum and minimum? 1400. Oh dears.

Okay, how many are more than 2% out? 135.

How many are more than 5% out? 45. But only two of these fall after the start of 2005. They are these values:

For GSK it appears that the close value is wrong. Substituting in the low value probably wouldn't be too bad. For RBS it's harder to see what's happened. The next opening was 215, so it's not impossible that the closing value is correct and it's the low that's wrong.

For the current simulation this isn't going to be a big factor, but the idea that the data are this flakey casts massive questions over any simulation results. I also have no idea whether there are other significant errors which aren't caught by the above check.

Other data sources

Google's historical data seems pretty similar to Yahoo's. I subscribed to Reuters DataLink in order to get better data, but after persuading a Windows user to allow me to install the client application on her computer we got lots of errors about unknown ticker symbols and failed to get any data at all. The fact that you have to use the standard client makes the data source pretty useless anyway, but it would be nice to see whether someone has accurate historical data. Anyone know where Yahoo's data comes from?

Plotting what happened

I was surprised by how hard it is test even simple code to do data analysis. Even when you're confident that all the components you have work as expected it's still easy to connect them together wrongly, and once they're connected together it's pretty hard to verify that they're working correctly. I have an idea that using a stepped function as input data it should be possible to manually predict the expected output for moderately complex functions, but I haven't implemented this, and it still wouldn't be an obviously-correct test.

In order to try to visualise what is going on, I decided to plot the stock values and when the algorithm is holding them. Here (warning: 15MB!) is the format I came up with for the S&P 500 (see later for why). The value of each line is the log of the percentage movement of each close value from the previous day. The three shades of blue indicate what is held by the straight-forward algorithm, and also the 1% difference and 2% difference variants of it. I haven't worked out how to add a variable horizontal scale yet; this graph is from March 2002 to May 2010.

One weakness is that here we just plot closing values (which we make out investment decisions based on), not opening values (which are the prices we actually pay).

Data issues

Using the charts it was easy to spot anomalies in the data. Here are a couple:

Toby at Timetric helpfully pointed out that Yahoo's US price data seems much better. I downloaded prices for the S&P 500. There do seem to be fewer errors, but there definitely still are some. For example, here's a couple of examples where the stock split ratio seems to have been applied to a couple of dates before the split.

In fact, this pattern seems to occur on about eight stock splits from 2001. However, there don't seem to be an instances after that. I decided to work with S&P 500 data, from 2002 onwards.

Stock splits data

In order to do any accurate simulation I need data about stock splits. I think Yahoo has this data pretty accurately, but getting hold of it is a pain. It appears in at least three places:

On the bottom of the standard graph of stock prices

In the table of monthly prices, though not in the CSV version, so that's not good

In the table of dividend data, but this data only appears when there is at least one dividend paid, and again not in the CSV version

None of these are very useful. I ended up parsing the human-readable page. Parsing that wasn't too bad, except that the 1960's (!) splits are of the form %Y-%m-%d instead of %b %d, %Y. Since some of the ratios are 102:100 I'm not even sure these are likely to be right. Anyway, after cleaning up this data I finally got to re-plot my graph and to my surprise it cleaned up most of the 'icicles' (the appearance of a sudden price drop) first time.

One exception was a 2:1 stock split of Dean Foods (DF) before trading on 24 April 2002. However, this seemed to be the only obvious split which was missing.

Dividend data

Downloading the dividend data was straight-forward. Checking it was a little trickier. After an anticipated dividend is paid, one would expect that the share price would have fallen by that amount, but I couldn't find clear examples of this. Instead I thought I'd first check that I agreed with Yahoo's adjusted close calculations.

According to a comment on a blog post, the dividend was $0.22.
Working with that: Increase = 22.58+0.22-21.96=0.84. 0.84/21.96 = 3.8% change. Much nearer.

So what's going on here? Why does Yahoo's data on the dividend amount seem to be out by a factor of 20? Well, the ratio between the adjusted close and actual price at that point is about 20. Can that be it? If so, argh! It does appear to be, but most of the dividends are round figures, so they don't look adjusted.

As an attempt to get some definitive info I went to the AIG website. It was down. You don't get much for $85000000000 these days. I went to Google finance. The data is the same as Yahoo's. But then I realised that Google adjust all their historical price data instead of just having an adjusted close column as Yahoo do!

Okay, at this point I decided that the dividend payments don't make enough difference to the comparison of strategies to worry about for now. I will fix it up, but not yet.

Final results

Incorporating the stock split data into my simulations, I got the following results for S&P 500 companies between 20 March 2002 and 21 May 2010:

There are 458 companies in the current S&P 500 for which I have data for this time period.

Holding $1 in each would return $947

Holding when SMA10 > EMA30 would return $649

Holding when SMA10 is 1% greater $591

Holding when SMA10 is 2% greater $545

Further work

It would be nice to have some data in each case about the how much of the time the money was in stocks. You might expect that if the efficient market hypothesis were true then the proportion of gain/loss that each strategy has compared to just holding the stocks is proportional to the time for which they are held.

It might make sense to invest a fixed amount each time we re-invest in a stock, rather than allocating a set fund to each stock and re-investing whatever returns we got previously, but making sure we don't use too much capital would be a good idea.

Working with the S&P 500 from 2002 instead of those companies which are in it now and then would reduce some bias.

Clever strategies could be employed to calculate at what value the averages would cross and buy during the day if that value is reached.

The code for this project should be tidied up so that it can be made public.

My initial attempt was quite slow; apparently Python's strptime is now a pure-Python implementation and hence quite slow but much more consistent that using the underlying system library.

Anyway, since the format is fixed, I thought I could save a little time and ended up with something like this:

Day=namedtuple('Day','date open high low close volume adjclose')defparseLine(l):d=l.split(',')[1:]returnDay(date(int(l[0:4]),int(l[5:7]),int(l[8:10])),*[float(x)forxind])

Yes, more beautiful code has been written, but this works; if this horrifies you, I recommend that you stop reading now. This code was many times faster than using Python's strptime, but still not fast. The above function was taking about 2.2 seconds to read nearly 200,000 lines of data. That's not bad, but not really fast either. Optimising wasn't necessary, but as a learning exercise I thought I'd give it a go.

My first attempt was to use PyPy. With JIT, PyPy beats CPython in the majority of benchmarks. Regular expressions are a known slow point, and I don't have any of those, so I expected good things. Sadly it was about 5 times slower. (Plus I had to find an implementation of namedtuple and import the with statement since PyPy currently implements Python 2.5)

Then I remembered that I wanted to have a play with Cython at some point. Cython is really amazing because it enables you to mix Python and C functions with ease.

The first thing I did was to move the above code into a .pyx file and import it using pyximport which automatically converts the code to C and compiles it on import. This trivial change gave me a small speed-up, around 20%.

This wasn't enough to be worth the effort, so the next step was to avoid some of the Python string processing. C's atoi function parses until the first non-numeric character, so I could point it at the right point in the Python string and have it work. Sounds too good to be true? I thought so, but here it is:

This saved a bit of time, but some profiling showed that this function was still the slowest in my code. Could I also do the rest of the string parsing in C? Well, C doesn't have a nice split function. But it turns out that it does have a function strtod which converts a string to a double and updates a pointer to the last character parsed.

Yes, sadly this gets quite ugly now, but anyway here's the code:

fromcollectionsimportnamedtuplefromdatetimeimportdateDay=namedtuple('Day','date open high low close volume adjclose')cdefexternfrom"stdlib.h":intatoi(char*)doublestrtod(char*,char**)defparseLine(char*l):cdefchar** jj=&lreturnDay(date(atoi(l),atoi(l+5),atoi(l+8)),strtod(l+11,j),strtod(j[0]+1,j),strtod(j[0]+1,j),strtod(j[0]+1,j),strtod(j[0]+1,j),strtod(j[0]+1,j))

As you can see we declare j to be a pointer to a pointer into a character array, and initialise it to point to l; we could initialise it to anything but l is a handy unused pointer which strtod can tromple over before it gets garbage collected when the function exits. To get a reference we use & as in C, but to dereference we have to use [0] since * already has a meaning in Python.

This is clearly a but ugly and has made the code more fragile, but it actually gives a 3-times speed increase over the pure-Python implementation.

Cython is usually used for increasing the speed of numerical calculations (where improvements of the order of 1000 times have been reported), and there are probably not that many cases where it is suitable for string processing, which after all is something Python is particularly good at. But the ease with which Python and C can be mixed is really impressive, and I will certainly be happier programming heavy numerical procedures in Python knowing that I can likely get a good speed up with some minor changes if I need to.

After eliminating the string processing, the next slowest thing was instantiating named tuples. This made me a little sad as I think named tuples are one of the nicest features Python has grown recently. However, doing something similar in Cython isn't too hard. Here's what I ended up with instead of the namedtuple definition:

Here I first have a class definition; to make attributes available from Python they need to be declared either public or readonly. Below that is a factory which instantiates the class and sets the appropriate attributes. This still isn't incredibly fast, but the change more then doubled the speed which I could load data again, so it did make a significant difference.

Returning to the problem of simulating investment strategies, I was struggling with Clojure. I have spent a lot more time with it and even written a basic asynchronous web-server with it, but there are still things which don't feel natural to me.

The task of simulating an investment strategy seemed ill-suited. After each simulated day, my simulation needs to update its ideas about the world and decide whether to invest based on that. Whilst it could re-compute everything for each day from all the previous history, keeping state and updating that state based only on the current day seemed a more sensible option.

With Clojure, the above would probably require something which could pass some functions their state and the next day's data, and then have the functions return their new state and a function deciding whether to invest. (This is a function of the following day's opening price).

What feels more natural is to have the state stay in place and to just pass in the next day's data and have the investment decision returned. This sounded like a perfect job for coroutines, so I decided that it was time to switch back to Python. There is an excellent tutorial on using coroutines in Python (pdf).

Combining these coroutines enables complex things to be expressed with good encapsulation. For example, suppose I want a coroutine which returns True iff the 10 period simple moving average is above the 30 period exponential moving average. Then I can combine them like this:

I've spent a few more days playing with Clojure. I've found it to be fun and challenging, but I can't really argue useful in any way yet.

I guess this isn't surprising. There are a few times when you would expect advantage from using Clojure: when you need a lot of heavy concurrency (so you can't just do things asynchronously), when you want to integrate with stuff on the JVM, and when you understand LISP enough for macros to make the language genuinely more powerful than other languages.

On the negative side, I've been confused by a few things.

Boxing

One of the expected ones was the boxing of low-level types. If you want to play with Java libraries, you might need to pass unboxed values around. But as soon as you touch anything in Clojure, it's automatically boxed. Calling libraries with values is fine: Clojure does some magic, but if you need to pass, say, an array of floats, you need to convert your Clojure vector explicitly. I made the mistake of expecting this to work:

(into-array [(double 1)2.03.0])

Since into-array is a function which creates an array in which everything is the same type as the first item. However, what actually happens is that the double has to be boxed to be put into the Clojure vector, so you get an array of boxed doubles. The correct solution is:

(double-array[123])

Java libraries

In order to use ChartDirector with Clojure I had to download a couple of web libraries (servlet-api and jsp-api) and add them to my classpath. This wasn't too painful when I worked out what to get, but there are multiple jars out there containing the classes I needed, and I don't know what might be an official source of them.

Functional programming

I'm still to be convinced that the purists from either the functional or the OO world have a point. Generally I've liked the functional side of Clojure, but today I wanted to create an exponentially weighted moving average function. It took me a long while! Here's what I came up with:

Although this seems to work, it won't work with very large lists: I need to thread an accumulator through so that tail-call optimisation is possible, and then use Clojure's recur since the JVM doesn't support tail-call optimisation. I also haven't thought very carefully about whether I should be using lists or vectors. Probably the performance is good enough with both, but it's another thing to think about.

After a while writing this, I wonder how much effort it would take in Python. Answer, about two minutes:

A lot simpler. (I'm not arguing that Python is always this much clearer; I'm just making the point that some things don't seem to be suited to the functional approach).

Maybe the Clojure code could be more like the Python by using the lazy-seq macro. I couldn't see how immediately, though.

Brain size

I'm not sure whether Clojure could ever fit in my brain. There are sooo many functions and macros. And often they are synonyms which do different things. For example vector returns a vector containing its arguments, whilst vec takes a collection as an argument and returns a vector containing its contents. I haven't found the equivalent of vec for lists yet!

Inlining

I spent ages early on trying to figure out why some code (which wasn't correct) behaved in inconsistent ways. It turned out to be related to function inlining. I haven't looked into this in depth, but I'm not the only person who's been confused by it. Looking at the source for + you can see the inline option:

This weekend I played with Clojure for the first time. There's no doubt it's pretty cool. I did a little bit of work with LISP before, and I do believe Clojure fixes a lot of the warts, as well as enabling use of all the Java libraries. And that's not mentioning the concurrency stuff which was the primary reason for Clojure's development.

For now, here's what I've written. I expect this code to get improved a lot, but so far it just fetches stock prices from Yahoo and maps the result into the correct data types.

We have a number of Linux machines onto which we want to deploy a number of applications which will run as services. There will likely need to be some amount of data transfer between the applications at some point, too, and possibly some shared configuration.

The applications which we are deploying need to be started at boot-up, restarted if they crash (and preferably also if they misbehave, such as by using too many system resources), and need to be manually controllable. We want it to be possible to easily install new versions of any of the services we have running on any of the systems.

Of course, it's also important that it's easy to install versions of the software for development and testing. In particular, it would be nice if installing on my Mac worked too.

Our standard existing deployment platform is Fedora Core 8. It would be good to have a solution that works on later versions of Fedora, but also Debian, Ubuntu and other distributions. Working on the custom Linux distributions found on some SCCs which we would like to use as embedded devices would also be an advantage.

Installing the software

Here, I'm going to assume that we're going to use zc.buildout. This is mainly because I have familiarity with it, it's extensible, it does what I want roughly how I think it should be done, and if there was anything better out there I suspect Jim Fulton would have found it.

Running the services

There are a few tools out there to run a program and monitor it to ensure that it keeps running. The two main ones I considered were zdaemon and supervisor. However, to confuse matters there is also D J Bernstein's daemontools, which covers pretty similar ground, but also provides a start-up system which works across most unix-based systems. Once you get there you also come across runit, which is meant to be an enhanced daemontools. Runit would really like to be run as process 1 and replace init, but it's not necessary in order to use it. There's a good article explaining the whats, whys and wherefores of runit.

Configuring boot-up scripts

This is possibly the trickiest bit. In our existing set-up, all custom things to be started on boot-up are in /etc/rc.local. In order to make it easy to install and upgrade and install different applications separately, we would ideally like to just place a start-up script in a directory and know that it will be run.

When daemontools installs itself it makes sure that it gets started by appending a line to /etc/inittab if it exists. From Fedora Core 9 this file still exists, but is not automatically executed on boot-up; Fedora has moved to the new upstart system. OS X has its own startup system called launchd.

zdaemon compared to supervisor

zdaemon and supervisor fill almost the same rôle, so it makes sense to compare them. This thread is a good comparison.

Puppet

On my way to getting all this working, I took a look at Puppet. Puppet is a tool for managing systems. It will create files, install packages, configure services and all those other things that one usually writes flakey scripts to do. A Puppet configuration, called a manifest, can be run repeatedly and will update the necessary components. Components can depend on one another, and it's all cool. Puppet is written in Ruby. The test coverage is high, and they use Trac and Buildbot. I can't help having a very positive feel about the project.

Puppet will also do some level of supervision. All that is required is a process which daemonises itself, and a set of commands to start/stop/status it. There are buildout recipes to install both zdaemon and supervisor with an application. However, following slightly the philosophy of Daemontools, I decided it made more sense to install the application and install something to daemonise it entirely separately.

I used zdaemon to do daemonise my process, as it produces an executable with start, stop and status commands. Supervisor doesn't install an init-style script. There's one for Debian in the respository here, but that's all, and that doesn't include a status command. There was a slight snag; the zdaemon status command returns exit status 0 even if it's not running. Since this is how Puppet tells whether the process is running, I hacked zdaemon to return 1 if the supervisor process is not running.

I should add that Puppet can either check the status of a process every time it updates itself (by default on boot-up and then every half hour), and start the process if it's not running, or it can configure the process to be started by init. It doesn't really make that much sense to do both, and I opted for the former since it makes it cleaner if I decide that I don't want the process running.

Not doing any of that

You might think that it would make sense for Puppet to do roughly what zdaemon does, so that you can just install an application and tell Puppet to make in into a service. But I have something which works, so I'm happy.

Another improvement at some point may be to move to using Supervisor instead of zdaemon. Supervisor is much more full-featured, including being able to configure notifications on failure, clever restarting logic and memory monitoring. But until I need one of these features, I'm quite happy.

Puppet Introduction

Puppet manifests are written in a custom language which borrows heavily from Ruby syntax. The code below demonstrates a lot of the features of Puppet, and I concluded that there are too many to attempt to explain them all!

But as a quick overview, I have defined a, erm, definition which takes a buildout config file and installs the application. I also created a definition to take an executable and make in into a service. (Unfortunately this only works under linux as it puts a script in /etc/init.d/. It would also be possible to write the control script to the application directory and then specify the start stop and status command explicitly in the service resource.)

Below is the Puppet configuration file which controls all the above. It requires Puppet to be configured to serve some files, including the application buildout file and a separate buildout which just installs zdaemon. (zdaemon could have just been installed globally, but this felt a little cleaner).

The unsolved problems

In some ways you might argue that the above fails to get anywhere near solving the problem that I set out to solve. In particular, there is still no simple way to decide which version of myapp is deployed. You would need to find a buildout configuration file for the correct version and copy it to the Puppet file serving directory. That's not too bad for performing an upgrade, but it's pretty messy for performing a downgrade or, more commonly, seeing what version is currently deployed.

It would be much better if I could just specify the version of myapp in the Puppet configuration. Hopefully that will happen soon!