I was wondering what is best practice for representing elements in a time series, especially with large amounts of data. The focus/context is in a back testing engine and comparing multiple series.

It seems there are two options:

1) Using an integer index, or
2) Using a date-based index

At the moment I am using dates, but this impacts on performance & memory usage in that I am using a hash table rather than an array, and it requires some overhead in iteration (either forward or backwards) as I have to determine the next/previous valid date before I can access it.

However, it does let me aggregate data on the fly (e.g. building the ohlc for the previous week when looking at daily bars) and most importantly for me allows me to compare different series with certainty I am looking at the same date/time. If I am looking at an equity issue relative to a broader index, and say the broader index is missing a few bars for whatever reason, using an integer indexed array would mean I'm looking at future data for the broad index vs present data for the given security. I don't see how you could handle these situations unless you're using date/times.

Using integer indexes would be a lot easier code wise, so I was just wondering what others are doing or if there is best practice with this.

What programming language are you using?
–
chrisaycockFeb 26 '11 at 6:27

I just do what R does with POSIXct: fractional seconds since the epoch. Millisecond resolution ... and it easily interfaces with POSIX time semantices in other systems.
–
Dirk EddelbuettelFeb 26 '11 at 14:12

R also have class libraries such as [XTS] (cran.r-project.org/web/packages/xts/index.html) which you can use. They also allow you to select subsample quite easily. But, as @chrisaycock said, it pretty much depends on what technology you're using, and how large is your sample.
–
SRKX♦Feb 28 '11 at 19:58

8 Answers
8

Representing time series (esp. tick data) using elaborate data structures may be not the best idea.

You may want to try to use two arrays of the same length to store your time series. The first array stores values (e.g. price) and the seconds stores time. Note that the second series is monotonically increasing (or at least non-decreasing), i.e. it's sorted. This property enables you to search it using the binary search algorithm. Once you get an index of a time of interest in the second array you also have the index of the relevant entry in the first array. If you wrap the two arrays and the search algorithm e.g. in a class you will have the whole implementation complexity hidden behind a simple interface.

Arrays are also cache friendly which is a big advantage on modern CPU where a lot of cache misses can be critical for performance.
–
Andrey TaptunovAug 23 '11 at 6:25

Additionally, if you have some data in mostly regular intervals (e.g. daily) you can use estimates of the actual position (knowing start/end dates and frequency) to further beef up performance for even more juice.
–
Karol PiczakDec 10 '11 at 14:59

3

I disagree with this solution; it produces human errors. what if you miss one entry on one of the arrays? all your data is out of wack. Best is to store the Date,OHLC in rows; and then in one array.
–
alphaApr 3 '12 at 0:08

I second alpha's comment, the last think you want to do is have a massive vector once you get any kind of data in your timeseries.
–
Ron EJul 6 '12 at 4:15

@alpha This is the correct datastructure, although I agree with you that one needs to put either a massive amount of testing into it or steal someone else's implementation. Pandas's series are essentially what's described.
–
U2EF1Jan 18 '14 at 4:06

I really wouldn't implement time series on my own unless I had a good reason to. AQR uses pandas, almost everyone in R using zoo or xts.

I never like multiple parallel arrays, if it breaks everything is broken, plus it gets uglier as you increment data. If you are doing something in C++, why not have an array of structs for each object where you have quote,time and all other data you need?

It's usually more efficient to have timeseries objects located sequentially in contiguous memory.

A hashtable doesn't provide this. As good as it is, from a complexity standpoint, it's not faster than a fixed array when accessing items in a [i+1] or [i-lag] kind of operation that is typical in timeseries code.

(For the most part you can estimate the array size needed to do timeseries before you start your operate, so array resizing operations can be optimized out.)

If you are serious about performance and flexibility, you have to take a look at data.table package in R. Here is the crantastic review. It is lighting fast! I think this is the best package addressing performance and memory issues.

At least for daily data: if you can afford to replace holes in your data by lines with N/A values (e.g. null), a possible approach is to simply store the date of the first row as a comment in your file, and then deterministically compute the date of any other row using its offset to the first row.

A key benefit is being able to use such a file as a time-indexed memory-mapped file (no need to load into memory), without involving a hashmap. Works like a charm (I implemented it in C#).

For a great, R-compatible (among others), memory-mapped file implementation, I can highly recommend TeaFiles (free an open source).

If your language is Java, CoralStore can persist time series for a fraction of the price of KDB. It provides very fast write access (~ 70 nanos / msg) so you can dump huge amounts of data to disk. When it comes to read access, you can fetch messages by sequence and it uses paging/swapping technology for very fast read access. It also allows simultaneous read/write access and comes with an asynchronous implementation for extra low variance.