Time series DB?

Upon some analysis (with the help of a few former co-workers), we are moving towards the suggestion by temnik - kdb+ in the long run. In the ultra-short run, I have switched to storing everything in duplicate as a pickle to improve read spreads from python.

It truly would be nice if there was an open source product that was as good as kdb+ and it's possible there is, but I just don't have the time to experiment.

@T0pH4t - so far my biggest complaint about influxdb is RAM requirements and poor implementation of as-of joins. At what version did you give it up? Those other databases are not time oriented - so they have no concept of previous/next or even group-by-time. How do you deal with that?

@T0pH4t - so far my biggest complaint about influxdb is RAM requirements and poor implementation of as-of joins. At what version did you give it up? Those other databases are not time oriented - so they have no concept of previous/next or even group-by-time. How do you deal with that?

More...

So kerf is timeseries database, so I will assume you are talking about the others. At the end of the day most if not all databases are implemented using core data structures. Meaning they are row based (Oracle, Microsoft, MySQL...) or column based (KDB+, Cassandra, MongoDB...). They also generally use a file structure that is based off a B tree variant or an LSM tree (BigTable, HBASE, levelDB, MongoDB, RocksDB...). Timeseries DBs will then use optimizations (like delta-delta compression) on top of these structures, taking advantage of the fact that time series data is a continous integer series with a known start and end. The query language can then be structured around the properties of time series data. The databases I suggested are just simple key value stores. Meaning they give you a base layer that you can then start building a timeseries database off of (which is what I did). They will not give you an out of box experience like InfluxDB. Most database could become time series oriented with certain techniques, and some will be better then others. Your access patterns for your data should drive your decision on which underlying structure to use (or at least they should).

Facebook put out an interesting white paper on their time series database used for metrics called gorilla . Beringei is their open source timeseries (in-memory) database based on the paper.

I should mention that kdb+ and kerf both are based off of APL to an extent which heavily leverages CPU vector instructures (and in some cases GPU). For timeseries/numerical data this can be a huge advantage. Its why other databases could have such a hard time beating them in the financial area.

So kerf is timeseries database, so I will assume you are talking about the others. At the end of the day most if not all databases are implemented using core data structures. Meaning they are row based (Oracle, Microsoft, MySQL...) or column based (KDB+, Cassandra, MongoDB...). They also generally use a file structure that is based off a B tree variant or an LSM tree (BigTable, HBASE, levelDB, MongoDB, RocksDB...). Timeseries DBs will then use optimizations (like delta-delta compression) on top of these structures, taking advantage of the fact that time series data is a continous integer series with a known start and end. The query language can then be structured around the properties of time series data. The databases I suggested are just simple key value stores. Meaning they give you a base layer that you can then start building a timeseries database off of (which is what I did). They will not give you an out of box experience like InfluxDB. Most database could become time series oriented with certain techniques, and some will be better then others. Your access patterns for your data should drive your decision on which underlying structure to use (or at least they should).

Facebook put out an interesting white paper on their time series database used for metrics called gorilla . Beringei is their open source timeseries (in-memory) database based on the paper.

Going this route (though overkill for my personal project), it is good to consider both access patterns and possibly separate the API/technology for each: both for write and read operations as separate concerns

Experience and history points to that these access patterns are very distinct, so may be worth abstracting from eachother and optimize by themselves. Besides possible optimizations, there's also more flexibility and freedom to change the underlying platform.

Traditional solutions tend to tie both write and read access patterns together in the same technology/API, providing a worst common middle ground, but may be simpler to get initial development started with.

It may be easier to start with such a pattern if one see clear benefits from choosing such principles.

However, the real advantage is in the q (and k) language framework itself. To truly get the best out of it you will need to master the language and design your system in such a way that most of the heavy (pre/post)processing of data is done within a set of dedicated q servers. Only in this way you should be able to fully utilise the memory and speed optimisation capabilities of the kdb+ framework.

Any other front-end clients should just use the data results for display only, for example.