Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It only takes a minute to sign up.

I've been scratching my head over the last few days trying to find a solution that best suit my needs. I tried to think that what I want is probably not very different from what many others wanted before but so far I didn't find something close enough.

Let me describe the problem and a few aspects/solutions I considered.

I don't want to be too specific but my definition of the problem is like this:

We have a very large (scientific) installation (= it) that works in "pulsed" mode: we have one new cycle every ~1s

It consists of 1000+ individual devices each device cooking 1 < n < 100 values per cycle. Each device sends its data to the control system with a distributed timestamp (the cyclestamp) which allows to know that a given sent of values corresponds to a given machine cycle

A key constraint is that the data model should have at its core the fact that data coming from different devices are associated with the same cycle.

The values are accessed by DeviceName/Property#field were "property" is just a group of "field"s

The values are of three distinct types: character strings, double precision reals, vector (or matrices) of doubles. The scalars should not be opaque (of course), the vectors (and matrices) can be opaque (in the end they should be Python numpy arrays) (and I think I'll store them in a binary format: numpy dump as binary => conversion to db specific binary format)

A very important aspect to me is that the existing devices can change regularly, the number of Property#field they have can change in time (a control software version is associated with each device, and this can evolve). Of course there's a way for me to get a complete schema of all this, replicate it in my database and store the values there, but I think this is a bad idea, because that forces me to maintain that scheme all the time. I think of my implementation as being "in user space" and the complete device of the properties (and existence) of the devices as being "in system space" and I'd like to find a solution that can handle the addition/removal of devices, and be flexible in all this.

What I'm looking for is to design a persistency layer for this system. The workflow on the database (whatever it is) is:

During machine operation a person can decide to record a give set of values on a given set of devices (you provide a list of these and switch it on and off whenever you want).

At some other point in time these dataset are queried. So far the datasets of the previous point were just dumped in single files. So no possibility to query the data properly and each dataset was isolated (this person stores the files there, this other one there, etc.). The goal is to have the possibility to query data in individual dataset or to look at some data (the value X of device Y) over many dataset, i.e. viewing it as a time series.

The queries can be simple (time series) or more complicated (correlation between values of devices, filtering based on a range of values, etc. I guess with properly build indexes these should not be too bad, in case this requires JOINs, in the end they will be JOINs but these guys probably exist for a reason...)

This doesn't sound too complicated and I can think of half a dozen data modelling and solutions for the implementations. However I can't decide what's best under my constraints.

I considered:

Relational (SQL) solutions (see below)

Nosql solutions, which at this point I ruled out for two main reasons: the volume, speed and availability I need are in the scope of regular SQL solutions and SQL solutions seem less exotic and would allow more complex queries

Unless I missed a fantastic advantage of Nosql for my situations, let's focus on SQL solutions.

Here are some schema ideas:

have one table per 'Device/Property', one column per 'field', row for each cyclestamps: I want to rule it out (see above) as I want to be flexible, if data from NewDevice/Unknowproperty comes in I want to go ahead and store it

If would just "tag" the entires with text fields containing the device information. This is a bit disappointing: I have wonderful flexibility but I'm not sure if that's query-able or recommended:

one table storing entries: timestamp, dataset id (group of inserts coming from the same user who started to store values coming from a set of devices), username, etc., a "deviceName", "propertyName" and "fieldName" columns, and finally a "dataType" column

one table per datatype, two columns: id (or timestamp) and the actual value (real or binary for opaque vectors and matrices)

The last solution seems to work and makes me wonder why I lost so much time researching this whole thing and why I wrote such a long question...

However the last one looks like a Nosql solution, so that either means it's a good solution, or I'm not at all profiting from SQL...

Additionally this stores the timestamps multiple times. In the end I want to do queries like "SELECT all values having the same timestamp", where timestamp is between this and that, where this_values < x , etc.

What version of postgresql are you using? Have you considered JSONB (or HSTORE alternatively)? I wouldn't recommend EVA unless there's no better option. Having tables created dynamically can also be cumbersome. PostgreSQL can scale quite well with your data. Postgresql-XL or pg_shard comes to my mind. You can also "migrate" older partitions to columnar store with cstore_fdw if you don't plan to update records which was built specifically for your use case.
– Károly NagyDec 29 '15 at 15:19