I have to design a relational database to store the results of gene expression experiments (both microarrays and RNA-seq), using MySQL.

I have a number of experiments, and for each experiment several contrasts/comparisons. For each contrast, I have thousands of genes with fold changes and p-values that I want to store. The database will be queried to retrieve genes showing differential expression across comparisons and across experiments.

What's the best design for such a scenario? I'm thinking to create one table per experiment but I'm not sure this is the best way to go about it. I'd be curious to hear from people who have experience in designing relational databases or who use SQL to query databases containing gene expression data.

Using a database system to store genes with just p-val and fold-change seems a bit overkill. What's the size of your data? If it is less than few hundred megabytes, you could probably just load the entire thing into memory to serve to the user via a simple webserver framework like bottle.py or node.js.

That's not what I need though. I'm oversimplifying the situation in my question. In reality I'll have several columns to store for each comparison and experiment. Besides, this will be part of a larger projects and has to be a relational database.

When designing databases, I try to follow a business object model. This usually means having a table per entity/class (e.g. a gene table, an experiment table...)and mapping their attributes to columns. I then apply normalization constraints, typically trying to achieve third normal form. I denormalize when it makes sense (e.g. faster queries). I would start simple with just core entities and their relationships. I also use MySQL Workbench to help with the design.

If this is for a single large experiment where the points of comparison are fixed the column approach could work, otherwise I would suggest adding an 'experiment' component.

What I've done in the past is have experiment, condition and fold change tables. Each experiment has several conditions, then you make two keys for the fold change table which refer to conditions. In the fold change table I made keys for the experiment and the two conditions. The exact implementation is a bit more complex, it was for microarray data so I built in the array's probes and so on into this. That way it was possible to pull fold change by probe id as well as the raw intensity values for each instance of that probe spotted on the array.

This made it relatively easy to just dump the fold changes from limma/etc into a single table using a python script to insert the relevant rows into the experiment and condition tables.

The idea was similar to what you are saying, the plan was to be able to pull fold changes for a given gene across individual experiments to make meta-analysis easier.

There is probably a better way, but you want to avoid adding tables for each experiment or comparison.

I'm not sure I completely understand what you mean with the two keys for the fold change table. Do you care to explain with an example? Or, even better, can you show me what the tables would look like? Cheers!