I have got to organize a testing/improvement process for ad banners, and I plan to automate most of it in Python. If possible, I will want to have my data (daily updates of per-client performance metrics) in a format that would be easy to work with with Python, and possibly R + Python. This will be rather large data sets, so I want to make sure I get it right from the beginning. What do you recommend?

I'm using h5py and HDF5 right now in a project and it's been great. Working with the data is fairly similar to working with dicts so it's pretty painless in Python, not to mention integral support for NumPy arrays and the ability to use HDF5 hyperslabs. HDF5's also supported in a ton of other languages.

I've never used PyTables extensively - I considered it along with h5py when I started my current project, but I wasn't sure how easily I'd be able to access data from other programming languages if I went with PyTables. I went with h5py mainly because it wasn't adding anything on top of HDF5 so I was fairly certain I wouldn't have too much trouble if I wanted access from Java, C++, etc.

h5py is a more straightforward (but still very pythonic) wrapper around the C hdf libraries.

If you want to store a 3D, 4D, etc array or just want to easily store and quickly access index-based slices of your data, then h5py is probably the way to go.

If you want fast queries or optimized offline calculations (without loading everything into memory), then PyTables is a better option.

That's all an oversimplification, of course, but it's the basic idea. Also, as has already been mentioned, h5py doesn't add any extra metadata or do anything special to the HDF files. If you're going to access things from another language, it's usually easier to use h5py.

Probably something like Redis or MongoDB if a properly-configured traditional RDBMS can't handle it (which it probably can).

Personally, I'd go with plain ol' PostgreSQL. Write to one table all day, partitioning only if your volume becomes a problem. Each night, collect the daily stats and produce reports, writing to "roll-up" summary tables. Drop/trunc the "daily" table and start over the next day. R, Python, and most other languages will have excellent support for this.

I was on a project that was sifting through several terabytes of information. We used makefiles and SQL with PostgreSQL. I was very gung ho about using python as an alternative but it simply lacked parallel processing and the implementation for the python library interface to Postgres was too limiting as datasets hit past 1GB began to slow exponentially. I never dug into what was going on in the underlying c code but python simply couldn't hack it.

I've had great luck moving large sets of data into MongoDB. Had a extremely large government database of locations inside a .xls. Wrote a quick python script to push the values into Mongo, and thats where it happily sits now. Depending on the size though, Redis or PostgreSQL might be a better choice. Mongo tends to eat a lot of memory, and unless your using location-aware data, its not always worth it.

'Large' is pretty vague. Are you talking about something you're going to have to use a Hadoop cluster to analyze, or something that fits in RAM? There aren't many binary formats that are easily read by both R and Python. Maybe they can both handle connecting to the same database, but the only compatible binary format I found was Matlab's - they both have libraries. I'd pickle numpy arrays to a gzipped file for intermediate steps in python, and convert as needed if R isn't your main environment.

Can recommend giving Pandas a look, large chunks of it are optimized in C and it's built from the outset for Numpy. If/when you outgrow CSV, it is trivial to build DataFrames from any dict-yielding iterable (e.g. various DBAPI adaptors with the right row factory configured).

One word of warning against PyTables/HDF is that record lengths are fixed, so if you need to store a potentially long string, things start to look ugly quickly (note: I only know this from fixing another design, never invested time really getting to know PyTables). In this case a multi gb file and 10 minute runtime reduced to a 30mb sqlite DB and 15 second runtime.

Edit: not many people seem to know, but with autocommit disabled and PRAGMA synchronous = 0 (which is similar to what you get with HDF5) sqlite becomes blazingly fast. As a compound tabular file format it's very hard to beat.

I routinely use space-delimited flat files for data sets up to about 5GB with no problems whatsoever. If you're familiar with command-line tools (grep, awk, cut, etc), this makes it extremely easy to do quick explorations of the data. It's much easier than using a database for this purpose.

I know it's not for everyone. (User input error isn't an issue for me, since my data are machine-generated.) But I don't think it should be dismissed outright: it's certainly a possibility if your data are not more than a few GB.

Csv files are terrible for large data sets and are really prone to user input error. Same for tab delimited files. Much data has commas and spaces naturally or through uswlers, which can cause problems in csv and tabbed files.

A lot of people have different ideas of what 'large' is. If you are trying to put your data into R, it better not be much more than a GB unless you are running in on a server with a bunch of ram. For taking a file out of a DB or a python scrape and throwing it into R, flat files are great.

Take a look at Twitter's Rainbird. It wasn't released and they don't plan to do it apparently, but the concept along with Cassandra counters looks very good. I've implemented similar system with some adjustments for games statistics system but not planning to release it any time soon, I can share some code if you find it interesting though.

An idea could be pipe-delimited ascii, zipped, and left zipped (or bzipped), and then uncompress on the fly... unless you're SSD or ram disks, you may get better load time into memory by decompressing it while loading it, rather than reading uncompressed from the disk. If you have disk compression already, then this wouldn't apply.