Gems don't really have to be modified to be used via Bundler. Gems have always had dependencies specified via gemspec files, and that's what Bundler uses. AFAIK, the problem Bundler has solved is dependency resolution, not gem packaging or dependency specification. That's the case for Rails apps, anyway.

Ah, interesting, guess I should have looked at all the links. I agree with you that theoretical knowledge of computer science and math is important. However, when I look at resumes, the degree matters a whole lot less to me than experience. Unfortunately, degrees don't seem to correlate well with knowledge, ability to apply knowledge or even interest in the area. I think that people who are genuinely interested in computer science will continue to learn on their own after Zoho's program.

I think your comment is misguided because TFA says that Zoho started their own 2 year education program. That's plenty of time to teach people about algorithms and data structures. Also, I've met people with CS degrees who could not write decent code, so I think that a degree in and of itself does not guarantee any amount of knowledge.

In New Zealand you don't have to file anything if you were just receiving wages and bank interest. If you have other income (e.g. from freelancing) then you file a return some time in June.
The tax rates are:

12.5% on income up to $14000

21% on income between $14000 and $48000

33% on income between $48000 and $70000

38% on income between $70000

So if you're earning $80000-$90000, you end up paying about 30% of it in taxes, which isn't too bad considering that we need a lot of infrastructure for the population that we've got (New Zealand is larger than the UK but has only 4M people).

Having driven both on the American highways (in California and Oregon) and on the Autobahn, I can say that the Autobahn is way better, at least in terms of surface quality. It was easy to drive at over 110 mph on the Autobahn but I think I'd be uncomfortable at those speeds on US highways.

I think you missed the part where it says "based on a heavily modified PostgreSQL engine". I'm aware of Yahoo's database, and there's no way you can say that it's a "Postgres database". This was my point right from the start: when you have a lot of data, you are forced to move away from a stock standard RDBMS and do something else.

I agree that there isn't a plug and play solution for large amounts of data (at least not yet), and of course doing things right helps immensely.

I still think that things could be a lot easier than what we have with the current generation of RDBMS. As an example, Skype uses Postgres but they have to jump through a lot of hoops to make it work for them. For one thing, they can't just run SQL queries anymore, and they have to maintain the shards somehow (e.g. they probably need a way of balancing them). Backup/restore probably isn't viable for them either, so they must have implemented some form of redundancy. Another limitation is that with shards you need to route all queries through an indexing server which can also become a bottleneck. In short, this is a very difficult problem to solve.

The appropriate solution also depends on the structure of your data. For example, in my case we had a massive table with hundreds of millions of rows that dwarfed everything else, and we did relatively simple queries on the data. A more suitable dataset for RDBMS would have a lot of tables with roughly the same number of rows in them, where you run queries with lots of joins and filters.

I'm actually curious what the data in your 150TB database was like and what sort of hardware was required for it.

One thing we did was upgrade from Postgres 8.1 to 8.3. From what I read, 8.1 performance degrades rapidly with multiple concurrent long queries. 8.3 also has more efficient storage, which helps with the main problem - hard drive throughput. IIRC, we got about a 10% improvement in query times with 8.3.

We also had two databases on one server, so the other thing that helped a lot was to run them on two separate servers. The largest table we had was clustered by one of the fields which made queries on that field fast. We didn't use autovacuuming and instead vacuumed overnight. A hardware upgrade also helped. We did some query profiling and made sure everything was indexed appropriately. None of this is rocket science of course, and just shows that as your database grows you have to get more and more involved in ensuring good performance.

We investigated vertical scaling with a better, more expensive server, and that would have helped for a while, but the database was projected to double in size in 1-2 years, so that would be no more than a stopgap measure. The conclusion I came to was that we had to move away from standard relational databases. One option was to use sharding (but I think sharding is a workaround for RDBMS limitations, so I don't like it that much), and the other option was to use something like a key-value store that can scale horizontally. Unfortunately, I didn't stay at the company long enough to implement this, so I can't tell you which of those would be a successful solution.

Oh, absolutely, I'm not surprised that your setup works well, Postgres is a great RDBMS. Of course, how you design your schema matters a great deal too.

But here is another issue I thought of: backup. For our database it was 24 hours to do a full restore, which isn't practical. The only reasonable solution I know is to use replication, which is a nuisance with Postgres and adds maintenance overhead (keeping the schemas in sync). I'd prefer to have built-in redundancy. Again, I think you get that with Cassandra and MongoDB.

I guess in a few years we'll probably end up with something that combines good properties of both key-value stores (redundancy and scalability) and RDBMS (powerful query language, transactions).

I have worked with large PostgreSQL databases (150GB or so) and really, Postgres isn't a solution. You run into issues anyway when some of your tables contain millions or even billions of rows. At that stage things like vacuuming or altering the schema start to become damn near impossible, and even querying starts to become a bottleneck.

Now how do you scale that if your database is still growing? Postgres doesn't have a decent clustering solution that I know of, so your options are either to roll your own, or to scale vertically. Both of those are expensive options.

Based on my experience, I don't think that relational databases are appropriate for really large databases, and at present the only realistic option is horizontal scaling which is a lot easier with things like Cassandra or MongoDB.