I have a strong interest in creating and learning about highly scalable web sites and applications which deal with Big Data. For example, many world class web companies and startups such as Twitter, Facebook, Google, etc deal with massive amounts of data and their software engineers use tools such as Hadoop/MapReduce, Cassandra/memcache, and other algorithms and techniques and tools to make sense of this data. To me this is a very interesting and challenging area and much more invigorating than just creating yet another e-commerce or web portal for a commercial company.

Naturally working at a typical software development firm won't get you experience with this level of massively distributed data processing opportunity, and very few if any people have access to cpu clusters to run Hadoop, DryadLinq, etc. I wanted to inquire if there are certain geographic areas in the countries which have companies dealing with such Big Data problems - Silicon Valley would be an obvious one, any others? Also, feel free to mention any non-obvious startups and companies which are dealing with such challenging problems.

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
If this question can be reworded to fit the rules in the help center, please edit the question.

Look into retail systems programming. You deal with many, many millions of records per day that go into OLTP systems (not just cubes or OLAP) via ETL alone, and blown out to several factors more in mastered records.

My personal opinion on this is that traditional data management (e.g. RDBMSs) do a stand-up job at managing and querying these data as long as the schema is intelligently designed (what's new) without having to resort to interesting data management alternatives.

Anecdote: With large data sets as described you can't be afraid to scale up rather than take the scale-out approach. Many years back we were processing what amounted to cartesian joins on 4-to-5-million-record-per-day data; think combinatorics. We had a "cluster" of 52 blades listening on a queue of data subsets and then doing smaller calculations with resubmittal back into the MQ. Sound familiar?

We ended up processing 24 hours of data in 26 hours, losing ground every day. Dozens of people trying to optimize, myself included, taking weeks and weeks. Finally this old, cool and collected DBA convinced the IT dept to get a lease on a 24-way IBM pSeries (which was really big back then) with some blazing DASD. He stayed up one night writing a few monster SQL statements and ended up processing the same 24 hours of data in about 16 hours. Taught me a big lesson that next day.

Other areas include equity and market data, GIS, and other areas where you have discreet data points that need evaluation.

I personally used to work for a company that did logistics, tracking product as it was shipped around the country. We'd get around eight million bar-code scans an hour, and we had to validate, log and accumulate that data into several categories (product, shipper, state, region, source, destination, etc) and each category had several "buckets" (hourly, daily, weekly, monthly, month-to-date, quarter, quarter-to-date, ...).

A shop that does data warehousing is a good candidate too. In that situation, it's not so much the amount of raw data, it's the volume you build up by slicing and dicing it so many ways.

Financial services is another obvious area. I am regularly involved in projects that process tens of thousands of events per second in near-real-time and/or need to store millions of rows of data a day.

The technology used is pretty diverse, ranging from proprietary software to some of the open source code that you mention to greenfield development.

(To be fair, there's quite of lot of primitive stuff too. I know of quite a few companies whose tick "database" is just a bunch of CSV files.)

If you want to analyze large sets of data, you can also work on the data you probably have right now or may have easily.

Example 1: web site logs

If you have a website and you preserved all the logs for the last five years, chances are, you have a big amount of data. Being a victim of DOS/DDOS helps a lot too.

This sort of data is also funny to play with:

Gather statistics about your users. Either you will learn something useful, or at least you'll be able to build some nice graphics from those results (like a map showing where your users are and how the geographical distribution evolves over time).

Analyze the user flow on your website to optimize it (this may be obsolete if you changed the website a lot),

Try to detect which requests were the hacking attempts. Based on those results, can you detect the current hacking attempts faster?

Track the HTTP responses and compare them to your error reporting tool. What you must expect is to find the most 500-type and 404-type requests being reported through the error reporting tool and being analyzed and solved. What you don't want to find is that half of your customers encountered 500/404-type errors regularly on your website, and you were unaware of that and did nothing to solve this.

etc.

Example 2: files

Having over 100.000 files on your PC can be another source of large sets of data, especially if you start logging what happens to those files. For example, years ago I did a monitoring system which remembered the size of some directories to alert me when something goes wrong (i.e. large increase in size). The collected data is pretty small, but you can do a similar tool which will collect more detailed data to process it later.

What about, for example, displaying graphically the flow of your data between the directories over time, to show the ones which grow or shrink, the files which are changed, and those which never change? Or, if you enjoy more business-oriented applications, you can process those statistics to decide what files must have more or less backups, or to modify your system by putting the most accessed files to an SSD.

Example 3: log the server activity

What about logging the activity of your server to determine that, for example:

When you are logged in through Remote Desktop but do nothing, the CPU is still used more than when there are no logged in users. Log off to reduce the power usage of your server.

Every Saturday at 3 AM, there is a peak of CPU usage that you don't expected. Maybe there is a scheduled task that you don't need?

For the last four days, the average memory usage increased by 10 MB, while you haven't changed anything on your server. How to explain that?

For the last two hours, the CPU usage is constantly at 95%, while it was at 25% before. Is there a DDOS attack?

All those scenarios are not really Big Data cases, since you deal with a few GB of data, not tera, exa or petabytes. But still, if you want to do something which is:

much more invigorating than just creating yet another e-commerce or web portal for a commercial company