Have you tried Python, it is also an open source language and easy for starters. More importantly, you can access R from Python almost seamlessly with the package RPY. I met a same problem as you (although less than 100GB you are facing), and solve it with Python. I also wrote a post on it http://www.mathfinance.cn/life-is-short-use-python/.

Have a look at ROOT (root.cern.ch). It was created for particle physics data, and we routinely analyze ntuples with > 1B events. You can have the data split in multiple files but then merge it for analysis. It's got limitations too, but might be helpful.

SAS is good for large datasets, as it has out-of-core algorithms. Splus can also do this. Revolution Computing's version of R does this. All of these are commercial products. In the open source domain I have found Python to be great. I would also look at open source databases, such as MySQL and SQLIte, but I haven't used these.

I regularly analyse > 15 GB data sets using the standard R distribution (and am the author of the second article you reference). You do have to think and work somewhat differently from how the standard introductions to the language works which is obviously a problem. And of course it does depend on what you need to do - I did have problems around 100 million call records when I tried to do social network analysis the naive way [1] but I eventually found a more fruitful way of analysing that data set.

Standard recommendations include the biglm, biganalytics, speedglm, and biglars packages, as well as DBI and friends.

In general, and this is probably better suited for a blog post than a comment, my approach is to first work hard at the data selection and preparation to make sure I work on the right problem and then to look at algorithms that I can execute in chunks and then combine. The latter is of course also essentially what SAS does.

Rick from SAS here. I think that the 2009 ASA Data Expo (http://stat-computing.org/dataexpo/2009/posters/) really helped expose many statistical programmers to the magnitude of data that corporations have to analyze every day. Taking part in the Expo was definitely an eye-opening experience for me, and it was fun to use SAS to analyze such a massive data set. For a summary, see http://support.sas.com/publishing/authors/extras/Wicklin_scgn-20-2.pdf

In the open source world, Kane and Emerson's bigmemory package (http://www.bigmemory.org/) is a great addition to the R arsenal. For his work on bigmemory, Kane was awarded the 2010 Chambers Award by the ASA Sections on Statistical Computing and Statistical Graphics.

Since memory is so cheap now, just install more memory on your machine so that you don't need a VM solution for Stata or R. I regularly work on a dataset that is >30G on Stata on a desktop computer that has 32G of RAM installed. As long as your RAM is greater than the dataset size, it will run fine.

Think about this: for the cost of an Intel 160Gb SSD (320 or 510), one gets near RAM speeds from "disk files". I would expect that someone (not I, as my C is very old and clunky) will, in time, build a package leverages SSDs. As to just loading up on memory, most PCs these days ship with all DIMM slots filled with the maximum supported DIMM.

SQIAR (http://www.sqiar.com/solutions/technology/tableau) is a leading global consultancy which provides innovative business intelligence services to small and medium size (SMEs) businesses. Our agile approach provides organizations with breakthrough insights and powerful data visualizations to rapidly analyse multiple aspects of their business in perspectives that matter most.