Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

$\begingroup$best ask on data science stack exchange. but there are plenty of ML platforms designed to scale to larger data sets (as are typical for eg computational advertising) see spark, dask ...$\endgroup$
– seanv507Mar 31 at 14:57

2 Answers
2

Use a model that can learn incrementally if you want to try using the whole dataset. All deep learning frameworks support mini-batch processing, and you can formulate a Multilayer Perceptron, or use a single layer for Logistic Regression or SVM. Gradient Boosted Trees also support incremental learning, and using GPU for training.

50M samples with 200 features is possibly doable overnight on standard hardware. At approximately 40TB of data (assuming 4 bytes of single precision per feature), I/O to the database is the likely bottleneck. So sampling and chunking should be performed by the DB.

Compute a learning curve over the model to estimate the number of observations actually needed to obtain desired performance. Can try 1/1000, 1/100,1/10 etc and see effect on validation scores.

I believe there are still clear situations where sampling is appropriate, within and without the "big data" world, but the nature of big data will certainly change our approach to sampling, and we will use more datasets that are nearly complete representations of the underlying population.

On sampling: Depending on the circumstances it will almost always be clear if sampling is an appropriate thing to do. Sampling is not an inherently beneficial activity; it is just what we do because we need to make tradeoffs on the cost of implementing data collection. We are trying to characterize populations and need to select the appropriate method for gathering and analyzing data about the population. Sampling makes sense when the marginal cost of a method of data collection or data processing is high. Trying to reach 100% of the population is not a good use of resources in that case, because you are often better off addressing things like non-response bias than making tiny improvements in the random sampling error.

How is big data different? "Big data" addresses many of the same questions we've had for ages, but what's "new" is that the data collection happens off an existing, computer-mediated process, so the marginal cost of collecting data is essentially zero. This dramatically reduces our need for sampling.

When will we still use sampling? If your "big data" population is the right population for the problem, then you will only employ sampling in a few cases: the need to run separate experimental groups, or if the sheer volume of data is too large to capture and process (many of us can handle millions of rows of data with ease nowadays, so the boundary here is getting further and further out). If it seems like I'm dismissing your question, it's probably because I've rarely encountered situations where the volume of the data was a concern in either the collection or processing stages, although I know many have

The situation that seems hard to me is when your "big data" population doesn't perfectly represent your target population, so the tradeoffs are more apples to oranges. Say you are a regional transportation planner, and Google has offered to give you access to its Android GPS navigation logs to help you. While the dataset would no doubt be interesting to use, the population would probably be systematically biased against the low-income, the public-transportation users, and the elderly. In such a situation, traditional travel diaries sent to a random household sample, although costlier and smaller in number, could still be the superior method of data collection. But, this is not simply a question of "sampling vs. big data", it's a question of which population combined with the relevant data collection and analysis methods you can apply to that population will best meet your needs.

$\begingroup$I'm thinking in this instance that the big data set IS the population (e.g. A customer database. How is it possible to run model fitting reasonably on a 50 M observation 200 feature dataset? You could reasonably sample from the population - but then you still have a large dataset (say 10 M observations). How is this handled in practice? In this case modeling the sample would appear untenable without substantial time and computer resources$\endgroup$
– Windstorm1981Mar 31 at 15:19

$\begingroup$Whenever one applies techniques of statistical inference, it is important to be clear as to the population about which one aims to draw conclusions. Even if the data that has been collected is very big, it may still relate only to a small part of the population, and may not be very representative of the whole. Suppose for example that a company operating in a certain industry has collected 'big data' on its customers in a certain country. If it wants to use that data to draw conclusions about its existing customers in that country, then sampling might not be very relevant.$\endgroup$
– MassoudMar 31 at 15:25

$\begingroup$If however it wants to draw conclusions about a larger population - potential as well as existing customers, or customers in another country - then it becomes essential to consider to what extent the customers about whom data has been collected are representative - perhaps in income, age, gender, education, etc - of the larger population.$\endgroup$
– MassoudMar 31 at 15:27

$\begingroup$The time dimension also needs to be considered. If the aim is to use statistical inference to support predictions, then the population must be understood to extend into the future. If so, then again it becomes essential to consider whether the data set, however large, was obtained in circumstances representative of those that may obtain in the future.$\endgroup$
– MassoudMar 31 at 15:28