I did a small side project on predicting some social phenomena (I cannot divulge the exact topic and the code) using scraped social network data;

So far I can see that I did it more or less well (ROC AUC score on train ~ 0.9, ~0.7 on validation, but with very small amount of annotated data - 5-7k items, also a small survey (100+ people) showed that my predictions mostly agreed with it);

Unsurprisingly the best features are:

The most plain indicators (age, how many friends / audios you have - activity and popularity counts);

Or the features that best correlate with you actual latest actions where you have intent (in my case - group reposts, your friend / network metrics- and probably likes will be better, but they are harder to collect) - nobody gives a shit about data you have not updated in your profile in years;

Natural log of your id in the social network is a good feature (it is least prone to trickery and may indicate age / tech savviness / social group etc);

In a nutshell social networks are much more mundane and harsh than you may think. Herd mentality, 90% of content being stupid shit and reposts, shitty popular public groups and posts at their finest;

Despite the what the best 2017 tabular data Kaggle competitions may show in top solutions, you actually need very moderate hardware for such tasks (if your are in practical reality);

An inside information - as of 2017 clever work with vk.com API time outs and access to API tokens with high limits actually enables you to get the necessary data legally and w/o any botnets or proxy lists;

Probably it also is a good reason why the best people (and sometimes even rich people) use social media only for professional reasons. If you want to have a strong voice - best be faceless and voiceless.

For this reason, btw, oppressive companies and countries fear third-party messengers so much. I personally advocate Telegram - it is fast, easy to use, cross platform, no ideas about security, vibrant community and a plethora of simple yet efficient features. Also as of 2017 the messenger itself has not sold out, but the majority of public channels are kind of cringe now.

1 So, what is the fuss about?

Some random guys asked me to finish off a prediction of a social phenomenon for ~200k+ vk.com users (2 other people started it, with different coding styles I have mixed feelings about =) ). The data was scraped from vk.com public API. I did the task and in doing so I was able to compare some of the best performing features.

Also I remember that a year ago some people were selling some bs to me that such tasks are really difficult. No they are not. IT/ML fields have a veneer of exclusivity / reserved club, when in fact they are not.

Scraping each user on vk.com daily for 5 yeas is difficult. The data science and ETL parts are not.

2 Why is it useful?

Well, for 2 basic reasons:

Some ideas for your company's ML algorithms;

Your data is not yours and being private is real currency nowadays. Just pause and ponder at:

The fact that I have personally heard people claiming to be doing such data collection for law enforcement agencies;

You probably get my point.

3 My stack + a couple of tricks

Basically you just need a set of plain vanilla ML-libraries and this XGBoost tutorial. That's it.

Basic libraries:

xgboost

pandas

numpy

scipy

sklearn

matplotlib

Also read this series for simple ETL jobs in pandas before jumping into something like this. Arguably you can do all the ETL with various methods (extended SQL, functional style scripts, shell scripts, even some higher level python libraries like Luigi or Dask) but with leaps in model PC power chances are that a single powerful workstation (plus probably something in the lines of bcolz if there is not enough RAM or you are HDD bound) will be more than enough.

I ended up using a slightly modified version of the training script in the tutorial above:

Surprisingly - time / activity related features, manual group and post annotation features performed very poorly;

Numerical features from profiles

Extended friend counters

Friend counters

5 What did not work / I did not try

Well having info on 10M user posts you would think that unleashing the raw power of LSTMs with pre-trained embedding would help ... but 90%+ of this content is just public page reposts. Which in turn is well covered just by SVD matrix decomposition.