Profiling Kaggle's user base

It's been almost five months since Kaggle launched its first competition and the project now has a user base of around 2,500 data scientists. I had a look at the make-up of the Kaggle user base for a recent talk that I gave in Sydney. For those interested, the highlights are below.

The largest percentage of users come from north America (followed by Europe, India and Australia).

Country

Proportion

United States

35.6

United Kingdom

9.7

India

8.9

Australia

6.6

Canada

3.8

France

3.3

Germany

2.0

China

1.8

Netherlands

1.4

Brazil

1.4

Spain

1.3

Of those who have signed up with university email addresses, most come from north American universities (although there are an inexplicably large number of users from Sabanci University in Turkey).

Email URLs

Proportion

sabanciuniv.edu

7.1

umich.edu

3.8

harvard.edu

2.1

javeriana.edu.co

2.1

mit.edu

2.1

duke.edu

1.7

gatech.edu

1.7

nthu.edu.tw

1.7

psu.edu

1.7

stanford.edu

1.7

unimelb.edu.au

1.7

columbia.edu

1.3

imperial.ac.uk

1.3

nd.edu

1.3

ualr.edu

1.3

uchicago.edu

1.3

yale.edu

1.3

Those who fill in the education section of the profile are typically trained in computer science, statistics, econometrics, mathematics and electrical engineering.

Training

Proportion

Computer Science

15.6

Statistics

11.6

Economics and Econometrics

10.0

Mathematics

8.8

Electrical Engineering

7.2

Bioinformatics, Biostatistics and Computational Biology

6.4

Physics

5.2

Finance and Computational Finance

4.8

Operations Research

3.2

Among those who nominate a favourite software package, R and Matlab are most popular.

Favourite Software

Proportion

R

22.5

Matlab

16.2

SAS

12.7

SPSS

5.8

WEKA

3.5

Excel

2.3

Minitab

1.7

Stata

1.7

Those who filled in the favourite technique section of their profile, typically like using neural networks, Bayesian methods, support vector machines and logistic regression.

Favourite Technique

Proportion

Neural Networks

7.4

Bayesian Methods

6.5

Support Vector Machine

6.5

Logistic Regression

5.6

Regression

4.6

Decision Trees

3.7

Linear Regression

2.8

Anthony Goldbloom
is the founder and CEO of Kaggle. Before founding Kaggle, Anthony worked in the macroeconomic modeling areas of the Reserve Bank of Australia and before that the Australian Treasury. He holds a first class honours degree in economics and econometrics from the University of Melbourne and has published in The Economist magazine and the Australian Economic Review.

It's sort of surprising to me that neural networks are the "most favorite" technique - I was under the impression that neural networks were considered passé, due to their slow training, numerous parameters, and ancient history. Is this just skew in Kaggle's user base, or an indication that the neural network approach is not so outdated as the eye-rolling I get from ML people would seem to suggest?

Anthony Goldbloom

@CHCH, I was under the same impression. Bear in mind that Neural networks are only preferred by 7.4 per cent of those who report their favourite technique. And because we only recently started polling users on their favourite techniques (and favourite software), the sample size is small. (Hopefully this blog post will alert members to these newish profile fields.)

IDFP

Neural networks are not outdated. They are an area of active research in the machine learning community and are used in a variety of applications. Here is an example published this year. http://www.cs.toronto.edu/~vmnih/docs/road_detection.pdf
One point to take away from that paper is that even very basic neural networks can do surprisingly well when they have a very large number of hidden units and are trained on a very large dataset.

Also, "neural networks" could mean a lot of different things. I would view logistic regression and linear regression as a one-layer neural networks. Maybe some of the people who said "Bayesian methods" use neural networks in a Bayesian way. Perhaps some people think of graphical models with a layered structure as neural networks. What about parametric models trained with gradient based approaches? The phrase "neural networks" could mean practically anything.

Responding to CHCH now, I don't think neural networks have slow training compared to SVMs for instance. SVMs with general kernels often have quadratic or cubic training times as a function of the number of training cases and other techniques in the "kernel methods" family are typically just as bad. Certainly simple neural networks are much more scalable than many of the highly trendy non-parametric Bayesian techniques that are all the rage these days (don't get me wrong, I love this stuff). I assume by "numerous parameters" you mean numerous hyper-parameters, since one wants as many parameters as one can get away with. IMHO, the proliferation of hyper-parameters is hard to get away from with a lot of modern machine learning techniques, but for the simplest of feed-forward neural networks this isn't bad.

Nathaniel Ramm

I am sure that reports of the death of neural networks are greatly exaggerated!
I think that the 'eye-rolling' effect when neural networks are discussed is due to the impression non-modellers have of predictive modelling. Neural networks are often mentioned in popular culture accounts of modelling (even in movies!), and some people have a glassy-eyed utopian view of what a neural network is, no doubt because of the 'brain' analogy.
However we all know that neural networks are 'JAFA' (Just Another Friggin' Algorithm)...

tilapia

It's just a silly artifact of the way the categories are partitioned. It's not like "logistic regression" people would insist on continuing to use it for a problem where the dependent variable was not binary. Had "regression" been a single category, it would include 13%.

http://www.slokjghje.com Chin Schepker

wonderful blog good info congrats. To get the best precio del dolar advice for spanish websites click on the link

http://lawrencebrowningxt.shutterfly.com Gussie Trepagnier

You made some nice points there. I did a search on the issue and found most people will go along with with your website.

http://www.digjack.com top 10

I was reading through some of your content on this site and I think this website is very instructive! Keep posting .