Please be more careful when interpreting SO Developer data

These types of surveys are interesting and useful, but each year I find myself pulling my hair out at poor analyses by the press and internal analysts. As an example:

The analysis of the Evaluating Competence question:

We asked respondents to evaluate their own competence, for the specific work they do and years of experience they have, and almost 70% of respondents say they are above average while less than 10% think they are below average. This is statistically unlikely with a sample of over 70,000 developers who answered this question, to put it mildly.

Is seriously flawed, and represents a misunderstanding of what "statistically likely" means.

First of all, there are no inferential statistics computed here, only summary statistics. Implicit in this analysis is a comparison between the distribution of competence in the population and a distribution of competence in the sample. See below for a brief discussion of the implied comparison. You cannot say whether the difference between your sample distribution and the population is "statistically likely" or not without inferential statistics.

If you did run an analysis using inferential statistics, you could make a statement about how likely it is that a distribution from a random sample of the population would have the characteristics that this sample does. You would not be able to draw a conclusion about whether respondents are biased in their evaluation of their own competence, or whether your sample was biased. Because of your methodology, we must assume a biased sample. Inferential statistics of SO survey data have minimal value in this context (comparing distributions to the population of developers) because respondents were not sampled using random sampling methods.

This is both a simple and crucial principle that, apparently, we don't hammer on enough in introductory statistics courses. Everyone seems to be able to parrot "correlation isn't causation", but equally important: you cannot generalize from a non-random sample!

Sample size doesn't save a biased sample:

Consider the case of the Literary Digest Election Poll of Landon vs. Roosevelt. A huge sample (2.4 million people) was used to generalize to the electorate at large, and predicted Landon would win, with 57% of the vote. In fact, the opposite occurred, Roosevelt won in a landslide, with 61% and a 24% margin of victory. A much smaller sample (50,000) by Gallup used sampling methods that allowed for generalization, and correctly predicted the Roosevelt landslide.

The challenges to generalization and inference here are the same challenges the 1936 literary digest poll had -- selection and non response bias. 70,000 is a lot, but you cannot generalize from a non-random sample, even a big one. Consider that, with over 20 million developers globally, you would need about 1 million respondents to have the same proportion of the population of interest as the Literary Digest sample. And we know how that turned out.

A comment on the response by the analyst:

That paragraph I wrote was intended to be a little light-hearted, but I'm willing to stick by it.

This is disappointing. The reasoning in the answer is mostly about the plausibility of the hypothesis, and whether there is data about any association between the known bias and the variable of interest. While I agree it is plausible, and even likely that developers overestimate their abilities this is not at all the point. We could make that argument without the survey. The most basic point here has little to do with the conclusion. The analysis itself contains an error and is incorrect regardless of whether the conclusion the analysis supports is true. It is an error to use the sample size of a non-random sample to support the underlying comparison with the population of interest. Sample size can decrease random error, but not bias. I would hope Dr. Silge consider carefully why she thinks the sample size of ~70,000 provides additional support for her comparison with the population and what exactly is "statistically unlikely".

Please note that I'm not coming at this from the perspective that there is nothing useful to be learned here. The SO developer survey is a useful undertaking. I would just suggest more care be taken when interpreting the data. Here, in particular, please own your errors when someone points them out.

The implied comparison:

Average competence is the midpoint of the distribution of competence. In this case, "average" is the median of the distribution, and, in any valid measure of competence, the median happens to have the same value as the mean. Average is explicitly defined in this literature see Kruger and Dunning to be the 50th percentile, the median.

The analysis of the proportion of respondents who said they were above average (70%) is based on an expectation that 50% of the population are above average competence by definition, and that the sample should have a similar proportion of competence as the population.

Stack Exchange data scientist Julia Silge regarding last year's survey (emphasis mine): "we do have great evidence that survey respondents and/or SO engaged users are not representative of all developers. As one example, ~18% of US CS undergrad degrees currently go to women but ~9% of US survey respondents were women". Stack has people who know better, yet they continue to present conclusions that assume representativeness. I won't speculate as to the incentives or motivations, but some possibilities come to mind...
– Jeremy BanksApr 10 '19 at 20:53

40

@Jeremy it drives me nuts every year, but this year, with the actual phrase statistically likely in the analysis, I thought I would write something.
– De NovoApr 10 '19 at 20:56

11

@DeNovo Thank you for making this point. It grates on me every year. I'm no statistician myself, but it seems pretty obvious that a self-selected sample can't be generalized to the population at large. The survey has entertainment value, but it doesn't have statistical value.
– Ryan LundyApr 11 '19 at 7:43

13

From the linked "Literary Digest Election Poll of Landon vs. Roosevelt" article, this line towards the end was interesting: "The most extreme form of nonresponse bias occurs when the sample consists only of those individuals who step forward and actually 'volunteer' to be in the sample."
– Jon SchneiderApr 11 '19 at 16:55

5

Interesting, I've never studied stats, but the exact same issue jumped out at me: it's entirely possible that StackOverflow users (or at least, respondents to the poll) are more competent than the broader developer population.
– Steve BennettApr 12 '19 at 0:29

3

Not only this is a non-random sample of developers, but also it's not a developer sample at all. It's a Stack Overflow viewer non-random sample. There was no requirement for anyone to be a developer or even to answer honestly.
– SklivvzApr 13 '19 at 8:44

2

There may be a point here but it sounds like the original statement was not meant to be taken this seriously
– AndrewApr 13 '19 at 14:25

What's a better methodology for SO to try to get a representative survey response? Given affordable cost constraints, i.e. not phoning a sample of their users.
– smciFeb 11 at 2:32

3 Answers
3

With respect to the "evaluating competence" metric, I think what the SO folks thought they'd get was a bell curve where the top of the bell was right down the middle, so that half of the respondents would say they are above average, and half would say that they are below. That is, after all, what "average" means, right?

But this analysis makes several invalid assumptions:

That developers on Stack Overflow are a representative sample of the entire developer population as a whole, and,

That developers who self-select to take the survey are a representative sample of developers on Stack Overflow generally, and

That developers who self-evaluate their competency level is the same thing as evaluating the competency level of individuals against a representative group (assuming you have such a group).

These kinds of studies are fundamentally limited in their veracity due to selection bias; any conclusions drawn by such studies must be taken with a not-insignificant sized grain of salt.

This kind of survey is certainly useful for generating hypotheses. You could then use valid sampling methods to get some answers, but we never seem to take that step.
– De NovoApr 10 '19 at 20:53

4

4. That respondent know about the Dunning–Kruger effect. BTW, I answered below average. Why? Because by nature I tend to read stuff I don't need help with.
– BraiamApr 10 '19 at 21:10

@DeNovo I'd argue that it's not really even that useful for generating hypotheses. At best it's kind of a weak census of SO users, but any conclusions about the broader developer community is kind of a stretch.
– rjziiApr 11 '19 at 3:07

1

There are many characteristics the you can ask people about and most people will say they are above or below average. That's because you are not actually measuring their skill, you are measuring either their self-perceived skill or the level of skill they think it is socially acceptable to self-report.
– ElinApr 11 '19 at 3:58

1

Because Dunning/Kruger Effect is a thing ...
– user10677470Apr 11 '19 at 15:09

@RobertHarvey Sure, but this survey instrument not very useful for making any conclusions about developers and imposter syndrome. At best - and it would be tenuous - you could use the results to justify a study that really looks for a connection between developer experience and the imposter effect.
– rjziiApr 12 '19 at 18:35

1

"any conclusions drawn by such studies must be taken with a not-insignificant sized grain of salt." (emphasis added) Please be more careful about drawing conclusions about the statistical significance of salt grain size. ;)
– AsaphApr 13 '19 at 15:31

#3 very true. I feel like an impostor often (esp if I've received a # of journal or job rejections that week) and if asked "how good of a programmer are you?" would likely answer on the lower side of possible options. However, when asked "how good of a developer are you?" the value is much higher. Due to the fact that my peers in data (where I'm just starting out) may never have done development. Brilliant programmers and mathematicians but not all understand automation tools, SDLC, setting up testing environments, or etc because they don't use it. Rating depends on talent (& focus) of peers
– LinkBerestApr 13 '19 at 21:24

1

FYI @RobertHarvey I would be surprised if that is what the SOs survey team expected when they wrote this question. The Dunning-Kruger effect is very hot now in pop sociology and makes for good snarky headlines, with all the misunderstanding that entails. I imagine this is exactly what they expected to see.
– De NovoApr 15 '19 at 16:08

1

@DeNovo: To be fair, the first paragraph of my answer is meant to be slightly tongue-in-cheek, in the same vein as the original comment that precipitated this discussion. I'm dead serious about the rest, though.
– Robert HarveyApr 15 '19 at 16:17

It's funny, I chuckled a little when this Meta post crossed my path. The reason I chuckled is because when I read that passage in the post about the SO Developer Survey, I had roughly the following sequence of thoughts:

However, I know perfectly well what the author's point is: there is a long standing and well established body of research on the tendency for people to overestimate their abilities, this summary statistic is broadly consistent with that, and it's reasonable to suspect that a similar phenomenon is at work here.

Some not insignificant number of people are going to make a much bigger deal out of this less than optimal choice of phrasing on Meta than it really deserves.

This is exactly why I happen to think this is important. Data, not properly analyzed, matches a phenomenon that is often poorly understood by the general public, and is seen as supporting it. It would be one thing if this was a little blog, but the SO Developers Survey is broadly read, reported, and internalized as a source of truth about the field. You may have the sophistication to not take it seriously, but have still concluded "it's reasonable to suspect a similar phenomenon is at work here." It may be reasonable to suspect that a priori, but these data do not give any additional support.
– De NovoApr 13 '19 at 22:36

Rather than making a possibly tongue in cheek statement suggesting that these data support (somehow strongly) an example of biased self assessment, a responsible analysis would be very clear that they don't. Again, we should expect accurate self-assessment to be dependent on competence, numeracy, culture, and the manner of assessment in developers. But these data have nothing to say about that.
– De NovoApr 13 '19 at 22:42

no, but I did mean "data, not properly analyzed, match" rather than "data... matches"
– De NovoApr 14 '19 at 0:42

7

The world is full of examples of this kind of thinking. "Oh, I'll just compromise a little here; nobody will notice or care." Then you compromise a little more the next time. Before you know it, you're a used car salesman.
– Robert HarveyApr 14 '19 at 4:02

4

@RobertHarvey Implying that Julia is on her way to having the morals of a used car salesman over this incident isn't likely to convince me that you aren't overreacting. Or were you phrasing something in a not technically accurate way in order to make a point, perhaps humorously? 🤔
– joranApr 14 '19 at 4:27

2

Frankly, the observation about ability overestimation doesn't bother me. It's the part about artificially adjusting for selection bias that seems hinky. While it doesn't skew the numbers much, I don't think you can take a bad sample and make it good just by fudging the numbers.
– Robert HarveyApr 14 '19 at 15:09

1

@RobertHarvey there are some very particular ways that you can recover from a biased sample, but none of them allow you to make a general statement about one variable in the population based on one variable in the sample.
– De NovoApr 14 '19 at 16:45

I'm the data scientist who worked on the survey this year and I wrote that piece of text, so the responsibility for it mainly rests with me.

We asked respondents to evaluate their own competence, for the specific work they do and years of experience they have, and almost 70% of respondents say they are above average while less than 10% think they are below average. This is statistically unlikely with a sample of over 70,000 developers who answered this question, to put it mildly.

The Stack Overflow Developer Survey has real issues when it comes to how representative it is. The main axis along which the sampling bias exists is participation on Stack Overflow; we field the survey on our site so we sample more from developers who are more engaged on our site. This has secondary effects on our sample. Developers from underrepresented groups in tech participate on Stack Overflow at lower rates, so we undersample those groups, compared to their participation in the software developer workforce. We have data that confirms that, but certainly I would also expect to undersample parents, folks from tech communities who are active away from Stack Overflow, people who code for work but aren't sure if the word "developer" applies to them, and more.

That paragraph I wrote was intended to be a little light-hearted, but I'm willing to stick by it.

The hypothesis that developers who are more engaged on Stack Overflow are more skilled overall to such a dramatic degree is not something I have ever seen data that confirms. I value our community and the resource we are building here together! However, someone who engages a lot on Stack Overflow is largely better at navigating Stack Overflow, not more or less good at coding overall. We have, for example, data around traffic patterns for registered and anonymous users, for low and high reputation users, etc. that points to this.

I don't say this to minimize the value of understanding Stack Overflow and how to participate here. I wouldn't work here if I didn't highly value the resource we are all creating! At the same time, such a hypothesis is, in my opinion, far-fetched, especially when the well-studied cognitive bias of illusory superiority is... right there.

Absence of evidence is not evidence of absence. Just sayin'. :)
– Robert HarveyApr 10 '19 at 21:49

42

It doesn’t seem completely absurd that the most dedicated and knowledgeable developers overlap with the developers who use Stack Overflow. First because the people who are motivated to use a site like this in their free time to share knowledge tend to be, well, knowledgeable and willing to learn. Second because using this site actually makes you a better developer. Not exclusively so—reading books or blogs about programming would do the same thing. But being a contributing member of Stack Overflow is definitely an indicator of someone who is devoted to their craft.
– Cody Gray♦Apr 10 '19 at 21:51

5

I think it's statistically significant that so many people think that they are better than average. 70%, when the reality is that 40% should be in that bin. Even if SO users are better than average developers, that is absurdly high; almost laughably so that so many people think they're better than average.
– Nick VithaApr 10 '19 at 21:55

3

And 40% is like the upper range of where it should be.
– Nick VithaApr 10 '19 at 22:02

24

Your interpretation is a reasonable hypothesis, but by using the phrase "statistically unlikely", and specifically referring to the sample size, makes it very hard to take your analysis seriously. I would, by the way, encourage you to further interrogate your hypothesis, using the appropriate methods.
– De NovoApr 10 '19 at 22:24

15

@NickVitha even out of 10M users of SO only 90K answered (less than 1%) - so it is possible that 100% of those who answered are above average, even higher than that - all can be in that top-10% (or even top 1%) of developers who are users of SO and since all developers presumably bigger than all SO users it would mean everyone who replied to survey could truly be top-1% of all developers well above average... (if that is the case - I don't think so, but plausible)
– Alexei LevenkovApr 10 '19 at 22:30

17

@NickVitha you may think it is significant, but you can't correctly call it statistically significant
– De NovoApr 10 '19 at 22:39

17

@JuliaSilge, you don't need to endorse an alternate explanation to be more careful about the way you report your summary findings. You're measuring self-perception here, and you have some support for an interesting finding (in particular, a difference in subgroups). Please don't say "statistically" though, unless you can back it up with an appropriate statistic.
– De NovoApr 10 '19 at 23:28

14

I'm honestly befuddled by the lack of numeracy here. Since nobody else has mentioned it, I suppose I will: @NickVitha and the upvoters, the expected value for above average is 50%. Not 40%, not, like less than 40%. 50%. 50% are above average. That is what average means.
– De NovoApr 11 '19 at 1:19

7

What @CodyGray said. I would consider it extraordinary if developers who volunteer their spare time to contribute on Stack Overflow are not above average relative to the rest of the community.
– Alex HarveyApr 11 '19 at 3:53

22

The correct conclusion to draw is not that it's statistically unlikely that 70% of survey respondents are above-average. Nor that it's statistically likely that 70% of survey respondents are above-average. The correct conclusion to draw is that there's not enough information to know how many survey respondents are truly above-average. The data doesn't support anything more than this.
– Ryan LundyApr 11 '19 at 8:02

7

@DeNovo 50% are above average. That is what average means. 50% are above median, that is what median means.
– BrakNickuApr 11 '19 at 12:23

5

Of all ways you could have poked lighthearted fun at the possible self-bias effects, which certainly exist, it doesn't seem this case in particular was the most well thought out one. There's certainly plenty of wiggle room in your interpretation, and one could easily imply you simply don't like the results yourself.
– lucasgcbApr 11 '19 at 15:16

6

"Developers from underrepresented groups..." Categorizing them as "underrepresented" assumes you know the optimal representation. How did you calculate the optimal representation?
– jpmc26Apr 12 '19 at 1:34

5

@YvetteColomb perhaps this would be better in a separate meta post, but as I understand it, comments in meta do not have the same narrow use as comments on the main site, and discussion with some back and forth is appropriate for comments here.
– De NovoApr 14 '19 at 16:54