In my previous diary entry The
Return of Jake, I shared the triumphant tale of Jake The
Construction Worker. Also, I posted a poll in which I invited you,
dear readers of HuSi, to vote on what you thought Jake would do with
his winnings from the out-of-court settlement with his employer.

Of the twenty-two votes cast among the five options provided, a surprising
eight votes were cast for the last option, which had been reserved
"for cynical bastards only." If we consider the twenty-two votes to be
representative of the population of HuSi users, can we reasonably say that the population has a significant cynical-bastard component? That is, can we say that the "cynical" votes are unlikely to be explained by chance alone?

Well, what do you think?

The rules

Because people may continue to vote in the poll and thus change our playing
field, restrict your analysis to the following snapshot of the
poll's results:

Let us assume the following:

We are
willing to accept no more than a five-percent chance of making the you-are-cynical-bastards
conclusion falsely.

The poll is fair in that a vote for the last option indicates cynical bastardism and votes for other options do not.

The poll is fair in that all of its options are equally attractive if we ignore each voter's tendencies toward cynical bastardism.

The challenge

Given the above, does the poll's outcome provide significant evidence to support the conclusion that HuSi's population has a significant component that leans towards cynical bastardism?
Support your answer with sound reasoning (or brilliant distractions).

Because some unknown population will always pick the option they find most amusing regardless of what they really believe. (Which I suppose means that assumption 3 is incorrect.)---[ucblockhead is] useless and subhuman

(Even though they may not in reality.) That way, we can eliminate the obvious answer – it's the goofy, CowboyNeal option that lots of people pick by default – and thus require the support of a more fun-with-math/entertaining/thoughtful analysis in order to answer the question.

(I had thought that the "Let us assume the following ..." lead-in to the assumptions would make my intent clear. Sorry about the confusion.)

If there's a 5% chance of choosing "cynical bastard", then one would expect an average of one (22 * 5%) incorrect cynical votes, which means that at least 1/3rd of voters are truly cynical bastards. I can't say if this is "significant" as far as the general Husi population without knowing how your defining it. (Though I suspect the population size is too small.)

Of course, you have to deal with the case where some people may have wanted to choose more than one poll option, for instance, a voter might think he wanted to vote a F-150 and feed his insatiable drug habit.---[ucblockhead is] useless and subhuman

I define it like so: "We are willing to accept no more than a five-percent chance of making the you-are-cynical-bastards conclusion falsely."
In other words, if we make the claim that the users are cynical, we don't want our chance of being incorrect to be more than 5%.

Of course, you have to deal with the case...

Well, you might choose to deal with that case, but I might not. For now, however, I ain't talkin' about how I might approach the problem. I'll leave the fun squarely in your capable hands.

In case by "significant" you were referring to this particular use: "Can we reasonably say that the population has a significant cynical-bastard component?" I explained what this means in the introduction to the story: "That is, can we say that the 'cynical' votes are unlikely to be explained by chance alone?"

In other words, if chance alone doesn't explain the votes, then we will conclude that there is a significant c-b component in HuSi's population of users.

If the chance of voting is even, then the chance of getting each cynical vote is 1/5. The chance of getting exactly 8 cynical votes is (4/5)^14, or ~ 4.4%. (It is easier to calculating the reverse, the chance that exactly 14 votes won't be cynical.) The chance of getting exactky 9 cynical votes is (4/5)^15, or ~3.5%. Therefore the chance of getting 8 or more votes purely by chance is quite a bit greater than 5% and therefore by your definition, the vote isn't significant. Not even close.

I'm too lazy to calculate exactly how much more, so that'll have to do. I could probably come up with some series with a limit or something, but there's no point, as I've already proved it with just two terms.

(Note that this is most assuredly not the right way to do this problem, but it's the easiest way I think to do it without having to go refresh my statistics memory.)

Playing with calc a bit on the high end, it looks like you'd have to have 18 cynical votes before you could say with 95% confidence that of a HuSi users pick cynical more often than chance would imply.---[ucblockhead is] useless and subhuman

If the chance of voting is even, then the chance of getting each cynical vote is 1/5.

Sounds reasonable.

The chance of getting exactly 8 cynical votes is (4/5)^14, or ~ 4.4%.

Care to explain your reasoning?

It would seem that by your logic, the chance of getting exactly 1,000,000 cynical votes out of 1,000,014 total votes is still (4/5)^14. Yet, it stands to reason that as the total number of votes increases, the probability of the count of cynical votes being any particular given number ought to decrease, something that your reasoning doesn't seem to address.

To come from a completely different angle, how would you calculate the probability of zero out of fourteen votes being cynical? (Warning: You may experience d

There are 5^22 different combinations...this requires calculating the number of combinations with 8 or more cynical bastard votes. Or less than 8, which is probably easier.

The number of combinations with no cynical bastard votes is 4^22. The number of combinations with one is 4^21. Etc. So it works out to (4^22 + 4^21 + .. + 4^15)/5^22

Which seems wrong because I get 1%, implying that the chance of getting 8 or more is 99% in favor, which has got to be wrong, but I've got no more time so bleah.
---[ucblockhead is] useless and subhuman

You can't lump cynicism in with a perfectly reasonable choice and then claim that because it was chosen, people are cynical. It is possible to read the instructions as "vote for what you believe unless you're cynical, in which case you must vote for drugs". You have no information on why people voted for drugs; maybe they just believe that's an expectable (or even preferred) outcome. Remember the population you're surveying here; drugs are no worse a choice than, say, a shrine to NASCAR or an unnecessary truck.

Also, I agree with Mr. Ckhead: 22 is not a large enough sample size to make any statistically meaningful distinctions between 5 choices. You should have at least 50 votes (10x number of levels of the measured variable), preferably more like 75, to really be able to trust the answers to 5%. It is on this excuse (rather than my own laziness) that I will refrain from running the numbers anyway.

You can't lump cynicism in with a perfectly reasonable choice and then claim that because it was chosen, people are cynical.

Yes I can. See The Rules in my original posting, items 2 and 3, in particular.

Please note that the goal of this exercise is not to draw truly meaningful inferences about the population of HuSi. Rather, the goal is to enjoy a fun little probability puzzle that I have drawn loosely from a context near and dear to our hearts – HuSi. That's why I ask you (in The Rules) to assume certain things that in reality aren't likely to hold.

Get it? Don't fight the assumptions. Love them, and they will set you free to enjoy the puzzle.

22 is not a large enough sample size to make any statistically meaningful distinctions between 5 choices

To get any indication that this sort of percentage 36% is an indicative percentage of the true percentage of cynical bastards you have to also assume that the poll is a fairly spread subset of the users, which something by invite (like a diary poll) is unlikely to achieve.

Also the numbers ignores the number of people who chose not to vote you should probably also quote the x users who have seen the story, (which is currently over a hundred) and will often equate to a none of the above option.

OK (without figuring the 5% misvote)... It's so long since I did anything approaching proper stats, but I wrote a quick and dirty application to randomly fill a 5 option poll will 22 votes running a million times it looks like the chance of having any single category with 8 or more elements is approxately 30% (bit less), so it sounds like the sample size really isn't big enough to make a justified conclusion.

whether this equtes to a 6% or 30% chance of the value being wrong probably depends on if you were only looking for at one of the poll options (depending on hypothesis), or are trying to make a hypothesis dependant on the results of the poll!

I mean that if the purpose of this poll was solely to support the hypothesis that Husi has a bias towards cynical bastards one could say that that has approx a 94% chance of being correct.

If the hypothesis is solely determined by the output of the poll (ie you would have for instance hypothesised there was a bias towards the other entries if they had come up) then it has approx 70% chance of being correct.

Simulation is an interesting approach to take. Care to share you code?

[T]he chance of having any single category with 8 or more elements is approxately 30% ...

Do you mean that 30% of the time when you inspect the five categories after voting that at least one of them will have 8 or more votes? Or do you mean that 30% of the time, the first category will have 8 or more votes; and 30% of the time, the second category will have 8 or more, and so on?

Consider the following experiment. Run a single iteration of your simulator, casting 22 votes at random into five categories. Let X be the count of votes for the last category. Output X. Repeat for 100,000 iterations. Analyse the distribution of the X values you output. What does the distribution of X tell you about the likelihood of the 8 original cynical-bastard votes being significant?

(You may find my histogram program, with the --all-integral option helpful for this analysis. See the "Data analysis and statistics" section Tom's Perl code for other tools like stats, which you might also find handy.)

For comparison purposes with your simulator, here's a small Perl one-liner that will generate 100,000 values of X (five at a time) via simulation:

I probably shouldn't have posted here my stats is terrible and I am probably just showing myself up, (amazingly I managed to go through a combined maths and physics degree at a fairly decent english university with out ever studying statistics).

What I meant is given random voting, 30% of the time one of the poll options will have 8 or more votes (not any of the options in particular). So generating any concrete conclusions from a poll exhibiting this behaviour seems unlikely.

Note by "The poll is fair in that all of its options are equally attractive if we ignore each voter's tendencies toward cynical bastardism." I was reading it even non cynical bastard have a 20% chance of voting the cynical bastards option? I might have misinterpreted this part.

I was reading [assumption 3 to mean that] even non cynical bastards have a 20% chance of voting the cynical bastards option?

No, it means that if we ignore the influence that cynical bastardism may have over the voters, then all options are equally likely to be chosen. In other words, if we eliminate the effects of c-bism from the universe, both cynics and non-cynics will be equally likely to choose each option. What this means is that if we determine that the last option was chosen more frequently than mere chance alone allows for, we can conclude that c-bism is what explains the unusually high frequency.

I guess: close, but not within 5% by george (6.00 / 1) #19Fri Aug 06, 2004 at 03:36:33 PM EST

If votes are cast "by chance", we can treat each vote as a Bernoulli trial with p = 0.2 probability of "success". The probability distribution for x successes in a series of N independent Bernoulli trials is given by the binomial distribution. For large N, the binomial distribution can be approximated as a normal distribution and one can use the properties of the normal distribution to figure this sort of thing out.

In this case, I think N is sufficiently small that it's better to use the binomial distribution directly: the probability of x successes in N trials is NCx * px * (1-p)N-x. (Where NCx means N choose x -- the binomial coefficient.)

Then the probability of x or more successes in N trials equals SUM i=x..N of NCi * pi * (1-p)N-i.

Via Excel (where n choose k can be computed via COMBIN(n, k)) and the above formula, the probability of 8 or more successes in 22 trials is 0.0561.

In other words, there is slightly more than a 5% chance that the cynical votes are due to chance.

Excel has a function, BINOMDIST, to directly evaluate binomial distribution probabilities. If the probability of success is p and you have N independent trials, then you can compute the probability of "exactly x successes" as BINOMDIST(x, N, p, FALSE), and the probability of "at most x successes" as BINOMDIST(x, N, p, TRUE).

I don't particularly trust Excel, and so I like to use R from The R Project for Statistical Computing. In R, the pbniom function gives the cumulative distribution function of the binomial distribution; i.e., pbniom(n, N, p) gives the probability of n or fewer successes in N trials each having probability of success p:

> 1 - pbinom(7, 22, 1/5)
[1] 0.0561446

R is under the GPL, and that's a big advantage. I used Mathematica for more than a decade, but over time Wolfram's licensing and support policies wore me down (i.e., annoyed the hell out of me) to the point where I'm moving to Free-Software platforms for my analyses. R has filled the gap (and then some) for the statistical analyses I do. I recommend it.

All trademarks and copyrights on this page are owned by their respective companies. Comments and Stories are owned by the Poster and Licensed to "Hulver's site". See our Copyright page for more information.