(a) There is a correct mark, independent of our efforts to measure it.

That is the design intent of the system.

In this, I believe the ISU to be grievously in error. In my response to gkelly's post above I said that the impossibility of achieving such a notion of the "correct mark" was the one thing we could all agree on.

I see I was wrong about that. No wonder we are not able to fit the square peg of figure skating judging into the round hole of statistical analysis.

Some GOE bullets. Jumps "Superior flow in and out of jump elements."

Step sequences: "Highlight the character of the program."

Spiral sequences: "Creativity and originality."

And not only the individual bullets, but "it is up to the judge to decide on the number of bullets for any upgrade..."

In my opinion, to say that these considerations require measurement rather than judgment is to distort the words measurement and judgment out of all usefulness.

Originally Posted by Mathman

To me, "measuring" something means assigning a real number to it...

Originally Posted by gsrossano

Don't know what you mean by a "real" number.

By a real number I meant an element of the real number system. Like pi is a real number, or the square root of 2 is a real number.

"Third place" is not a real number. I don't mean that third place is unreal. Just that you cannot point to a place on the number line and say, this is thrid place, right here between 16.39847 and 16.39848.

Judges are not judging perceptions of intangible things, like pain, beauty, emotion, how I feel about the program, or how you feel about the program -- or at least they are not supposed to under IJS.

They are judging height, speed, rotations, cheats, positions, movements, timing, unison etc. If a jump is cheated less than 1/4 the GoE goes down 1, 1/4 cheat down 2, 1/2 cheat down 3.

Using the pain example, there is no standard unit of pain that allows one to say I am feeling 2.5 units of pain, and I can't say at all how much pain you are feeling at all since you are feeling it and not me.

But in judging there are quantifiable units of under-rotation and I can say you under-rotated by 1/4 unit of rotation, or 1/2 unit of rotation or whatever. I can also say you did 5 rotations in your spin, or 6 or 8. I can say (to a sufficient extent) if you were spinning at 2 rotations a second or 6 or 10. I can decide if your jump was 6 inches off the ice or two feet. I can say in a quantifiable way if your elements hit the highlights of the music or not, and if you were in time to the music or not. I can say you finished 5 seconds early or 10 seconds late, or stood at the boards for 5 seconds waiting for your music to catch up at the start of the step sequence. And I am not trying to perceive how you feel about your timing, I am judging your timing as I see/measure it. Pretty much every thing that gets judged in a skating performance has quantifiable units associated with it.

And the fact that we are not measuring them with a stopwatch or a ruler does not mean we are not measuring. As an example as a photographer, I can look at an arena and tell you how much light there is and what settings are correct for exposure to within 1/4 stop. After 50 years of experience my eyes and brain make a pretty good light meter -- and that is true for most photographers. A highway patrol officer can look at you on the highway and tell you speed to within 5 mph (better than 10%). They don't need a radar gun.

On the other hand what are the units for beauty? How many units of beauty does one picture have compared to another? Can't say. That's why we don't judge things like beauty or art. Sorry, but artistic impression hasn't been a judging concept for 20 years. Let it go.

IJS was created to address the IOC demand that skating must be judged according to an objective quantifiable standard -- or it's out of the Olympics. Some would say that is not the right approach. Nevertheless, that is the purpose of IJS, to objectively judge quantifiable characteristics of skating. Not to judge perceptions. The 6.0 system was killed off by the IOC because it was viewed as being nothing but a mishmash of perceptions and opinions.

From some of the previous comments it seems the reactionaries would like to push IJS back into the subjective ill defined mode of 6.0 judging. Isn't going to happen.

I guess what I am saying is that no matter what the ISU comes up with, when we subject the IJS to rigorous statistical analysis, we always uncover all sorts of problems.

The reason why, IOC or no IOC, is not because the system needs tweaking but because it rests on doubtful assumptions.

(That's what I think, anyway.)

I for one am not saying that we should go back to 6.0 judging. Just that we should not be surprised when we see a bunch of error terms that are distressingly large compared to the increments of judging (in the PCSs, for instance), and stuff like that.

No offense, but the ISU doesn't care. This is the road they have gone down.

And yes there is still a lot of subjective junk in the rules, but over time it is slowly being weeded out. It's hard making progress, though, because there are still some judges who want the ability to do whatever they want in whatever situation they want, to get the answer they prefer. That group puts alot of pressure on the Tech Committees to water down the system and go back to the old ways, but I don't think it is going to happen.

By a real number I meant an element of the real number system. Like pi is a real number, or the square root of 2 is a real number.

I see, counting the number of times a coin flip comes up heads is not a measurement since it is an integer, and coin flips (or dice tosses) do not follow the laws of statistics! Glad you cleared that up for us.

I see, counting the number of times a coin flip comes up heads is not a measurement since it is an integer, and coin flips (or dice tosses) do not follow the laws of statistics! Glad you cleared that up for us.

Yes, I would certainly say that counting the number of times a coin turns up heads is counting as contrasted with measuring. Coin flipping follows the laws of discrete probability. Measured quantites follow the rules of continuous probabily.

There are two grand themes that since prehistoric times have animated all mathematical thought. They are counting and measuring. On the counting side are arithmetic, number theory, and algebra. On the measuring side are analysis (calculus) and geometry ("measuring the earth").

It is the difference between "how many?" and "how much?"

Of course there are many areas of overlap and interaction, probability and statistics being two of the most interesting.

Originally Posted by Hsuhs

1. Judging = measurement, IMO. 'I like it / I don't like it' is a measure, or so I thought.

It seems to be a question of how broadly we want to use the word "measurement."

A carpenter takes out his tape measure, measures a board and finds that it is 1.3 meters too long for his needs, so he cuts some off. If you ask him if he measured the board, he would certainly say yes. (The fundamental rule of carpentry: measure twice, cut once.)

Now ask him why he chose rosewood instead of mahogany for his project. I don't think he would use the word "measure" in his answer.

Well, I guess he measured the qualities of each type of wood against the qualities that his experience told him would be desirable in the finished project (sort of like figure skating judging after all.)

Still, if the customer said, "hey, you measured this cabinet wrong," I don't think the carpenter would say, "by golly, you're right, I should have gone with mahogany."

Anyway, that's semantics. On the mathematics side, the question is whether or not we are trying to fit a statistical model to a particular case study that it just doesn't match up with very well. This is a question of more than just semantic substance, IMHO.

Now ask him why he chose rosewood instead of mahogany for his project. I don't think he would use the word "measure" in his answer.

Well, I guess he measured the qualities of each type of wood against the qualities that his experience told him would be desirable in the finished project (sort of like figure skating judging after all.)

Still, if the customer said, "hey, you measured this cabinet wrong," I don't think the carpenter would say, "by golly, you're right, I should have gone with mahogany."

Anyway, that's semantics.

Let 9 carpenters choose between rosewood and mahogany for their individual projects. You'll have a measure of 'popularity of 2 sorts of wood among professional carpenters'. Still semantics?

Originally Posted by Mathman

the question is whether or not we are trying to fit a statistical model to a particular case study that it just doesn't match up with very well.

I'm still not sure what exactly IJS is supposed to measure, what's the name of the construct? Without a certainty of that knowledge, I'm afraid I'm clueless about the best fit.

Let 9 carpenters choose between rosewood and mahogany for their individual projects. You'll have a measure of 'popularity of 2 sorts of wood among professional carpenters'. Still semantics?

If the panel went 6 to 3 for mahogany I think I would prefer to say that we have a "count" of the relative popularity.

But maybe this is splitting hairs stupidly. I will give up on trying to press this distinction.

I'm still not sure what exactly IJS is supposed to measure, what's the name of the construct? Without a certainty of that knowledge, I'm afraid I'm clueless about the best fit.

I was hoping someone more knowledgable about skating than I am would jump in and give that question a shot.

I fear that the answer is, "the name of the thing we are trying to measure is 'the number of CoP points that a program ought to receive.'"

An alternative approach to figure skating judging would be to say, "the name of the thing we are trying to judge is 'the quality of the performance.'"

But this may be only half the story. There are, after all, the first mark and the second mark, the technical specialist and the judges.

Maybe: "Do hard tricks and do them well" is the mantra.

Tryng to roll up quantities and qualities into a single ball of wax is the joy and the curse of figure skating judging, in my opinion. But I have to say, of the two, I do have a special fondness in my heart for quality.

(a) I voted for this particular political candidate because he is 6 foot 2 inches tall, because he has 13,476.12 dollars in his bank account, and because he voted for tax cut bills 52.38 per cent of the time.

Or

(b) I voted for this guy because he has a sterling character, an heroic spirit and a noble mind.

I think we all more or less agree on the main issues, which are that statistically random dropping of judges makes little sense, and ethically IJS's explanation that this is done to shield the judges from outside influences isn't persuasive (it more looks like they're hiding the corrupt judges from public's scrutiny). Anonymous judging suffers some of the same problems.

Some of the other questions that have come up, what gsrossano might term "pollens on the leaves of the trees", are of great interest to me, because they come close to my research.

So for instance, there are two major camps in statistics: Frequentist and Bayesian. Frequentists basically only applies probability to measure the empirical frequency of a repeated, external event (for instance, a fair coin lands on heads .5 fraction of the time), whereas Bayesians apply probability in addition to subjective uncertainty (so armchair figure skating fans can say there's .65 probability that Yu-na will win OGM, and .95 probability that the US ladies will not medal, even though this particular Olympics with these particular skaters has never taken place before). This is a philosophical difference, not a mathematical one, both Frequentists and Bayesians agree on the fundamental laws of probability, just not their applications.

I feel this is analogous to the discussion of what is actually "measured" by the judges' scores. We can argue forever whether the judges' scores reflect noisy samples of some underlying "truth", or whether there is no independent "truth" except what emerges as a consensus from judges' scores.

This is a philosophical question, not a scientific one. The fact is that we can only access judges' scores, whether an independent "truth" exists or not.

That doesn't mean one can't be scientific about it. The scientific way to go about it is to analyze the statistics of the judges' scores across competitions and judging formats, as gsrossano seems to have been doing, and quantify just how much margin of error there is in terms of absolute COP scores, and how much there is in terms of relative placements, and how these margins of error change as a function of the number of judges taken into account (and whether the mean or variance is used, and whether or not high's and low's are trimmed). This would be really useful, because what one can say is for instance, when there are 7 judges, a COP difference of 5 points in the SP, and 10 points in the LP, is a "statistical tie" (a term commentator seems to toss around a lot). And what a statistical tie means is that in fact with only 7 judges, skaters scoring within 5 points in the SP should be given a "tied" score (because the variability in their scores reflect only "noise"), and likewise for skaters scoring within 10 points of each other in the LP. Then maybe what one will find is that with 9 judges, the margins of error go down to 4 and 8 points. And with 12 judges, it goes down to 3 and 5. Etc. I'm just making all these numbers up to illustrate the point, but I think it's in the right ballpark.

I think sadly, what we will discover is that the majority of championship decisions in major competitions since the inception of COP are "statistically insignificant" -- in other words, if these scores were the results of a scientific experiment, no journal would credit it as "real" and publish it.

(Thank God at least we hold our scientific journalism to higher standards than the COP, right? Imagine statistical methodology as sloppy as COP being applied to drug cost/benefit analysis or monitoring of climate change or affects of smoking on health. Hey, actually...)

While a "fully rotated jump" may be precisely defined in COP (270 degrees), there is no precise, independent way of measuring it except through human judgment. Note how a tech panel may downgrade or not downgrade a jump due to camera angle, and a "strict" and "lenient" caller may call the same jump differently. Precision of definition doesn't matter if there is no precise instrument to measure it. When it comes to vision (which is a dominant sensory modality in which skating is perceived), human vision (even rat vision) far exceeds machine vision (i.e. the kind of analyses computer algorithms can perform on camera images), and the richness of figure skating visual imagery is such that I think human judgment (with no independent measurement nearly as precise or accurate) will prevail as part of the scoring systems for decades and maybe centuries to come (assuming Speedy and his successors don't kill it first).

And this is related to what some people seem to be arguing in terms of what can be "measured" and what can't be. One thought is that those can be externally verified is measurable, while those requiring human judgment is not. But if something as precisely defined as 270 degrees has to be judged by a human (or 3), does that make "rotatedness" as subjective a judgement as "beauty"?

Personally, I say "yes." One day, scientists may understand better how the brain evaluates "beauty", and put it into quantifiable terms (so, for instance, we already know that on average, people judge a symmetric face more beautiful than asymmetric face, people judge a certain female waist-to-hip ratio to be the most beautiful over all other ratios, people judge certain harmonic structure to be musically beautiful and others to be ugly). So just as "rotatedness" can be defined in degrees (who doesn't think 270 is arbitrary, by the way), then beauty may be quantifiable in some physical dimensions as well. And all of them are subjective in their own ways.

So if that's the case, then figure skating as a sport might as well give up trying to contend an absolute "truth". Let's face it, figure skating is a much more complex activity than anything other sport, and requires mastering of a much larger array of skills what anything else even comes to. And its very complexity is what draws us figure skating fans to it. All this buttonholing of FS into these oversimplistic criteria (which actually growing more complex by the month) is killing the beauty of skating. A skilled judge can judge a program as a whole much better than utilizing a system of arbitrary criteria and levels and requirements. When it's a 6.0 performance, we can all feel it, audience and judges alike.

I like whoever that proposed that COP be changed to have a single presentation score out of 6, and the TES be normalized to the same scale of 6. I think that'd be a great idea!

The fact is that we can only access judges' scores, whether an independent "truth" exists or not.

So now the next question is: if there is a wide variation among judges' scores, is this really statistical noise, or is it rather the very thing that we are studyng?

Originally Posted by feraina

Thank God at least we hold our scientific journalism to higher standards than the COP, right? Imagine statistical methodology as sloppy as COP being applied to drug cost/benefit analysis or monitoring of climate change or affects of smoking on health. Hey, actually...

Studies of statistical errors in prestigious medical journals like the New England Journal iof Medicine consistently reveal serious or fatal errors in about 50 per cent of the articles.

I suspect that one problem may be the availability of easy-to-use statistical software that makes no requirement that the user actually understand the underlying principles behind the tests employed. In any event, it would appear that the general standard of statistics in medical journals is shabby. Perhaps special emphasis should given to the necessity for medical journals to have proper statistical refereeing of submitted papers. Indeed, some journals, embarrassed by reports such as these, are doing exactly that.

Editied to add: That last sentence is especially interesting. Now medical researchers complain that the statistical censors have become so zealous that the researchers can't get their stuff published at all.

Not surprising that Mr. Dore agrees with the ISU's position.. He is among the principle architects of the new judging system, and as ISU Vice President for Figure Skating he is the ISU's official spokesman on the figure skating side.

Of the nine judges for the SP, five will be replaced for the LP.

I presume they draw straws for which judges will be dropped after the SP. No?

Do the 5 replacements know who they are before the competition begins?

I think that four of the nine judges will be replaced, with five continuing on the the long program.

I think that's right, that they draw straws to see which of the four SP judges will be replaced.

Yes, the replacement judges know in advance that they will be judging the long program, after sitting out the short program.

Can a judge volunteer to be on a Panel of his choice?

For each of the four disciplines the ISU has some kind of random draw for the counties that will send judges. After this draw it is known which 9 countries will have judges for the SP, and which 4 countries will be held in reserve to replace four of the jusges in the LP.

Then later each national federation selects the individual judge that they will send, subject to the condition that the judge sent is on the ISU list of approved championship judges.

All in all, it looks to me like there is no reduction in judges if 14 judges are active somehow.

The 14 judges, the replacement of four from the SP, the elimination of two judges' scores at random, and anonymous judging -- all of these are methods by which the ISU is trying to make it harder for judges to cheat.

The mathematical question that gsrossano raises is different. When the smoke has cleared, you have seven marks to combine in a trimmed averaging procedure. This is down from a total of nine before. This reduces the trustworthiness of the average score by a factor that is measured by the ratio of the square root of 9 divided by the square root of 7.

This ratio works out to be 1.13. In other words, our confidence in the reliability of the average scores of the judges as a meaningful measure is reduced by 13%.

So in the example that Feraina gave above, suppose the margin of error is plus or minus 5 points for the short program and plus or minus 10 points fro the long, with nine judges. Then with 7 judges, the margin of error increases to 5.65 points for the short and 11.13 points for the long.

So you can see what the problem is. If a skater wins by a score of 240 points to 238 for his rival, but the margin of error built into the judging system is plus or minus 5 points, then who "really" won?

And this is just purely "statistical error." On other words, nobody did anything wrong, it's just that the procedure of taking a bunch of numbers and averaging them is not a 100 per cent reliable way of drawing a conclusion.

By the way, the part where I think that the ISU is being dishost is here:

"It's to keep everything uniform," David Dore, ISU vice president, said Tuesday. "We've already done it in the championships, but the ruling on the Olympics was really out of whack. I don't know how you can have two such important championships done two different ways.

"It was an oversight, to be honest with you."

Right. The ISU just "forgot" that they have to judge the Olympics as well as the World Championships.

What actually happened was, last October they snuck the changes for the World Championship through without consulting the membership. Since it did not involve the Olympics, the change did not attract much attention.

Now they can come along and say, oh by the way, we are changing the judging procedure for the Olympics, too, because it is "out of whack" with the procedure for Worlds.