denis_berthier wrote:Most of the polls you read in the newspapers rely on samples < 1,000. As precision of the results varies like sqrt(1/sample_size) for uncorrelated samples, significant improvements in precision require very large increases in size. That's why small samples are generally used, unless precision is absolutely necessary for some reason.

If you truncate your sample at 9.0, you still have more than 43,000 puzzles.If you truncate it at 9.1, you still have more than 2,300 puzzles; this doesn't provide high precision, but it's better than nothing.If you tried to truncate it at 9.2, I don't think it would keep much meaning.

so as expert, you have no problem to state that with a sample having no puzzles in the 9.2 10.8 area you can define the ratio JE/ number of puzzles in that area. ???

Correct for the pool sample, but it is part of the statistician skills to build a "so called" representative sample.

denis_berthier wrote:Most of the polls you read in the newspapers rely on samples < 1,000. As precision of the results varies like sqrt(1/sample_size) for uncorrelated samples, significant improvements in precision require very large increases in size. That's why small samples are generally used, unless precision is absolutely necessary for some reason.

If you truncate your sample at 9.0, you still have more than 43,000 puzzles.If you truncate it at 9.1, you still have more than 2,300 puzzles; this doesn't provide high precision, but it's better than nothing.If you tried to truncate it at 9.2, I don't think it would keep much meaning.

so as expert, you have no problem to state that with a sample having no puzzles in the 9.2 10.8 area you can define the ratio JE/ number of puzzles in that area. ???

As I've just said exactly the contrary ("If you tried to truncate it at 9.2, I don't think it would keep much meaning."), I can't see what you mean.BTW, I may have lost my elementary skills in arithmetics, but it seems to me that 158+3 = 161 and not 0.

denis_berthier wrote:Estimated unbiased proportion of puzzles having an exocet in the SER > 8.6 area

I see some convergence in our views.may be the next point would be to rephrase that sentence.As I read it, it affirms that the results of the sample can be applied without any limit in SER

I don't understand. My sentence unambiguously defines the SER > 8.6 area, i.e. the result applies to this whole area, namely:Estimated unbiased proportion of puzzles having an exocet in the SER > 8.6 area = 0.018%

Notice that this doesn't imply that puzzles with an exocet are uniformly distributed in this area, independently of their SER. Therefore, if this is your problem, no conclusion can be drawn for any subarea, especially for a subarea with almost null relative probability within the SER > 8.6 area, such as the SER > 9.2 subarea.

Last edited by denis_berthier on Tue Jun 11, 2013 6:09 am, edited 1 time in total.

denis_berthier wrote:Of course, there's no problem for the estimate of E(X), but knowing such an estimate without knowing its "uncertainty" / "precision" / "error" is not very useful in my view.

Using std(X) as an approximation of the "uncertainty" / "precision" / "error" of this estimate of X is merely a convention; as you know, one can use confidence intervals and confidence levels if one wants to be more precise; but this is equivalent. In the sequel, I'll therefore concentrate on the only mathematical notion, std(X). In this context, I can make no sense of your sentence:

Blue wrote:Note: here too, correlations aren't figuring into the actual calculation of the estimated standarddeviation, but only into the calculation of the estimated error.

There's another view on X's and E(X)'s: that knowing the value E(X), to any precision, is not very useful, because it doesn't provide any insight into the shape of the distribution. In that case, knowing the standard deviation of X, can be at least be helpful.

It's that kind of standard deviation that I was talking about, at the time.

The other kind of standard deviation, is the standard deviation in estimates for either E(X) or std(X).The idea is that if you perscribe some particular method of obtaining an estimate (including sample sizes and so on), and then try that method some number of times ... then what you'ld actually get for results, fall into some range centered on the actual values, and with a "spread" that's dictated by a number that you would calculate as the standard deviation for the estimates.

For the second kind of standard deviation, when you calculate the number, the calculation can involve the actual std(X) value. When you use a simple random sampling of X values, it's just std(X)/sqrt(# of samples). You know this, of course.

What I was getting at, was that when you use Red Ed's method, the correlations that are present in the "short bursts" of puzzles produced from a particular (G,S) choice, don't enter at all, into the calculation of the actual estimates for E(X) or std(X), but they to enter in (in some tangential way), into the (hypoethetical) calculation of the standard deviation in the estimates for a particular sample size and subgrid size. I'm not saying that you can do the actual calculation, or rather that if you could, that the result would only depend on the sample size and the actual std(X) value. I think you can only try the method, and arrive at approximate formula "experimantally". (The situation is similar for your "controlled bias" method, I think). There is still a general rule, though, that the standard deviation in the estimates, will be some constant, divided by the square root of the number of (G,S) samples.

denis_berthier wrote:Now, I'd like to clarify the following points.

Consider a random variable X (e.g. the W rating of a randomly chosen puzzle or the presence of an sk-loop in a randomly chosen puzzle).Consider also a set of n independent instances Xk of X (associated with different puzzles Pk). One can estimate the mean and the standard deviation of X from the sample made of the n Xk's. In particular, there's a full fledged theory for this independent case and there are known unbiased estimators for E(X) and almost unbiased for std(X)* (given by the usual formula with sqrt(1/(n-1)) for std(X)).(*) There is a really unbiased estimator for var(X).

No doubt that this method actually works -- at least if we assume that top down generators can provide an unbiased source of random puzzles with a fixed clue size.

denis_berthier wrote:Now, what's the difference with Red Ed's situation? He has a set of correlated (and therefore not independent) instances of X. From this point, there are two possibilities:- either you compute an "empirical standard deviation" using the same formula as if the Xk's were independent - but this is where words are delusory: it is by no means an unbiased (nor even almost unbiased) estimator of the real std(X); it can even be grossly wrong if there is a strong correlation;- or, for any possible variable X, you need to define an (almost) unbiased estimator for sdt(X), prove that it is (almost) unbiased and, most of all, compute how it will vary with the size of your sample; until now I've not seen any of this in Red Ed's approach and I doubt I'll ever see any of this.

I think that the key point that you're overlooking here, is that in Red Ed's method, you don't have just one (potentially correlated) collection of Xk's produced from a single (G,S) choice. The number 'K', that appeared in my calculations, was the number of random (G,S) choices that are used to produce the estimate, and you still need a lot of those, to get any precision in the estimates.

blue wrote:There's another view on X's and E(X)'s: that knowing the value E(X), to any precision, is not very useful, because it doesn't provide any insight into the shape of the distribution. In that case, knowing the standard deviation of X, can be at least be helpful.

When you have a sample S generated by the controlled-bias generator, you can use it to compute estimates of the real E(X) and std(X) for any random variable X. You can therefore use it in a straightforward way to compute an estimate of the real distribution of X, as I did long ago for X = number-of-clues, for X = SER and for X = W-rating.

blue wrote:I think that the key point that you're overlooking here, is that in Red Ed's method, you don't have just one (potentially correlated) collection of Xk's produced from a single (G,S) choice. The number 'K', that appeared in my calculations, was the number of random (G,S) choices that are used to produce the estimate, and you still need a lot of those, to get any precision in the estimates.

I hadn't overlooked the K.In my method, there's a unique parameter, the size n of the sample, and precision classically varies like sqrt(1/n).In Red Ed's method, for each k <= K you deal with a correlated collection of puzzles. Hopefully, "local" precision for each k increases with the size of the k-th sub-collection; but how it increases with it highly depends on the k-th "local" correlation and this correlation may also strongly depend on which random variable you are considering. If you have p puzzles uniformly for each sub-collection, you'll have to deal with a total of p*K puzzles (the equivalent of my n, as far as complexity of the computations related to the random variable under consideration is concerned), but with precision varying only like sqrt(1/K)* some_hopefully_decreasing_function(p) instead of sqrt(1/n).So we're back to my initial doubts about the efficiency of the method.

denis_berthier wrote:When you have a sample S generated by the controlled-bias generator, you can use it to compute estimates of the real E(X) and std(X) for any random variable X. You can therefore use it in a straightforward way to compute an estimate of the real distribution of X, as I did long ago for X = number-of-clues, for X = SER and for X = W-rating.

Yes, I understand this well. The same thing is true if you have a sample generated using Red Ed's method.The only caveat, is that you besides having the puzzles, you need to have a "batch number" associated with each puzzle,and you need to use it in calculating the error estimates (but not the value estimates). The "batch number" would indicate which (G,S) sample the puzzles were from. Alternatively, the puzzles can just be grouped into sets coming from the same (G,S) sample. A sequential list with "start" markers would suffice.

denis_berthier wrote:

blue wrote:I think that the key point that you're overlooking here, is that in Red Ed's method, you don't have just one (potentially correlated) collection of Xk's produced from a single (G,S) choice. The number 'K', that appeared in my calculations, was the number of random (G,S) choices that are used to produce the estimate, and you still need a lot of those, to get any precision in the estimates.

I hadn't overlooked the K.In my method, there's a unique parameter, the size of the sample, and precision classically varies like sqrt(1/n).In Red Ed's method, for each k <= K you deal with a correlated collection of puzzles. Hopefully, "local" precision for each k increases with the size of the k-th sub-collection; but how it increases with it highly depends on the k-th "local" correlation and this correlation may also strongly depend on which random variable you are considering. If you have p puzzles uniformly for each sub-collection, you'll have to deal with a total of p*K puzzles (the equivalent of my n, as far as complexity of the computations related to the random variable under consideration is concerned), but with precision varying only like sqrt(1/K)* some_hopefully_decreasing_function(p) instead of sqrt(1/n).So we're back to my initial doubts about the efficiency of the method.

First, one point: if you've looked at the beginnings of Red Ed's post, you can see that for producing an error estimate, the final word will be that the standard deviation of the estimates goes like sqrt(Var(V) / K) ... or if you like, sqrt(Var(V) / s(K-1)). The only question is: what is Var(V) for a given subset size, and statistic of interest ?

OK, there are two different things that we can talk about, that we might call "methods"

First, there is the the idea of using a collection of sample puzzles generated by either method, to estimate some value, and to produce an error estimate to go with it. I wouldn't deny that if the collection was generated by Red Ed's method, then you might need a larger sample to achieve the same error estimate -- emphasis on the might. When the statistic is something like an SE rating, then every puzzle in a batch, contributes, and 1) calculating SE ratings is expensive, and 2) the average batch size might be large, depending on the size of S. For some statistics -- puzzles containing a JExocet, for example -- the story can be that it is highly unlikely that even one puzzle from a batch, satisfies the property, and that for two in a batch to contain one, is even less likely. In that case, the two methods should perform more or less the same, with respect to the number of puzzles in the sample. I won't say exactly the same, since both kinds of samples will have some known bias in the clue counts that needs to be addressed, and I won't pretend to know the details relating to that, off the top of my head.

The other issue is how the collection of puzzles was generated in the first place. Having some experience using both methods, I can tell you that Red Ed's method is capable of generating a much larger colletion of puzzles in a fixed time, than the "controlled bias" method -- which I'ld prefer to call the "known bias" method.

I've always been curious how many CPU core hours it took, to generate the collection that you produced. I'm guessing something like 100 (2 GHz) Macs for a month. Would you mind giving a number (an estimate if necessary) ?

Regards,Blue.

Edit: corrected a couple of minor things, and one major thing: Var(V)/sqrt(K) --> sqrt(Var(V) / K)

Last edited by blue on Wed Jun 12, 2013 2:29 am, edited 3 times in total.

As the focus now seems to be moving away from validity of the statistical estimates and on to efficiency of the algorithm, let me try to bring a potential point of confusion to the fore.

Statistics says that Var(X+Y) = Var(X) + Var(Y) + 2.Cov(X,Y) ... in other words, if you process a batch of positively correlated items (Cov>0) and take their sum as the output then the variance is worse than you'd get from summing the results of processing the same number of uncorrelated items. In the context of subtree enumeration, where the puzzles within a subtree are positively correlated, that leads to the variance of their sum (or of the random variable I referred to previously as f(S)) being worse than for the equivalent number of independent singleton probes (suexg-cb type probing).

Denis, stop celebrating, I'm not done yet

The reason that the correlated internals of S (the subtree) don't hurt much is that we can search within S very efficiently. A 24-clue subtree for 21-clue minimals contains 2024 21-clue subsets: that's the "equivalent number" mentioned above. But we can search that subtree at a cost of substantially fewer than 2024 suexg-cb type probes -- maybe 2000 times quicker, if the 24-clue "root" has >1 solutions (because then there's nothing more to check). It's this algorithmic efficiency in searching the tree that offsets the negative effect of correlations internal to S, ultimately making subtree and supertree methods the best choice in certain (but not all) parts of the distribution.

blue wrote:The only caveat, is that you besides having the puzzles, you need to have a "batch number" associated with each puzzle,and you need to use it in calculating the error estimates (but not the value estimates). The "batch number" would indicate which (G,S) sample the puzzles were from. Alternatively, the puzzles can just be grouped into sets coming from the same (G,S) sample. A sequential list with "start" markers would suffice.

I need a unique mean "batch number" for the whole collection; in any case, I don't see how it could be a caveat.In case you want to use sub-samples instead of the whole collection, you cannot cut the latter anywhere, you need these "start markers" and you need the "batch number" for the sub-collection; but, here again, I can't see any problem.

blue wrote:if you've looked at the beginnings of Red Ed's post, you can see that for producing an error estimate, the final word will be that the standard deviation of the estimates goes like sqrt(Var(V) / K) ... or if you like, sqrt(Var(V) / s(K-1)). The only question is: what is Var(V) for a given subset size, and statistic of interest ?

Yes

blue wrote:there are two different things that we can talk about

I agree, we shouldn't confuse generating a collection of puzzles (1) and using it (2).Generating a sample is a one-shot business; using it is an iterative one: every time you want to study a new random variable, you can re-use the same sample.

(1) Generation

blue wrote:how the collection of puzzles was generated in the first place. Having some experience using both methods, I can tell you that Red Ed's method is capable of generating a much larger colletion of puzzles in a fixed time, than the "controlled bias" method -- which I'ld prefer to call the "known bias" method.

Comparing generators with different properties is meaningless in my view. My generator produces uncorrelated samples. Comparing it with Red Ed's may be more meaningful than comparing it with top-down, but not much more.The expression "controlled-bias" has a history, I defined it in opposition to unknown bias: by adapting a top-down algorithm, we can control the bias in such a way that it becomes known.

blue wrote:I've always been curious how many CPU core hours it took, to generate the collection that you produced. I'm guessing something like 100 (2 GHz) Macs for a month. Would you mind giving a number (an estimate if necessary) ?

I'd like to have 100 Macs (especially the forthcoming new oneshttp://www.apple.com/mac-pro/ - free ad. Donations accepted.)I never made a secret of it, but I don't have a precise estimate. I started using the generator before drastic improvements were made. As the controlled-bias algorithm is a simple modification of top-down, it could easily be implemented by eleven, starting from suexg (as I now remember it, there were only one or two lines of code changed). There was much collaborative work to improve it (see chapter 6 of my last book for a list of participants). The final version was ~ 200 times faster than the first. As these improvements couldn't change the results but only speed, I kept all the puzzles generated by all the successive versions.With these preliminaries, the whole generation process took several weeks (between 6 and 8, I can't remember). At the start, I used only my MacPro 2006. After some time, I also used Linux computers from my lab, up to ~ 100 cores during the weekends; during workdays, I could only use fewer cores and the job was assigned low priority; I have no idea which % of the cores was used.So, the estimate remains rather vague, but that's all I have.

(2) Use

blue wrote:I wouldn't deny that if the collection was generated by Red Ed's method, then you might need a larger sample to achieve the same error estimate -- emphasis on the might. When the statistic is something like an SE rating, then every puzzle in a batch, contributes, and 1) calculating SE ratings is expensive, and 2) the average batch size might be large, depending on the size of S.

Most of the stats I've been interested in fall in this category.

(3) Confusing generation and use

Red Ed wrote:Denis, stop celebrating, I'm not done yet

Would you mind not trying to turn this again into a personal affair?

Red Ed wrote:The reason that the correlated internals of S (the subtree) don't hurt much is that we can search within S very efficiently. A 24-clue subtree for 21-clue minimals contains 2024 21-clue subsets: that's the "equivalent number" mentioned above. But we can search that subtree at a cost of substantially fewer than 2024 suexg-cb type probes -- maybe 2000 times quicker, if the 24-clue "root" has >1 solutions (because then there's nothing more to check). It's this algorithmic efficiency in searching the tree that offsets the negative effect of correlations internal to S,

The term "offsets" relies on a confusion between generation and use. You implicitly consider you have to generate a new collection every time you use it.But the more you use the collection, the more the negative effects of correlation will be appear.

Red Ed wrote:ultimately making subtree and supertree methods the best choice in certain (but not all) parts of the distribution.

I guess you are speaking of the number of clues. As far as I can remember, your generation method was more efficient (for this very particular random variable) in the upper part of the distribution, where probability is close to 0.

With regard to generation and use, I am not confusing anything, but our aims do differ. I have only ever cared about the pure number-of-minimals question, so I need counts (what you referred to as the "main result" of the 2009 work) and that's it. In the summary post, you advocate using such results ("valid once and for all") to partially correct the statistics arising from biased generators, so we have some common ground here. It is true that the more expensive a statistic you wish to compute on those subtree-generated puzzles, the less advantageous subtree enumeration becomes: parameter-setting trials, as described in the summary thread I set up for you, will guide you to the right choice.

To repeat what I said in the summary thread, yes, subtree/supertree enumeration wins in the tails of the distribution (well, okay, not on 17s - we have another way of tackling those!) which, perhaps in part due to the "exotic" nature of those low probability puzzles, have attracted enduring interest on this forum. If you prefer to look only at the more ordinary mid-section of the distribution then that's your choice, in which case, yep, suexg-cb may be all you need.

I wonder if dobrichev, whose brief foray into subtree methods sparked this exchange, is feeling any better informed now?

Red Ed wrote:our aims do differ. I have only ever cared about the pure number-of-minimals question, [..]It is true that the more expensive a statistic you wish to compute on those subtree-generated puzzles, the less advantageous subtree enumeration becomes.[..]subtree/supertree enumeration wins in the tails of the distribution [...] which, perhaps in part due to the "exotic" nature of those low probability puzzles, have attracted enduring interest on this forum. [..]If you prefer to look only at the more ordinary mid-section of the distribution then [...] yep, suexg-cb may be all you need.

One uses the part of a distribution adapted to one's goals; it isn't a matter of "preference" or being "more ordinary". It's very funny to see how you use vocabulary with different connotations to speak of the two methods.But I have no major disagreement with your above statements.

BTW, speaking of generation, suexg-cb is only one implementation of the controlled-bias idea/algorithm. There are now faster programs than suexg and potential improvements in speed - but "probably" not orders of magnitude.