I can't seem to work out how to reply with the received response/quote in a box, so I'm starting again from scratch. So my reply to DCB's exquisite insights and my subsequent questions have got lost in the "crosstalk".

Apologies for revisiting previous themes.

David provided an elegant solution to my Sudoku solution of the top left box, the middle centre box and the bottom right box being identical. I am curious to know whether such a solution can be reduced to a 17 clue entry grid. Perhaps it's ignorance on my behalf i.e. ALL Sudoku solutions can be reduced to a 17 clue starting grid. I don't believe this - and of course just thought of another question. Given a valid unique solution, what solutions can ONLY be reduced to a given clue number starting grid - what is that maximum number?

I'll leave further discussion of the symmetry of 17 clue starting grids until another message, save for an observation that any such grid must have a clue in the centre cell of the centre box.

I'll come back to my knight's tour questions later.

David - thanks for your analysis of the first 100 17 clue grids. As I said previously, I am not an expert in number theory - I know even less about statistics. This raises a really deep mathematical question - are the bizaare (esoteric) questions I'm asking best answered by a statistical approach or a pure mathematical approach?

Coming back to the minimum digit problem - instinctively, I think 7 is doable, but 6 isn't. Of course the quest doesn't end there - if 7 is doable - what is the smallest number of clue starting grid?

David - thanks for your analysis of the first 100 17 clue grids. As I said previously, I am not an expert in number theory - I know even less about statistics. This raises a really deep mathematical question - are the bizaare (esoteric) questions I'm asking best answered by a statistical approach or a pure mathematical approach?

Mark Twain said there are lies, damned lies, and statistics. He wasn't far off the mark. Statistical evidence never constitutes proof.

That said, I'd like to explain why I started digging into the distribution of digits in the 17-clue Sudoku puzzles. One of your questions was "What is the minimum number of different digits appearing in a 17-clue Sudoku puzzle?"

I think the minimum number of distinct digits appearing in the initial grid for any valid Sudoku puzzle is 8. I can't prove that. I don't even have a clue as to how such a proof might be constructed. But I have a copy of all 24,620 known 17-clue Sudoku puzzles. If I can analyze all of those and see that every one contains at least 8 distinct digits I'll be pretty sure that the answer to your question, paraphrased above, is 8. Too bad I'm not a real good PC programmer ... I'll probably have to brush up on C to get the analysis done.

We don't know -- in the sense of a mathematical proof -- that the set of 24,620 17-clue Sudoku puzzles is complete. There may be more of them. On the other hand, there probably aren't any more of them, because a large number of people put a great deal of effort into finding them, and it seems intuitively likely that those guys have already found all the 17-clue possibilities.

There seems to be general agreement that there are only 24620 17 clue valid Sudoku starting grids. I have just visited one of your links that published 50 such grids. What was interesting to me was that

a) none of them had the centre cell filled in
b) ALL of them have all 9 digits

Although my maths ability isn't up to it, I am intrigued to know the answers to the following questions:

I thought I'd already answered that question in an earlier post, John.

David Bryant wrote:

(Referring to the first 100 puzzles in Gordon Royle's collection of 24,620 17-clue Sudokus)

-- Exactly 79 different combinations of the digits 1 - 9 occur in those 100 puzzles. 61 of the combinations are unique; 15 of them occur twice, and 3 of them occur three times.

-- In 40 of the puzzles, only eight of the digits appear as initial clues.

-- Of those 40, 39 omit the digit "9", and 1 omits the digit "7".

So 17-clue Sudokus with exactly eight different digits appearing as initial clues are quite common, occurring roughly 40% of the time. Here are a couple of examples, neither of which contains the digit "9".

I have completed an analysis of the distribution of digits in the first 1000 17-clue Sudoku puzzles from Gordon Royle's collection. Here are my findings.

-- 575 of these puzzles contain all 9 digits in the initial setup. The other 425 only contain eight digits. Not one of these puzzles has seven or fewer digits in the initial grid.

-- Of the 425 puzzles with only 8 digits, 403 omit the number "9", 2 omit the number "8", 3 omit the number "7", 6 omit the number "6", 9 omit the number "5", and 2 omit the number "4". Every one of these 1000 puzzles contains at least one "1", one "2", and one "3".

-- I think it's odd that most of the 8-digit examples omit the number "9", with a smattering of exceptions. Since none of the puzzles in this collection can be transformed into another one using the symmetry group operations (permutation of digits, columns, rows, etc) I had expected to find that all the 8-digit examples would be missing the same digit. Perhaps Mr. Royle just hasn't bothered to comb his pets that carefully yet.

John, I hope this satisfies some of your curiosity. I haven't had time to look for the symmetrical 17-clue puzzles yet. dcb

Coming back to the minimum digit problem - instinctively, I think 7 is doable, but 6 isn't. Of course the quest doesn't end there - if 7 is doable - what is the smallest number of clue starting grid?

I finally got around to writing a couple of C programs to massage the file from Gordon Royle's web site. I'll provide more details later, but here are the main results.

-- I found 3,504 distinct combinations of digits among the 24,620 puzzles.

-- Roughly 45% of these combinations involve only 8 digits. The remaining 55% use all nine digits.

-- There are no puzzles in this set with fewer than 8 distinct digits.

So, John, either the collection of 24,620 17-clue Sudoku puzzles is incomplete, or else there are no 17-clue Sudoku puzzles containing fewer than 8 distinct digits. I've verified that by directly examining every entry in the file (with the help of my PC). dcb

Here are some more details from my initial analysis of Gordon Royle's collection of 24,620 distinct minimal Sudoku puzzles.

-- There are 3,504 different combinations of the digits 1 - 9 appearing in the file I downloaded from Mr. Royle's web site. 1,466 (41.84%) of these combinations have just 8 digits in them. The other 2,038 combinations incorporate all 9 digits.

-- 11,230 (45.61%) of the puzzles incorporate just 8 digits as clues. In the other 13,390 puzzles, all 9 digits appear.

-- There were 878 8-digit puzzles in which the missing digit was not a "9". The other 10,352 8-digit puzzles had the digit "9" missing. Apparently some attempt to standardize this subset has been made, but it was incomplete.

Since all these puzzles are supposed to be distinct, even when the digits are permuted, it just makes sense that the "canonical" presentation of the 8-digit subset should always use the digits 1 - 8. So I cleaned up the file a little bit, interchanging the digit "9" with the missing digit in the 878 odd-ball puzzles. This reduced the number of combinations considerably.

-- There are 2,897 distinct combinations of digits in this modified file. Of those, 859 (29.65%) have just 8 digits in them, and 2,038 (as before) have all 9 digits.

In looking at the combinations more closely, more opportunities for standardization are immediately apparent. For example, the most prevalent pattern of digits, by far, appears to be "11122233445566778" -- that is, 3 each of two digits, 2 each of five digits, and just one of the 8'th digit. But this pattern appears in many alternative forms -- as "11122333445566778" and as "11123344555667788", for example. So the next thing I'll try to do with this file is to boil the combinations down some more, by always making "1" the most common digit, and "2" the next most common, etc. dcb

I wrote to Gordon Royle the other day -- he's the math professor in Australia who put the file containing 24,620 17-clue Sudoku puzzles up on the internet. Here's the message I sent.

David Bryant wrote:

Hello!

I don't have any new 17-clue puzzles to tell you about. But I have downloaded the big file from your web site, and I have spent some time looking at it. Here are some ideas I want to share.

11,230 of the 24,620 puzzles have just 8 non-zero digits in them. 10,352 of these contain the digits 1 - 8 and omit the digit 9. The other 858 omit some other digit besides 9, although none of them omit the 1.

Since permutation of the digits doesn't really alter the puzzle, I did a little cleanup of the file, so that the digit "9" is the only digit that does not appear in the 11,230 puzzles with just 8 digits in them. This reduced the number of combinations of digits appearing in the 8-digit puzzles from 1,466 to 859.

It appears that some further standardization of the choice of symbols is possible. For example, the most common combination of digits appearing in the 8-digit puzzles is "11122334455667788". But I also noticed the combination "11223334455667788", which is really the same thing, if we permute the 1's and 3's in the second string.

I intend to do some further fiddling around with the file you so kindly provided. I'm curious about the choice of digits in these puzzles. Apparently someone has done a certain amount of mapping, to be sure that "1" is the most prevalent digit. I'm just curious how much of this you have done, and if you think there's any point in trying to find a new representation for the set of puzzles, so that the number of combinations of digits is minimized.

Thanks for the puzzles! I've had a great deal of fun with them. dcb

Professor Royle was kind enough to respond very quickly -- here's his reply:

Gordon Royle wrote:

Hi, Glad you have found them of some interest...

As you say, the puzzles do all have some mapping performed on them, in order to ensure that they are not just trivial relabellings or other rearrangements of each other.. So each new one I find is first mapped to its "canonical form" and then this is checked against the others and it is only added if it is different to the new ones. However this canonical form is purely something that is easy to compute, and it has no real human interpretation. Essentially, it is a black box with the property that I pop in a puzzle, and out comes a relabelled puzzle... the key property is that no matter how I relabel or do other trivial rearrangements (ie swap rows 1 and 2) on the INPUT puzzle, the same puzzle comes out...

So there may well be some value in finding a sensible human interpretable relabelling, but for computation purposes when dealing with thousands or even millions of puzzles, I am stuck with the black box!

All the best
Gordon

Now I can't profess to understand the "black box" Mr. Royle is talking about, except that it must involve finding a version of each puzzle that is invariant under some subset of the symmetry transformations that preserve a Sudoku puzzle. I'll try to dig into that a bit more as time permits -- if anybody else has some insight, I'd love to hear your thoughts.

In the meantime I have gone ahead and relabelled the puzzles according to my own very simple-minded scheme, which simply ensures that "1" is the most common digit in every puzzle, that "2" is the next most common, and so forth. I took this modified file and boiled down the digits and was surprised to learn that there are only 14 combinations of digits appearing in my modified file. Here are those combinations.

Out of curiosity I extracted the two "rarest" puzzles (records #5773 & #22024, for the benefit of anyone who has downloaded Mr. Royle's file) and tried solving them. The second one was very easy to solve and did not present any unusual difficulties. The first one was a little more fun, leading to an interesting position after 47 moves, when there were 17 blank cells left. Here's that puzzle.

> The cadidate table has 16 cells with pairs and one call with 3 digits.
> I started a "double implication chain" technique:

> (A) r9c9 = 9 <== Start setting
> Which you can follow also on the Print of Exclude/Candidate Tables
> And it produced the solution:

+++

What is a "double implication chain" technique????

I looked at this one and it is clear that r9c9 MUST have value 9 as
giving it value 6 (the alternative for that cell) leads to a contradiction.

However, this looks suspiciously like "Trial and Error".

Is there ANY reason for choosing '9' as the start value in r9c9 BEFORE
testing value 6? If not then it is 'trial and error' with a fancy name!

++
NB: After my "rant" about 'trial and error' I suggest a method which
resolves this puzzle without such techniques. It may be worth reading
on a bit - even snoozing during the rant - but waking up later.

As I have posted elsewhere, a breach of the conventions on this site
occurs when one needs to posit a specific value for a cell and to rely
upon a contradiction down the road in order to "backtrack" to a point
from which a valid solution emerges.

What is needed is a rule which results in a UNIQUE allocation of a value
to a cell based on the currently available data with NO reliance on the
specific data in a particular cell. Examples of the rules include the basics
such as

a) A digit must not appear more than once in a line/region.
b) Every digit must appear at least once in a line/region.
c) If a digit must occur in one of two cells, it cannot appear in any other
cell which shares a common line/region with such two cells.
d) etc, etc

All these rules are quite independent of the actual values in the cells
and they define what value will go into any specific cell by reference
to what is known about the grid AT THAT POINT IN TIME and using the
rules which manipulate such current knowledge.

As soon as one postulates the value for a cell with the reservation that
one may wish to change that value later one has entered the realm of
"trial and error". This is NOT the same as saying the value in cell P must
be the same as the value in cell Q or that it must differ - because such a
rule applies irrespective of the specific value.

Binary Pairs often lead to a "Forcing Chain" (if there are enough of them)
but in this puzlle the problem lies in column four. Digit 8 appears in
three cells as a candidate and so a judgement of "Not 8" for a cell does
not imply that "must be 8" applies to any other specific cell. This is in
contrast with column six say where each of digits 2,8,9 occur twice only
in the column - albeit in three separate cells so that being able to state
conclusively that a cell "must be X" or "cannot be X" (where X is one of
the candidates for the cell) leads to an incontravertible resolution for
some cell or other.

++++++++++++++++++++ New Method discussion starts here
Looking at column 4 and row 6 in this puzzle gave me the inkling of
a rule but I have not been able to formulate it as a generality.

++
Taking row 6, the candidates are {89, 268, 28, 69}
I tried taking any two from three of the pairs and attempting to make
a triplet set with digits from the 268.

89 and 28 would require 29 as the third - impossible from 268.
89 and 69 would require 68 as the third - quite possible, isolating 28.
28 and 69 have nothing in common.

With 28 isolated and 8 inside the triplet set, r6c6 must be 2.

++
Taking col 4, the candidates are {29, 68,268, 89}
Taking any two, in turn, from three of the pairs and attempting to make
a triplet set with digits from the 268 as before:

89 and 68 would require 69 as the third - impossible from 268.
89 and 29 would require 28 as the third - quite possible, isolating 68.
29 and 68 have nothing in common.

With 68 isolated and 8 inside the triplet set, r5c4 must be 6.

++
Having determined that one cell must have the same value irrespective
of the value taken by another (only indirectly linked) cell, there is often
sufficient 'new' information to resolve the whole puzzle (as here where
setting r6c6 to 2 OR r5c4 to 6 leads to full resolution).

+++

My question is:

Is it possible to specify a rule covering the situation found above which
can be applied irrespective of the values actually present?

An attempt would be

"If a logical line contains four unresolved cells with three containing
only two candidates and one containing three candidates, the cell
with two candidates whose members BOTH occur in the cell with three
members may be further reduced by elimination of the digit which
occurs as a candidate three times overall."

Here this would mean
a) We have four unresolved cells in the row or column in context.
b) They are occupied as three with two members and one with three.
c) There is a cell with two "containable" in the three
(28 in 268 for row six and 68 in 268 for col 4)
d) One digit (8) occurs three times, the other digits only twice each.
This MUST occur as there are four distinct digits and a total of nine
candidate places (3x2+1x3=9). If more than one digit were to
occur three times then another could occur only once - and so
would be resolved anyway!
e) Eliminating the thrice-occurring digit from the "containable" cell
resolves that cell (here taking 28 to 2 or 68 to 6)
f) After that, the "impasse" is resolved - WITHOUT 'trial and error' or
substitutes with fancy names like "double implication chain".

The theory behind this is that if N cells have a "congruent" set of
candidate profiles it MUST be the case that if one cell were to be
resolved then N-1 cells must then have a congruent set of profiles.
The rule above seeks to determine which of the N cells could be
excluded and leave an N-1 congruent set. The suggestion is that in
the scenario specified only ONE of the cells could be eliminated so
that the remaining N-1 could still be congruent. However, rather than
testing each cell in turn (although this can be done to DEMONSTRATE
the rule!) the rule identifies exactly which cell can be removed from the
N-cell set and what value it should take.

+++

My final question is then whether this rule (which applies to cases with
four unresolved cells) can be extended to other situations as a much
more general rule. A simple desk-check suggests not - at least not
without several more caveats.

I looked at this one and it is clear that r9c9 MUST have value 9 as
giving it value 6 (the alternative for that cell) leads to a contradiction.

However, this looks suspiciously like "Trial and Error".

Is there ANY reason for choosing '9' as the start value in r9c9 BEFORE
testing value 6? If not then it is 'trial and error' with a fancy name!

Quote:

What is needed is a rule which results in a UNIQUE allocation of a value
to a cell based on the currently available data with NO reliance on the
specific data in a particular cell.

You can add: "a rule which results in a UNIQUE exclusion of a value to a cell ..."

Now a question for you Allan:

Q: Is the "Turbot Fish" technique (or the "Coloring" one) getting into your category of "good" rules or is it getting into the "bad" trial and error?

Depending on your answer, I could extend it (your answer) to the "Double implication chain" technique.

I can undestand that you think (with your very nice algorithm) and feel (that other are trial & error) different !!!
It's the same with me and my wife. We both stick to our ways of thinking and feeling, and love each other.

A long discussion, in several places, about "trail & error".
A lot of people see it as something durty, that should not be used.
It was the invension of the Devil, not of God that has everyting arranged so that things can be "forseen" 100% for the future. Pure Math!!!
Till Kurt Gödel came!

> Q: Is the "Coloring" technique within the category of "good" rules
> or is it getting into the "bad" trial and error?

The colouring technique is acceptable in that it does NOT allocate
values to cells and then backtrack if a contradiction is found.
It uses the information currently to hand and constructs chains which
allocate colours (eg blue/green) to cells and then monitors the patterns
formed by such colouring. It does NOT allocate values to cells UNTIL
the END of the process - whereas 'trial and error' allocates on the
basis of "guesswork" and then backtracks if it goes wrong.

I am not familiar with the detail of the Turbot in sufficient depth to be
able to opine upon its category. However, if it is a process that results
in a specific NON-REVERSIBLE result and NO allocations are made to
the cells in the grid before completion of the process then it would
seem to be acceptable.

The technique introduced here is either not explained adequately for
me to understand it or it is akin to trial and error. The previous post
started off by setting r9c9 to have a specific value (9) in line A of the
several lines a to E presented to us. If there was, indeed, a setting of
value 9 in that way then I still need an answer as to why that cell was
not set to 6 in the first place. I am quite willing to read the reasons
but acknowledge that I do not always understand them (like why r9c9
of the Nov 18th Medium can be set to '3' before anything else. I know
that such is true but cannot see why!).

> A long discussion, in several places, about "trial & error".
> A lot of people see it as something dirty, that should not be used.

It is not "dirty" and it is quite acceptable on OTHER sites but part of
the (attractive!) ethos of this site is that the puzzles do not use any
"trial and error" techniques. It is, of course, significant that the exemplar
puzzle for this topic is NOT one set by SamGJ! However, I (and many
others I suspect) choose not to engage with T and E and thus will
avoid interaction with the puzzles set in the London Daily Telegraph
and probably in some other journals as well.

It is true that "trial and error" (known as "reductio ad absurdum" when
I was at school) is the ONLY technique with an ABSOLUTE guarantee
of reaching a solution - given enough time! However, the attraction of
the puzzle is the application of logic - not brute force resolution.

By no means would I deny users of this site the right to use T and E
for any sudoku - by consenting people in private! However, THIS site
should be kept as free as possible from such contagion. There are
many other sites that cater for those sorts of puzzles. Our task here
is to search for (and hopefully to find) methods that do not rely upon
the 'trial' of positing an initial value for a cell (ie guessing!).

+++
> It was the invension of the Devil, not of God that has everyting
> arranged so that things can be "forseen" 100% for the future.
> Pure Math!!! Till Kurt Gödel came!

I have not studied philosophy and so have no idea of the ideas of
Kurt Godel or their comparison with those of others.

However, there is a current school of thought that has some evidence
for the idea that physical reality is not "fixed" until and unless it has
intentional consciousness applied to it. The theory extends to the whole
universe having an inherent consciousness (akin to what some call
"God") and an innate organising principle.

Personally I do not subscribe to the concept of "The Devil" - with its
binary basis (everything has its opposite - even God). In our physical
universe nearly everything has an opposite but there is one important
exception - Light! We delude ourselves if we posit "dark" as the opposite
of light. Of course it is so with shades of colour but true light has no
opposite - only the relative absence of it. Take a completely sealed
room that is totally dark and open the door. Does the dark reduce the
amount of light outside the room? No, the light permeates the space of
the room and the dark just ceases to exist. It is not possible to get a
bottle of "darkness" in order to tone down the light. Eventually it is
quite likely that "science" will come to understand such things. Indeed,
the leading edge of science is already doing so.

Some fifty years ago it was believed that science had made the concept
of god outmoded ('We are all atheists now') but more recently it is from
science that we have calls that ('If god does not exist then we will have
to invent such an entity to explain our total environment.').

However this is getting off the subject of Sudoku - other to ask the
fundamental question: "Why are we so fascinated by it?" and the
supplementary question "What attributes do sudoku players have that
are present in significantly different proportions in non-players?".

Many thanks to David Bryant for introducing me to Gordon Royle’s puzzles with 17 starting entries.

An indication of the extent to which GR’s black box skews the numbers might be provided by comparing the frequencies with which the digits 1, 2, …, 9 occur in the starting entries of GR’s data with the same frequencies in the corresponding normalised puzzles. The normalisation consisted of permuting the starting entries. The permutation applied to each puzzle is the one which makes the top row of the solution 1, 2, …, 9 in order. Here are the results:

That's interesting information, Steve. I guess you got your PC to generate all 24,620 solutions. How long did that take?

Puzzle #2350 is interesting. There's a hidden pair in the initial setup, and I found a swordfish (in the "2"s) right off the bat. After 26 moves (43 cells completed, 38 left to go) I had to use the "coloring" technique (on "9"s, and on "1"s) to break through.

I'm curious -- did your computer notice the swordfish? The binary chain in the "1"s? That pattern is particularly cute, because it resolves itself into a row and a column that each contain only one possibility for the "1" (both the same color, obviously), and this allows one to place all 5 (or was it 6?) missing "1"s simultaneously!

David, the run took between half and three-quarters of an hour, I think. Sadly, the program knows nothing of fish: after checking X-wings it is as yet reduced to guesswork. In the case of number 2,350 it found four pairs, two quadruples and two X-wings before guessing so I was interested to learn that a swordfish cracks it.

I took the “GR puzzle” program one step further after posting my previous note, intending to check that there were no duplicates in the data set. To my surprise the 24,620 puzzles gave rise to 24,611 distinct normalised solutions. The shortfall is accounted for by 9 doubles in the set. It is, I think, obvious that distinct normalised solutions implies distinct puzzles but I have no reason to suppose the converse is true; on the contrary, GR’s black box proves otherwise.

The doubles are 1725 and 1795; 12631 and 13047; 9437 and 13048; 11873 and 14238; 2776 and 15009; 12790 and 15487; 12791 and 15488; 12359 and 17884; 18099 and 22938.

In each double the two elements have different patterns but I did not take this further and shall not be able to do so for some time without risking being shot. The doubles might repay study. Anyone interested is welcome to take them on.

Incidentally, if anyone wants a copy of the normalised puzzles in GR’s order (or solutions or normalised solutions for that matter), just ask. I’m pretty sure I can get the listings into a Microsoft Word document but might need help with the next steps.