In part three of this
seemingly never-ending series
on high performance computing
we continued what may become the longest whittle in Perl Monks
history by further reducing the running time of our magic formula
search program, from 728 years down to just 286.

The latest round of improvements were achieved not by a major breakthrough,
rather by applying a succession of micro-optimizations.
To achieve the required massive speed boost,
it was becoming increasingly clear that
we needed to find a new breakthrough idea.

Constraint Satisfaction Problems

Constraint satisfaction problems (CSPs) are mathematical problems
defined as a set of objects whose state must satisfy a number of
constraints or limitations.
CSPs are the subject of intense research in
both artificial intelligence and operations research, since the
regularity in their formulation provides a common basis to analyze
and solve problems of many unrelated families.
CSPs often exhibit high complexity, requiring a combination of heuristics
and combinatorial search methods to be solved in a reasonable time.

So, while the talk of this being a “10**21 problem” is an interesting one,
and the latest installment (#3) about whacking the TLB of the processor is even more so,
it seems like an incredible amount of work to unleash
on a problem that has a trivial algorithmic solution ...
This is on the one hand very interesting and computer-sciency,
as I have already said, but the choice of example-problem puzzles me.

Roman numerals were the challenge at hand.
The method of the example is common to constraint
searches beyond the example.
Constraint searches exist in drug research and many other fields ...
Pick any field which has a large search space for just the
right combination of properties in an as yet undiscovered item.
Write a program which tries and makes a preliminary fitness
determination for each possibility.
Have that program spit out a short list of candidates
for further investment of testing and development ...
In this specific case, the fitness is a maximum length,
a handful of inputs, and a handful of outputs that map
correctly to those inputs ...
The point of an example is that it is a concrete thing that
is completed and shown rather than an abstract idea.

This was indeed a classic constraint satisfaction problem ...
which I felt woefully ill-equipped to solve.
Lacking the mathematical ability of an ambrus,
the only strategy I could think of was to desperately search for a hack,
any hack, that would allow me to abandon a potential solution as early
as possible, as soon I was certain it could not
be successfully completed.

In computer science and mathematical optimization, a metaheuristic is
a higher-level procedure or heuristic designed to find, generate, or
select a lower-level procedure or heuristic (partial search algorithm)
that may provide a sufficiently good solution to an optimization problem,
especially with incomplete or imperfect information or limited computation capacity.

Hmmm, every one of these blocks looks like it is one of just three distinct types:

All zeros

All non-zero numbers even

All non-zero numbers odd

Every one of those blocks is one of just three distinct types!
Every one of those blocks is one of just three distinct types?!!
Whoa. But does that hold true everywhere?
To find out, I nervously ran a brute force search over all seven 4GB lookup tables
and was elated to learn that this "theorem" does indeed hold true for
every single block of every single lookup table!

For modulo 1001, each and every 128-byte block in all seven 4GB lookup tables
contains either all zeros, all non-zero numbers even,
or all non-zero numbers odd

-- ambrus' fourth theorem

Once again I could not restrain myself from adding to our list
of the theorems of ambrus. :-)
I expect ambrus' third theorem is easy to prove, his fourth much harder.

Curiously, out of all odd modulos in the range 1001..1221,
the fourth theorem of ambrus applies
to modulos 1001 and 1221 only.
Why only those two? Who decides? Weird.
Maybe ambrus can show mathematically why it must be so.
I "proved" it only via brute force enumeration of hundreds of 4GB lookup tables.

For modulo 1003..1219 you would need to find a different heuristic to
trim the search space (if one exists).
That is the main reason why I used modulo 1001 only.

As you might have guessed, noticing this oddity in the lookup table data was a key breakthrough. Why?

All zero blocks can be skipped immediately. 125 searches eliminated.

To get a hit, all blocks must be even or all blocks must be odd (a value from an odd block can never match one from an even block).

We can encode each 128 byte block in just two bits: one to indicate zero or non-zero, the other to indicate odd or even. This reduces the lookup table size from 4GB down to just 8MB! This is crucial in reducing CPU cache misses.

Wait, there's more.
Each of MDCLXVI can produce a
valid solution only if all their
lookup table blocks are of the same (non-zero) type.
Assuming an equal number of blocks of each type,
that occurs with a probability of (2/3) (first block non-zero)
times (1/3)**6 (next six blocks match first non-zero block).
After a preliminary check of all seven bitmaps therefore,
only one in 1093.5 candidate solutions needs further
(more expensive) checking for an exact match --
via a 4GB table lookup plus calculation,
as detailed in earlier episodes of this series.

Bitmaps

Here is the code (64-bit compile, 32-bit int, 64-bit size_t) to create a 8MB bitmap from a 4GB lookup table:

Notice that the above code uses the TLB, prefetch, vectorization and loop peeling techniques
discussed in earlier episodes but this time with seven 8MB bitmaps.

This one runs in about 2.9 seconds = 2.9 * 125**4 / 60 / 60 / 24 / 365 = 22 years.
Given multithreading and enough cores that should be fast enough to find
a solution in a year or two.

As you might expect, this inner loop code was trivial to multithread.
When I was suffering from high memory latency, Intel's hyperthreading gave me quite a lot.
As I reduced this latency, hyper-threading gave me less and less until I
didn't bother running more threads than there were physical cores on the machine.

In practice, this multithreaded code, running on two
four-core physical machines, found a solution in about a year.

I expect its performance could be considerably further improved using the many
excellent performance tips provided by oiskuu in replies to earlier
episodes in this series.

As in golf.
Finding the solution as soon as I did, after searching only 17 of 125 numbers, was about a one in nine shot.
Of course, as computer hardware improved over the next few years, the task would have got easier.

As you can see above, for some strange reason, only odd numbers were hit out of the
first two hundred (of 0..1001) while only even numbers were hit in the last two
hundred; the middle ranges featured a mixture of odd and even.
I have no explanation for that, maybe it warrants yet another theorem of ambrus. :-)

Apart from keeping you interested, keeping statistics as you run a very long-running
program is a good way to protect against bugs, to verify that the rate
of finding "almost" solutions matches theoretical expectations.
You certainly don't want a search program, like Deep Thought,
that thinks silently for millions of years before telling you The Answer.
Imagine the pain of leaving a program running for years, only to find
it contained a silly bug which prevented a solution being found.

What was the point?

We all await The Final Chapter, when, we trust, All Mysteries Will Be Revealed.

Apart from that, for me personally, solving this problem was a journey,
forcing me out of my comfort zone, learning many interesting and new things.
I also enjoy writing, telling a story. I hope you enjoyed reading
this story as much as I enjoyed writing it.

Apart from that, for me personally, solving this problem was a journey, forcing me out of my comfort zone, learning many interesting and new things. I also enjoy writing, telling a story. I hope you enjoyed reading this story as much as I enjoyed writing it.

I'll never be really good at golf!

I have no problem with attacking a problem from all angles; running programs for days or even weeks to find solutions; and optimisation is one my most enjoyed pastimes and something of a passion. But there has to be -- at least notionally -- a practical use for the code or the results it produces.

But that's my hang-up, and is no bad reflection upon your pastime. Millions of people spend their time knocking little balls into a field and then looking for them; and often as not have to pay exorbitant annual and per-game fees for the privileged. Others write long lists numbers in books. Yet more sit around all night in the freezing cold and/or rain, on the off chance that the clouds will clear long enough to peer at fuzzy blobs of light in the night sky. Each to their own "waste of time" :)

For me, two things come out of this latest of your series:

All the optimisation techniques, especially those from oiskuu, are extremely informative and will be useful to me in the future.

Your statement "know your data" is one of the most oft overlooked -- and outright ignored -- missives in our industry.

All too often people tackle tasks involving large datasets with the mindset that the must cater for the full generic range of possibilities for that data; when often large subsets of that range either cannot, or just usually do not occur.

And in the latter case; in the rare event that the uncatered for data does occur, it can be shown that the results would be anomalous anyway.

I enjoyed following along; albeit that I came late to the party. Thank you.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Luckily, it was accepted by both
shinh's golf server,
when I submitted it
about a month ago,
and the code golf server also (currently down).
Shinh's server runs Python 2.7.2 on Linux, the code golf server Python 2.5 on Linux.

It would be interesting to see the actual hash value and Python version by running this version:

Strange, that this is the same Python version 2.7.5 as reported by mr_mischief
but on a different platform (Windows vs MacOS).
I tested both 32-bit and 64-bit Python on Windows (multiple versions) and didn't see a failure.

In my own defense ... :-) ... I guessed soon enough, as did we all, that this was really a discussion about a problem that did require nothing less than a massive-search to solve it. And, that the only possible way to foreshorten such a search must be to find an early-abandonment strategy or strategies, and to prove that they work without taking 22 years to do so. I did not mean to “diss” your work at any point, and I think that this was clearly understood by most Monks and by you.

The point of my side-thread was, and is, that the search for algorithms to reduce a problem effectively can be tricky unto itself. I have seen many “Roman Numerals decoder rings” that were very-complicated examples of recursion, simply because the designer in question didn’t hit-upon “read it backwards.” (Not one of my teachers ever presented it to me that way.) Exhaustive-searches have been done to hunt down problems ... or, the searches were much larger than they need be ... because (generally in the days before Google It™) the designer missed a single key stroke of insight. Meant to be a parallel observation, not a rebuttal of any sort. And, I think, mostly understood to be so.

Although the search-space in this problem is extremely large such that bitmaps coupled with a vast amount of memory are required to solve it, we know that Moore’s Law will continue to hold true ... and this is a very nice demonstration of just how powerful the Perl language really is.

Early-abandonment strategies aren’t new: the “minimax” optimization used in game-playing programs is the simplest case. (Once you know that a solution is sufficiently bad that it will cause any of the preceding nested look-aheads to be rejected, you can clip-away the entire subtree at the base because you know it can only get worse and do not care exactly how worse it gets.) The extraordinary challenge presented by this problem is its size, and the difficulty of proving that a particular abandonment strategy holds.

Very entertaining reading, well-worth all of its parts. Well done, and thanks very much for sharing.