The routine is memory based.
Everybody knows the memory is slower than a CPU.
So you would assume that it does not matter how the routine accesss the memory
as the memory will always be the bottleneck. This is not true.

Loopunrolling helps but in real live the time needed to process the data might be higher
so the impact of the unrolling will usually be lower.

TODAY ALL CPUS HAVE MULTIBLE INSTRUCTION UNITS
WHY CAN THE CPU PROCESS MULTIBLE BYTES PER CLOCK?

Yes, all modern CPU have multible instruction units and are often able to process
up to 4 independant integer instructions at the same time.
But many CPU can only perform one memory load per clock.
Because of the multible instruction units we assume that our example CPU only needs one clock to process each byte. We assume that the CPU is able to execute the str<end compare at the same time as the processing of the memory byte.

FACT: while modern CPUs have multible instruction units they often have a limited number of load/store units.
This means our CPU might be able to perform 4 integer operations per clock but only one load.

HOW CAN WE SPEED UP THE ROUTINE?
By operating on longer words. If we can find a way to access the memory in 32bit instead 8bit then we can highly increase the throughput.

Lets say we are able to work on 32 bit words and we loopunroll 4 times.

HOW CAN ALTIVEC HELP?
ALTIVEC has powerfull string pattern matching operation and Altivec is able to process 16 bytes per instruction.

So with AlLTIVEC we could in theory get down to
10 clocks loading first cache line, 16 bytes
+ 1 clocks processing cache line (working on 16 bytes)
+ 1 clock for the loop
+10 cloack loading next 16 bytes
+ 1 clocks processing next cache line
+ 1 clocks for the loop

Total: 24 clocks

--
The above example is oversimplified.
If you have any ideas to improve it and how you can better explain this then please comment.

The explanation is nice, as long as you mention that it is a general illustration, not an accurate description of every detail.

There are a few additional goodies in the hardware that are triggered more easily from vector code than from scalar code. Most importantly the "store miss merging" mechanism. The processor attempts to merge two pending stores of adjacent small quantities into one store of a larger quantity. This is recursively done each clock cycle, but the hardware cannot work miracles. It's simplest for the hardware when two adjacent vector stores can be merged, because then the merged transaction contains a whole cache line. In that particular case, the modified cacheline can be written out directly, instead of the more usual "read old contents then merge with new data" procedure. This can reduce bandwidth requirements of a copy operation by a third.

Another potential benefit of AltiVec is that there won't be conditional branches in the loop body; vectorization forces you to find a more streamlined algorithm simply because you cannot express spaghetti code with vectors. Another side effect of the inherent parallelism is that loops will be sort-of unrolled, as each iteration works on more than one element of data. Actual unrolling still helps, of course.

On the downside, you cannot scan along vectors sequentially. The hardware processes each element at the same time, so it just doesn't fit the paradigm. I could imagine one or two extra instructions that would help to vectorize that type of algorithm, but I don't know if those could practically be etched in silicon. Data dependencies across the full vector width are bad for the timing.

Yes, the explanation is good. Too many people still think that the processor fetches and executes one instruction after another. Their perception of performance is limited to the question "how long does this particular instruction take?".

Anything that educates them about the enormous amount of parallelism, that is really employed in even a comparably simple CPU like G4+, is a good thing. These days, performance means high throughput, not low latency.

Arriving at a result in fewer instructions is one trick, but it is equally important to hand enough independent work to the CPU so that all the parallel functional units (or pipeline stages) can simultaneously help to achieve the overall goal.

A traditional metaphor for a CPU used to be a file clerk working at his desk. But these days it is more appropriate to think of the processor as a big factory hall with lots of workers and machinery. That other image makes it obvious that a lot of work can be done in parallel, and that efficiency means to keep everyone busy with sensible work.