A simple (but not necessarily correct, see below) model is: The line
is not in the I-cache, so it only invalidates the corresponding
D-cache line (if that line is present). So on the next D-cache access
there is a D-cache miss, and the access invalidates the corresponding
I-cache line, leading to an I-cache miss later on. Repeat.

This happens if you do alternating data and instruction accesses to
the same cache-line-sized region. E.g., consider primitives in a
traditional-style indirect-threaded Forth: the code field is directly
in front of the code, and is accessed directly before executing the
code; however, the code can be separated from the data without too
much hassle. With direct threaded code, you have the dual of the
situation: code in the code field adjacent to data of non-primitives
(no so easy to avoid, but still possible, seehttp://www.complang.tuwien.ac.at/papers/ertl01.ps.gz).

> which means that with a "normal">(unoptimised) indirect-thread implementation each cache would be>invalidated, flushed and filled twice for *each and every* thread "jump".>Do I have that right?

I do not know exactly what happens, but the timings indicate that
there are at most two (not four) L1 cache refills happening for each
pair of data and instruction accesses; in some cases, there is only
time for less than two normal refills. Nowadays I might be able to
use performance counters to find out more if I had access to a Pentium
with the appropriate software.

> This was only on the original Pentium?

The fundamental problem is there in all processors with split caches.
Younger architectures fix this by making it software's job to keep the
I-cache coherent with the D-cache. But for older architectures with a
strong compatibility requirement all implementations with split caches
have the problem in some form or other.

For the IA32 architecture the K5 and K6-* have the problem in a form
similar to the Pentium (and Pentium MMX). The P6 (PentiumPro and
later), K7 (Athlon and later), and Pentium 4 cores are a little more
sophisticated: on these chips an I-cache/trace-cache invalidation only
happens when there is a write to the D-cache line.