Chris Leary

I'm back from holiday break and I need to limber up my tech writing a bit. Reason 1337 of my ever-growing compendium, Nerdy Reasons to Love the Internet, is that there are always interesting discussions going on. [*] I came across Never create Ruby strings longer than 23 characters the other day, and despite the link-bait title, there's a nice discussion of string representation within MRI (a Ruby VM).

My recap will be somewhat abbreviated, since I've only given myself a chunk of the morning to write this, so feel free to ask for clarification / follow up in the comments.

Basic language overview

At the language level JavaScript strings are pretty easy to understand. They are immutable, same as in Python:

But you can (without mutating any of the original values) compare them for equality, concat them, slice them, regexp replace/split/match on them, trim whitespace from them, slap them to chop vegetables, and so forth. (See the MDN docs for String.prototype methods.) In the VM, we need to make those operations fast, with an emphasis on the operations that the web uses heavily, which are ideally[†] the ones reflected in benchmarks.

Abstractly

In an abstract sense, a primitive string in SpiderMonkey is a GC cell (i.e. small header that is managed by the garbage collector) that has a length and refers to an array of UCS2 (uniformly 16-bit) characters. [‡]

Recall that, in many dynamic language implementations, type tagging is used in order to represent the actual type of an statically-unknown-typed value at runtime. This generally allows you to work on integers (and, in SpiderMonkey, doubles) without allocating any space on the heap. Primitive strings are very important to distinguish quickly and they are subtly distinct from (non-primitive) objects, so they have their own type tag in our value representation, as you can see in the following VM function:

/* * Convert the given value to a string. This method includes an inline * fast-path for the case where the value is already a string; if the value is * known not to be a string, use ToStringSlow instead. */static JS_ALWAYS_INLINE JSString *ToString(JSContext *cx, const js::Value &v)
{
if (v.isString())
return v.toString();
return ToStringSlow(cx, v);
}

Aside

In JavaScript there's an annoying distinction between primitive strings and string objects that you may have seen:

For simplicity and because they're uninteresting, let's pretend those new String things don't exist.

Atoms

The simplest string form to describe is called an "atom", which is somewhat similar to an interned string in Python. When you write a literal string or identifier in your JavaScript code, SpiderMonkey's parser turns it into one of these atoms.

Note that the user has no overt control over which strings get atomized (i.e. there is no intern builtin). Also, there are a bunch of "primordial" atoms that the engine creates when it starts up: things like the empty string, prototype, apply, and so on.

The interesting property of atoms is that any two atoms can be compared in O(1) time (via pointer comparison). Some work is required on behalf of the runtime to guarantee that property.

To get an atom within the VM, you have to say, "Hey SpiderMonkey runtime, atomize these characters for me!" In the general case the runtime then does a classic "get or create" via a hash table lookup: it determines whether or not those characters have an existing atom and, if not, creates one. The atomized primitive string that you get back always has its characters contiguous in memory — a property which is interesting by contrast to...

Ropes

Let's say that you had immutable strings, like in JavaScript, and you had three large books already in string form: let's call them AoCP I, II, and III. Then, some jerk thinks it would be funny to take the first third of the first book, the second third of the second book, and the third third of the third book, and slice them together into a single string.

What's the simplest thing that could possibly work? Let's say that each book is a 8MiB long. You could allocate a new, 8MiB array of characters and memcpy the appropriate characters from each string into the resulting buffer, but now you've added 33% memory overhead and wasted quite a few cycles.

A related issue is efficient appending and prepending. Let's say you have a program that does something like:

var resultString ='';
function onMoreText(text) {
// If the text starts with a letter in the lower part of the alphabet,// put it at the beginning; otherwise, put it at the end.if (text[0] <'l')
resultString = text + resultString;
else
resultString = resultString + text;
};

If you did the naive "new string and memcpy" for all of the appends and prepends, you'd end up creating a lot of useless garbage inside the VM. The Python community has the conventional wisdom that you should build up a container (like a deque) and join on it, but it's difficult to hold the entire ecosystem of web programmers to such standards.

In the SpiderMonkey VM, the general solution to problems like these this is to build up a tree-like data structure that represents the sequence of immutable substrings. and collapse that datastructure only when necessary. Say that you write this:

The concatenation is performed lazily by using a tree-like data structure (actually a DAG, since the same string cell can appear in the structure at multiple points) that we call a rope. Say that all of the arguments are different atoms — the resulting rope would look like:

Since strings are immutable at the language level, cycles can never form. When the character array is requested within the engine, a binary tree traversal is performed to flatten the constituent strings' characters into a single, newly-allocated buffer. Note that, when the rope is not flattened, the characters which constitute the overall string are not in a single contiguous region of memory — they're spread across several buffers!

Dependent strings

How about when you do superHugeString.substr(4096, 16384)? Naively, you need to copy the characters in that range into a new string.

However, in SpiderMonkey there's also a concept of dependent strings which simply reuse some of the buffer contents of an existing string's character array. In a very simple fashion, the derived string keeps the referred-to string alive in order to reuse the characters in the referred-to string's buffer.

Fancier and fancier!

Really small strings are a fairly common case: they are used for things like array property indices and single-characters of strings — recall that, in JavaScript, all object properties are named by strings, unlike in languages like Python which uses arbitrary hashables. To optimize for this case, we have strings with characters embedded into their GC cell header, avoiding a heap-allocated character buffer. [§] We also pre-initialize many of these (less than length-3 strings and integers up to 256) atoms when the runtime starts up to bypass the typical hash table lookup overhead.

I'm out of time, but I hope this gives some insight into the good number of tricks are played to make common uses of JavaScript strings fast within the SpiderMonkey VM. For you curious types, there's lots more information in the code!

Of all the things I've lost...

For a long time, I felt that I wasn't crazy enough to write my own blog software. So, in a totally sane fashion, I:

Meticulously wrote all of my blog entries in reStructuredText, complete with metadata, in my own Mercurial repository

Converted that reStructuredText to HTML with my own custom extension to Docutils' rst2html capability

Copied and pasted the HTML from the rendered file into Wordpress

Updated the metadata in Wordpress by hand

Of course, with every edit, I repeated all of these steps.

Now, this wouldn't be so bad, if writing weren't such a damn perfectionist art. I'm not sure the average number of cycles I took around this loop of automation apathy for each blog entry, but I would guess it was around five. Each trip around the loop I hated it more.

They say that the definition of insanity is doing the same thing over and over again, but expecting different results.

I think I've successfully channeled my gripes into the implementation of MicroClog. I hope that, ultimately, greasing the wheels on this process will help flush the 83 entries in my drafts folder (along with a small handful of unfulfilled promises to write something) out to the internet.

The idea behind MicroClog

Writing about code is a total pain in most blog engines. Writing in reStructuredText rocks.

MicroClog chooses reStructuredText over WYSIWYG/HTML editing and existing distributed revision control systems over a in-blog-engine revision control system. The current workflow for MicroClog is:

Write a blog entry in reStructuredText on your local machine

Commit and push the changes to a repository on the host server

The host server's repository hook renders entries that have changed

Entries designated for publishing are publicly visible

There's also ways to share drafts in a restricted fashion. I'm currently hacking together a "live preview" on the server side for the reST entries you're editing on the client side, using the fancy new server-sent events API.

Ultimately, there are a few simple tasks that I want to optimize for:

Start an entry and dump a stream-of-conscious text in it

Share draft entries with proofreaders

Converge on a publication by iterating a read-and-tweak cycle

I love writing in my text editor — especially when writing about code — but I also want to marginalize the advantages WYSIWYG has over markup by getting live previews as smooth as possible.

Feature creep

There are some more sordid incentives for me to have all my blog data easily queried and manipulated in a Django app. A few of the features I'd like to try adding in the future:

First class updates

I would really like to support the idea of an "update" or "followup" as a first class feature — manually hacking old entries to point at newer ones with followup content is lame, and engine support for that kind of workflow isn't difficult.

More widgets

I've always wanted to have a widget where I could select a handful of my hundred-odd drafts and generate a poll where users could select the title/intro blurb that was most interesting to them. Knowing what people are interested in reading gives me additional motivation.

Decoupling syndication and entry labeling

I find that tying planet syndication directly to the feed generated for a label has been bothersome. Sometimes I feel like I want to syndicate an entry to a planet but that label isn't appropriate, or sometimes I don't want to syndicate an entry to a planet but I do want to use the label.

Statistics pr0n

Because data is fun to look at. Some ideas I've had:

tf/idf style analysis to suggest tags automatically

Plot of entries correlated against start/publish date/time

Start-to-publish duration versus word count

The good left undone

I'm still writing the software out of a private repo, because I've perpetrated some epic hackery in the interest of shipping.

There's a bunch done, but there's a lot more cleanup, generalization, and feature implementation to do. My day job isn't at all webdev related, so if you're knowledgeable and interested in helping me generalize a system like this for public source release, feel free to get in touch!

Footnotes

You should be cautious that the functions you call from generators don't accidentally raise StopIteration exceptions.

Background

Generators are generally a little tricky in a behind-the-scenes sort of way. Performing a return from a generator is different from a function's return — a generator's return statement actually raises a StopIteration instance.

Snag

There's a snag that I run into at times: when you're writing a generator and that generator calls into other functions, be aware that those callees may accidentally raise StopIteration exceptions themselves.

If you substitute StopIteration with ValueError, you get a traceback, as you'd probably expect. The leakage of a StopIteration exception, however, propagates up to the code that moves the generator along, [*] which sees an uncaught StopIteration exception and terminates the loop.

JavaScript

The same trap exists in the SpiderMonkey (Firefox) JavaScript dialect:

Design

You may look at this issue and think:

The fundamental problem is that uncaught exceptions are raised over any number of function invocations. Generators should have been designed such that you have to stop the generator from the generator invocation.

In an alternate design, a special value (like a StopIteration singleton) might be yield'd from the generator to indicate that it's done.

One issue with that alternative approach is that you're introducing a special-value-check into the hot path within the virtual machine — i.e. you'd be slowing down the common process of yielding iteration items. Using an exception only requires the VM to extend the normal exception handling machinery a bit and adds no additional overhead to the hot path. I think the measurable significance of this overhead is questionable.

Another issue is that it hurts the lambda abstraction — namely, the ability to factor your generator function into smaller helper functions that also have the ability to terminate the generator. In the absence of a language-understood exception, the programmer has to invent a new way for the helper functions to communicate to the generator that they'd like the generator to terminate.

Footnotes

I am totally taken aback by the lack of hyperbolic romanticism in the foreword of the programming language book that I just got. There are horribly boring-sounding hooks along the lines of "computers are ubiquitous in the 21st century" and "connecting the theoretical foundations of computer science to modern platform architectures". Hopefully people don't judge a book by its cover or its foreword.

Pragmatic rhetoric is uninspiring — you could just as easily write a foreword for a book on ripping up floorboards and talk about how "floorboards are ubiquitous in the 21st century" and "connects the theoretical foundations of wood science to modern house architectures".

If I were to write a foreword that mentioned the reasons you should be interested in programming languages, it would go roughly like this, to which I hope there is no analogy for ripping up floorboards:

Foreword

Rejoice, programming languages are the irrigation ditches of your mind-goo!

You've got brilliant ideas brewing inside your head, trying to claw their way out of your little mind and escape into the world. Unfortunately, the meat shell that your ideas live in is quite limited — you can barely eat and talk at the same time, and even then, you're not supposed to be talking with your mouth full. It's bad manners.

Luckily for you, computers provide another venue to express and execute your abstract thoughts. The way you express yourself to these unquestioning harbingers of awesome is through programming languages, whose programs cause actions to be taken in the computing device's virtual world.

Much like ice cream, programming languages come in many flavors (and some have those great cookie dough chunks that you find when you first bite into them). The flavor of a programming language shapes the way that people express and reason about their abstract thoughts — a result of the language's unique design and implementation. Because there's no "best" way to channel your thoughts into a computing device, our work on programming languages is never done!

What [whoever] has written here is [probably] a wonderful explanation of the way we currently bridge the gap between the mind and the computing device for the flavors of programming language we've invented-and-used thus far. Learning from historical successes and pitfalls is key to really understanding existing programming languages and evaluating the design decisions that you'll be making for the programming languages of tomorrow.

Inline caching is a critical ingredient in the delicious pie that is dynamic language performance optimization. What follows is a gentle-albeit-quirky introduction to what polymorphic inline caches (PICs) are and why they're useful to JavaScript Just-In-Time compilers like JaegerMonkey.

But first, the ceremonial giving of the props: the initial barrage of PIC research and implementation in JaegerMonkey was performed by Dave Mandelin and our current inline cache implementations are largely the work of David Anderson. As always, the performance improvements of Firefox's JavaScript engine can be monitored via the Are We Fast Yet? website.

C is for speed, and that's good enough for me

C is fast.

Boring people (like me) argue about astoundingly interesting boring things like, "Can hand-tuned assembly be generally faster than an equivalent C program on modern processor architectures?" and "Do languages really have speeds?", but you needn't worry — just accept that C is fast, and we've always been at war with Eurasia.

So, as we've established, when you write a program in C, it executes quickly. If you rewrite that program in your favorite dynamic language and want to know if it still executes quickly, then you naturally compare it to the original C program.

C is awesome in that it has very few language features. For any given snippet of C code, there's a fairly direct translation to the corresponding assembly instructions. [*] You can almost think of C as portable assembly code. Notably, there are (almost) zero language features that require support during the program's execution — compiling a C program is generally a non-additive translation to machine code.

Dynamic languages like JavaScript have a massive number of features by comparison. The language, as specified, performs all kinds of safety checks, offers you fancy-n-flexible data record constructs, and even takes out the garbage. These things are wonderful, but generally require runtime support, which is supplied by the language engine. [†] This runtime support comes at a price, but, as you'll soon see, we've got a coupon for 93 percent off on select items! [‡]

You now understand the basic, heart-wrenching plight of the performance-oriented dynamic language compiler engineer: implement all the fancy features of the language, but do it at no observable cost.

Interpreters, virtual machines, and bears

"Virtual machine" sounds way cooler than "interpreter". Other than that, you'll find that the distinction is fairly meaningless in relevant literature.

An interpreter takes your program and executes it. Generally, the term "virtual machine" (AKA "VM") refers to a sub-category of interpreter where the source program is first turned into fake "instructions" called bytecodes. [§]

I call these instructions fake because they do things that a hardware processing units are unlikely to ever do: for example, an ADD bytecode in JavaScript will try to add two arbitrary objects together. [¶] The point that languages implementors make by calling it a "virtual machine" is that there is conceptually a device, whether in hardware or software, that could execute this set of instructions to run the program.

These bytecodes are then executed in sequence. A program instruction counter is kept in the VM as it executes, analogous to a program counter register in microprocessor hardware, and control flow bytecodes (branches) change the typical sequence by indicating the next bytecode instruction to be executed.

Virtual (machine) reality

Languages implemented in "pure" VMs are slower than C. Fundamentally, your VM is a program that executes instructions, whereas compiled C code runs on the bare metal. Executing the VM code is overhead!

To narrow the speed gap between dynamic languages and C, VM implementers are forced to eliminate this overhead. They do so by extending the VM to emit real machine instructions — bytecodes are effectively lowered into machine-codes in a process called Just-In-Time (JIT) compilation. Performance-oriented VMs, like Firefox's SpiderMonkey engine, have the ability to JIT compile their programs.

The term "Just-In-Time" is annoyingly vague — just in time for what, exactly? Dinner? The heat death of the universe? The time it takes me to get to the point already?

In today's JavaScript engines, the lowering from bytecodes to machine instructions occurs as the program executes. With the new JaegerMonkey JIT compiler, the lowering occurs for a single function that the engine sees you are about to execute. This has less overhead than compiling the program as a whole when the web browser receives it. The JaegerMonkey JIT compiler is also known as the method JIT, because it JIT compiles a method at a time.

For most readers, this means a few blobs of x86 or x86-64 assembly are generated as you load a web page. The JavaScript engine in your web browser probably spewed a few nice chunks of assembly as you loaded this blog entry.

Aside: TraceMonkey

In SpiderMonkey we have some special sauce: a second JIT, called TraceMonkey, that kicks in under special circumstances: when the engine detects that you're running loopy code (for example, a for loop with a lot of iterations), it records a stream of bytecodes that corresponds to a trip around the loop. This stream is called a trace and it's interesting because a) it can record bytecodes across function calls and b) the trace optimizer works harder than the method JIT to make the resulting machine code fast.

There's lots more to be said about TraceMonkey, but the inline caching optimization that we're about to discuss is only implemented in JaegerMonkey nowadays, so I'll cut that discussion short.

The need for inline caching

In C, accessing a member of a structure is a single "load" machine instruction:

The way that the members of struct Nose are laid out in memory is known to the C compiler because it can see the struct definition — getting the attribute nose->isPointy translates directly into a load from the address addressof(nose) + offsetof(Nose, isPointy).

Note: Just to normalize all the terminology, let's call the data contained within a structure the properties (instead of members) and the way that you name them the identifiers. For example, isPointy is an identifier and the boolean data contained within nose->isPointy is the property. The act of looking up a property through an identifier is a property access.

On the other hand, objects in JavaScript are flexible — you can add and delete arbitrary properties from objects at runtime. There is also no language-level support for specifying the types that an identifier can take on. As a result, there's no simple way to know what memory address to load from in an arbitrary JavaScript property access.

Consider the following snippet:

function isNosePointy(nose) {
return nose.isPointy;
}

To get at the isPointy property, the JavaScript VM emits a single bytecode, called GETPROP, which says "pull out the property with the identifier isPointy". [#] Conceptually, this operation performs a hash-map lookup (using the identifier as a key), which takes around 45 cycles in my microbenchmark. [♠]

The process of "looking up a property at runtime because you don't know the exact type of the object" falls into a general category of runtime support called dynamic dispatch. Unsurprisingly, there is execution time overhead associated with dynamic dispatch, because the lookup must be performed at runtime.

To avoid performing a hash-map lookup on every property access, dynamic language interpreters sometimes employ a small cache for (all) property accesses. You index into this cache with the runtime-type of the object and desired identifier. [♥] Resolving a property access against this cache under ideal circumstancestakes about 8.5 cycles.

WTF is inline caching already!?

So we've established that, with good locality, JS property accesses are at least 8.5x slower than C struct property accesses. We've bridged the gap quite a bit from 45x slower. But how do we bridge the gap even bridgier?

The answer is, surprisingly, self-modifying code: code that modifies code-that-currently-exists-in-memory. When we JIT compile a property access bytecode, we emit machine-code that looks like this:

Now, if you ask Joe Programmer what he thinks of that code snippet, he would correctly deduce, "The slow lookup code will always be executed!" However, we've got the self-modifying code trick up our sleeves. Imagine that the type matched, so we didn't have to go to the slow lookup code — what's our new property access time?

One type load, one comparison, an untaken branch, and a property value load. Assuming good locality/predictability and that the object's type happened to already be in the register (because you tend to use it a lot), that's 0+1+1+1 == 3 cycles! Much better.

But how do we get the types to match? Joe Programmer is still looking pretty smug over there.

The trick is to have the slowLookupCode actually modify this snippet of machine code! After slowLookupCode resolves the property in the traditional ways mentioned in previous sections, it fills in a reasonable value for IMPOSSIBLE_TYPE and IMPOSSIBLE_SLOT like they were blank fields in a form. This way, the next time you run this machine code, there's a reasonable chance you won't need to go to slowLookupCode — the types might compare equal, in which case you can perform a simple load instruction to get the property that you're looking for!

This technique of modifying the JIT-compiled code to reflect a probable value is called inline caching: inline, as in "in the emitted code"; caching, as in "cache a probable value in there". This the basic idea behind inline caches, AKA ICs.

Also, because we emit this snippet for every property-retrieving bytecode we don't rely on global property access patterns like the global property cache does. We mechanical mariners are less at the mercy of the gods of locality.

Where does "P" come from?

Er, right, we're still missing a letter. The "P" in "PIC" stands for polymorphic, which is a fancy sounding word that means "more than one type".

The inline cache demonstrated above can only remember information for a single type — any other type will result is a shapeIsKnown of False and you'll end up going to the slowLookupCode.

Surveys have shown that the degree of polymorphism (number of different types that actually pass through a snippet during program execution) in real-world code tends to be low, in JavaScript [♦] as well as related languages. However, polymorphism happens, and when it does, we like to be fast at it, too.

So, if our inline cache only supports a single type, what can we do to handle polymorphism? The answer may still be surprising: self-modify the machine code some more!

Before we talk about handling the polymorphic case, let's recap the PIC lifecycle.

The PIC lifecycle

The evolution of the PIC is managed through slowLookupCode, which keeps track of the state of the inline cache in addition to performing a traditional lookup. Once the slow lookup is performed and the PIC evolves, the slowLookupCode jumps back (to the instruction after the slot load) to do the next thing in the method.

When a PIC is born, it has that useless-looking structure you saw in the previous section — it's like a form waiting to be filled out. The industry terminology for this state is pre-monomorphic, meaning that it hasn't even seen one (mono) type pass through it yet.

The first time that inline cache is executed and we reach slowLookupCode we, shockingly, just ignore it. We do this because there is actually a hidden overhead associated with modifying machine code in-place — we want to make sure that you don't incur any of that overhead unless there's an indication you might be running that code a bunch of times. [♣]

The second time we reach the slowLookupCode, the inline cache is modified and the PIC reaches the state called monomorphic. Let's say we saw a type named ElephantTrunk — the PIC can now recognize ElephantTrunk objects and perform the fast slot lookup.

When the PIC is monomorphic and another type, named GiraffeSnout, flows through, we have a problem. There are no more places to put cache entries — we've filled out the whole form. This is where we get tricky: we create a new piece of code memory that contains the new filled-out form, and we modify the original form's jump to go to the new piece of code memory instead of slowLookupCode.

Recognize the pattern? We're making a chain of cache entries: if it's not an ElephantTrunk, jump to the GiraffeSnout test. If the GiraffeSnout fails, then jump to the slowLookupCode. An inline cache that can hit on more than one type is said to be in the polymorphic state.

There's one last stage that PICs can reach, which is the coolest sounding of all: megamorphic. Once we detect that there are a lot of types flowing through a property access site, slowLookupCode stops creating cache entries. The assumption is that you might be passing an insane number of types through this code, in which case additional caching would only only slow things down. For a prime example of megamorphism, the 280slides code has an invocation site with 1,437 effective types! [**]

Conclusion

There's a lot more to discuss, but this introduction is rambling enough as-is — if people express interest we can further discuss topics like:

Why in the name of Knuth is it okay to perform a linear search of the types in the inline cache? Can't we do better?

What are all the types of inline caches that you currently implement? What bytecodes do they correspond to? Who are the handsome fellows who implemented each of those inline caches and are they currently seeing anybody?

Why are you talking about types in a language like JavaScript, where typeof o is almost always "object"?!

What happens if the property lives on a prototype of the object, instead of the object itself? ...then it can't be a simple slot load! I feel hurt and betrayed... what other caveats haven't been discussed!?

Why might you choose to use monomorphic inline caches (MICs) instead of using PICs everywhere?

Does the TraceMonkey execution time suffer because hits in the inline cache prevent fills of the (global) property cache?

What kick-arse optimizations related to PICs have been discussed in the literature that have yet to be implemented in JaegerMonkey?

What the crap have I been talking about this whole time?

Suffice it to say that JavaScript gets a nice speed boost by enabling PICs: x86 JaegerMonkey with PICs enabled is 25% faster on SunSpider than with them disabled on my machine. [††] If something makes a dynamic language fast, then it is awesome. Therefore, inline caches are awesome. (Modus ponens says so.)

Alternative interpreter designs tend to walk over something that looks more like the source text — either an abstract syntax tree or the program tokens themselves. These designs are less common in modern dynamic languages.

There have historically been implementations that do things like this; notably, the Lisp machines and Jazelle DBX. The JavaScript semantics for ADD are particularly hairy compared to these hosted languages, because getting the value-for-adding out of an object can potentially invoke arbitrary functions, causing re-entrance into JavaScript interpretation.

Note that there is actually further overhead in turning the looked-up property into an appropriate JavaScript value. For example, there are additional checks to see whether the looked-up value represents a "getter" function that should be invoked.

Gregor Richards published a paper in PLDI 2010 that analyzed a set of popular web-based JS applications. The results demonstrated that more than eighty percent of all call sites were monomorphic (had the same function body). I'm speculating that this correlates well to the property accesses we're discussing, though that wasn't explicitly established by the research — in JS, property access PIC are easier to discuss than function invocation PICs. In related languages, like Self, there is no distinction between method invocation and property access.

"Hidden overhead my foot! Where does it come from?" Today's processors get a little scared when you write to the parts of memory that contain code. Modern processor architecture assumes that the memory you're executing code out of will not be written to frequently, so they don't optimize for it. [‡‡]

The annoying part is that the instruction prefetcher may have buffered up the modified instructions, so you have to check if the modified cache line is in there. Older cache coherency protocols I've read about flush lines past unified caches if they detect a hit in both the instruction and data caches — maybe it's better nowadays.