Short-Circuiting Boolean Operators in YARA

Posted on 2018-12-18 by Rob King

Here at InQuest, YARA is among the many tools we use to perform deep-file
inspection, with a fairly extensive rule set. InQuest operates at line
speed in very high-traffic networks, so these rules need to be fast.

This blog post is the second in a series discussing YARA performance notes,
tips, and hacks.

Each of these tests (regular expressions, string matches, function calls, and so
on) can succeed or fail for a given input. Whether a rule fires is determined by
a single condition that is a combination of all of the individual terms that
are being tested.

This post talks about how changing the order of your terms in a rule's condition
can have a significant performance impact.

The Example Rule

To use a simple example, let's write a rule that checks to see if the file
in question is a blacklisted executable. Our blacklist is expressed as a list
of SHA-256 hashes and when we say
"executable" in this example, we mean an
ELF binary.

A First Attempt

This is the "obvious" way to write this in Yara, given our super-huge
two-member blacklist and YARA's hash module:

So we can see that about half of our time is spent calculating the hashes,
while the other half is spent on unavoidable work like opening and reading
file contents; we'll never be faster than six seconds.

Don't Calculate the Hash Unnecessarily

It would be great if we didn't have to calculate the hash of the file if
we knew it couldn't possibly be the file in question.

That "good" corpus has gigs and gigs of files that cannot possibly match
because they aren't executables: they're documents, images, text files, and
so on. It would be nice if we could avoid calculating the hash on those files
if at all possible.

Logical Connectives and Short-Circuiting

YARA supports three logical connectives: and, or, and not.
These have the same meaning that they do in logic:

- *a and b* is true only if both *a* and *b* are true
- *a or b* is true if either *a* and *b* or true, or both of them are
- *not a* is true if *a* is false

YARA's version of these operators implements them using a technique
known as "short-circuiting". This technique is common in many programming
languages (including universally popular ones like Python, C, and Java) that
performs a kind of "conditional evaluation". The arguments to and and or
are evaluated from left to right and if the outcome is certain after evaluting
the left hand side the right hand side is skipped.

For example, if I write

false and check_database(1)

I know after evaluating the false that there's no way that statement can be true.
Conversely, if I write

true or check_database(1)

I know after evaluating the true that there's no way that the statement can be
false.

Many programming languages, YARA included, take advantage of this bit of logical
truth and simply don't evaluate the right hand side of a connective if the left
hand side is sufficient to determine truth or falsity.

(The famous, brilliant, and...let's say punctillious...computer
scientist Edsger Dijkstra, who taught here in Austin for many years,
was famously opposed to short-circuit evaluation. "Because the
conditional connectives thus complicate the formal reasoning
about programs, they are better avoided," he wrote in his famous
handwriting.

Because of this impediment to formal reasoning about programs, some
languages omit the short-circuiting operators entirely and others, like
Erlang and Eiffel, provide them but with different names so that you know
what you're getting into.)

Only Calculating Hashes on Binaries

Armed with the knowledge of short-circuiting logical connectives in YARA, we can now
write a rule that only calculates the hash of a file that we've confirmed is an ELF
binary. If the binary check is faster than the hash calculation and we know most of the
files we're going to inspect aren't ELF binaries, this should result in a pretty significant speedup.

First, let's confirm that the short-circuiting works as we expect it does.
Let's write a rule that would calculate the hash if YARA didn't do short-circuiting:

Oh no! We're now taking over a minute, where our first rule took only twelve seconds.

(NB. This section's title comes from the story of Magic and More
Magic, an old hacker
legend. Note that the magic module doesn't take its name from this story,
of course, but rather from the concept of "magic numbers" at the beginning
of files that identify what they are.)

The Integer Trick

The magic string for ELF binaries is "\x7fELF"; that is, an ASCII DEL character followed by
the letters 'E', 'L', and 'F'. We can convert that to a single 32-bit integer and use YARA's
uint32be function to inspect the first four bytes of the file very quickly using the trick
described in the previous blog post.

Success! It's around 50% faster than our first test, and only 33% slower than the
empty test!

Conclusion

We increased the speed of our YARA rule by 50% simply by taking advantage of our knowledge of
how YARA implements logical connectives. In any rule with lots of different connected terms,
it makes sense to move the cheap tests to the left side of their connective.

Coming Soon

Tune in next time for a discussion (and release!) of some of the tools we use to
generate YARA rules automatically, including a tool that makes the "integer checks"
discussed above easy to write.