Stringless YARA Rules

Posted on 2018-09-30 by Rob King

Here at InQuest, YARA is among the many tools we use to perform deep-file
inspection, with a fairly extensive rule set. InQuest operates at line
speed in very high-traffic networks, so these rules need to be fast.

This blog post is the first in a series discussing YARA performance notes,
tips, and hacks.

YARA bills itself as the "Pattern Matching Swiss Knife". (It used to be the
"Pattern Matching Swiss Army Knife" but apparently "Swiss Army Knife" is a trademark of Victorinox AG. Who knew?)

It's used to determine if a given input (often a file, but it can attach
to a running process and analyze its memory too) matches any defined rules.
YARA rules consist of three parts, two of them optional:

Some rule metadata, which is just a mapping of strings to values.
These name-value pairs have no effect on the rule itself, but are useful
for conveying additional information when a rule matches. This section
is optional.

Some "strings". I put it in quotes because they're not limited to static
strings; regular expressions are also permitted. This section is likewise
optional.

A condition. This is exactly one boolean expression of arbitrary complexity.
If it evaluates to true for a given file, the rule fires. This section
isn't optional.

There are a few other things that can happen in rules, like tags, but they are beyond the scope of this blog post.

The condition is where the magic is all tied together. It can check for
matches of any strings/regular expressions defined in the rule, check
against the values of some provided external variables, call external
functions, and even run loops.

The Example Rule

To use a simple example, let's write a rule that detects if the file we're
looking at is an Adobe Flash file.

Adobe Flash files begin with one of three magic strings: "FWS", "CWS",
and "ZWS". The three magic strings differentiate Flash files based on
the compression mechanism used for the data they contain.

A First Attempt

This defines a single regular expression (named $flash_magic), case
sensitive, that will match the three strings we defined above. We then
say the rule Flash will match if $flash_magic has a match in the
file at offset 0 (that is, the beginning of the file).

Let's see how fast this is. I'll run it on a Flash file in the InQuest
testing corpus:

yara -f test1.rule testfile : 0.006s

0.006s is really pretty good, right? But let's try running it on a much
larger (1GB), more malicious file:

Too Many Matches

What's going on here is that YARA first finds all the matches for a regular
expression in the file and then checks the rule conditions to see if
they are true. The malicious file has a huge number of strings that end
up containing the characters "CWS", "FWS", and/or "ZWS".

Obviously, we don't want to fail to analyze the file, so we need to try
a few alternatives.

Don't Match So Much

Let's try modifying the regular expression so that it won't match so much.
Since the magic bytes are always at the beginning of the file, let's anchor
the regular expression:

Now, the question is, can we go faster? It turns out, we can...but why
it works will take some explaining.

Strings and Atoms

YARA tries very hard to make string and regular expression matching
very fast. It operates under the assumption that a file is going to have
hundreds or even thousands of rules run against it, and each of those
rules is usually going to contain one if not many more strings.

To speed up this process, YARA tries to avoid running regular expressions
over the whole file. Instead, at rule compilation time, it looks at all
of the defined strings and regular expressions and extracts out of them
a collection of atoms.

An atom is a short string. For example, for the expression

$flash_magic = /[FCZ]WS/

YARA might extract out the atoms "FWS", "CWS", and "ZWS". The set
of all atoms for a given rule set is then fed into the Aho-Corasick
algorithm,
a fast string-finding algorithm.

(The "Aho" in "Aho-Corasick" refers to Alfred Aho who is also the "A"
in the AWK programming language.)

The Aho-Corasick algorithm is run and the offsets of each atom are recorded.
Thanks to this, the various regular expressions don't need to be run over
the whole file: any spot where the atoms couldn't possibly line up with
the expression is eliminated.

This can result in enormous speedups, but it has the major downside that
it requires a lot of preprocessing to find all of the atoms in the file.
For a large file with a lot of atom matches (indicating a lot of potential
regular expression or string matches), this preprocessing time can be large.

Eliminate the Strings

To speed things up, we could try eliminating the strings/regular expression.
The problem is, how do you match the regular expression ^[FCZ]WS when
you can't look for strings?

YARA has a collection of built-in "integer functions" that can read integers
of various sizes and orderings from a given offset in the file. For example:

uint16be(0x72) == 0x3829

would read a 16-bit big-endian integer from offset 0x72 in the file and
see if it's equal to 0x3829. Similar functions exist for single bytes
and 16- and 32-bit integers in both big- and little-endian formats.

Given this, we can transform our regular expression into a sequence of these
calls. To not keep you all in suspense, here's what that would look like:

Notice how we used a combination of 16-bit and 8-bit calls above, to
handle the fact that our strings were three bytes long. We use that sort
of trick often to match strings of arbitrary length.

Can We Go Faster?

We've gone from failing to process the file at all to 29 seconds, to 23
seconds, to 15 seconds.

Before we see if we can go any further, it might be useful to see what the
lower limit actually is. To do that, we can run a rule that does nothing
against the file:

rule NullRule
{
condition:
false
}

Let's run it against the file and see how long it takes:

yara -f test5.rule maliciousfile : 14.476s

Running a rule that does nothing against our file takes just under 1.2
seconds less time than our fastest rule. That fourteen seconds is the time
it takes to launch YARA, parse and compile the rule, load the target file,
and do whatever preprocessing is necessary.

We could probably shave off a few milliseconds by reordering the clauses
in our condition, but I don't think it would buy us much. I think we've
truly gotten this rule as fast as it will go.

Conclusion

YARA is pretty fast already, especially given how extensive its abilities are.
However, how you write your rules can have a real bearing on how fast they run,
and sometimes doing things in the less-obvious way can result in some real
speedups.