What makes Avast security unique?

Just as the computing landscape has dramatically changed in recent years, so have the requirements for protecting users. An ever-increasing diversity of platforms, operating systems, applications and devices means an ever-greater set of attack vectors. By leveraging the largest multi-platform sensor network in the world, the cloud, machine learning, and proprietary big-data analytics, Avast delivers a unique and highly sophisticated approach to today’s security challenges.

Security in Numbers

There are four main components to our protection network. First, there are the 400 million globally distributed devices that serve as the unparalleled sensor network for our system. Second, there is the massive database of more than 5 billion files tested by Avast, which we call FileRep. Third, there is Scavenger, our internal system that keeps track of all the threat samples Avast has ever detected and of all the work that the virus analysts did to classify and detect those threats. Fourth, there is the GPU-powered system that leverages this amazing amount of information to quickly and automatically classify as many new samples as possible. We call it Medusa.

Synchronization

To be effective, Medusa has to be kept up to date with all the changes that happen in Scavenger. This includes new files arriving, and those files being classified as malware or benign, or being removed from the sets that are relevant for classification. A durable message queue in RabbitMQ is used to ensure that no updates are lost even during Medusa server updates and maintenance. Additionally, the consistency between Medusa and Scavenger is continuously checked with complete dumps of object identifiers.

Real-time deployment

Real-time classification is required whenever a file is too new and a machine is having trouble deciding if the file is malicious. Given that there are 400 million Avast users, there is potential for a massive overload when new threats appear. To eliminate this, we implemented a caching proxy between the clients and the Medusa cluster. A file can be classified differently, as new information arrives from Scavenger, so the TTL of the cached decisions is set to a few minutes. Despite this, the cache hits in almost 40% of the requests.

Cluster setup

Every Medusa node in our deployment uses two or four Nvidia GPUs. A Medusa cluster has one master node that is aware of all the sets, and several slave nodes, which contain some parts of the sets. The classification needs many clean and malicious samples. The Evo-gen generator also uses a set with unclassified samples. Because of a large difference in usage patterns, we keep the different sets separate. Clean set samples are the most important because of the inherently high costs of a false positive. Thus, the clean set takes most of the space and is proportionally the slowest to scan. To increase the throughput, we keep the clean set mirrored. The sets with recent malware and unclassified samples take up only a fraction of the space —roughly 10% — needed for a clean set.

Machine Learning

The only way to effectively use all the information collected over the years is to leverage machine learning. We use instance-based learning because of its many beneficial properties, including the ease of:

re-learning the model, which is only a matter of adding or removing a sample to or from the correct set

understanding the reasons for particular decisions

fine tuning the false positive rate

Custom-distance function

Each sample is represented by a constant-sized feature vector consisting of approximately 100 attributes. We keep the exact composition of the feature vector secret, but, for example, obvious candidates such as section table data in the Portable Executable format are included. In general, there are static and dynamic features, categorized as offsets, sizes, checksums, factors, bit flags and generic numbers. Taking into account the nature of the attributes, we ended up with several distance operators and a weighting scheme that equalizes the importance of the attributes. The following table contains a sample of the operators we use.

Distance operator

Field types

Dexription

EQUAL_RET32

Checksum, value

return 32 when values are equal

Ham

String

Hamming distance

Ham-Mul32

Bitfield

Hamming distance multiplied by 32 - each flag change is as important as maximal change of one feature

Log

Length, Offset

base 2 logarithm of a difference

Order

Length, Offset, Value

difference of base 2 logarithms

Retz

all

ignore the feature, return 0 for all values

.kNN classifier

The most common approach for instance-based learning is the nearest neighbor classification. To fine tune our classifier, we built a tool, called Pythia, which displays the nearest neighbors of a given query sample. It uses a dimensionality reduction method (NMDS) to display the neighbors in 2D space, and also displays additional metadata for the selected samples. This information can be used by a human to determine whether or not it is feasible to distinguish between malware and clean neighbors in the current case. The goal was to create a fully autonomous system — which means high precision at the cost of lower recall. After some experimenting, we added a few thresholds, including minimal allowed distance to clean files, maximal allowed distance to malware files, as well as a weighting term that shifts the balance between clean and malware sets.

Real-world data

The redundancy in real world data is quite significant. Our internal systems handle around 250,000 new PE files every day. Out of those, 150,000 can be directly assigned to one of 20,000 clusters using very strict clustering criteria (low threshold distance and complete linkage). Each cluster can then be classified as a whole. That means 130,000 fewer decisions to make, and that the total number of clusters does not grow by 20,000 every day, as the clusters overlap between days.

World-class Protection

There are several methods we can use to transfer the classification to our customers. In order to provide the highest level of protection, we chose these three methods.

Method

How many users

Delay in classification

How many file versions

Real-time classification

single user

no delay

one

FileRepMalware

all users

seconds

one

Evo-gen

all users

minutes

many

Real-time classification

Avast checks every executable before it’s executed in a customer’s machine. When no signature from the current threat database matches the file, the FileRep service is queried. If the returned user count (prevalence) is anomalously low, the executable ends up in the Avast Sandbox. If the executable trace log does not match any known threat, the real-time classifier is invoked. Avast extracts the feature vector, submits it to a cloud-based service, and waits for the response. Most of the low prevalence files are benign. Out of approximately 250,000 requests daily, about 4,000 are classified as malicious.

FileRepMalware

Once a file is classified as malicious and our internal systems check that it is safe to detect this particular file worldwide, a simple flag is set in the FileRep service. Every Avast client that encounters that particular file instantly blocks it and reports it as FileRepMalware.

Evo-gen

Old string-based signatures work well when properly executed, and are especially good at generalizing many variants of a threat. But string-based signatures require an analyst and time. In today’s threat landscape, with all its variants and interconnectivity, there just aren’t enough people or enough time to keep pace. A new approach was required that generalizes like string signatures do, but doesn’t rely on human intervention or take as much time.

Enter Evo-gen. Evo-gen leverages the distance function to create a set of similar feature vectors which allows us to build a rule set from those features. Once we have a set of very similar feature vectors from the distance function, we can start to pick features that make them similar and build a rule set from those features. It is somewhat similar to rule-set generation in decision trees, but the objectives are different. To boost the generalization, we can pick as few rules as possible, while keeping hits in the clean set at zero. But there are many ways to pick 20 rules from 100 possible ones - 5.36x1020, or 536 billion, numerically speaking. We’re currently taming the combinatorial explosion with a stochastic approach, which provides better results than Scavenger approaches. This is where the speed of the GPUs is very important again.
While trying to understand how the Evo-gen rule sets (blue) affect the signature “ecosystem,” we produced the following visualization. Each blob represents a different rule set or signature, and the size of the blob is proportional to the number of detected variants.

Avast: The Smarter Security Solution

By leveraging the largest multi-platform network, the cloud, machine learning, and proprietary big data analytics, Avast delivers a unique and highly sophisticated approach to security. Unlike other security providers, our protection network gets stronger and easier to manage with the more users we add. Anyone who connects to our network is immediately protected, regardless of their location. We call this Global Security.