Blog

I first heard about Mayhem when I read that researchers at my university, Carnegie Mellon, had reported 1200 crashes in Debian, just by running their binary analysis system on Debian programs for 15 minutes at a time. When I learned that the technology developed by those researchers was spun out as a startup, ForAllSecure, I knew I had to get involved.

My joining ForAllSecure coincided with the company’s preparation for the DARPA Cyber Grand Challenge, the world’s first automated hacking competition. We spent thousands of hours extending the Mayhem system over two years to prepare for the competition, with significant focus on improving its core, the symbolic executor. Symbolic execution is the process of taking a program and generating test cases by tracking how inputs are used and mathematically solving for new inputs based on a deep understanding of what the code is doing with the input. Contrast this with fuzzing, which combines high-level feedback with high-speed testing to find new inputs.

After CGC’s very public demonstration of the possibilities for automated software analysis, we started receiving many inquiries from a wide range of companies wanting to know how this would work for their applications. Most of these projects involve software we cannot discuss publicly, so we also decided to continue applying our technology to open source programs that we could share the results from. Some of our engineers had done a few small-scale experiments on well-known and well-fuzzed binaries, and found new bugs: OpenSSL (CVE-2016-7053) and sthttpd (CVE-2017-10671). While I think these are good demonstrations of what Mayhem can do for real-world software, I was itching to try additional experiments that would look at a single piece of software more in-depth.

After reviewing several open source projects, I settled on testing objdump. This program is a normal part of a developer or security researcher’s workflow, providing insight into the contents of a binary program, for example, which functions and assembly instructions it contains. At its core is libbfd, which has been fuzzed significantly in therecentpast. Looking at the history of reports, it was ripe for additional fuzzing enhanced by symbolic execution. Many bugs had been reported and fixed, but reports had tailed off significantly in recent months. This suggests that most of the bugs visible to existing fuzzing tools were already found and patched. If any more bugs were to be discovered by Mayhem, this would be a great indicator that Mayhem can find things other tools cannot.

Results

Our testing found over 10 bugs in about 60 total hours on my workstation with an 8-thread CPU. I triaged and reported the bugs each morning, then merged fixes and restarted the testing process over the course of a few days. A single bug can manifest in multiple ways, so by only reporting 1-2 bugs per binary format in each report, additional crash reports were clearly the result of other distinct bugs. Libbfd’s excellent maintainer was able to supply patches that fixed not only specific bugs but entire bug patterns. Triggering multiple crashes because the same buggy pattern appears in multiple places is not really compelling and inflates the bug count, so I wanted to be as fair as possible about the number of bugs.

In this chart, the red entries represent bugs where memory corruption occurred. This means they are potentially exploitable to run an attacker’s code. The orange and yellow entries represent memory disclosure, which means information in the program’s memory may be visible to an attacker. Finally, the green entry can simply crash the program.

Why were we able to find so many overlooked bugs? These are not recent regressions, and coverage guided fuzzing has been applied to the same code. In the VMS module, we found a bug almost immediately that had been somehow missed by others who reported bugs in that same area of code only a few weeks ago.

At ForAllSecure, we believe it comes from a confluence of fuzzing and symbolic execution. Symbolic execution shines the most when several simultaneous constraints must hold on an input. Although symbolic execution is significantly slower per iteration than fuzzing, each and every synthesized seed is guaranteed to take a unique path, unlike fuzzing where mutated tests are generated very quickly, but are often duplicated or meaningless.

During CGC, we and other teams found that a combination of testing strategies is more effective than doing either alone. The relationship between practitioners of the two approaches has been a little rocky, with skepticism on both sides – either fuzzing is effective but shallow, or symbolic execution is heavyweight but by itself not always scalable for real world software. Both sides really want the same result, which is to find high impact bugs.

I suspect the two approaches will ultimately converge, as this experiment reflects only the superficial effect of merging corpora between guided fuzzing and symbolic execution; there is a great deal of room for improvement and we’re excited to continue pioneering this approach in our Mayhem product.

Interested in trying out Mayhem yourself on your software? Contact us about starting a pilot here.

LEGIT_00004 was a challenge from Defcon CTF that implemented a file system in memory. The intended bug was a tricky memory leak that the challenge author didn’t expect Mayhem to get. However, Mayhem found an unintended null-byte overwrite bug that it leveraged to gain arbitrary code execution. We heard that other teams noticed this bug, but thought it would too hard to deal with. Mayhem 1 – Humans 0. In the rest of this article, we will explain what the bug was, and how Mayhem used it to create a full-fledged exploit.

Mayhem is a fully autonomous system for finding and fixing computer security vulnerabilities.On Thursday, August 4, 2016, Mayhem competed in the historical DARPA Cyber Grand Challenge against other computers in a fully automatic hacking contest…and won. The team walked away with $2 million dollars, which ForAllSecure will use to continue its mission to automatically check the world’s software for exploitable bugs.

In 2008 I started as a new assistant professor at CMU. I sat down, thought hard about what I had learned from graduate school, and tried to figure out what to do next. My advisor in graduate school was Dawn Song, one of the top scholars in computer security. She would go on to win a MacArthur “Genius” Award in 2010. She’s a hard act to follow. I was constantly reminded of this because, by some weird twist of fate, I was given her office when she moved from CMU to Berkeley.

The research vision I came up with is the same I have today:

Automatically check the world’s software for exploitable bugs.

To me, the two most important words are “automatically” and “exploitable”. “Automatically” because we produce software far faster than humans could check it manually (and manual analysis is unfortunately far too common in practice). “Exploitable” because I didn’t want to find just any bugs, but those that could be used by attackers to break into systems.

Aside from our cool research, ForAllSecure also works on creating fun and engaging games to promote computer security. Just about every employee in our company has been involved in Capture the Flag exercises for the past several years, and we have been hosting these online events for our customers for about 3 years now. One of our big dreams is to see these types of contests gain in popularity, similar to how e-sports grew. Continue reading “Live Streaming Security Games”→

In nearly all CTF competitions organizers spend dozens of hours creating challenges that are compiled once with no thought for variation or alternate deployments. For example, a challenge may hard-code in a flag, making it hard to change later, or hard-code in a system-specific resource.

At ForAllSecure, we are working to build automatically generated challenges from templates. For example, when creating a buffer overflow, you should be able to generate 10 different instances to practice on. And these instances should be able to be deployed anywhere, on a dime. While you can’t automate away the placement of subtle bugs and clever tricks, we can definitely add meaningful sources of variance to challenges without much additional effort, with the added bonus that challenges are easier to deploy.

Although we have been very busy at ForAllSecure, we finally got the time to redo our website, huzzah! This website is a bit more pleasing on the eyes, and we hope to add more up-to-date information about our projects and what we’re up to.

Part of this refresh is also a new blog. We plan to talk about interesting things we are working on, so check back frequently! To kick things off, here is a post about some of our work on DARPA’s Cyber Grand Challenge.

In June, ForAllSecure participated in DARPA’s Cyber Grand Challenge (CGC) Qualification Event (CQE) 1. During the event our automated system tweeted its progress, and to continue the trend of openness, we decided to publish a writeup of some more details about our system. Our team, Thanassis Avgerinos, David Brumley, John Davis, Ryan Goulden, Tyler Nighswander, and Alex Rebert spent many thousands of hours on our system, and now that the CQE is over, we’re excited to give you a glimpse of its inner workings.