One of the applications you can use to demonstrate the capabilities
of RV-Match is the benchmark generated by Toyota ITC to evaluate
different static analysis tools.
For more information on this benchmark, refer to the following ISSRE‘15
paper:

The Toyota ITC benchmark is publicly available
on Github.
At the time of this writing (December 2015), it consisted of a total of 1,276
tests, half with planted defects meant to evaluate the defect rate (DR)
capability of the analysis tools and the other half without defects meant to
evaluate their false positive rate.
The tests are grouped in nine different categories:
static memory, dynamic memory, stack-related, numerical, resource management,
pointer-related, concurrency, inappropriate code, and miscellaneous.

An early 2014 version of the paper above,
Quantitative Evaluation of Static Analysis Tools,
was presented and published in the same avenue one year before, namely in
ISSRE‘14.
However, the 2014 paper has been removed from the IEEE database and we are
not able to find it anywhere anymore.
According to Prof. Regehr’s Blog,
Coverity has a DeWitt Clause
in their EULA that disallows comparisons of their tool with other tools,
which is probably the reason why the 2014 paper was removed.

The revised version of the Toyota ITC paper published in
ISSRE‘15
discusses, with approval from GrammaTech and MathWorks, only three static analysis tools:
GrammaTech’s CodeSonar and MathWorks’ Code Prover and Bug Finder.
Kudos to the Toyota ITC researchers for their fabulous and immensely useful
effort, and to Grammatech and MathWorks for allowing science to follow its path!

When executed on the Toyota ITC benchmark, RV-Match scores the best of any of the tools for which we were able to obtain scores.
RV-Match also distinguishes itself by having a perfect false
positive rate of 0 false positives.
The table below shows how RV-Match compares (in January 2016) to the commercial static analysis tools discussed in the ISSRE‘15 paper:

DR (Defect Rate) refers to the percentage of tests containing errors
where the error was detected.
FPR (False Positive Rate) refers to the percentage of tests not containing
errors where a false positive was detected, and underlined FPR is 100 - FPR.
PM refers to the productivity metric computed in the
paper, which is geometric mean of DR and 100 - FPR.
For the other tools referenced in the table above, we used the numbers
presented in the original public
ISSRE‘15
paper above by Toyota ITC, because these tools are not freely available.
The green boxes mean that tool had the best score in that category for that
metric.
The red highlights are features not yet supported by the corresponding tool,
yielding a 0 score.
The dark red boxes at the bottom compare the final score of RV-Match to that
of the best tool in the original ISSRE‘15 paper.

Usual compilers also perform some minimal static checks on the program.
For reference, in the table above we also included the numbers we obtained
with gcc and clang when run on the benchmark, with all their warning
flags enabled.
Therefore, as illustrated in the table, if no special analysis tools are used
in your development process, you should expect your usual compiler to only catch
about 5% of the undefinedness C bugs lurking in your code.
Which is why it is very rarely the case that safety critical code is developed
without the help of software analysis tools.
There is also a variety of non-commercial analysis tools that you can use to
improve the quality of your code.
Below is a table showing the numbers we obtained (in January 2016) when running
such freely available analysis tools on the Toyota ITC benchmark:

As expected, the scores of the freely available tools are lower than those of
the commercial tools.

The reader should not naively assume, based on the numbers above alone, that
runtime verification is simply the best program analysis approach.
The best way to look at the various technologies is that they complement
each other, and not that they compete with each other.
For example, static analysis tools are more forgiving in terms of analyzing
code that does not even compile, so they can help you find errors earlier
in the process.
Also, they typically analyse all your code in one run of the tool.
On the other hand, dynamic analysis tools aim at very low runtime overhead,
even unnoticeable if possible, at the expense of less rigorous analysis.
Like dynamic analysis, runtime verification requires the program to actually
execute, so it cannot analyze a program that does not compile.
Also, runtime verification covers in depth only the fragment of code in your
program that is actually executed, so it needs to be combined with your
existing testing infrastructure (e.g., running your unit tests with the RV
tool) for best results.
However, unlike dynamic analysis tools, the checks that RV-Match performs
are mathematically rigorous and follow the formal specification of the
ISO C11 standard.
For more information on the runtime verification technology underlying
RV-Match, you can view this presentation.

If you would like to reproduce the results of the benchmark for
yourself, you can do so by executing the following commands:

Note

The Toyota ITC Benchmark requires use of libraries, including
libc, to run.
Because our native Windows version currently does not fully
support libraries, we recommend using the Linux version or
a Linux-based virtual machine on Windows, including the provided
Vagrant VM, to run the benchmark.

You can then reproduce the results of any test by running the
relevant program.
For example, to run the first test in the second test group of tests with
defects, you can run ./01.w_Defects/01.w_defects002001.
For more information about which test groups correspond to which defect
types, refer to main.c.
If you want to check the details of our experiments involving the freely
available tools described, or to redo the experiments yourself, you can find
all our records and the scripts we used
on github.

Interestingly, the use of RV-Match on the Toyota ITC benchmark
detected a number of flaws in the benchmark itself, both in the form
of undefined behavior that was not intended by being one of the test
cases of the benchmark, and in the form of tests that were intended to
contain a defect but were actually correct.
We have made our fixes with detailed explanations public in
this pull request on GitHub,
and informed the Toyota ITC authors to fix their benchmark.
If you have any comments or questions regarding any of these changes
please feel free to contribute to the pull request above and/or contact us at
our support page.

Note that we used the fixed version of the benchmark in our experiments
reported in the RV-Match columns in the table above.
Fixing the benchmark bugs was not optional, because without some of the fixes
the benchmark would not even compile with RV-Match due to constraint violation
defects in the code.
Unfortunately, we do not have access to the static analysis tools referenced
in the ISSRE‘15 paper and the table above, so we just reproduced the
results reported in the ISSRE‘15 paper for them.
Because of that, it is possible that the metrics scored for the other tools
may be off by some amount.
If you have access to these static analysis tools and run the fixed benchmark
yourself and find any inconsistencies, please contact us at
our support page
and help us fix the numbers.