Even if we intend to test only a GUI feature, the rest of the layers may be covered by this test. We are not even verifying the calculation results, yet the Calculation layer may have 100% code coverage. Obviously, test coverage is not a good metric here.

I have some thoughts on a better metric.

Test depth

In the example above, the Calculation layer is tested only through the User interface layer. In other words, the Calculation layer is multiple levels away from the test code in the call stack. We will name this distance between test and tested code as "test depth". A high test depth (for this particular test) means that the test is not actually testing the tested code very well.

This gives us an improved metric; by weighting the test coverage with the test depth, we will get a new depth-weighted coverage. In traditional test coverage, each line is either covered or not covered:

C(line) = 0% or 100%, depending on whether line is executed by a test or not
Whereas the coverage for an entire class is the average

C(class) = Coverage_sum / line_count.

A depth-weighted coverage would then define the coverage per line (if it is covered)

C(line) = 100% / (lowest_depth_of_test - 1), where depth_of_test is the distance in the call stack.
If a line is covered directly by a test, C(line) will be 100%. If it is one level deeper, it will be 50%. Two levels deeper will yield a C(line) of 33%.

Is this the perfect metric?

Is depth-weighted metric a perfect test metric? No, it still doesn't actually verify that you are asserting on the right things. It is not a silver bullet, but it does however prevent the deep-test scenario above.

How do I calculate this?

I haven't found any tools that actually makes this kind of metric. Perhaps I need to make one myself... or are there any volunteers out there? ;-)