Ran across this story about the Three Amigos Strategy of Developing User Stories that looks interesting. I’m definitely feeling the squishy requirements problem lately, though I’m not sure the Quality staffing levels I have at the moment will let me fully staff the third amigo.
And the Quality seat at the table isn’t the sort of thing you can just assign someone a hat to wear for a meeting.

My boss asked for some help devising some Key Performance Indicators (KPIs) for our team of Quality Engineers for the next year. We needed to define some metrics to measure the performance of our entire team; ideally metrics that are based on numbers, really reveal how well the team is executing, and that help focus the team on what is important. The question on how to do this (well) comes up often and seems hard to answer. Keeping focus on performance and improvement can be hard as QE are often caught in the software development cycle between development engineering always wanting a bit more time and release deadlines that can’t move. Their performance is no easier (and likely harder?) to judge than those ‘regular’ development engineers. I know I’ve been asked countless times while interviewing how one should measure and gauge individual QE performance. It’s a tough question, especially if you want easy-to-measure numbers, rather than vague hand waving.

The first thing that was suggested by her management group was to track the # of bugs filed. That’s such an easy number to generate on a per-person and per-team basis. But how useful is that number? Today I reopened a bug at the direction of the engineer I’m working with because he had conceived of a new way to handle an issue that I’d filed and he’d dismissed earlier. If my team (or myself) was being rated based on the number of bugs filed I would have been in my best interests to spend a couple extra minutes creating a new bug for the issue instead of clicking [Reopen][Save]. If the “# of bugs filed” really is an indication of our/my performance, we’re saying that the time spent filing that extra bug was adding value to our organization and the company. But I’m pretty sure we all agree that it doesn’t, as the incentive isn’t to ‘add value to the company’ but to generate bug reports.

We also discussed that maybe “Reopened Bugs” was a possible metric to track, with the goal being to keep that number low. That means my act of imagined misdirection above would have helped keep another metric low in my favor.

But, I pointed out, if we were to track the bugs, it would also make sense to track the number of open bugs over time for each project. This would be the basis for some charting and bug modelling, which I felt would add value to the organization and improve our ability to ship software with higher quality with a better understanding of it’s quality.* Tracking the number and state of bugs in itself isn’t bad. But the number doesn’t map well to being a measure of the value that the Quality team is adding to the business.

We walked through several other potential metrics, some good, some bad:

more subjective test coverage (how much of the product is tested, identify gaps)

code coverage (use a tool to evaluate how much of the code is exercised by tests)

number of test configurations and environments

time spent maintaining test systems/environments divided by the number of them

time spent per project (map initial estimate vs. actual, also track the inevitable project development slippage against what was promised*)

number of issues that make it to customers on a per project basis to see if our quality is improving over time*

* I just couldn’t help through the process but turn the “metrics of QE team performance” into numbers that show the health of the project and software.

But the problem was that we had a quite a lot of ideas for numbers that could be generated that didn’t have much of a connection to what we as quality engineers are doing with our time or what our entire engineering team knows that we MUST work on in order to assure and improve our quality. There are a lot of projects that are vital (“create beta test system for key customers”) that don’t fit into the ‘execute tests, file bugs’ model or metrics. And many of these suggestions are open to being gamed. I can easily write more bugs, break tests into separately numbered parts, write a lot of easy tests and avoid working on a hard test, and devise all sorts of ways to pad the various metrics in my favor that are totally counter to constructive work. While I’m sure I wouldn’t copy-n-paste a section of code 10 times instead of a loop to pad my statistics, these ‘bad metrics’ can subtly influence even the best of us. More importantly, I’ve never seen ‘bad metrics’ have a positive value on team morale; at best we shrug our shoulders and get back to work.

And that’s when it hit me: There is already a way to measure the value being added by the work of individuals and teams. It was something that the various Agile teams I’ve been a part of were collecting all the time. Story points is one of the names they go by. Every task that the team is working on can be assigned a ‘business value’ that determines how useful they are to generating value. There may be other measurements as to how hard they are, how long they are expected to take, etc. but the important one here isn’t that it’s an arduous 3-month long project, but rather the relative value added to the company. These should be the points assigned by the product owner as to the value added to the company, not the team’s planning poker estimate of effort required. We want to measure outcomes not estimate effort.

Story points. That’s the appropriate measurement of my team’s performance and individual performance. That’s a measurement that doesn’t distract the team from working to provide value. That’s a measurement that can be taken into account when some other emergency project arrives. We may find at the end of the year that we have only provided 75% of the “points” for the quality initiative goals for the year–but I bet it will be because there’s another 50% worth of unplanned and unforeseen projects that have been completed. And those projects can be assigned points–and if they’re trumping our scheduled work, they must be worth more to the company. We can scale our expectations of our number by the size of the staff. We can compare the cost of the team and infrastructure against those points. We will encourage the team to work faster, smarter, and more focused.

I’ve been involved in Agile development, Scrum trainings, Bay Area Agile Leadership Network meetings, read the books, etc. but for whatever reason I hadn’t quite made the connection on how to apply the theory to the reality of this level of software development. Perhaps the question being pitched to me as counting the number of bugs threw me off and into a defensive posture and I couldn’t think straight.

Agile KPIs are the best KPIs. Whether your organization is Agile or not.

The source of many challenges my team faced while implementing, automating, and reporting performance test results was based on the differences between performance tests vs. the type of tests the rest of the organization was used to: the standard “Pass/Fail” test. This is another in a series of Some thoughts on Performance Testing where I’ll examine those differences.

Pass/Fail tests are absolute; Performance tests have nuance. The product management, project/program management, and engineering teams love the absolute nature of pass/fail tests. Even when a Pass/Fail test is broken, everyone knows what the desired result had it been working should be: Pass. Pass/Fail tests can still cause problems, especially automated tests in a system hasn’t been built to consider that your results should really be one of 3 main categories: Error (of various types)/Test Completed with a Pass/Test Completed with a Fail. This adds some complexity to deal with ‘failure to run’ such as required media/network connection failures that preclude a test running to completion. But even in an imaginary world where all tests always are able to run to completion performance tests don’t give such simple pass/fail answers. And the answers they do give can change over time and configuration.

A standard Error/Pass/Fail test returns a simple value (or error) and suites of tests can be easily rolled up as in most cases those 100 tests in suite X will all Pass, so the suite itself can just say “Pass” and perfectly record all the data for the underlying tests. A performance test may report multiple results for a single test: e.g. both time elapsed and memory used. You’ll also need to know what units the test is reporting so the values have more meaning. For some applications you may need even more information about the data being recorded such as amount of time a measurement spans, etc. though some of that info may fall more into “know what your test is testing” rather than info that needs to be stored with every measurement taken, which bleeds into info about tests vs. info about data.

All of the readings that are taken will require further info to interpret. Tests which use a resource (memory, time, battery, cpu) generally aim for less, but for other tests (connections, maximum, or X per time-period) better performance means a higher number. Frames per second, number of operations per second, number of connections are all examples where bigger is better. A boolean “BiggerIsBetter” may capture that sort of vector info about the test results. Other tests may require falling within a certain range, so a simple vector isn’t enough. If you were using a Control Chart, comparing results to a target with upper and lower limits, you may have a lot of interpretive information which needs to be considered.

You may also have one or more complete set of ‘target’ results which you wish to try to achieve or better. I’ll cover that later in my discussion of Benchmarks.

Each platform, configuration, or network design may require different benchmarks, test time-frames and limits, and will impact the interpretation of results. A simple ‘bouncing balls’ animation test may run so much faster on a new desktop compared to an older smart phone that the test itself must act different so it doesn’t run too slowly or too quickly to provide decent results.

If a performance test does Fail upon comparison with a goal, there is still a great deal more to learn. Did the performance improve a little, a lot, not at all, or did we go backwards? Are we almost to the goal and some tweaks may be all we need or are we far away and really need to reconsider the approach? Does the history of results for this test reveal anything? Did we make one little tweak to improve one test and affect others (negatively or positively)? Even if the results of a performance test technically Pass, you will likely learn a lot more from the result in the context of previous results and benchmarks. Did we go from a solid Pass to just barely? Are we trending down on this test as we implement other systems?

Understanding the need for context and more information for interpreting results is the key to understanding how tests of performance differ from standard Pass/Fail/Error tests.

While there may be cases where a simple, uncomplicated performance test provides “Pass”, I prefer all such tests to record their results as-measured and do the interpretation of results after completion. If you aren’t graphing your results over time, you should be.

Having inherited a huge mass of performance tests that were developed by many different teams over more than 5 years, one of the first things I had to do was try to figure out what each test did and make some sense of them. As I cataloged each one, exploring what they did and what they were intending to measure, they began to sort themselves out into three main categories based on the ‘depth’ of testing: Micro-Benchmark, Feature/Module, and Product/Real-World tests. These borders are a bit fuzzy but the various levels of testing have some important distinctions.

These levels of depth build upon each other and should generally be executed and evaluated in this order. One hopes that a problem with the creation of your base data object will be quicker to identify, fix, and confirm fixed via a simple Micro-Benchmark level test instead of noticing that nearly all the Product level tests have slowed down.

Micro-benchmark level tests
Micro-benchmark tests are designed to test small sections of code at a very basic level. The most common type of this test runs a small bit of code in a tight loop that repeats many times and measures the amount of time elapse. Often these should be very focused on a specific aspect of the software and stripped down to avoid other factors influencing the results. Some examples of these sort of tests are math functions, object creation, fileIO, database interaction, text/image drawing, screen transition, etc. We repeat the operation many times so any change in behavior is magnified.

I’ve labelled these “micro-benchmark” rather than “benchmark” for a couple of reasons. First, to avoid confusion with the general use of the word “benchmark” in testing. Second to draw attention to the focus on keeping the part tested here very small. Many of the tests one may encounter in a “benchmark” suite of tests (e.g. Octane 2.0) are what I’m calling “micro-benchmark” tests.

These tests are developed with a firm understanding of the underlying code. They may be developed in coordination with the code developer and may even start out as a development tool that an engineer uses runs while figuring out the best way to implement a feature. Micro-benchmark tests of this sort are very interested in pointing out differences in performance between implementations and in underlying architectures.

Because the measurements made by micro-benchmark tests should be very focused any change in performance here should be easy to trace back to a cause. They also may provide some protection of the base units of code from later changes.

While most often concerned with time-based measurements (“How many ms does it take to generate 100,000 objects?”), this type of test may also focus on memory use.

Feature/Module level tests
Feature/Module level tests are designed to test the performance of specific features and modules in isolation. These types of performance tests are probably best built as part of feature/module development and the existence of the tests be a part of the definition of feature complete and/or your definition of “done”. These tests do not require as much understanding of the underlying code as a micro-benchmark test–you don’t need to understand the code which anti-aliases text or renders the video in order to measure the fps of a scrolling text box or of a video.

While it is important to measure, understand, and prove performance of a new feature during development, these tests are often considered “protective” once the feature/module is complete. If someone else comes through to fix a bug later to the feature/module without fully understanding the code or changes underlying code the feature/module is dependent upon a performance change can be caught.

These tests should try to focus on the feature/module so any significant performance change can be attributed to the feature/module. A test of basic video playback should still try to isolate video from other factors such as text rendering, network performance, etc. that may cause a test to report a drop that isn’t caused by the video playback.

Because these are more involved tests the performance variance by platform and over time may be much wider and require more thought than a ‘all out’ micro-benchmark test. A ‘bouncing balls’ type test written for an average desktop browser could perform too poorly to be useful on a mobile device but also not tax the system of a new high-end desktop in a couple years. There’s various ways to try to anticipate this: different ball counts per platform, stepping up the ball count and taking readings at each step, increasing the number of balls until performance is degraded below a certain point, etc.

This sort of test may measure time (“How long does this transaction take?”), they measure a broader scope than the micro-benchmarks tight-loop, and may measure different parameters such as frames per second (fps), memory use, load on the system, etc.

Product level tests
The Product level tests are designed to replicate real-world use cases. These may focus on a certain area (e.g. specific graphics pipeline, account creation) but are intended to measure how everything behaves together. These may be tests run against a literally live system or they may be a more canned and controlled simulation which replicates expected real world use cases. Though we want to test something akin to a real-world situation it is still helpful to remove as much noise as possible. When creating a product level test for a helicopter attack game you will get more reliable, consistent, useful data if you can remove causes of randomness: exclude download time when extraneous, hardcode the randomseed or otherwise hardwire the enemy movements, program the player’s moves (or at least require no interaction), etc.

These gestalt tests may be the most important from an end-user and system health point of view, because they really let you know how all the pieces are working together. These are the tests that the Product team should be most concerned with. But they are also generally the hardest for the engineering team to reverse engineer to find the root cause of a change. With a robust set of micro-benchmark and feature/module tests you may already have indicators of where the problem is (and isn’t). If you’re able to keep these easier to deconstruct, you’ll save engineering time and aggravation. Very high-end, complex tests may have simpler versions: for the helicopter attack game we may want one version to test our sprite animation with little logic and another logic-based with little animation.

These tests will often measure such things as rendering performance (fps), memory, start-up time, CPU or battery usage, interaction speed, concurrent users, etc.

Bonus level:Meta-level tests
This probably isn’t really a different category from the Product level of testing, but when an example of this cropped up it turned out to require a different measurement mechanism. Rather than building out a specific test, the meta-level performance test metric is a measure of meaningful data for running an entire test suite (e.g. “What does the memory footprint look like after executing all the tests in the test suite?”). It may even be adding performance-type tracking around a suite of non-performance pass/fail tests. This sort of test is likely covering a higher level measurement such as memory usage, battery drain, cpu use, etc. A fall in performance here will likely require a bit of thought as to what the root cause may be.

I had a conversation earlier this week that got me to thinking about Performance Testing. Specifically, it was an open question about how to go about doing Performance Testing on a specific product, and my mind instantly exploded with a lot of thoughts about what I learned from a previous software product. The products aren’t identical, the previous one was enormously complex (Mac & PC desktops * most browsers, mobile (Android, iPhone, RIM/QNX, Palm), a little Linux, set-top boxes, etc.) and the new one is much more constrained. But I still think that much of what I learned on the previous project is applicable to simpler and different Performance Testing approaches.

So I’m going to scribble down some notes and take a look at some old documentation and see if I can put some coherent posts together. This isn’t intended to be a comprehensive guide to everything one could do or encounter (much like the Wikipedia Software Performance Testing page which seems to be written with mostly web-services in mind) but it is intended to highlight some things I learned that I think may be interesting.

It dawned on me this morning that I have lately been hitting a lot of technical debt, but of a different sort than I’m used to encountering.

Technical debt is a metaphor that Ward Cunningham came up with to compare technical complexity and debt. The term comes up a lot in software development and generally the ‘debt’ is when you do something ‘quick and dirty’ to hit a deadline and then end up paying ‘interest’ for it later when you encounter problems: bugs, less robust code, code that is harder to work with, etc. If you pay off the debt soon after the deadline is hit by writing ‘good code’ then you just did a short-term trade off–but most of the time it is ignored and the development time is turned to the next deadline and over time this can build to the point where much of the work a team is doing is servicing the debt. It’s also probably brought up by engineers for more cases than necessarily apply–if today’s new feature isn’t found to meet a real need/increase demand/bring in more business, time spent doing it ‘the right way’ for building upon and extending later is wasted. (Agile development hopes to understands this.)

Ordinarily when I’ve encountered technical debt, it has been much like what the wikipedia article discusses and is often considered ‘friction’ in the system which slows down new development and maintenance.

As I’m currently involved on a product which is being radically de-staffed (not off-shored, but cut across the board) I’ve been encountering another sort of technical debt: with fewer and fewer people available that have less and less of a complete understanding of this complex physical+software system, the harder it is to maintain that system, let alone continue to do any development. Technical debt that has built up is based on work and processes spread around a large number of machines which are all to often old, physical machines of dubious need. Lack of documentation about what state things are supposed to be in. Drift of development over time which never moved important pieces to whatever the modern system leaves us with 100’s of machines which occasionally require direct user interaction when a critical bug or security breach is discovered. Development workflow that takes just a tiny bit of hands-on interaction, even little things like adding a build number to the bug tracking database, quickly drag everything to a halt when you have only 20% of a team left.

These are bits of technical debt that affected the development team when under ‘growth development’ as well, but in a less severe way as there were plenty of people (to many?) to spread the pain around. But they were known. And I wouldn’t suggest that the team should have spent time worrying about fixing problems that would purely impact ‘post growth upkeep’ before it was clear that the product was being shelved. But there were a great many of ‘best practices’ that weren’t followed over the years that become glaringly evident when the development shifts. I know I certainly had opportunities to recognize that random machines stacked in a server room were something that should be addressed, but it seemed like such an improvement compared to the original problem of random/critical machines sitting under various desks.

Going forward, I think my main takeaway from this situation is a broader understanding of how technical debt can balloon in different ways and recognize the risks it presents.

My thoughts on the sort of situation he is discussing isn’t so much that this is an example of “Scrum is too much” as much as it is that all the developers involved in this rapid-prototyping situation are senior enough that they already have or don’t need most of the advantages of Scrum: visibility, engaging all team members, protecting developers from frequent change, frequent iteration and working product.

To me this seems more like an example of an experienced team that was motivated to develop while the visionaries are still on their vision-quest. Post-Scrum, anyone?

Many have heard (it was announced at MAX) that Flash CS5 will have the ability to publish Flash content for a variety of platforms, including making iPhone apps. This doesn’t mean that Flash content will run in Apple’s iPhone Safari browser, but it will allow a lot more people to create content for the iPhone, and/or quickly port over things that they have already created in Flash to the iPhone.

Here’s some perspective from a developer that has already used the beta CS5 to create the Trading Stuff in Outer Space game: Flash on iPhone: My Experience.