]]>When it comes to testing software, many of today’s organizations rely heavily on comprehensive testing, especially unit testing, to minimize the risk of outages. But in this session, Michalis Zervos of Microsoft talked to audience members about what some consider the “next generation” of creating software resiliency: actually taking those anticipated faults and forcing them to occur to your software.

“Fault injection,” as Zervos refers to it, can be performed on everything from virtual machines, to custom applications to hardware. And this is a practice Zervos’ team at Microsoft actively uses and promotes in order to see not just how particular services and such are affected by certain unwanted events, but also how the dependent services and software are affected as well.

Zervos explains some of the reasons to adopt fault injection alongside testing.

“We create ‘storms in the cloud’ to see how it performs under pressure and failure and use that to create resiliency,” he said. And according to Zervos, fault injection can be used for more than just testing resiliency. It can also be used for things like testing new features, training and verifying staged deployments.

Zervos covered the numerous faults that teams could consider injecting, including creating a kernal panic, “hooking” and disrupting critical service code, crashing critical processes and even pulling the power plug on your data center. He also suggested a few publically available tools that development teams can use to make the process easier, such as Consume.exe, Sysinternals tools and “managed code fault injection” through TestApi, a library of test and utility APIs.

Zervos did warn audience members that fault injection cannot be performed without certain precautions and considerations in order to achieve accurate results and avoid creating more problems. He cautioned that teams need to still follow fundamental security principles such as the least-privilege principle, make extensive use of code signing, create a “safety net” for the automatic removal of faults should they get out of a tester’s control and have a “kill switch” available, which he said can save developers and testers “a lot of grief.”

Zervos also stressed this importance of extensive verification and reporting when it comes to fault injection. He also instructed audience members that it is useful to manage fault injection from a centralized location.

“If you are not able to verify what happened, you don’t get the most out of your system,” he said.

Zervos presents his own system architecture in relation to a centralized fault management service.

One of Zervos’ final points was that it is not enough to simply perform fault injection every now and again. He stressed that teams need to integrate fault injection as a continuous part of the production cycle and find creative ways to encourage teams to adopt its practice. One suggestion he made was the idea of “recovery games,” in which one team member simulates an attack on a particular system and another team member, often a trainee, must record what occurs and take the proper steps to mitigate the risks of an outage. By implementing these types of programs, Zervos said his organization was able to increase adoption of fault-injection and also garner helpful insights about the behaviors of team members, such as that some spent too much time debugging and not enough time actually mitigating the problem.

“It needs to be part of the engineering process and part of the culture of the company,” Zervos said.

Zervos provides examples of the goals that can be achieved through adoption and training programs such as “recovery games.”

John Billings, technical lead on one of the infrastructure teams at Yelp and attendee of Zervos’ talk, said he thoroughly enjoyed the session and believes that fault injection is “the next step in actually testing resiliency of production systems,” he said.

Billings, who also held a talk at QCon on the “human side of microservices,” said he particularly liked the fact that Zervos spent his time discussing the general principles of fault injection rather than talking about specific technologies. And while his company does already make use of fault injection techniques, he is hoping to push the adoption of this strategy even further within his company and hopes that others will as well.

“Tests can only cover so much that you’ve thought about beforehand,” he said. “If you actually have fault injection happening all the time in production, you get that additional level of reliability that otherwise would be very difficult to achieve.”

Billings also said he liked the idea of introducing “fault injection games” as an approach to encouraging the adoption of this strategy, but believes that these adoption strategies must be align with a company’s individual culture. For instance, he noted hearing about the idea of a “badge-based system” that awards teams particular badges for completing and adopting certain testing and production techniques.

“You have to experiment and just see what works for your particular culture and your company,” he said.

]]>A photograph from a Mars Rover may be breathtaking, but it will not deliver the complex data space scientists seek. Scientists like Washington University-St. Louis computer systems manager Thomas Stein need broader sets of data in formats that work with modern data analysis software. Stein helped create Analyst’s Notebook, a tool that documents geological findings from space missions and organizes that data in an online offering that is made accessible to scientists and the public.

Look at the data coming from just one instrument, say, the Mars Rover Opportunity. Some scientists are focused on a certain type of data from that one instrument, some on others. Meanwhile, said Stein, the general science community may want broader data from that instrument to do research in other disciplines. In addition, many scientists are doing cross-instrument and cross-mission searches and correlations to study a variety of topics.

“Today’s scientists cannot simply convert an image to a .JPG and use it, because you lose so much of the science quality of the data,” said Stein, who works in the University’s Department of Earth and Planetary Sciences. Analyst’s Notebook helped enable replay and archiving mission images and data, but that information must still be archived in formats accessible to scientists using many different software applications and devices.

Stein’s group works with NASA (National Aeronautics and Space Administration) to archive planetary data for the long term – as in the next 50 to 100 years. “We wanted to develop a value-added tool that helps scientists bind this data in a meaningful way,” said Stein. “By giving them data previews, we’d help them understand what they’re getting before they actually hit the download button.”

Developing software for geological studies of space rocks wasn’t Stein’s intention when he got an after-college job in in the Smithsonian Institution’s Mineral Sciences Department. Yet, it was there that he was asked to develop software for a traveling exhibit on volcanoes. The success of the three applications he delivered led to more projects for the Smithsonian.

After achieving success in these geological software projects, Washington University contacted Stein about programming software for scientists studying “space rock” data from the Giant Magellan Telescope. The immediate problem Stein addressed was a flaw in the way scientists were doing field-testing. “Nobody was taking notes about the decision-making process,” he said. “After a week of field-tests, they realized, ‘Hey, we don’t even remember why did we decided to look at this rock instead of that rock.’”

Of the many challenges for building scientific applications, two in particular really perplexed Stein and the NASA team: unpredictability of data from Rovers and feature glut.

For an orbital mission, an obvious objective is to map the planet systematically, but the Rovers don’t make this process easy, because they, well, rove. “Scientists often don’t know where the rover will drive and what it’s going to find,” said Stein. Another goal is determining the characteristics of natural objects, such as rocks. The scientists need to know where and in what context, which is hard to tell from an image. To deal with this problem, the development team used Microsoft Image Composite Editor, which was built on Microsoft SQL Server. The Editor can be used to create images that aggregate the surroundings of a finding in context mosaic image.

The feature glut issue comes from the length of today’s space missions. “Keeping up with what our users need over 10-15 years is unbelievably hard,” Stein said. “Think of how different the expectations of software users were 15 years ago – nobody asked for one-click ordering online.”

The development team, focused on Opportunity and other NASA Rovers, sought an automated development platform that set up the back end so they could devote more time to building value-added tools specific to planetary data coming from Rovers. “We shouldn’t be building basic code, laboring over documentation and doing cross-platform testing,” he said. Telerik Platform, a cross-platform development suite, was chosen to help the software teams focus on high-level challenges and bypass earlier phases of software development.

A web-based application running on the Microsoft ASP.NET platform, Telerik Platform provides a user interface (UI) that NASA uses for framework controls. In addition, Telerik’s automated test and quality assurance tools reduce the time needed to build a feature. An example is a documentation feature Stein’s team built that enables rapid online searches. “Documentation becomes very difficult when doing rapid application development and dealing with such huge sets of data,” he said. Telerik’s toolset helped him build a feature that enables a user looking for images of a certain target find it quickly online “at the push of a button, instead of the user having to do literature searches.”

Being able to react quickly to user needs is a necessity today, one that automated test and development platforms makes possible. “In reality, I’m still not a computer scientist, I’m a geologist,” he said. “A foundation development tool really helps me not worry so much about the computer science side and focus on the science side.”