Quality in the Cloud

I was invited last week to attend a marketing event where different customers did talk about their experience in the world of Cloud and virtualization.

It was really interesting: I learned a lot, but especially, I discovered the vision of the production teams about… application quality.

A message to the project teams: good times are running out.

As you probably know, costs reduction is the primary reason why more and more companies move from physical infrastructure to virtual infrastructure. I did not imagine the number of servers which, in every company, use less than 5% of its CPU. It is frequent to find 20% to 40% of the infrastructure that is so much underused. And the benefit is simple to evaluate: even keeping a margin of 50% of existing resources in order to manage peeks or HA policy (High Availability), gaining half of the capacity of 20% to 40% of your machines means a potential reduction of 10% to 20% of future purchases.

However, this evolution goes with a lot of complexity for the production teams:

Most companies virtualize 50% of their servers, 80% is rare and nobody is fully 100% virtual.

Rationalization is not on the agenda: nobody buys to a unique vendor although it would limit operating expenses. All IT departments prefer to take advantage of competition between different vendors in order to get lower prices.

So virtualization goes with a high heterogenization of infrastructures, both physical and virtual, and creates technological silos of hardware / software that require a high specialization of teams. No system engineer is able to control and manage the whole production room or to resolve all the incidents that may occur.

Meanwhile, virtualization means higher level of service.

Previously, when you needed a new server on a new project, the answer was « you will have to wait one month so that we order the machine and get it delivered ». Now, to install a new VM (Virtual Machine) is three mouse clicks so you ask it to be achieved overnight and would not understand if you are not satisfied rapidly. Time to Market now gets to the production department.

Previously, if your server had a performance problem, it was your server, your responsibility: you had to investigate and resolve this problem yourself. Now, it’s the VM’s fault. You just create a ticket for the incident and wait for other people to solve it.

However, as we just said, production teams are still not well equipped to cope with the complexity induced by heterogeneity. What is the cause of the problem: CPU saturation or memory or I/O? To measure this, you need different people: the specialist of this virtual machine system, another one who knows well this operating system on this hardware, etc. And the problem could be caused by another application in another virtual machine on the same server, which takes resources from other VMs. Engineers need to develop and maintain scripts to retrieve the right metrics and spend time collecting and analyzing these data. Monitoring alerts and incident management have become a major concern in computer rooms.

Well, very simple. When gains from reducing infrastructure costs are being overtaken by the increase in spending more and more hardware to support user demands and increasing costs of troubleshooting, the answer from IT management is: ‘Stop’.

Some comments I heard during this event:

The project team that requests a new database server and does not use it: Stop.

The QA team that wants a copy of the production database to realize its tests, so another 800 Go on the hard disk: Stop. Please learn again to program test data.

All these directories of data forgotten on a disk and that no application can read again, all these Excel files that nobody remembers what they were used to, all the users who are not able to say which is the reglementary period of data retention, all these applications that are kept alive because nobody know how to recycle the data : Stop.

And last but not least. I asked a stupid question: « Are there technologies more demanding in terms of resources? ». I naively thought that database, datamining or infocenters were consuming. Everybody looked at me and I thought for a moment that they did not understand. Then came the answer : « There is not bad technology, there is only bad code ».

The J2EE application whose memory leaks saturate the VM and its neighbors on the server: Stop.

The SQL query that causes drive head’s panic on the storage array: Stop.

Expensive statements in loop: Stop.

The tools exist to monitor the infrastructure end-to-end, from hardware to application, and identify the technical processes that saturate the virtual machines. Production teams are beginning to map them to applications. And system engineers are beginning to look at code quality tools.

There is no bad technology, there is only bad code. Beware: good times are running out.