OSSC zero-day patching

Author:

Pawel Koniszewski

Dec 21, 2016

Written by Kamil Szczygiel

There is a wide range of opinions on the current state of live migration capabilities with OpenStack. By establishing objective goals and using repeatable automated tests, we intend to determine the proper configuration options and uncover any technical issues in the entire stack so we can continually improve the solutions results. The goal is to ensure that live migration with OpenStack is highly reliable, efficient, fast, and automated.

Testing methodology

Our working assumption was that we need to apply a zero day system patch to all compute nodes as fast as possible. To simulate that, we do the following steps:

First we disable particular nova-compute services to make sure that virtual machines won’t be migrated to currently patched compute nodes (if patching more than one server at a time).

After that, we start to live migrate all of the virtual machines out of the compute nodes.

Then we force a system reboot of servers without performing any system modifications.

When servers and nova-compute services are back online, we reenable the nova-compute services and perform the same set of actions on the remaining compute nodes.

Conclusions:

Disabling tunneling has a security drawback: tunneled data will be not encrypted.

One of the ideas is to use a dedicated, encrypted live migration network based on IPv6.

Lowest patching time with all virtual machines successfully migrated was five hours and 40 minutes (three compute nodes at once and one concurrent live migration).

While tunneling was disabled auto converge was never triggered due to very good progress in the live migration process.

QEMU throttles down CPU only when there is no progress during live migration. Because of that live migration duration with auto converge enabled and disabled is nearly equal.

Test with stressed instances and tunneling enabled caused live migration to fail.

Virtual machines changed memory pages faster than the host could transfer them to the destination host.

Enabling the auto converge feature did not improve the live migration process.

Network throughput on live migration interface during live migration was < 2 Gbit/s (due to tunneling).

Attempt to live migrate the instance timed out after 15 minutes.

Patching three hosts at once with three concurrent live migrations (nine concurrent live migrations in total) caused RabbitMQ connection failures. Because of that, not all live migrations were successful.

Connection failures during live migration resulted in “ghost” virtual machines: this occurred when the virtual machine was running on a different compute node than reported by nova. This caused a resources usage mismatch between reality and nova, and there was no ability to live migrate the virtual machine to the correct compute node because the virtual machine disk already exists on a target node.

Software and workloads used in performance tests might have been optimized for performance only on Intel® architecture.

Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors might cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of each product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

Further Reading

The continued evolution of software-defined Cloud infrastructure and the explosion of data being created at the edge of the network are driving a wave of new uses cases along with innovative projects and technologies. The breadth of potential edge locations and landing zones makes it critical that we simplify the deployment and management of hardware and software infrastructure supporting edge computing.