Please note: This schedule is for OpenStack Active Technical Contributors participating in the Icehouse Design Summit sessions in Hong Kong. These are working sessions to determine the roadmap of the Icehouse release and make decisions across the project. To see the full OpenStack Summit schedule, including presentations, panels and workshops, go to http://openstacksummitnovember2013.sched.org.

The REST and RPC APIs should expose a unified means to manage hardware, regardless of which driver is used, or whether that driver affects the nodes via the OOB mgmt interface or via local operations in a custom ramdisk.

We will discuss the API changes needed to expose various hardware management operations and the creation of a reference implementation, relying on a bootable ramdisk, for them.

* How do we expose these functions through the REST API? * Who are the consumers of these functions? Cloud admins? Nova? Heat? * Do we need an agent in the ramdisk with its own API? * Do we need distinct ramdisks for different operations (eg, update-firmware, build-raid, erase-disks, etc)? * How do we get logs back from the ramdisk, and how do we report errors back to the user?

This session will include the following subject(s):

Data protection for bare metal nodes:

The problem ------------------ There are several cases when private data might be accessible by third party users: - A new user of a node provisions it with a smaller partition size than the previous one so some data might still remain on the volume. - Previous user exceeded the size of physical memory and there might be some data on swap partition.

Possible solutions -------------------------- - Build a special undeploy image and use it for either - Securely erasing the volume on the node side - Exporting a volume to manager and perform erasing on the manager side - Create a separate boot configuration on the node that loads a kernel and a ramdisk with undeploy scripts in it

Food For Thought -------------------------- - Should wiping be a part of deploying or undeploying? - Should we wipe all nodes or wipe them on-demand? - Wiping all nodes might be ot required for everyone - Securely wiping a node requires a lot of time

Related bug report: https://bugs.launchpad.net/ironic/+bug/1174153

(Session proposed by Roman Prykhodchenko)

Communicating with the nodes:

It would be nice to formalize what and how a node under Ironic's control will communicate with Ironic. Currently the nodes only communication is a signal to start the deploy and return signal that the deploy is completed. Ironic should support a dynamic conversation between itself and the nodes it is controlling.

Ironic will need to support several new areas of communication: * All node actions * will need to send basic logging back to Ironic * should be interruptable * deployment (if done by an agent on the node) * nodes will need a way to communicate with Ironic to get the image to be deployed * Ironic will need to communicate RAID setup and disk partition information to nodes. * hardware & firmware discovery * nodes will need a way to send information about their hardware and current firmware revisions to Ironic * do nodes need to be able (re)discover replaced HW such as a nic? * firmware update * Ironic will need a way to push firmware updates to a node * secure erase * nodes will need to communicate progress back to Ironic * Ironic will need to communicate which devices and how many cycles to erase * burn in * Nodes performing a "burn-in" will need to communicate any failures back to Ironic * Ironic will need a way to specify which burn-in tests to run * Ironic will need to specify how long / how many test to run

-----------------

Open questions: * Some of the operations above (eg. discovery, RAID setup, firmware) may be performed via multiple vectors (eg, IPMI) * Some may be best served by borrowing from other OpenStack services (eg, cinder for RAID vol spec) * Not all deployers will want all of these features, and some may use vendor extensions to accelerate specific features (eg, firmware mgmt). How do we support this mixture?

2:40pm

It's a common requirement that the users or system admins need to update firmware for the baremetal servers, but different vendors may have different processes to do this. The summit is a good opportunity for several vendors to get together and discuss how Ironic may provide a common framework to implement this (may involve the Diskimage-builder project, too).

3:30pm

The Ironic service must be able to tolerate individual components failing. Large deployments will need redundant API and Conductor instances, and a deployment fabric with no SPoF. Ironic's current resource locking uses the database for lock coordination between multiple Conductors, but only a single Conductor manages a given deployment. There are several things we need to do to improve Ironic's fault tolerance.

Let's get together and plan development of the ways in which we can: * recover the PXE / TFTP environment for a managed node, when the conductor that deployed it goes away; * set reasonable timeouts on task_manager mutexes; * break a task_manager's mutex if the lock-holder is non-responsive or dies; * distribute deployment workload intelligently among many conductors; * route RPC requests to the Conductor that has already locked a node; * route RPC requests appropriately when multiple drivers are used by different Conductor instances.

4:30pm

Let's talk about how Ironic can manage local ephemeral, local persistent, and network-attached storage in a general way, and then make a plan to implement it!

Some things to consider: * TripleO requires local volumes that persist across re-imaging of the machine, eg. "nova rebuild"; * Users may require secure-erase of all local volumes on instance deletion; * Flavor "root_gb" may be much less than actual storage; use of the additional space should be enabled via Cinder and Ironic APIs; * Users may request different RAID topology be applied to the same node; * Some hardware can mount a network volume and present it as a local disk;

5:20pm

Let's discuss what remains to be done before we can tag our first RC and tell everyone to migrate away from nova-baremetal. Since this will be our last session for the day, we should also list the various tasks that we've postponed, and discuss anything that isn't clear.

I would also like to specifically invite vendors and anyone deploying nova-baremetal to this session. Their feedback will be invaluable in helping the project evolve to meet the needs of the community.

Expect a lively session and many action items getting assigned to NobodyCam before we're done!