Part 3: Keeping the devices running

So in Part 1 and 2, we saw how Buildbot tegra and panda masters can assign jobs to Buildbot slaves, and that these slaves run on foopies, and that these foopies then connect to the SUT Agent on the device, to deploy and perform the tests, and pull back results.

However, over time, since these devices can fail, how do we make sure they are running ok, and handle the case that they go awol?

The answer has two parts:

watch_devices.sh

mozpool

What is watch_devices.sh?

You remember that in Part 2, we said you need to create a directory under /builds on the foopy for any device that foopy should be taking care of.

This script will look for device directories under /tools to see which devices are associated to this foopy. For each of these, it will check there is a buildbot slave running for that device. It handles the case of automatically starting buildbot slaves as necessary, if they are not running, but also checks the health of the device, by using the verification tools of SUT tools (discussed in Part 2). If it finds a problem with a device, it will also shutdown the buildbot slave, so that it does not get new jobs. In short, it keeps the state of the buildbot slave consistent with what it believes the availability of the device to be. If the device is faulty, it brings down the buildbot slave for that device. If it is a healthy device, passing the verification tests, it will start up the buildbot slave if it is not running.

Therefore if you need to disable a device, by marking it as disabled in slavealloc, watch_devices.sh running from a cron tab on the foopy, will bring down the buildbot slave of the device.

Where are the log files of watch_devices.sh?

They are on the foopy:

/builds/watcher.log (global)

/builds/<device>/watcher.log (per device)

If during a buildbot test we determine that a device is not behaving properly, how do we pull it out of use?

If a serious problem is found with a device during a buildbot job, the buildbot job will create an error.flg file under the device directory on the foopy. This signals to watch_devices.sh that when that job has completed, it should kill the buildbot slave, since the device is faulty. It should not respawn a buildbot slave while that error.flg file remains. Once per hour, it will delete the error.flg file, to force another verification test of the device.

But wait, I heard that mozpool verifies devices and keeps them alive?

Yes and no. Mozpool is a tool (written by Dustin) to take care of the life-cycle management of panda boards. It does not manage tegras. Remember: tegras cannot be automatically reimaged – you need fingers to press buttons on the devices, and physically connect a laptop to them. Pandas can. This is why mozpool only takes care of pandas.

Mozpool is the highest-level interface, where users request a device in a certain condition, and Mozpool finds a suitable device.

Lifeguard is the middle level. It manages the state of devices, and knows how to cajole and coddle them to achieve reliable behavior.

Black Mobile Magic is the lowest level. It deals with devices directly, including controlling their power and PXE booting them. Be careful using this level!

So the principles behind mozpool, is that all the logic you have around getting a panda board, making sure it is clean and ready to use, contains the right OS image you want to run it with, etc – can be handled outside of the buildbot jobs. You would just query mozpool, tell it you’d like a device, specify the operating system image you want, and it will get you one.

In the background it is monitoring the devices and checking they are ok, only handing you a “good” device, and cleaning up when you finish with it.

So watch_devices and mozpool are both routinely running verification tests against the pandas?

No. This used to be the case, but now the verification test of watch_devices.sh for pandas simply queries mozpool to get the status of the device. It no longer directly runs verification tests against the panda, to avoid that we have two systems doing the same. It trusts mozpool to tell it the correct state.

So if I dynamically get a device from mozpool when I ask for one, does that mean my buildbot slave might get different devices at different times, depending on which devices are currently available and working at the time of the request?

No. Since the name of the buildbot slave is the same as the name of the device, the buildbot slave is bound to the one device only. This means it cannot take advantage of the “give me a panda with this image, i don’t care which one” model.

Summary part 3

So we’ve learned:

there is a cron job running on the foopies, that looks for the device directories under /builds, and spawns/kills buildbot slaves as appropriate, so that the state of the buildbot slave matches the availability of the device

mozpool is a tool for automatically reimaging pandas

not all features of mozpool are available due to our buildbot setup (such as being able to get an arbitrary panda dynamically at runtime for a given buildbot slave)

Part 2: The foopy, Buildbot slaves, and SUT tools

So how does buildbot interact with a device, to perform testing?

By design, Buildbot masters require a Buildbot slave to perform any job. For example, if we have a Windows slave for creating Windows builds, we would expect to run a Buildbot slave on the Windows machine, and this would then be assigned tasks from the Buildbot master, which it would perform, and feed results back to the Buildbot master.

In the mobile device world, this is a problem:

Running a slave process on the device would consume precious limited resources

Buildbot does not run on phones, or mobile boards

Thus was born …. the foopy.

What the hell is a foopy?

A foopy is a machine, running Centos 6.2, that is devoted to the task of interfacing with pandas or tegras, and running buildbot slaves on their behalf.

My first mistake was thinking that a “foopy” is special piece of hardware. This is not the case. It is nothing more than a regular Centos 6.2 machine – just a regular server, that does not have any special physical connection to the mobile device boards – it is simply a machine that has been set aside for this purpose, that has network access to the devices, just like other machines in the same network.

For each device that a foopy is responsible for, it runs a dedicated buildbot slave. Typically each foopy serves between 10 and 15 devices. That means it will have around 10-15 buildbot slaves running on it, in parallel (assuming all devices are running ok).

When a Buildbot master assigns a job to a Buildbot slave running on the foopy, it will run the job inside its slave, but parts of the job will involve communicating with the device, pushing binaries onto it, running tests, and gathering results. As far as the Buildbot master is concerned, the slave is the foopy, and the foopy is doing all the work. It doesn’t need to know that the foopy is executing code on a tegra or panda. As far as the device is concerned, it is receiving tasks over the SUT Agent listener network interface, and performing those tasks.

So does the foopy always connect to the same devices?

Yes. Each foopy has a static list of devices for it to manage jobs for.

How do you see which devices a foopy manages?

If you ssh onto the foopy, you will see the devices it manages as subdirectories under /builds:

How did those directories get created?

Manually. Each directory contains artefacts related to that panda or tegra, such as log files for verify checks, error flags if it is broken, disable flags if it has been disabled, etc. More about this later. Just know at this point that if you want that foopy to look after that device, you better create a directory for it.

So the directory existence on the foopy is useful to know which devices the foopy is responsible for, but how do you know which foopy manages an arbitrary device, without logging on to all foopies?

So what if the devices.json lists different foopy -> devices mappings than the foopy filesystems list? Isn’t there a danger this data gets out of sync?

Yes, there is nothing checking that these two data sources are equivalent. For example, if /builds/tegra-0123 was created on foopy39, but devices.json said tegra-0123 was assigned to foopy65, nothing would report this difference, and we would have non-deterministic behaviour.

Why is the foopy data not in slavealloc?

Currently the fields for the slaves are static across different slave types – so if we added a field for “foopy” for the foopies, it would also appear for all other slave types, which don’t have a foopy association.

What is that funny other data in the devices.json file?

The “pdu” and “pduid” are the coordinates required to determine the physical power supply of the tegra. These are the values that you call the PDU API with to enable/disable power for that particular tegra.

The “relayhost” and “relayid” are the equivalent values for the panda power supplies.

Where does this data come from?

This data is maintained in IT’s inventory database. We duplicate this information in this file.

Is there any sync process between inventory and devices.json to guarantee integrity of the relayboard and PDU data?

No. We do not sync the data, so there is a risk our data can get out-of-sync. This could be solved by having an auto-sync to the devices.json file, or using inventory as the data source, rather than the devices.json file.

So how do we interface with the PDUs / relay boards to hard reboot devices?

Please note: nowadays, Fennec it only available for Android 2.2+. It is not available for iOS (iPhone, iPad, iPod Touch), Windows Phone, Windows RT, Bada, Symbian, Blackberry OS, webOS or other operating systems for mobile.

Therefore, the original reason for creating a standard interface to all devices (the SUT Agent) no longer exists. It would also be possible to use a different mechanism (telnet, ssh, adb, …) to communicate with the device. However, this is not what we do.

So what is the SUT Agent, and what can it do?

The SUT Agent is a listener running on the tegra or panda, that can receive calls over its network interface, to tell it to perform tasks. You can think of it as something like an ssh daemon, in the sense that you can connect to it from a different machine, and issue commands.

How do you connect to it?

You simply telnet to the tegra or foopy, on port 20700 or 20701.

Why two ports? Are the different?

Only marginally. The original idea was that users would connect on port 20701, and that automated systems would connect on port 20700. For this reason, if you connect on port 20700, you don’t get a prompt. If you connect on port 20701, you do. However, everything else is the same. You can issue commands to both listeners.

What commands does it support?

The most important command is “help”. It displays this output, showing all available commands:

Typically we use the SUT Agent to query the device, push Fennec and tests onto it, run tests, perform file system commands, execute system calls, and retrieve results and data from the device.

What is the difference between quit and exit commands?

I’m glad you asked. “quit” will terminate the session. “exit” will shut down the sut agent. You really don’t want to do this. Be very careful.

Is the SUT Agent a daemon? If it dies, will it respawn?

No, it isn’t, but yes, it will!

The SUT Agent can die, and sometimes does. However, it has a daddy, who watches over it. The Watcher is a daemon, also running on the pandas and tegras, that monitors the SUT Agent. If the SUT Agent dies, the Watcher will spawn a new SUT Agent.

Probably it would be possible to have the SUT Agent as an auto-respawning daemon – I’m not sure why it isn’t this way.

Does the Watcher and SUT Agent get automatically deployed when there are new changes?

No. If there are changes, they need to be manually built (no continuous integration) and manually deployed to all tegras, and a new image needs to be created for pandas in mozpool (will be explained later).

Fortunately, there are very rarely changes to either component.

Summary part 1

So we’ve learned:

Tegras and Pandas are used for testing Fennec for Android

They run different versions of the Android OS (2.2 vs 4.0)

We don’t build anything on them

Tegras are older/inferior/less reliable than pandas

We can’t reimage tegras programmatically, but pandas we can

There is a SUT Agent that runs on both the tegras and the pandas, and provides a mechanism to interact with it

There is a Watcher that keeps the SUT Agent alive

Whenever a new version of SUT Agent or Watcher is required, this needs to be manually built and rolled out to devices