Trying to find useful things to do with emerging technologies in open education and data journalism

Thoroughly Confused About Student VMs & Docker

The story so far… We’re looking at using a virtual machine (VM) preconfigured with all sorts of software and services for a distance education course. The VM runs in headless mode (no graphical desktop) and exposes the applications we want students to be able to run as services accessed through a web browser. The VM is built from a vagrant script using puppet, in order to support maintenance and as a demonstration of a potentially generic production model. It also means that we should be able to build VMs for different VM runners (Virtualbox, VMWare etc), as well as being able to generate machine images for use in cloud hosted VMs as well as getting them up and running. The activities to be run inside the VM include a demonstration of a distributed MongoDB network into which network partitions are introduced. The separate database instances run inside individual docker containers and firewall rules are used to network partition them.

The set-up looks something like this – green blocks are services running in the VM, orange blocks are containers:

The containers are started up from within the IPython notebook using a python wrapper to docker.io.

One problem with this approach is that we have two Mongo DB downloads – one in the VM and one for use in containers. This makes me think that it might make more sense to run all the applications in containers of their own. For example, the notebook server in a container of its own, the original MongoDB instance in a container of its own, PostgreSQL in a container of its own, and finally a data container or other area of the VM that can be used to persist data within the VM such that it can be accessed by one or more of the other services as and when required.

I’m not sure what the best strategy would be for persisting state, for example as used by the database services? If we mount a database’s datastore in a volume within the VM, we can destroy the container that is used to run a database service and our database datastore will be preserved. If the mounted volume is located in the VM/host share area, if the VM itself is destroyed the data will persist in a volume on the host. This is perhaps a bit scrappy because it means that we might nominally take up a large amount of space “on the host”, compared with the situation of providing students with a pre-populated database in which the data volume was mounted “inside” the VM proper (i.e. away from the share area). In such a case, it would nice if any data tables that students built were mounted into the host shared area (so the students can clearly “retain ownership” of those tables), but I suspect that DBMS don’t like putting different data tables or databases into different volumes..?

This approach to using docker is perhaps at odds with the typical way of thinking about how we might make use of it. It is very much in the style of a VM acting as an app runner for a host, and docker containers being used to run the individual apps.

One problem with this approach is that we need a control panel that:

is always running;

is exposed via an http/HTML service that can be accessed on the host machine;

allows students to bring up and shut down containers/services/”apps” as required.

The container approach is nice because if something breaks with one of the databases, for example, it should be easy for a student to switch it off and switch it back on again… That said, if we are popping up and ripping down containers, we need to work out a connection manager so that eg IPython knows where to find a particular database (I think that docker starts services up on essentially arbitrary internal IP addresses?). This could possibly be done by naming docker processes sensibly and then creating a little python library to look up the docker processes by name so that we know where to find them?

As ever, there are tradeoffs in terms of making this easy approach easy for us (as production engineers), making them easy for students, and making them easy for the helpdesk to support…

I don’t know enough about any of this stuff to know whether it makes sense rebuilding the VM we currently have, in which services roaming around the base VM, with a completely containerised version. The core requirement from the user perspective is that a student should be able to download a base box, fire it up as easily as possible in Virtualbox, and then access a control panel (ideally in a browser) that allows them to start up and shut down applications-in-containers, as well as seeing a clear dashboard view of what services are up and running and what (localhost) ports they can be accessed on through their browser.

If you can help talk me through what the issues are or might be, and whether any of the above makes sense or is complete nonsense, I would be most grateful…

11 thoughts on “Thoroughly Confused About Student VMs & Docker”

You’re channeling my current thoughts with this post! We’re bringing Kin Lane in next week and one of my primary objectives to work on with him is what Reclaim Hosting could to build a Docker infrastructure and program an open source control panel frontend for users so that anyone can fire up “applications” regardless of the complexity of the environment. I see this as the future of hosting and moving beyond the traditional LAMP stack and I want Reclaim Hosting to help pave the way for others (the upside as well is how incredibly portable containers are, fits right in with our mission of data ownership).

Look at container linking as an example of naming containers and connecting them regardless of internal IP (https://docs.docker.com/userguide/dockerlinks/). Also the API will give you the IP if you need it, and it’s quite possible it’s supported with environment variables to dynamically set configurations as well. As you can see I’m coming at this pretty new too, and so much is changing so fast. The support for distributed infrastructure with swarm, machines, and orchestration recently announced gets me that much more excited about what this could all look like.

@Tim Ooh… sounds interesting… One thing I have been looking at is panamax, though that with more to a view of how academics, say, might pull together a docker config that wires together various containerised apps that an instructor (or researcher) decides they want in their VM.

What’s confusing me atm are pros/cons of having a “blended” approach where a user has a VM that runs some things as services on the base VM and other things in containers. The expectation is that students (or researchers) would run their own VM, but I guess I also need to be mindful about how things would scale if we offered hosted VMs, or containerised app configs, on behalf of students.

Could you not call Docker’s HTTP API to control containers from the iPython notebooks rather than building a custom dashboard?

Regarding DB data storage, why not store data on the VM, but have a command the user can run which calls ‘mongoexport’ and dumps the Mongo data as BSON onto the the user’s computer through a mapped folder?

Finally container discovery might be solved with one of the new Docker tools announced at Dockercon (can’t remember which one). Alternatively you could use something like Consol.io to do DNS discovery for you.

@alex I’ve been using docker-py library ( http://docker-py.readthedocs.org/en/latest/api/ ) thus far as the library to manage the containers. HTTP API sounds interesting… I guess I need to give docker a kick to get that service running first? (Will look for the docs…)

The thing about storage is that the data needs to be in a place where it’s safe just in case. Like a backup, it’s no good if you think to run the command just after you accidentally lost everything…

I was thinking it’d be nice to have an embedded iframe showing the realtime status of those machines, and two or three lines for each of them showing their most recent status thoughts (?!;-). Streaming is totally new to me too, so I wondered whether this might prove a simple test case for me to have a play with in terms of getting eg a socket.io thing going to stream status messages from each mongo container into a simple HTML status dashboard for them?

Fellow Google Developer Expert Martin Hawksey suggested I might find this post interesting and indeed it is right up my street. I’m a Systems Engineer who has worked on something conceptually quite similar for a pilot at the University of Portsmouth. In our approach we built VMs on Google Compute Engine with a 1:1 ratio of VMs to students on our first year CS undergrad program. Each student has access to a control panel running in App Engine which allows them to create their own dedicated VM running Debian. The initial use case for the first years is simple MySQL and web development although in theory there is no reason why they couldn’t also use containers as part of their coursework. (Containers are first class citizens in Google Cloud Platform)

I guess the reason we didn’t go down the Virtualbox route is that we felt Compute Engine would be easier to support. All of the support staff and faculty involved have password-less sudo rights to any of the VMs which are running 24×7 (the students can’t switch them off, they can reboot them). If a student gets stuck for some reason then it’s trivial for someone to dial in and help them. Where a student has demonstrated competency they can be granted sudo rights on a case by case basis. The entire 300 node cluster runs in a private network that is only accessible via ssh over the internet. For HTTP access is only possible via the UoP network.

We are hoping to document and share our experiences via blog and video formats in the coming months and would love to hear how your approach develops once you also have more experience with it.

I’m not sure I understand why you mean to control the containers from the notebooks server… I guess I would initially start by trying to have everything in its own containers, and interlinking them. Only the docker process could be in the VM… but my students working on such a setup haven’t yet let me tinkle with such a system to really have a determined opinion ;-)

@Olivier Right. At the moment I have a hybrid system – PostgreSQL, an ‘everyday’ Mongodb instance an IPython (notebook server) run in the VM natively, and for a specific exercise I fire up set of containers containing additional mongodbs in docker containers. Docker was introduced as a way of making it easy to demonstrate network partitions between containers (which have their won ip address) – the original model was a straightforward VM.

Now I am wondering what the advantages might be of making further use of the containers, given possible deployment models: student runs own VMC on desktop/laptop machine; student runs own VMC on the cloud; OU hosts VMCs for all students. In each case, what are pros/cons of VMC in each of the config models i-iii described above.

In theory, I could imagine that everything in containers may mean that it would be easier to have mixed setups : local containers inside VM (or native, on Linux base system) + containers on the Net, etc.

But practice may be more complex ;-) Still have to make a lot of experiments in that (rapid-changing) technology, to provide more useful comments.

Yes; if everything is containerised and rules are in place to wire different containers together (and firewall them) we could have configurations for several students in the same VM, which would be good when offering hosted services.

On the desktop, an issue is making sure state is persisted, ideally to a shared area with host. I guess we could keep the container model going by having a state/data container, and then sharing that with host in the desktop case.

I’m not sure what the trade-offs are; I’m also trying to be mindful of OU production and delivery models, as well as the idea that eg different OU courses taken by the same student may all require VM run apps. In which case, containerising everything would probably make sense, with different configs for each course, and the possibility that different courses might make use of similarly defined containers, if not configurations of multiple containers. eg one course may use IPython notebook, another may use IPython notebook+ postgresql, etc.