Synopsis

Project Petri is a 1-year experiment in providing a Platform-as-a-service (PaaS) and/or Infrastructure-as-a-Service (IaaS) offering to make it easy for people with new ideas for web apps to try them quickly, with minimal impact on IT/Ops, and with an on-ramp to making webapps that are better/safer/faster/more maintainable if the ideas they're testing prove to be useful.

Poetry

Ideas expressed in software have a life that is not very different than Hobbes' description of life: solitary, poor, nasty, brutish and short.

Most ideas die young, barely committed to a version control, never seen by anyone other than a select few. Only a few make it to any decent scale. Figuring out how to identify the promising ones in a primordial soup of random thoughts is very hard. In particular, for ideas to grow up, one has to use the same tricks that biology employs: create lots of tiny things, put them in a safe place, hope that some survive, and then help the strong ones grow up. This is particularly tricky in an environment like Mozilla where it's not unreasonable to expect that some of these ideas will reach hundreds of millions of users. That's 7 orders of magnitude.

When it comes to software products, it's remarkably hard to know what a "safe place" is, especially as many of the things required for later stages of life are deadly when applied too young. A young idea may need to be tested in front of a small audience, be assessed, and change, daily. And do so for weeks until it gets "good enough" to get to the next stage. "Safe," in this stage of its life, means being able to make changes and deploy immediately. As a result, too much coordination between groups is a young-idea-killer. Once an idea has made it past the pond, and wants to explore the river beyond, then it needs to grow some defenses, grow a harder shell. Once in that new stage, change gets harder, and rapid evolution is no longer what is best for the creature, or for the ecosystem.

We need to have an approach to product-type web apps which makes it easy for "creative types" to quickly spin up experimental products, offer those to selected, fully-informed early adopters, limit the risk (to Mozilla, and to users) of such crazy ideas, and a process to define when an idea has demonstrated enough value that it needs to get "harder" and go out into the broader world.

This project is about the software provisioning part of that plan.

Goals

Goals of the project:

To make it possible for creative devs who want to explore an idea that requires a publicly addressable web service to be able to go full-speed ahead without requiring too much coordination.

To reduce the load that labs-type folks have on IT

To be more efficient about allocation of hardware resources for experiment

To be able to understand what experimental services we're running

To set up incentives which naturally lead developers to build on a smaller set of stacks and build a shared knowledge base

Non-goals of the project:

To provide a deployment platform for anything other than small-scale, experimental work.

To make it easy for developers to get around security, infrasec, privacy, or other reviews

To build a competitor to any commercial or OSS PaaS/IaaS system

To provide a public general-purpose PaaS/IaaS

Definitions:

A PaaS is a Platform-as-a-service. It is a software system running on computers (virtualized or not), which allocates services and deploys programs on-demand, and using a self-service model. In contrast, IaaS is Infrastructure-as-a-service, where virtual machines (full computer abstractions) are provisioned.

High level proposal

In an initial phase, we'll deploy a limited-scope PaaS on Mozilla-owned hardware, and if demand warrants, an IaaS for those users who need full VMs.

Background: Not all, but many webapps can and are built as a composition of a few well understood stacks. In Mozilla in particular, a lot of projects consist of either Python (using WSGI/Django/WebOb/Tornado) or Node.js, using either SQL databases (mySQL/Postgres) or NoSQL DBs (MongoDB, Redis, CouchDB, etc.).

The hypothesis is that we should:

Quickly evaluate the various PaaS & IaaS systems available today, within the context of an experimental "farm"

To procure enough hardware to hopefully support forecasted needs for one year of operation

To define roles and responsibilities between various groups: IT/Ops, Platform drivers, and Platform users

Offer Petri as a service to Mozilla paid developers ASAP (in particular, expect uptake from labs, research, webdev, engineering), and explore opening the service to trusted volunteers in a later timeframe.

After 1 year, evaluate whether to evolve, grow, change, or wind down the offering.

Driver

David Ascher (dascher@ mozilla dot com)

Approvers

Jim Cook; mrz; Todd Simpson, Mark Mayo

Contributors

Gozer, MCoates, others TBD

Informed

everyone who cares, in particular: directors, labs, research, engineering, webdev, people-who-sign-up, + public posts along the way

Status

Ongoing (just starting, actually)

Draft Schedule

-Dec 12

Proposal Discussion / Editing / Establish Schedule

Dec 5

Survey of Mozilla developers to understand current behaviors and which platforms/IaaS systems they'd be able to use

Identifying and training Petri Ops: folks who are responsible for maintenance / monitoring / management of the platform.

Ordering HW / racking / installation / netops / etc.

Configuration, testing, initial documentation

Launch to early adopter devs (e.g. lloyd, atul, etc.)

Launch to broader dev population

Evaluation Process

This will move to a separate set of wiki pages, but what's known so far:

Known candidate PaaS providers to evaluate:

OpenShift from RedHat (mmayo has contacts)

CloudFoundry from VMWare (mmayo has contacts)

Heroku from SalesForce (mmayo has contacts)

Stackato from ActiveState (davida has contacts)

Nodejitsu (mmayo has contacts)

Others?

Candidate IaaS systems to evaluate?

Eucalyptus (note: CEO is a friend of Mozilla)

OpenStack

OpenNebula

Nimbula

Ganeti

Others?

Proposed evaluation criteria (none have veto power)

Open source or not, what license, what contribution model

Can we run it ourselves inside the firewall?

Are there commercial service providers? Could we pay someone else to run it?

Current capabilities (e.g. supports Django or not, etc)

Known roadmaps/plans (e.g. "doesn't do HBase today, but planned for 2013")

Developer ergonomics: how good is the fit with what our developers need?

IT/Ops ergonomics: how good is the fit with what IT/Ops need?

Cost

Strategic possibilities (could we influence the PaaS provider to make things better for people outside of Mozilla?)

Expected Evolution

While we'll start the system with stock installations of a PaaS, the following changes are expected to happen:

Account provisioning: We need to manage who has access to launch an app/service on these systems. All PaaS systems have account management systems, but we'll likely want to do some integration with BrowserID or LDAP (or both).

One of the reasons to host this platform on our systems is to have maximal control over user data that might be stored. We will build processes, procedures, and IT systems to make it easy for developers to "do the right thing" when it comes to user data, and make it hard to do the wrong thing. For example, we should consider integrating Sauropod as a service to the platform when appropriate.

Similarly, we want to make it easier for developers to do the right thing when it comes to: infrastructure security, being a good net citizen, etc. To deal with these on a platform level will mean things like:

Making it easy for a webapp to send emails from the Mozilla domain, but using a trusted mechanism (e.g. a third-party mailer) using APIs that are metered and monitored. Again, making it easy for a developer to test an idea that requires sending a few emails, but hard for a badly written script to send spam.

Making it easy for a developer to declare expected network behaviors, so that Infrasec/Netops can more easily manage the network, etc.

Scope Restrictions

There will be experiments which we want to facilitate which don't fit the stacks provided by these PaaS/IaaS systems. (As an example, it's likely that none of these systems currently provide elastic clustered "big data" capabilities in the open-source + behind-the-firewall configurations. For now, such applications should be provisioned the way they have in the past, and we'll use that demand to gauge whether to expand the PaaS offering or not.

Production scale services should not be hosted on Petri, for lots of reasons:

Production services imply promises to users and others that Petri is designed to avoid.

The platforms we're picking will likely only work well with single-node configurations, thereby limiting the number of addressable users.

Everything about Petri will be generic, meaning that per-user costs will be high. That's only affordable for small numbers of users.

Petri will come with clear SLAs which will be deliberately terrible. In particular, it's impossible to ask an Ops group to ensure uptime for a system running unpredictable software. Petri is hoping for "one 9" availability, but even that isn't guaranteed in the first year.
Project Assessment

After one year, what would define a success for Petri?

Fewer Mozilla staff use outside-the-firewall services like Heroku to run their systems (and not because they were yelled at).

We have more experiments launched

Some experiments have been deemed successful enough that they have started the next stage of life (see: What Happens After Conception below)

The platform is not a 'problem area' for IT/Ops

Apps deployed on Petri have better security/privacy/design/management characteristics than those deployed elsewhere

Project Mechanics

There's nothing about this project that warrants secrecy, so the project will be run like a traditional Mozilla project:

We should be careful to explain in the project descriptions that we're starting by tackling the problem for internal users, but we should also explore the issues preventing us from making the platform available to trusted Mozillians who aren't paid. I expect there are legal issues there worth exploring.

Background

The Labs VM cluster has approximately 50-60 VMs currently running. Provisioning each of those VMs takes a variable time depending on what else is ongoing. In part because of the impact of coordination, developers have a tendency to go outside of the firewall and self-serve either full VMs on places like Rackspace, EC2, etc., or nodes/services on PaaS providers like Heroku. At that point, we have no idea who'se doing what where, we have no way of improving developer ergonomics, cost control, user privacy, network security, etc.

[Could use more background about what happens outside of labs, e.g. in webdev/enginering/research].

What Happens After Conception

It doesn't match the analogy that I introduce at the top of the page, but when an idea passes the validation stage, and leaves the Petri dish, life for that project is going to be different. Much like what happens once the pregnancy test comes back positive. For most people, there's a realization that what you used to do the day before is no longer the right thing. If you smoked before, you're going to try and quit. You're going to eat healthier. You'll put on your seatbelt. You'll start saving money. Some people lead lives that don't change after that point in time – and some projects will maybe just keep on doing what they did before. But for most, that "pregnancy test" moment is decisive.

In the life of a service-based software project at Mozilla, the pregnancy test will imply a lot more people getting involved in your project, so that the pregnancy is as healthy as possible. It's a change. And it's really uncomfortable if the project is too young -- but if you want to actually see this baby born, it's a requirement for carrying the Mozilla name.