Yesterday, we had the people from PHPMad user group back in our Madrid offices to hold the 2nd monthly meetup. This time, the speaker was our backend engineer Daniel Pañeda, and he talked about his experience migrating to HipHop, the HipHop virtual machine, and the general state of the project. You can read the epic tale of the migration process at Tuenti in this post. The talk wasn’t recorded this time, but you can see the slides here and if you are interested, you can also browse our patches to HipHop in our Github account.

Normally, website development differentiates between regular code and configuration code.

Having configuration in a separated code allows you to do quick and basic changes without touching the logic. These changes are safer because they should be just an integer value change, a boolean swap, a string substitution, etc. and don’t involve a full release process.

Some good practices have been described recently on the Internet about how to write your code in a flexible way to avoid useless releases, how to do A/B testing, about how to make your database changes backwards compatible, etc. and all of these good practices involved a good configuration system.

Following the DevOps Culture

Here at Tuenti, we are very fond of the DevOps culture and try to apply it as much as possible. We consider it the way to go and the way to do things efficiently, and it’s there is proof that it has helped us to improve quite a lot.

In our company, the configuration deployment is a clear example of DevOps culture. There is no manual intervention and there isn’t anyone who does deployments such as a release manager or operations guy. Every developer pushes his/her configuration changes to production on his/her own. Therefore, Devs are doing Ops tasks.

We use ConfigCop for that.

ConfigCop

ConfigCop is a tool to deploy configuration to preproduction and production servers. Any developer can and must use it to, first, test their changes in a preproduction server and then, deploy them to production.

Deploy to Preproduction

Preproduction deployments logic are fully done on the client side and the basic options a developer can use is a configuration initialization and a configuration update, the latter being the one that uploads any configuration to be tested.

ConfigCop pull the latests code from Mercurial, gathers all configuration files and stages them all applying some overrides necessary to make it work in preproduction servers and also generating some .json files readable by HipHop.

Then, just deploy them using Rsync and the developer is ready to test.

Deploy to Production

A production deployment must be sequential and cannot be done in parallel because conflicts may arise.

Therefore, for this ConfigCop uses its server side. It’s basically a server-client communication over RPC that establishes a locking mechanism to perform sequential deployments. The lock is given to the developer currently deploying a configuration and it can’t be stolen.

Until the developer has finished a configuration change, another one can’t start.

The workflow is pretty simple:

Start a configuration change by just typing a command line.

ConfigCop checks if it’s unlocked. If it is, it deploys the configuration change to a preproduction testing server that will always have the same code as production.

This is important: we are assuring that the configuration change will work properly in production.

Notify the developer that s/he can start testing.

Once it’s tested, push it to production with a single command.

The deployment takes process and it finishes in less than 1 minute.

Results

ConfigCop freed the release managers of doing these operational tasks and now developers with just 2 command lines are able to test and deploy configuration to production in a easy, reliable, and fast way.

Real Examples

Due to a network problem, some chat servers are down and the load in those that are alive is increasing very quickly, we will probably suffer an outage.

A developer can just do a config change to temporarily disable the chat to 10% of the users.

The problem is fixed in less than a minute.

A release requires changes in the database schema, the developer coded inserting in both tables to make it backwards compatible. The release is successfully deployed so the old schema can be deprecated.

A developer does a config change to make the code to insert to the new tables.

The developer does it on his/her own, nobody is bothered with such task.

The product team wants to test two different layouts (A/B testing) for the registration form and wants some users using the old one, and others using the new one so they can measure stats and choose the best one.

The developer will play with the percentages of users using A version or B version, and when the final decision is taken, set 100% to the chosen one.

Requirements and Motivation

The team in charge of the Mobile Virtual Network Operator (hereafter MVNO) at Tuenti was asked to build a new checkout for the store. For legal and business reasons, a checkout for an MVNO has to comprise steps such as shipping information, billing information, customer ID check, etc. Our main goal was to find the best checkout that minimizes users abandoning the process.

The checkout had to be extensible and configurable enough to fulfill the following requirements:

To be able to perform A/B testing on the order of the steps.

To be able to perform A/B testing on the specific view of a step.

Back-end

There are two main entities in the back-end to model our checkout framework: the Order and the StateMachine.

The Order is a simple, persistent container into which we place the input the user has chosen through the checkout. In the Order, we keep a record of the data plan chosen, the initial top-up, shipping information, etc. An Order is identified by its OrderId.

The StateMachine is a state machine that drives the checkout process from the beginning to end. Its responsibility is to define a flow of steps, so that the front-end can request the next step of the checkout in a generic way, which the StateMachine will provide. A StateMachine is identified by the FlowId.
When we have a step that depends on those that come before it, the StateMachine is able to resolve it through guard conditions. For example, if the user chooses neither a an initial top-up amount nor a data bundle, the machine won’t provide the payment step.

The relation between the two is that an Order references a StateMachine, so an order can only be processed by a single StateMachine.

A/B Testing on the Order of the Steps

The design described above allows us to perform the A/B testing on the order of the steps simply defining another StateMachine, which is no more than a declaration of steps and transitions.

For example, if we’re doing an A/B/C testing, we define 3 different state machines.

StateMachine, Auto and Edit Transitions

Any transition of the StateMachine has two attributes:

The guard condition decides which step will follow the current one depending on the return value of the guard function.

Callback is the function to be executed when the transition takes place.

Although the checkout happy case is to go forward until the end, we had to make it possible to go back to edit steps. Having implemented a state machine, the natural move was to just define edit transitions between steps. These edit transitions go backwards in the StateMachine.

Implementing the step editing possibility as a transition not only removes code complexity and looks like the natural behaviour of the state machine, but it also makes it clear to us the from which states users can edit others (for example, if the user has already gone through the payment step, we shouldn’t allow him/her to change the initial top-up).

Being that a main goal is to make finishing the checkout as fast as possible, we didn’t want our customers to repeat the steps they already completed, even if that only means clicking and submitting the pre-filled data again. This can be done both when editing a previous step and after a full reload.

Thus, we make the user jump from the editing step when submitting it to where s/he was before editing (that is, forward in the state machine). The initial implementation was to navigate automatically through the original transitions, calculating them given the previously introduced data (automatically resolving branching, like postpay/prepay), until we got to the point we where we could not transition further because of lack of data: That’s the state of the user before, so we would stop the machine there.

This initial implementation of these auto transitions was quickly shown to be incorrect when we needed to add callbacks to the transitions, such as sending a request to our third party ID validation system, sending an email of a certain event to the user, etc. These actions took place every time the transition was traversed, which was an undesirable behaviour.

In the end, we started coding the auto transitions as if they were independent transitions, without any triggers. Like with the edit transitions, this gave us the power to choose which transitions could be automatic and which can’t.

In summary, every StateMachine is defined by its transitions, which are specified in 3 groups:

Forward transitions (Grey)

Edit transitions (Orange)

Auto transitions (Green)

Front-end

The front-end side of the project is based on our in-house framework. The typical architecture of a product based on this framework consists of an Agent that receives the http request parameters and returns a response with rendered content and data to the client-side.

We added a new layer to decouple the rendering. With it, we are able to control the kind of views we are using for this flow, that is, it allow us to perform A/B testing against the look and feel of the checkout.

From the client side, if a step has a special need, we subscribe a step initializer function name. It will be called by our JavaScript page coordinator when the step is loaded. For example, if the shipping information step needs to build a city autofill, that step will specify the JavaScript function to do so.

A/B Testing on the View of the Steps

With the architecture presented previously, the testing a step is as easy as creating a new renderer inheriting from the original, and overriding the method in charge of render the step we want to change.

The entry point of the checkout (usually an agent) is the one who defines the flow (StateMachine) and the renderer of the process for this user. Our A/B testing API chooses both of them, which are stored in the Order (that is recovered by session), so if the user reloads the page or revisits it after several days, they still see and continue the same process.

Conclusions

The whole design and implementation has been a challenge for us. Making something as sensitive as a MVNO checkout that was configurable was a very ambitious project.

We feel as though we developed more a checkout framework than the product itself, something that exceeded our original estimations. However, we are confident that it will pay off, because we had total flexibility to change any aspect of the process in an easy way, as long as we have the power to do A/B testing against any side of the checkout and, of course, build more specific flows.

Last Wednesday, we hosted the first meetup of the new PHP users group, PHPMad, formerly the Symfony Madrid group, in our nice Madrid office. In addition to a short presentation by the group, we had the chance to hear a few words from invited speaker David Buchman. He gave a talk and presented a demo on the The Symfony2 Content Management Framework, of which he is one of the core developers. The talk was recorded, so you can watch it here.

Release Candidate Selection

So, we have made sure that the integration branch is good enough and every changeset is a potential release candidate. Therefore, the release branch selection is trivial, just pick the latest changeset.

The release manager is in charge of this. How does s/he do it? Using Flow and Jira.

Like the pull request system we talked about in the previous post, Jira orchestrates the release workflow, and Flow takes care of the logic and operations. So:

The release manager creates a Jira “release” type ticket.

To start the release, just transition the ticket to “Start”

When this operation begins, Jira notifies Flow and the release start process begins.

This is what Flow does in the background:

It creates a new branch from the latest integration branch changeset.

Analyzes the release contents and gather the involved tickets, to link them to the release ticket (Jira supports tickets linking)

It configures Jenkins to test the new branch.

It adds everyone involved in the release as watchers (a Jira watcher will be notified in every ticket change) in the release ticket, so that all of them are aware of anything related.

Sends an email notification with the release content to everyone in the company’s tech side.

This process takes ~1 minute.

Building, Compiling and Testing the Release Branch

Once the release branch is selected, it is time to start testing it because there might be a last minute bug that escaped all of the previous tests and eluded our awesome QA team’s eyes.

Flow detects new commits done in release (this commits almost never occur) and builds, compiles and updates an alpha server dedicated for the release branch.

Build

Why do we build the code? PHP is an interpreted language that doesn’t need to be built! Yes, it’s true, but we need to build other things:

Compilation

We also use HipHop, so we do have to compile PHP code.
The HipHop compilation for a huge code base like ours is a quite heavy operation. We get the full code compiled in about 5 minutes using a farm built with 6 servers. This farm is dedicated just to this purpose and the full compilation only takes about 5 - 6 minutes.

Testing

The code built and compiled is deployed to an alpha server for the release branch, and QA tests the code there. The testing is fast and not extensive. It’s basically a sanity test over the main features since Jenkins and the previous testings assure its quality. Furthermore, the error log is checked just in case anything fails silently but leaves an error trace.
This testing phase usually takes a few minutes and bugs are rarely found.
Furthermore, Jenkins also runs all of the automated tests, so we have even more assurance that no tests have been broken.

Staging Phase, the Users Test for Us

Staging is the last step step before the final production deployment. It consists of a handful of dedicated servers where the release branch code is deployed and thousands of real users transparently “test” it. We just need to keep an eye on the error log, the performance stats, and the servers monitors to see if any issue arises.

This step is quite important. New bugs are almost never found here, but the ones that are found are very hard to detect, so anything found here is is more than welcome, especially because those bugs are usually caused by a big amount of users browsing the site, a case that we can’t easily reproduce in an alpha or development environment.

Updating Website!

We are now sure that the release branch code is correct and bugs free. We are ready to deploy the release code to hundreds of frontends servers. The same built code we used for deploying to that alpha for the release branch will be used for production.

The Deployment: TuentiDeployer

The deployment is performed with a tool called TuentiDeployer. Every time we’ve mentioned a “deploy” within these blog posts, that deploy was done using this tool.

It is used across the entire company and for any type of deployment for any service or to any server uses it. It’s basically a smart wrapper over Rsync and WebDav that parallelizes and supports multiple and flexible configurations letting you deploy almost anything you want wherever you want.

The production deployment, of course, is also done with TuentiDeployer, and pushing code to hundreds of servers only takes 1 - 2 minutes (mostly depending on the network latency).

It performs different types of deployments:

PHP code to some servers that does not support HipHop yet.

Static files to the static servers.

An alpha deployment to keep at least one alpha server with live code.

HipHop code to most of the servers:

Not fully true, we can’t push a huge binary file to hundreds of servers.

Instead, it only deploys a text file with the new HipHop binary version.

The frontend servers have a daemon that detect this file has changed.

If it changed, all servers get the binary file from the artifact server.

This file is there as a previous step, pushed after its compilation.

Obviously, hundreds of servers getting a big file from an artifact server will shut it down, so there are a bunch of cache artifacts servers to fetch from and relieve the master one.

Finishing the Release

Release done! It can be safely merged into the live branch so that every developer can get it to work with the latest code. This is when Jira and Flow takes part again. The Jira ticket triggers all the automated process with just the click of a button.

The Jira release ticket is transitioned to the “Deployed” status and a process in Flow starts. This process:

Merges the release branch to the live branch.

Closes the release branch in Mercurial.

Disables the Jenkins job that tested the release branch to avoid wasting resources.

Notifies everyone by email that the release is already in production.

Updates the Flow dashboards as the release is over.

Oh No!! We Have to Revert the Release!!

After deploying the release to production, we might detect there is something important really broken and we have no other choice but to revert the release because that feature cannot be disabled by config and the fix seems to be complex and won’t be ready in the short term.

No problem!! We store the code built compression of the last releases, so we just need to decompress it and do the production deployment.
The revert process takes only about 4 minutes and almost never takes place.