We have recently released the #1 requested feature at Tuenti, group chat.
It has been a titanic effort, months of developing the server code, client side code, and systems new infrastructure to support this highly anticipated feature. But was it so big as to take so much time?

Scope

Since 2010 improvements have been made to the chat server code (Ejabberd, using Erlang as the programming language), achieving important performance gains and lowering the server resource consumption.
We had approximately 3x better performance than a vanilla Ejabberd setup, which taking into account that we currently have more than 400M daily chat messages is not bad at all.

We also had 20 chat server machines, each running on average 6 instances of Ejabberd, and behaving even too well, under their capabilities, so resharding the machines and setting up a load balancer was appealing.

Chat history was almost done, but we had to add support of group chat. It is one of the first projects we do with HBase instead of MySQL as the storage layer.

The messages delivery system (aka message receipts) was also quite advanced in its development, but not yet finished. It uses a simple flow of Sent -> Delivered -> Read states.

Multi-presence means being able to open multiple browser windows and/or multiple mobile devices and not losing the chat connection at any of them (up to a maximum). In order to achieve this the server side logic needs to handle not only jabber ids but also resources, so that the same JID can be connected from multiple sources at the same time.

The “new Tuenti”: This new version of the main website required to focus great part of the technical resources of the company. The team in charge of the Chat not only has that responsibility, so we had to dedicate engineers to build parts of the new website.
As it implied a complete new visual look, the chat had to change its appearance too.

And of course, the group chat.

Being able to chat with multiple people at once

Roles of room owner (the “administrator” of that chat group/room), members and banned members

Storing the rooms even if you close the window (until you explicitly close them)

Supporting both default group room avatars (a pretty mosaic) or custom ones (choose any of your photos or upload a new one)

Supporting custom room titles

Room mute

The Old Chat Web Client

The web chat it is a full Javascript client, only using Flash for videochat. We use a modified opensource Javascript XMPP library, JSJaC, tailored to our needs.
A rough schematic architecture of the chat client is:

HTML receiver file, that performs long polling connections to the chat servers to simulate a typical socket.

One requests controller that processes incoming XML chat messages (stanzas, iqs and the like) using the JSJaC library, and converts them to javascript objects.

Buddylist, User and other classes, all of them twice, one with UI prefix, and the other with Data prefix. We separate UI behaviours from data handling, and all components communicate with each other (think of linked widgets more than a traditional desktop chat client application).

User class performs two tasks: It represents a buddy list contact, but it also represents a conversation room (stores the conversation, etc.)

The code has been working perfectly and with almost no client-side maintenance since it was launched in 2009, just adding new features and visual style changes.

What went right

New cluster: It works really good. Now we not only have load balancing, but also we can perform upgrades on one leaf and keep the chat up with the other leaf’s nodes.
Each node now has 10 machines running up to 4 instances per machine, so we actually do more with less hardware.

Cleaner, up to date code: Now we have inheritance in the chat client code, allowing to avoid repeating code by having a base chat room, then one-to-one and group rooms. Data-related classes are also now better separated from UI-related ones, a lot of the code now has lots of comments, we have private and public fields (by convention, not enforced by any javascript framework).
Many events are now handled by YUI and we have dozens of javascript files that are still bundled into one when we deploy the code live, so it eases a lot the development.
Overall, now the client will support future enhancements and additions quite faster.

Fast, very fast: Server side code is even faster. More optimized, more adapted to our needs, being able to handle up to 13 times more messages at once! Custom XMPP stanzas have been built to allow fast but correct delivery.

Everything works as expected: We didn’t had to do any tradeoffs due to technical limitations. We have kept the same browsers support (including IE7) and all features work as the original requisites defined.

Two UIs co-exist happily: Both versions of www.tuenti.com, each with their distinct UI, share all inner code and are easy to extend.

What went wrong

Ran two projects in parallel: Along with housekeeping tasks, the team had other high priority projects to work on, which took resources and time out of the group chat. Bad timing made half of the client side team dedicated to building the “new Tuenti” instead of bringing the full power until last stages of development.

One step at a time: Tuenti has migrated almost all client side code to use Yahoo’s YUI library. We had to migrate the chat client, plus do a huge code refactor to add support of group chats, plus visual changes of the new website, plus new features (chat history, receipts...). This generated a lot of overhead and a first phase of code instability where we didn’t know quickly if a bug was due to the refactor, due to YUI or due to a new feature not yet finished.
Probably would have been much better to first migrate to the new framework, then refactor and then apply the visual changes and implement or finish the new features.

Single responsibility principle: A class should have only one single responsibility. By far, the biggest and hardest part of the refactor was to separate the original User class into ChatUser and ChatRoom. We couldn’t think about group chat back in 2009, but we estimated too optimistically the impact of this change when planning the group chat.

Lack of client side tests: Old chat client had no tests, so QA has to manually test everything and this generated too a lot of overhead.
We are now getting ready a client side testing environment and framework to have the new chat codebase bug-free.

CSS 3 selectors performance: With the multipresence and the new social application, all users now have many more friends online or reachable via mobile device at once. Rendering hundreds of chat friends, plus some performance-wise dangerous CSS 3 selectors hit us in the late stages of development.
We hurried to do some fixes and we are still improving performance as some browsers still suffer a bit from the amount of DOM nodes plus CSS matching rules.

General guides

At Tuenti we love HTML and CSS, but we also know it can cause lots of problems. That’s why we use some of the following practices to keep our “love affair” going strong. This article is not a list of recommendations, but rather our way of sharing some guidelines used at Tuenti. I was inspired to share our methodology after reading Harry Robert’s article(kudos Harry).
Tuenti is a big project with lots of code, so you can imagine how hard is to work in a collaborative environment where different people can create or modify HTML and CSS, and where, above all, you need to provide the most efficient solutions possible. To accomplish that, we follow a modular oriented approach to generate reusable components and to mantain a consistent UI components framework.
Using this modular approach has some key benefits:

Write less CSS, and more predictable (because we maximize 1:n relation between CSS vs HTML)

Less browser rendering problems (because we use more previously tested code)

Easier to mantain

More flexible

HTML

You need to be flexible with your markup to provide efficient components, I remember in the past trying to avoid “divitis” and “classitis” all the time, and mantaining the markup as clean as possible. However, these practices are bad when associated with modularity; the markup is too closely coupled with the CSS, resulting in less flexibility and more troublesome maintenance.
We try to create modules that can adapt to different HTML structures, achieved by applying different classes to our HTML, so our HTML is happily affected by “classitis”, “divitis” and others not-so-effective called “good practices”.
Additionally, we write our markup always in lowercase, to mantain coherency throughout our platform, which improves the readability and consistency of our code. We also focus on readability by adding blank lines between some structures, and comments in some closing tags.

Tags

We always close our tags. Although we know there are some situations where you can avoid closing tags, we’ve decided to explicitly close all tags to avoid confusing other developers. If I were coding for a personal project, I would probably ommit some closing tags (specially li’s or td’s). But efficient teamwork requires predictable code.
We also use certain tags for some specific things. For example we use <i class="i-photo"></i>for icons and for text in buttons or to accompany icons.

Attributes

We always use quotes for attributes, again for the sake of consistency. This means if you’re using an OOCSS approach, you’ll often use multiple classes, so it’s better to always to quote attributes.
Recently, we’ve started using dashes to separate attribute words. We previously used a camel case approach, but have learned that dashes improve readability. Here at Tuenti, the markup you write will always be read by your colleagues, so we make an effort to follow conventions.

CSS

Following the OOCSS approach, we use classes in order to abstract modules. This is a great way to avoid code repetition as the project grows. Be careful with the abstractions. Although coupling HTML and CSS is a big problem, abstracting too much can also generate issues (like having to apply too many classes to an element to style it). Try to separate concepts like structure, appearance, and function.
We have some basics CSS files (reset, layout and structures) and lots of small CSS files for every module, and then we concatenate and minify them in the server side. In fact this separation is great for loading only the modules you need… but that’s another story. Working in a CSS file with thousands of lines can be a real mess, so don’t be afraid of having to deal with lots of small files.
Here’s a summary of the practices used in our modular approach:

Avoid id’s for styling hooks (mainly because of the specificity)

Avoid over-qualified classes (p.whatever, because of flexibility and specificity)

Short selectors (maximum 3 or 4 classes, because of the specifity, readability and performance)

Separate container and content (.bulleted-listinstead of .sidebar ul, for example)

Try always to use our grid framework to build module structures

Avoid !important and inline styles

This approach tries to minimize the “C” in CSS (cascading) by only using cascading on small components that do not affect other structures. Again, this means that lots of good practices we used in the past are not advisable to create modular CSS.

Naming classes

We use different naming conventions to easily predict which styles to apply, for example we use to create general helpers:

Selectors

As I mentioned above, we try to keep our selectors as short as possible to maximize portability and minimize dependency and specificity – this avoids the specificity wars of the past. The way you write your selectors is crucial to avoid out-of-control style sheets, so you must write them carefully.
The most important part of a selector is the key selector due to performance reasons (key selector is the last part of a selector), because browsers read selectors from right to left. So we try to write key selectors that are class selectors instead of type selectors (an element like p or li).
To increase readability, we also indent related selectors to see the hierarchy easily, and we add some white space between blocks.

Properties

We order CSS properties by relevance (position and size first, then margin, padding, fonts, colors, and finally everything else). We don’t have an explicit rule for this, we prefer common sense. To increase readability, we like to add whitespace after “:” and align vendor prefixes around “:”. We also use one line per property if the block of properties has more than 4 properties. This creates a nice balance between increased readability and proper function with version control tools. Here’s an example:

Wrapping up

I hope you’ve enjoyed these “Tuenti practices”. Take what makes sense for you and adapt it to your needs. But remember, the “best practices” we used in the past are now obsolete. So never stop innovating and evolving!

We have been busy at work building a new Tuenti that we wanted to be much faster. In the process of renewing ourselves we needed to shake out the older client architecture and start afresh. More than doing code changes we found a new philosophy of doing things in which performance was part of the main criteria on how to architect the website.

Load Javascript on Demand

The most important shift we made was loading javascript on demand. This means downloading javascript lazily rather than eagerly. To be clear, you can download eagerly and still be downloading Javascript and CSS asynchronously. Still, that is not a great practice to follow if you have lots of javascript in the client.

Loading javascript eagerly lead us to a situation in which at some point we had about ~1.5MB of JS code for the browser to parse and execute (minimized and gzipped) much of which the user wouldn’t use at all. What we wanted was to just download the javascript we needed to display the current page. To do dynamic code downloading in the client we used YUI, which worked great for us once we integrated its build with our own.

Once we added the ability to download javascript on demand we extended it to client side translations and client side templates. Both translations and templates are compiled to Javascript by our build process and can thus be loaded as needed just like the rest of the code. We use Handlebars as a client side rendering engine, it has a great module that runs on Node.js to compile HTML templates into Javascript.

Measure Everything

We use the HTML5 performance timing API to gather performance data.:
Now, we make sure you are not wasting precious keep alive connections to send performance data. We found out that sending stats data through our regular http connection was counter productive. Sending data so frequently maintained keep-alive connections in the loadbalancer for too long and that increased the loadbalancer memory usage. We ended up sending our stats and performance data from the client to a different domain without keep-alive so as not to interfere with requests to www.tuenti.com.

We optimize everything up to the connection level. Not only we compress images, css and js, version them and use cache headers, but analyze what happens on the network level to guarantee the best experience.

For example we make your browser fetch images from several domains so browsers open more parallel connections for those resources. But not so many that dns resolution for extra domains causes more delays than benefits.

For the main site we focus on fast response times rather than increased parallelism, so we use long-lasting keepalive connections, we use the same origin domain for more types of requests to reuse open connections as much as possible and we have tuned the tcp's initial window to allow sending more data on newly established connections before waiting for client's acks.

Optimize the first Page Load. Use the right tool for the job.

We wanted to keep the Javascript needed to show the first page to a minimum -- none, if possible. So, we removed client side rendering for the first page load, which is normally the Home Page, and made sure all of it was rendered server side. Since we rely on YUI for dynamic loading we at least needed to load the code to do the YUI bootstrap. We ran experiments with slow connections to decide whether it was better to do a connection to retrieve the YUI bootstrap or to just plain inline it on the page. While fast browsers didn’t care, for slow browsers it was faster to actually do a request to retrieve the bootstrap code, since inlining it in the page actually made the payload bigger and slowed the first page response while an external request that gets cached (changes are rare) reduces the data needed to be fetched by the browser.

While we kept the javascript to a minimum, there was still some Javascript we needed to download for the page to render besides the YUI bootstrap. To make things faster, rather than 1) serve page and 2) do client side calculations to see what extra Javascript we needed to download and download it (resulting in a waterfall network panel graph), we added the ability server side to find the Javascript needed for a given url. Thus, the first pageload computes server side the javascript it needs to run that particular url and adds it to the page so by the time the user gets the page it is ready to roll.

Maintainable Javascript.

Writing maintainable Javascript is very important but it becomes critical if you work with a large number of developers. YUI provides a very good set of tools and principles to make this happen:

Everything is a module that can depend on other module. This allows you to create decoupled and reusable components without worrying about dependency management.

YUI modules run in a sandbox, which can be a little bit confusing at the beginning but that shows up a lot of benefits in the long run. For more information about this check their quick start guide.

Custom events provide a very simple way to isolate your components. Rather than having a component making direct calls to other components, i.e. making one know about the other and increasing coupling between them, it’s preferred to have two isolated components that can throw events, and having an upper level entity able to set up the event listening and make them work.

CSS matters

We found several performance issues related with having a large number of DOM nodes. As the chat client needs to be able render a lot of contacts, this is something we have to deal with.

Selectors

Certain CSS selectors can slow the page down. Browsers will apply selectors from right to left so the faster you are able to discard a rule, the faster the browser will be able to process the whole tree and apply the style properly to that element. e.g.

All selectors will match when the browser is resolving the computed style for the span[class=”foo’] element, but what the browser needs to do on each case is very different.

For a) it just compares the class of the element with the selector. Trivial for the browser

For b) it has to go up in the DOM tree to know whether the span has a <p> element as ancestor, and then move up again in the tree to check if the <p> has a <div> element as ancestor. This is horrifically slow

For c) we reduce the ancestor lookups to just be parents. Better than b, but still bad

For d) we need to check if we have an sibling at distance=1 that has the foo-parent class.

For e) we need to check if at the same level of the DOM tree we have any sibling that has the foo-parent class.

As a summary, avoid descendant selectors, avoid tags to qualify rules with IDs or classes and use CSS3 selectors with care, thinking about what the browser will need to do to compute the style of each element.

Event delegation

Event delegation is a nice practice to reduce the number of DOM event subscribers. It uses event bubbling to centralize listeners in a container node, working more or less like this

// Using YUI as an example, it’s pretty much the same on every library out there
Y.one(‘#container’).delegate(‘click’, myHandler, ‘span’)

So, when we click into one of this span elements the click event will bubble up to the <li>, then the <ul> and finally the container, whereas it matches the provided css selector, it will be properly handled.

The math is pretty simple -- the deeper your DOM is, the slower all your event handling will be too. Also, CSS sanity rules apply here too, don’t use aggressive selectors to perform the matching as the browser needs to do a querySelector (or a replacement where not available) for each of the nodes that are propagating the event.

Wrapping Up

Making the changes outlined (and some more) resulted in a tuenti experience that is about five times faster than prior. Not only that, we no longer need a loading bar, as we download the minimum amount of resources to be able to display the page you want to see.

Tuenti is a fast paced growing company and website. Our site is constantly growing with more and more users getting registered, new features being added and new issues being addressed to ensure a safe and steady growth of our system. It is therefore, within our company methodology to develop and release code as quick as possible with certain quality guarantees.
At the time of writing we have code releases on a tri-weekly basis, and we are aiming to be able to safely release code on a daily basis, minimizing the QA team efforts. In the next few lines I’ll describe what our test framework engineers have been working on to achieve the current system and what steps are being taken to get to the next level.

The Release Process

Typically the workflow for anyone to check code into Tuenti is the following:

So the cycle begins with a feature request and its implementation. That is a complex process on its own, but we will not get into that now. When a development team coding a feature starts the actual coding, the team will fork a branch from the last stable version of the code and he will include its project in our Continuous Integration system: Jenkins. Jenkins is an open source tool that among many other things, it can trigger code builds when code is checked into the repositories. During the development process Jenkins will give feedback to the developer so that he knows whether he broke the build or any of the tests.

Typically the developer will write its own tests to unit test the classes he is modifying or coding, some integration tests, and at last, the QA engineers assigned to the team will write some UI browser tests to ensure the feature works and feels right on the UI. Developers have test sandboxes at their environments, which can be very useful for them to quickly assert that their code is working, nonetheless they can only run a reduced set of tests, most likely the ones related to the feature being developed. This is where Jenkins comes into play.

Jenkins CI environment will automatically run for the developer the whole test suite on his project so that he can be sure that his feature is not breaking any already existing features nor causing any disruption on the site.
Once Jenkins gives the developer the thumbs up (represented by a blue icon), the developer can automatically merge his code into the integration branch. Each time a branch is merged, a build is triggered in the integration branch, so we ensure that what was already integrated by other developers, works with the code which has just been merged.

At some point the release operations team selects an error-free changeset which will be the one going live. The set of error-free changesets will be moved to a different branch, called release, so that developers can continue integrating code in the integration branch without affecting the code release. The set of changes in the release branch, is then manually tested by the QA team, and bugs are corrected as soon as they are reported. On the meantime automatic regressions keep happening to ensure that the fixes don’t introduce new bugs into the code.

Once the branch is stable and all the bugs are under control, the code will be rolled out to live in off-peak traffic hours to avoid disrupting our users. Our QA team will ensure that everything works as expected and report any bugs found which will be fixed. When the devops (Development Operations) team considers that the site is stable the release is closed and the release branch is considered to be safe and stable. At that point the code is merged to the stable branch, thus stable branch will have the last version of the code in live.

If bugs are spotted after the release is closed, depending on how important and compromising they are, they will be fixed immediately and committed to the Hotfix repository, where the fix will be tested and later pushed into live outside the planned workflow. If the bugs were not urgent to fix, the fixed would be pushed to live in the next scheduled code release.

Continuous Integration Framework

Now that we gave you an insight to our release workflow, let’s take a look at how our test framework is implemented and what is it capable of.

Tuenti Test framework is always at continuous development. Is a pretty complex tool which at the same time provides a lot of value to the development teams. We constantly face new challenges to solve. We keep developing features to help developers to get feedback as soon as possible and in the clearest way and we keep adding tweaks to the system to maximize speed and throughput.

Currently we have a test suite with over 10.000 tests. Among this tests some are slower than others, for example browser UI tests take an average of 16 seconds each, integration tests 2 seconds and unit tests just a few milliseconds. The number of tests is constantly growing, at a pace of around 400 tests per month, making the full regressions slower and slower. As if that wasn’t enough, the rapid growth of the development team skyrocketed the amount of code to be tested.
It currently takes our CI system around 110 minutes to run all those tests, and about 25 minutes when we use the pipeline strategy. For some it might be good enough, but there are certain scenarios in which we need very quick feedback to react ASAP, plus developers are always happy to get feedback as soon as possible.

How do we achieve all this? What challenges did we face and what challenges arose?

To achieve this we are using 21 fairly powerful machines (8 Cores @2.50GHz, with 12 and 24 GB of RAM). Each job is run a in a single machine, except for pipe-lined jobs for extra quick feedback. In each machine we execute tests in parallel in 6 isolated environments (as you can see in the figure below) which are distributed by a test queue. Each environment has its own DB with the required fixtures, and for the browser tests different VNC environments, as we had some problems with some tests failing when they lost browser focus by other browsers running tests in parallel.

Optimizing the code of the browser tests managed to decrease the test regression time by about 20 minutes. We speed up builds by removing unconditional sleeps which aren’t strictly necessary, in some parts of the test framework.

On the issues side, we have a few non-deterministic tests. This tests produce a different outcome each time they are ran, that seems to not depend on code changes. If you have worked in test automation, you probably know what I’m talking about: those tests that seem to fail from time to time, making your build as failed for no apparent reason and at last, the test seems to pass when you run it again. Unstable tests are indeed a big issue. Some of the core benefits and goals of automation are lost when your suite has unstable tests: your suite is not self checking nor fully automatic (requires manual debugging) and the test results are not repeatable. Unstable tests waste a lot of test engineer's’ time in debugging to figure out what went wrong, plus it makes the suite look as flaky and unreliable, and you don’t want that happening. If your test regression becomes unreliable, developers will get angry and will always blame the framework when tests fail, no matter whether it was their fault or not.

To fight off this enemy, we took two strategies:
● Short-term strategy: we keep track of all the tests which have failed, and after the complete suite is finished, if the number of failed tests is below a given threshold, then the build process will automatically retry to run this tests in isolation without parallel execution. If the tests pass, we modify the logs and mark them as passed. We implemented this approach to filter false positives and save time analysing reports This approach is pretty effective and does a good job cleaning the reports but it has some drawbacks. The retry task adds some extra 15 minutes to the whole build. The strategy doesn’t cope with the problem root as the number of unstable tests will keep increasing and at last, some tests do fail even after being retried, requiring manual debugging.
● Long-term strategy: when a test fails and then passes is added into an unstable tests report produced by Jenkins after every build. After the reports are produced, the Quality Assurance or the DevOps Team will temporarily remove the test from the regression suite and will examine the test to try to determine the root cause of non determinism when running the test. Once the test is fixed and stable is brought back to the suite. Thanks to this approach the number of unstable tests has drop by 80%, making our regressions reliable and completely automatic in most cases.

When it comes to providing feedback ASAP, we can’t parallelize limitlessly as we don’t have infinite machines and we need to find the optimal number to provide quick feedback minimizing the potential loss of total throughput of builds that might be affected as trade off if we use more than one machine per job execution. Also running tests in more environments, mean more setUp time (the built code shall be deployed in that environment, the data bases should be prepared for the environment, etc). So far we have applied the following strategies:
● Early feedback: we provide results per test on real time as they are executed and we run the tests which failed during the last build first, so that developers can check ASAP whether their tests have been fixed.
● Execution of selected tests: developers can customize their regressions to customize which tests they want to run. They can choose to run only unit and integration or only browser tests. In early development stages we advise to run unit and integration, and at last acceptance upon merging to integration branch. Developers can also specify the tests to run by using phpunit @group tag annotation so they can have a fine grained selection of the tests they need.
● Build Pipeline: for special jobs in which is critical to get very quick feedback or we need reports of nearly all the changesets. For the pipeline we use 6 testing nodes. The pipeline is divided in 3 steps mainly: build and unit tests, main test regression and aggregating the test reports. We trigger 5 jobs to run the main regression in parallel. The pipeline produces results in about 25 -35 minutes compared to 110’ - 120 minutes for normal builds. The drawback is that build pipelines require many machines and reduce the total build throughput per hour at peak times.
● On demand cloud resources: we used to use Amazon’s web services to get testing resources in the pas as a proof of conceptt, but we decided to invest in our own infrastructure as it proved to be more convenient economy wise. Nowadays we are thinking of going back to on demand virtualization as we need to produce quicker feedback for full regression. Quicker feedback of complete regressions with a constantly growing amount of tests, can only be translated into an increment of machines in our testing farm. However, as most builds are needed during peak hours, it would make sense to have on demand resources rather than acquiring new machines which would be under-used at off-peak hours.

Last week we held the latest edition of our Hack Me Up, an internal challenge to spend 24 hours building personal projects related with Tuenti. We had the two typical tracks, "product" and "geek", and almost 15 projects made it in time to be presented.

On the product track the winner project was Clippy cupcake, by Oleg Zaytsev, the comeback from the old Office Word assistant Clippy, "helping you" using tuenti by auto liking photos, searching for hot girls, and other funny tasks.

On the geek track we have again David Iglesias as the winner, with More Chocolate, an API to do batch inserts on Google Spreadsheets to save money,bandwith and CPU time, and thus allow Tuenti to buy more chocolate for us the hungry engineers ;)