2019-04-05T03:33:13+00:00http://webuild.envato.com/Octopress2019-04-04T16:00:00+00:00http://webuild.envato.com/blog/automating-the-migration-of-lodash-to-lodash-es-in-a-large-codebase-with-jscodeshiftRecently the Elements team needed to make a reasonably large change to the codebase: migrating over 300 files which imported lodash to instead import from lodash-es.

To automate this change we chose to write a codemod for jscodeshift, a tool by Facebook. The power of jscodeshift is that it parses your code into an Abstract Syntax Tree (AST) before transforming it, allowing you to write codemods that are smarter than regular expression based codemods.

jscodeshift is a toolkit for running codemods over multiple JS files. It provides:

A runner, which executes the provided transform for each file passed to it. It also outputs a summary of how many files have (not) been transformed.

A wrapper around recast, providing a different API. Recast is an AST-to-AST transform tool and also tries to preserve the style of original code as much as possible.

Writing the jscodeshift transformer

The starting point of any jscodeshift codemod is the transformer function. The transformer function gives you the fileInfo of the file that the CLI is operating on, and the api.jscodeshift API. It’s common to see examples of this API reference aliased to j.

12345678910111213

exportdefaultfunctiontransformer(fileInfo,api){// j is a reference to the api we will use later onconstj=api.jscodeshift// Create a jscodeshift Collection from the source stringconstroot=j(fileInfo.source)// Do some sort of transform on the Collection// .. omitted ..// Return the new code stringreturnroot.toSource()}

When working with the API j, you pass it the file source and it returns a Collection. A Collection is an object containing an array of NodePath objects. The docs describe it as jQuery-like:

jscodeshift is a reference to the wrapper around recast and provides a jQuery-like API to navigate and transform the AST.

Finding the import declarations

The first thing we want to do is find all the import declarations that are sourcing Lodash modules. To better understand how we might do this, it’s helpful to first explore what the AST looks like. The AST for our sample code contains many ImportDeclaration nodes. Each one contains a source string literal with the value, which we will check for Lodash.

We also want to understand if the import declaration’s specifier is the same as the module name. If it’s not the same, it is important to capture this, because we might need to be able to import the named functions with the as directive in the future.

To do this we need to explore what the import specifier AST nodes look like.

1234

// Get the first specifier...const[specifier]=nodePath.value.specifiers// ...and save the nameconstname=specifier?specifier.local.name:id

With that, we now have enough information to create new import specifiers. As we loop over each import declaration, we’ll populate an array of replacement specifiers we intend to use on our final import declaration.

Removing and replacing import declarations

Now that we have all of the replacement specifiers, we can look at replacing all the lodash/* import declarations we found, with a single import declaration for lodash-es. To do so we’ll need maintain a reference to the first import declaration for later. All the other Lodash import nodes can be removed.

To remove nodes in jscodeshift, we’ll need to wrap it in a j(nodePath) call and use remove().

12345

if(!first){first=nodePath}else{j(nodePath).remove()}

Using that first Lodash import reference, we can create the new import declaration.

To replace nodes in jscodeshift, we’ll need to wrap it in a j(nodePath) call and use replaceWith()

Dealing with comments

The modifications we’ve applied so far are, are enough for our application code to be functional, but we will lose all the comments from the nodes with removed and replaced since, in the AST, comments are attached to nodes. In order to fix this problem, we’ll need to collect up all the comments and assign them to the final lodash-es import declaration.

1234567

letreplacementComments=[]// Save all the commentslodashImports.forEach(nodePath=>{replacementComments=comments.concat(nodePath.value.comments||[])})// Replace the commentsfirst.value.comments=replacementComments

Putting it all together

That’s it, now lets assemble it into our final transform function.

123456789101112131415161718192021222324252627282930313233343536373839

exportdefaultfunctiontransformer(fileInfo,api){constj=api.jscodeshiftconstroot=j(fileInfo.source)constlodashImports=root.find(j.ImportDeclaration)//.filter(nodePath=>{returnnodePath.value.source.value.startsWith("lodash")})letfirstletreplacementComments=[]constreplacementSpecifiers=[]lodashImports.forEach(nodePath=>{constid=nodePath.value.source.value.replace("lodash/","")const[specifier]=nodePath.value.specifiersconstname=specifier?specifier.local.name:idconstreplacementSpecifier=j.importSpecifier(j.identifier(id),// the import idj.identifier(name)// the import "as" name, it might be the same as id.)replacementSpecifiers.push(replacementSpecifier)replacementComments=replacementComments.concat(nodePath.value.comments||[])if(!first){first=nodePath}else{j(nodePath).remove()}})if(first){first.value.specifiers=replacementSpecifiersfirst.value.source.value="lodash-es"first.value.comments=replacementComments}returnroot.toSource()}

Finally, run the codemod over your entire codebase with the CLI:

1

jscodeshift src -t lodash-es-imports.js --extensions=js,jsx

And the result? Checking the git diff, we see many changes like this:

1234567

import React from "react"
// isEmpty used because it works on arrays and objects
-import isEmpty from "lodash/isEmpty"// Note that myMapValues is different from mapValues
-import myMapValues from "lodash/mapValues"+import { isEmpty, mapValues as myMapValues } from "lodash-es";import { connect } from "react-redux"

You can copy the complete example above and play with it in AST Explorer. You’ll need to turn on Transform -> jscodeshift in order to see the codemod output.

Happy codemod’n.

Links

]]>2018-12-06T14:21:00+00:00http://webuild.envato.com/blog/speeding-up-ciOne of our development teams highlighted that their build was taking too long to run. We obtained a near three times speed improvement in most part by using newer AWS instance types and allocating fewer Buildkite agents per CPU.

Envato use the excellent Buildkite to run integration tests and code deployments. As a bring-your-own-hardware platform, Buildkite offers us a lot of flexibility in where and how these tasks run.

This means that we’re able to analyse how that build is using its hardware resources and try to work out a better configuration.

The build in question is for the “Market Shopfront” product: a React & node.js application written in TypeScript, built with webpack, and tested using Jest and Cypress.

On-branch builds were taking between ten to twenty-five minutes. master builds, which also include a separate build of a production container and a deploy to a staging environment, were taking between fifteen and forty minutes.

Builds should take less than five minutes: any longer and the waiting for a build becomes a reason to switch to something else, forcing an expensive context switch back once the build has finished. Worse, a consistently failing build can easily consume an entire day, especially if it’s only repeatable in CI.

The efforts described below are one part of a larger project to improve this build’s performance and use what we learned to improve other builds at Envato.

Investigation

The first thing that stood out to me was the very high variance in build times. This hinted either that:

the build was relying on third party APIs with varying response times, or

the build’s performance was being affected by other builds stealing its resources

The first possibility was quickly ruled out: the parts of the build that talk to things on the internet or in AWS showed the same level of variance as other parts of the build that are entirely local.

We don’t (yet) have external instrumentation on the build nodes, so we ssh’d into them individually and used the sysstat toolkit to watch the instance’s performance. We found that CPU was almost entirely utilised while memory, disk bandwidth, disk operations per second, and network throughput still had a fair amount of headroom. We also found that the CPU being fully utilised by concurrent builds on the same node was the cause of the large variance in build times.

This confirmed what several people in the team already suspected: we needed more CPU.

Exploratory Research

Dedicated spot fleets and Buildkite queues were created to perform indicative testing on the effect of different node configurations and classes on build performance.

The existing configuration was c3.xlarge and m3.xlarge spot instances with one agent per AWS virtual CPU.

We tried:

increasing the instance size from xlarge to 2xlarge

halving the number of agents per virtual cpu

moving to current generation c5 and m5 instances

using the newly released super-fast CPU z1d instances

We found that:

current generation instances provide a 50% speed increase over their previous generation counterparts

the difference between m5 and c5 instances was minimal

z1d instances provided a further 30% performance increase, but at double the cost

halving the number of agents per virtual cpu provided a performance increase

using smaller instance types meant steps more often needed to docker pull cache layers, which randomly increased build times

However, these results and findings are indicative only: only one sample was taken for each instance class.

Modern Instances

We isolated two steps from the build that were not network-dependent and were idempotent: the initial webpack build and the first set of unit tests. They were run multiple times using the avgtime utility on a set of instance types:

This confirmed (at least, for these two steps) the indicative findings on the performance improvements offered by the newer instance types: c5 & m5 instances are approximately 50% faster than their older c3 counterparts for this type of work. It is also interesting that c5 and m5 instances are almost exactly as fast as each other for this step, despite the c5’s reported 3ghz versus the m5’s 2.5ghz.

Virtual vs “Real” CPUs

AWS advertises its instances as having a certain number of “virtual CPUs” or vCPUs. This can be misleading if you’re not already familiar with Intel’s Hyperthreading, where for every processor core that is physically present, two “logical” cores are made available to the operating system. AWS’ vCPUs map directly to logical cores, not physical ones,

Our instances were configured to run one agent per logical core, not physical. This meant that even single-threaded build steps could take up to twice as long to run if the instance’s CPU was fully taxed. This was originally a cost saving measure that was based on the assumption that most build steps would spend their time waiting on network resources or other tasks. For this build queue this assumption proved to be incorrect.

We ran two benchmarks on a single c3.large: one with a single webpack build running, and one with two running in parallel. We also ran the same benchmark on a c5.large to determine whether the newer instance type provided better Hyperthreading optimisations:

On both classes of instance, running two identical steps at the same time on the same physical CPU nearly doubled the execution time versus running only one, despite the benefits offered by Hyperthreading.

Other findings: Docker COPY vs Bind Mounts

All of the tests above were run via docker run on a container without volumes or bind mounts: node_modules and the project’s source were baked into the image via COPY . /app. Running the webpack build with these files instead bind mounted (via -v $(pwd):/app) showed us a significant performance improvement:

Unfortunately, this isn’t something that we can easily take advantage of in our builds without making them significantly more complicated. Bind mounts also gave us no performance improvements when running the unit test step.

Configuration Changes

Based on the above results, we decided on two initial actions:

moving to c5d.2xlarge and m5d.2xlarge instances

halving the number of agents per virtual cpu

We opted for the d class instances as we wished to keep using the instance storage provided by the c3 and m3 class instances. Doubling the instance size while halving the number of agents per virtual CPU meant that we still had the same number of agents per spot instance, meaning that we’d increase build performance without increasing cache misses on docker image layers.

This was recorded as a set of Architectural Decision Records in the git repository containing the StackMaster configuration for this fleet so that future maintainers would know the context and thinking behind these changes.

Costs

Prediction of how much this would increase costs was difficult: we anticipated that while each individual instance cost twice as much, we’d ultimately need less of them, as faster builds would mean that the spot fleet autoscaling rules would be triggered less frequently. We expected that the change would increase costs as our fleet is configured to always have one instance running regardless of load, and we’d be doubling that instance’s size.

We found that the change more than doubled the cost for this fleet: the newer instance types are in higher demand and therefore attract higher spot prices. Fortunately for us the original costs were very low, so this level of increase was not a big worry!

Impact

This change had an almost immediate and significant effect on branch builds, as shown in the scatter plot below in the middle of November. Master builds have also improved, but less so, as the deploy to staging adds a significant chunk of time:

Through this change builds have become both much faster and much more consistent: branch builds that previously took between ten and twenty five minutes now take between four and ten, and master builds that took between fifteen and thirty five minutes now take between seven and thirteen.

Other improvements have been made to this build, but of all of them it was this change that had the highest impact. We’re now hoping to take what we’ve learned here and roll it out to a single consolidated fleet of agents that can be shared by all projects, rather than a single fleet per project. This will allow us to consider faster instance types (like the lighting-fast z1d instances) as we’ll have less “idle” agents, offsetting costs.

Eagle-eyed readers will notice that the times in the scatter plot above are faster than the speculative improvements we expected in our initial runs. The above improvements aren’t the whole story, just the change we made with the highest impact: additional improvements were made in our webpack configuration, balancing of E2E tests between nodes, and docker layer caching strategies.

More on these further changes soon!

]]>2018-03-06T08:29:00+00:00http://webuild.envato.com/blog/migrating-edge-providersUnknown to our users, we recently migrated edge network providers. This
involved some particularly interesting problems that we needed to solve
in order to migrate without impacting availability or integrity of our
services.

Before we get into how we made the move, let’s look at what an edge
network actually does.

An edge network allows us to serve content to users from a physically
nearby location. This allows us to deliver the content in a fast, secure
manner by avoiding sending every request to our origin infrastructure,
which may be physically distant from users.

On the security front, using an edge provider allows us to perform
security mitigations without tying up our origin infrastructure
resources. This becomes quite important when we start talking about
Distributed Denial of Service (DDoS for short) attacks that aim to
saturate your network and consume all of your compute resources making
it difficult for legitimate users to visit your site. By offloading the
defense against malicious traffic and mitigation work to a set of
purpose built servers distributed across the globe, you free up your
origin resources to do what they need to do and service your users.

DDoS + WAF

Malicious users are a very real threat and something we deal with on a
daily basis. The majority of these attacks aren’t volumetric however
they can impact other users if they manage to generate enough requests
to slow down a particular part of our service ecosystem.

Depending on the type of malicious traffic, we have two options: a Web
Application Firewall (WAF) and DDoS scrubbing.

WAF is used for most of our mitigations. This is a series of rule
sets that have been developed over the years based on attacks that we’ve
seen against our services. We “fingerprint” a large sample of requests
and then extract out common traits to either block the traffic,
tarpit the request or perform a challenge that
requires human interaction to proceed.

DDoS scrubbing comes into play when we have a highly distributed
attack or we are seeing high volumes of network traffic. The goal of
this mitigation is to filter out the malicious requests (much like using
WAF) however it is usually done far more aggressively and involves
inspecting other aspects than just HTTP.

Prior to the move, these were two separate systems and the request flow
looked like this:

This setup wasn’t perfect and lead to a few issues.

Debugging was very difficult. To get to the bottom of any request
issues, we needed to use both systems to piece together the full
picture. While both systems had correlation IDs that we could map
against each other, it was easy to get confused which part of the
request/response you were looking at in either system.

Coordinating changes was hard. As we added additional features to
either our DDoS defense or WAF we needed to do some extra work to
ensure that rolling out changes in one would continue to work with the
other.

API differences. We are big users and advocates for Infrastructure
as Code, however only some of our service providers offered this
facility and even then only for limited portions of their services.
This resulted in us either needing to use the UI with manual reviews
or only storing part of it in code which added confusion on what went
where.

Getting blocked in one system could be misunderstood by the other
system. If a user managed to trigger one of our WAF rules, the DDoS
system that had some basic origin healthcheck capabilities could read
the response as the origin being under load and start throwing
confusing errors. This would get in the way of finding the real issue
as you would get errors from both systems instead of just a single
one.

So, we set out to combine the two systems into a single port of call for
our traffic mitigation needs.

DNS

We maintain both internal and external DNS services. For our internal
DNS, we use AWS Route53 as that is already well integrated with our
infrastructure. However, externally we needed something that would do
all the standard stuff plus cloak our origin and prevent recursive
lookups from finding the origin.

Something else we wanted to improve was the auditability of our DNS zone
changes. Our existing DNS provider didn’t lend itself very well to
managing the records as code. This resulted in changes needing to be
staged in a UI and then posted to Slack channels for review from other
engineers before being committed. Managing our DNS in code would help us
level up in our security practices because it would keep DNS changes
easily searchable and aid mitigating vectors like dangling DNS
vulnerabilities.

Preparing for the move

One of our biggest concerns with migrating these services was conformity
between the new and old. Having discrepancies between the two systems
could cause a bunch of issues that if not monitored, would create bigger
issues for us.

We decided that we would address this in the same way we prevent
regressions in our applications; we would build out a test suite. Our
engineering teams are very autonomous which meant that this test suite
needed to be easy to understand and use by the majority of the
engineering team since they could be potentially making changes and
would need to verify behaviour.

After some discussions, we landed on RSpec. RSpec
is already a well understood framework in our test suites and the
majority of our teams are using it daily. Even though using RSpec would
get us most of the way, we would still need to extend it to add
support for the expected HTTP interactions and conditions. To do this,
we wrote HttpSpec. This is our HTTP RSpec library that performs the
underlying HTTP request/response lifecycle and has a bunch of custom
matchers for methods, statuses, caching, internal routing and protocol
negotiation. Here is an example of something you might see in our test
suite:

This solved the issue for most of the functionality we were looking to
port however we still didn’t have a solution for DNS. We started putting
together a proof of concept that relied on parsing dig responses and a
short while later decided that wasn’t scalable to our configuration due
to the number of variations that could be encountered. This prompted us
to go in search for a more maintainable tool. Lucky for us, Spotify had
already solved this issue and open sourced rspec-dns.
rspec-dns was a great option for us since it could be integrated into
our existing RSpec test suites and gave us the same benefits we wanted
in our edge test suite. This is what our DNS tests looked like:

Now that we had a way of confirming behaviour on both systems, we were
ready to migrate!

Making the move

The second big issue we hit was that the two providers didn’t use the
same terminology. This meant that a “zone” in provider A wasn’t
necessary going to be the same thing in provider B.

Remedying this wasn’t a straight forward process and required a fair
amount of documentation diving and experimentation with both providers.
In the end, we built a CLI tool that took the API responses from our old
provider and mapped then to what our new provider expected to manage
the equivalent resources. This greatly reduced the chance of human error
when migrating these resources and ensured that we would be able to
reliably create and destroy resources over and over again. An upside of
taking this automated approach is that we could couple resource
creation with spec creation. For instance, if the CLI tooling found a
DNS record in provider A it would also update our specs to include an
assertion based on what was going to be created in provider B (Yay, for
free test coverage!)

As an additional safety measure, we configured our edge and DNS test
suite to run hourly (outside of regular Pull Request triggered builds)
and trigger notifications for any failures. This ensured that we were
constantly getting feedback on a quickly changing system if we broke
anything.

To keep the blast radius of changes as small as possible while we were
gaining confidence in the migration process, we migrated the systems in
order of traffic and their potential for customer impact. By taking this
approach, we were able to give stakeholders confidence we were able to
bring over larger systems without impacting users.

Once we were happy the site was working as expected we would release it
to our staff using split horizon DNS to gain
some confidence that there wasn’t anything missed. If a regression was
found, we’d go back through the TDD process until we were completely
confident in our changes.

After we were happy with the testing, we’d schedule some time to perform
the public cut over. On cut over day, the migration team would jump into
a Hangout and start stepping through the runbook and monitoring for any
abnormal changes.

A caveat to note about DNS NS record TTLs: Despite taking precautions
such as lowering the NS TTLs weeks before hand, the TTL for NS records
are pretty well ignored by most implementations. This means that while
you may cut over at 8am on a Monday, the change may see a long tail
until full propagation is achieved. In our case, up to 3 or 4 days in
some regions. For this reason we introduced a new 24x7 on call roster
that would help the system owners mitigate this issue should we have
needed to roll back the cut over.

Final thoughts

Embarking on a migration project for your edge network provider is no
small feat and it definitely isn’t without risks. However, we are
extremely pleased thus far with the improvements and added
functionality that we have gained from the move.

In the future, we will be looking to integrate our edge provider closer
with our origin infrastructure. The intention behind this is to automate
away some of the manual intervention we currently perform when applying
traffic mitigation. In the long term this will help us build a safer and
more resilient Envato ecosystem.

]]>2018-02-11T14:45:00+00:00http://webuild.envato.com/blog/building-a-scalable-elk-stackAn ELK stack is a combination of three components; ElasticSearch, Logstash and Kibana, to form a Log Aggregation system. It is one of the most popular ways to make log files from various services and servers easily visible and searchable. While there are many great SaaS solutions available, many companies still choose to build their own.

When we set about building a log aggregation system, these were the requirements we had:

Durability: we need to persist logs long term with a very high level of confidence.

Integrity: we need to ensure logs cannot be tampered with in the event of a security breach.

Maintainability: we have a small team with limited resources to operate and manage the platform.

Scalability: we need to be able to deal with a very large number of events that may spike during peak periods, outages and when bugs are introduced on the producer end.

ELK stacks are not known for their Durability or Integrity, so we had to think outside the box to solve these problems. But they are fairly easy to maintain and scale, when they are designed and implemented thoughtfully. Logstash can become particularly unruly if not implemented carefully.

Structured Logs

“You want structured data generated at the client side. It isn’t any harder to generate, and it’s far more efficient to pass around.” - Charity Majors

We decided early on to enforce the complexity of making logs easy to consume on to the application or server that produced them. Each server has an agent that collects logs from a file or network socket. Either the logs are structured or not, if they are not the agent parses them and forwards them on.
This greatly simplifies the Logstash configuration. Instead of having dozens of rules that covered a variety of applications (nginx, HAProxy) and frameworks (Rails, Phoenix, NodeJS) we have a single JSON format.

Queuing

“(A Queue) is imperative to include in any ELK reference architecture because Logstash might overutilize Elasticsearch, which will then slow down Logstash until the small internal queue bursts and data will be lost. In addition, without a queuing system it becomes almost impossible to upgrade the Elasticsearch cluster because there is no way to store data during critical cluster upgrades.” - logz.io

Since building our modified ELK stack we have experienced incidents where the number of logs being created was greater than we were able to process. Having a queue in place was invaluable to buffer the logs whilst we caught up. It also allows us to easily take down any part of our stack for maintenance without losing logs. For the queue we chose AWS Kinesis, because it scales well beyond what we will need and we don’t have to manage it ourselves.

Logstash now has Persistent Queues, which are a step in the right direction, but they are not turned on by default.

Persistent Queues also have an important side effect, without them Logstash is a stateless service that can be treated like the rest of your Infrastructure; built up and torn down whenever you want. With persistence they become stateful and your Logstash instances become harder to manage. They have to be managed like a database; backed up, disk space carefully measured and data replicated somehow between multiple data-centres to make them highly available. Many people use Kafka or Kinesis to form a queue in front of Logstash, which comes with in built replication. Logstash may be more manageable than Kafka, particularly on a smaller scale, but Kafka and Kinesis are a lot more robust and have been developed with durability as a primary concern.

Durability

While having a queue improves the durability of our logs in ElasticSearch there’s still a risk we could lose them from there. We can replicate our shards across multiple availability zones, but there’s still the risk, through human error or catastrophic failure, that logs could be lost. So we configure all our Log Agents to forward logs to both a Kinesis Stream and a Kinesis Firehose Delivery Stream. This Delivery Stream persists the logs to an S3 bucket for long term archival.
In the event of a failure we can always retrieve logs from S3.

Integrity

One of the many benefits of Centralised Logs is providing an audit trail of actions taken by users in any given system. During a security breach they form an essential piece of evidence for establishing what was breached, how and potentially by whom.
But if the log servers themselves are breached the audit trail could be modified, rendering them useless. This forms the basis of the PCI DSS 10.5.2 requirement:

A side note about Logstash Plugins

While Logstash core is a robust service, there are many community maintained plugins with varying levels of maturity. In an early implementation of this platform we attempted to load some logs from S3 using the logstash-input-s3 plugin. However we had many issues:

The plugin doesn’t support assuming a role to read logs, which was necessary since the S3 bucket is in a different account and the objects are sometimes owned by a third account (such as ELB and CloudTrail logs). We had to write the code ourselves and raised a Pull Request upstream

We also tried to use a number of Codec plugins to parse CloudTrail and CloudFront logs, but had many issues including a lack of compatibility with Logstash 5.
In the end we dumped all plugins except logstash-input-kinesis.

The curious case of the blocked pipeline

Another issue we faced was Logstash seemed unable to keep up with CloudTrail logs. This manifested as all logs being delayed several hours at regularly intervals. We tried to report metrics from the S3 input plugin, but this proved difficult to get working. To understand why we need to look at the structure of a CloudTrail payload as delivered to S3:

Inside the Records array can be hundreds of events. This means Logstash has to deserialise a large (several megabyte) JSON file before passing it down the pipeline. In testing we found that Ruby, even JRuby which is used by Logstash, would take dozens of seconds to load such a large JSON string.
Ultimately Logstash is designed to deal with streams of logs and not large serialised payloads. Our theory is the CloudTrail logs were choking all the worker threads causing all logs to be delayed.
The number of Logstash instances, with their RAM and CPU requirements, needed to ingest all our CloudTrail logs was cost prohibitive.

To solve this we looked at ways to pre-process the events before they were consumed by Logstash. Writing a small program to chunk the JSON into smaller events and feed them to the Kinesis Stream also simplified our architecture since all events now come from Kinesis and reduced the responsibility and complexity of the Logstash implementation. We decided to write this program in Golang, as benchmarks showed it was six times faster at deserialising large JSON strings than Ruby or JRuby.

Tuning Logstash for performance

The issue with CloudTrails was just one of the performance issues we have experienced with Logstash. Over time we developed a better understanding how Logstash performs and how to troubleshoot it. The Elastic guide provides some insights into how you can scale worker threads and batch sizes to get better utilisation and throughput. However these are very one-dimensional in tuning performance, they only generally affect filter and output components and not performance issues in the input plugin. For instance the previous issue discussed is exacerbated by the fact the S3 input plugin is single threaded. Adding more worker threads or increasing the batch size does not improve this issue.

Given a Logstash Pipeline, consisting of input, filter and output plugins, how do we find the bottleneck? A simple way is to start with the guide provided by Elastic and see if this improves performance.

The pipeline.workers setting determines how many threads to run for filter and output processing.

If increasing the pipeline.workers setting improves performance then great! If not though, the issue could still be an input, filter or output plugin. To determine which we can dive into the Logstash API:

$ curl localhost:9600/_node/stats/pipeline?pretty=true

This API, available in Logstash 5.0 and greater, gives a breakdown of the number of events that have executed in the pipeline and how long each step has taken. To understand what these numbers mean we first need to understand a bit more about how the Logstash Pipeline works:

Each input stage in the Logstash pipeline runs in its own thread. Inputs write events to a common Java SynchronousQueue. This queue holds no events, instead transferring each pushed event to a free worker, blocking if all workers are busy. Each pipeline worker thread takes a batch of events off this queue, creating a buffer per worker, runs the batch of events through the configured filters, then runs the filtered events through any outputs.

With this in mind, and some basic queuing theory, we can see how increasing the pipeline.batch.size and pipeline.workers may improve throughput. If we process more events at a time we can reduce the number of network calls made to the output we are feeding. If we have more workers, we can process more events at a time. However if the input or output plugin are single threaded, or a filter plugin takes a long time to process each event, we need more CPU and RAM to be dedicated to Logstash (either by creating more instances, or adding more resources to the existing instance), which becomes cost prohibitive.

A blocked pipeline, in which the input worker is waiting for a free thread to handle a new batch of events, can be discovered by looking at the queue_push_duration_in_millis statistic from the node pipeline stats. For this example we’ll look at a Logstash instance we use that takes inputs from syslog and forwards them to a Kinesis Stream and a Firehose Delivery Stream:

Here we can see the “firehose” output plugin is taking a lot longer than others. 99% of time is spent in the firehose plugin! As it turns out the Firehose output plugin we were using was sending each event one at a time in a single thread, limiting it to about 20 events per second.

Summary

Building any kind of event processing pipeline is not an easy task and Log Aggregation is no exception. But as with any pipeline, decoupling components to reduce their responsibilities greatly simplifies the architecture and makes it easier to scale and troubleshoot issues. By taking this approach to the problem of Log Aggregation we have been able to scale to consistently processes 1000 events per second, with regular spikes exceeding 1300 events per second.

]]>2017-05-08T12:59:00+00:00http://webuild.envato.com/blog/a-real-word-story-of-upgrading-react-router-to-v4-in-an-isomorphic-appWhile working on the new Envato Market Shopfront app, the team agreed to always keep all the dependencies in the project up to date. Sometimes it was just straightforward patch or minor version upgrade, but sometimes it could also be breaking changes that need a whole lot of thought. The upgrade to react-router v4 happened to be a good example.

Webpack (v2.3.3) and Babel for bundling JavaScript for server and browser

My original plan was to upgrade all the dependencies in one pull request. But when it comes to the React related package families, things start to get out of control. For those who are using React in their project already, you may have already heard about the latest changes to React v15.5.0.

The biggest change is that we’ve extracted React.PropTypes and React.createClass into their own packages.

This means, for every single component that is using those two packages or methods, it will have to be updated to using the new packages to get rid of all the deprecation warnings. Luckily, the React team always provide nice codemod with react-codemod to automatically migrate the code.

But what about third party React related modules? If you’ve chosen your project’s packages wisely and with a little luck, the package author would have already released a new version to support the latest release of React and, even if that’s not the case, this might be a good opportunity to give back by sending a pull request to the repo.

Everything went pretty smoothly until it came to upgrade React Router. We are currently on v2.8.1, do we want to upgrade to v3 or v4 now?

Considering all the changes we’ve already made to the other React packages, I thought that there might be too many changes in one pull request, so in the end I decide to try to only update to v3 (as I’ve heard React Router v4 has changed dramatically since the previous version) in a separate pull request.

It seems to me that the biggest change from v2 to v3 for React Router is withRouter according to the change logs.

Add params, location, and routes to props injected by withRouter and to properties on context.router

It turns out to be a big problem for us because we depends on the location object heavily for critical search query filters, SEO and other stuff. Previously, location was not injected by withRouter, we were passing a modified version of it from the very top page level down to component which need access to location.

And not by coincidence, those components also use withRouter to do props.router.push for page transitions (router here is injected by withRouter to component props). The newer version however providing that, we will now have lots of conflicts regarding the location object.

Because the code is heavily dependent on React Router and we can’t change the internal API provided by it, we can only modify or rename the location we are passing down which is not a small amount of work.

Considering the amount of work from v2 to v3, why not upgrade to v4 directly?
I decided to have a try.

The how

Before reading this I highly recommend reading the migrating guide from the official Github Repo first.

The first change I made is to install the new react-router-dom and update all the references in the code from react-router to react-router-dom. For those who don’t know what react-router-dom is and the difference between them, short answer is that react-router includes both react-router-dom and react-router-native. For a web based project react-router-dom is usually what you need.

Another big difference is that instead of having a centralised route configuration for your application and rendering children based on router, now you can define a child component as a normal one inside the component where you need to render content based on current location.

However this doesn’t really work for us because we are doing isomorphic rendering. The key for achieving isomorphic rendering is the ability to pre-fetch data before calling React.renderToString so the content(HTML) you sent to browser will have the required data.

In the previous version, we normally have a central routes config like this

On the server side, when the request comes in, we can have an express.js middleware like this to handle and render the content

123456789101112131415161718

exportdefault(req,res)=>{match({routes,location:req.url},(error,redirectLocation,renderProps)=>{// here we assume we have defined a `loadData` static method on the component where we want to pre-fetching dataconstprefetchingRequests=renderProps.components.map(component=>{if(component&&component.loadData){returncomponent.loadData(renderProps)}})Promise.all(prefetchingRequests).then(prefetchedData=>{constHTML=React.renderToString(<Appdata={prefetchedData}></App>)res.send(HTML)})})}

What about v4?

In v4, there is no centralized route configuration. Anywhere that you need to render content based on a route, you will just render a component.

There’s no central routes config anymore, how do we co-locate the static loadData method on the render component tree?

Luckily there is someone already doing this for us! There’s a package named react-router-config from the react-router team.

To achieve the same purpose, now we just have to adjust our routes config into something like this

And 💥, we now have server side data pre-fetch working with React Router v4.
The last thing we need to fix is client side data fetching. This happens when user switches route inside the browser, we will also need to trigger the same requests to load new data to render new content.

In the previous version, we can use browserHistory.listen to watch client router change and trigger the network request

12345

browserHistory.listen(location=>{match({routes,location},(error,redirectLocation,renderProps)=>{// same as what we did for server side})})

This could still work with the new version, but we can also follow the example given in react-router-config repo to create a special component, and use withRouter to attach the location object to the component props, then we can use componentWillReceiveProps to listen on location change and trigger the network request call.

1234567891011121314151617

componentWillReceiveProps(nextProps){constnavigated=nextProps.location!==this.props.locationconst{routes}=this.propsif(navigated){// save the location so we can render the old screenconstprefetchingRequests=matchRoutes(routes,window.location.pathname).map(({route,match})=>{returnroute.component.loadData?route.component.loadData(match):Promise.resolve(null)})Promise.all(promiseRequests).then((prefetchedData)=>{// do things with new data})}}

Another benefit of using componentWillReceiveProps over browserHistory.listen is that you have a context of the previous location and the current location so you can implement shouldFetchNewData to prevent making expensive network requests.

Conclusion

Voila! That was pretty much what we needed to do to upgrade an isomorphic React app from React Router v2 to v4. I’ve definitely learned a lot from it:

It was probably a bad idea to depend so much on a routing library so deeply nested in application component tree. What we should probably do next is make the location part of the redux store—this way, the next time the location object changes, we simply update it in the redux store without having to modify things all over the code base.

Some of you who read this article may be wondering why am I doing the upgrade, same question for me when I was in the middle of doing this. Is it just because we want to keep everything up to date? I’m not sure. Maybe not. I was able to figure out that there are still quite a lot of places in our code which could be improved.

Last but not least, I wrote this article because by the time I went to do the upgrade, I couldn’t find any existing example I could refer to, and I hope you may find this one helpful.

]]>2017-03-22T15:55:00+00:00http://webuild.envato.com/blog/rememdying-the-api-gatewayTo expose our internal services to the outside world, we use what is
known as an API Gateway. This is a central point of contact for the
outside world to access the services Envato Market uses behind the
scenes. Taking this approach allows authors to leverage the information
and functionality Envato provides on its marketplaces within their own
applications without duplicating or managing it themselves. It also
benefits customers who want to programmatically interact with Envato
Market for their purchases instead of using a web browser.

The old API gateway

The previous generation API gateway was a bespoke NodeJS application
hosted in AWS. It was designed to be the single point of contact for
authentication, authorisation, rate limiting and proxying of all API
requests. This solution was conceived one weekend as a proof of concept
and was quickly made ready for production in the weeks that followed.

This solution worked well and allowed Envato to expose a bunch of
internal services via a single gateway, removing the need to know which
underlying service it was connecting to and how to query it correctly.

Here is an overview of how the infrastructure looked:

Whilst building a Ruby client for the Envato API I noticed some
niggling issues that I fixed internally however throughout the whole
process, I was getting intermittent empty responses from the
gateway. This was annoying but at the time I didn’t think much of it
since my internet connection could have been to blame and there wasn’t
any evidence of this being a known issue.

March 2016 saw Envato experience a major outage on the private API
endpoints due to a change that incorrectly evaluated the authorisation
step, resulting in all requests getting a forbidden response. You can
read the PIR for full details however during this outage we had
many of our authors get in touch and conveyed their justified
frustrations. Due to this incident, we implemented a bunch of improvements
to the API and created some future tasks to address some issues that
weren’t user facing but would help us answering some questions we had
about the reliability of our current solution.

Following on from these discussions, in April a couple of our elite
authors got in touch regarding some ongoing connectivity issues with
the API. They were experiencing random freezes in requests that would
eventually just time out without a response or warning. During the
conversations, they also mentioned they would see an occasional empty
body in the responses. We spent a great deal of time investigating these
reports and working with the elite authors to help mitigate the issue as
much as possible. We finally managed to trace down some problematic
requests and begin trying to replicate the issue locally.

Even though we were able to eventually reproduce the issue locally, it
was very difficult to isolate the exact cause of the problem for a
number of reasons:

The single API gateway application had so many responsibilities and
tracing requests showed it crossing concerns at every turn.

We were using third party libraries for various parts of functionality,
however the versions we were running were quite old and included many
custom patches we added along the way to fit our needs.

The proxying functionality (used for sending requests to the backends)
didn’t perform a simple passthrough. There was a great deal of code
covering discrepancies in behaviour between backends and the content
was rewritten at various stages to conform to certain expectations.

All of the above points were made even more difficult since we have very
little in-house support for NodeJS and those who are familiar with it
are primarily working on the front end components, not the backend so
this was a new concept for them too.

After spending a few weeks trying to diagnose the issue, we realised
we weren’t making enough headway and we needed a better strategy. We got
a few engineers together and starting working on some proposals to solve
this for good. During the meeting we decided that going forward, NodeJS
wasn’t going to work for us and it needed to be replaced with a solution
that handled our production workload more effectively and we knew how to
run at scale.

The meeting created the following action items:

Throw more hardware into the mix with the aim of reducing the chance
of hanging requests by balancing the load over a larger fleet of
instances. While this didn’t solve the issue entirely, it would allow
our consumers hit this issue less often.

Find a replacement solution for the NodeJS gateway. It needed to be
better supported, designed in a way that allowed us to have better
visibility, be highly scalable and fault tolerant.

The new API gateway

Immediately after the meeting we scaled out the API gateway fleet and
saw a drop off in the hanging requests issue. While it wasn’t solved,
we saw significantly fewer occurrences and eased the pressure.

We started assessing our requirements for the new API gateway and came
up with a list of things that we set as bare minimums before a solution
was considered viable:

Must isolate responsibilities. If a single component of the service
was impaired, it should not impact the rest.

Must be able to be managed in version control. This was important for
us since we are big fans of infrastructure as code and all of our
services take this approach to ensure we can rebuild our
infrastructure reliably each time, every time.

Must be able to maintain 100% backwards compatibility with existing
clients so that our consumers don’t need to redo their whole
applications to fit our changes.

Have great in-house support. If something goes pear-shaped, we have
the skills to solve the problems.

Following some trialling of PaaS and in-house solutions we landed on
AWS API gateway. This met all of our criteria and employed many
AWS products we were already familiar with which made the transition far
smoother. However, a problem for us was that much of the functionality we
needed was still under development by AWS and for a long time, we were
building against a private beta of the service and hit various bugs that
were still being addressed by the AWS teams.

We finally managed to ship a private beta of the service to a select few
elite authors in late November and after ironing out a few bugs we
found, we dark launched the new gateway to public use in January.

Here is what the infrastructure and request flow looks like (as of this
writing):

This new infrastructure has allowed us to meet all the requirements we
set out to while also removing a bunch of the confusion around which
components are associated with which responsibilities. When we go to
perform changes to a piece of this infrastructure, we know exactly what
the impact will be and how to best mitigate it. The move has also given
us a bunch of improvements around scalability and resiliency. Now if we
experience a request surge the gateway infrastructure is able to scale
to meet the needs instead of throwing errors because all the available
resources have been exhausted.

While it’s still early days, we are far more confident in the API
Gateway’s reliability. Since the move we have full visibility into each
component, which was lacking before and a major cause of frustration.
Consequently we are able to measure the availability and act quickly
when a component fails.

]]>2017-02-27T14:40:00+00:00http://webuild.envato.com/blog/cloudbleed-impact-on-envato-dot-com-usersYou may have recently heard reports or seen news about a security bug called “Cloudbleed” affecting sites served by Cloudflare. Envato delivers some websites using services provided by Cloudflare, however Cloudflare have confirmed that none of our websites are directly affected by this security bug. Cloudflare published a detailed explanation of what the bug is and how it came to be, you can read it on their blog.

UPDATE Since the original publication of this post, Cloudflare have released a follow up blog post with information they have learned in their investigations. The second article focuses more on explaining the real-world impact of the bug, rather than the technical details.

How does the security bug impact you?

The security bug has caused a very tiny percentage of requests served through Cloudflare to contain information from other unrelated sites. In an even smaller percentage of cases, some of this leaked information included usernames, passwords, and other private information.

Envato takes security very seriously, so as a precautionary measure we have:

Expired all current login sessions on all Envato websites that use Cloudflare services. Despite being extremely confident session data was not exposed by this bug, we took this step to make 100% sure that even if session data was exposed it was no longer valid and could not be used to access your account.

Replaced all credentials that Envato systems use with other service providers that may have also been affected by this bug.

Whilst we are confident no usernames or passwords to Envato websites were leaked through Cloudflare if you used the same password somewhere else it may have been compromised. If you are at all unsure we recommend changing your password.

]]>2016-10-23T23:31:00+00:00http://webuild.envato.com/blog/post-mortem-report-19-october-2016On Wednesday 19 October, Envato Market sites suffered a prolonged incident and were intermittently unavailable for over eight hours. The incident began at 01:56 AEDT (Tuesday, 18 October 2016, 14:56 UTC) and ended at 10:22 AEDT (Tuesday, 18 October 2016, 11:22 UTC). During this time, users would have seen our “Maintenance” page intermittently and therefore would not have been able to interact with the sites. The issue was caused by an inaccessible directory on a shared filesystem, which in turn was caused by a volume filling to capacity. The incident duration was 8 hours 26 minutes; total downtime of the sites was 2 hours 56 minutes.

We’re sorry this happened. During the periods of downtime, the site was completely unavailable. Users couldn’t find or purchase items, authors couldn’t add or manage their items. We’ve let our users down and let ourselves down too. We aim higher than this and are working to ensure it doesn’t happen again.

In the spirit of our “Tell it like it is” company value, we are sharing the details of this incident with the public.

Context

Envato Market sites recently moved from a traditional hosting service to Amazon Web Services (AWS). The sites use a number of AWS services, including Elastic Compute Cloud (EC2), Elastic Load Balancing (ELB), and the CodeDeploy deployment service. The sites are served by a Ruby on Rails application, fronted by the Unicorn HTTP server. The web EC2 instances all connect to a shared network filesystem, powered by GlusterFS.

[09:13] Our Gluster expert identifies a problem with one directory in the shared filesystem

[10:22] A fix to use a different shared directory is deployed, restoring the site to service.

Analysis

This incident manifested as five “waves” of outages, each subsequent one occurring after we thought the problem had been fixed. In reality there were several problems occurring at the same time, as is usually the case in complex systems. There was not one single underlying cause, but rather a chain of events and circumstances that led to this incident. A section follows for each of the major problems we found.

Disk space and Gluster problems

The first occurrence of the outage was due to a simple problem which went embarrassingly uncaught: our shared filesystem ran out of disk space.

As shown in the graph, free space started decreasing fairly quickly prior to the incident, decreasing from around 200 GiB to 6 GiB in a couple of days. Low free space isn’t a problem in an of itself, but the fact that we didn’t recognize and correct the issue is a problem. Why didn’t we know about it? Because we neglected to set an alert condition for it. We were collecting filesystem usage data, but never generating any alerts! An alert about rapidly decreasing free space may have allowed us to take action to avoid the problem entirely. It’s worth mentioning that we did have alerts on the shared filesystem in our previous environment but they were inadvertently lost during our AWS migration.

An out-of-space condition doesn’t explain the behavior of the site during the incident, however. As we came to realize, whenever a user made a request that touched the shared filesystem, the Unicorn worker servicing that request would hang forever waiting to access the shared filesystem mount. If the disk were simply full, one might expect the standard Linux error in that scenario (ENOSPC No space left on device).

The GlusterFS shared filesystem is a cluster consisting of three independent EC2 instances. When the Gluster expert on our Content team investigated he found that the full disk had caused Gluster to shut down as a safety precaution. When the lack of disk space was addressed and Gluster started back up, it did so in a split brain condition, with the data in an inconsistent state between the three instances. Gluster attempted to automatically heal this problem, but was unable to do so because our application kept attempting to write files to it. The end result was that any access to a particular directory on the shared filesystem stalled forever.

A compounding factor was the uninterruptible nature of any process which tried to access this directory. As the Unicorn workers piled up, stuck, we tried killing them, first gracefully with SIGTERM, then with SIGKILL. The only option to clear these stuck processes was to terminate the instances.

Resolution

One of the biggest contributors to the extended recovery time was how long it took to identify the problem with the shared filesystem’s inaccessible directory–just over seven hours. Once we understood the problem, we reconfigured the application to use a different directory, redeployed, and had the sites back up in less than an hour.

GlusterFS is a fairly new addition to our tech stack and this is the first time we’ve seen errors with it in production. As we didn’t understand its failure modes, we weren’t able to identify the underlying cause of the issue. Instead, we reacted to the symptom and continued trying to isolate our code from the shared filesystem. Happily the issue was identified and we were able to work around it.

Takeaway: new systems will fail in unexpected ways, be prepared for that when putting them into production

Unreliable outage flip

In order to isolate our systems from dependent systems which experience problems, we’ve implemented a set of “outage flips” – basically choke points that all code accessing a given system goes through, allowing that system to be disabled in one place.

We have such a flip around our shared filesystem and most of our code respects it, but not all of it does. Waves 3 and 5 were both due to code paths that accessed the shared filesystem without checking the the flip state first. Any requests that used these code paths would touch the problematic directory and stall their Unicorn worker. When all the available workers on an instance were thus stalled the instance was unable to service further requests. When that happened on all instances the site went down.

Resolution

During the incident we identified two code paths which did not respect the shared filesystem outage flip. Had we not identified the underlying cause, we probably would have continued the cycle of fixing broken code paths, deploying, and waiting to find the next one. Luckily, as we fixed the broken code the frequency with which the problem reoccurred decreased (the broken code we found in wave five took much longer to consume all available Unicorn workers than that in the first wave).

Takeaway: testing emergency tooling is important, make sure it works before you need it.

Deployment difficulties

We use the AWS CodeDeploy service to deploy our application. The nature of how CodeDeploy deployments work in our environment severely slowed our ability to react to issues with code changes.

When you deploy with CodeDeploy, you create a revision which gets deployed to instances. When deploying to a fleet of running instances this revision is deployed to each instance in the fleet and the status is recorded (successful or failed). When an instance first comes into service it receives the revision from the latest successful deployment.

A couple of times during the outage we needed to deploy code changes. The process went something like this:

Deploy the application

The deployment would fail on a few instances, which were in the process of starting up or shutting down due to the ongoing errors.

Scale the fleet down to a small number of instances (2)

Deploy again to only those two instances

Once that deployment was successful, scale the fleet back to nominal capacity

This process takes between 20-60 minutes, depending on the current state of the fleet, so can really impact the time to recovery.

Resolution

This process was slow but functional. We will investigate whether we’ve configured CodeDeploy properly and look for ways to decrease the time taken during emergency deployments.

Takeaway: consider both happy-path and emergency scenarios when designing critical tooling and processes

Maintenance mode script

During outages, we sometimes block public access to the site in order to carry out certain tasks that would disrupt users. To implement this, we use a script which creates a network ACL (NACL) entry in our AWS VPC which blocks all inbound traffic. We found that when we used this script, outbound traffic destined for the internet was also blocked. This was especially problematic because it prevented us from deploying any code.

CodeDeploy uses an agent process on each instance to facilitate deployments: it communicates with the remote AWS CodeDeploy service and runs code locally. To talk to its service it initiates outbound requests to the CodeDeploy service endpoint on port 443. When we enabled maintenance mode the agent was no longer able to establish connections with the service.

As soon as we realized that the maintenance mode change was at fault, we disabled it (and blocked users from the site with a different mechanism). After the incident, we investigated the cause further, which turned out to be an oversight in the design of the script. Our network is partitioned into public and private subnets. Web instances live in private subnets, and communicate with the outside world via gateways residing in public subnets. Traffic destined for the public internet crosses the boundary between private and public subnets, and at that point the network access controls are imposed. In this case, the internet-bound traffic was blocked by the NACL added by the maintenance mode script.

Resolution

As soon as we realized that the maintenance mode script was blocking deployments, we disabled it and used a different mechanism to block access to the site. This was effectively the first time the script was used in anger, and although it did work, it had unintended side effects.

Takeaway: again, testing emergency tooling is important

Corrective measures

During this incident and the subsequent post-incident review meeting, we’ve identified several opportunities to prevent these problems from reoccurring.

Alert on low disk space condition in shared filesystem

This alert should have been in place as soon as Gluster was put into production. If we’d been alerted about the low disk space condition before it ran out, we may have been able to avoid this incident entirely. We’re also considering more advanced alerting options to avoid the scenario where the available space is used up rapidly.

This action is complete; we now receive alerts when the free space drops below a threshold.

Add monitoring for GlusterFS error conditions

When Gluster is not serving files as expected (due to low disk space, shutdown, healing, or any other type of error) we want to know about it as soon as possible.

Add more disk space

Space was made on the server by deleting some unused files on the day of the incident. We also need to add more space so we have an appropriate amount of “headroom” to avoid similar incidents in the future.

Investigate interruptible mounts for GlusterFS

The stalled processes which were unable to be killed significantly increased our time to recovery. If we could have killed the stuck workers, we may have been able to recover the site much faster. We’ll look into how we can mount the shared filesystem in an interruptible way.

Reconsider GlusterFS

Is GlusterFS the right choice for us? Are there alternatives that may work better? Do we need a shared filesystem at all? We will consider these questions to decide the future of our shared filesystem dependency. If we do stick with Gluster, we’ll upskill our on-callers in troubleshooting it.

Ensure all code respects outage flip

Had all our code respected the shared filesystem outage flip, this would have been a much smaller incident. We will audit all code which touches the shared filesystem and ensure it respects the state of the outage flip.

Fix the maintenance mode script

The unintended side effect of blocking deployments by our maintenance script extended the downtime unnecessarily. The script will be fixed to allow the site to function internally, while still blocking public access.

Ensure incident management process is followed

We have an incident management process to follow, which (amongst other things) describes how incidents are communicated internally. The process was not followed appropriately, so we’ll make sure that it’s clear to on-call engineers.

Fire drills

The incident response process can be practiced by running “fire drills”, where an incident is simulated and on-call engineers respond as if it were real. We’ve not had many major incidents recently, so we need some practice. This practice will also include shared filesystem failure scenarios, since that system is relatively new.

Summary

Like many incidents, this was due to a chain of events that ultimately resulted in a long, drawn out outage. By addressing the links in that chain, similar problems can be avoided in the future. We sincerely regret the downtime, but we’ve learned a lot of valuable lessons and welcome this opportunity to improve our systems and processes.

]]>2016-08-25T20:38:00+00:00http://webuild.envato.com/blog/moving-the-marketplace-to-awsIn a previous post, Envato Market: To The Cloud! we discussed why we moved the Envato Market websites to Amazon Web Services (AWS) and a little bit about how we did it. In this post we’ll explore more of the technologies we used, why we chose them and the pros and cons we’ve found along the way.

To begin with there are a few key aspects to our design that we feel helped modernise the Market Infrastructure and allowed us to take advantage of running in a cloud environment.

Where possible, everything should be an artefact

Source code for the Market site

Servers

System packages (services and libraries)

Everything is defined by code

Amazon Machine Images (AMIs) are built from code that lives in source control

Infrastructure is built entirely using code that lives in source control

The Market site is bundled into a tarball using scripts

Performance and resiliency testing

Form hypotheses about our infrastructure and then define mechanisms to prove them

We made a few technical decisions to achieve these goals along the way. Here we’ll lay those decisions out and why it worked for us, as well as some caveats we discovered along the way, but first.

The implementation

Auto Scaling Groups (ASG)

We rely heavily on Auto Scaling Groups (ASGs) to keep our infrastructure running, they are the night watchmen keeping our servers running so we don’t have to. At the core of designing infrastructure to run in the cloud is the concept of designing for failure; only when you embrace failure do you enable yourself to take advantage of the scalability and reliability of cloud services.

Every server lives in an Auto Scaling Group; which defines a healthcheck to ensure the server is running. If the server fails it is terminated and replaced with a new one. We also run our ASGs across three Availability Zones (different data centres in the same region). If an Availability Zone fails, the failed servers are launched automatically in another.

In order to use ASGs we must define a server artefact to launch. To provide operational efficiencies we want that artefact to be built automatically.

Packer and Puppet

For simple servers, like our log forwarder, we use vanilla Packer with embedded bash code to build AMIs. The JSON files are our code and the AMIs our build artefact.

We’ve been using Puppet for a number of years to manage our servers and we’re comfortable with it. Since the migration took many months it was also good to use the same code we used to define our servers in both our old and new environments, so we didn’t miss any updates or fixes. So for our application servers (which by far have the most complex requirements) we decided to build our AMIs with Puppet and Packer, using BuildKite to do the build for us to ensure consistency.

We also have a lot of ServerSpec tests we run locally on our laptops to test our Infrastructure Code. Running them locally on our machine was sometimes a slow and buggy process, especially for those of us who work from home and don’t have such fast internet. Also it’s not entirely accurate as the virtual machine on our laptop doesn’t exactly replicate the AMI we are building. So we developed AmiSpec to help us utilise our Continuous Integration systems to test our servers before they go into production.

We build as much of the software and configuration as we can into our AMIs. This enables us to launch replacement instances quickly, but we don’t bake our application into AMIs, as Netflix does, a concept called “immutable AMIs”. This gives us a degree of flexibility to deploy as often as we do at lower cost, while allowing us to launch new servers relatively quickly (generally within a few minutes).

Code Deploy

In a previous post we discussed how we implemented automated deploys. During the migration we moved our aging Capistrano deployment code to CodeDeploy. We make upwards of 40 changes and 18 deployments to our website a day and we need the deployments to be reliable and fast. The change was significant but necessary, our existing deployment code had many problems:

It had grown organically over many years and resembled spaghetti code

The mixture of Bash and Ruby made the code difficult to read, write and reason about

It had zero unit tests

For all these reasons it was extremely fragile which further prevented us from refactoring it.

This led to a very brittle deployment approach that everyone wanted to avoid touching.

With CodeDeploy the deployment code continued to live in our source code repository, but since CodeDeploy handled downloading the source to every server we were able to write most of it in Bash. Some more complicated parts required Ruby, but even with Bash we are able to write tests using Bats. This differs from our previous approach of “shelling out” to Bash from Ruby, because we are defining specific functions in one language. Each component can then be unit tested and swapped out easily for another if necessary.

While we found CodeDeploy was a great choice for us, we also had a couple of issues that tripped us up more than once.

Diagnosing launch failures

It can be tricky to diagnose why a new instance fails to launch; CodeDeploy will automatically fail the launch of any instance to which it fails to deploy, causing the instance to be terminated. If your ASG is trying to scale up to meet desired capacity and this continually happens you end up launching new instances in a loop. This is very expensive if it goes unchecked since you’re paying for 1 hour of time for each instance launch and can launch many instances an hour. We highly recommend anyone using CodeDeploy to ensure they monitor for this scenario and wake someone up if necessary to resolve it, to do this we chose DataDog, but there are other solutions we won’t cover in this article.

To troubleshoot this you should first check the CodeDeploy deployment log, available in the AWS Console. You can also use the get-console-output cli command to see the output from your instance at boot time to help understand if the server started correctly.

Creating a new CodeDeploy application with no successful deployment revisions

If you have to re-create your CodeDeploy application then there are no healthy revisions of your application. When you have no healthy revisions it is impossible to get CodeDeploy to deploy if you use an Elastic Load Balancer (ELB) healthcheck.

Your instances won’t get the application deployed to them on boot because you have no previous “healthy” revision (that is a revision that was successfully deployed). Because they have no application deployed you can’t deploy the application because you have no healthy instances, they will be stuck in a respawn loop as described above.

We chose to switch to EC2-based checks to work around this situation.

Concurrent deploys

There’s a limitation of 10 concurrent deploys per account. Each instance you launch in a CodeDeploy deployment group is one deployment. When we wanted to scale up our ASG by more than 10 instances at a time, the rest of the instances fail to launch and are terminated because their heartbeat times out (10 minutes by default). The maximum number of concurrent deployments is a service limit you can ask AWS to raise.

Starting the CodeDeploy agent on boot

If you require your user data to run before your app can be deployed, you need to start CodeDeploy from your user data. The CodeDeploy agent can start before user data runs, resulting in a race condition bug.

We found the AWS CodeDeploy Under the Hood blog post extremely valuable for understanding how CodeDeploy works and troubleshooting these types of issues.

Elastic Load Balancers (ELB)

This is a no brainer for our core web servers. ELBs scale to support hundreds of thousands of requests per minute and are at the core of almost every major AWS deployment.

We also spent a lot of time planning and creating our healthcheck endpoint. We chose Rack ECG, an in house developed open source tool, to create a simple endpoint for the ELB to check. We were deliberate about only checking hard dependencies of our application, like our databases and cache. We ensure our databases are writable so if our database fails and Rails does not reconnect or re-resolve the DNS entry, the instance is terminated and a new one provisioned. We did lots of testing with different failure scenarios to make sure we could recover automatically where possible and as quickly as possible.

ELB connection draining over application reloads

One decision we made, without measuring the performance impact, was how we stop serving traffic on a web server in order to deploy a new revision of our code without impacting users.

We use Unicorn as our backend HTTP server. It supports a reload command that will allow existing connections to finish while stopping old threads and starting new ones with our new code. This resulted in a four fold increase in response times for a brief period, until Unicorn settled down.

Using the ELB to drain connections to our web servers instead we’ve noticed our response time only increases 50% during deployments.

Route53

For hosts not directly part of our application server group that don’t need to be, or cannot be, load balanced we use Route53 to register domain names with one or more IPs that we associate to our Auto Scaling Group instances on boot.

CloudFormation and StackMaster, a match made in heaven

Last year, as part of a hackfort project, some of our developers put together a tool called StackMaster. You can read more about it in a previous blog post.

Initially we reviewed Terraform as well as StackMaster, but chose StackMaster for its flexibility combined with the maturity of CloudFormation. Time and again we’ve found the modularity of SparkleFormation dynamics combined with StackMaster Parameter Resolvers and many other features produce small re-usable stacks, with little repetition that enable us to reduce the amount of code needed and make that code easier to reason about.

We like smaller stacks because from our previous experience it’s possible, through human error or software bugs, for a stack to become “wedged” in a state that’s unrecoverable. That’s also why we chose not to define our database resources in CloudFormation, but using scripts. When a stack is wedged you have little choice but to destroy it and re-create it. By creating smaller stacks we reduce the impact of having to destroy a CloudFormation stack.

Also by splitting certain resources out, like our Elastic IPs and Domain Names, we decouple the infrastructure in a way that allows us to more easily make changes in the future. At the moment we’re considering adding another Load Balancer to an Auto Scaling Group, an operation that requires the Auto Scaling Group be destroyed (along with all its instances) and recreated. This would normally be a change that would cause downtime, but by defining the Domain Name that points to our Load Balancer in a separate stack we can stand an exact copy of that stack up, swap the domain name over to it and delete the old stack, similar to a Blue-green deployment.

In Summary

We created what we like to call “semi-immutable” machine images, to balance speed of scaling up with cost and flexibility.

We chose to restructure how we deploy our infrastructure and application in order to take advantage of a cloud platform.

We spent time investigating our technology choices and design decisions to validate they solved the right problems

All this would be for nothing if there was no impact on users. No amount of fancy cloud buzzwords would make it valuable if our customers were not better off. Thankfully we’ve already started to see impressive performance improvements on our site, mostly because we recognised during early performance testing some bottlenecks and were able to quickly resolve them.

Here’s a chart of our backend response time (in blue) with response times from before the migration in grey. That is how fast our server responds to queries from users.

Our agility is what allowed us to make this improvement and it’s what drove us to move the Envato Market sites in the first place, so it’s already paying dividends.

]]>2016-08-16T23:16:00+00:00http://webuild.envato.com/blog/envato-market-to-the-cloudThis is the story of how we moved Envato’s Market sites to the cloud. Envato Market is a family of seven themed websites selling digital assets. We’re busy; our sites operate to the tune of 25,000 requests per minute on average, serving up roughly 140 million pageviews per month. We have nearly eleven million unique items for sale and seven million users. We recently picked this site up out of its home for the past six years and moved to Amazon Web Services (AWS). Read on to learn why we did it, how we did it, and what we learned!

A short history of hosting at Envato

Back in 2010, Envato was hosted at EngineYard, and looking to move. EngineYard was then a Ruby-only application hosting service. The Market sites were growing to the point where the EngineYard service was no longer suitable, and in addition Envato wanted to focus on the core business of building marketplaces rather than running servers. In August 2010 the Market sites moved to Rackspace’s managed hosting platform.

From 2010 to 2016 the Market sites were hosted by Rackspace. While managed hosting was a good choice for the Envato of that time, the company and the community have grown significantly since then. Around 2013, we found ourselves looking once again for a platform that better fit our needs.

Like many tech companies, Envato runs “hack weeks”, where we pause our normal work and spend a week or two trying out new ideas. In a hack week in September 2014, a team wondered if it was possible to move Market to AWS in one week. In true hack week style, this project focused solely on that goal, and was successful. The team had one Market site running in AWS within the week, and proved that the “lift and shift” strategy was feasible. While we’d have loved to migrate to AWS then and there, the work was only a proof of concept and nowhere near production-ready.

Flash forward nearly two years from that first try, and we’ve made it a reality for all the Market sites!

Why we moved

A strong element in the development culture at Envato is a “do it yourself” attitude – rather than waiting for someone else to do something for us, we’d prefer to do it ourselves. Managed hosting was no longer such a good fit, because we had so much to do and were constantly being constrained by the delays inherent in a managed service.

The managed nature of the service means we were effectively hands-off of the infrastructure. While we had access to the virtual machines that ran our sites, everything else – physical hardware, storage, networking – was controlled by Rackspace and required a very manual support ticket process to change. This process was often lengthy, and held us back from operating with the speed we desired.

Working in AWS requires a paradigm shift. While Amazon still manages the physical infrastructure, everything else is up to you. Provisioning a new server, adding storage capacity, changing firewall rules or even network layout: these tasks, which would have taken days to weeks in a managed hosting environment, can be accomplished in seconds to minutes in AWS.

Sometimes we run experiments to prove or disprove an idea’s feasibility, often in the form of putting a new website up and observing how people use it. That means taking the idea from nothing to a functional site in a short period of time. With managed hosting, that could take weeks or even months to accomplish. In AWS, we can build out a site and its supporting infrastructure very rapidly. This ability to quickly run an experiment is crucial to developing new products and features.

Finally, there is a cost incentive to moving to AWS. In Rackspace, we leased dedicated hardware and paid a fixed cost, no matter how much traffic we were serving. We had to pay for enough capacity to handle peak traffic load for our sites at all times; at non-peak times we paid the same rate. In AWS, “you pay for what you use” – you’re only billed for the actual use of the resources you provision. The ease with which you can add or remove capacity means that we’ll be able to add capacity during peak times, and remove it during non-peak times, saving money. After an initial settling period, we’ll be able to model our usage and we expect to see cost savings on the order of 30-50%.

How

With a limited timeframe to accomplish the migration, we had some tough decisions to make. Rebuilding the application from scratch to work in the cloud was not an option, due to the enormous amount of time that would take. Instead, we chose a common strategy in software: MVP, minimum viable product. We did just the amount of work required to deliver the new, migrated platform, without rebuilding every component. This reduced the time to market and let us focus on the core problems.

A big choice faced by companies moving workloads to the cloud is to “lift and shift” or rearchitect. Lift and shift refers to picking up an entire application and rehosting it in the cloud. This has the advantage of speed and reduced development effort, however applications migrated like this can’t leverage the capabilities of the cloud platform and often cost more than they did pre-migration. Rearchitecting, on the other hand, is cost and time intensive, but results in an application built for the platform which can benefit from all the features it provides.

Envato performed a lift and shift migration of our Single Sign-on system (account.envato.com) a couple of years ago; we learned that while this approach can be accomplished quickly in the short term, it requires significant work after the fact to get the systems involved running as desired. Had we realized that up front, we may have chosen to do that work as we migrated.

Why not both?

Instead of picking one or the other, we chose a hybrid approach in moving the Market. Functionality that could easily be left unchanged, was. Only changes that were required or that would be immediately beneficial were made.

Amongst the more important changes made were the following:

We replaced our aging Capistrano-based deployment scripts with AWS CodeDeploy. CodeDeploy integrates with other AWS systems to make deployments easier. While Capistrano can be made to work in the cloud, it falls short supporting rapid scaling.

Scout, our existing Rails-specific monitoring system, has been replaced by Datadog for monitoring and alerting. Monitoring in the cloud requires first-class support for ephemeral systems, and Datadog provides that along with excellent visualization, aggregation, and communication functionality.

The key component of the Market sites, our database, was moved from a self-managed MySQL installation to Amazon Aurora, a high performance MySQL-compatible managed database service from AWS. Aurora offers significant performance increases, high availability, automated failover, and many other features.

For some core services, we opted to use AWS’ managed versions, rather than managing ourselves. We chose Amazon ElastiCache for application-level caching; the Aurora database mentioned above is also a managed service; and we make use of the Elastic Load Balancing service for our load balancers.

The application now runs on Amazon EC2 instances managed by Autoscaling groups, effectively removing the concept of a single point of failure from our infrastructure. If a problem affects any given instance, it is easily and quickly replaced and returned to service. Adding and removing capacity literally takes nothing more than the click of a button.

As a counterpoint, some specific things which didn’t change:

Shared filesystem (NFS) for some content: while we really wanted to get rid of this part of our architecture, it would have been too time consuming to remove our reliance on it. We’ve instead marked it as something to address post-migration.

Logging infrastructure: we had a good look at Amazon Kinesis which looked to provide a new AWS-integrated log aggregation system. However, it turned out that there were irreconcilable problems with this approach, so we left the current system unchanged. Again, we’ll review this at a later date.

The vast majority of the Market codebase was untouched during the migration. Any code that didn’t need to be changed, wasn’t.

A key decision we made early on in the project was to manage our infrastructure as code. Traditionally, infrastructure is defined by disparate systems: routers, firewalls, load balancers, switches, databases, hosts, and rarely do these systems share a common definition language or configuration mechanism. That’s a major difference in AWS; everything is defined in the same way. We chose the AWS CloudFormation provisioning tool, which lets you define your infrastructure in “stacks”. The benefit is that our infrastructure is under source control; changes can be reviewed before being applied, and we have a history of all changes. We use CloudFormation to such an extent that we’ve written StackMaster to make working with stacks easier.

In Rackspace, our systems were spread over a small number of physical hosts, on which we were the only tenants. Contrast that to AWS, where our systems are spread out over hundreds of physical hosts which we share with other AWS customers. A consequence of the increased number of systems is an increased failure rate of individual servers. However, this can be mitigated by architecting with expected failures in mind:

As mentioned previously, all our instances are members of Autoscaling groups, which means they are automatically replaced if they become unhealthy.

Most systems are deployed to multiple physical locations, ensuring a problem (e.g. loss of power, cooling, or internet connectivity) at any one location does not affect the availability of the site. Those systems deployed to only a single location are able to run in any location, and when disrupted in one location can launch in another.

Managed services (Aurora and ElastiCache, most notably) are also configured to run in multiple locations, and are tolerant of the loss of a location.

Not only have we followed the cloud best practice of designing for failure, we’ve taken it a step further by researching possible failure scenarios, validating our assumptions, and where possible, optimizing our designs for quick recovery. Additionally, we’ve worked to create self-healing systems; many problems can be resolved without human intervention. This gives us the confidence that not only can we tolerate most failures, but when they do occur we can quickly recover.

Readers familiar with cloud architecture may ask, “why not multi-region?” This refers to running applications in multiple AWS regions. Even though we’ve architected for availability by running in multiple locations (availability zones) and storing our data in multiple regions, we still only serve customer traffic from a single region at a time. For availability and resiliency on a global scale, we could run out of multiple regions concurrently. Running a complex application like Market simultaneously in multiple locations is a hard problem, but it is on our roadmap.

Execution

The mandate from our CTO was clear: “optimize for safety.” Many of our community members depend on Market for their livelihoods; any data loss would be unacceptable. This requirement led to a hard decision: the migration would incur downtime – Market sites would be entirely shut down during the actual cutover from Rackspace to AWS.

While we would have liked to keep the Market sites open for business the entire time, there was no way to guarantee that every change – purchases, payments, item updates – would be recorded appropriately. This is due in large part to the fact that the source of truth for all this data, our primary database, was moving at the same time. Maintaining multiple writable databases is a very difficult problem to solve, and we opted to take the safer route and temporarily disable Market sites.

Months of planning led to the formation of a runsheet: a spreadsheet containing the details of every single change to be made during the cutover, including timing, personnel, specific commands, and every other detail required to make each change. Multiple rollback plans were made, instructions for undoing the changes in the event of a major failure.

The community was notified; authors were alerted, vendors were consulted, Envato employees informed. Preparation for the cutover day, scheduled for a Sunday morning (our time of lowest traffic and purchases), began the week prior. On Sunday morning, the team arrived (physically and virtually) and ran the plan. Market was taken down, the move commenced, and four and a half hours later, the sites were live on AWS! Not only live, but showing a small performance increase as well!

In the following app-level view from one of our monitoring systems, you can clearly see the spike in the middle of the graph showing the cutover, and the decreased (faster) response time following it:

In this browser-level view, you can again see the cutover at the same time, and following that the better-than-historical behavior of the new site:

Next steps

While the sites have successfully been moved to AWS, we’re far from done. There is plenty of clean-up work to do, removing now-unused code and configuration. Our infrastructure at Rackspace needs to be decommissioned.

Another major task which will continue for some time is to start modifying the Market to take advantage of the AWS platform – or as it’s more commonly known, “drinking the kool-aid.” AWS provides many services, and we’ve only scratched the surface during the migration. As we continue to develop and operate the Market sites in AWS, we’ll evaluate these services and use them where it makes sense.

Lessons Learned

A factor that really contributed to the success of this migration was having the right team involved. The migration team had representatives from several parts of the business: the Customer group (owners of the Market sites themselves), the Infrastructure team (responsible for company-wide shared infrastructure), and the Content group (who look after all the content we sell on the sites). Having stakeholders from each area involved in the day-to-day work of the migration meant that we had confidence that everyone was up to speed and we weren’t missing any major components.

Another contributing factor was the “get it done” strategy we employed – the team was empowered to make the necessary decisions to complete the project. That’s not to say that we didn’t involve other people in the decision-making process, but we were able to avoid the “analysis paralysis” problem by not asking each and every team their opinion on how to proceed.

With a project of this scale, there will certainly be things that don’t go right. One area where we could have improved is communication. This project affected many teams at Envato, but our communication plan didn’t reflect that. Notifications were left until later in the project, and we didn’t communicate every detail we should have. Given another chance, communicating early and often to the rest of the company would have helped ensure everyone was on the same page and had all the information they required. Similarly, we didn’t communicate our plan to the community until the project was nearing its end; more lead time would have been helpful.

On cutover day, we had trouble with the database. Indeed, migrating the database was far and away the most complex part of the move. We had a detailed plan for it, but due to the fact that it contained live data and the complexity around it, we had no opportunity to practice that part of the migration. Finding a way, however difficult, of practicing the database migration may have mitigated some of this trouble. Ultimately, though, we found solutions to the problems and the database was safely migrated without ever putting data at risk of loss or corruption.

Were we to offer any tips to the reader thinking about a similar migration, they’d be these:

First and foremost, understand your application. A solid understanding of what the app does and how it works is critical to a successful migration. Our biggest fear, happily unrealized, was of some unknown detail of our ten-year-old system that would show up and stop the show.

Get AWS expertise on board. There’s no substitute for experience, and having that experience in the team was critical. Send team members to training, if necessary, to get the knowledge, but also practice it.

Beware the shiny things! There are a lot of cool technologies in AWS, and it’s tempting to use them anytime you see a fit. This can be dangerous and distract from the migration goal. You can always revisit things once the project is complete.

Consider AWS Enterprise Support. It may seem expensive, but having a technical account manager (TAM) on call to answer your questions or pass them off to internal service teams when required will save your team valuable time. The TAM will also analyze your designs, highlight potential problems, and help you address them before they become real problems. AWS provides a service called IEM, where the TAM will be available during major events (e.g. migrations), proactively monitoring for issues, and liaising with internal service teams in realtime to address actual problems.

Conclusion

As this post has hopefully demonstrated, a lot of thought went into this migration. Due to the comprehensive planning the move went relatively smoothly. We’re now in a position to start capitalizing on our new platform and making Envato even better!

]]>2016-08-05T16:51:00+00:00http://webuild.envato.com/blog/getting-envato-market-https-everywhereLast month we announced that we had
finally completed the move to HTTPS everywhere for Envato
Market. This was no easy feat since we are serving over
170 million page views a month that includes about 10 million products
listed and are all user generated content. Along the way we have learnt
many valuable lessons that we want to share with the wider community and
hopefully make other HTTPS moves easier and encourage a better adoption
of HTTPS everywhere.

Behind the scenes, the groundwork for the HTTPS rollout started back in
2014 with a couple of the engineers implementing a feature toggle which
allowed staff to opt-in for HTTPS. For a long time, this sat dormant and
unused by most staff until earlier this year when a few engineers got
together and decided it was time to give HTTPS everywhere another push
and get it to general availability.

But Why?

HTTPS isn’t just about the having a padlock or green indicator shown in
the browser. It’s about creating a trusted connection between the end
user and your services via three protection layers:

Encryption: Securing the exchanged data to prevent eavesdropping
on the connections.

Data integrity: Confidence that the data has not been altered mid
transit without being detected.

Authentication: Assuring the website you are connecting to is who
you expect them to be.

An added side effect of migrating to HTTPS is that you can unlock HTTP/2
and features like request multiplexing and server push which are great
news for performance! Last year, Google announced HTTPS as a ranking
signal so by migrating to HTTPS you get a boost in
your search results too!

User managed content

We have a lot of user managed content. The problem here is that many
of our authors don’t have time or additional funds to implement things
like content delivery network (CDN for short) caching so most of our
user managed content requests would end up needing to hit their origin
servers to fulfill requests. This was bad for a few reasons:

Many authors use shared hosting or very small instances for storing
these assets. During the testing phase, we generated a low amount of
traffic for a particular set of assets that would sometimes take over
20 seconds to complete! The result of these slow load times is a very
poor experience for buyers and would result in many people looking
elsewhere because they couldn’t see previews or screenshots of the
product quickly enough.

Very little HTTPS adoption. If we intended to serve our pages under
HTTPS, we needed to ensure the assets on the page were also served
securely. The issue here is that it’s very unrealistic for Envato to
force users to spend time (and potentially money) on updating all of
their assets to be served via HTTPS to avoid seeing mixed content
warnings on the item pages.

To solve both of these issues, we decided to use an approach that
consisted of an image proxy and a CDN. The image proxy would rewrite all
of the non-secure links at render time to point at our CDN which would
help speed up response times and allow us to take some of the load off
author origins by caching the assets.

Initially we used camo which was built by Corey Donohoe
when he was at GitHub where they needed to solve a similar
issue. This worked well for us until we started
trying to scale it to handle more traffic. GitHub solved the scaling
issue by adding more worker processes however we decided to try adding
clustering support so that we could utilise more of
the hardware we already had in place. This didn’t solve the problem for
long and we eventually ended up back in the same position and needed to
resize our hardware to account for the additional load. We determined
this wasn’t going to be a viable path going forward and we needed a
better solution. After some looking around we found go-camo
which is a Go port of Corey’s original project. For a while we ran the
two implementations side by side and discovered that go-camo was able
to better utilise all of the existing hardware (due to its ability to
use more than a single operating system thread) and was easier to debug
when issues popped up. After a couple of weeks of load testing we
decided to completely swap to go-camo and start ramping up the number
of users who were using this.

Sharing cookies

As you may know, Envato Market is built using Ruby on Rails and out of the box Rails offers the ability to define how you
handle your cookies. To continue with our
incremental rollout, we needed to allow user cookies to be able to be
accessible on HTTP or HTTPS depending on which protocol their request
was served via. This was achieved by omitting the Secure flag on
cookies until we were confident post rollout that we were not going to
roll back.

Performance

One of the big concerns from teams looking to undertake HTTPS migrations
is that they will incur a performance hit once it’s live and in most
cases, it’s just not true. Deploying to modern hardware/software setup
and using a suitable cipher suite
mitigates many of the performance bottlenecks that used to be associated
with HTTPS. This definitely doesn’t mean you shouldn’t collect metrics
around these areas and monitor them, it just means that “HTTPS is slow”
is not a valid excuse.

In Envato’s case, we haven’t seen any performance impacts and our end
user time is consistent with the weeks prior to the HTTPS rollout.

Monitoring

One of the most important things you can do during a HTTPS migration is
Monitor All The Things. By having insight into the changes within your
stack during the migration you can quickly detect an issue before all
your users do. During our rollout some of the metrics we kept a very
close eye on were:

Exception rate

Time spent in network requests

End user response time

Application response time

Instance resource utilisation (CPU specifically)

Total number of requests

Edge network requests and the count by status code

During our rollout we identified a couple of issues, most notably a load
balancer misconfiguration. We were seeing a CPU spike on a small subset
of web instances that were missed in all of our testing which we managed
to catch before rolled it out to all of our users.

Here are two that we put together to keep everyone informed about how
far through the rollout we were. The top one was the initial rollout
(mostly just staff usage) and the second is when we decided to cut
everyone over to HTTPS.

SEO

2016 has been a big year for SEO at Envato. We’ve kicked off many
initatives targeting better visibility for search engines into our
author products and during early discussions it was decided we needed to
be extra careful during the migration not to undo all of the hard work
we’ve put into the last 7 months. To ensure we didn’t do go backwards we
took a couple of steps that has helped us stay on top and improve our
search engine rankings:

Submit both HTTP and HTTPS sitemaps to Google webmaster tools: In
the week leading up to the swap over, we took a snapshot of our
sitemaps and uploaded them into Google webmaster tools as a new set of
sitemaps. This was done to ensure that when we swapped over to HTTPS
Google would have access to a HTTP and HTTPS sitemap source and would
allow Google to continue crawling the HTTP sitemap but at the same
time be lead into the HTTPS version of the site.

Ensure we maintained 1:1 redirects: This helped ensure our users
(and bots) still knew where to find us even though we moved to HTTPS.

In taking these steps, 61% of high volume terms we track have remained
stable or improved their rankings since the HTTPS release. The remaining
terms that have moved backwards were not on page 1 and have not actually
lost us traffic or revenue.

The migration wasn’t completed overnight and took longer than we would
have liked but we’ve managed to roll this out without any negative
impacts on our users or application which is something we are extremely
proud of. Hopefully by publishing our journey this will give others
information about migrating to HTTPS along with the added benefits that
come with it and remove the stigma that HTTPS is only a painful
experience.

]]>2016-05-19T20:09:00+00:00http://webuild.envato.com/blog/tracking-down-ruby-heap-corruptionBack in November 2015, one of the Envato Market developers made a
startling discovery - our exception tracker was overrun with occurrences
of undefined method exceptions with the target classes being
NilClass and FalseClass. These type of exceptions are often a
symptom that you’ve written some Ruby code and not accounted for a
particular case where the data you are accessing is returning nil or
false. For our users, this would manifest itself as our robot error
page letting you know that we encountered an issue. This was a
particularly hairy scenario to be in because the exceptions we were
seeing were not legitimate failures and replaying the requests never
reproduced the error and code inspection showed values could never be
set to nil or false.

It’s worth noting that during the assessment of these exceptions we were
able to confirm that the security of Envato Market and that of our
community’s data was not impacted. This was a notion we continuously
challenged and ensured was still true throughout every step of our
investigations and if at any stage that was not clear, we stopped and
tightened third party checks and monitoring to make sure we were
certain.

These exceptions were harder to track down than the regular errors that
manifest in a single point of the application and our error tracker
showed the errors had begun sometime in October, although we could not
isolate a specific deployment or change that matched that timeframe.
Initially we tried upgrading and downgrading newly introduced gem
version changes and when none of these stopped the errors, we also
rolled back our Gemfile to previous months with no success. We also
read close to every article on the internet related to Ruby heap
corruption available and tried all the proposed solutions without any
luck.

To add to the problem this issue was only occurring in production. Any
attempts to replicate this locally or in our staging environments never
created the error - even attempting to replay the production traffic
through another environment didn’t create the issue. In production this
was very difficult to create a reproducible test case as we did not know
exactly where the problem originated from. About mid investigation we
did think we had a script to reproduce it however it turned out that it
was just running into the exceptions much like our usual traffic was.

Our suspects

Premature garbage collection

From the early investigations, we suspected a premature GC due to valid
pointers disappearing mid call. This became more suspicious as upgrading
to MRI 2.2.x and introducing generational GC made our situation worse.
To dive further into this we instrumented Ruby GC stats using
Graphite and watched for any unusual
change in object allocations and heap health. However, we could never
find anything that pointed us to a GC problem despite spending a large
amount of our time investigating and tuning this Ruby’s GC behaviour.

Unicorn

Envato Market runs unicorn as the
application server and suspicion was raised after we traced a series of
requests back to the same parent process. Since unicorn uses forking and
copy on write we thought
there could be a chance that the parent process was becoming corrupt and
then passing it onto its children processes. We lowered the worker count
to 1 and attached rbtrace to the
parent and child processes but came up with nothing that looked like
corruption and never managed to capture a segmentation fault in the
watched processes so we were able to rule this out.

C extensions

We use a handful of gems that rely on C extensions to perform the low
level work in a speedy and efficient manner. Using C is a double edged
sword - using it well results in excellent performance whereas using it
poorly (especially unintentionally!) results in very difficult to
diagnose issues unless you are well versed in the language and
environment. For the most part we relied on
valgrind and manual review to identify anything
that we thought could be the cause. To get a list of all gems that we
needed to inspect we use the following snippet which returns a list of
the gems which have a ext directory.

Once we had a list of potential culprits we set off reviewing each one
individually and running it through valgrind looking for memory leaks
and anything that didn’t quite look right that might lead us to a heap
corruption scenario. Our investigations here
didn’t solve our specific issue however we did submit
someupstreampatches to the projects that
we identified potential issues with.

Our Ruby build process

Before the Envato Market application got onto MRI 2.x we ran a custom
Ruby build with a series of patches that required us to build and
maintain our own fork of Ruby. Ruby is in a far better place now so we
rarely need to patch it but we still maintain this process as it allows
us to package up our version of Ruby and ship it to our hosts as a
single package and eliminate attempting to download and build on
individual hosts.

During the investigations there were concerns that the version of Ruby
we were building and deploying was being corrupted at some stage and we
were seeing that as the segmentation faults in production. To
troubleshoot this further, we tried to build our custom version with
additional debugging flags but quickly discovered they were not being
applied as we thought they were in the past. After spending a bit of
time digging into the issue we identified the cause was the position we
were passing in the flags and inadvertently ended up getting trampled on
by ruby-build default options
instead. We fixed this within our packaging script and shortly after we
were able to verify that our version of Ruby matched the upstream Ruby
once it had been packaged up manually and deployed to our hosts.

Getting help

After a few months of trying and eliminating everything we could think
of, we reached out for external help. We had spent a bit of time
analysing our C extensions but we didn’t feel we went deep enough with
our investigations. To get more insight into the C side of things we
teamed up with James Laird-Wah and started
going through our core dumps and gems that relied on C extensions.

After many long hours of debugging and stepping through core dumps, we
had found a smoking gun in the form of a Nokogiri
release.
Nokogiri RC3 introduced a
patch
for a nasty edge case where if you have libxml-ruby and Nokogiri
loaded in the same application, you could encounter segmentation
faults due to the way some fields were managed. Updating to Nokogiri
RC3 saw us drop the number of segmentation faults to a third of their
previous counts! Looking at this behaviour further we identified the
cause of this edge case was the way libxml-ruby managed the fields
despite the ownership on them. In order to address the remaining
segmentation faults, we got together with James and formulated a patch
that would ensure libxml-ruby would only manage the fields it
explicitly owned. We tested it and deployed it to production where we
monitored it closely for 24 hours. Lo and behold, 0 segmentation
faults! We’d finally found the cause to our issues! Excited with our
discovery we pushed the patch upstream and it’s now available at
xml4r/libxml-ruby#118.

Ensuring this doesn’t happen again

Like most regressions, proactive monitoring and insights into your
normal application behaviour are the best solution for avoiding long
running issues like this. To make sure we are not stung with this again,
we are taking the following measures:

Implementing alerting based on any occurrences of segmentation faults
within our applications. Our applications should not ever be hitting
segmentation faults and if they are, we need to mark them as a high
priority and assign engineers to resolve them.

Dependency graph review on each bundle update to see when we have a
lingering dependency and possible removal. The gem we were using that
leveraged libxml-ruby has since been refactored out but the
relationships and dependencies were never cleaned up once it was
removed.

Better monitoring and roll ups of exceptions on a team by team basis.
We are aiming to better integrate our developer tooling with our
exception tracker to ensure we can quickly identify an increase in
exceptions and work on a resolution before they become a bigger issue.

Since the end of 2015, the Envato Front End team has been working on bringing a modern development workflow to our stack. Our main project repo powers sites like themeforest.net and serves around 150 million Page Views a month, so it is quite a challenge to re-architect our front end while maintaining a stable site. In addition, the codebase in 9 years old, so it contains the code from many developers and multiple approaches.

We recently introduced our first React based component into the code base when we developed an autosuggest search feature on the homepage of themeforest.net and videohive.net. The React component was written with ES6, and uses Webpack to bundle the JavaScript code.

As I mentioned above, it’s a 9 year old code base and nobody can guarantee that introducing something new won’t break the code, so we began all the work with tests in mind. This post documents our experiences developing the framework for testing the React based autosuggestion component.

One issue I had while writing unit test code is that some of the code depends on a browser based environment because they need to access to some browser only object or APIs.

The first solution

Most of the unit tests nowadays are running with Nodejs, so in order to emulate a browser environment, jsdom showed up.

A JavaScript implementation of the WHATWG DOM and HTML standards, for use with Node.js.

Here’s a handy snippet that you could use before your testing code to prepare a DOM environment:

12345678910111213

importjsdomfrom'jsdom'// This part inject document and window variable for the DOM mount testexportconstprepareDOMEnv=(html='<!doctype html><html><body></body></html>')=>{if(typeofdocument!=='undefined'){return}global.document=jsdom.jsdom(html)global.window=global.document.defaultViewglobal.navigator={userAgent:'JSDOM'}}

And in your test code, you could just import it and use it by calling the function.

123

import{prepareDOMEnv}from'jsdomHelper'prepareDOMEnv()

If your code depends on some DOM helper function like jQuery, you may also need to include the source code of jQuery into the prepared environment, you could do:

Notes: in the official jsdom github repo, they give an example of loading jQuery from the CDN which needs an additional network request and can be unreliable and not work if without network. They also have an example loading jQuery source code with nodejs fs module but it’s not clean as you have to tell the path to jQuery.

Everthing looks OK so far, but why do we bother to having a real browser environment?

The reason is that once things get compliated, your code may depend on more browser based APIs. Of course you could fix your code but what if you are using 3rd party moudles from npm, and one of them happen to depends on XMLHttpRequest, it’s nearly impossible to “mock” everything, and to be honest, I feel uncomfortable doing it this way as it’s really tricky and kinda dirty.

Let’s run it in a browser

Why not Phantomjs

From the problem we saw above, it’s pretty straight forward to think about just running all the tests in a real browser. If you search “headless browser testing” on Google, the first result will be PhantomJS.

I haven’t used phantomjs a lot and I’m not familar with how it works, but I’ve been heard bad things about it, “lagging behind more and more from what actual web browser do today”, “have 1500+ opened issues on Github”, “unicode encode issue for different language”.

Let’s talk about Electron

Build cross platform desktop apps with web technologies. Formerly known as Atom Shell. Made with <3 by GitHub.

It would take another blog post to explain what Electron is and what it does, I have built a few projects with it and also have written a few blog posts about it. The short version, and what really matters to me, is A Nodejs + Chromium Runtime, actively maintained by fine folks from Github and used by Atom editor, Slack etc. To conclude I’ll quote from one of my favourite JavaScript developer dominictarr

Electron is the best thing to happen to javascript this year.
Now we get root access to the browser!

Important notes

As the title indicate, this post is about running tests on any CI server. The reason is that most of the CI servers are neither Mac or Windows, and there’s a known issue with running Electron on Linux, you need a few setup steps to get it running.

Here’s a few notes copied from the repo and thanks to juliangruber for including my section on running it on gnu/linux there.

To use the default electron browser on travis, add this to your travis.yml:

Final step

Once we have all setups ready, our test will be much simpler without the need to “hack” a browser like environment:

12345678910111213

importtestfrom'tape'importReactfrom'react'importjQueryfrom'jquery'import{render}from'react-dom'test('should have a proper testing environment',assert=>{jQuery('body').append('<input>')const$searchInput=jQuery('input')assert.true($searchInputinstanceofjQuery,'$searchInput is an instanceof jQuery')assert.end()})

Conclusion

Voilà! That’s all we needed to get headless JavaScript test running on any CI server. Of course your testing environment may different from mine but the idea is there.

As front-end development is changing rapidly recently with things like single page application, isomorphic universal apps, also front-end tooling system like npm, Browserify, Babel, Webpack, testing will become more complex. I hope this setup will make your life suck less and be significantly eaiser.

Last but not least, if you have any questions or better way for testing setups, let us know!

CloudFormation is an Amazon (AWS) service for provisioning infrastructure as
“stacks”, described in a JSON template. We use it a lot at Envato, and
initially I hated it. Typing out JSON is just painful (literally!), and the
APIs exposed in the AWS CLI are very asynchronous and low level. I wanted
something to hold my hand and provide more visibility into stack updates.

Today I’d like to introduce a project we’ve recently open-sourced: StackMaster
is a tool to make working with multiple CloudFormation stacks a lot simpler. It
solves some of the problems we’ve experienced while working with the
CloudFormation CLI directly. The project is a refinement of some existing
tooling that we have been using internally at Envato for most of this year, and
it was built during one of Envato’s previous “Hack Fortnights”.

See the changes you are making to a stack before you apply them:

When applying a stack update with StackMaster, it does a few things. First
you’ll see a text diff of the proposed template JSON and the template that
currently exists on CloudFormation. This helps sanity-check the changes and
abort if something doesn’t look right. It also shows a diff of any parameter
changes. After confirming the change, StackMaster will display the log of stack
events until CloudFormation has finished applying the change.

Easy ways to connect stacks:

StackMaster provides a number of helper functions to deal with parameters.
One allows you to easily take the output from one stack and use it as the input
to another, without having to hardcode it. We call these helpers parameter
resolvers.

Make it easy to keep secrets secret.

Another parameter resolver transparently decrypts encrypted parameters before
supplying them to CloudFormation, meaning you don’t need to worry about
plain-text secrets.

Make your parameters easier to understand by using names instead of IDs.

Another set of parameter resolvers StackMaster offers allow you to refer to
Amazon Simple Notification Service (SNS) topics and security groups by
descriptive names, instead of obscure and hard to maintain ID numbers.

Make it easy to customise stacks in different environments

StackMaster will load and merge parameters for a given stack from multiple YAML
files to allow for region- or environment-specific overrides. You can, for
example, set defaults in one YAML file and then use an environment specific
YAML file to tailor as required. We use this to do things like use a smaller
instance type in our staging environment.

Apply descriptive labels to regions

Think in terms of environments instead of region names. StackMaster allows you
to operate on your staging stack, rather than on your ap-southeast-2 stack,
reducing the chance of applying changes where they are not desired.

]]>2015-09-22T17:10:00+00:00http://webuild.envato.com/blog/how-envato-defined-the-expectations-of-our-developersThe Envato development team has always had a strong sense of what we stand for, how we work together and what we expect of each other … at least that is what many of us thought. Around 9 months ago our company participated in the Great Places to Work survey, which gauges how our employees feel about Envato as a place to work. Each department received a breakdown of their feedback, and whilst much of our feedback was great, one statement was a clear outlier “Management makes its expectations clear”. This was a trigger to question our assumptions about those expectations. This post tells the story of that journey.

The Response

Step 1 - Review What We Have

We reviewed where we stated expectations, how consistent and available that information is, and how we applied that information. The conclusion was it is patchy and not consistent. Our position description alluded to some expectations, our annual review question didn’t at all, and were broad and generic across the whole company, and goals that line managers set for developers were unique to each developers and did not relate to a common set of expectations.

Step 2 - Organise a Working Group

Envato has around 60 developers right now, and about 45 when we started this work, so consulting everyone as a large group was not going to be productive. We got together a working group of 7 developers that was a cross section of disciplines and seniority.

Step 3 - Come Up with a Framework / Classification

We reviewed all the information we had and came up with a basic classification system of expectations:

Tech Competency

Learning and Teaching

Collaboration and Ownership

Envato Values

Step 4 - A Starting Set of Values / Expectation

Next the working group had a go at coming up with a set of these statements, as a template for more broader consultation. We came up with a way of stating them which was a first person assertion, so that a developer could “ask themselves” if they satisfy this expectation, not just so that a manager could “judge” the developer … e.g. “I define the problems I am solving before the solutions”

We came up with 13 statements in this first draft, although we chose not to be exhaustive, but more to provide starting points for the rest of the team.

Step 5 - Get Ideas all Team Members

The entire dev team of 35 staff members was split up into groups and and assigned to one working group member. There were 9 separate sessions ran.

They ran a workshop to come up with a unique set of statements themselves. Ideas were collected in individual trello boards, and categorised either with the existing categories or new categories.

There were 220 statements, around 20 from each group, generated in these sessions.

Step 6 - Merging all Input to One Master List

The working group got back together to attempt to synthesise all this feedback. We created a new “Master” board with 5 major lists. The working group moved cards from their workshop boards into the Master board. The lists we came up with, with number of cards in each list was:

Technical Competence - 86

Collaboration - 57

Learning and Teaching - 42

Envato Values - 19

What Makes and Awesome Team Mate - 8

Miscellaneous - 7

With everyone’s separate lists on the one board it looked like this!

Step 7 - Consolidating Ideas

The working group split into 3 groups to consolidate one list each. The outcome of the consolidation was to represent common themes in the input. For example the Collaboration list had 57 cards and was consolidated to 10 major themes, covering about 50 of the cards, with 7 being marked for review.

We linked these consolidated cards back to the original card so we could trace individual input through to final statements.

Consolidating all our ideas revealed some outliers that were not common across the working groups. We wanted to reduce the final set of expectations to a workable number, so outliers were cut.

Step 8 - Finalise the List

After much re-wording we came up with our final set of lists and cards that we considered a small enough but not too large list of expectations.

Step 9 - Convert to a Github Repo and Request Reviews

Once we were happy with our trello cards, it was time to publish these expectations and open them up for comment and review. We decided to use the development process that serves us well for code, which is github hosted repositories and pull requests. After presenting our content to the entire team at a ‘code party’ we found team members started to start conversations via Pull Requests.

And finally, publishing our living set of developer expectations to the public. You can find them here at Developer Expectation.

]]>2015-07-08T19:19:00+00:00http://webuild.envato.com/blog/how-to-organise-i18n-without-losing-your-translation-not-foundI’ve written before about Working with Locales and Time Zones in
Rails, but I often feel the i18n library (short for
internationalisation) is underused (appreciated?). Perhaps it is even avoided
because of the perception it is more effort to develop with and harder to
maintain.

This article will, I hope, open your mind to the idea that you will be better
off using i18n in your application (even for a single language) and that it
can be maintainable with some simple organisational pointers.

Organisation

The number of i8n keys that your application accumulates can become
overwhelming as your application develops. In my experience, the biggest pain
point has been finding the key(s) you’re looking to update if everything is in
one huge file such as the default en.yml.

Thankfully, you are not restricted to using a single file. We can adjust the
i18n load_path in Rails and break up our translation files into more
logically grouped files.

Rails doesn’t care or place special significance on this structure, all it
cares about is the key hierarchy it eventually stores after parsing and merging
the resulting data structure.

Global translations

What if you have some translations that need to be “global” to your application
and don’t fit in a particular view or class, perhaps they are changeable and
appear in many locations so having them repeated would be inconvenient.

The idea can be carefully applied at a global level too when it’s needed.

123

|- config
|- - locales
|- - - global.en.yml

If you’ve only got a few keys then a single file is fine, if you start finding
the file hard to read then break it up into smaller files inside a global
directory or whatever makes sense for your domain.

Using i18n outside of views

This approach is not limited to views, I find it really useful to use for
validation messages in form object and models as well. For example:

Given a simple form object with basic presence validation as an
example, the locale file below shows how you might customise the validation
message. This makes use of Rails “lazy” lookup in a similar way to views and
will affect this form only.

Global error messages can be set as well, check out the rails-i18n
gem for an exhaustive list of the defaults you can change in
Rails.

Taking care to create our locale file in a corresponding location to our class
makes it easy to find these translations in the future.

Naming keys

Naming things is hard. So don’t over think it, with our file structure this key is already confined to a single view and locale file. If it turns out to be a poor choice you can confidently go and change it.

My recommendation for naming your keys is to choose a name after the purpose
of the key and not just use an underscored version of your translation. One
exception to this would be things like models attribute labels where it makes sense to just use the translation as the key too.

1234567

en:avoid:keys_like_this_are_fragile:Keys like this are fragilegood:heading:This is a better keycall_to_action_message:"IexpectthistranslationtochangesoIhaven'tbasedmykeyonit"

Hopefully in the example above you can see the point I am trying to make. This
isn’t a rule; more of a guide to help you make a more meaningful choice early
on.

Remove text from HAML and Slim templates

If you use, or have ever used a template engine such as Slim or
HAML they are great for structure but I find if you want to start adding text things can get out of hand.

That’s better! You might notice I call #each on the .feature translation,
this is so I can quickly mention namespace lookups: if you call a
key with children it will return a Hash with the translation as the value so
it can used to create the list.

In my opinion this keeps your focus on good structure without the noise and
distraction of interpolated strings. This can be used for ERb templates too.

Bonus level: Pluralization

I just wanted to share one of my favourite uses for i18n, handling
pluralization. Not simply changing a singular form to a plural, I’m talking
about adapting the entire sentence based on the count. For example, given the
following locale file:

123456

en:comments:number_of_comments:zero:No one has left a comment yetone:There is %{count} commentother:There are %{count} comments

You can probably see what’s going on just from that example; depending on the
count you pass to the translation it will select the appropriate response.

It should be noted that you don’t need to include all those options. You might not need to include a special case for zero in which case it will fallback to other.

1

I18n.t('comments.number_of_comments',count:@comments.size)

This can help reduce a lot of unnecessary code in your view, not to mention the
benefit of providing users with a better message.

I suggest taking a look at the documentation for i18n in Rails to
learn more about its features and creative uses. This article is more about
how to better manage your locales and really only scratches the surface of
what you can do with i18n.

]]>2015-06-17T09:00:00+00:00http://webuild.envato.com/blog/envato-market-structure-styleguideToday we have released the Envato Market ‘Structure Styleguide’ to the public.

A Structure Styleguide is a special breed of living styleguide, designed to document an application’s UI logic and all of the possible permutations of the UI.

The goal is to have a complete and shared understanding of how an application behaves in any situation, even in the most uncommon edge cases, without having to laboriously recreate each scenario by hand.

What does it do?

The Envato Market Structure Styleguide is the very same tool that our developers, designers, user experience experts and product team use day-to-day to build Envato Market. It is a living, breathing work in progress and its various elements are likely to change at any time without warning.

Although at this stage only a small percentage of the application lives in the styleguide, we are continually adding more as we develop new and exciting features or have the chance to improve existing ones.

Why do this?

Although the content itself may not be super useful to anyone else, we hope that by making it public we can share our knowledge and experience of a development technique that we have been refining for the past two years and found tremendously beneficial in building complex user interfaces: Styleguide Driven Development.

By making it public, we hope that other people can learn first-hand the benefits of a Structure Styleguide and how to effectively document an application’s UI so that nothing is out-of-sight, out-of-mind and ultimately neglected.

Resources

To learn all about ‘Structure Styleguides’ as well as ‘Styleguide Driven Development’, here are a bunch of resources we’ve put together:

Update: We have been actively working on a new design for Envato Market to provide better design and user experience, as a result the market styleguide will no longer be maintained. Please use our new, updated site envato.design.

]]>2015-05-27T10:56:00+00:00http://webuild.envato.com/blog/your-puppet-code-base-makes-you-fear-the-apocalypseLet me paint you a picture. At some point in time someone said ‘hey, wouldn’t it be great if we could manage our servers with that new puppet thing’. ‘Great’, said everyone, ‘Let’s do that and the way we have always done it.’. And that, my friends, is how you end up where we are.

Reading our puppet code base reads much like a bad folding story. Everyone had a plot line and tried to weave into a flowing narrative. And like all badly written stories it has many dead ends, twists, continuity issues and obscure meanings. You get that from many different authors - all code bases face the same problem.

So in this particular story I’m going to tell how we began solving some of the mysteries within this Odyssey like tail.

Without mixing my metaphors, it starts with two families living in a village. The Webs the Web-Trans. Web-Trans were an old family of servers that had been managed with the ‘old ways’ of building servers. Rather than being a respected family of web servers, people started to fear them because they are just not like other web servers.

The Webs on the other hand, were a new addition to the village. Much more certain in themselves. But still with that titillating vulnerability that you don’t quite know what you’re getting.

Conformity became a very strong movement in the village - no one liked odd things anymore. Something had to change. Now the simple way of fixing this conflict would have been to start a village war where hopefully one side would be wiped out and forever erased from the commit history. But no, not yet at least.

The messy problem is that these two web families sometimes used different and sometimes the same paths through the village to get to the end of whatever they are getting to the end of; let’s just call it a happy wedding where they serve web requests. And along each of these pathways that the families take, they often stop and ask the village Gods for directions because they forgot their maps, or lost their way or whatever. So the people that have planned the wedding don’t always know if one of the web families will turn up, be dressed the right way, or have their heads on.

So how do we make certain that whatever path our web families take they turn up to the wedding the way we want them? We first of all describe to them how they should look and then we test that they are what we expect.

Now I want to talk about Zombies - see it is very hard to stay in your storyline, but I do want to break out of this villagey weddingy metaphor and just start getting to the real story.

So we have two ‘types’ of web servers that do exactly the same thing but take two different paths to get there, one with more certainty than the other. Okay, pick the more certain one and start testing against that.

In puppet world (man, should have used a story about muppets) we can test modules with things like puppet-rspec and other tools including using vagrant to provision a host for us. That is a good way of doing some quick testing of your manifests. But we wanted to test that the two types of web server actually produced the same web server configuration state. So when the village war started we could be assured that the victor would be what we wanted. With that information we can then even wipe out the entire village, make a new village with completely different paths and lights and stop signs and discos and everything and still have the same type of web server at the end.

We decided to introduce ServerSpec to the village. As you may know it is a way of testing your actual server state with rspec code. It ties in nicely with vagrant and is a great addition to your TDD infrastructure.

You can use the documentation for serverspec to set up your testing, but here is how we test both types of web hosts without duplicating the testing. Having the same expected state means we can begin to remove the obvious differences and validate that we still get what we expect in a web server.

In the above layout, a vagrant host called mp_web will execute a bunch of tests including some shared tests. By using the symlink for mp_web_trans we can perform the same tests for mp_web as we do for mp_web_trans hosts. Now when a change is made to the mp_web_trans hosts we can validate that they will work the same as a mp_web host. Of course this makes deploying changes a much more assured thing.

So when the Zombie apocalypse comes and the village is under attack and the story starts taking a surprising turn, our wedding can still go on and be a happy event driven experience.

]]>2015-04-10T08:39:00+00:00http://webuild.envato.com/blog/push-button-deploymentsEnvato is becoming a large company, with several teams working on many different products spread across various technology stacks.

Each of our teams are responsible for their own deployment process, which means each has ended up with its own way to deploy changes. It’s complicated to grant a new developer access to all the tools and services they need to deploy a single project, let alone multiple projects.

We have just finished one of our quarterly “hack weeks” here at Envato, where we take time out from our usual programme of work to spend time on solving new or interesting problems. Our team of 5 developers decided to try out Zendesk’s open source deployment tool Samson to improve our deployment process.

Rather than our current process of having each developer deploy from their local machine, using a centralised deployment tool has some benefits. You don’t need a complex set of permissions, nor have to add SSH keys to a bunch of servers to deploy – you simply need web access to Samson. Some further benefits:

Logins to Samson are authenticated via OAuth, from either Github or Google

Deploys are logged

Multiple users can view an ongoing deployment

Our deploy process now looks like this:

A developer merges a pull request into the master branch in GitHub

The merge triggers a build of the master branch in our CI service Buildkite

Passing builds notify Samson to create a release

Releases can be deployed by Samson by either the click of a button, or automatically

Samson updates the status in the GitHub deployment API

As a finishing touch for this project, we wanted to perform deploys by pushing a physical button. We got our hands on a wireless physical pushbutton which is connected to a given project, and pressing it triggers a deploy of that project’s latest pending release. Here is our CEO Collis eager to deploy a new feature. By a press of a button he now can. We’re actually going to get him to deploy this very blog post by pushing the button during our presentation.

]]>2015-03-23T15:56:00+00:00http://webuild.envato.com/blog/announcing-aldousAbout a year ago the Tuts team encountered an issue with our Rails codebases becoming more unwieldy as they got bigger. Our development pace was slowing down and new features and fixes would cause regressions in unexpected parts of the system. This is a problem that many of us have encountered before, and we often treat it as part and parcel of doing business. This time, however, we decided to see if we could find ways to improve the situation.

Our goals were:

keep a constant development pace regardless of the size of the codebase

ensure that new features and fixes don’t cause regressions in unrelated parts of the system

Focusing our attention on OO design helped to address some of these concerns, but along the way we found a few common threads which could be codified into some helpful patterns. We created a gem and slowly started augmenting Rails, trying things out on our own codebases as we went. A little while ago, we decided that our gem was finally useful enough to release publicly.

Meet Aldous - a Brave New World for Rails with more cohesion, less coupling and greater development speed.

The main issues common to larger Rails codebases that we try to address are:

controllers that contain a lot of logic, spread around many before_actions

the lack of proper view objects, having to make do with templates and helpers

Have a glance over the README, it is pretty extensive. We also plan to write a few blog posts about how to use Aldous, as well as give more detail about some of the motivations behind it. As the blog posts appear, we will add them to the README.

We hope you’ll give it a try. Bug reports, pull requests and general comments are welcome.