Shipping Software in FinTech

Posted by Dennis Ideler & Oleksandr Kruk

Funding Circle is one of Europe’s leading FinTech companies.
Our headquarters are based in the City of London, one of the oldest financial centres in the world and still a global leader.
Naturally there is a lot of financial regulations with which our business must comply.

In this post we describe how regulation affects our agile development process, and present an open-source tool that
helps with some of the challenges faced. Afterwards we discuss some of the implementation details – this
section is more technical and is included as optional reading.

Funding Circle has seen tremendous growth over the last year. We’ve become an
established financial institution and with that comes financial regulation.
One of the regulatory bodies in the UK is the Financial Conduct Authority (FCA).
We have been operating under interim permission from the FCA since April 2014
and continue to work closely with them to obtain a full operating license.

With regulation comes process. There are a lot more checks in place now than
when we were a startup, and rightfully so to protect our client money and data.
The increased trust that comes with such process is a win, but if the business
isn’t careful, the tradeoff is a loss in speed from all the red tape.

With our goal to become compliant, an anti-goal was slowing down. To be
successful, we have to be able to make quick decisions and deploy software multiple
times a day. A non-goal was to be faster - if it happened that would be
great, but we wouldn’t actively be trying to achieve that.

Becoming FCA compliant affects virtually every department of our business.
For the tech team it meant being able to answer ‘yes’ to a very long list of
questions about engineering and security practices. Fortunately for us there
were quite a lot of practices which we already satisfied or exceeded. If we can prove
during an audit that we follow good practices, we can answer ‘yes’ to those.
For those which we cannot, we have to explain what we are doing
to get to a ‘yes’ answer.

One of the sections of the compliance checklist was focused on the governance of software
delivery. Our approach was to build a system that would create an inspectable audit trail of
the software development lifecycle, highlighting practices that we follow. It would have to withstand a
surprise technical audit, meaning we had to create a detailed audit trail of events
where we could easily view the state of the world at a specific time in history.

After an inception meeting we had a quick look to see if there were any existing
tools but couldn’t find any that met our criteria. After many discussions and diagrams in a cramped cold
room we dubbed “The Icebox”, we all decided on the following:

Our unit of currency would be Git SHAs.
These are the software versions that ultimately get shipped.

Track the full delivery process for all released software versions.
From story inception to deployment and everything in between.

Keep the tool fairly generic and not too tightly coupled to specific services.
In case the business decides to switch from GitHub to BitBucket for example.

The tool would only observe and alert, not lock down. Though we cannot prevent
another process locking things down based on the information retrieved from our tool.

We open sourced the tool from the beginning as it would be a lot more challenging to do after
it’s been built. It’s called Shipment Tracker1 and you can find it on GitHub.

Shipment Tracker brings our software development processes closer to becoming fully compliant,
though it’s important to note that it only addresses certain needs. It’s not a
silver bullet – there will be other changes to your processes and tools that need
to be made in addition to using a tool such as Shipment Tracker.

When a feature is ready to be reviewed by the product owner (PO), the developer presents them with a Feature
Review. This is a page that has a checklist of various processes, such as relevant tickets and their state,
test results, which user acceptance testing (UAT) environment it’s been deployed to, QA approval, and so on.

[fig. a] The Feature Review page as shown for a specific software version that’s under review.

Some of these processes could be optional. For example, not every change may need QA to review.

Each panel gets its information from events. Shipment Tracker is continuously
receiving and storing events from many different sources – more on this later.

A Feature Review gives the change-control process more visibility. We can use it when signing off a feature to
make sure it’s in good state. We can also use it during an audit to see what criteria was met at various
specific times.

Because this page is a projection of events, we can show the state of a Feature Review at any
time by replaying events up until a given time. If no time is specified, we show the Feature Review at its
current state by replaying all events up to the very last one.

[fig. b] Feature Reviews can be viewed at specific times, down to milliseconds if needed.

On every push to GitHub, Shipment Tracker creates a commit status for the associated Feature Review.

[fig. c] GitHub Commit Statuses by Shipment Tracker show the state of any associated Feature Review(s).

When no associated Feature Review exists, it asks you to create one and link it to at least one (JIRA) ticket.
If an associated Feature Review does exist for the commit, it links to it.
Once a Feature Review exists for the feature branch, any child commits on the same branch will have a Feature
Review auto-created for them.

Every repository tracked by Shipment Tracker has its own Releases page where you can see what’s been deployed.
A Release is defined by its software version (Git SHA). Only commits made directly on the canonical branch2
are considered to be Releases as that’s the source for releasable software3. Meaning any commits made
on feature branches will not be shown, but their merge commit will be.

An application can be deployed in multiple regions, with each region having a different deploy queue.
This is why Releases are per geography – indicated by a flag for each geography. New geographies can easily
be added as two-letter country codes4 by setting an environment variable.

The page is divided into two sections. Pending releases and deployed releases. Releases are by default
considered to be pending. To be marked as deployed, Shipment Tracker must receive a production deploy event
for that software version or for one of its children.

We can see if the release has been approved or not. Unapproved releases are indicated by a red row and a
variety of statuses, such as it being in a pre-approval state, or a code change was pushed after approval, or
it simply lacks an associated Feature Review.

[fig. e] Releases can be flagged if they haven’t gone through the proper approval process.

The Shipment Tracker tool is an observer, not a gatekeeper. Sometimes things won’t follow the happy path.

Unauthorised Releases will trigger a deploy alert at the time of deployment. This can be for a variety of
reasons, such as deploying an older software version. A full list can be found on the Shipment Tracker wiki.

[fig. f] Risky deploys are alerted to the business via a Slack channel.

Currently such alerts go to a Slack channel, but it can easily be extended to send email alerts.
Here the Deployer or Product Owner can justify the release or retrospectively rectify the situation.

The Shipment Tracker landing page is the Search page. Here you can search for released features that have gone
through the Shipment Tracker process. By default it shows releases for the current day.

[fig. g] Tickets that have been released can be found via the Search page.

The query can contain the app name or SHA of a deployed commit, or any keywords from a ticket title or
description. The search criteria is weighted. For example, matches against deploy data is most relevant
and those results will appear first.

The results will show all relevant tickets with their title and snippet of their description. The bottom
of each ticket panel shows deployment information, such as the region, time of deployment, app name, and short SHA.

Implementation

Shipment Tracker is a typical Rails application with a PostgreSQL database and background jobs.
It has an authenticated GUI and API. We used Auth0 for user authentication but this is
easily configurable with environment variables. In fact, because we built Shipment Tracker as an open source
application from the start, it affected a lot of our design decisions. We strived to follow the guidelines of
a twelve-factor app wherever possible.

A few areas where Shipment Tracker stands out from other Rails apps are
- Interaction with Git - there’s quite a bit of this
- PostgreSQL is used in a variety of ways, including as a NoSQL store (via JSON) and full-text search
- Storing state as an event stream, which creates an audit trail with time travel capability

Shipment Tracker relies on the concept of Event Sourcing.
In summary, event sourcing consists of treating all the system state changes as a stream of events and
recording all the states changes of your application through event objects. This allows you to reconstruct
the state of your application at any given time by replaying all the events, assuming you have stored all the
events since the beginning of the application operation.

Shipment Tracker stores all events in a relational database management system (RDBMS) known as PostgreSQL.

[fig. h] Shipment Tracker keeps track of the software development lifecycle by receiving events from a
variety of external and internal sources. For brevity, not all event sources are shown here.

Shipment Tracker uses the event sourcing concept for providing the traceability of all the relevant actions
taken during the software development and delivery process. The platform provides a webhook to which all the
external services report. This webhook is the source of all the events that the system needs to create its
current state. Tracking new data is as easy as creating a new type of event with a specific endpoint.

To get the current state of the world, state has to be accumulated by applying all events, from the very
beginning until the specific point of interest in time. This is inefficient and doesn’t scale well.
The time it takes to apply events grows linearly as the number of events grows.

Instead of storing events and lazily (re)applying them whenever the system has to project information,
we can apply events as we receive them and persist the accumulated state so it can be efficiently looked up.
This is called “snapshotting” and is common in systems that use event sourcing.

A snapshot is a record of accumulated state from events, for a specific model. For example, events from a CI
source such as CircleCI, Travis, or Jenkins would be normalised into build snapshots with these attributes:5

[fig. i] Events are normalized into persisted snapshots, which are more efficient to work with.

So any information that is ultimately projected to users, is retrieved from snapshots instead of raw events.

Queries – which are used in controllers – collect information to be projected in the views.
A Query can have multiple Repositories for data retrieval.
Every Repository has a data store, which contains Snapshots for a specific model.
Finally, a Snapshot normalizes a raw Event.

[fig. j] Views project information from Queries. Queries use Repositories for interacting with the persistence layer.
Repositories each have a Snapshot store. Snapshots store accumulated state of Events.

Snapshotting is a continuous process. There is an infinite loop that snapshots every new event.
To identify new events, there is an EventCount table that keeps track of the last event applied for each
snapshot type. Events that have an id greater than the last event applied are considered new and are pending snapshot.

[fig. k] The EventCounts table tracks the last event applied for each event repository.

Sometimes we need to wipe all snapshots and recreate them. An example is adding a new attribute to a snapshot.
We would have to recreate all snapshots of that type so that events are re-applied and the relevant information
can be extracted for the new field.
In these cases we put the application into maintenance mode. A Capistrano task sets DATA_MAINTENANCE=true,
stops the background workers (so they don’t process new incoming events), then restarts the application.
A rake task is then automatically triggered to recreate snapshots - but only up to the previous snapshot count.
After resnapshotting is completed, maintenance mode is automatically turned off and any new events that were
queued are snapshotted as usual.

All events are derived from an abstract BaseEvent model and stored in the database.

[fig. l] Events have a straightforward structure. Event-specific details are stored as JSON. The
unstructured data allows for quick and easy modification of payloads from event sources.

The only “custom” field is the details column which holds all event metadata. The others are common columns
provided by Rails. We have the typical timestamp columns which we need for event sourcing, and we also have a
type column that we need for Single Table Inheritance (STI).

Specific type of events that are derived from the base event will have their type set. This allows us to use
BaseEvent to query all events. We can also narrow down our queries by using a specific event, such as
JiraEvent.

Note that the details column uses a JSON data type. This makes it very easy for us to accept any type of payload
for event details, which is essential as we have a large amount of event sources, each being very different.
Many of these are from third-party sources where we do not have any control over the payload. So we accept anything
but we only pick and choose what we need from the details payload, and that varies for different events.

Let’s look at the JIRA case. JIRA allows you to specify an endpoint for notifying on the
specific modifications we are interested in. In our case, we are interested in the ticket lifecycle flow, meaning
that whenever a ticket is moved between different states e.g: “To Do”, “In Progress”, “Ready for Deploy”,
we want to be notified. And the reason for this interest in a ticket’s state change is that it will allow
us to answer questions like:

“Was the code changed in context of ‘ticket A’ after the ticket was approved by PO?”

“What is the current state of the ‘ticket A’?

In order to explain how we answer those questions we need to introduce the Github push notifications.

GitHub, as well as JIRA, simplifies our life by allowing us to setup an endpoint to which it will send the
events we are interested in. For answering our questions, we are only interested in receiving a notification
event when a code change is pushed to a repository which we are tracking. We do not filter the destination
of the commits, all pushes to all branches are processed.
How do we use the information from GitHub push events?
First let’s look at some key fields in the body of the GitHub push event

{repo: "audited_repo",sha: "29efedc",parent: "4acde2d"}

As you probably guessed, the sha key will point to the commit’s SHA about which we are being notified
while the parent key will point to the SHA of the parent of the current commit.

With this information we can easily identify which repository was updated, but what about the information
about the branches? Well, in order to be able to quickly query the repository for relevant information,
we decided to maintain a copy of each repository locally on each Shipment Tracker deployed instance,
we’ll give more details on this later, but for now let’s assume that we have a local up-to-date copy for each
repository that we track. We use rugged to manage the locally cloned repositories. Rugged is a very useful
gem (a wrapper for a C++ library libgit2) that allows us to query the repository and find out which
commit belongs to which branch, which commits were made between commit A and B, was commit A merged to
the master branch, among other important information.

Now that we have an event to call us when any change is made to the code and we have an easy way to find out
more information about the commit by querying the local repository, we can do some interesting things.
Whenever we receive the push event from GitHub, Shipment Tracker will check the following condition:

Is there any association between the parent commit and a JIRA ticket, if so, associate that same
ticket (or tickets) to the newly pushed commit.

Check the status which the associated ticket is in, and post it to GitHub as a commit status. The status can be one
of the described in figure c.

These commit statuses on GitHub are a convenient way for developers to check what status the feature related ticket is in.
Also it can be very useful to implement a specific “merge to master” policy, e.g.: a developer
can only merge the PR if the ticket in JIRA is approved (which is not the practice at Funding Circle, given
that we prefer to monitor and alert as opposed to restrict people’s actions and slow down the development process).

In section GitHub events we mentioned that we keep a copy of all git repositories
that are under audit by Shipment Tracker. Now let’s look at how we keep the data in those copies up
to date.
In order to be able to query the most recent state of the code for each local repo, we need to update them
frequently. We have already experimented two different solutions:

Fetch the repo updates whenever there is a web request that will require querying the repository

Run a background job which compares local repo against remote every minute and in case the local repo is
outdated, fetch the updates for that repo.

Let’s look at each solution closely and analyse the pros and cons. Solution 1. is quite efficient given that
we are not polling the repo all the time. Also the repo is only cloned when there is a clear need to query it.
The problem starts manifesting when we deal with larger repos which take some time to update or when we have
to clone the repo for the first access. The size of the repo impacts quite significantly the response time of
the web requests which is quite frustrating for the users :(.

After having this solution for a couple of
months, we’ve decided that it the impact on the user experience is too big of an impact on the user experience and started developing a better
version which resulted in solution 2) mentioned above.

We are still evaluating this second solution and we have already identified a couple of advantages and
disadvantages. This solution provides a much better user experience in most cases by keeping all the
repositories up to date in the background. All the repositories are scanned in an infinite loop.

It provides quite a fast way of keeping all the repos up to date in most cases but also has an associated cost of the tireless background worker.
One of the problems that we’ve foreseen is scaling the number of tracked repositories.

Imagine we have a thousand repositories to keep track of. Assuming that in a given update loop we have 300 outdated repositories. If we assume an update operation takes in average 2 seconds, then 2x300 will
give us a delay of 600s. or 6 minutes for the last repository to be updated. This can be acceptable for small
companies which are less likely to reach those numbers, but when you have dozens of teams working in parallel
on a couple of repositories each, you might reach the acceptable limit quite easily. Luckily this is not the
worst problem to have since it can be solved with threading. By using only six threads we can reduce the
longest outage to one minute, which looks much more acceptable, and if it’s not, we can always use more
threads. In fact, the most reasonable case for this problem is probably to define the maximum outage time
and calculate the number of threads needed to satisfy that criteria.

Let’s look at an example, considering
the exact same number of outdated repositories, 300, and an average time of 2s. per repo update, assuming we
want a maximum outage time of 10 seconds, the number of threads needed to ensure this would be obtained by
threads = 300/(10/2) = 60. Sixty is not a low number by any means, but it’s steel manageable. In a real world
we would of course define a maximum threshold for the number of threads and establish the maximum limit of
repos for which we can guarantee a maximum outage of 10s.

In case of CircleCI we receive notification events whenever a build finishes.
These events allow us to determine if the build for a given commit was successful or not.
This information can be seen in the "Test Results” section of the Feature Review page in the picture above.

This information is relevant for the Product Owner and QA, it is a quick way of confirming that the
implementation for this specific feature is building correctly and passes all tests.
Another use case for this information, which is planned for future is to allow automated deploy once the build goes green.
A CircleCI build would package the application version and store it in a convenient place, than with a single
click of a button a deploy of this version could be done.

The landing page of Shipment Tracker is a search page for Released Tickets, as seen in figure g.
For this we use PostgreSQL’s full text search
instead of a dedicated full-text search tool such as Elasticsearch.

Postgres’s text search system preprocesses documents6 by reducing them to the tsvector format –
a compact representation of the full document. The tsvector data type stores lexemes (key words) with
associated rankings. It’s like a hash map specifically for weighted search.

Some of the words look weird because they’re stemmed. Postgres uses a dictionary to eliminate common words
that should not be considered in a search, and to stem words so that different derived forms of the same word
will match. For example, “jumping” and “jumped” would be stored as a lexeme like “jump”.

Searching and ranking are performed entirely on the tsvector — the original text only needs to be retrieved
when the document has been selected for display to a user.

Search results are ranked. Certain parts of a document (e.g. titles) can be giving higher relevance by using
weights, so that when there are any matches, the most relevant results can be shown first.

We use the pg_search gem to extend ActiveRecord (the default ORM
for Rails). It provides ActiveRecord callbacks to update the tsv indexes, but some ActiveRecord operations
skip callbacks, so it’s much safer to have a trigger defined on the database.

This trigger will set the weights for records that have been touched and index them.

A pleasant side effect which we hadn’t thought about is that now we are collecting a lot of information regarding our development process and we can answer interesting questions about it. Some of the most interesting questions are:

How many times we deploy in a day?

How long, on average, does it take from development start until the deploy of a feature?

How many unauthorised deploys we have per month?

This information is extremely valuable when you want to improve/speed-up the process because it allows you to see where the bottlenecks are and focus immediately on the pain points without loosing too much time on investigation.

Our acceptance tests are integration tests. Here we almost never stub, except for cases such as communication
with a third-party service. These tests use Cucumber (for the Gherkin syntax), Capybara (for simulating
user interaction), and RSpec (for test assertions).

Our acceptance tests are usually written in a BDD style,
focusing on the business value from a customer’s perspective. The narrative follows the style of user-story.
For example:

Feature: Managing Repository Locations
As an application onboarder
I want to add a repository from GitHub
Because I want an audit trail of the application's development
Scenario: Add repositories
Given I am on the new repository location form
When I enter a valid uri "ssh://github.com/new_app"Then I should see the repository locations:|Name|URI||new_app|ssh://github.com/new_app|

From a development perspective, the outside-in approach let us focus on getting the feature working first,
and then making it better while being confident that it doesn’t break. This prevents us from being bogged down
in design details. Because acceptance tests are expensive, we don’t use them to test exhaustively, instead
only testing the business critical paths which are usually happy paths.

After unit tests and acceptance tests are run, we run a Ruby style linter known as RuboCop.
These run after because the changes are usually cosmetic, and we prefer to see that the code behaves as
specified before making cosmetic changes. On projects with a slower test suite, you’ll usually see such linters
running first.

We introduced performance tests to benchmark projections before and after snapshotting.
They were used to justify implementing snapshots, as event sourcing wasn’t scalable without it - page
requests would very quickly start to timeout.

Testing with Git can be difficult. Git repositories need to be built up and teared down on the spot, while
keeping the tests easy to understand. To make it easier, we introduced some helpers.

The interface for interacting with a test Git repository is the GitTestRepository class.
This class helps us create basic test repositories and handles all the common operations for us, such as
creating commits, and creating and merging branches. To reference specific commits in our tests without
knowing what their SHAs will be beforehand, we use “pretend commits” which is a hash that links a
predetermined human-readable key such as “#abc” to an actual commit.

Overall the introduction of Shipment Tracker was a positive experience and it has been successfully running for almost a year.
We are really happy for being able to carry on our development with agility while being regulated,
instead of setting up blockers which would frustrate the development team and slow down the company.

There are some enhancements that have to be done in the area of usability so that we reduce the distraction time to a minimum.
For example a typo fix to the README is not a change that affects production. Maybe there could be a commit message keyword to skip some of the strict checks.
There are also some lacking features, such as alerting on out-of-hour deploys or unlinking tickets from a Feature Review.

As a nice to have, it would be also interesting to create a statistics page,
which would represent in charts the answers to the questions mentioned in Data analysis section.

We hope this project might be of use to others and would love to see some contributions given its open-source nature.
Don’t hesitate to contact us if you are interested in exploring more of Shipment Tracker.

The name “Snowden” won an internal poll but was rejected due to its controversial meaning. ↩

Container-based deployment using a tool such as Docker are becoming increasingly popular.
In such cases we recommend setting image tags as the commit SHA.
That ensures the software version can be easily found when sending a deploy event. ↩

Country codes as defined in ISO 3166-1 alpha-2. For example, ‘GB’ for Great Britain and ‘US’ for the United States of America. ↩

Notice the event_created_at field which each snapshot has. It allows the system to easily move across history at specific points in time. For example, we may want to show that a build for a software version was failing at the time of ticket approval, but a later rebuild was passing. Damn flakey tests. ↩

For text search purposes, a document is the unit of searching. In other words, the text to be searched. ↩