If you develop or administer apps running on Google Cloud Platform (GCP), you’re probably familiar with Cloud Shell, an on-demand interactive shell environment that contains a wide variety of pre-installed developer tools. Up until now, you could only access Cloud Shell from your browser. Today, we're introducing the ability to connect to Cloud Shell directly from your terminal using the gcloud command-line tool.

If you're using Mac or Linux, you can even mount your Cloud Shell home directory onto your local file system after installing sshfs. This allows you to edit the files in your Cloud Shell home directory using whatever local tools you want! All the data in your remotely mounted file system is stored on a Persistent Disk, so it's fast, strongly consistent and retained across sessions and regions.

We're sure you'll find plenty of uses for these features, but here are a few to get you started:

Use it as a playground — take advantage of the tools and language runtimes installed in Cloud Shell to do quick experiments without having to install software on your machine.

Use it as a sandbox — install or run untrusted programs in Cloud Shell without the risk of them damaging your local machine or reading your data, or to avoid polluting your machine with programs you rarely need to run.

Use it as a portable development environment — store your files in your Cloud Shell home directory and edit them using your favorite IDEs when you're at your desk, then keep working on the same files later from a Chromebook using the web terminal and editor.

The full documentation for the command-line interface is available here. The cloud-shell command group is currently in alpha, so we're still making changes to it and welcome your feedback and suggestions via the feedback link at the bottom of the documentation page.

As companies on-board to Kubernetes, one of their goals is to provide developers with an iteration and deployment experience that closely mirrors production. To help companies achieve this goal, we recently announced Skaffold, a command line tool that facilitates continuous development for Kubernetes applications. With Skaffold, developers can iterate on application source code locally while having it continually updated and ready for validation or testing in their local or remote Kubernetes clusters. Having the development workflow automated saves time in development and increases the quality of the application through its journey to production.

Kubernetes provides operators with APIs and methodologies that increase their agility and facilitates reliable deployment of their software. Kubernetes takes bespoke deployment methodologies and provides programmatic ways to achieve similar if not more robust procedures. Kubernetes’ functionality helps operations teams apply common best practices like infrastructure as code, unified logging, immutable infrastructure and safer API-driven deployment strategies like canary and blue/green. Operators can now focus on the parts of infrastructure management that are most critical to their organizations, supporting high release velocity with a minimum of risk to their services.

But in some cases, developers are the last people in an organization to be introduced to Kubernetes, even as operations teams are well versed in the benefits of its deployment methodologies. Developers may have already taken steps to create reproducible packaging for their applications with Linux containers, like Docker. Docker allows them to produce repeatable runtime environments where they can define the dependencies and configuration of their applications in a simple and repeatable way. This allows developers to stay in sync with their development runtimes across the team, however, it doesn’t introduce a common deployment and validation methodology. For that, developers will want to use the Kubernetes APIs and methodologies that are used in production to create a similar integration and manual testing environment.

Once developers have figured out how Kubernetes works, they need to actuate Kubernetes APIs to accomplish their tasks. In this process they'll need to:

Find or deploy a Kubernetes cluster

Build and upload their Docker images to a registry that's enabled in their cluster

Use the reference documentation and examples to create their first Kubernetes manifest definitions

Use the kubectl CLI or Kubernetes Dashboard to deploy their application definitions

Repeat steps 2-4 until their feature, bug fix or changeset is complete

Check in their changes and run them through a CI process that includes:

Unit testing

Integration testing

Deployment to a test or staging environment

Steps 2 through 5 require developers to use many tools via multiple interfaces to update their applications. Most of these steps are undifferentiated for developers and can be automated, or at the very least guided by a set of tools that are tailored to a developer’s experience.

Enter Skaffold, which automates the workflow for building, pushing and deploying applications. Developers can start Skaffold in the background while they're developing their code, and have it continually update their application without any input or additional commands. It can also be used in an automated context such as a CI/CD pipeline to leverage the same workflow and tooling when moving applications to production.

Skaffold features

Skaffold is an early phase open-source project that includes the following design considerations and capabilities:

No server-side components mean no overhead to your cluster.

Allows you to detect changes in your source code and automatically build/push/deploy.

Image tag management. Stop worrying about updating the image tags in Kubernetes manifests to push out changes during development.

Supports existing tooling and workflows. Build and deploy APIs make each implementation composable to support many different workflows.

Support for multiple application components. Build and deploy only the pieces of your stack that have changed.

Deploy regularly when saving files or run one off deployments using the same configuration.

Pluggability

Skaffold has a pluggable architecture that allows you to choose the tools in the developer workflow that work best for you.

Demo

Editor’s note: If you recently migrated from one cloud provider to another—or are thinking about making the move—you understand the value of avoiding vendor lock-in by using third-party tools. Tamr, a data unification provider, recently made the switch from AWS to Google Cloud Platform, bringing with them a variety of DevOps tools to help with the migration and day-to-day operations. Check out their recommendations for everything from configuration management to storage to user management.

Here at Tamr, we recently migrated from AWS to Google Cloud Platform (GCP), for a wide variety of reasons, including more consistent compute performance, cheaper machines, preemptible machines and better committed usage stories, to name a few. The larger story of our migration itself is worth its own blog post, which will be coming in the future, but today, we’d like to walk through the tools that we used internally that allowed us to make the switch in the first place. Because of these tools, we migrated with no downtime and were able to re-use almost all of the automation/management code we’d developed internally over the past couple of years.

We attribute a big part of our success to having been a DevOps shop for the past few years. When we first built out our DevOps department, we knew that we needed to be as flexible as possible. From day one, we had a set of goals that would drive our decisions as a team, and which technologies we would use. Those goals have proved themselves as they have held up over time, and more recently allowed us to seamlessly migrate our platform from AWS to GCP and Google Compute Engine.

Here were our goals. Some you’ll recognize as common DevOps mantras, others were more specific to our organization:

Automate everything, and its corollary, "everything is code"

Treat servers as cattle, not pets

Scale our devops team sublinearly in relation to the number of servers and services we support

Don’t be tied into one vendor/cloud ecosystem. Flexibility matters, as we also ship our entire stack and install it on-prem at our customers sites

Our first goal was well defined and simple. We wanted all operation tasks to be fully automated. Full stop. Though we would have to build our own tooling in some cases, for the most part there's a very rich set of open source tools out there that can solve 95% of our automation problems with very little effort. And by defining everything as code, we could easily review each change and version everything in git.

Treating servers as cattle, not pets is core to the DevOps philosophy. Server "pets" have names like postgres-master, and require you to maintain them by hand. That is, you run commands on it via a shell and upgrade settings and packages yourself. Instead, we wanted to focus on primitives like the amount of cores and RAM that our services need to run. We also wanted to kill any server in the cluster at any time without having to notify anyone. This makes doing maintenance much easier and streamlined, as we would be able to do rolling restarts of every server in our fleet. It also ties into our first goal of automating everything.

We also wanted to keep our DevOps team in check. We knew from the get-go that to be successful, we would be running our platform across large fleets of servers. Doing things by hand requires us to hire and train a large number of operators just to run through set runbooks. By automating everything and investing in tooling we can scale the number of systems we maintain without having to hire as many people.

Finally, we didn’t want to get tied into one single vendor cloud ecosystem, for both business reasons—we deploy our stack at customer sites—and because we didn’t want to be held hostage by any one cloud provider. To avoid getting locked into a cloud’s proprietary services, we would have to run most things ourselves on our own set of servers. While you may choose to use equivalent services from their cloud provider, we like the independence of this go-it-alone approach.

Our DevOps toolbox

Picking a configuration management system should be the very first thing you do when building out your DevOps toolbox, because you’ll be using it on every server that you have. For configuration management, we chose to use Ansible; it’s one of the simpler tools to get started with, and you can use it on just about any Linux machine.

You can use Ansible in many different ways: as a scripting language, as a parallel ssh client, and as a traditional configuration management tool. We opted to use it as a configuration management tool and set up our code base following Ansible best practices. In addition to the best practices layed out in the documentation, we went one step further and made all of our Ansible code fully idempotent—that is, we expect to be able to run Ansible at any time, and as long as everything is already up-to-date, for it to not have to make any changes. We also try and make sure that any package upgrades in Ansible have the correct handlers to ensure a zero downtime deployment.

We were able to use our entire Ansible code base in both the AWS and GCP environments without having to change any of our actual code. The only things that we needed to change were our dynamic inventory scripts, which are just Python scripts that Ansible executes to find the machines in your environment. Ansible playbooks allow you to use multiple of these dynamic inventory scripts simultaneously, allowing us to run Ansible across both clouds at once.

That said, Ansible might not be the right fit for everyone. It can be rather slow for some things and isn’t always ideal in an autoscaling environment, as it's a push-based system, not pull-based (like Puppet and Chef). Some alternatives to Ansible are the afore-mentioned Puppet and Chef, as well as Salt. They all solve the same general problem (automatic configuration of servers) but are optimized for specific use cases.

2. Infrastructure configuration: Terraform

When it comes to setting up infrastructure such as VPCs, DNS and load balancers, administrators sometimes set up cloud services by hand, then forget they are there, or how they configured them. (I’m guilty of this myself.) The story goes like this: we need a couple of machines to test an integration with a vendor. The vendor wants shell access to the machines to walk us through problems and requests an isolated environment. A month or two goes by and everything is running smoothly, and it’s time to set up a production environment based on the development environment. Do you remember what you did to set it up? What settings you customized? That is where infrastructure-as-code configuration tools can be a lifesaver.

Terraform allows you to codify the settings and infrastructure in your cloud environments using its domain specific language (DSL). It handles everything for you (cloud integrations, and ordering of operations for creating resources) and allows you to provision resources across multiple cloud platforms. For example, in Terraform, you can create DNS records in Google DNS that reference a resource in AWS. This allows you to easily link resources across multiple environments and provision complex networking environments as code. Most cloud providers have a tool for managing resources as code: AWS has CloudFormation, Google has Cloud Deployment Manager, and Openstack has Heat Orchestration Templates. Terraform effectively acts as a superset of all these tools and provides a universal format across all platforms.

One of the basic building blocks of a cloud environment is a Virtual Machine (VM) image. In AWS, there’s a marketplace with AMI images for just about anything, but we often needed to install tools onto our servers beyond the basic services included in the AMI. For example, think Threatstack agents that monitor the activity on the server and scan packages on the server for CVEs. As a result, it was often easier to just build our own images. We also build custom images for our customers and need to share them into their various cloud accounts. These images need to be available to different regions, as do our own base images that we use internally as the basis for our VMs. Having a consistent way to build images independent of a specific cloud provider and a region is a huge benefit.

We use Packer, in conjunction with our Ansible code base, to build all of our images. Packer provides the framework to spin up machines, runs our Ansible code, then saves a copy of the snapshot of the machine into our account. Because Packer is integrated with configuration management tools, it allowed us to define everything in the AMIs as source code. This allows us to easily version images and have confidence that we know exactly what’s in our images. It made reproducing problems that customers had with our images trivial, and allowed us to easily generate changelogs for images.

The bigger benefit that we experienced was that when we switched to Compute Engine, we were able to reuse everything we had in AWS. All we needed to change was a couple of lines in Packer to tell it to use Compute Engine instead of AWS. We didn’t have to change anything to our base images that developers use day-to-day or the base images that we use in our compute clusters.

When we first started building out our infrastructure at Tamr, we knew that we wanted to use containers as I had used them at my previous company and seen how powerful and useful they can be at scale. Internally we have standardized on Docker as our primary container format. It allows us to build a single shippable artifact for a service that we can run on any Linux system. This gives us portability between Linux operating systems without significant effort. In fact, we’ve been able to Dockerize most of our system dependencies throughout the stack, to simplify bootstrapping from a vanilla Linux system.

Containers in and of themselves don’t inherently provide scale or high availability on their own. Docker itself is just a piece of the puzzle. To fully leverage containers you need something to manage them and provide management hooks. This is where a container orchestration comes in. It allows you to link together your containers and use them to build up services in a consistent, fault-tolerant way.

For our stack we use Apache Mesos as the basis of our compute clusters. Mesos is basically a distributed kernel for scheduling tasks on servers. It acts as a broker for requests from frameworks to resources (cpu, memory, disk, gpus) available on machines in the Mesos cluster. One of the most common frameworks for Mesos is Marathon, which ships as part Mesosphere’s commercial DC/OS (Data Center Operating System), the main interface for launching tasks onto a Mesos cluster. Internally we deploy all of our services and dependencies on top of a custom Mesos cluster. We spent a fair amount of time building our own deployment/packaging tool on top of Marathon for shipping releases and handling deployments. (Down the road we hope to open source this tool, in addition to writing a few blog posts about it).

The Mesos + Marathon approach for hosting services is so flexible that during our migration from AWS to GCP, we were able to span our primary cluster across both clouds. As a result, we were able to slowly switch services running on the cluster from one cloud to another using Marathon constraints. As we were switching over, we simply spun up more Compute Engine machines and then deprecated machines on the AWS side. After a couple of days, all of our services were running on Compute Engine machines, and off of AWS.

However, if we were building our infrastructure from scratch today, we would heavily consider building on top of Kubernetes rather than Mesos. Kubernetes has come a long way since we started building out our infrastructure, but it just wasn’t ready at the time. I highly recommend Google Kubernetes Engine as a starting point for organizations starting to dip their toes into the container orchestration waters. Even though it's a managed service, the fact that it's based on open-source Kubernetes ensures minimized the risk of cloud lock-in.

One of the first problems we dealt with in our AWS environment was how to provide ssh access to our servers to our development team. Before we automated server provisioning, developers often created a new root key every time they spun up an instance. We soon consolidated to one shared key. Then we upgraded to running an internal LDAP instance. As the organization grew, managing that LDAP server became a pain—we were definitely treating it as a pet. So we went looking for a hosted LDAP/Active Directory offering, which led us JumpCloud. After working with them, we ended up using their agent on our servers instead of an LDAP connector, even though they have a hosted LDAP endpoint that we do use for other things. The JumpCloud agent syncs with JumpCloud and provisions users and groups and ssh keys onto the server automatically for us. JumpCloud also provides a self-service portal for developers updating their ssh keys. This means that we now spend almost no time actually managing access to our servers; it’s all fully automated.

It’s worth noting that access to machines on Compute Engine is completely different than AWS. With GCP, users can use the gcloud command line interface (CLI) to gain access to a machine. The CLI generates a ssh key, and provisions it onto the server and creates a user account on the machine (for example, here's a sample command is `gcloud compute --project "gce-project" ssh --zone "us-east1-b" "my-machine-name"`). In addition, users can upload their ssh-keys/users pairs in the console and new machines will have those users accounts set up on launch of a machine. In other words, the problem of how to provide ssh access to developers that we ran into in on AWS doesn’t exist on Compute Engine.

JumpCloud solved a specific problem with AWS, but provides a portable solution across both GCP and AWS. Using it with GCP works great, however if you're 100% on GCP, you don’t need to rely on an additional external service such as JumpCloud to manage your users.

Given that we run a large amount of services on top of a Mesos cluster we needed a way to provide persistent storage to Docker containers running there. Since we treat servers as cattle not pets (we expect to be able to kill any one server at any time), using Mesos local persistent storage wasn’t an option for us. We ended up using RexRay as an interface for provisioning/mounting disks into containers. RexRay acts as the bridge on a server between disks and a remote storage provider. Its main interface is a Docker storage driver plugin that can make API calls to a wide variety of sources (AWS, GCP, EMC, Digital Ocean and many more) and mount the provisioned storage into a Docker container. In our case, we were using EBS volumes on AWS and persistent disks on Compute Engine. Because RexRay is implemented as a Docker plugin, the only thing we had to change between the environments was the config file with the Compute Engine vs. AWS settings. We didn’t have to change any of our upstream invocations for disk resources.

DevOps = Freedom

From the viewpoint of our DevOps team, these tools enabled a smooth migration, without much manual effort. Most things only required updating a couple of config files to be able to talk to Compute Engine APIs. At the top layers in our stack that our developers use, we were able to switch to Compute Engine with no development workflow changes, and zero downtime. Going forward, we see being able to span across and between clouds at will as a competitive advantage, and this would not be possible without the investment we made into our tooling.

Love our list of essential DevOps tools? Hate it? Leave us a note in the comments—we’d love to hear from you. To learn more about Tamr and our data unification service, visit our website.

When we launched the Netherlands region earlier this year, we said the third zone would be along shortly. We opened up the region with two zones as soon as we could to fulfill the growing demand from our customers in Benelux. Now, we’re happy to announce the launch of a third zone (europe-west4-a) in the region. This is the 45th GCP zone globally and now, like other GCP regions, this third zone enables developers to build highly available services that meet the needs of their business.

As an official training partner, g-company is dedicated to helping companies organize their processes more intelligently by offering highly interactive Google Cloud Platform (GCP) training and supporting companies as they build tailor-made applications on GCP to transform their businesses. g-company led the implementation of the Netherlands-based online travel company, Travix, in a company-wide migration to Google Cloud.

Resources

For the latest on availability of services from this region as well as additional regions and services, visit our locations page. For guidance on how to build and create highly available applications, take a look at our zones and regions page. Watch this webinar to learn more about how we bring GCP closer to you. Give us a shout to request early access to new regions and help us prioritize what we build next.

As an auditor, you probably spend a lot of time reviewing logs. Google Cloud Audit Logging is an integral part of the Google Stackdriver suite of products, and understanding how it works and how to use it is a key skill you need to implement an auditing approach for systems deployed on Google Cloud Platform (GCP). In this post, we’ll discuss the key functionality of Cloud Audit Logging and call out some best practices.

The first thing to know about Cloud Audit Logging is that each project consists of two log streams: admin activity and data access. GCP services generate these logs to help you answer the question of "who did what, where, and when?" within your GCP projects. Further, these logs are distinct from your application logs.

Configure and view audit logs

Getting started with Cloud Audit Logging is simple. Some services are on by default, and others are just a few clicks away from being operational. Here’s how to set up, configure and use various Cloud Audit Logging capabilities.

Configuring audit log collection

Admin activity logs are enabled by default; you don’t need to do anything to start collecting them. With the exception of BigQuery, however, data Access audit logs are disabled by default. Follow the guidance detailed in Configuring Data Access Logs to enable them.

One best practice for data access logs is to use a test project to validate the configuration for your data access audit collection before you propagate it to developer and production projects. If you configure your IAM controls incorrectly, your projects may become inaccessible.

Viewing audit logs

You can view audit logs from two places in the GCP Console: via the activity feed, which provides summary entries, and via the Stackdriver Logs viewer page, which gives full entries.

Permissions

You should consider access to audit log data as sensitive and configure appropriate access controls. You can do this by using IAM roles to apply access controls to logs.

To view logs, you need to grant the IAM role logging.viewer (Logs Viewer) for the admin activity logs, and logging.privateLogViewer (Private Logs viewer) for the data access logs.

When configuring roles for Cloud Audit Logging, this how to guide describes some typical scenarios and provides guidance on configuring IAM policies that address the need to control access to audit logs. One best practice is to ensure that you’ve applied the appropriate IAM controls, to restrict who can access the audit logs.

Viewing the activity feed

You can see a high-level overview of all your audit logs on the Cloud Console Activity page. Click on any entry to display a detailed view of that event, as shown below.

By default, this feed does not display data access logs. To enable them, go to the Filter configuration panel and select the “Data Access” field under Categories. (Please note, you also need to have the Private Logs Viewer IAM permission in order to see data access logs).

Viewing audit logs via the Stackdriver Logs viewer

You can view detailed log entries from the audit logs in the Stackdriver Logs Viewer. With Logs Viewer, you can filter or perform free text search on the logs, as well as select logs by resource type and log name (“activity” for the admin activity logs and “data_access” for the data access logs).

The example below displays some log entries in their JSON format, and highlights a few important fields.

Filtering Audit Logs

Stackdriver provides both basic and advanced logs filters. Basic log filters allows you to filter the results displayed in the feed by user, resource type and date/time.

An advanced logs filter is a Boolean expression that specifies a subset of all the log entries in your project. You can use to it choose the log entries:

from specific logs or log services

within a given time range

that satisfy conditions on metadata or user-defined fields

that represent a sampling percentage of all log entries

The following filter shows a filter on all calls made to the Cloud IAM API that calls the SetIamPolicy method.

Log entries are held in Stackdriver Logging for a limited time known as the retention period. After that, the entries are deleted. To keep log entries longer, you need to export them outside of Stackdriver Logging by configuring log sinks.

A sink includes a destination and a filter that selects the log entries to export, and consists of the following properties:

Sink identifier: A name for the sink

Parent resource: The resource in which you create the sink. This can be a project, folder, billing account, or an organization

Logs filter: Selects which log entries to export through this sink, giving you the flexibility to export all logs or specific logs

Writer identity: A service account that has been granted permissions to write to the destination.

You need to configure log sinks before you can receive any logs, and you can’t retroactively export logs that were written before the sink was created.

Another feature for working with logs is Aggregated Exports, which allows you to set up a sink at the Cloud IAM organization or folder level, and export logs from all the projects inside the organization or folder. For example, the following gcloud command sends all admin activity logs from your entire organization to a single BigQuery sink:

Be aware that an aggregated export sink sometimes exports very large numbers of log entries. When designing your aggregated export sink to export the data you need to store, here are some best practices to keep in mind:

Ensure that logs are exported for longer term retention

Ensure that appropriate IAM controls are set against the export sink destination

Design aggregated exports for your organization to filter and export the data that will be useful for future analysis

Managing exclusions

Stackdriver Logging provides exclusion filters to let you completely exclude certain log messages for a specific product or messages that match a certain query. You can also choose to sample certain messages so that only a percentage of the messages appear in Stackdriver Logs Viewer. Excluded log entries do not count against the Stackdriver Logging logs allotment provided to projects.

It’s also possible to export log entries before they're excluded. For more information, see Exporting Logs. Excluding this noise will not only make it easier to review the logs but will also allow you to minimize any charges for logs over your monthly allotment.

Best practices:

Ensure you're using exclusion filters to exclude logging data that will not be useful. For example, you shouldn’t need to log data access logs in development projects. Storing data access logs is a paid service (see our log allotment and coverage charges), so recording superfluous data incurs unnecessary overhead

Cloud Audit Logging best practices, recapped

Cloud Audit Logging is a powerful tool that can help you manage and troubleshoot your GCP environment, as well as demonstrate compliance. As you start to set up your logging environment, here are some best practices to keep in mind:

Use a test project to validate the configuration of your data-access audit collection before propagating to developer and production projects

Be sure you’ve applied appropriate IAM controls to restrict who can access the audit logs

Determine whether you need to export logs for longer-term retention

Set appropriate IAM controls against the export sink destination

Design aggregated exports on which your organization can filter and export the data for future analysis

There are many reasons to automate your deployments: consistency, safety, and timeliness. These increase in value as your software becomes more critical to your business. In this post, I'll demonstrate how easy it is to start automating deployments with Google Cloud Platform (GCP) tools, and refer you to additional resources to help make your deployment process more robust.

Before you start your first build, set up your project for Container Builder. First, enable two APIs: Container Builder API and Cloud Functions API. To allow Container Builder to deploy, you need to give it access to your project. The build process uses the credentials of a service account associated with those builds. The address for that service account is {numerical-project-id}@cloudbuild.gserviceaccount.com. You'll need to add an IAM role to that service account: Project Editor. If you use this process to deploy other resources, you might need to add other IAM roles.

Now, test your deployment configuration and permissions by running:

gcloud container builds submit --config deploy.yaml .

Your function is now being deployed via Cloud Container Builder.

Creating a build trigger is easy: choose your repository, the trigger condition (in this case, pushing to the "prod" branch), and the build to run (in this case, the build specified in "deploy.yaml").

Now, update the "prod" branch, bring it up-to-date with "master", push it to Cloud Source Repositories, and your function will be deployed!

If the deployment failed, it will show up as a failed build in the build history screen. Check the logs to investigate what went wrong. You can also configure e-mail or other notifications using Pub/Sub and Cloud Functions.

This is a simplified deployment pipeline—just enough to demonstrate the power of deployment automation. At some point, you'll probably find that this process doesn't meet your needs. For example, you might want to get a manual approval before you update production. If that happens, check out Spinnaker, an open-source deployment automation system that can handle more complex workflows.

And that’s just the beginning! As you get further down the road toward automating your deployments, here are some other tools and techniques for you to try:

In the world of distributed systems, hosting and scaling dedicated game servers for online, multiplayer games presents some unique challenges. And while the game development industry has created a myriad of proprietary solutions, Kubernetes has emerged as the de facto open-source, common standard for building complex workloads and distributed systems across multiple clouds and bare metal servers. So today, we’re excited to announce Agones (Greek for "contest" or "gathering"), a new open-source project that uses Kubernetes to host and scale dedicated game servers.

Currently under development in collaboration with interactive gaming giant Ubisoft, Agones is designed as a batteries-included, open-source, dedicated game server hosting and scaling project built on top of Kubernetes, with the flexibility you need to tailor it to the needs of your multiplayer game.

The nature of dedicated game servers

It’s no surprise that game server scaling is usually done by proprietary software—most orchestration and scaling systems simply aren’t built for this kind of workload.

Many of the popular fast-paced online multiplayer games such as competitive FPSs, MMOs and MOBAs require a dedicated game server—a full simulation of the game world—for players to connect to as they play within it. This dedicated game server is usually hosted somewhere on the internet to facilitate synchronizing the state of the game between players, but also to be the arbiter of truth for each client playing the game, which also has the benefit of safeguarding against players cheating.

Dedicated game servers are stateful applications that retain the full game simulation in memory. But unlike other stateful applications, such as databases, they have a short lifetime. Rather than running for months or years, a dedicated game server runs for a few minutes or hours.

Dedicated game servers also need a direct connection to a running game server process’ hosting IP and port, rather than relying on load balancers. These fast-paced games are extremely sensitive to latency, which a load balancer only adds more of. Also, because all the players connected to a single game server share the in-memory game simulation state at the same time, it’s just easier to connect them to the same machine.

Here’s an example of a typical dedicated game server setup:

Players connect to some kind of matchmaker service, which groups them (often by skill level) to play a match.

Once players are matched for a game session, the matchmaker tells a game server manager to provide a dedicated game server process on a cluster of machines.

The game server manager creates a new instance of a dedicated game server process that runs on one of the machines in the cluster.

The game server manager determines the IP address and the port that the dedicated game server process is running on, and passes that back to the matchmaker service.

The matchmaker service passes the IP and port back to the players’ clients.

The players connect directly to the dedicated game server process and play the multiplayer game against one another.

With Agones, Kubernetes gets native abilities to create, run, manage and scale dedicated game server processes within Kubernetes clusters using standard Kubernetes tooling and APIs. This model also allows any matchmaker to interact directly with Agones via the Kubernetes API to provision a dedicated a game server.

Building Agones on top of Kubernetes has lots of other advantages too: it allows you to run your game workloads wherever it makes the most sense, for example, on game developers’ machines via platforms like minikube, in-studio clusters for group development, on-premises machines and on hybrid-cloud or full-cloud environments, including Google Kubernetes Engine.

Kubernetes also simplifies operations. Multiplayer games are never just dedicated game servers—there are always supporting services, account management, inventory, marketplaces etc. Having Kubernetes as a single platform that can run both your supporting services as well as your dedicated game servers drastically reduces the required operational knowledge and complexity for the supporting development team.

Finally, the people behind Agones aren’t just one group of people building a game server platform in isolation. Agones, and the developers that use it, leverages the work of hundreds of Kubernetes contributors and the diverse ecosystem of tools that have been built around the Kubernetes platform.

Founding contributor to the Agones project, Ubisoft brought their deep knowledge and expertise in running top-tier, AAA multiplayer games for a global audience.

“Our goal is to continually find new ways to provide the highest-quality, most seamless services to our players so that they can focus on their games. Agones helps by providing us with the flexibility to run dedicated game servers in optimal datacenters, and by giving our teams more control over the resources they need. This collaboration makes it possible to combine Google Cloud’s expertise in deploying Kubernetes at scale with our deep knowledge of game development pipelines and technologies.”

Getting started with Agones

Since Agones is built with Kubernetes’ native extensions, you can use all the standard Kubernetes tooling to interact with it, including kubectl and the Kubernetes API.

Creating a GameServer

Authoring a dedicated game server to be deployed on Kubernetes is similar to developing a more traditional Kubernetes workload. For example, the dedicated game server is simply built into a container image like so:

You can then apply it through the kubectl command or through the Kubernetes API:

$ kubectl apply -f gamesever.yaml
gameserver "my-game-server" created

Agones manages starting the game server process defined in the yaml, assigning it a public port, and retrieving the IP and port so that players can connect to it. It also tracks the lifecycle and health of the configured GameServer through an SDK that's integrated into the game server process code.

You can query Kubernetes to get details about the GameServer, including its State, and the IP and port that player game clients can connect to, either through kubectl or the Kubernetes API:

What’s next for Agones

Agones is still in very early stages, but we’re very excited about its future! We’re already working on new features like game server Fleets, planning a v0.2 release and working on a roadmap that includes support for Windows, game server statistic collection and display, node autoscaling and more.

If you would like to try out a v0.1 alpha release of Agones, you can install it directly on a Kubernetes cluster such as GKE or minikube and take it for a spin. We have a great installation guide that will take you through getting setup!