]]>Fraud detection and risk analysis initiatives are a high priority for many companies conducting business online. This tasks data science teams with building and maintaining platforms capable of performing complex processing, modeling, and analysis of fast data and big data in real time. Such a platform (and all of its associated components) can be difficult and time consuming to build and maintain manually.

Fortunately, there is a better way: Mesosphere DC/OS streamlines the building of data processing and modeling pipelines so data science teams can be more productive. And fortunately for you, you don’t have to experience the pain I did when I helped build fraud detection and risk analysis platforms at past companies.

Let’s take a look at the pain points that I encountered in the past when building fraud detection and risk analysis platforms and how DC/OS would have made the process far easier and faster. Although much of this post describes situations that predate DC/OS, the core problems are the same today. As such, this is a good demonstration of how DC/OS helps data science teams build and maintain a modeling infrastructure now and in the future.

Background

Most data science teams that I’ve managed focused on the product and how customers used it. There were many disparate data science platforms along the way, and much of our work revolved around reporting and A/B testing. I’ve worked with teams responsible for things like measuring the impact of changes to site flow, search rankings, and many other product adjustments. Most of the work was performed in batch jobs, which meant that adding new products, features, and pipelines often did not require many infrastructure changes. However, as stream processing has become more prevalent, the infrastructure required to support data science has grown more complex.

At one point I was tasked with taking over a fraud/risk platform from engineering. The fraud platform was very different from anything that the data science team had previously owned. The near real-time aspect was the reason it had been owned by engineering. At that time the data science team owned nothing that operated in real time. The closest system we had was comprised of batch jobs that ran every 15 minutes. There were analytics systems that ran in near real time, but they were not owned and maintained by the data science team. They were also custom to their task. Based on the nature of the data and how the data was accessed and processed it was not an option to simply run a new batch process with smaller batch intervals. The data science team needed a new system to return model results in near real time.

Building a Fraud Detection Platform Manually

When I started work on the fraud platform, it was a Java and Weka solution. It worked but wasn’t trivial for a non-Java programmer to take over. It didn’t read from all the data sources that were being continuously added to the infrastructure.

There were a couple of data sources. Ruby on Rails would write csv files. Queries were also written against a MySQL database. At the start that was sufficient. Over time, however, third party data and data from HDFS became available, yet the existing system would require significant work to add these new data sources.

After the data was pulled Weka was used to create and tune a model. When satisfied with the results the modeler could save the Weka code where it was available to a restful API written in Java.

However, the difficulty in adding new data sources wasn’t the main problem. There wasn’t a way to run multiple models at one time, and we had no version control over our models. Even this was all fixable, but not if the data science team was going to take over ownership because there were no programmers on the data science team to maintain the Java and Weka code we were using. The primary languages of the team members were R and Python. There were a few Stata programmers, but they were all economists (who wants economists working on risk?).

I started looking for an oxymoron: a scalable system that worked with R. If engineering was going to hand this off to the existing data science team the system needed a different design.

Fraud Detection Platform Requirements

Fraud detection and risk analysis initiatives can only be successful if they impact the business as a whole. For this reason, it was important for us to gather business and technical requirements prior to developing and deploying the platform.

There were many more requirements than the ones listed below, but the following were the most important high-level requirements.

Ability to develop and test new models quickly and consistently to constantly improve models and react to rapidly changing conditions.

Migrate from batch only to batch and real-time architecture to improve the timeliness of analyses and prevent fraud as it happened.

Transfer ownership of fraud/risk platform from engineering to the data science team to improve developer agility.

The last requirement revolved around a non-engineering team taking the platform over. Without hiring data scientists with strong engineering backgrounds (later that was done), we needed a solution built on SQL, R or Python. This business requirement was the largest driver for the technical requirements.

Technical Requirements

R support. As stated earlier, with few exceptions everyone on the data science team knew R well. Thankfully, fraud is rare, so working with fraud data allows for large undersampling of negative cases. Modeling on modest hardware with R was, therefore, possible. This also allowed model training to be divorced from production processing.

RESTful API. The main new feature required was quickly returning model results from API calls. The API call volume was high and responses needed to be returned in under a few dozen milliseconds. Failover and redundancy also needed to be included. The calls and results needed to be logged for future analysis and reporting.

Batch processing.Although the API was the key component, batch processing of the model results was still required for “what-if” scenarios such as “What if the model we just created had been running in the past? What would the results have been?” It wouldn’t work to make 10’s of millions of calls to an API for every query used to evaluate a model. The model would need to be compilable and transportable.

A/B Testing. Running models and what-if scenarios was not enough. We also needed to run A/B tests of various models. That meant that a model, selected randomly from the set of appropriate models, needed to be returned with a call. The model name and version needed to be included and logged for future evaluation.

GIT support. There needed to be version control on the models. Any data scientist added to the project would want to know what had already been tried and correlate the model with the results.

Fraud Detection Platform Architecture (Without DC/OS)

Finding a solution that met all requirements was more difficult at the time I was building this fraud detection platform than it would be today. I looked through many potential solutions at various stages of maturity. The one that we eventually went with was Openscoring.

R support. R could export Predictive Model Markup Language (PMML). This allowed models created in R to be consumable by anything that was PMML compatible. Openscoring consumed PMML, so it could meet our other needs. At that time Openscoring was a young project missing some features we needed. It also had a few known bugs that we needed addressed. It was very close to what we needed and the quality of the code base looked very good.

RESTful API. This was the main draw to Openscoring. The RESTful API worked well, and it’s improved a lot since then.

Batch processing. Openscoring could convert the PMML to compilable Java. The model could then be used as a UDF in Hive. Now it would be even easier with jpmml/jpmml-hive.

A/B Testing. Openscoring logged calls with all the model details. This made A/B testing trivial.

GIT Support. PMML is just text. Models were easily versioned and stored in GIT. There were workarounds for a few very large random forests, but generally, this wasn’t a problem.

Building a Better Fraud Detection Platform with DC/OS

Even with a clear path the project took months and required a lot of engineering help. The project was a big success, but had DC/OS existed it could have been completed in weeks and been easier to maintain. Here I’ll run down how we implemented Openscoring and how DC/OS could have made it a lot easier. Keep in mind that this was several years ago, and while some of the decisions may seem a little crazy today, they were good options at the time.

Provisioning
At the start, I needed a machine to test and train models and this required the operations team to provision a node on AWS. They often wanted to take the node down in the evening and especially over the weekend since it would be idle. A laptop was sometimes used, but that was less than ideal. Matching a laptop environment to an AWS node while continuing to do other things on the laptop created unneeded problems.

With DC/OS, I wouldn’t need anything provisioned by the operations team. On a DC/OS cluster, any Docker image can easily be provisioned from the GUI. If the container doesn’t consume a lot of resources while idle, it can be left running on the cluster. The container takes up resources, but it becomes possible to take fractional resources. That is, rather than requiring an entire node, my container may only require a fraction of an already running node. In many cases, there is no incremental cost.

It’s true that you could do this with only Docker on a laptop. While that is nice, it is not as good as running on DC/OS. If something works on a local docker container, it doesn’t guarantee it will work on a cluster with different networking and routes to machines. If the container is already reading and writing to and from the cluster, while in the cluster, then making the process a production process is made easier.

DC/OS clusters can be installed locally, or on AWS, Azure, or Google Compute Engine. Installation does not require strong operations expertise. Anyone with a basic understanding of networking can bring up a cluster by just following the directions.

Installing
After the node was provisioned on AWS, Openscoring needed to be installed, not a trivial task at the time. For a data scientist that is not an engineer, it was even more of a challenge. I mercilessly solicited assistance from engineers outside the data science team.

If we had DC/OS I could have done the work with minimal, if any, help. Engineers would have gotten a lot more sleep and spent far less time hiding from me. Openscoring has a container on Docker Hub. Anyone on the data science team could have entered the container information and resource requirements into the DC/OS management GUI. Two clicks later and it would be running on the cluster. There’s no need to fuss about versions, paths, missing libraries, etc. It would have just worked.

As an example, this 13 minute video shows how to easily install Tensorflow and Jupyter on DC/OS using Docker images. The example shows both GPU and non-GPU versions and compares performance. Another 9 minute video has the same example using only Docker. Together they illustrate how to seamlessly go from a local test environment to a cluster running DC/OS.

Adding Features and Debugging

Fortunately, not only did I have great internal engineering support, but we also contracted out the originator of Openscoring, Villu Ruusmann. Some complications, however, were introduced because he was not an employee and therefore didn’t have access to our cluster. It was sometimes difficult for him to reproduce our errors. Villu would add features that worked when he tested them in his environment, but sometimes would not work in ours.

With DC/OS, he could have worked specifically on a Docker image. If needed, it could be a private Docker repository owned by the company. Since passwords are stored securely in a secrets store on the cluster, code can be written independently of password and username. Even better, we could have set up a dummy cluster outside our network. The dummy cluster would have the same networking without any real data. A contractor could develop and debug on something very close to the environment the application would eventually run on, with DC/OS managing details like versions, paths, libraries, and more.

Integration

Even with everything running, integration with all the other systems was difficult at best and painful at worst. Data came from several systems across the company. Some data came in at near real-time (e.g. IP address of site users and 3rd party partners). Other data was compiled from user history, host history, and batch-processed data. Some processes were dependant on other process completing. The data pipeline required for each request was extensive.

With DC/OS, connecting applications is often easy. The built-in networking makes it easy to discover and connect clients to data services via a hostname. The DC/OS job scheduling capabilities can be used to reliably orchestrate ETL pipelines with complex dependencies.. This still requires the hard work of writing the logic to process and compile data, but the connectivity is easier to see, understand, and debug. It’s possible to deploy a data pipeline in under 10 minutes with DC/OS.

Maintenance

Chef was required to ensure that operations could bring up an Openscoring node if it died in the middle of the night. That meant I had to setup the Chef recipe. While not incredibly difficult, it’s not an enjoyable task.

With DC/OS, if the node with Openscoring goes down it will automatically be restarted on another one. If the container itself were to crash for some reason, another container will automatically be spawned. In either case, it will come up and be correctly configured in the network. The DC/OS network takes care of routing so all other apps will seamlessly continue to reach Openscoring. No Chef recipes are needed.

It is also worth noting that, by design, any application that is set up with attached volumes will be stateful. That is, when a new application is spawned from an application that died, it will know where it left off.

DC/OS Increases Data Science Productivity

DC/OS makes data science initiatives easier to create, shortens the amount of time required to bring them to market, and simplifies the deployment and management of open source frameworks and in-house developed models.

Openscoring, or any other containerized application can form the basis of a data science platform. Any application with a Docker image can be deployed as part of the solution. Moreover, there are dozens of applications already in the DC/OS Service Catalog to choose from, including Apache Kafka, Apache Spark, BeakerX (Jupyter with extensions), Apache Zeppelin, and more that can be installed and configured as easily as an iPhone app.

Getting individual applications working on DC/OS is easy, even for a non-engineer. Connecting several applications together as a comprehensive and powerful solution is also within reach of non-engineers. Data scientists can snap together a pipeline of application like legos. Creating solutions that are assembled on the cluster from the beginning shortens the time to production and improves the quality of the solution. With DC/OS, data science teams are more proactive in architecting solutions, relieving pressure from infrastructure teams and freeing data science teams to focus on data and data-centric solutions.

Download this free book excerpt from O’Reilly to learn how to use Apache Spark to process data quickly, at scale. This excerpt includes three chapters: an introduction to Spark, how Spark works, and how to leverage Spark settings for optimal performance.

]]>The DC/OS package for Kubernetes 1.9.4, which fixed a critical security vulnerability found in Kubernetes itself, is now generally available. As stated in the Github issue, the vulnerability affects most versions of Kubernetes after 1.3. DC/OS announced the general availability of open source Kubernetes version 1.9.3 on DC/OS 1.11 last week. The Kubernetes team at Mesosphere had the new package and documentation ready for customers the day of its release. This release also fixed an issue with the DC/OS SDK that affected the Kubernetes service.

The Kubernetes security vulnerability compromised clusters that allowed untrusted users access to the pod specification and would be more likely to affect shared clusters more than clusters that are used by a single team within an organization.

This security vulnerability was quickly patched by the active and large Kubernetes community and highlights another important reason for having a constant awareness of newly discovered security vulnerabilities, a painless upgrade path, and the option to have the newest version soon after a Kubernetes release.

DC/OS 1.11 includes enhanced security for Kubernetes that enforces secure configuration settings for authentication, authorization and secure networking. DC/OS 1.11 can be configured to secure application and data services traffic using SSL/TLS.

The documentation on Kubernetes 1.9.4 on DC/OS 1.11 is available on Mesosphere’s site. The Mesosphere team will release Kubernetes 1.9.5 as it becomes available in the coming week to addresses other issues found in Kubernetes.

Getting Started with DC/OS 1.11 and Kubernetes 1.9.4, Batteries Included

Push-Button Kubernetes for Existing Customers
Once existing customers have updated to DC/OS 1.11, they should use the Kubernetes 1.9.4 package. For more information see official documentation.

Quickstart for New Open Source DC/OS Users
For those that are new to DC/OS, there is now a Quickstart (including Terraform templates for AWS, Microsoft Azure, and Google Cloud) to get you up and running quickly.

]]>Mesosphere Enters into Reseller Agreement with Portworx to Enable Customer Deployments of Containers and Fast Data Services at Scalehttps://mesosphere.com/blog/dcos-portworx/
Wed, 14 Mar 2018 10:00:02 +0000https://mesosphere.com/?p=12067Today we are excited to announce that we have entered into a...

]]>Today we are excited to announce that we have entered into a formal agreement with Portworx, the leading provider of cloud native storage for containerized applications, to resell PX-Enterprise. This agreement comes on the heels of the success of our previous collaboration to support joint customers running big data and fast data applications.

Running stateful applications, such as Apache Cassandra, ElasticSearch, Apache Kafka, Apache Spark and Apache Hadoop, in containers requires special care to ensure persistence, high availability (HA), and security in a completely automated way that works for any service or infrastructure. By combining DC/OS and Portworx PX-Enterprise, customers are able to deploy and run data-intensive applications in a fully automated, DevOps friendly manner.

This relationship is an ideal fit for Mesosphere and DC/OS as we are the leading platform for implementing containers and fast data services at scale and Portworx PX-Enterprise is the leading cloud native storage offering in the market for container-based applications. Many of our joint customers, including GE Digital, Verizon, NIO, athenahealth, and Beco, Inc. have already achieved demonstrable value by combining the flexibility of DC/OS with the PX-Enterprise persistent container storage platform.

With this agreement, Mesosphere is providing its customers with a best-in-class solution for running scalable, high-performance, data-rich applications in production. PX-Enterprise supports both Marathon and Kubernetes container orchestrators, allowing Mesosphere customers the choice to manage their mission-critical stateful applications using either technology via DC/OS.

This reseller agreement formalizes the relationship between the two companies and will lead to future implementations of our technology to make it easier for customers to build, test, and deploy robust container-based and fast data applications.

With the focus on hybrid, edge and multi-cloud deployments in the 1.11 release of DC/OS, Portworx is a particularly good solution for DC/OS customers. Using Portworx with Mesosphere DC/OS running Marathon or Kubernetes, customers can have true portability of applications between environments. For instance, with Mesosphere and Portworx:

Stateful applications like Cassandra, Kafka, ElasticSearch, HDFS and more can be deployed across fault-domains, ensuring that if a server, rack, datacenter or even region is down, an application can keep running or be rescheduled to another host with an up-to-date copy of the data available.

Big data, fast data, and machine learning workloads like TensorFlow can be burst to the cloud using the Portworx CloudSnap feature which moves data between environments so that DC/OS can schedule compute jobs on demand.

DC/OS Kubernetes-as-a-service users can take advantage of Portworx’s forced hyperconvergence of compute with data to ensure that their mission-critical, performance sensitive data services always run with fast local storage, even when being rescheduled due to a server or network failure.

]]>Imagine this: You’re pulling out of your driveway, asking your phone to pull up navigation to your destination and just at that critical moment when you need to decide which way to turn, you look down at your phone and realize it’s hanging. It can’t find the destination, and it looks like it has no connection. This frustrating moment is an almost universal experience, but it’s one Deutsche Telekom aims to eliminate for all of its customers in Europe and across the globe.

As one of the world’s top telecommunications providers, Deutsche Telekom and its subsidiaries, like T-Mobile in the United States, are using machine learning algorithms based on dynamic cloud infrastructure managed by Mesosphere DC/OS to dramatically improve the consumer experience when it comes to mobile connectivity. The Deutsche Telekom CONNECT app, now available on the Google Play and iOS App Store, launched this year to allow customers to optimize their connection based on cost or performance at any point in time.

Want to avoid dropped connections and slow performance like the scenario above? Then you’d likely choose the “best network” setting, where the app automatically switches your phone from a Wi-Fi hotspot to the cellular network to ensure a seamless experience. If you’re the budget-conscious type, then you’ll opt for the “Wi-Fi preferred” setting to keep data transfers over cellular to a minimum. Of course, if the telecommunications provider can help you find fast Wi-Fi at will, everyone wins. Ensuring customers are using available Wi-Fi networks whenever possible means keeping networks from being overly taxed. Unclogging mobile cellular networks means a better experience for everyone.

A New Stack for Data Analytics and Machine Learning

While the end-goal may seem simple, the technology to get there is far from it. In speaking with Oliver Goldich, Deutsche Telekom’s solution architect responsible for driving the backend infrastructure for the CONNECT app, Goldich observed that the company first started investigating connectivity solutions in 2015. The team began by piloting an emerging telecommunications standard called the Access Network Discovery and Selection Function (ANDSF). However, Goldich observed, “The standard was still old-fashioned and not very flexible for our use case, which required dynamic monitoring and decision-making. We needed a dynamic rule engine that the protocol could not support.”

When it became clear they required a new kind of stack to adopt advanced data analytics and machine learning, Goldich and team began sourcing the components to get there. In 2016, they chose to invest in DC/OS. “At the time, Mesosphere DC/OS was already a mature and production-proven platform,” said Goldich. “It was the best candidate to move our connectivity project forward.” The CONNECT app started as a low-friction proof-of-concept: they built a network speed test app on a DC/OS service layer that allowed users to do a simple speed check. Deutsche Telekom collected and analyzed that user data with Spark to validate their approach.

Choosing a Cloud Provider

With the proof-of-concept results looking positive, Goldich and team began to build out the infrastructure requirements to automate the speed tests and, ultimately, improve the customer experience. “At that point, we knew if we wanted to create machine learning capabilities, we needed to add cloud capabilities for scale. However, it was very important to do so without vendor lock-in,” said Goldich. With cloud came complications. Goldich noted that their existing devops process centered around deploying to an on-premise datacenter, which would not work with the machine learning stack they wanted to adopt.

“We chose Microsoft Azure and immediately set up DC/OS environments with new CI/CD capabilities. We felt confident choosing a cloud provider knowing that, with DC/OS, we will never need to re-architect the applications if we choose to move providers in the future. DC/OS completely abstracts the infrastructure layer, making it easy to move our applications to our preferred infrastructure — be it cloud, bare metal, or on-premise — with minimal engineering effort,” said Goldich.

Additionally, DC/OS enables the use of elastic cloud resources and eliminates the need to create a virtual machine (VM) for every application, which allows Deutsche Telekom to maintain a high rate of utilization — which currently averages more than 75-percent CPU utilization on their production cluster.

DC/OS also provides one-click access to all of the open source data services Deutsche Telekom needed, including Apache Spark, Akka, and Apache Cassandra. By leveraging the “SMACK Stack”, as this combination of data tools on Mesos is often referred, Deutsche Telekom is able to perform data collection at scale and analysis of network speed tests in real time.

Adding on the ELK Stack

Deutsche Telekom also chose the “ELK Stack,” which is the combination of Elasticsearch, Logstash and Kibana, to supplement their data science and partnered with Instana for AI-powered application performance monitoring, which delivers deep insight into the health of their full technology stack and services.

Looking forward to predictive analytics

With all the production-ready components in place, Deutsche Telekom launched its CONNECT app at the close of 2017, enabling seamless connectivity experiences for its 156 million mobile customers. Whenever the CONNECT app is connected to a hotspot, Deutsche Telekom is able to perform automated speed tests to collect data on the thousands of available hotspots. Spark then processes those concurrent data streams to make real-time decisions for its customers on-the-go. Taking it a step further, the engineering team at Deutsche Telekom is training machine learning models using Spark to make these decisions even faster, which will continuously improve the user experience.

As the machine learning models continue to evolve, the DC/OS platform allows the developers to push constant, incremental improvements to their customers. The long-term goal is to incorporate predictive analytics, anticipating when certain cellular networks or hotspots are normally congested and diverting network usage accordingly. For now, the CONNECT app is focused on Deutsche Telekom’s network of hotspots, but it plans to expand this service to third-party providers, like hotels and other public spaces.

]]>We are proud to announce the availability of Mesosphere DC/OS 1.11, which makes DC/OS an even better choice for deploying and operating all of your applications and data services with ease. This latest release adds three exciting new capabilities:

Seamless Edge and Multi-Cloud Operations — Unifying multiple cloud providers and private datacenters has been the holy grail for infrastructure and operations teams since the birth of cloud computing. Gartner estimates 9 in 10 enterprises will adopt Hybrid Infrastructure Management within two years. Enterprises want the flexibility to choose where to run their applications based on cost, speed to market, and security & compliance considerations. Distributing today’s applications and a growing set of data services across multiple infrastructures (including private and edge computing environments) helps guarantee quality of service and uptime. Bursting workloads to the cloud, disaster recovery across locations, and simplified management of edge compute and remote offices is now effortless with DC/OS as your unified control plane. DC/OS 1.11 allows you to pool public cloud, private datacenter, and edge compute resources into a single logical computer and intelligently schedule workloads anywhere from a unified user interface.

Production Kubernetes-as-a-Service — Development teams around the world are flocking to Kubernetes as their preferred platform for containerizing and deploying applications. But as an operator, your options for supporting these teams are less than ideal. Installing, operating, and upgrading Kubernetes on your own infrastructure can be challenging, and the loss of control and high cost of using cloud hosted container services can trump their convenience. DC/OS provides a third way: operations teams can deploy, scale, and upgrade pure Kubernetes for all of the teams in their organization with one click, and run their stateless applications alongside the stateful services that underpin them. Following a successful beta release of Kubernetes on DC/OS 1.10, during which the technology was tested by many users and customers, DC/OS 1.11 makes Kubernetes on DC/OS generally available.

Enhanced Data Security — Every company’s most valuable asset is its data. However, that data is also constantly under threat from bad actors around the world. To retain the trust of their customers, partners, and shareholders, every business needs to protect their data and applications. This latest DC/OS release adds multi-layer security features to help you secure your entire application stack.

Since our first release of Mesosphere DC/OS nearly 3 years ago, we have focused on automating the best practices of cloud-native infrastructure and operations, so that you can accelerate your time to market, eliminate mundane operational tasks, and reduce your costs. Our customers and user community rely on DC/OS to deliver data-intensive applications like personalization, IoT, and predictive analytics. Our latest release continues our mission of making cloud native tools and infrastructure easy to deploy and operate, so that you can focus on creating the next generation of applications that will help you and your company succeed.

With Mesosphere DC/OS 1.11, mainstream companies can deliver personalized and data-driven experiences with far less specialized expertise. They can focus on their customers, not their infrastructure.

For a long time, technology leaders have searched for a way to seamlessly pool resources from multiple-cloud environments. Mesosphere DC/OS has always provided a cloud-like operational experience by pooling cluster resources and automating applications services based on their unique operational requirements. Examples include all components of the SMACK stack and other popular data services on DC/OS. This means an automated and highly consistent management experience on any infrastructure where DC/OS is deployed.

Now with DC/OS 1.11 a single DC/OS cluster can pool resources from multiple public or private clouds at once, and operators can distribute workloads across multiple fault domains. This means that in addition to application-aware automation, DC/OS 1.11 adds cloud-aware automation that unleashes powerful new hybrid and multi-cloud operations capabilities, and helps to address enterprise-wide resourcing requirements.

Edge and Multi-Cloud Federation

An operator using his or her DC/OS credentials can manage multiple clusters on different clouds from a single DC/OS interface by linking these clusters. This means operators can focus on the services they’re running, not the differences of the underlying infrastructure. Whether it’s an on-premises datacenter, cloud compute on Azure, AWS, or Google, or any other mix of resources, the underlying infrastructure is transparent to the operator – simply use the dropdown menu to switch to the cluster you want to manage.

DC/OS operators can also run clusters that are stretched, where the agent nodes (the servers that do the work) can be in a remote location away from the master nodes (the brains of DC/OS). This means operators can minimize complexity of their infrastructure by deploying only agent nodes in edge datacenter or remote offices (where they are needed), while still having a single unified operating experience across their entire infrastructure.

Business Continuity and Disaster Recovery

Keeping applications highly available is another key challenge for infrastructure operations. Outages can occur at multiple levels including server, rack, datacenter (e.g., AWS US-EAST-1), region (e.g., AWS US-EAST) or the entire cloud (e.g., all of AWS).

DC/OS 1.11 allows operators to intelligently define fault domains and recover against this hierarchy to maximize service survivability. For example, within a region, stateless services can recover automatically from failures at the node, cluster, rack, or even site level. For stateful services, Mesosphere has partnered with Portworx to provide persistent storage for containers that is fully integrated with DC/OS, so users can easily run stateful services with highly available storage, bare-metal performance, and built-in data protection.

DC/OS allows operators to easily deploy workloads to multiple regions (e.g., to AWS, and also on Azure), to enable multi-cloud high availability.

Cloud Bursting

Scale applications across multiple clouds (or from local datacenters to public clouds) to accommodate rapid demand spikes and reduce infrastructure spend. Companies worldwide spend over $60 billion annually on cloud capacity they don’t need. By creating a DC/OS cluster composed of agents from multiple clouds, operators can elastically scale by adding and removing nodes as needed (using Terraform or other basic scripts). DC/OS’s cloud-aware scheduling capabilities can then schedule workloads to take advantage of the burst capacity.

Give Your Development Teams Kubernetes-as-a-Service on Any Infrastructure

DC/OS and Google Cloud Platform both provide pure Kubernetes by using a underlying platform to supply resources and automate operations. Unlike public cloud providers, however, DC/OS is agnostic to the infrastructure it runs on top of, so your Kubernetes-based applications, developer tools, and backing data services are all completely portable.

Production-Ready Kubernetes On Demand, Anywhere

DC/OS makes it effortless to set up highly available Kubernetes for production — it automates 20+ steps and many hours (or days) of work into a single click, resulting in a fully functional deployment in minutes.

Count on the latest version of Kubernetes as soon as you’re ready for it. Upgrade your Kubernetes deployment to the latest version in-place, without disruption, due to DC/OS application-aware automation.

Kubernetes, Dev Tools, & Data Services Happy Together

Teams typically run Kubernetes with other tools to facilitate operations and support a delivery pipeline. Examples include Prometheus for monitoring, Jenkins for continuous integration/continuous delivery (CI/CD), and Elastic, Logstash, & Kibana for logging. All of these services run elastically together on a shared DC/OS cluster.

Modern data-intensive applications have many components, and securing all of them can be hard. Containerized microservices are dynamically scheduled, discovered, load balanced, killed, and restarted by design, compounding security challenges even more, and making security strategies highly error-prone.

DC/OS is already secured with an encrypted control plane, and role-based access controls (RBAC) with integration with authentication providers to ensure that only authorized users with the right roles or privileges are provided access to services running on DC/OS.

DC/OS 1.11 adds additional layers of security for data services, which simplifies regulatory compliance by enabling transport level encryption for sensitive information in transit. DC/OS also simplifies data services integration with authentication, authorization, and access control mechanisms such as Kerberos, LDAP, and Active Directory. Secrets management in DC/OS has also been enhanced.

Secure Communications Within Distributed Data Services

Transport layer security (TLS) ensures only trusted services can communicate with each other (server-to-server), and their client communications are also encrypted (server-to-client). For example, TLS ensures that two nodes of a Cassandra or Kafka cluster can communicate securely, by encrypting the network traffic between those nodes.

Encryption keys and certificates are securely stored in DC/OS’s encrypted secret store and dynamically loaded only for authorized services or clients, providing an additional level of security for your sensitive data.

Control Which Applications Can Access Data Services

By enabling client authentication for connections to application or data services, you can control which applications can read or write to those data services. Authentication mechanisms can include Kerberos, LDAP, or Active Directory protocols.

Fine-grained client authorization and control over read and write operations. For example, you may decide to have only certain applications read or write to a specific topic within the Kafka service on DC/OS.

Secrets Management

DC/OS provides a centralized encrypted and access-controlled location for sensitive application credentials such as username/password, certificates and configuration files. Applications are automatically loaded with the right credentials at launch.

DC/OS 1.11 adds a hierarchy, and multi-team isolation to the DC/OS secrets store, making it easier to manage which secrets can be accessed by various applications or teams.

Since the beginning, our goal for DC/OS was to provide application developers and operators with an easy way to consistently deploy and run applications and data services on any public and private infrastructure. The unified developer and operator experience allowed our customers to easily migrate applications across clouds, and made it easy to use one cloud or datacenter for testing, a second for production, and a third for disaster recovery.

Edge and Multi-Cloud Federation: A single DC/OS cluster can now stretch beyond local nodes, which means remote offices or edge computing infrastructures can be centrally managed while maintaining a minimal footprint. In addition, multiple DC/OS clusters can be linked to simplify management and operation.

Business Continuity/Disaster Recovery: In addition to being resistant to machine failure, DC/OS 1.11 allows you to build applications that are resistant to outages within racks, availability zones or even cloud regions. You can even deploy multiple instances of your application across cloud providers.

Cloud Bursting: Easily expand your on-premise infrastructure with additional capacity from public cloud providers, remove capacity when not needed, and manage it all from a single unified interface.

To accomplish the above, DC/OS 1.11 introduces many features such as:

Regions and Zones support

Multi-cluster Linking

Other enhancements to simplify operations

Let’s dive into these capabilities and see how they work together to achieve a hybrid cloud experience.

DC/OS 1.11: Regions & Zones

DC/OS introduces a two-level hierarchical grouping for your physical/virtual/cloud server pool: Regions and Zones. A Region represents all server instances in a specific datacenter or a cloud region (i.e AWS-West). A Zone represents a fault domain within each region, for example, a rack in a datacenter or cloud availability zone (AWS-West-A).

Regions makes it easy to manage different cloud environments while Zones make it easy to automatically manage and deploy application applications across fault domains. Let’s look at them into more detail.

Regions
Regions makes it easy to manage one large DC/OS cluster across multiple clouds. DC/OS supports multiple regions in one cluster, and introduces the concept of (one) local and (multiple) remote regions. The local region is the region running the master nodes and agent nodes while the remote region contains only agent nodes. Mesos master nodes must be in the same local region due to network latency requirements. They should, however, be spread across zones within a region for fault tolerance (see later for more details).

Customers can use a single DC/OS GUI or CLI instance to deploy applications based on available resources or business requirements to a specific region. Applications that don’t specify a region will be deployed to the local region by default.

Regions are expected to have latency no greater than 100ms. This is usually adequate to have a datacenter spread between US East and West coast or between US East coast and Europe. By default, DC/OS considers connectivity to a remote node to be lost after 10 minutes of inactivity. Users can configure remote agents with a higher timeout, making sure that applications in the remote regions remain alive until connectivity is restored.

Zones
Zones make it easy to deploy applications across fault domains in your datacenter or cloud to increase application uptime. A fault domain is a section of the datacenter or a cloud that is vulnerable to damage if a critical device or system fails. All server instances within a fault domain share similar failure and latency characteristics. All application instances in the same fault domain are affected by failure events within the domain. Placing server instances in more than one fault domain reduces the risk that a failure will affect them all.

Prior to Zones, DC/OS provided high availability and automatic failover against hardware or virtual machine failure. Distributed data services like Kafka, Cassandra, Elastic, or HDFS were automatically deployed across multiple server instances to avoid major failure in the event of hardware or virtual machine failure. Zones takes this concept to the next level, and allows applications to be distributed across fault domains, further increasing uptime.

For an on-premise datacenter, Zones can be manually defined according to business requirements or datacenter layout. A common approach is to use a server rack or a group of racks as a Zone boundary. Public cloud providers identify fault domains as availability zones within each cloud region. DC/OS automatically detects the region and the availability zones for the top 3 public cloud providers (AWS, Azure, GCP) during installation without the need for any configuration from the user.

DC/OS 1.11: Cluster Linker

DC/OS also introduces cluster linking to simplify administration of multiple DC/OS clusters. Organizations sometimes have different DC/OS clusters across multiple datacenters and clouds for different functions such as for dev/test, edge cloud, or remote branch. Clusters can be linked together, and users in the same organization can manage multiple clusters through a single interface. Customer using a single-sign-on solution such as SAML or OpenID Connect will only have to log in once to manage any linked clusters.

DC/OS 1.11: Other enhancements for simplified cloud operations

To improve the operator experience, we’ve also provided many more capabilities such as Marathon app support, simplified node addition and removal, and automatic detection of Regions and Zones:

Marathon Support
Marathon support for hybrid cloud allows operators to have fine grained placement policies across Regions and Zones . Operators can now specify whether they want the application to be deployed to a specific Region (cloud provider availability zone) or distributed across (specific) Zones for high availability.

Marathon defaults application deployment to the local Region if no remote Region is specified in order to avoid an application being deployed to remote Regions.

SDK-based Data Services SupportCertified data services in the Mesosphere DC/OS Service Catalog have been updated to support deployment across Zones. Customers can use Marathon-style constraints to identify that a data service should be deployed automatically across Zones. Customers can also specify zones in which they would like their application to be deployed for more fine-grained control. Note that it is not recommended to deploy data services across Regions given the expected higher latency between them. Our team will continue working with our partners to enhance additional data services to these new and exciting capabilities.

Simplified node addition/decommission
DC/OS 1.11 also makes it easy to add nodes to and remove nodes from the cluster. You can easily add nodes to a specific Region by installing DC/OS on any new resources using any of your favorite automation tools such as Chef, Puppet, Ansible, Terraform, or even Bash scripts. Node decommission is also simplified through the CLI with a simple dcos node decommission command. Note that this command only removes the node from the DC/OS cluster, and does not tear down the physical or cloud machine, which you will need to do by yourself to avoid consuming unused resources.

Automatic detection of cloud regions and availability zones
At node installation, DC/OS communicates with the 3 major public cloud provider APIs (AWS, Azure, GCP) to identify the region and availability zone of the node. DC/OS then tags the node automatically with the appropriate labels for simplified administration. This feature can be enabled or disabled if desired.

]]>Kubernetes-as-a-Service Now Available in DC/OS 1.11https://mesosphere.com/blog/dcos-1_11-kubernetes/
Thu, 08 Mar 2018 01:00:12 +0000https://mesosphere.com/?p=11954Pure, open source, Kubernetes-as-a-Service is now generally available on DC/OS 1.11. Operations...

]]>Pure, open source, Kubernetes-as-a-Service is now generally available on DC/OS 1.11. Operations teams can now deploy and manage a CNCF certified, highly available Kubernetes cluster anywhere with a single command or button push. Furthermore, fixing the cluster usually requires no action by the operators thanks to zero touch self-healing. To provide truly simple management, the numerous monitoring, developer tools, and other solutions that are required for Kubernetes are easily accessible from DC/OS service catalog. The services all have the same high availability deployment and ongoing management features.

Like one would expect in any “as-a-Service” delivery, with the new release of DC/OS, managing many stages of a Kubernetes cluster is distilled into a single command line.

Some of the highlights for this release include:

Single Click HA Automation – Deploy, scale, upgrade, and manage with a push of a button or single command.

Google, AWS, and Azure – Mesosphere partnered with Google, worked with AWS, and made it available in Azure to make sure anyone can easily deploy Kubernetes and its cloud native ecosystem in hybrid and edge computing environments.

Enhanced Security – Possible security holes opened up from inexperienced Kubernetes administrators, like the unsecured administrative console recently used as vector for malware, are locked down by default on a secured DC/OS cluster.

Evergreen – With this release, we are tracking to the newest release of Kubernetes, and the releases will remain evergreen by adding future Kubernetes releases to the DC/OS Service Catalog soon after they become generally available.

Unify the Cloud Native Landscape With Kubernetes

Kubernetes requires other components that users often ask for by name. The Cloud Native Computing Foundation, which governs the Kubernetes project, has adopted some of these open source technologies like Prometheus for monitoring or Linkerd for service management, and has published a helpful guide to the Cloud Native Landscape.

DC/OS includes over 100 solutions delivered as-a-Service in its service catalog, including the full array that are needed for Kubernetes solutions.

Kubernetes-as-a-Service Delivery, Anywhere

With DC/OS, installation and ongoing management of the Kubernetes clusters is automated. Kubernetes with DC/OS has following features to allow operations to easily manage the cluster:

Provisioning: $ dcos package install kubernetes

By default, there are three Kubernetes etcd nodes, three master nodes, three private nodes, and one public worker node. The private nodes (e.g. behind the firewall) are where pods are deployed by default. The public node (e.g. in DMZ) needs to be exposed explicitly by the user.

Upgrading: $ dcos kubernetes update –options=new_options.json

The Kubernetes versions available on DC/OS will track very closely to the releases of Kubernetes. This allows customers to use the latest features of the release, which are often required for “next steps,” and ensure interoperability with public clouds that update to the newest version a few weeks after a Kubernetes release. Updating is a single command once you have the new packages and an updated JSON file.

Scaling: $ dcos kubernetes update –options=options.json

Scaling the Kubernetes nodes up and down in DC/OS simply requires changing the number of nodes in the JSON file and running a single command.

Despite everyone loving their Kubernetes cluster, it is sometimes best to let go and start anew. Killing a Kubernetes cluster in DC/OS is a single command.

Zero Touch Self-Healing and Easy Disaster Recovery

When Kubernetes master or worker node components are no longer working, DC/OS can respawn the resource. DC/OS uses Mesos application-aware scheduling to not only provision but to maintain the desired state of the cluster operator and self-heal when something goes wrong.

If there is a catastrophic failure for all infrastructure, then DC/OS has adopted the Ark tool to provide a simple disaster recovery procedure that allows backing up and restoring the cluster. The commands are natively supported in the DC/OS command line and the Kubernetes cluster can be backed up in cloud provider storage or in one of the data services in DC/OS.

To fully restore a cluster, the command is simply $ dcos kubernetes restore.

Robust Cluster and Network Security

Kubernetes inherits many of the security features of DC/OS. Full Transport Layer Security (TLS) is enabled by default.

Kubernetes will inherit the DC/OS network overlay by default. The container network interface (CNI) is also available. For example, if you want a zero trust, policy driven, network then you can plug in Project Calico. The container networking interface provides a number of options for pure Kubernetes including: Project Calico, Amazon EC2 elastic network interface (ENI), VMware’s NSX, and others.

Getting Started with DC/OS 1.11 and Kubernetes: Batteries Included

Whether you’re an existing Mesosphere customer or a new user who wants to try open source DC/OS, it is easy to get started with Kubernetes on DC/OS today.

Push-Button Kubernetes for Existing Customers
For over 125 Mesosphere customers, after adopting DC/OS 1.11, it is easy to spin up a highly available Kubernetes cluster for production workloads with a single push of a button or single command line. For more information see official documentation.

Quickstart for New Open Source DC/OS Users
For those that are new to DC/OS, there is now a Quickstart to get you up and running quickly. There are Terraform templates for AWS, Microsoft Azure and Google Cloud to help you provision cloud instances, DC/OS, and a Kubernetes cluster with a few commands.

]]>It’s incredibly straightforward to deploy production-ready data services into your DC/OS cluster in a few clicks, but why stop there? Let’s take the next step and connect data services to create a complete data pipeline!

For this guide, I will show an example of utilizing the Confluent Platform leveraging the following tools in order to pipe data to an ElasticSearch service co-located in my DC/OS cluster:

Confluent-Kafka

Confluent-Connect

Confluent-Control-Center

Confluent-REST-Proxy

Confluent-Schema-Registry

Elastic

If you are interested in learning more about Confluent, take a look at this blog post by Kai Waehner covering why Confluent Platform, DC/OS, and microservices work hand-in-hand to produce highly scalable microservices.

Note that this guide leverages some of our Certified frameworks in the DC/OS Service Catalog (Confluent-Kafka, Elastic). Certified packages (as discussed in part 1 of this tutorial blog series) are Enterprise-supported and production-ready frameworks built in conjunction with our partner network adhering to best practices for deployment, Day 2 Operations, and maintenance. As a quick recap, DC/OS Certified packages typically support:

Single-line install command

Built-in health monitoring

DC/OS CLI subcommands

Rolling updates

Rolling configuration updates

Configurable placement constraints

Production grade security

Multitenancy

Enterprise support through a channel partner

The architecture of the solution we’re building today (visualized below) is commonly used in many fast-data (streaming) solutions today. By using a highly scalable and highly available publish-subscribe solution such as Apache Kafka, customers can build powerful distributed applications at web-scale. This way, multiple raw data sources can pipe data into Kafka and tools such as Apache Spark can be used for analysis on Kafka streams and persisted into many different data services.

Benefits of running Kafka on DC/OS

While DC/OS streamlines the implementation of data services, it also makes Day 2 Operations such as scaling, management, networking, and monitoring insanely easy. Here are some examples:

Automated provisioning and zero-downtime upgrades of Kafka components. All Certified DC/OS frameworks are built with best practices in mind and are quick and easy to deploy. In Day 2 Operations however, Mesosphere brings value by making it simple to upgrade, scale, and manage the cluster past the deployment phase from a single graphical management console and control plane.

Unified management and monitoring of multiple Kafka clusters on a single infrastructure. Many companies have multiple Kafka clusters per each business unit, each managed by a separate team. By running Kafka on DC/OS, you can manage multiple team clusters from one single pane of glass, driving up operational efficiency and lowering management costs.

Elastic scaling, fault tolerance, and self-healing of Kafka components. Kafka on DC/OS is built out-of-the-box to be a highly scalable and fault tolerant solution. Our team has worked together with Confluent to create an enterprise grade and production-ready solution that can be easily deployed and managed.

Apache Kafka on DC/OS Guide

For this guide, we will walk through deployment using the GUI as much as possible so that you can get a visual representation of the end solution. However, we recommend that you script and automate this in production settings with the CLI and API tools provided by Mesosphere.

Prerequisites:
This guide uses 21 CPU shares in DC/OS. I typically test on a DC/OS cluster using m4.xlarge instances with 8 private agents and 1 public agent.

Step 1: Deploy Services
Once your cluster is up and running, navigate to the DC/OS Catalog and search for the Confluent packages. Deploy the default packages listed below by clicking package —> review & run —> run service.

confluent-kafka*

confluent-connect

confluent-control-center*

confluent-rest-proxy

confluent-schema-registry

elastic

Marathon-LB

*Note: A Confluent license is required to use this package. The trial lasts for 30 days.

Note: It may also be useful to install the CLIs associated with these services:

Select and edit the configurations for control-center and navigate to the Environment tab in the left column. Add a label (Key: HAPROXY_GROUP / Value: external) to expose the service and select Review & Run —> Run Service.

Repeat step 2 for the REST-proxy service as well.

Step 3: Access Confluent Control Center

Now that Confluent Control Center is properly exposed to the public internet, we can access the server via the port (10002) that was assigned by Marathon by opening http://<public_node_IP>:10002 in a browser. In the recent DC/OS 1.11 release we have added the capability to view service endpoints through the GUI. Simply navigate to the Control Center service –> endpoints tab to view the Service Port.

In this case we can access Control Center by opening http://<public_node_IP>:10002 in a browser.

You should see this:

Step 4: Create a Kafka Topic

Create a Kafka topic using the Confluent-Kafka CLI command below:

dcos confluent-kafka topic create <topic_name>

Output should look similar to this:

Step 5: Configure Elastic Connector in Confluent Control Center

First, grab the Elastic coordinator-http service endpoint by running the command below:

dcos elastic endpoints broker coordinator-http

Output should look similar to this:

Go ahead and grab the Confluent-Kafka broker endpoint as well, since we may need it later:

dcos confluent-kafka endpoints broker

Navigate to the Kafka Connect tab in Control Center —> Sinks —> Add a Sink in order to configure the data pipeline between the Confluent-Kafka and Elastic data services and select topic1 that was just created.

Set your connector class as ElasticsearchSinkConnector and give it a name, in this case I used topic1connector. Continue to fill in the information below:

The result should be a ‘RUNNING’ state:
Note: to retrieve the DCOS auth token you can run: dcos config show core.dcos_acs_token. Keep in mind that it is a good practice to use the dcos config show core.dcos_acs_token as an abstraction as to not expose your token as shown in the example above.

Step 6: Send a post to Kafka using an API call

Using the API syntax below, send a couple of messages to Kafka in AVRO/JSON format. AVRO/JSON format is the easiest to use and preferred format for Kafka:

Output should look similar to below:Step 7: View Data Persisting in Elastic

In order to view data persisting in Elastic we can leverage the popular byrnedo/alpine-curl docker image. Create a Service either through the GUI and input the following as the command (see picture below for more detail), or use the Marathon app definition to deploy using the DC/OS CLI:

Congrats! If successful, you can navigate to the STDOUT of the curl service that was just created and see that data has been persisted in Elastic. The output should look similar to below:

Summary

Data pipelines often consist of multiple complex components such as Kafka, Spark, Elastic, and Cassandra. DC/OS provides a platform that simplifies the deployment and management of such solutions using our operational expertise to automate non-trivial tasks. In this article, I showed you how to build a data pipeline to work with streaming data from multiple sources. In a future blog, I will elaborate to show how we can use analytics tools such as Spark in order to complete the data lifecycle from start to finish.

Learn More About Cloud Native Infrastructure

Cloud native infrastructure is more than servers, network, and storage in the cloud—it is as much about operational hygiene as it is about elasticity and scalability. This book reveals the hard-earned lessons on architecting infrastructure from companies such as Google, Amazon, and Netflix drawing inspiration from projects adopted by the Cloud Native Computing Foundation (CNCF). It also provides examples of patterns seen in existing tools such as Kubernetes.

]]>HQ Trivia’s outages during the Super Bowl halftime show remind us that, even for seasoned professionals, designing apps that can handle the deluge that comes with overnight success is hard. It’s common now in the tech industry to mistake the public cloud’s near limitless capacity as a substitute for architecting app infrastructure to allow for scaling quickly to meet growing and dynamic demand. It’s possible to avoid the “technical difficulties” fate by learning from the successes of other SaaS and mobile apps who faced this before.

If you haven’t played HQ Trivia yet, you’re missing out. It’s re-imagining the traditional trivia show format to create a live, mobile experience — allowing thousands of your closest friends to compete in a question-and-answer game for cash prizes. You can’t yet earn Jeopardy-level money or notoriety, but it is fun. However, it’s begging for some updates, like nixing the annoying chat room and creating ways to keep players engaged after they miss one of the questions.

Game mechanics aside, its Achilles’ heel right now is, unfortunately, downtime caused by overwhelming user demand. Like Twitter, Pokemon Go, and dozens of apps before them, HQ Trivia is struggling with its popularity. Building an application to scale up to support an average of 600,000 simultaneous users is no small feat — let alone a spike of 1.7 million during the Super Bowl. It’s not surprising that the video routinely freezes and connections are often dropped, resulting in emotional outbursts and obscenities in the chat stream. If HQ Trivia wants to keep building on its surging popularity, they must address current technical shortfalls. Rapid success is both a blessing and a curse for digital businesses, and HQ Trivia is wrestling with some of the hardest demands that can be placed on an application infrastructure. It’s a great problem to have, but a real headache for the IT organization.

HQ Trivia’s Exponential yet Spikey Demand

HQ Trivia is gaining users at breakneck speed. The hundreds of thousands of general trivia enthusiasts that log-in twice every weekday play for only a short time. This results in serious traffic bursts for tens of minutes, but for much of the day it’s flat. This dramatic burst in traffic is a common pattern in industries like retail/ecommerce (Black Friday and Cyber Monday) and financial services, where crunching numbers for end-of-day portfolio analysis can require orders of magnitude more compute. Experiences in these industries have shown that avoiding outages requires a well thought-out approach and not a quick fix. Adding physical compute and memory through elastic cloud resources is akin to the traditional approach of throwing more hardware at the problem, and isn’t the only or the even the best answer. True elastic application infrastructure that can scale components of the backend up and down is crucial for high availability and long-term scalability.

In the early years, Twitter faced a similar problem: global site outages, which were labeled “fail whales,” became not uncommon occurrences. A user’s homepage on Twitter, essentially a firehose of short messages, required multiple loosely coupled data and other services in order to render. An outage in one of the open source services (like MySQL) or any latency in the connection would cascade and snowball into outages. The only way to add capacity to any service at the time was to add servers and use configuration management to provision the application infrastructure. When the number of users is growing exponentially, these processes become error prone and need to be automated and standardized. At Twitter, Mesosphere founders Florian Leibert and Ben Hindman helped architect one of the solutions that killed the fail whale, including new software infrastructure based on Apache Mesos. This allowed engineering to easily provision and manage the entire lifecycle of those application infrastructure components needed to add capacity as usage spiked. If something went down, Mesos would make sure to spin it back up. Mesosphere was born out of the need to make that type of technology more accessible to the wider business world — not just web-scale startups.

Real-Time, Data-Rich Delivery

HQ Trivia traffic spikes include demanding, resource-intensive, data-rich video that often requires additional design considerations to ensure a Netflix-like experience. On top of that, the network must support real-time chat interactions. While automation, management, and standardization of application infrastructure is important for data-rich services like video, it is also critical to consider data locality. In order to preserve a compelling user experience, data needs to be geographically close to the user to take advantage of a low-latency connection.

One could claim that cloud infrastructure and content delivery networks (CDN) eliminate the need for regional compute. Cloud providers have sold technical professionals wholesale on the myth that cloud infrastructure bandwidth is infinite and the latency virtually zero. When signing up for a cloud provider, technologists are often limited by the hardware that is available in a certain service region. For example, a cloud service provider may only have the hardware needed to support your disk or memory requirements in one datacenter. Once you application is constrained to a particular regional data center, your entire business is limited by that site location, resources, and load.

CDN services, however, become less useful the more dynamic your content. Very few support live-streaming, and those that do are shockingly expensive at scale. Instead of relying on a large, centralized cloud or “off-the-shelf” CDN to distribute video to customers, engineering should take a hybrid approach, deploying compute at the edge with the right resource management and networking to ensure that the application is highly available. This hybrid approach, with edge computing powering the user experience, is crucial to delivering on rising consumer expectations.

Edge is the Answer

HQ Trivia requires more than a CDN or a global cloud provider to solve their problems. While we don’t know for certain what their infrastructure looks like, we believe that what they need is an edge computing architecture built on a combination of many regionally distributed data caching and compute resources that can quickly process data streams from within the chat while preserving data-rich video performance and quality-of-service (QoS).

If Royal Caribbean can use edge computing while at sea to cope with exponential, spiky demand against a geographically challenged infrastructure, then so can HQ Trivia. As anyone who has been on a cruise can testify, cruise ship internet connections are expensive and tenuous. Not too long ago, if you wanted to sign up for on-board activities or excursions, passengers had to line up and talk to guest services. By adopting Mesosphere DC/OS, Royal Caribbean was able to extend their compute power all the way to the edge (in this case, cruise ships) to power a reliable mobile app experience, allowing customers to spend more onboard the ship. The suite of Royal Caribbean mobile applications are powered by DC/OS, which manages resources across the datacenter and public cloud, to deliver reliable mobile experiences to everyone on the cruise ship, even during peak demand. As this example shows, the right architecture makes it possible to create reliable mobile experiences — even in the middle of the ocean. If Royal Caribbean can conquer the challenges faced by their distributed infrastructure, then so can HQ Trivia.

HQ Trivia could leverage a similar approach to connect concurrent users to content close to their location. Indeed, most organizations should start planning to utilize this edge design. As connectivity increases to our devices and they become smarter providing deeper interactions, and our tastes evolve from generic to bespoke personalized experiences. In turn, the need for edge computing becomes ever more salient. Easier said than done, as the Trivia HQ engineering team is likely saying reading this blog.

At Mesosphere, we’re the first to admit that the new reality is hard. Building and operating distributed applications and related services results in hard problems with often half-built solutions. It’s why we’ve made it our mission to make managing dynamic infrastructure, which is increasingly hybrid and multi-cloud, a bit more manageable.

Free O’Reilly eBook excerpt: Cloud Native Infrastructure

Your guide to the best practices, patterns, and requirements for creating cloud native infrastructure that meets your organization's needs.

]]>It takes a certain level of tenacity to dive into the unknown of building a technology startup from scratch, and this is elevated to a whole new level when you decide to take on a highly seasonal business, dealing with sensitive financial information and stressed out customers. That’s exactly what Will Sahatdjian endeavored to do when he and his co-founders, Richard Lavina and Michael Mouriz, started Taxfyle in 2015. An experienced CTO, Sahatdjian is no stranger to building systems from the ground up. He had previously built HIPAA-compliant software and knew he wanted to build a new service on a highly secure, available, and scalable modern stack.

“When it comes to startups, it’s easy to fall prey to the path of least resistance. Sign up with a cloud provider, add a few proprietary services to maintain velocity, and ship product fast,” said Sahatdjian. “It’s only later that you realize you’ve chained yourself to a particular cloud provider and you’ve lost control of your data. I never wanted to put Taxfyle in that position. On the flip-side, traditional ETL services get very expensive, very quickly. While researching viable alternatives, I was immediately impressed by Mesosphere DC/OS.”

Before deciding on DC/OS, Sahatdjian worked through the logical steps: He experimented with Digital Ocean droplets and AWS instances to work on some proof-of-concepts, quickly realizing that neither would be able to scale with the company in the long-run. Sahatdjian said, “We tried Docker Cloud next. I had a positive experience while evaluating its predecessor, Tutum, and a low-friction path to going multi-cloud with a SaaS-like, managed service seemed like an easy win. Unfortunately, we had some stability hiccups between Docker Cloud and the host nodes it needed to manage on AWS and Digital Ocean. We quickly realized that there were trade-offs that come with having mission-critical infrastructure reaching across multiple availability zones for every aspect of orchestration.

Taxfyle Invests in DC/OS

“Determined to avoid a repeat of our first experience, I dug pretty deep while researching DC/OS and its underlying services. It became apparent that even seemingly arbitrary architecture decisions were actually directly influenced by the types of pitfalls I’d encountered during my other approaches. The end result seemed less like an automobile and more like an airplane. As long as our internal testing went well, I was ready to bet the farm. We felt that investment in DC/OS would be worth the added stability, the control over staging and production environments, and the CI/CD capabilities, like consistent artefact management and blue-green deployments.”

Prioritizing responsiveness, flexibility and fine-grained control, Sahatdjian began extensive research and development, evaluating cloud providers, bare metal, and related server management services. “About a year ago, we decided to migrate from AWS to Google Cloud, and we’ve been a very happy customer since. The management and services interface is easy to use and intuitive.” With Google Cloud Platform in place, Sahatdjian sought additional capabilities to get the most out of his cloud resources with highly-scalable microservices and data pipelines.

“We’re planning to double our engineering team this year, and I needed to simplify cloud operations so that each new hire can get up-to-speed quickly and start contributing. Mesosphere DC/OS helps us do that and more,” said Sahatdjian. “We were using Kubernetes for container orchestration, and still do for one production database, but it’s not the end-to-end solution we found in DC/OS and, more specifically, Marathon, which is much more intuitive to use than Kubernetes,” said Sahatdjian. “DC/OS has many features that allow us address the security concerns that come with processing sensitive financial data. Between Mesosphere DC/OS and Google Cloud Platform, I’m confident in the security of Taxfyle’s platform.”

Taxfyle is Poised to Scale

With his stack shaping up nicely, and engineering onboarding accelerating, Sahatdjian has perfectly positioned Taxfyle to scale — not only in users but also in features. With DC/OS Edge-LB, Sahatdjian doesn’t have to worry about load balancing and has the confidence to scale in tax season. “For me, this first phase was about getting the right infrastructure and toolchain to capture streaming data. With Google Cloud Platform, DC/OS, and the suite of services it enables in place. Our next phase involves leveraging an advanced data pipeline with Spark, Kafka and machine learning.”

The Taxfyle platform makes filing taxes as painless as possible. Consumers can upload their documents and file with confidence, knowing that documents are processed accurately, securely, and reviewed by a tax professional. With DC/OS enabling data stream capture, Taxfyle is able to create customer profiles that help them identify important user behavior and quality metrics at a granular level. In addition to the consumer-facing service, Taxfyle built and maintains a vertically integrated platform serving its distributed workforce of financial professionals to ensure a fully automated and seamless experience.

“I think most people assume Mesosphere is only for enterprise companies, but startups should start paying more attention to the long term implications of architecture choices and dependency on cloud operations. Sacrificing agility for convenience across the stack really begins to add up. DC/OS has removed so much of the usual Dev/Ops and orchestration headaches from my day-to-day. Instead, I’m able to focus on building product and growing my team to better serve our growing customer base. And, if the day ever comes when I need to migrate to new cloud provider, I can rest easy knowing that my applications are easily portable using DC/OS,” said Sahatdjian.

Learn more about Data-Rich Apps in Financial Services

Modern data-driven application architectures are getting broad adoption in leading financial services firms. Technologies first developed by webscale companies for personalized experiences are now used by financial services to enable everything from customer insights to cybersecurity and fraud detection. Download this guide to learn how microservices and cloud-native data services bring new flexibility for organizations to get to actionable insights from enterprise data.