What is Cassandra?

Apache Cassandra is an open-source, distributed NoSQL database system whose design and data model is inspired by Amazon’s Dynamo and Google’s Bigtable respectively. Cassandra gained popularity because of it’s scalability and high-availability with no single point of failure. There is no concept of a master node, with all nodes communicating with each other for consensus and data partitioning. Cassandra also allows workloads to run across multiple datacenters with support for low latency replication making it a great platform for mission-critical data.

Golden Signals of Cassandra Health and Performance

For engineering teams using Cassandra to deliver workloads at scale, it is critical to monitor the cluster health in real-time to avoid performance issues. Cassandra is a Java-based system that can be managed and monitored via JMX. Some of the key metrics, known as Golden Signals of application health, that are important to monitor include:

Throughput of read and write request queries

Latency of slowest queries

Error rates

Netsil’s Approach for Cassandra Monitoring

Fig 1: Golden Signals of Cassandra Monitoring

Interaction Analytics: The Netsil Application Operations Center (AOC) uses a network-centric approach to monitor query level performance for databases like Cassandra. Without instrumenting either the server or the client side, just by looking at the wire protocol through TCP packet capture, the AOC provides information about how each and every query is doing. This approach has low overheads and gives real-time visibility into latency, throughput, error code, distribution of requests, response sizes etc. for every query. You do not need to install the Netsil traffic collectors on the database server and you can look at the interactions from the client side.

Polling: The AOC also uses a polling technique to give a complete picture of the database performance. Basic polling allows us to look at the saturation metrics of database (e.g., thread counts, IOPS issues, connections/sec, etc.).

The protocol datasources related to request/response are available out-of-the-box in the AOC. Please look at the pre-canned dashboards for Cassandra or use the Analytics Sandbox to plot charts without any additional configuration. The infrastructure datasources are available if you configure the Cassandra integration. More information on configuring the infrastructure integration can be found in the documentation.

Throughput

Monitoring the throughput of queries between a client and server provides a real-time snapshot of how the database is performing. Using the AOC’s drill-down feature, you can easily separate throughput numbers based on dimensions such as query string, query type, server error code, server instance, server port, etc.

For example, as shown in the image below, you can sort the throughput of the top-K queries by the client name. All of this is without any code instrumentation. This really helps DevOps efforts when you want to know what client or server are making the most requests and receiving the most responses.

By monitoring the throughput numbers you can track your cluster’s overall health and watch for spikes or dips that might need further investigation.

Fig 2: Throughput of most requested Cassandra queries

Latency

Monitoring the latency of a read or write query is critical no matter what your use case is. By focusing on the latency numbers you can identify potential problems or shifts in usage patterns and adjust your cluster size accordingly. In the AOC you get latency information about the top-K most requested queries as well as the slowest queries.

DevOps teams can use the real-time latency information to figure out where traffic bottlenecks might be building up and, for example, figure out which host might be contributing to the latency the most.

Fig 3: Latency of slowest Cassandra queries

Error Rates

The AOC tracks the server error codes and error strings, and monitors the error rates. Alerting on high number of errors is very important for DevOps teams. If your Cassandra cluster is unable to handle incoming requests adequately then it is something definitely worth paging your team.

Conclusion

Netsil’s interaction analytics combined with polling gives complete visibility into the performance of the Cassandra database, without any instrumentation on client application or database side. By interpreting the network interactions, Netsil is able to track the performance of queries with no overhead to database servers.

If you are using Cassandra for your cloud application, we encourage you to get started free with the AOC today and gain completely visibility into the health of all your service interactions.

Recap

Kubernetes has emerged as the dominant container orchestration platform. It pretty much slayed all of its competition including Docker Swarm, DC/OS, AWS ECS in 2017. Containers are a convenient pattern that simplify packaging, shipping and running of software applications particularly “microservices” applications. But containers come with the headaches of managing namespaces and underlying infrastructure abstractions such as network namespaces, routing tables, persistent storage volume mounts, noisy/bad container neighbors, etc. So, if you have lots of containers, you will have lots of problems. This is where Kubernetes comes in.

From a business perspective, Kubernetes, Kubernetes-native applications, Kontainers, etc., are all essentially promising that the software development teams will be able to deliver features faster and quickly address business needs. However, if majority of your software velocity is slow due to organizational and process related issues, these technologies may not result in any benefits.

With that in mind, let’s take a closer look at the Kubernetes 2018 outlook.

Common wisdom in the valley is that building a software is only 10% of the challenge and the 90% lies in running, monitoring, securing, maintaining or operating the software. Microservices architectures of Kubernetes-native applications are no exception. 2018 will be the year when operational challenges of Kubernetes-native applications takes center stage. The progress made in 2017 indicates the following operational challenges:

Service Mesh: In Kubernetes-native applications almost every service is heavily dependent on a large number of other services and the dominant mode of dependency is the network (i.e API calls). Hence the problems of service discovery, routing service calls, fault tolerant circuit breaking, handling time-outs and retries, etc. are omnipresent in Kubernetes-native applications.And indeed, one of the areas of focus at KubeCon (Dec 2017) was a service-mesh, i.e. a network of proxies with those functionalities related to inter-service communication without requiring every service writer to implement them. (Read: Matt Klein’s blog on load balancers and proxies for modern applications)

2. Observability: Kubernetes-native applications can be thought of as a graph or map where each node represents a service and edges represent the communication/dependency. If such a visualization is absent, then every operational aspect becomes daunting. If a service is having issues the entire dependency chain will be impacted yet it is very hard to identify all the services in that chain. This challenge will arise every time there is a bad deployment or an incident. Additionally, now that a majority of the business logic relies on the inter-service API calls, it is crucial to monitor the golden signals of these interactions — latency, throughput and error rates.(Read: Netsil’s blog on observability)

3. Security: Kubernetes has always put security first. Kudos to the developers and thought leaders for baking things like TLS among cli, kubelet and master api endpoints since the very early days. In 2016 and 2017 we saw the RBAC and network policies as major steps towards security. However, there is still a significant way to go on the security front. At the core, the namespace and cgroup underpinnings of containers are very “thin boundaries” from a security perspective (as opposed to the VM or good ol’ bare metal OS). That problem still needs to be addressed and yes, reliable docker images were the “d’oh” moment in that direction but then developers are picking open source packages all the time. Beneath the signed image could be lurking something dangerous that crosses container boundaries without too much difficulties. (Read: Rkt examples of security challenges)

On the security of inter-service communications, the DMZ concept has already proven to be obsolete. So the state-of-the-art is micro-segmentation or security groups. Today, the billion dollar revenue claims of VMware NSX and the fact that security groups are fundamental elements for all public clouds are a testament that micro-segmentation is crucial. However, the IP level security does not work for containers that keep changing their port and IP address with every incarnation. Moreover, the modern attacks are already piggybacking the existing sanctioned communications. In a nutshell, a deeper application aware security of the inter-service communication will be needed. (Read: Project Cilium for http aware security).

Kubernetes on Public Cloud is The Winner. Kubernetes on Private Cloud Not Yet

In 2017, we saw every major public cloud embracing Kubernetes, including AWS and Azure as well as private cloud vendor including VMware and Pivotal. The fastest and most reliable path to successful Kubernetes-native applications to production will be leveraging the K8s-as-a-service offering of the public cloud vendors. GKE naturally has a huge lead there with polished integrations for handling networking volumes and ingress controllers. AWS is the incumbent cloud leader and one of the fastest moving companies in terms of meeting their customer needs, and Azure is making huge strides with its Deis acquisition and Brendon Burns, one of the founding engineers of Kubernetes, now in the Azure camp.

Kubernetes in private cloud will suffer though, from the simple fact the private cloud still largely struggles in reality. Let’s first establish that a virtualized environment is not a cloud. The breadth of services, high SLAs and comprehensive API and ecosystem offered by public clouds have an advantage over any virtualized private datacenter however close to a cloud they try to be. Even if good APIs, automation, RBAC, authentication, etc. are addressed in a private cloud, there are still big gaps such as object storage (S3) or a reliable Database-as-a-Service (RDS). Where does the “state” get stored since the Kubernetes-native paradigm encourages building stateless apps that push the state out to such services as S3, Spanner, or Cloud SQL? Then there is the networking challenge. This talk by Kelsey Hightower on “Container Networking” illustrates the inadequacies of the private cloud with regard to Kubernetes-native applications.

Early Majority Asks — What applications to run on K8s?

When all is said and done what do you build on K8s? A challenge is that much of your workforce is baby-sitting what was built in the past decades. Those older application don’t work well with the modern “stateless”, “ephemeral”, “ci/cd”,… paradigms of the Kubernetes world. So it will likely be the green field of applications and services that will be built from scratch but mechanisms to discover and interact with the old world will still be needed. As K8s enters early majority, the early adopters will continue to present their use cases and help pave the way in these conversations. As an example, here is a brilliant talk from Kubecon 2017 describing the challenges of porting existing legacy applications to the Kubernetes-native landscape.

To Pod or To Lambda

While you were reading the sections above, a new paradigm is already brewing hot in the market — Serverless Computing. While technically not server-less, this new paradigm essentially takes your “function code” and schedules them to run on servers. No need to baby-sit schedulers, worry about routing the calls, load balancing, etc., the “FaaS” takes care of it. If this walks like a PaaS and quacks like a PaaS then perhaps it is, but with fewer constraints than the PaaS of yesteryears and done on the larger scale of the cloud. Serverless computing is still in its nascent stages. The debate will be whether or not Kubernetes is needed or should energy rather be focused on application logic, describing it to a Lambda service which will take care of operationalizing it. For an excellent article on this topic, you can read Karl Stoney from Thoughtworks.

Conclusion

Change is the only constant attribute of the technology industry. The pace of change has also been accelerating, particularly with the democratization of computing via public clouds. Kubernetes is the promising layer of democratization across the clouds. It will certainly have a significantly disruptive impact on the way applications are designed and run in the coming years. Of course it runs the risk of getting disrupted itself by the likes of Lambda services or AI programs that write and run software on their own!

Best wishes from the Netsil family, we look forward to engaging with you in your tech endeavors in 2018 and beyond.

Modern digital businesses are delivered by real-time interactions among hundreds and thousands of services. When you order a Lyft or stream a Netflix movie, several services start interacting and coordinating with each other to fulfill your request. Considering the importance of service interactions, the performance, reliability and health of these interactions becomes very critical for every digital business.

Unsurprisingly, significant advancements are going on to improve the inter-service communication mechanisms and more broadly improve the entire communication fabric. HTTP2 and gRPC are defining the next generation of highly efficient inter-service communications. Istio (envoy) and Linkerd are promising to overhaul and establish a robust fabric for service discovery, routing, failure handling, etc. among services.

In this post we will summarize the key advancements in HTTP2, share an overview of gRPC and then describe Netsil’s approach to monitor the health and performance of HTTP2/gRPC based interactions.

HTTP/2 Overview

1. Binary Framing and Compression: As opposed to the newline-delimited plain text HTTP 1.x protocol, HTTP 2 employs binary encoding for frames. The binary encoding is much more compact, efficient for processing, and easier to implement correctly. The structure of the binary encoded frame is describe in detail here.

In addition to binary encoding, HTTP2 employs header compression to reduce the footprint of HTTP headers which can grow upto kilobytes (think cookies) and also headers are often repeated across requests and responses. HTTP2 leverages static huffman code to compress literals. But in addition to the compression, client and server also maintain a list of frequently seen fields and their compressed values. So when these fields are repeated they simply include the reference to the compression values.

2. Multiplexing: HTTP was initially a single request and response flow. Client had to wait for the response before issuing the next request. HTTP 1.1 introduced pipelining where client could send multiple requests without waiting for the response. However, the server is still required to send the responses in the order of incoming requests. So HTTP 1.1 remained a FIFO queue and suffered from requests getting blocked on high latency requests in the front (referred to as Head-of-line blocking).

HTTP2 introduces fully asynchronous, multiplexing of requests by introducing concept of streams. Client and servers can both initiate multiple streams on a single underlying TCP connection. Yes, even the server can initiate a stream for transferring data which it anticipates will be required by the client. For e.g. when client request a web page, in addition to sending theHTML content the server can initiate a separate stream to transfer images or videos, that it knows will be required to render the full page. The figure below shows multiple streams, 0 to 4, communicating on a single TCP connection.

Stream 0 is reserved for communicating connection control frames.

Stream 1 and 3 (odd-numbered) are initiated by the client.

Stream 2 (even-numbered is initiated by the server.

TCP Packets are illustrated as carrying the content for multiple streams. The Packet 1, for e.g. is transferring SETTINGS for Stream 0, HEADER & DATA for Stream 1 and DATA for Stream 3.While streams are mostly independent, there are provisions to establish priority and dependencies across streams as well.

3. Flow Control: A successful implementation of multiplexing requires flow control in place to avoid contention for underlying TCP resources and avoid destructive behavior across streams. Rather than enforce a particular control flow algorithm, HTTP2 provides the building blocks for client and servers to implement flow control suitable for specific situation.

Application-layer flow control allows the browser to fetch only a part of a particular resource, put the fetch on hold by reducing the stream flow control window down to zero, and then resume it later — e.g., fetch a preview or first scan of an image, display it and allow other high priority fetches to proceed, then resume the fetch once more critical resources have finished loading. (More details on flow control available at HTTP 2 Spec and O’Reilly High Performance Browser Networking)

gRPC Overview

gRPC is rapidly gaining adoption as the next generation of inter-service communication particularly in microservices architectures. gRPC leverages HTTP2 underneath and as such benefits from many of the above efficiencies of HTTP2. The practical benefits of gRPC have been captured elegantly in this blog post at grpc.io. Specifically, following attraction points are highlighted for why gRPC:

Ability to auto-generate and publish SDKs as opposed to publishing the APIs for services.

Leverage server-side streaming from underlying HTTP2

Efficiency gains during serialization and deserialization by using protocol buffers as opposed to JSON

When you are building many services and establishing interactions among them using gRPC, it becomes critical to monitor the golden signals i.e latency, throughput and error for the gRPC calls. At Netsil, we perform a deep analysis of both HTTP/2 and gRPC interactions. As a result you get complete visibility into health of these critical service interactions as well as get clear understanding of dependencies among services.

Monitoring HTTP2 and gRPC Interactions

The health of gRPC and HTTP2 interactions can be defined by the golden signals of latency, throughput and error rates. You can easily monitor the health of gRPC and HTTP2 interactions without any code or container changes by using the Netsil Application Operations Center (AOC). All you need to do is download the Netsil collector and install one collector per host. The collector can be installed as docker container, Kubernetes DaemonSet pods or regular processes.

The Netsil collectors will automatically start analyzing the gRPC and HTTP2 interactions. They will generate detailed metrics for latency and throughput along with all the key attributes such as gRPC service method name, gRPC status message, status code, etc. Leveraging these metrics and attributes you can setup alerts to monitor the health of your service interactions. The dashboard below captures the latency and throughput of gRPC calls grouped by the method type and service method name.

gRPC Dashboard in the Netsil AOC

From the perspective of monitoring gRPC health, the following mapping of gRPC constructs to HTTP2 headers is of significance (full mapping details).

gRPC Request:

Method → “:method POST”

Scheme → “:scheme ” (“http” / “https”)

Path → “:path” “/” Service-Name “/” {method name}

gRPC Response:

HTTP-Status → “:status 200”

Status → “grpc-status” 1*DIGIT ; 0–9

Status-Message → “grpc-message” Percent-Encoded

These attributes allow you to further analyze the interactions and build granular alerts and dashboards. For e.g. the below chart alerts on gRPC requests resulting in errors.

Monitoring gRPC Errors in the Netsil AOC

Conclusion

Service-interactions are critical for your digital business. HTTP2 and gRPC deliver significant performance and reliability improvements. As you adopt these newer communication mechanisms, you can leverage Netsil AOC to monitor and alert on the health of gRPC and HTTP2 interactions.

]]>https://blog.netsil.com/kubernetes-monitoring-service-dependencies-with-maps-and-traces-ec708bd14fbe?source=rss----70aaaf0935d9---4
https://medium.com/p/ec708bd14fbeWed, 08 Nov 2017 18:57:39 GMT2017-11-08T18:57:54.984ZA fundamental challenge for the reliability of distributed systems is the ability to observe and understand dependencies among components. The blindness from not understanding service dependencies is costly:

Frustrating Root-cause Analysis: “The service looks fine; some other dependency is causing errors.”

A new category of products is emerging to address the observability challenges across services. These observability products generate live maps and traces which capture the dependency structure among services. Additionally, they capture the golden signals of monitoring service health — latency, throughput and error rates. In this post, we discuss Netsil Maps and OpenTracing for delivering complete visibility into the dependency structure and health of service interactions.

Netsil Maps

The Netsil Application Operations Center (AOC) delivers auto-discovered maps of Kubernetes services and their interactions. The Netsil maps can be created at multiple abstraction levels of Kubernetes clusters. For example, the picture below shows maps at (a) host, (b) namespace, and (c ) pod level.

Multi-level Netsil Maps for Kubernetes Clusters

Along with the dependency structure, the Netsil maps also show the latency and throughput of service interactions. Deeper insights into any service interaction can be obtained by simply clicking on the link between the two services. For example, the picture below captures the complete profile of http interaction between the sock-shop/frontend and sock-shop/catalogue pods. The latency, throughput and error rates are presented grouped by insightful information on URIs, request methods, return status codes, etc.

Service Interaction Health and Details

The Netsil AOC generates service interaction map by performing deep analysis of packets. As a result, Netsil maps don’t require any code change or container image changes to deliver complete visibility into the health of service interactions. The AOC has the capability to analyze and understand most of the common service protocols including gRPC, HTTP2, HTTP, PostgreSQL, MySQL, DNS, Cassandra, Redis, and many more (full list here).

The DevOps teams can easily deploy the Netsil collectors as DaemonSets and leverage the maps and metrics. One common use case of the maps is for deployment management. Every deployment has a significant risk of negative impact on other dependent services. The Netsil maps not only show you the dependency but can also alert you the negative impacts of latency increase, throughput drop or increase in error rates. This way you can prevent bad deployments from hitting production and avoid costly downtime.

The Netsil maps capture individual segments of services interactions. One limitation is that the causality information is lacking. The timestamps and call signatures i.e URI, MySQL query, etc. provide heuristics to deduce causality. Since the communications are highly repetitive, causality may not be needed at individual transaction-level. Nevertheless, if granular individual transaction level tracing is crucial for your debugging needs then OpenTracing is a good but laborious option.

OpenTracing

While Netsil employs a “black-box” approach to generating maps, OpenTracing employs what can be called a “white-box” approach. For OpenTracing (or distributed tracing in general), the application code:

creates spans

creates and send the required context to subsequent calls for linking spans

establishes the causal link among spans.

For e.g. in the picture belows lets say A, B, C, etc. represent the services associated with the respective spans. Service A creates SpanA as part of its request processing. Only service A knows that it is calling Service B and C as a part of fulfilling the ongoing request. So service A will need to create the required context and send it to Service B and C, which can then link their respective spans as child of SpanA.

It is safe to say that only the application has full context available to reliably establish the causality. So for all practical purposes, in order for tracing to work, you have to add tracing code to the application. This is pretty laborious undertaking especially considering the vast amount of code that is already written, which would be extremely hard to change just for inserting traces. Additionally, there are a lot of third party softwares such as cache, databases, payment processing, inventory management, etc. for which it might be impossible to insert tracing code. Every service that doesn’t handle the tracing will become a blind spot and termination point for the trace. Ironically, the seminal Google Dapper paper, which has inspired the many distributed tracing efforts, already warned of the brittle nature of relying on code changes for distributed tracing:

“Application-level transparency: programmers should not need to be aware of the tracing system. A tracing infrastructure that relies on active collaboration from application-level developers in order to function becomes extremely fragile, and is often broken due to instrumentation bugs or omissions, therefore violating the ubiquity requirement. This is especially important in a fast-paced development environment such as ours.” — Google Dapper Paper

Another important challenge with tracing is in terms of underlying protocols and required “room” for the propagation of the span context. While protocols such as HTTP provide support for custom headers, there are a lot of protocols which don’t have any room for passing additional context in headers. Modern communication channels such as gRPC, Thrift,etc. all have good support for OpenTracing but in the absence of these channels, the constraints of underlying carrier protocol become a challenge for propagating context across services.

Conclusion

Observing and monitoring service dependencies is critical for reliability and performance of microservices applications (and in general for any distributed application). With Netsil you get maps without any hard work and you get to understand the structure of the communications and transaction flows. Netsil maps and metrics can greatly help the DevOps teams with deployment management, incident response, root-cause analysis and capacity planning.

If transaction level granularity is required then you should consider a disciplined approach to adopting and instrumenting tracing across all the services. With tracing the burden is on the development teams to properly handle traces, ensure the underlying communication protocols can carry the context for traces and that there is scalable analytics available to query vast amounts of traces for meaningful insights. While tracing efforts might take months to bear fruit, you can deliver observability in minutes using Netsil for your kubernetes clusters.

Good To Read

Google Dapper Paper (At Google, the “uniformity” in the use of thread libraries, common RPCframeworks, etc. were key instrumentation points for adding tracing without requiring every app-dev team to add traces.)

]]>https://blog.netsil.com/kubernetes-vs-openshift-vs-tectonic-comparing-enterprise-options-part-ii-ad6697c54ac2?source=rss----70aaaf0935d9---4
https://medium.com/p/ad6697c54ac2Tue, 17 Oct 2017 19:56:05 GMT2017-10-24T03:33:01.610ZThe article was originally published on the Netsil blog

In Part1, we covered a high-level overview, differences and use cases for Openshift, Tectonic and vanilla Kubernetes. In this post we will take a deep-dive and evaluate following aspects in greater detail:

Supported Environments

Storage

Networking

Ease of Operations

Application Ecosystem

Openshift vs Tectonic vs vanilla Kubernetes

Supported Environments

Vanilla Kubernetes has a lot of installation options for various environments.

Minikube is a single node cluster available for local testing and development.

Kubeadm is another tool which makes it easy to install Kubernetes on Linux VMs running Ubuntu 16.04+ or CentOS 7.

Tectonic has GUI installers for AWS and bare-metal platforms. There are also Terraform installers available for AWS, Bare-Metal, Azure (alpha), VMWare (pre-alpha) and Openstack (pre-alpha). Other cloud support is not specified.

Openshift can be installed in two ways via RPM or via containerized components. Ansible scripts are also provided which allow automated installation and can be tuned as required.

Openshift Origin has documented support for AWS, OpenStack, GCE and Azure. Minishift is a small Openshift installation on a single VM which allows a quick installation and is useful for anyone to do local testing.

Storage

Vanilla Kubernetes provides persistent volume support for storage backends such as nfs, iscsi, fiber-channel, gluster-fs, AzureDiskVolume, AWS EBS, GCE Persistent Disks, Openstack Cinder, CEPH RBD, vSphere volume, OpenEBS, Quobyte, Portworx and ScaleIO. As vanilla Kubernetes can be installed across multiple hardware types, the range of supported storage is also pretty wide. Vanilla Kubernetes also provides an abstraction in form of Storage Classes which hide away the storage complexity from the user.

Since Tectonic provides additional features on top of vanilla Kubernetes, all of above storages are supported. Openshift has documented support for most of above but support for third-party storage solutions such as vSphere volume, Quobyte, Portworx and ScaleIO is not mentioned.

Some of above options are bound to their respective clouds whereas some others are open source as well as cloud agnostic (such as nfs, gluster etc). Using a cloud-based storage is a good fit for a scenario where all implementations of the product are going to be on the same cloud as there is less maintenance overhead. Whereas, for a product which may have to be deployed at any cloud/on-premise, a cloud agnostic solution can be a better fit.

Networking

Again, vanilla Kubernetes is most flexible as it supports networking plugins such as Cilium, Contiv, Contrail, Flannel, GCE, direct L2 networking (experimental), Nuage VCS, Open VSwitch, Open Virtual Routing, Calico, Romana, Weave, CNI-Genie and user can choose based on their requirements.

An interesting project in the networking area is Container Networking Interface(CNI); managed by CNCF, CNI is designed to be a minimal spec concerned only with configuring network interfaces within Linux containers and removing the networking resources when the container itself is removed. This creates a uniform standard against which various networking plugins can be created.

Kubernetes, Tectonic and Openshift are listed container run-time adopters for CNI which makes it easy to swap various networking plugins which are following CNI specifications.

Ease of Operations

In this section, we shall look at two major operational activities; Upgrades and RBAC implementation.

Upgrade of vanilla Kubernetes depends on the method of installation (direct / kubeadm / hyperkube / juju-charms among others) and hence there are multiple ways to upgrade a cluster which can get confusing. In addition, there aren’t any well documented generic upgrade procedures which are quickly available. Most Kubernetes upgrade guides allow for upgrades within a major version number.

Tectonic provides an experimental automatic upgrade of Kubernetes components within its cluster which can be fully automatic or approval based. While minor upgrades can be done seamlessly, upgrades between releases (such as 1.5.x to 1.6.x) are still a work in progress. Also, a clear procedure for manual upgrades is missing.

Openshift provides the ability to automate upgrades via ansible playbooks in addition to a well defined manual upgrade process. Although, the upgrades need to be in sequence.

Vanilla Kubernetes 1.6 has a new RBAC feature which can be enabled by passing --authorization-mode=RBAC flag to kube-apiserver. There are two types of roles, regular roles are scoped within a namespace while ClusterRoles are scoped for the entire cluster. A RoleBinding resource is used to bind roles with subjects which can be users, groups or service accounts.

Tectonic has RBAC implementation similar to vanilla Kubernetes but it also adds on audit logging capability so that the audit logs can be streamed to log aggregation backends which is a requirement in certain industries.

Application Ecosystem

Helm is a package manager for Kubernetes which allows deployment of pre-configured Kubernetes resources (called Charts). The helm charts repository contains a lot of applications packaged as charts which can be easily deployed on a Kubernetes cluster. This makes it easy to manage and update applications rather than handling individual resources. In addition to the charts available in the helm repository, users can create charts for their own applications and use them as a way of distributing their application. Helm client can receive charts from source repositories, zip archives as well as directories. Both vanilla Kubernetes and Tectonic allow applications to be deployed via Helm charts.

Openshift has a similar concept called templates which is used to deploy a list of parameterized objects on an Openshift cluster. Openshift also maintains a library of curated templates similar to helm charts. Users can also write their own templates and upload them to their cluster for further deployment.

Conclusion

Comparing Openshift, Tectonic and vanilla Kubernetes, we see that in terms of handling storage they are all almost at par with each supporting a wide range of storage backends. In terms of networking, vanilla Kubernetes provides the widest variety of plugins whereas Tectonic and Openshift have relatively fewer plugins to support.

Looking at upgrades of Kubernetes itself, Tectonic provides an automated way to upgrade between minor versions, Openshift provides scripts for performing upgrades from one version to next and requires upgrades to be handled sequentially. vanilla Kubernetes still needs a lot of clarity on upgrade procedure.

Helm Charts are a good way of packaging applications for vanilla Kubernetes and Tectonic which are agnostic of underlying layers. Openshift, on the other hand, has its own mechanism of templates which may not be as portable.

]]>https://blog.netsil.com/netsil-launches-introduces-auto-discovered-maps-for-kubernetes-and-docker-apps-2165288718e2?source=rss----70aaaf0935d9---4
https://medium.com/p/2165288718e2Thu, 21 Sep 2017 21:07:19 GMT2017-09-21T21:08:11.118ZNETSIL launched today from stealth by unveiling the Application Operations Center (AOC), a universal observability and monitoring platform for modern cloud applications. With the AOC, Netsil enables DevOps teams to gain complete visibility into all the services and their dependencies with absolutely no code changes required. As a result, DevOps teams are able to reduce downtime, ensure safer deployments and meet their service level objectives (SLOs).https://medium.com/media/d3de53bef04a139a8c00c370636f24b7/href

Observability Challenge

Every digital business is powered by hundreds of services and thousands of service interactions. Yet, DevOps teams, responsible for uptime and performance of applications, cannot see the services and the service dependencies. This critical blindness incurs a huge financial cost from:

Prolonged Outages: “The service looks fine; some other dependency is causing errors.”

To quote Adrian Cockcroft, an expert on cloud architectures, “flow visualization is a big challenge.” With the shift to Kubernetes and Docker-based microservices, the blindness is worsening as more services and interactions become part of applications. The Netsil AOC squarely addresses this blindness problem by delivering auto-discovered, real-time maps capturing all the services and their dependencies.

Observability Challenges of Microservices Applications

The Netsil Application Operations Center (AOC)

Netsil AOC — Observability and Monitoring Platform

The Netsil AOC can be thought of as “Google Maps for Cloud Apps”. The AOC generates maps which automatically discover every Docker container, Kubernetes pod, host, and service endpoint, along with all the interactions among them. The maps also capture key service health metrics of latency, throughput and error rates for API calls, database queries, DNS lookups and several other service interactions. Using the Netsil maps, DevOps teams can:

Reduce Downtime — by quickly identifying root cause using dependencies on the map

Deliver on SLOs — by monitoring and addressing the latency and errors of services that impact end users

Limetray is a rapidly growing company that is addressing the marketing and operational challenges of the restaurant industry. On the tech side, we have a modern stack running on Kubernetes, Docker, and AWS. As we add more features and services for our customers, one of the biggest challenges has been understanding service dependencies. Netsil maps and metrics have been instrumental for us to understand transaction flows and quickly identify root causes of latency and errors before they impact our customers. We love the simplicity of Netsil’s approach where we just drop an agent and are able to see everything. Equally impressive has been the responsive support and close engagement with Netsil team. If you are running Kubernetes stack, then Netsil is essential for real-time visibility into all the services and their dependencies. — Sooraj Elamana, AVP Engineering at Limetray.

The proliferation of microservices makes it harder to monitor and debug applications, and it’s becoming increasingly difficult for operation teams to have adequate visibility into the systems they manage,” said Alex Ethier, Chief Product Officer at SourceClear, who was earlier VP Product at Chef. “A deeper level of visibility and control is needed, and I’m excited about Netsil’s introduction of the application control center. With the AOC, operations teams finally have the tools they need to be successful for managing the ever-growing complexity of today’s and tomorrow’s systems. — Alex Ethier, Chief Product Officer at SourceClear, ex-VP Product at Chef.

As enterprises speed up software development using DevOps and microservices, release management becomes a critical need. CI/CD pipeline tools, Kubernetes and Docker enable enterprises to quickly deploy changes to production environments. However, without a complete understanding of service dependencies, making changes to a production environment is inherently risky. This is where Netsil’s ability to see every container, pod, host and their dependencies is indispensable.Using Netsil’s maps and service metrics, DevOps can evaluate changes in a complete application context, ensure safer deployments, and achieve agility without compromising reliability.” — Steve Hendrick, Research Director at Enterprise Management Associates.

Network As The Vantage Point for Microservices Observability

A key benefit of the AOC is that it does not require any code change to generate maps and metrics. Netsil (listen spelled backward), “listens” to service interactions and conducts a real-time analysis of packets to obtain deep application insights. As a result, Netsil observes everything that “hits the wire” including calls to external services such as AWS RDS, AWS DynamoDB, API calls to Google Maps, Salesforce, Stripe, Twilio, etc.

Using network as the vantage point, the Netsil AOC can observe and monitor across generations of applications. In particular, Netsil is especially powerful for the Kubernetes- and Docker-based microservices applications. For Kubernetes clusters, DevOps teams can visualize their applications at multiple levels by creating maps of hosts, namespaces, services, and pods. From the application maps, they can drill down and quickly diagnose a range of complex issues such as service configuration (e.g. Kubernetes DNS errors), service reachability issues (e.g. HTTP errors) and service creation problems (e.g. pod scheduling errors). The Netsil AOC delivers all of these features and capabilities with the simplicity of installing just one collector agent per node.

Incumbent APM providers such as AppDynamics and Dynatrace also deliver application maps that capture transactions and service dependencies. In contrast to Netsil’s code-agnostic approach, APM products inflexibly depend on programming languages since they rely on code-instrumentation techniques. With APM, each service written in an unsupported programming language becomes a blind spot for DevOps teams. Moreover, there is a wide range of critical services such as databases, load balancers, service discovery and DNS, that are impractical to instrument using APM. All such services become blind spots for operations teams. Built for the monolithic Java and .NET era, the APM techniques are a liability for the polyglot, fast-changing world of public clouds and containers. Netsil’s auto-discovered maps and deep analytics elegantly address these challenges without relying on code changes or instrumentation.

About Netsil

Netsil’s patent-pending approach is the combined result of years of research at the University of Pennsylvania and the operational experience of founders at Google and Twitter. Netsil’s seed round of funding was participated by Mayfield Fund, Engineering Capital, Moment Ventures and other marquee Silicon Valley investors.

When you consider the chaos in application space with new programming languages, abstractions and frameworks, the network emerges as a natural, stable vantage point to observe and monitor modern cloud applications. Netsil’s network-centric approach is future-proof across generations of applications. So, whether it is Kubernetes and Docker today or Lambda functions tomorrow, the Netsil AOC will observe and monitor them without requiring any code changes. — Harjot Gill, Netsil CEO and Co-founder.

With the continued rise of Kubernetes and Docker, we are seeing a secular shift towards microservices architectures. This shift is exposing new challenges and creating opportunities to rethink entire category of products such as APM. We are very excited about Netsil’s radically innovative approach that delivers pervasive observability for DevOps with immense simplicity and ease of use. We’re excited to partner with Harjot, Shariq and the Netsil team on an exciting journey. — Ursheet Parikh, Partner at Mayfield Fund and one of the lead investors in Netsil.

The lack of visibility into service dependencies is a big challenge and is getting worse with microservices and heavy usage of external SaaS services. Netsil addresses this critical operational blindness in a non-intrusive manner without requiring changes to code or containers. You simply drop an agent on the host and start seeing everything. This combination of power and simplicity is incredible. — Gokul Rajaram, Product Engineering Lead at Square, the “Godfather of Google AdSense”, and an early investor in Netsil.

The Netsil AOC is available now in both SaaS and self-hosted deployment models. The AOC works on all clouds (AWS, Microsoft Azure, Google Cloud Platform, VMware) and container environments (Kubernetes, Mesos, Docker). Get Started Free With Netsil and take control of your modern cloud applications today.

We had the opportunity to sit down with Nathaniel Felsen,DevOps Engineer atMedium and the author of “Effective DevOps with AWS”. We are happy to share some practical insights from Nathaniel’s extensive experience as a seasoned DevOps and SRE practitioner.

While we hear a lot about these experiences from Google, Netflix, etc., we wanted to gather perspectives on DevOps and SRE life with other easily relatable companies. From tech-stack challenges to organization structure, Nathaniel provides a wide range of practical insights that we hope will be valuable in improving DevOps practices at your organization.

How is Site Reliability Engineering (SRE) practiced at Medium?

Medium takes a slightly different approach to SRE. We have split the responsibility into two groups:

There is a DevOps team responsible for automation, deployment management, understanding impact of deployments and improving collaboration across teams.

There is separate group called “the Watch”, and they are the on-call rotation team. The Watch is made up of 2 product developers for 2 weeks who mostly handle day-to-day operations but more specifically respond to pages, triage bugs, handle deployment issues / rollback, etc. As a developer, you go on the Watch every 4 months.

My time is divided across both these teams and I get to see both worlds quite closely. While I lead the Watch team, I also invest significant time doing DevOps engineering including creating build pipelines, automating aspects such as monitoring, and providing feedback on production readiness when a service is ready to launch.

Give us a flavor of your tech-stack?

All AWS and containers using ECS. We started with a monolithic application and still have that. But in order to build newer services faster, we are using containers and adding them as services. So, we have monolith as well as quite a few containerized services. We heavily use Node.js but it becomes harder to manage Node.js as the application becomes big. Many new services are being written in Go and we are experimenting with React as the new language for our front-ends. Other than that, we leverage all the usual suspects from AWS — Auto-scaling, ELB, SQS, Kinesis, Lambda, DynamoDB, etc.

What would you say are the top challenges for the Watch team?

In distributed architectures including microservices, it is hard to address the bottlenecks. For example, we use a lot of queues in the backend for asynchronous processing. We do a lot of asynchronous processing for our recommendation engine, which tends to increase the number of messages added to SQS queues. To address this, we add more “queue consumers”. But then we hit limits on how fast we can write into DynamoDB. Our primary challenges are around understanding the dependencies, identifying bottlenecks and addressing them for the short and long term.

Share with us a “house on fire” war story along with the key learning?

This incident is a great example of how features that you build down the lane might not play well with your database structure from the past. The Whitehouse took the initiative to publish the script of the State of the Union. Naturally, this was going to be a very popular Medium post. Usually, our auto-scaling capabilities are very well established to handle such traffic spikes. However, we had recently launched a new feature of “highlighting and sharing” text. So, naturally, along with huge readership came heavy usage of the highlight feature. Through a sequence of dependencies, the highlight feature eventually ends up invoking a service that does a write call to DynamoDB. We were sharding the content based on post-id, which had worked fine for that table until now. The highlight feature, though, swamped our table because all the writes were on the same post-id and hence on the same shard!

As I had said earlier, the key learning here is to be able to visualize and understand the intricate dependencies in modern applications.

What attracted you to Netsil?

Since we have a bunch of services and microservices, understanding dependencies is a common critical task. We were doing tcpdump and putting that into wireshark . But when I heard about Netsil, I found that Netsil could do dependency analysis for us. Netsil would auto-discover API communications and automatically do a nice graphical version (maps) of what we were doing with tcpdump + wireshark.

With Netsil’s auto-discovered maps we are able to identify dependency chains such as “The Monolith → Queue → other services → HAProxy & ELB → Social Service → Graph database” . Netsil also gives us insights into latencies and throughput for these API calls, which helps us identify the hotspots in our dependency chains. If you have a modern microservices application, then Netsil is great for monitoring and tracing the dependencies.

Another use case for us was to do build comparisons and catch code deployment issues. Our entire deployment pipeline is automated, allowing us to deploy new services dozens of times a day. In order to identify regressions caused by bugs, we use tools such as Netsil to analyze http status code (e.g. 500/400 errors) of new builds by exposing the build id in the http header. Thanks to that system, we are able to prevent bugs from making their way to production and have our Watch team analyze and file bugs for these issues.

We’ve been implementing a request tracing service for over a year and it’s not complete yet. The challenge with these type of tools is that, we need to add code around each span to truly understand what’s happening during the lifetime of our requests. The frustrating part is that if the code is not instrumented or header is not carrying the id, that code becomes a risky blind spot for operations.

What would be your tips for fellow DevOps Engineers & SREs?

To check out my book! :-)

Measuring everything doesn’t mean alerting on everything. Whenever you are investigating an issue, having all the data you need is critical to get to the bottom of an issue. You don’t want to spend the first 20 mins of an outage trying to gather information about what’s going on. You want service alerts to be important, timely and actionable. You don’t want the on-call engineer to suffer from alert fatigue and constantly see warning (or even worse, get paged) for issues they can’t fix or don’t matter (for e.g., issues with an internal reporting system may not require waking up the on-call engineer at 3am). For web applications, for example, you can usually focus on top level metrics such as latency and error rate and rely on your dashboards inside Netsil and other monitoring tools to tell you why those metrics are higher than expected.

With respect to alerting, some of the common questions that need to be answered are: Was the page justified/avoidable? Was there proper documentation? Can something be done to prevent that issue from happening again? After each important incident, review what happened and whenever possible create a post-mortem. Include information like the timeline, root cause, top level metrics such as mean time to detect and mean time to recover, mention about what went well and what could be improved.

Conclusion

On behalf of the Netsil team and all our readers, our sincere thanks to Nathaniel @ Medium for sharing practical insights on DevOps and SRE life. We look forward to learn more about Nathaniel’s experiences in the upcoming book: “Effective DevOps with AWS”.

]]>https://blog.netsil.com/kubernetes-tutorial-monitoring-http-service-health-1a7f532fcc5f?source=rss----70aaaf0935d9---4
https://medium.com/p/1a7f532fcc5fTue, 01 Aug 2017 17:41:04 GMT2017-08-03T05:31:09.171ZHTTP API calls are the backbone of modern cloud applications especially Kubernetes based microservices applications. Yet, very little is done to understand the health of HTTP communications. Other than services connected to the load balancer, it has been rather difficult to measure the key performance indicators (KPIs) of latency, throughput and error rates for HTTP calls.

The Netsil Application Operations Center (AOC) captures and analyzes service interactions to deliver the complete picture of HTTP service health. The AOC does a deep analysis of application level protocols such as HTTP and gathers all the KPIs along with HTTP attributes (URIs, status codes, etc.). In this tutorial, we will provide a step-by-step guide for using various HTTP datasources, and grouping and filtering the HTTP data based on HTTP attributes.

This blog is meant as a follow-along tutorial, all you need is a Kubernetes cluster and kubectl. You can easily setup the sock-shop appand Netsil AOC.

Topics Covered

Defining HTTP Latency, Throughput and Error Rates

Comparing Latency of HTTP Success and Errors

Setup

We will be using the sock-shop app running on a Kubernetes cluster as our target application for mapping and monitoring. The AOC is installed as a pod and the collectors are installed as DaemonSet pods on each of the Kubernetes worker nodes (see figure below). You can easily get this setup going in your Kubernetes cluster using our installer.

Netsil AOC Setup to Monitor HTTP Services in Kubernetes Cluster

What HTTP Service to Monitor?

Your application probably has a lot of HTTP services. The Netsil maps help you understand the dependencies among services and pick HTTP calls that you should monitor. From the Maps blog, we have the following picture of HTTP interactions in the sock-shop app.

We will pick the HTTP communication between front-end and catalogue for this tutorial (see figure below).

Using Maps to Understand Service Dependencies and Select Services to Monitor

Getting A List of the HTTP Interactions

There might be multiple HTTP calls going on between front-end and catalogue pods. We can understand these calls by using the AOC Analytics Sandbox. All we need to do is select the client and server pod names and groupby http.uri, easy!

From the left navigation box, select Analytics Sandbox

Select http.request_response.count as the Datasource

Select count as the Aggregation function to apply

Set http.uri as the GroupBy

Now let’s set the Filters so that we restrict the client and server to specific pods

pod_name(client) : sock-shop/front-end...

pod_name(server) : sock-shop/catalogue...

6. Change the chart type to Bar

Using Netsil Analytics to Get List of HTTP Interactions Between Specific Pods

We can see the http URI associated with the communication between front-end and catalogue. As expected, the calls are for URI of the form /catalogue/<catalogue_id>.

Defining HTTP Avg Latency

We will define the HTTP Avg Latency for the calls to URI : /catalogue.*. Additionally we will restrict to measure the GET requests coming from front-end to catalogue.

From the left navigation box, select Analytics Sandbox

Select http.request_response.latency as the Datasource

Select avg as the Aggregation function to apply

Now, let’s set the Filters so that we restrict the metrics to the specific http interaction of interest.

And we have the chart measuring latency of front-end to catalogue HTTP interaction! We selected the HTTP latency datasource. Then we applied the client/server filters and restricted the metrics to specific URI, that is /catalogue.*, and specific method GET.

All this was made easy because Netsil gathers the HTTP metrics along with all the key attributes such as URI, request method, etc. automatically from analyzing service interactions.

Measuring HTTP Latency Using Netsil Analytics

Defining HTTP Throughput

This is very similar to defining the latency. All we need is to change the datasource from http.request_response.latency to http.request_response.throughput. Below we have repeated the steps and also highlighted in the resulting chart.

From the left navigation box, select Analytics Sandbox

Select http.request_response.throughput as the Datasource

Select throughput as the Aggregation function to apply

Now, let’s set the Filters so that we restrict the metrics to the specific http interaction of interest.

pod_name(client) : sock-shop/front-end...

pod_name(server) : sock-shop/catalogue...

http.uri : /catalogue.* (regex)

http.request_method : GET

Measuring HTTP Throughput Using Netsil Analytics

Defining HTTP Error Rates

For simplicity, let’s focus on the HTTP5xx and 4xx errors (for e.g., status code 500, 404, etc.). Then the error rate will be defined as:

(Throughput of HTTP 5xx or 4xx requests) / (Total Throughput) * 100

Continuing, from the previous section, we have already defined the overall throughput. Below is the screenshot of that query. Note the query statement name A. So, A represents the total throughput. We will see how to use this name and combine query to generate the error rate. We will create another query statement and use filters to restrict the throughput metrics to HTTP5xx and 4xx status codes.

Using HTTP Status Codes to Obtain Throughput of HTTP Errors

Create another query statement by clicking the + METRIC button. Note this creates new statement named B.

Select http.request_response.throughput as the Datasource

Select throughput as the Aggregation function to apply

Now, let’s set the Filters so that we restrict the metrics to the specific http interaction of interest.

pod_name(client) : sock-shop/front-end...

pod_name(server) : sock-shop/catalogue...

http.uri : /catalogue.* (regex)

http.request_method : GET

http.status_code : (4\d\d|5\d\d)(regex) [We filter on status code and select only those requests that are getting 4xx or 5xx errors].Query statement B has throughput of the 4xxand 5xx errors. Next we will use the EXPRESSION feature to combine and obtain the error rate i.e B/A*100.

Measuring HTTP Error Rates Using Netsil Analytics

Create an expression statement by clicking the +EXPRESSION button

Select Eval as the operator to combine queries using arithmetic

Now simply use $and query statement name, to reference the results of query statements and write the appropriate mathematical formula. In this case ($B/$A)*100We have the error rates. We created two query statements and combined them to obtain error rates.

Comparing Latency of HTTP Errors and Success

If an HTTP service is failing, it better fail fast. Otherwise the end users not only end up waiting longer but in the end are frustrated to recieve HTTP errors. A good approach to measure this is the ratio of Avg Latency of HTTP Errors / Avg Latency of HTTP Success.

Let’s learn how to define this metric in Netsil.

From the left navigation box, select Analytics Sandbox

Select http.request_response.latency as the Datasource

Select avg as the Aggregation function to apply

Now, let’s set the Filters so that we restrict the metrics to the specific http interaction of interest.

Note the query statement name B. This query statement is returning the average latency of HTTP requests resulting in success. Now, we just need to calculate A/B to get the ratio comparing the latency of errors and success

Create an expression statement by clicking the +EXPRESSION button

Select Eval as the operator to combine queries using arithmetic

Now simply use $and query statement name, to reference the results of query statements and write the appropriate mathematical formula. In this case ($A/$B)Note that the Eval statement name is C. The plot of C reveals that the latency of error requests is a small fraction of successful requests. This is how it should be! As mentioned earlier, this is a good metric to track and set alerts as it greatly impacts end user experience.

Ratio of Latency of HTTP Errors & HTTP Success

Conclusion

Monitoring the health of HTTP API calls is critical to ensure the reliability of modern microservices applications. Latency, error rates and throughput are key health indicators for HTTP calls. There is a need to understand these calls and monitor them along multiple attributes such as client id, server id, status codes, URI patterns, etc.

The Netsil Application Operation Center (AOC) provides deep insights into HTTP API health by doing a real-time analysis of service interactions. By leveraging Netsil, the operations teams can get complete visibility into the health and performance of the HTTP APIs. You can get valuable insights into your API health right away by using Netsil in your Kubernetes cluster.

Microservices applications are an intricate web of service interactions. All transactions are fulfilled through sequences of API and DB calls that span multiple services. It is absolutely critical to understand the service dependencies in microservices applications such as those running on Kubernetes. This is where application maps, that capture all the services and their dependencies in real-time, come into play.

In a previous blog post, we defined application maps and compared various techniques for generating them. The Netsil Application Operations Center (AOC) generates auto-discovered application maps by analyzing service interactions without requiring any code change. Users can visualize and understand the applications from multiple perspectives by using the AOC generated maps. For e.g., the maps below show Kubernetes cluster at the host, namespace and pod levels.

This blog is intended as a walk-through tutorial that can help you to create maps for your Kubernetes clusters. We also highlight specific use cases for leveraging the dependency chains in maps to help with incident response and production deployments.

Kubernetes Maps at Host, Namesapce and Pod Levels

Setup

We will be using the sock-shop app running on a Kubernetes cluster as our target application for mapping and monitoring. The AOC is installed as a pod and the collectors are installed as DaemonSet on each of the Kubernetes worker nodes (see figure below).

You can easily get this setup going in your Kubernetes cluster within minutes by downloading our installer and using this documentation.

Netsil AOC Setup to Map Kubernetes Cluster

Discovering Your Application Using Default Map

Once the AOC and collectors are installed, login to the webapp and switch to the Map Sandbox. This will load the Default Map.

The Default Map uses an internal algorithm (AutoGroup) to identify services based on the protocol and attributes of the protocols such as HTTP URIs, DB Queries, etc. For e.g., in the picture below you see the auto-discovered HTTP, DNS, MySQL services.

Netsil AOC Auto-discovered Service Map

The zoom and pan features of the map help you move around and visualize the discovered services. You can also search for specific services. The picture below searches for MySQL services in the application. In addition to discovering services, default map also captures dependencies and key performance metrics (latency and throughput) for services.

Creating Your First Map

The default map is great to quickly get started and get visibility into all the services making up your application. But if you are responsible for a specific subset of services, you can create a map containing just the right set of services.

Let’s say we are responsible for the sock-shop application. We can create a map for sock-shop consisting of all the pods and their dependencies. We will use the Filters and GroupBy features to customize the map. Netsil automatically collects the kubernetes metadata such as pod names, namespaces, service names, etc. So all we need to do is select the right grouping and apply right filters.

Creating Custom Maps Using Filters and Groups

Start with the Map Sandbox in the left navigation. This will load the default map.

Apply filter to restrict pods to the sock-shop namespace. Use tags.kube_namespace attribute and set it to the sock-shop namespace.

Since we want sock-shop map to be at the pod-level, change the grouping criteria from AutoGroups to pod_name

Name the map and save it. That’s it we are done!

The figure below displays the sock-shop map at the pod-level.

Pod-level Map of Kubernetes

Summary Action Items:* Load a Default Map from Map Sandbox* Change the GroupBy from AutoGroup to pod_name* Apply filter by using tags.kube_namespace attribute* Name and save the map

Understanding Impact of Deployment

The rate of production deployments has increased significantly as a result of DevOps and microservices. Unfortunately, deployment and code changes are also among the top causes for production issues. By using application maps you can evaluate the impact of deployments in the complete application context and prevent costly incidents.

Let’s use a concrete example of updating the shipping pod in our sock-shop application. In the figure below (a) shows the sock-shop before and (b) shows the same map after deploying new shipping pod image. Even though the throughput, shown in requests per second (rps) remains roughly the same, there is an almost 2x jump in latency across all the pods in the dependency chain of front-end –> orders –> shipping. This is a good indicator to take a second look at the changes made to shipping pod before it hits production!

Using Maps to Understand Impact of Deployments

Summary Action Items:* Use dependency chains in Netsil maps, to evaluate the impact of deployments on other dependent services* Ensure there are no performance drifts before the deployment hits production

Accelerating Root Cause Analysis in Dependency Chains

Another natural use case for maps is to expedite root cause analysis. Let’s say we are monitoring the latency for front-end pod since that is the service exposed to end users. We have set an alert on a spike in latency of frontend and the pager goes off. In the figure below, we can compare the before and after maps.

A quick scan of dependencies reveals a spike in latency on the dependency chain leading up to the catalogue items database service. If metrics on other dependencies look normal, then catalogue database service seems like a good candidate to diagnose further. A very promising candidate for the root cause is revealed promptly using the maps. In the absence of maps such analysis would involve correlation or chase tcpdumpacross multiple machines. Netsil maps greatly accelerate root cause analysis thereby saving time, money and best of all delivering extra sleep for your on-call teams!

Using Maps to Accelerate Incident Response

Conclusion

Microservices applications heavily utilize service interactions (API calls, DB queries, DNS lookups, etc.) to fulfill transactions. Additionally, due to freedom of parallel development, microservices applications change very frequently. These characteristics greatly complicate root-cause analysis during incidents and make it difficult to evaluate the impact of deployments.

]]>https://blog.netsil.com/kubernetes-tutorial-an-api-perspective-on-how-kubernetes-works-5fd30d1010ff?source=rss----70aaaf0935d9---4
https://medium.com/p/5fd30d1010ffWed, 28 Jun 2017 18:17:50 GMT2017-06-29T17:52:20.696ZAs beautiful as Kubernetes is, it often feels like voodoo-magic on how it does things. But for operations team planning to run Kubernetes in production, “it’s magical” isn’t a very confidence inspiring answer. As a layer immediately above the infrastructure, Kubernetes orchestrates several critical functions such as scheduling, deployment management, service discovery, service routing and much more. So, it is critical for operations teams to have a solid understanding of this framework and its inner workings.

Fortunately, Kubernetes is a quintessential modern system which relies heavily on API calls among its components. Observing and analyzing the API calls, we can learn a lot about systems such as Kubernetes, without requiring any code change or in-depth Ph.D. in the system. Netsil precisely employs this innovative technique of analyzing packets from service interactions. We will leverage Netsil to understand the inner workings of Kubernetes. In this blog post, we will understand the calls flows for namespace, pod and service creation.

An important broader point worth highlighting is that just like Kubernetes there are many 3rd party, OSS and external systems that are heavily used in modern cloud applications. Whether it is OSS such as Kafka, MySQL, Consul; or SaaS services such as AWS RDS, Stripe APIs, Google Auth APIs etc.; your cloud application is made up of components that are hard to instrument and monitor using existing techniques. A much more practical and effective approach is to leverage service interactions as the source of truth to monitor modern cloud applications (check out our previous blog for more information on this approach). With that digression out of the way, let’s first review our test environment setup and then understand how Kubernetes works.

Overview of Netsil Technique Used to Understand Kubernetes

We use a simple Kubernetes setup with one master and two worker nodes. The Netsil collector is installed as Docker container on Kubernetes master and as daemon set pods on workers (see figure 1). The Netsil collectors leverage pcap to capture and send copies of packets from API interactions between Kubernetes master and workers. These packet copies are low-level (i.e., L3 TCP/IP packets) which are reconstructed into application level protocols (i.e. L7) by Netsil. We use the Netsil Analytics Sandbox to query, analyze and understand the Kubernetes inner workings using the API data.

Setup for understanding the Kubernetes call flows using Netsil

With that backdrop on API data collection, we use the following query to understand the Kubernetes API call flows.

We use http.request_reponse.count data; since we know these are primarily REST calls.

We will group the http data based on Client.hostname, Server.hostname. This will, naturally, help us understand who is involved in the request and response for the call.

We will also group by http.uri and http.request_method to identify the URI endpoint and the HTTP method type (GET, PUT, POST, PATCH, etc.)

And lastly, we use time window to focus the query on specific time when the Kubernetes calls are made.

Figure 2, shows how the query looks like in Netsil Analytics Sandbox. And now let’s dive right into the call flows for Kubernetes.

How does Kubernetes Work?

Create Namespace

>> date; kubectl create ns test-netsil-ns ; date

Using above command, we create namespace test-netsil-ns and use the date timestamps to analyze the API calls in Netsil AOC.

Key Observations:

There are multiple GET calls from kubectl to Kubernetes API server before the namespace creation starts.

Kubectl is wrapping up not just the authentication but also authorization, certificate checks, policy and rbac calls.

For namespace creation, there were twoPOST calls. One to create the namespace and another to create a service-account. The service-account creation POST call is initiated and served by k8s-master. With every namespace, a service account is automatically created bearing the same name as that of the namespace.

Create Deployment and Pods

One POST call for creation of Deployment. (client.host_name:null refers to Kubectl running on our laptop, where Netsil collector is not installed)

One POST call to create from k8s-master to k8s-master for creating ReplicaSet

Two POST calls to create 2 pods one for each pod creation also going from k8s-master to k8s-master. These will essentially just create entries in etcd. The actual pod creation happens later and is done by workers.

POST calls to bind the pods to nodes

An important step is where the kubelets on worker nodes read the pods assigned to them. This is done via a GET call to the API server as shown below. There is no POST call from k8s-master to k8s-worker for creating the pods.

Completing the Knowledge

We captured quite a lot of information on how Kubernetes works using Netsil and analyzing HTTP calls. Naturally, we didn’t learn much about what is going on inside the Kubernetes master or workers. This is where having deep knowledge (or as we call it Ph.D.) in the system comes in. There is no other better person to learn about internals of Kubernetes than Joe Beda, one of the founders of Kubernetes. And he recently wrote an amazing blog on “Jazz Improv” performed by Kubernetes. We have copied informative picture from his blog below and would highly recommend that you check out that blog.

Conclusion

In this tutorial, we explained the inner workings of Kubernetes API calls for some of the key functions. We used Netsil to capture and analyze HTTP calls among Kubernetes master and workers. We didn’t need to do any changes to Kubernetes code; and didn’t require in-depth knowledge of Kubernetes internals.

Considering the complexity of modern applications and the number of components they have, it is not practical for operations teams to have in-depth knowledge into all of them. As illustrated with Kubernetes, the Netsil approach of capturing and analyzing API calls is simple yet effective in understanding and monitoring these complex components and systems of modern microservices applications.