Docker CaaS (Containers-as-a-Service), allowing any Docker container to run on their platform, filling a void between IaaS (Infrastructure-as-a-Service) that requires a lot more system administration and configuration, and PaaS (Platform-as-a-Service) that is typically very limiting in terms of language support and libraries.

Containers are here to transform how build, test, ship and run applications securely on any infrastructure.

Containers as a service (CaaS) is a paid offering from cloud providers that includes compute resources, a container engine, and container orchestration tools.

Developers can use the framework, via API or a web interface, to facilitate and manage containers and application deployment.

There are two keyword in Container world :-
– Container Orchestration
– Container as a Service

Also there are many overlapping projects available in market to provide both keywords
e.g. To provide Container Orchestration we can use below
Amazon ECS
Kubernetes
Docker Swarm
Rocket
Apache Mesos
Azure Container Service (ACS supports two orchestration engines – Docker Swarm and Mesosphere DCOS)

Docker UCP
Enterprise-grade on-premises service for managing and deploying dockerized distributed application in any on-premises or virtual cloud environments. It’s built-in security features like LDAP/AD integration and rolebased access control (RBAC) allow IT teams to be in compliance with industry security regulations.

Kubernetes Concepts

Skydns is the DNS addon for service ip .

Jobs (kind:Job) are complementary to Replication Controllers. A Replication Controller manages pods which are not expected to terminate (e.g. web servers), and a Job manages pods that are expected to terminate (e.g. batch jobs). A Job can also be used to run multiple pods in parallel and one can control the parallelism.

Endpoints are nothing but collection of pod_ip:port

Port: is the abstracted Service port. Service is backed by a group of pods. These pods are exposed through endpoints.

TargetPort: is the port the container accepts traffic on

NodePort: When a new service get created in kube-cluster, kube-proxy opens a port on all the nodes (also called as nodeport). Connections to that port will be proxied to the pods usinf selectors and labels

By default kubernetes create deployment (newer concept of RC) for pods if RC is not defined. Deployment support rollback to previous deployment that was missing in RC.

Kube-Proxy is responsible for implementing a form of virtual IP(clusterIP). In Kubernetes v1.0 the proxy was purely in userspace. In Kubernetes v1.1 an iptables proxy was also added.

Proxy-mode: userspace : In this mode, kube-proxy watches the Kubernetes master for the addition and removal of Service and Endpoints For each Service it opens a port (randomly chosen) on the local node. Any connections to this “proxy port” will be proxied to one of the Service’s backend Pods (as reported in Endpoints).

Iptable proxy : kube-proxy watches the Kubernetes master for the addition and removal of Service and Endpoints For each Service it installs iptables rules which capture traffic to the Service’s clusterIP(which is virtual) and Port and redirects that traffic to one of the Service’s backend sets. For each Endpoints object it installs iptables rules which select a backend Pod.

Security

Security in Kubernetes is applied to 4 type of consumers (3 infra consumer types and 1 service consumers type)

When a human access the cluster (e.g. using kubectl), he is authenticated by the apiserver as a particular User Account.

All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the APIServer.

Processes in containers inside pods can also contact the apiserver. When they do, they are authenticated as a particular Service Account. This cover inter-container and container-apiserver communication.

When a outside cluster consumer contact a service using kube-proxy. It is being authenticated as per Service account via service itself.

Apiserver is responsible for perforing authentication and authorization for users of kube-infrastructure e.g. kubectl.

Kubelet handles locating and authenticating to the apiserver

A secret stores sensitive data, such as authentication tokens/certificates, which can be made available to containers/application upon request.

Namespace is a mechanism to partition resources created by users into a logically named group.

A security context is a set of constraints that are applied to a container/pod in order to achieve the following goals

Ensure a clear isolation between container and the underlying host it runs on using user namespaces feature of docker

Limit the ability of the container to negatively impact the infrastructure or other containers by using Docker features such as the ability to add or remove capabilities (cpu/memory etc) .

A pod runs in a security context under a service account that is defined by an administrator, and the secrets a pod has access to is limited by that service account.

For Infrastructure users security would be implemented as below to secure apiserver access

Create namespace -> Set Cluster Name and override cluster-level Properties for this namespace -> Set credentials to the cluster and user in Namespace -> Create Security Context to “Cluter+Namespace+User” combination

For Service consumers

Create service account-> secure it with secret -> Create service under service account -> Create pods belonging to service

Define iptable rules for service access

Kube-up.sh create below certificates in /srv/kuberntes/

First a CA is created, the result is a cert/key pair (ca.crt/ca.key). You can use easyrsa to generate your PKI or OpenSSL

Then a certificate is requested and signed using this CA (server.cert/server.key), it will be used

by the api server to enable HTTPS and verify service account tokens

by the controller manager to sign service account tokens, so that pods can authenticate against the API using these tokens

Another certificate is requested and signed (kubecfg.crt/kubecfg.key) using the same CA, you can use it to authenticate your clients

Kubernetes HA Cluster

Architecture

Concepts

flannel is used because we want to use overlay network. Other options to flannel are Open vSwitch or any other SDN tool

While configuring cluster/ubuntu/config-default.sh we should be aware that private ip ranges should not conflit with datacenter private ips. we can use any of these range
10.0.0.0 – 10.255.255.255 (10/8 prefix)
172.16.0.0 – 172.31.255.255 (172.16/12 prefix)
192.168.0.0 – 192.168.255.255 (192.168/16 prefix)

As of Kubernetes 1.3, DNS is a built-in service(based on skydns) launched automatically using the addon manager “cluster add-on” (/etc/kubernetes/addons). DNS would be used to resolve hostnames like http://www.dns.com into machine ips

Etcd Cluster: etcd provides features both TTL on objects, and a compare and swap operation, to implement an election algorithm. Kubernetes used both of these feature for master selection and HA.

Unelected instances can watch “/election” (or some other well known key) and if it is empty become elected by writing their ID to it. The written value is given a TTL that removes it after a set interval, and the elected instance must rewrite it periodically to remain elected. By the use of etcd’s atomic compare and swap operation, there is no risk of a clash between two instances being undetected.

Podmaster:

Podmaster’s job is to implement a master election protocol using etcd “compare and swap”. If the apiserver node wins the election, it starts the master component it is managing (e.g. the scheduler), if it loses the election, it ensures that any master components running on the node (e.g. the scheduler) are stopped.

Podmaster is a small utility written in Go-lang that uses etcd’s atomic “CompareAndSwap” functionality to implement master election. The first master to reach the etcd cluster wins the race and becomes the master node, marking itself as with an expring key that it periodically extends. If it finds the key has expired, it attempts to take over using an atomic request. If it is the current master, it copies the scheduler and controller-manager manifests into the kubelet directory, and if it isn’t it removes them. As all it does is copy files, it could be used for anything that requires leader election, not just kubernetes!

The easiest way to implement an HA Kubernetes cluster is to start with an existing single-master cluster. The instructions at https://get.k8s.io describe easy installation for single-master clusters on a variety of platforms.

1) RAID-Z1 is similar to RAID 5 (allows one disk to fail), RAID-Z2 is similar to RAID 6 (allows two disks to fail) and RAID-Z3 (allows three disks to fail). The need for RAID-Z3 arose recently because RAID configurations with future disks (say 6–10 TB) may take a long time to repair, the worst case being weeks.

2) ZFS has no fsck repair tool equivalent, common on Unix filesystems, Instead, ZFS has a repair tool called “scrub” .

3) ZFS – data is being compressed first, then deduplicated

4) Logical Data (Original size of data without compression or dedup)
The amount of space logically consumed by a filesystem. This does not factor into compression, and can be viewed as the theoretical upper bound on the amount of space consumed by the filesystem. Copying the filesystem to another appliance using a different compression algorithm will not consume more than this amount. This statistic is not explicitly exported and can generally only be computed by taking the amount of physical space consumed and multiplying by the current compression ratio.

5) zpool replace will copy all of the data from the old disk to the new one. After this operation completes, the old disk is disconnected from the vdev.

6)Although additional vdevs can be added to a pool, the layout of the pool cannot be changed

7) ZFS deduplication is in-band, which means deduplication occurs when you write data to disk and impacts both CPU and memory resources. Deduplication tables (DDTs) consume memory and eventually spill over and consume disk space. At that point, ZFS has to perform extra read and write operations for every block of data on which deduplication is attempted. This causes a reduction in performance.

*GlusterFS is a powerful cluster filesystem written in user space which uses FUSE to hook itself with VFS layer.

*Filesystem in Userspace (FUSE) lets non-privileged users create their own file systems without editing kernel code. User run file system code in user space while the FUSE module provides only a “bridge” to the actual kernel interfaces.

* Though GlusterFS is a File System, it uses already tried and tested disk file systems like ext3, ext4, xfs, etc. to store the data.

*Gluster designed a system which does not separate metadata from data, and
which does not rely on any separate metadata server, whether centralized or distributed.
* In the Gluster algorithmic approach, we take a given pathname/filename (which is unique in any directory tree) and run it through the hashing algorithm. Each pathname/filename results in a unique numerical result.
* We store files in library way(alphabatic order) .
* An alphabetic algorithm would never work in practice, that is why we pick hash.
People familiar with hash algorithms will know that hash functions are generally chosen for properties such as determinism (the same starting string will always result in the same ending hash), and uniformity (the ending results tend to be uniformly distributed mathematically).
* Storage system servers can be added or removed on-the-fly with data automatically rebalanced across the cluster
*File system configuration changes are accepted at runtime and propagated throughout the cluster allowing changes to be made dynamically as workloads fluctuate or for performance tuning.
* The number of bricks should be a multiple of the replica count for a distributed replicated volume

5)

* NFS is traditionally difficult to scale and achieve high availability. the same GlusterFS could do.

* If the file is not where the hash code calculates to, an extra lookup operation must be performed, adding slightly to latency.Details

6)

Self Healing

Previously, this self healing needed to be triggered manually, however there is now a self-heal daemon which runs in the background, and automatically initiates self-healing every 10 minutes on any files which require healing.

File is said to be in split-brain when the copies of the same file in different bricks that constitute the replica-pair have mismatching data and/or meta-data contents such that they are conflicting each other and automatic healing is not possible. In this scenario, you can decide which is the correct file (source) and and which is the one that needs healing (sink) by looking at the mismatching files.

*When a client is witnessing brick disconnections, a file could be modified on different bricks at different times while the other brick is off-line in the replica. These situations lead to split-brain and the file becomes unusable and manual intervention is required to fix this issue.

Evolving Swift where a single cluster can be distributed over multiple, geographically dispersed sites, joined via high-latency network connections.
Disaster Recovery will be the mechanism for continued operations when you have multiple Swift environments in various locations. In this context DR is a continued workload operations in an alternative deployment, the recovery target clouds.

OpenStack Swift in itself has architecture to deal with disasters by way of data replication to Zones that are distributed across datacenter. Swift can uniquely place replicas according to drives, nodes, racks, PDUs, network segments and datacenter rooms.

A new concept of “Region” is introduced in Swift. A Region is bigger than a Zone and extends the concept of Tiered Zones. The proxy nodes will have an affinity to a Region and be able to optimistically write to storage nodes based on the storage nodes’ Region. Affinity makes the proxy server prefer local backend servers for object PUT requests over non-local ones.

-:Pointers:-
** Some distinguish HA from DR by networking scope – LAN for HA and WAN for DR, in the cloud context a better distinction is probably the autonomy of management.
** To add more capacity to the cluster, Add new capacity to the ring with increased weight.
** To add more regions to the cluster, Change ring and add replica count by a fractional amount e.g. 3 -> 3.1 in ring.
** Replication traffic needs to be bandwidth-limited across WAN links, both for responsiveness and for cost.
** Objects(Actual data), that can help in recreating entire Swift setup after the proxy server recovery. A simple rebalance of the Rings can be used to redistribute
the data to nodes added/recovered as a part of disaster recovery and mitigation.

Software-defined networking (SDN) is an approach to networking in which control decoupled from hardware and given to a software application called a controller.

1) SDN is :
a) Separation of data and control planes and a vendor-agnostic interface (e.g. OpenFlow) between the two.
b) A well-defined API for the networking (3rd parties can develop and sell network control and management apps).
c) Network virtualization (Underlying network infrastructure is abstracted from the applications, no vendor lock-in).

2) SDN is Not :
a) Only Implementing Network Functions in Software or on Virtual Machine
b) Only Programmable Proprietary APIs for Network Device or Management System

a) At bottom, the data plane is comprised of network elements, whose SDN Datapaths expose their capabilities through the Control-Data-Plane Interface (CDPI) Agent.

b) On top, SDN Applications exist in the application plane, and communicate their requirements via NorthBound Interface (NBI) Drivers. In the middle, the SDN Controller translates these requirements and exerts low-level control over the SDN Datapaths, while providing relevant information up to the SDN Applications.

c) The Management & Admin plane is responsible for setting up the network elements,
assigning the SDN Datapaths their SDN Controller, and configuring policies defining the scope of control given to the SDN Controller or SDN Application.

d) This SDN network architecture can coexist with a non-SDN network, especially for the purpose of a migration to a fully enabled SDN network

** Openstack Integration with SDN

1) OpenStack Neutron is a networking-as-a-service project within the OpenStack cloud computing initiative.

2) Neutron is an application-level abstraction of networking that relies on plug-in implementations to map the abstraction(s) to reality.

3) Neutron includes a set of APIs, plug-ins and authentication/authorization control software that enable interoperability and orchestration of network devices and technologies (including routers, switches, virtual switches and SDN controllers) within infrastructure-as-a-service environments.

** OpenDaylightOpenDaylight is an open source SDN project with a modular, pluggable, and flexible controller platform at its core. This controller is implemented strictly in software and is contained within its own Java Virtual Machine (JVM). As such, it can be deployed on any hardware and operating system platform that supports Java.OpenDaylight has driver for Neutron.

SDN is focused on the separation of the network control layer from its forwarding layer, while NFV decouples the network functions, such as network address translation (NAT), firewalling, intrusion detection, domain name service (DNS), caching, etc., from proprietary hardware appliances, so they can run in software. Both concepts can be complementary, although they can exist independently.

SWIFT’s Object Placement Strategy (The Ring) Swift uses a data structure called “Ring” to map a URL for an object to a particular location in the cluster where the object is stored. It is static mapping, one could not change on the fly.

0.2) The Ring maintains this mapping using zones, devices, partitions, and replicas. Each partition in the Ring is replicated three times by default across the cluster, and the locations for a partition are stored in the mapping maintained by the Ring. The Ring is also responsible for determining which devices are used for handoff should a failure occur.

0.3) For a given partition number, each replica’s device will not be in the same zone as any other replica’s device.

0.4) The ring builder assigns each replica of each partition to the device that desires the most partitions at that point while keeping it as far away as possible from other replicas. The ring builder prefers to assign a replica to a device in a regions that has no replicas already; should there be no such region available, the ring builder will try to find a device in a different zone; if not possible, it will look on a different server; failing that, it will just look for a device that has no replicas; finally, if all other options are exhausted, the ring builder will assign the replica to the device that has the fewest replicas already assigned. Note that assignment of multiple replicas to one device will only happen if the ring has fewer devices than it has replicas.

1) Regions, zones, servers and drives form a hierarchy for data placement.
1.1) Regions are used only when distributing a cluster over geographic sites.
1.2) A zone is defined as a unique domain of something that can fail, such as power or a networking segment.

2) OpenStack Swift places three copies of every object across the cluster in as unique-as-possible locations: first by region, then zone, then server, then drive.
A quorum is required — at least two of the three writes must be successful before the client is notified that the upload was successful.

3) As a distributed storage system, the ring is deployed to every node in the cluster, both proxies and object servers.

4) All objects have their own metadata.

5) The Ring maps Partitions to physical locations of object/container/account on disk.
An account database contains the list of containers in that account. A container database contains the list of objects in that container.

6) After Object placement the Container database is updated asynchronously to reflect that there is a new object in it.

9) The Container Server’s primary job is to handle listings of objects. It does not know where those objects are, just what objects are in a specific container.
The listings are stored as SQLite database files, and replicated across the cluster similar to how objects are.
Statistics are also tracked that include the total number of objects, and total storage usage for that container.

10) The Account Server is very similar to the Container Server, excepting that it is responsible for listings of containers rather than objects.

11) If a replicator detects that a remote drive has failed, the replicator uses the get_more_nodes interface for the ring to choose an alternate node with which to synchronize.

13) When a disk fails, replica data is automatically distributed to the other zones to ensure there are three copies of the data.

Notes:

1) Post Grizzly token format default to PKI in place of UUID. change in keystone.conf provider and format to UUID if you want to see token in short form though PKI tokens are then much more secure since the service can trust where the token is coming from and much more efficient since it doesn’t have to validate it on every request like done for UUID token.