Kubernetes

The Distributed Linux of the Cloud.

“Customers don’t want to buy new machines every year, they want to reconfigure it.” [Chuck McManis]

I expect to see other Big Data AI/ML project to embrace Kubernetes to scale their workloads.” [Chris Aniszczyk]

Mainstream technologies hardly ever hang together with high-end up to date research, usually the latter comes first, stay behind the scenes for long periods, even decades up to the day someone find something useful for it. Containers isn’t different, the concepts were born with IBM mainframes adding partition isolation so multiple companies and agencies could share the same expensive hardware; years later Solaris Zones were created so customer could reuse the same hardware for several applications in a low-cost operation process, without need to format or reconfigure the whole OS every time a change was needed. Docker came to give simplicity and Kubernetes added on top of it enterprise grade features for a high scalable production scenario, transforming what was a MAY for a few customer into a MUST on nowadays multi-tenant platforms supposed to handle millions or billions of users.

INTRODUCTION

Kubernetes is an open source platform capable of managing containerized applications, as a platform it provides management, orchestrating, networking and storage capabilities, therefore all necessary resources to create a PaaS or IaaS in a portable design across distinct infrastructure or cloud providers. It was developed from the ground up to be able of handling thousands of nodes and hundred thousand of applications under the same cluster.

The most basic components of a kubernetes cluster relies on the concept of nodes and pods. Nodes are the hardware (physical or virtual) where all components run, and where all applications shall be executed, while pods are a set of containers that together defines an application or a service.

Basically, there are two types of nodes: master and workers, while the first one main purpose stands in executing components responsible for managing the worker nodes and containers, the latter is supposed to host applications and execute them. The figure below illustrates those two nodes interacting in a single deployment and in a distributed or a clustered one.

k8s-node-deployment
Kubernetes Node Deployment

Looking at the image is possible to realize four different kubernetes deployment. The first one shall be used for training and demonstration purpose once it hosts master and worker components in the same node, such deployment is also available for production scenarios where high availability and high performance isn’t a strict requirement.

On the second scenario there are two nodes, the master and the worker one, it still isn’t appropriate for a high availability scenario, even though is safer once configuration relies on a distinct node, allowing for a fast recovery when necessary.

The third scenario is a common one and can be used to cover high performance requirements where applications are high available, once there is only one master node, such deployment is in risk of a total configuration unavailability when master goes down, what doesn’t put the worker nodes down yet will prevent SysOps to perform any administration operation.

Finally, there is the fourth scenario, where both master and workers are in cluster, note the master shall always be deployed using an odd number of servers since kubernetes (in fact etcd) uses a RAFT election algorithm to store its data.

 

NODES

Master and Worker are just a name given to the servers where kubernetes will run, what truly differentiate both nodes are the components installed on it. Let’s start with the simple one – the worker. As mentioned before, the worker is responsible for hosting and managing all applications, hence, to perform that, three main components shall be installed on every single worker node: container engine, kube-proxy and kubelet.

The container engine is the component responsible for managing all containers in execution, the most famous container engine today is docker, even though exist others; kubernetes only supports two distinct container engines: docker and rkt.

Kube-proxy is the component which manages networking, once kubernetes aims to manager a cluster of applications hosted inside containers, it is supposed to create routes between containers which are in the same worker node or event through distinct nodes. There are three distinct modes no manage networking (we will talk about it later), just for now, let’s think on kube-proxy as the component which performs all networking configuration on each node.

The least, but not less important, is kubelet whose main duty is guarantee that all pods shall be up and running – a little further we will get in depth on what pods truly are, up until there, think of pods as applications or services. Basically, kubelet, guarantee the cluster desired state by putting containers up and down or having them probed and checked for health status. Next figure shows those components inside a worker node.

k8s-worker-node-components
Worker Node Components

All worker nodes in a kubernetes cluster always have its addresses composed by its node hostname, external IP address (visible outside the cluster) and internal IP address (visible from inside the cluster). Every node has its own CPU and Memory capacity, which shall be used to guarantee a properly allocation of pods; also, a node must set its current condition:

  • OutOfDisk occurs when a node has no more disk space to allocate a pod,
  • Ready pointing whether a node is ready to receive pods,
  • MemoryPressure occurs when memory is low,
  • DiskPressure occurs when disk is low,
  • NetworkUnavailable occurs when node network is not configured,
  • ConfigOK points whether kubelet is working properly.

Master node also has its own components, distinctly from worker ones, such components are supposed to manage all kubernetes configuration, also named desired state. Master node has a bit more components.  The first one is etcd which works like a key/store distributed database storing most of the cluster data, etcd can also be deployed in separate server if volume demands it.

Kube-apiserver is the component which expose kubernetes so external tools can configure and monitor the cluster. All APIs exposed, starting from creating a node, passing through adding and delete pods or even allocate a storage resource are available through the kube-apiserver. Basically, a request that reaches it has its permission checked, input integrity validated so it can further be stored as a cluster data where all actions to keep the desired state up to date is taken. This component can be scaled vertically as much as needed, usually it is proxied by an external load balancer and shall be protected under an SSL public/private key pair.

Also, there is a component which main duty is to schedule pods, so they are applied to the correct nodes. Its name is kube-scheduler which keeps looking after to fulfill the desired state configured by putting the right pods up and down.

Finally, there is kube-controller-manager that manages and controls everything existing in a cluster. For example, the node controller is responsible by manage all nodes, while replication controllers shall manage all pods, there is also an endpoint controller to make configurations available to pods and the service account & token controller to manager users, tokens and roles assigned to it.

Fortunately, things don’t stop by here, as kubernetes has evolved and a lot of cloud providers starts to support it, a fifth component was released with the name of cloud-controller-manager which main purpose consist of manage and control all resources when kubernentes is deployed on the cloud. For example, a node controller at AWS manages all EC2 instances, route controller configures network routes, there are also the service controllers responsible for setting up load balancer and volume controllers to monitor storage volumes. The cloud-controller-manager is a brand-new component created to guarantee that cloud providers can evolve their infrastructure and  kubernetes support decoupled from kubernetes itself.

k8s-master-node-components
Master Node Components

Once kubernetes is a live project with hundreds of developers working on it, isn’t uncommon the delivery and release of couple new components that works like add on, those are nothing more than pods whose main goal is to provide extra tools to manage and monitor the cluster; some of those add on have been so effectively used that they are becoming kubernetes standards.

Kubernetes DNS is supposed to assign a CNAME for nodes, pods and services in the cluster while Web UI Dashboard responsibility relies on simplify management of the whole cluster through a visual interface; some external tools like Resource Monitoring store all metrics from the cluster in a time-series database where it should be possible to query for historical data while Cluster Level Logging collect logs from all containers centralizing it so searches can be performed via a friendly web interface. The kubelet performs on the master node the same duty it has on the worker ones, which is guarantee that all components are up and running as previously configured.

 

POD

As mentioned above, think of pods as application services. What that means? Basically, it represents a business service that will provide functionalities to a customer or user. Inside kubernetes, a pod is the minimum package/bundle that shall be used to put an application or service to run. Pods may be composed by a set of containers each one running a single or a dozen of processes.

One could raise a question: Why pod, and not just containers? Well, some applications service may, very well, be composed by a set of containers that working together provide a holistic feature or solution that separately has no use at all. For example, if a company does have a system that requires a Php Application plus a MySQL database and doesn’t make any sense have them deployed or undeployed independently those are very good candidates to compound a pod. If the example below did not make much sense, good for you, because creating pods with a lot of interdependent containers or several processes isn’t, for sure, something we would call a good practice.

For a kubernetes cluster, a pod is the smallest deployed feature, as consequence, it setup the communication between containers using a localhost address, what makes impossible having containers inside the same pod overlapping on network port allocation. As a canonical unit of deployment, it is also possible for binding them to volumes that might be used to store persistent data accessible to all container inside the pod, that shall survive crashes and restarts.

When creating a pod, the Kube-Controller-Manager assign it to a worker node, the kube-scheduler makes sure it will be allocated as soon as possible. The kubelet download all container images and request docker to bring them up. Once all allocation is done, the pod will stay at the node up until it is restarted, the node crashes or reaches its maximum resource usage. All actions into the cluster deciding whether a pod shall be kept or moved from a node are takes by the controller, the worker node never, ever, decides about that. The figure below is quite like Figure 2- Worker Node Components, yet, it highlights what a pod truly represents inside a worker node.

k8s-worker-node-pods
Worker Node Pods

To give SysOps flexibility, kubernetes allows to configure hooks that shall be run before a pod start – poststart hook – or after a pod is gracefully shutdown – poststop hook. Sometimes more than a simple hook is necessary, in such cases it is possible to define as much as needed init containers that shall be run before a pod is initialized, those can be used to pre-configure and perform all necessary setup before an application service get available.

Isn’t uncommon the need to share a unique information among pods, that might be necessary, for example, to share a common certificate, a secret or even a file or environment variable. For such cases, kubernetes allows to define pods preset which are responsible for injecting those variables and artifacts into its initialization without developers known the values during development time. Furthermore, it is possible to probe a pod health by executing a command line and retrieving its return, using a tcp connection or a http get request. A SysOps might just define a policy to restart the pod as soon as its gets down or unhealthy, making sure it is always up.

Also, it is possible to create a pod using a privileged mode, guaranteeing to it, a capability to access Linux underlying features and kernel resources, such usage is very common for pods which must access kernel data as is the case of the now famous cAdvisor. Finally, for security reasons it is possible for SysOps defines PDBs (pod disruptive budgets) which are policies describing if a pod can be terminated voluntary (i.e. on purpose), such policies are tremendous important to guarantee that a mistyped command or a human mistake shall not bring down a set of functionalities that could affect the system quality of service or availability at all.

Similarly, to nodes, pods also have its own states, which are very important to understand how they are operating or debugging when something gets wrong. The available status are:

  • Pending when container image is still being download and created,
  • Running when all containers where created and at least one is running,
  • Succeeded indicating all pods have been finished properly,
  • Failed indicating at least one pods has failed goes to terminated,
  • Unknown when is not possible to retrieve its status.

 

Now that we know what nodes and pods are, let’s move forward and understand how kubernetes works behind the scenes and deep a bit dive into its components details.

Kubernetes also supports policies for pods by using the well know RBAC model. For example, it is possible to define if a pod can access the local disk, define a port to be bind in the node, use a specific kind of volume and so on. Pods policies are rich and extensive, its usage will not be covered in this article.

 

OBJECTS AND DESIRED STATE

Know the concept behind the main components is important once they provide straight information on how kubernetes works. The question that by now should be raised is: How SysOps tell kubernetes to do the things they are supposed to do? Here comes the concept of objects and desired state.

An object inside kubernetes behaves like an entity inside a software product, it contains all relevant information used by the system to perform its duty. The same way entities are usually inserted by an end user, kubernetes objects are inserted, modified, deleted or retrieved by the SysOps. The set of all objects inside kubernetes is what we call desired state, which is nothing else but the state kubernetes shall match to fulfil SysOps demands. That is performed using a control loop concept brought from automation system discipline.

The same way a consumer goes to amazon.com and insert items to its shopping cart and press 1-click ordering expecting to receive his product as soon as possible; a SysOps goes to kubernetes configure its objects in a desired state and press apply to receive as soon as possible a confirmation its cluster is up to date and running as needed. Similarly, the same way an amazon.com purchase stay safe into company’s database, all cluster configuration is stored in the kubernetes control plane (more specifically the etcd database), which is responsible for guaranteeing the desired state will be reach by keeping master and worker nodes components in continuous communication to each other.

Naturally, since kubernetes is a very complex system it has dozens of objects, some of them we have already discussed earlier: node, pods, services, volumes, ingress, configs, etc. As kubernetes has evolved, some of those objects has been abstracted by higher layers which main purpose is simplify the cluster configuration; one of those abstraction are the controllers and are the right ones to be used in most cases for cluster configuration; as example, let’s say a SysOps needs three replicas of a pod, instead of creating three pods objects he/she can create a deployment of the pod requiring a minimum of three replicas, quite easier, isnt’ it? Kubernetes supports six types of controllers:

  • ReplicaSet which were deprecated and shall not be used anymore,
  • Deployments which replaced ReplicaSet and shall be used to put pods in a desired state,
  • StatefulSet which works like Deployments yet guaranteeing pod ordering and uniqueness, what can be very useful for services that shall be started and stopped in an ordered fashion,
  • DaemonSet which is used to put a pod inside each node of the cluster, what is a good idea when agents are necessary to run on every node,
  • GarbageCollector which guarantee a cascade operation among pod dependencies, also collecting objects that have no owner at all,
  • Jobs which shall be used to execute one or a set of parallel pods once,
  • CronJobs wich are jobs scheduled to run at a specific time or periodically.

There are several ways to configure a kubernetes cluster desired state, the most common is by using kubectl command line, which can be compared to the bash terminal for mange a Linux or Unix server. Behind the scenes, kubectl invokes kubernetes APIs available in the kube-apiserver to modify current configuration, applying the desired state. All kubernetes APIs are made available using the OpenAPI (swagger) format, and most common format to setup an object is using a json or yaml format.

Each object has a unique identifier, its type and its configurable properties. Also, it is a very good practice labeling those objects, so they can be used later for administration purpose. For example, assigning the label “high-performance” for a database pod could guarantee its allocation on nodes with more than sixteen cores. In despite of being less common, it is possible to configure annotations to objects as well, and those are used to provide extensibility mechanism, so plugins can retrieve object metadata to take any further decision.

Kubernetes also has the concept of namespaces that works like a database schema, i.e., it allows a unique cluster to be logical partitioned in separate datastores, so configurations applied in one namespace does not affect others. Be aware that not all objects in a kubernetes cluster can be portioned by namespaces once couple of them are system-wide like nodes and volumes.

Figure below puts all discussed in this section together, where kubectl is the command line used to configure desired state, those are stored in the etcd distributed database. Every configuration is used by the master node to configure the workers one, including partition it in separated namespace. Also, it is possible to use a web interface like WebUI or any other third-party vendor tool to invoke APIs as well.

k8s-configuration.png
Kubernetes Configuration

 

ARCHITECTURE

Kubernetes is a very complex system, in this section we will get deeper into its internals clarifying how the pieces get together and how components will end orchestrated in a cluster deployment. First thing to consider is that a worker node does not communicate to the master ones, except in a very few occasions, as for example, when a node must register itself into the cluster; also, it is possible that a pod on the worker node may communicate to the kube-apiserver to perform any kind of configuration, therefore keep in mind that such a communication is more related to deployment requirements than kubernetes itself.

Once the kube-apiserver allows any client to modify, create or delete desired states, it highly recommended that such interface shall be protected by an authentication mechanism, the same way activating authorization policies and naturally configuring a secure communication through SSL are a MUST requirement as well. Maybe the last statement has passed unnoticed, so let’s emphasize it, be aware that all cluster configuration is performed over a secure encrypted connection that requires properly credentials and authentication, plus, there are authorization policies and rules to guarantee that only allowed users or roles can perform the necessary configuration.

On the other hand, the communication between master node and worker node is very intensive, they are used to fetch log data, attach pods, configure ports, configure services, etc. All those interfaces are not protected by default, in fact, they do not validate the certificate, what makes very unsecure configure a deployment scenario where the master node and the worker ones are separated by a public network. Therefore, if such a public network is necessary, a recommendation is to enable certificate validation or use an SSH tunnel.

k8s-master-node-communication
Kubernetes Master/Worker Communication

Another very important information underlying kubernetes architecture is the concept about Cloud Controller Manager, it has been existing for a while in the kubernetes model however the rocket adoption by cloud providers demanded a new design. The Cloud Controller Manager or CCM, is an interface created by kubernetes engineering, so a cluster can be easily deployed in any available cloud, basically it defines a common contract that shall be implemented to develop extensions that allows kubernetes to easily integrate to cloud providers like AWS, Azure and GCE by automatic configuring its underlying infrastructure.

For example, let’s say a worker node has failed and a new one shall be started to keep the desired state. Most cloud providers have an API and resources to perform such operation and kubernetes can perform that by requesting kubelet to initialize a node and node controller picking the right zone or region or even define the properly network and hostname, the same is true to configure network communication between all nodes, task performed by the Route Controller. When a service is created with several redundant pods, in a high-availability deployment, it is necessary to expose the host and port where to reach it, therefore kubernetes can request the Service Controller to create a rule into a Load Balance component. The same is true for reserve or allocate volumes that shall be used as a persistence storage, which is handled by the Persistence Volume Controller.

CCM works similar do Kubernetes-Controller-Manager, including the watch/notify or control loop mechanism to keep cluster in the desired state, the main difference relies that CCM are used when deploying kubernetes cluster in a cloud a provider while KCM are used in private infra deployments. Hardly ever CCM will replace KCM to configure all features, for example the volume controller is one that community decided should not be handled by CCM, the same is truth for some other couple of existing controllers.

 

CONFIGURATION

It is not just creating a kubernetes cluster a complex task, managing and guaranteeing that every piece stick together, all configurations are enforced and easily maintained will require from the cluster Administrator a lot of organized thinking, very high skills in housekeeping and later but not less, a deeply knowledge on the environment and its applications including the chaos that will happen on it. I know it looks like hilarious, however that is the only truth, kubernetes cluster are not meaning for simple solutions, they aim to handle deployment and administration of a quite complex set of systems, and that will demand from administrators a lot of time, studying and knowledge.

First thing to have in mind is the cluster will change quite frequently including its configurations, developers will find out a best way to use a kubernetes resource or a new feature once a while and that means administrator will have to figure out how to put that in production, without left behind the possibility of roll it back if anything gets wrong. So, ensure that all configurations are stored in a versioning repository, like git, so recovery a previously configuration would be simple and fast.

A kubernetes administrator must have in mind that it is his mission to help developers use the resources in a properly way, so instruct them to use CNAME instead of IPs and prevent the usage of pods that requires a specific hostport. Also, it is possible to use a digest tag to make hard update pods that are more stable, whose impact is higher, and where a modification should be approved by a committee.

A simple strategy to make cluster administration easier consist of using labels on every pod, once they are there, every kubectl cli command can be performed simultaneously for a set of common labeled ones, preventing a partial modification that can put the cluster in an undesirable state. Also, It is very important give limits for CPU and Memory usage; once kubernetes will turn swap off, reaching the OS memory limit is easier and dangerous are the consequences.

Guaranteeing minimum resources to a pod get hosted is also very important for preventing a pod from stuck a node that doesn’t have the appropriate resource to hots it. In kubernetes limits a CPU means a vCPU or a Hyper Thread, and it can be configured using fraction numbers like 0.1 vCPU which means the node must have at least 10% of its CPU resource available before host a pod.

Also, Kubernetes allows to configure node affinity and anti-affinity what forces or give priorities to nodes that match or support the defined requirements. The same way it is possible to use taints and tolerations to prevent a pod to be scheduled on a specific node; for example, if a pod demands too much resource and we don’t want it to be scheduled in the same node a very critical service is executing reducing risk of a total crash or high latency due to CPU competition.

Finally, use and abuse from kubernetes ConfigMaps and Secrets, both are just key/value pairs that store variables which are injected in all pods during its startup; the difference is that while ConfigMaps store and display all variable values as text, the Secret ones are stored using base64 encoding. Environment variables defined during development time shall be eradicated in detriment of centralized configurations, for example, a database connection URL should not be defined by the developer during development phase but the cluster administration in deployment processes. Secrets allows its values to be stored as a file in a pod volume as well, so have in mind that if more than just a hash or a password is necessary, for example a private/public key, puts the secret to be pushed into the pod filesystem. Always remembering the secret must be defined before pod gets up, pod will not be able to find it.

Due to the existence of multiple namespaces a cluster also support the concepts of Quotas per namespace, that very powerful feature allows putting distinct teams, or set of applications under the same node cluster preventing one to impact each other or consume more resources than it was supposed to. It is possible to define limits for CPU Usage, Memory Usage, Number of Pods, Number of LoadBalancers and so on. It is very important to use quotes when a more centralized administration is performed, and a set of distinct companies or applications share a unique cluster. That might not be the case for most companies yet they are a very common scenario for host providers.

 

NETWORKING

Once a kubernetes cluster is supposed to support thousand pods hosted on thousand nodes a good question would be: how does the pods talks to each other? For someone with some knowledge on designing physical networks such number nodes would bring some challenges for sure. Now imagine that working in a very dynamic scenario, where pods and nodes are inserted and removed hundred times a day. If such an administration has be performed manually, it would require a lot of work from dozens specialist 24/7.

To support all those configurations in a simple fashion kubernetes networks has three main requirements:

  • All container can communicate to each other without use NAT
  • All nodes shall reach every container and vice-versa, also without using NAT
  • Container itself and all other containers sees the same IP Address

Also, there is an implicit affirmation on the rules above, the rule that each pod will have its own IP address, so containers inside the same pod will all share this same IP, what makes impossible to those containers share the same port, containers in the same pod has always to use distinct ports. Another important information relies in the fact that pods are ephemeral and move from one machine to another, so bind pods to a specific port will end causing port assignment conflicts in OS, so do not use static host ports for your services as well, unless it is strictly necessary.

How does kubernetes manages all this package exchange simulating a private IP network for thousand nodes and hundred applications? The answer is quite simple, it depends!!! There are dozen models already implemented and available, let’s walk through some of them.

Cisco ACI is a kind of ultra-mega-super-premium solution that allows integrate several datacenters all around the world and manages all network details from a single point, even building up a multitenant network environment on top of it. Probably the kind of solution that AWS or Azure would need behind the scenes, nit us mere mortals. Just for imagination, let’s say you have all that; why not integrate kubernetes to it and guarantee that all its internal communications will also be handles by this same complex and “hardwared” network.

If you are a kubernetes cluster admin that lucky, you would be able to just add pods and nodes using kubectl and let all this integration on configuring routes flows automatically. Most companies don’t have a budget for an ACI solution, so I just describe it here, so we can visualize how far a kubernetes plugin can go.

Cilium is a solution based on Berkeley Packet Filters (BPF) that allows to inject instruction code inside the kernel, modifying socket routing and filter in such a way that a private network and its routes can be simulated. Basically, it allows a received packet to be filter based on its IP address, defining forward rules, or even perform load balancing solution, of for security reason block HTTP requests for a specific service. Naturally it also supports overlay network by encapsulating a packet inside a VXLAN protocol and forward it to another node, which is simple to configure and understand yet has a low performance.

Once there are dozens of network plugins, Huawei has defined its own CNI-Genie which is supposed to allow several plugins to coexist together and switch to each one of them based on a group, namespace or application group.

Contiv and Contrail are also technologies that embraces multiple support not only for several network providers, including device providers, as support for multiple container orchestration technologies like Red Hat OpenShift, OpenStack and Mesos.

In despite of all existing solutions and plugins, probably the most used is Flannel due to its simplicity and easy deployment. Flannel does not provide any network policies, just an overlay network that provides communication between nodes. One thing to keep in mind is that flannel is far from being a good choice for high performance or huge overloads, it is a basic step to get into kubernetes networking and a component to be enjoyed with cautiousness.

Once Flannel has no policy control over the network, project Calico aims to delivery security policy inside a kubernetes network regardless of using encapsulated overlay network. Using policies is an advanced topic and makes sense when is necessary to protect nodes from each other or even isolate a multitenant network, what probably is not used by most scenarios today.

Most part of kubernetes deployments just consider a free and open access between all nodes in the cluster, once create a whole policy for a virtual network between containers that are ephemeral and move from subnets and nodes would require a lot of network project and deployment planning, what is time consuming for ones that in a rush.

All major cloud providers like AWS, GCE, DigitalOcean, Bluemix, Azure and couple others have a kubernetes plugin implemented to support network underlying its own infrastructure. The same is truth for virtualization providers like VMWare which has it own network virtualization delivered via NSX-T.

Another common word in the kubernetes world is the acronym CNI, which stands for Container Network Interface, a set of interface or contracts defined by the kubernetes community that all vendor shall follow to setup or teardown kubernetes network connections. Several of the models described above, are behind the scenes, process compliant to CNI which to integrate a running pod into a virtual network by assigning it an IP address and configure all required network routes.

Figures below tries to put all the kubernetes network model together, there we have only two nodes, yet the diagram could be expanded to more nodes; each node has couple pods inside it. The yellow pod has two containers that communicate to each other using localhost in different ports, highlighting the fact that container cannot use the same port. Also, the figures show the node itself is deployed at network 192.168.0.0/16 while the kubernetes cluster uses network 10.0.0.0/16 which is reachable via a virtual network or a custom bridge.

CNI is the common interface used by kubernetes controllers to configure a it, as mentioned earlier flannel is one of the available ones, just for illustrate purpose an Overlay Tunneling was used to remember that flannel uses an overlay network that encapsulated all packages using a UDP datagram.

Calico can handle both network configuration and policies, so it was highlighted using a security icon, a policy example would be prohibiting Green Pod from talking to the Red one by any means. Finally, there is the ACI adapter that uses a real hardware and infrastructure network to guarantee the communication between all nodes.

k8s-networking
Kubernetes Networking Model

Remember this figure is an illustrative one, once in a real deployment hard ever would be necessary to put flannel, aws, ACI and Calico to manage a kubernetes networking all together.

Up until now, every should be able to talk to each other, and everything should seem to make sense and properly arranged; so, let’s bring some reality on that. As told in the very beginning, pods are ephemeral, and by that it means that pods come and go dozen or hundreds of times per day, based on that information, how to guarantee that a Green Pod can reach Yellow Pod when its IP interface and port keeps changing all the time? The answer for that is kubernetes services, from now on just services.

Basically, services are just a kubernetes configuration or desired state that defines a protocol (ex: UDP or TCP), a port (ex: 443 for https) and a target port (ex: 9443). It works like an abstraction to keep pods available at the same IP Address and port regardless of pods restart or even node changes. While the port exposes the pod on a unique port, the target allows to define in which port the container is expecting requests. Having that, Green Pod can now access Yellow Pod using its DNS name (ex: greenpod.svc.com) at port 443, and the kubernetes will know how to forward it to the right IP address at the right port destination. How does it works? Remember kube-proxy component? The one that behaves like a DaemonSet and run on every worker node? He is the one responsible for configuring and unconfigure all that.

At the very beginning, kube-proxy was used to behave like a proxy, it would change iptables inside the OS Kernel redirecting all requests supposed to reach the services to itself, assuming the responsibility to delivery this request to the right destination, such approach works and scale up to hundred nodes; however, a cluster has more than that, performance problems may rise due to its centralized architecture.

Once kubernetes community figured out performance was too low due to high usage of user space memory access they decided kube-proxy should change iptables forwarding the requests to the right pod straight way, now routing has a higher performance, that can scale up to thousand nodes, however it still has some drawbacks, as for example perform a round robin routing or try a secondary pod when the primary one is unreachable for any reason.

Kubernetes could still do it better, so on version 1.9 it was implemented on kube-proxy support to IPVS (Linux IP Virtual Server) which allows to create a routing table at kernel space with several types of balancing algorithm and filters. Starting from now a service could now be reach using environment variables defined by the convention name {SVCNAME}_SERVICE_HOST:{SVCNAME}_SERVICE_PORT or by just adding DNS add-on and configuring it appropriately, finally clusters can scale to five thousand nodes still keeping a very good performance. Figure below illustrate the three models to configure kubernetes networks.

k8s-services
Kubernetes Service Model

When creating a service is possible to define how it will be reachable from other pods or even from applications outside of the cluster. The default behaviour is called Cluster IP and makes service reachable only for other nodes inside the cluster. The second one is NodePort which reserve a static port on every node of the cluster mapping it to the right pod, in such configuration any node can access the service by sending a packet to the node:port, however developers must guarantee they will not release two services under the same port in order to prevent conflicts.

It is also possible to use the LoadBalancer type, which expose the services using a cloud provider load balancer, behind the scenes it defined a NodePort, creates a listener in the Load Balancer and put as target every existing node n the cluster;  for some reason kubernetes engineering didn’t required the LoadBalancer type to target only the node where the service is running, so it still necessary careful to not get ports conflict and almost all requests will require an extra route loop via iptables as previously described.

Finally, is possible to select an ExternalName where service is exposed using an external CNAME, in such configuration it is necessary to install kube-dns add-on. The figures below illustrate the first three types of services: Cluster IP, NodePort and LoadBalancer. The ExternalName one will be covered later.

k8s-services-types
Kubernetes Service Types

Important mention that kubernetes supports headless services which are services that generates no requests to kube-proxies, what gives totally freedom to use any other way to map a service to the right pod, one of those ways could be using DNS configuration.

On the figure above all services are described using its unique IP address, however it is also possible use CNAMES, that requires the kubernetes cluster to use a DNS Server (kube-dns or CoreDNS is just two examples of them), which behind the scenes is just a pod that provide name resolution to all other pods in the cluster.

Basically, when a cluster administrator installs a DNS add-on in kubernetes, what really happened behind the scenes is the deployment of a DNS pod that contains an IP Address with the ability to perform name resolution plus an entry on every created pod to use the DNS Service to perform Name to IP resolution.

Kubernetes allows two kinds of DNS Records for its services. A Records allows to map a hostname to an IP Address while SRV Records allows to map a port and hostname to a specific IP Address and port. It straight way understanding why Kubernetes DNS allows both models, the first is used when the client has knowledge of the server destination port while the latter shall be used when the client knows the protocol and have no idea on each port it can be reached. An A Record would have the format “yellowpod.svc.cluster.local”: “10.0.0.13”, while a SRV Record would follow the format “ldap.tcp.yellowpod.svc.cluster.local”: “10.0.0.13:389”.

Also, DNS can be used to give pods more human names by following the format pod-ip-addr.namespace.pod.cluster.local” or name.subdomain.svc.cluster.local”. There are four types of DNS resolution for a pod:

  • ClusterFirst: pods use the cluster DNS and if none exists it tries use the DNS configured for the node itself, this is the default behavior,
  • Default: which, by the way, isn’t the default (ClusterFirst is the one instead), pods use the DNS configuration of the node where it is hosted,
  • ClusterFirstWithHostNet: that shall be used for pods with hostNetwork,
  • None: where all DNS configuration is grabbed from dnsConfig section of the pod configuration file (it is supported only in Kubernetes 1.9+)

Each one of the models above has its own reasons to exist, the most common scenarios will rely on the ClusterFirst, which is the default one with a fallback for the node DNS Configuration. A pod using “hostNetwork: true” must use the ClusterFirstWithHostNet, once it gives direct access to the node network interface while the None configuration is useful once they allow more granularity for a specific pod inside the cluster or for administrators that doesn’t has a standard way to access DNS on its own network.

Unfortunately, things don’t end here for services, usually they shall be accessible to an external system or even the Internet, some security constraints must be addressed, the first of them is guaranteeing a service has a properly certificate on all pods, what can be achieved by using the secret configuration mentioned earlier; or by inserting configuring the certificate on the LoadBalancer itself.

After all, the cluster is up and running, ready to be accessible from any pod, additionally services are available to prevent nodes shutdown or pod reallocation from broke cluster communication between pods. Nonetheless there is a part for services exposure still missing, and it is about making a service available to third part components outside the cluster, what is probably a very common scenario for applications that integrates to third party ones or are invoked by an independent frontend.

Here come kubernetes ingress, which shall be used to give an external IP Address to kubernetes services, including support for Load Balancing, SSL termination (using secrets), virtual hosting, etc. A cluster that has an ingress object configured will have an Ingress Controller enabled which is responsible for manage all those configurations, including the ones that configure an external edge router. Up until now, ingress configuration supports only http[s] requests where a URL Path is mapped to a kubernetes service.

Regardless of running on a cloud provider or an on-premise infrastructure, the ingress controller shall be installed, and its provider defined, currently there are four main controllers available:

  • Nginx: which uses the very famous Nginx server to route http (support for TCP or UDP protocol is still being developed),
  • GCE: which shall be used when deploying kubernetes on Google Cloud Platform,
  • F5 BIG-IP: which integrates kubernetes to the very famous F5 load balancer,
  • Kong API Gateway: which integrates kubernetes to the Kong API gateway, making easier to bind services to the API Gateway configuration itself

Ingress is a feature that has been improving on the last months and will keep, so, in despite of being possible to expose services to external address in other ways (ex: using NodePort or a LoadBalancer on cloud providers) Ingress will probably become the official and most adequate model for doing so. Figure below illustrate how an ingress object interacts with service ones to reach pods with containers inside it.

k8s-ingress
Kubernetes Ingress

At this point the whole kubernetes networking should be done, most developers would be satisfied, maybe dozens of Cluster Administrators too.  Nonetheless a very important networking issues was not being covered at all: and it is about security. Most corporate network has very strict rules, separating subnets and computers from reach each other, for example, one could clearly protect a database server, so it can be reached only by the servers where the applications are hosted, prevent any companies’ employee to direct try to access it, the same is truth for a file server or any other security critical component.

When creating a kubernetes cluster think on security policies is an important thing as well, even though most deployments just had no protection allowing for every pod to access each other. Security Policies isn’t that difficult as it sounds, even though they are laborious. Think of it as AWS Security Groups, where you can allow inbound and outbound traffic from pods. By default, kubernetes will allow every pod to reach each other, so to change that it is necessary install a policy add-on, like Project Calico.

k8s-network-policies
Kubernetes Network Policies

Figure above illustrate a simple network policy where all pods can access the database except the green one. In such case it was configured an inbound rule blocking green pod to access the gray one. Remember that green pod isn’t a valid object in a kubernetes cluster at all, so for such configuration it is assumed that properly labels were previously configured for pod objects. Network policies constraints can assume other facets, like blocking access based on CIDRs, protocols or even ports.

 

VOLUMES

Containers are ephemeral, so when a container is removed all the files inside it is deleted. The same way a pod is also ephemeral, so when a pod is deleted and recreated all the data on its volume is deleted. Kubernetes Volumes are the objects created to handle such behavior, using volumes it is possible to define a common filesystem to be shared among containers inside the same pod, even allowing it to survive container restarts; volumes also support more advanced features, for example, configuring remote filesystem like NFS or high-end storages; in fact, kubernetes supports dozens of volumes types, following the are a couple of them:

  • emptyDir: It maps a pod filepath into a node filepath, the filepath is erased during pod startup and is deleted as soon as pod is removed, container restarts will not affect the volume, which can be mapped to memory if necessary,
  • hostPath: It maps a node filepath into the pod including its existing files, some extra configuration is supported like creating a directory if doesn’t exist or ensuring the device type (in case a file map to a device). Be aware that to write to a hostpath mapped file, root permission is required,
  • configMap: It allows to map a pre-configured config map into a volume, so pod can easily access provided information,
  • secret: It behaves the same way as configMap, however allows to map secrets to pod filesystem,
  • NFS: It allows to map several pods to an existing NFS server, including support to multiple writers, it is very useful for onpremise setups where common file must be shared,
  • awsElasticBlockStore, azureDisk or gcePersistentDisk: Those are used to bind a pod to a storage device on AWS or Azure or GCE,
  • cephFs, quobyte, glusterf, flocker, storageOS or portworx: In despite of being different, those try to address volumes mounted on top a distributed file system which support object storage as well,
  • vSphere: It supports bind a pod to a vSphere VMDK or VSAN datastore.
k8s-volumes
Kubernetes Volumes

Figures above illustrate how a pod can be configured and how kubernetes will handle it behind the scenes. The first configuration shows an emptyDir example where each container has an inner path mapping to the same node VFS; the second configuration shows hostPath usage mapping both containers to a specific VFS on the node; while the third configuration exhibits nfs volume usage, where container are mapped to an external NFS server.

When allocating a volume is possible to specify the required resources, for example it is possible to require a volume with at least 4GB of disk available, read/write capabilities, etc; having that, kubernetes will bind the properly volume to the pod guaranteeing the required resources. Kubernetes itself has some pre-defined set of resources that can be configured although plugins from third providers usually support extended capabilities.

The same way kubernetes defined a CNI for standardize Container Network Interfaces, a CSI (Container Storage Interface) has been provided as an alpha feature in version 1.9 and moved to beta in kubernetes 1.10 so storage providers can create their own plugin and evolve them independently from kubernetes source.

Also, when mounting a volume, it is possible to define if any other pod or node can add sub paths to it or even mount another directory at all; for example, if a volume is mounted as None, new mounting on the volume performed by the node or any other container will not be noticed by original pod. On the other hand, if a volume is mounted as HostToContainer a pod is notified when a new mount point is created by a node, but not vice-versa. Finally, there is the Bidirectional mount propagation type that guarantees that all pods and nodes using the mount point will share and see all changed under it. Enabling mount propagation will require extra docker daemon configuration.

In despite of examples above demonstrate how to attach a pod to a volume, something important is left behind, the fact that developers shall describe which volume to bind during development time while attachment to volumes should be addressed by cluster administrators during deployment time. To address that, kubernetes has brought the concepts of PersistenceVolume (PV), PersistenceVolumeClaims (PVC) and StorageClasses.

Persistence Volumes are an object that allows defines volumes without bind them to pod definitions, i.e., using PVs will allow volumes have a distinct lifecycle from pods, for example, to not be deleted when a pod is removed. Persistence Volume Claims on the other hand are objects that allows a pod request volume’s related resource, as for example size in GB or IOPS throughput in such a way that a pod doesn’t have to know any information on the volume that will be assigned during deployment time. Just for describe it more clearly, before PVs and PVCs a pod should explicitly defines path, host or any other attribute of the volume the pod would attach to; after PVs and PVCs all a pod must do is require how much disk space, policies and IOPS it needs.

Figure below illustrate the workflow for a pod receive a physical volume, so it can attach and use it. Important to notice that pods requests PVCs (lines 4 to 7) which consequently defines demanded resources (lines 8 to 10) filtering by labels (lines 5 to 7) that helps kubernetes matches the most appropriate PV, which in turn was previously created by System Administrator or Storage Administrator.

k8s-PV-PVC
Kubernetes PVC and PV

On the example above the volumeMode attribute was not specified, what means the default FileSystem was selected; kubernetes also supports Block volumeMode, and that means the pod will access the block device directly without pass through Linux FileSystem. It seems not to make sense once most of filesystems are used on top of a block device, however have in mind that accessing the device directly can increase performance, what can be very useful for applications that support such a feature (ex: DBMS or NoSQL databases).

PVs can handle several distinct policies when a pod is deleted; from delete the volume to recycle it so another pod can use it. A PV can be attached only to a unique PVC, however since several pods can reference the same PVC, kubernetes will always bind a pod claim request to an available PV on that specific time. Also, PVs has an attribute indicating its status:

  • Available: PV is ready yet is not requested by any PVC,
  • Bound: PV is bound to a specific PVC,
  • Released: PV has been released and was not request by any other PVC,
  • Failed: PV could not be allocated and failed attend a PVC

At this point, kubernetes has covered everything a distributed application would need, however there are still a piece missing; using PVs and PVCs require from SysOps defining all PVs beforehand what is laborious and very static in a dynamically environment. Here is where StorageClass comes to scene, it allows defining provisioning, parameters and reclaiming policies that will be used to dynamically allocate PersistentVolume.

Basically, a PersistentVolume points a StorageClass attributes on its object description, the same way a PVC describe which StorageClass shall be used during a claim request, kubernetes makes sure they match and dynamically allocate the volume. How does that magic happen? Defining the provisioner in the StorageClass is like pointing the driver that shall be used to allocate the disks; it is one more time kubernetes architecture allowing distinct vendors to implement a plugin behavior. Cluster administrator could define dozens or hundreds of distinct StorageClass so different flavors of pods can request what fits better to their needs. If no StorageClass is defined, kubernetes uses a default one, which can be defined by the SysOps as needed too.

This article touched several aspects of kubernetes, from its architecture to main features. The next one will address a use case with several services which uses almost all described featured in a step by step guide on how to configure a kubernetes cluster from scratch in a virtualized environment.

 

Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair /  Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s