47 Things To Become a Kubernetes Expert

I've been leading a project team for 3 years and have developed a large system around Kubernetes.

In this article, I'd like to share my knowledge and findings of what I think important when developing custom Kubernetes controllers.

Please leave your comments and/or suggestions on my Twitter, if any.

API
Implementing controllers
Components and their collaborations
Resources
Networking
Monitoring
Access control

API

List extension mechanisms of kube-apiserver
Sample answer
- Custom resources: to define custom resources using OpenAPI schema
- Aggregation layer: to configure reverse-proxy servers to provide additional API groups
- Admission webhooks: to validate or mutate resources before saving them to etcd
- Authentication webhook: to verify authentication tokens with external authentication servers
- Authorization webhook: to authorize requests with external authorization servers
Describe the operation sequence of admission controls
Sample answer
1. Authentication and authorization
2. Mutating webhook
3. Object schema validation
4. Validating webhook
5. Saving data to etcd
cf. A Guide to Kubernetes Admission Controllers
Describe the problem and solution when multiple mutating webhooks edit the same resource

Sample answer
There is no way to specify the order of applying mutating webhooks for kube-apiserver.

Suppose we have two mutating webhooks to edit Pods, one is to add a volume mount configuration to all containers, and another is to add a container. To make all containers have the volume mount configuration, the first webhook needs to be called after the second.

We can set the reinvocation policy of the first webhook to IfNeeded to make the first called after the second.
Describe what happens when a call of an admission webhook fails

Sample answer
It depends on the setting of failure policy of the webhook. For admissionregistration.k8s.io/v1, the default is Fail so that the request is rejected.
Describe how kube-apiserver prevents resource editing conflicts

Sample answer
All resources saved in etcd have a resource version that is updated every time it is edited. kube-apiserver utilizes this to reject conflicting edit requests if the resource revision is different from the saved one.

This mechanism is called optimistic locking and is applied to all PUT (update) requests.

cf. Optimistic lock control for Kubernetes API Server object modification
List and describe available PATCH methods
Sample answer
- JSON Patch: can be used for both built-in and custom resources.
- JSON Merge Patch: ditto. For the difference from JSON patch, read http://erosb.github.io/post/json-patch-vs-merge-patch/ .
- Strategic Merge Patch: can be used for only built-in resources.
- Server Side Apply: can be used for both built-in and custom resources.
  This works substantially differently from others. Fields in resources may have owners, and the owners manage only their fields.
cf. PATCH operations
cf. An example of using dynamic client of k8s.io/client-go
Describe subresources

Sample answer
Subresources are a partial element of a resource for which a REST API endpoint is provided separately from the main resource. The most common subresource is /status that represents status element.

Since subresources have an independent set of API endpoint and verbs, they have independent RBAC permissions from the main resources.

cf. Types (Kinds)
Describe what is the storage version of API

Sample answer
Each Kubernetes API is versioned. When an incompatible change is introduced to an API, its version is bumped.

When an API resource is saved in etcd, the resource is converted to a specific version of the API and serialized. This specific version is called the storage version of the API.
Describe how to bump Kubernetes API version step by step
Sample answer
1. Introduce a new API version. The storage version of the API stays the old version.
2. Change the storage version to the new one after the new version gets stabilized and matured.
3. Update the saved API resources in etcd to the new version (by updating them).
4. Deprecate the old API version. Tell users to update their resources to the new version.
5. Remove the old API version after a while.
cf. The Future of Your CRDs – Evolving an API
Describe why conversion webhooks have to implement a round-trip conversion

Sample answer
Suppose that an API sets v1 as its storage version.

When creating the API resource as v2, the conversion webhook needs to convert the resource from v2 to v1. kube-apiserver then saves the resource as v1 in etcd.

When retrieving the API resource as v2, the conversion webhook needs to convert back the saved resource from v1 to v2.

Clearly, the conversion webhook needs to implement a round-trip conversion.
Describe how to avoid missing information in round-trip conversion

Sample answer
The common technique is to save the missing information as annotations. For instance, HorizontalPodAutoscaler saves fields added in v2 as annotations in v1.

cf. Horizontal Pod Autoscaler
Describe how kube-apiserver and aggregation API servers authenticate/authorize each other

Sample answer
They mutually authenticate each other using TLS. Read Authentication Flow for details.

As to authorization, aggregation API servers have to be granted to create SubjectAccessReview resources by kube-apiserver. To grant the privilege, bind a system built-in Role called extension-apiserver-authentication-reader in kube-system namespace to the ServiceAccount of the aggregation API server.

Implementing controllers

Describe what are Event resources and how long they live in kube-apiserver

Sample answer
Event is a resource to record events that happened to a target resource. kubectl describe pods NAME displays the events of the Pod in a readable manner.

Events usually live for only one hour in kube-apiserver.

cf. Emitting, Consuming, and Presenting: The Event Lifecycle
What namespace should be used for Events of cluster resources such as Node?

Sample answer
default namespace.
Describe what is reconciliation in Kubernetes

Sample answer
Reconciliation is a process to make sure the actual state of the world matches the desired state. In other terms, reconciliation is the implementation of declarative API.

cf. What is "reconciliation"?
Describe how to watch resources in kube-apiserver

Sample answer
kube-apiserver provides a way called watch to feed changes for all API object resources to clients. Watch is much more efficient than polling kube-apiserver periodically.

cf. Efficient detection of changes
Describe how Delete REST API works

Sample answer
Delete REST API begins the deletion of a given resource. The completion of the REST API call does not necessarily mean that the resource is removed from kube-apiserver.

kubectl delete waits for the completion of the deletion by watching kube-apiserver until the resource is removed. With --wait=false, kubectl delete does not wait for the completion.
Describe what is metadata.deletionTimestamp and how it works

Sample answer
metadata.deletionTimestamp is usually not set. It is set when a resource cannot be deleted immediately. The timestamp indicates the schedule of the deletion.

For Pods, this field is used to implement graceful termination. Containers get SIGTERM as soon as the deletion timestamp is set, and get SIGKILL after the timestamp expires. The Pod resource itself will not be removed until kubelet completes the deletion of Pod processes.

The deletion timestamp is also set when metadata.finalizers is not empty as described below.

cf. Metadata
Describe what is metadata.finalizers and how it works

Sample answer
While metadata.finalizers is not empty, the resource will not be removed from kube-apiserver. A controller can do some finalization process for deleting objects by inserting an item to metadata.finalizers. When the controller completes the finalization, it should remove the item from metadata.finalizers.

As soon as metadata.finalizers becomes empty, kube-apiserver deletes the resource from etcd.

cf. Using Finalizers
Describe how k8s.io/client-go/tools/leaderelection works

Sample answer
The package implements leader election by using kube-apiserver resoruces. Currently, the recommended resource to be used is Lease.

This package does not guarantee that only one client is acting as a leader (a.k.a. fencing).

Implementation example: https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go
Describe what is metadata.ownerReferences and how it works

Sample answer
The field is used by the garbage collector to implement cascading deletion of resources.

The field is also used by controllers to identify the parent resource.

cf. Garbage Collection
What does happen to PersistentVolumeClaims instantiated from a StatefulSet when the StatefulSet is deleted?

Sample answer
They will remain.

If you want to delete PVC along with the StatefulSet, set PVC's metadata.ownerReference to the StatefulSet or something else. For example, Elastic Cloud on Kubernetes (ECK) sets the owner of PVC to Elasticsearch custom resource.

Components and their collaborations

Describe the roles of these components:
- etcd
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- kubelet
- kube-proxy
- containerd
- CoreDNS
Sample answer
- etcd: to store resource objects persistently.
- kube-apiserver: to access etcd and provides REST API for other components.
- kube-controller-manager: is a set of controllers to watch and edit resources in kube-apiserver.
- kube-scheduler: to schedule new Pods to a Node.
- kubelet: to run Pods on each Node.
- kube-proxy: to configure network rules on each Node for Services.
- containerd: to accept CRI requests from kubelet and run containers.
- CoreDNS: to provide internal DNS for Service names.
Describe the behavior of each component from the creation of a Pod to the running of the containers inside
Sample answer
1. kube-apiserver saves a new Pod resource in etcd
2. kube-scheduler finds the new Pod
3. kube-scheduler allocates a Node to the new Pod based on available resources and other conditions
4. kubelet on the allocated Node finds the new Pod
5. kubelet initializes the Pod runtime as follows:
  1. kubelet sends a CRI request to CRI runtime such as containerd to create an infrastructure container
  2. CRI runtime calls CNI plugins to initialize the network namespace of the Pod
6. kubelet sequentially requests CRI runtime to run spec.initContainers, if any
7. kubelet concurrently requests CRI runtime to run spec.containers
Describe who creates default ServiceAccount in each namespace and when

Sample answer
The default ServiceAccount does not exist immediately after a namespace is created.

The default ServiceAccount is created by kube-controller-manager a little after the namespace is created. Similarly, the Secret token for the default ServiceAccount is created by kube-controller-manager a little after default is created.

For this reason, creating a Pod in a newly created namespace sometimes fails. It is safe to create a Deployment instead.
Explain what happens to a Pod when kubelet or Node running the Pod cannot communicate with kube-apiserver
Sample answer
kube-apiserver watches kubelet by receiving periodic heartbeat from it. If the heartbeat stops, kube-apiserver add taints to the Node resource.

Pods can tolerate the taints up to 300 seconds because they have the following tolerations by default:
```
 tolerations:
 - effect: NoExecute
   key: node.kubernetes.io/not-ready
   operator: Exists
   tolerationSeconds: 300
 - effect: NoExecute
   key: node.kubernetes.io/unreachable
   operator: Exists
   tolerationSeconds: 300
```
When 300 seconds have elapsed, graceful termination is initiated. As spec.terminationGracePeriodSeconds is 30 seconds by default, metadata.deletionTimestamp is normally set to 30 seconds from now.

When additional 30 seconds have elapsed, the Pod transitions to Terminating status. However, since kubelet cannot see the status of the Pod, the Pod will remain running.

cf. Taint Nodes by Condition, Taint based Evictions
Describe how ReplicaSet controller works if a Pod is Terminating

Sample answer
ReplicaSet controller usually adds a new Pod in a timely manner.
Describe how StatefulSet controller works if a Pod is Terminating

Sample answer
StatefulSet controller cannot add a new Pod because Pods in a StatefulSet have stable network IDs.
Explain why fencing a failing Node is important before removing Node resource

Sample answer
When kubelet cannot communicate with kube-apiserver, Pods on the Node becomes Terminating but will not be removed. Such Pods can be removed if the Node resource is removed from kube-apiserver.

However, Pod processes may still live if the problem is merely communication between kubelet and kube-apiserver. In this case, removing Nodes and Pods might cause split brain syndrome because a new Pod having the same ID of a StatefulSet would run on another Node.

To avoid such incidents, a failing Node should be killed with STONITH or something like that before removing the Node resource.

Resources

Explain the difference between resources.limits and resources.requests of a container

Sample answer
resources.limits sets an upper limit of resource usage to containers using Linux cgroups.

resources.requests is used by kube-scheduler to choose available Nodes. resources.requests.cpu is also used to distribute CPU time among containers using CFS shares.

cf. Setting the right requests and limits in Kubernetes
What does happen when a container has only resources.limits.memory?

Sample answer
The container is modified to have resources.requests.memory with the same value of resources.limits.memory.

This is the same for resources.limits.cpu.

cf. Create a Pod that gets assigned a QoS class of Guaranteed
What does happen when a container consumes more memory than requested?

Sample answer
Pods that overuse memory become candidates of eviction when Node is running out of memory.

cf. Interactions between Pod priority and quality of service
Describe Quality of Service classes for Pods

Sample answer
There are three classes, namely, Guaranteed, Burstable, and BestEffort.

Pods that have requests and limits for both CPU and memory, and have the same value for requests and limits are classified into Guaranteed. Guaranteed Pods will not be evicted, except in exceptional cases.

Pods that have at least a resource request are classified into Burstable. Other Pods are classified into BestEffort. Burstable Pods are less likely to be evicted than BestEffort Pods.

cf. Evicting end-user Pods
Describe PriorityClass for Pods

Sample answer
PriorityClass is used by kube-scheduler to perform Pod preemption. Preemption is an operation that removes a low-priority Pod from a Node and schedules a high-priority Pod to the Node.

cf. Pod Priority and Preemption
Is a Pod evicted when the Node is running out of CPU time?

Sample answer
No.

For this reason, setting a proper CPU request is important for production environments.

Networking

Describe types of Service, namely, ClusterIP, NodePort, and LoadBalancer

Sample answer
ClusterIP is the most basic Service type. It provides a virtual IP address to service consumers to access backend Pods.

NodePort provides a port number in addition to the virtual IP provided by ClusterIP type. Service consumers can reach backend Pods by connecting any Node with the port number.

LoadBalancer tells an external load balancer to assign a virtual IP address and route packets to the virtual address to backend Pods.
Explain the relationship between Service and Endpoints (EndpointSlices)

Sample answer
Every Service is accompanied by an Endpoint(Slice)s of the same name. Endpoint(Slice)s represents addresses of backend Pods.

Endpoint(Slice)s is created and updated automatically by kube-controller-manager if Service has a Pod selector. If not, Endpoint(Slice)s need to be maintained by other means.
Describe the usage of spec.containers.ports of Pod and EXPOSE in Dockerfiles
Sample answer
If defined with names, spec.containers.ports can be used in livenessProbe, readinessProbe, or Service's targetPort field to reference the port by the name.
```
 containers:
 - ports:
   - name: health
     containerPort: 8080
     protocol: TCP
 livenessProbe:
   httpGet:
     port: health
     path: /healthz
```
EXPOSE in Dockerfiles is merely a documentation.

Both of them do not actually publish the port. A container may listen on other ports than the specified ones. Also, a container may not listen on the specified ports.
Explain how packets from the outside reach Pods If the Service's spec.externalTrafficPolicy is set to Local

Sample answer
spec.externalTrafficPolicy is mainly for LoadBalancer type Services. If this field is empty or Cluster (default), kube-proxy rewrites packets' source address to the Node address and forwards them to the destination Pod. In this mode, the destination Pod may be running on another Node.

If this field is Local, kube-proxy does not rewrite the source address. In this mode, the destionation Pod must be running on the same node where kube-proxy is running. Therefore, the external load balancer routes packets only to the Nodes where the destination Pods are running.

For example, MetalLB advertises the virtual address only from the Nodes where the destination Pods are running.

cf. Preserving the client source IP

Monitoring

Describe what happens when a readinessProbe fails

Sample answer
A readinessProbe checks if the container is ready to accept requests.

If a readinessProbe fails, the Pod becomes unready and is excluded from Service load balancing targets.
Describe what happens when livenessProbe fails

Sample answer
A livenessProbe checks if the container is alive.

If a livenessProbe fails, the container process is killed and restarted.

Access control

Can a Role (not a ClusterRole) grant access to cluster-scoped resources?

Sample answer
No.
Can a ClusterRole grant access to namespace-scoped resources?

Sample answer
Yes. Such a ClusterRole can be used to grant access to resources in any namespace.

cf. Understanding Kubernetes RBAC
Is it a good idea to edit the privilege of the default ServiceAccount?

Sample answer
Definitely not.

The default ServiceAccount is used by any Pods that do not specify ServiceAccount. Editing the privilege of the default ServiceAccount would cause unexpected behavior.
Describe how kube-apiserver prevents privilege escalation

Sample answer
kube-apiserver checks when a subject (user or ServiceAccount) creates or updates (Cluster)RoleBinding. If the subject does not have the same privilege as what it is going to grant to other entities, kube-apiserver will deny the operation.

If you are implementing a controller that dynamically grants some privilege to other ServiceAccounts, make sure that the ServiceAccount of the controller has the same privilege.

cf. Privilege escalation prevention and bootstrapping
Describe what is user impersonation

Sample answer
If granted, a user can act as another user and/or belong to another group through HTTP request headers. User impersonation should be granted only for cluster administrators.

When using kubectl, --as=USER and --as-group=GROUP command-line flags set impersonation headers.

cf. User impersonation
Describe view, edit, admin ClusterRoles

Sample answer
These are called aggregated ClusterRoles. An aggregated ClusterRole merges privileges of other ClusterRoles that have special labels.

When defining new custom resources, consider aggregating the appropriate privileges into these ClusterRoles.

cf. User-facing roles