47 Things To Become a Kubernetes Expert

I've been leading a project team for 3 years and have developed a large system around Kubernetes.

In this article, I'd like to share my knowledge and findings of what I think important when developing custom Kubernetes controllers.

Please leave your comments and/or suggestions on my Twitter, if any.

API

  1. List extension mechanisms of kube-apiserver

    Sample answer

  2. Describe the operation sequence of admission controls

    Sample answer

    1. Authentication and authorization
    2. Mutating webhook
    3. Object schema validation
    4. Validating webhook
    5. Saving data to etcd

    cf. A Guide to Kubernetes Admission Controllers

  3. Describe the problem and solution when multiple mutating webhooks edit the same resource

    Sample answer There is no way to specify the order of applying mutating webhooks for kube-apiserver.

    Suppose we have two mutating webhooks to edit Pods, one is to add a volume mount configuration to all containers, and another is to add a container. To make all containers have the volume mount configuration, the first webhook needs to be called after the second.

    We can set the reinvocation policy of the first webhook to IfNeeded to make the first called after the second.

  4. Describe what happens when a call of an admission webhook fails

    Sample answer It depends on the setting of failure policy of the webhook. For admissionregistration.k8s.io/v1, the default is Fail so that the request is rejected.

  5. Describe how kube-apiserver prevents resource editing conflicts

    Sample answer All resources saved in etcd have a resource version that is updated every time it is edited. kube-apiserver utilizes this to reject conflicting edit requests if the resource revision is different from the saved one.

    This mechanism is called optimistic locking and is applied to all PUT (update) requests.

    cf. Optimistic lock control for Kubernetes API Server object modification

  6. List and describe available PATCH methods

    Sample answer

    cf. PATCH operations
    cf. An example of using dynamic client of k8s.io/client-go

  7. Describe subresources

    Sample answer Subresources are a partial element of a resource for which a REST API endpoint is provided separately from the main resource. The most common subresource is /status that represents status element.

    Since subresources have an independent set of API endpoint and verbs, they have independent RBAC permissions from the main resources.

    cf. Types (Kinds)

  8. Describe what is the storage version of API

    Sample answer Each Kubernetes API is versioned. When an incompatible change is introduced to an API, its version is bumped.

    When an API resource is saved in etcd, the resource is converted to a specific version of the API and serialized. This specific version is called the storage version of the API.

  9. Describe how to bump Kubernetes API version step by step

    Sample answer

    1. Introduce a new API version. The storage version of the API stays the old version.
    2. Change the storage version to the new one after the new version gets stabilized and matured.
    3. Update the saved API resources in etcd to the new version (by updating them).
    4. Deprecate the old API version. Tell users to update their resources to the new version.
    5. Remove the old API version after a while.

    cf. The Future of Your CRDs – Evolving an API

  10. Describe why conversion webhooks have to implement a round-trip conversion

    Sample answer Suppose that an API sets v1 as its storage version.

    When creating the API resource as v2, the conversion webhook needs to convert the resource from v2 to v1. kube-apiserver then saves the resource as v1 in etcd.

    When retrieving the API resource as v2, the conversion webhook needs to convert back the saved resource from v1 to v2.

    Clearly, the conversion webhook needs to implement a round-trip conversion.

  11. Describe how to avoid missing information in round-trip conversion

    Sample answer The common technique is to save the missing information as annotations. For instance, HorizontalPodAutoscaler saves fields added in v2 as annotations in v1.

    cf. Horizontal Pod Autoscaler

  12. Describe how kube-apiserver and aggregation API servers authenticate/authorize each other

    Sample answer They mutually authenticate each other using TLS. Read Authentication Flow for details.

    As to authorization, aggregation API servers have to be granted to create SubjectAccessReview resources by kube-apiserver. To grant the privilege, bind a system built-in Role called extension-apiserver-authentication-reader in kube-system namespace to the ServiceAccount of the aggregation API server.

Implementing controllers

  1. Describe what are Event resources and how long they live in kube-apiserver

    Sample answer Event is a resource to record events that happened to a target resource. kubectl describe pods NAME displays the events of the Pod in a readable manner.

    Events usually live for only one hour in kube-apiserver.

    cf. Emitting, Consuming, and Presenting: The Event Lifecycle

  2. What namespace should be used for Events of cluster resources such as Node?

    Sample answer default namespace.

  3. Describe what is reconciliation in Kubernetes

    Sample answer Reconciliation is a process to make sure the actual state of the world matches the desired state. In other terms, reconciliation is the implementation of declarative API.

    cf. What is "reconciliation"?

  4. Describe how to watch resources in kube-apiserver

    Sample answer kube-apiserver provides a way called watch to feed changes for all API object resources to clients. Watch is much more efficient than polling kube-apiserver periodically.

    cf. Efficient detection of changes

  5. Describe how Delete REST API works

    Sample answer Delete REST API begins the deletion of a given resource. The completion of the REST API call does not necessarily mean that the resource is removed from kube-apiserver.

    kubectl delete waits for the completion of the deletion by watching kube-apiserver until the resource is removed. With --wait=false, kubectl delete does not wait for the completion.

  6. Describe what is metadata.deletionTimestamp and how it works

    Sample answer metadata.deletionTimestamp is usually not set. It is set when a resource cannot be deleted immediately. The timestamp indicates the schedule of the deletion.

    For Pods, this field is used to implement graceful termination. Containers get SIGTERM as soon as the deletion timestamp is set, and get SIGKILL after the timestamp expires. The Pod resource itself will not be removed until kubelet completes the deletion of Pod processes.

    The deletion timestamp is also set when metadata.finalizers is not empty as described below.

    cf. Metadata

  7. Describe what is metadata.finalizers and how it works

    Sample answer While metadata.finalizers is not empty, the resource will not be removed from kube-apiserver. A controller can do some finalization process for deleting objects by inserting an item to metadata.finalizers. When the controller completes the finalization, it should remove the item from metadata.finalizers.

    As soon as metadata.finalizers becomes empty, kube-apiserver deletes the resource from etcd.

    cf. Using Finalizers

  8. Describe how k8s.io/client-go/tools/leaderelection works

    Sample answer The package implements leader election by using kube-apiserver resoruces. Currently, the recommended resource to be used is Lease.

    This package does not guarantee that only one client is acting as a leader (a.k.a. fencing).

    Implementation example: https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go

  9. Describe what is metadata.ownerReferences and how it works

    Sample answer The field is used by the garbage collector to implement cascading deletion of resources.

    The field is also used by controllers to identify the parent resource.

    cf. Garbage Collection

  10. What does happen to PersistentVolumeClaims instantiated from a StatefulSet when the StatefulSet is deleted?

    Sample answer They will remain.

    If you want to delete PVC along with the StatefulSet, set PVC's metadata.ownerReference to the StatefulSet or something else. For example, Elastic Cloud on Kubernetes (ECK) sets the owner of PVC to Elasticsearch custom resource.

Components and their collaborations

  1. Describe the roles of these components:

    • etcd
    • kube-apiserver
    • kube-controller-manager
    • kube-scheduler
    • kubelet
    • kube-proxy
    • containerd
    • CoreDNS

    Sample answer

    • etcd: to store resource objects persistently.
    • kube-apiserver: to access etcd and provides REST API for other components.
    • kube-controller-manager: is a set of controllers to watch and edit resources in kube-apiserver.
    • kube-scheduler: to schedule new Pods to a Node.
    • kubelet: to run Pods on each Node.
    • kube-proxy: to configure network rules on each Node for Services.
    • containerd: to accept CRI requests from kubelet and run containers.
    • CoreDNS: to provide internal DNS for Service names.
  2. Describe the behavior of each component from the creation of a Pod to the running of the containers inside

    Sample answer

    1. kube-apiserver saves a new Pod resource in etcd
    2. kube-scheduler finds the new Pod
    3. kube-scheduler allocates a Node to the new Pod based on available resources and other conditions
    4. kubelet on the allocated Node finds the new Pod
    5. kubelet initializes the Pod runtime as follows:
      1. kubelet sends a CRI request to CRI runtime such as containerd to create an infrastructure container
      2. CRI runtime calls CNI plugins to initialize the network namespace of the Pod
    6. kubelet sequentially requests CRI runtime to run spec.initContainers, if any
    7. kubelet concurrently requests CRI runtime to run spec.containers

  3. Describe who creates default ServiceAccount in each namespace and when

    Sample answer The default ServiceAccount does not exist immediately after a namespace is created.

    The default ServiceAccount is created by kube-controller-manager a little after the namespace is created. Similarly, the Secret token for the default ServiceAccount is created by kube-controller-manager a little after default is created.

    For this reason, creating a Pod in a newly created namespace sometimes fails. It is safe to create a Deployment instead.

  4. Explain what happens to a Pod when kubelet or Node running the Pod cannot communicate with kube-apiserver

    Sample answer kube-apiserver watches kubelet by receiving periodic heartbeat from it. If the heartbeat stops, kube-apiserver add taints to the Node resource.

    Pods can tolerate the taints up to 300 seconds because they have the following tolerations by default:

     tolerations:
     - effect: NoExecute
       key: node.kubernetes.io/not-ready
       operator: Exists
       tolerationSeconds: 300
     - effect: NoExecute
       key: node.kubernetes.io/unreachable
       operator: Exists
       tolerationSeconds: 300
    

    When 300 seconds have elapsed, graceful termination is initiated. As spec.terminationGracePeriodSeconds is 30 seconds by default, metadata.deletionTimestamp is normally set to 30 seconds from now.

    When additional 30 seconds have elapsed, the Pod transitions to Terminating status. However, since kubelet cannot see the status of the Pod, the Pod will remain running.

    cf. Taint Nodes by Condition, Taint based Evictions

  5. Describe how ReplicaSet controller works if a Pod is Terminating

    Sample answer ReplicaSet controller usually adds a new Pod in a timely manner.

  6. Describe how StatefulSet controller works if a Pod is Terminating

    Sample answer StatefulSet controller cannot add a new Pod because Pods in a StatefulSet have stable network IDs.

  7. Explain why fencing a failing Node is important before removing Node resource

    Sample answer When kubelet cannot communicate with kube-apiserver, Pods on the Node becomes Terminating but will not be removed. Such Pods can be removed if the Node resource is removed from kube-apiserver.

    However, Pod processes may still live if the problem is merely communication between kubelet and kube-apiserver. In this case, removing Nodes and Pods might cause split brain syndrome because a new Pod having the same ID of a StatefulSet would run on another Node.

    To avoid such incidents, a failing Node should be killed with STONITH or something like that before removing the Node resource.

Resources

  1. Explain the difference between resources.limits and resources.requests of a container

    Sample answer resources.limits sets an upper limit of resource usage to containers using Linux cgroups.

    resources.requests is used by kube-scheduler to choose available Nodes. resources.requests.cpu is also used to distribute CPU time among containers using CFS shares.

    cf. Setting the right requests and limits in Kubernetes

  2. What does happen when a container has only resources.limits.memory?

    Sample answer The container is modified to have resources.requests.memory with the same value of resources.limits.memory.

    This is the same for resources.limits.cpu.

    cf. Create a Pod that gets assigned a QoS class of Guaranteed

  3. What does happen when a container consumes more memory than requested?

    Sample answer Pods that overuse memory become candidates of eviction when Node is running out of memory.

    cf. Interactions between Pod priority and quality of service

  4. Describe Quality of Service classes for Pods

    Sample answer There are three classes, namely, Guaranteed, Burstable, and BestEffort.

    Pods that have requests and limits for both CPU and memory, and have the same value for requests and limits are classified into Guaranteed. Guaranteed Pods will not be evicted, except in exceptional cases.

    Pods that have at least a resource request are classified into Burstable. Other Pods are classified into BestEffort. Burstable Pods are less likely to be evicted than BestEffort Pods.

    cf. Evicting end-user Pods

  5. Describe PriorityClass for Pods

    Sample answer PriorityClass is used by kube-scheduler to perform Pod preemption. Preemption is an operation that removes a low-priority Pod from a Node and schedules a high-priority Pod to the Node.

    cf. Pod Priority and Preemption

  6. Is a Pod evicted when the Node is running out of CPU time?

    Sample answer No.

    For this reason, setting a proper CPU request is important for production environments.

Networking

  1. Describe types of Service, namely, ClusterIP, NodePort, and LoadBalancer

    Sample answer ClusterIP is the most basic Service type. It provides a virtual IP address to service consumers to access backend Pods.

    NodePort provides a port number in addition to the virtual IP provided by ClusterIP type. Service consumers can reach backend Pods by connecting any Node with the port number.

    LoadBalancer tells an external load balancer to assign a virtual IP address and route packets to the virtual address to backend Pods.

  2. Explain the relationship between Service and Endpoints (EndpointSlices)

    Sample answer Every Service is accompanied by an Endpoint(Slice)s of the same name. Endpoint(Slice)s represents addresses of backend Pods.

    Endpoint(Slice)s is created and updated automatically by kube-controller-manager if Service has a Pod selector. If not, Endpoint(Slice)s need to be maintained by other means.

  3. Describe the usage of spec.containers.ports of Pod and EXPOSE in Dockerfiles

    Sample answer If defined with names, spec.containers.ports can be used in livenessProbe, readinessProbe, or Service's targetPort field to reference the port by the name.

     containers:
     - ports:
       - name: health
         containerPort: 8080
         protocol: TCP
     livenessProbe:
       httpGet:
         port: health
         path: /healthz
    

    EXPOSE in Dockerfiles is merely a documentation.

    Both of them do not actually publish the port. A container may listen on other ports than the specified ones. Also, a container may not listen on the specified ports.

  4. Explain how packets from the outside reach Pods If the Service's spec.externalTrafficPolicy is set to Local

    Sample answer spec.externalTrafficPolicy is mainly for LoadBalancer type Services. If this field is empty or Cluster (default), kube-proxy rewrites packets' source address to the Node address and forwards them to the destination Pod. In this mode, the destination Pod may be running on another Node.

    If this field is Local, kube-proxy does not rewrite the source address. In this mode, the destionation Pod must be running on the same node where kube-proxy is running. Therefore, the external load balancer routes packets only to the Nodes where the destination Pods are running.

    For example, MetalLB advertises the virtual address only from the Nodes where the destination Pods are running.

    cf. Preserving the client source IP

Monitoring

  1. Describe what happens when a readinessProbe fails

    Sample answer A readinessProbe checks if the container is ready to accept requests.

    If a readinessProbe fails, the Pod becomes unready and is excluded from Service load balancing targets.

  2. Describe what happens when livenessProbe fails

    Sample answer A livenessProbe checks if the container is alive.

    If a livenessProbe fails, the container process is killed and restarted.

Access control

  1. Can a Role (not a ClusterRole) grant access to cluster-scoped resources?

    Sample answer No.

  2. Can a ClusterRole grant access to namespace-scoped resources?

    Sample answer Yes. Such a ClusterRole can be used to grant access to resources in any namespace.

    cf. Understanding Kubernetes RBAC

  3. Is it a good idea to edit the privilege of the default ServiceAccount?

    Sample answer Definitely not.

    The default ServiceAccount is used by any Pods that do not specify ServiceAccount. Editing the privilege of the default ServiceAccount would cause unexpected behavior.

  4. Describe how kube-apiserver prevents privilege escalation

    Sample answer kube-apiserver checks when a subject (user or ServiceAccount) creates or updates (Cluster)RoleBinding. If the subject does not have the same privilege as what it is going to grant to other entities, kube-apiserver will deny the operation.

    If you are implementing a controller that dynamically grants some privilege to other ServiceAccounts, make sure that the ServiceAccount of the controller has the same privilege.

    cf. Privilege escalation prevention and bootstrapping

  5. Describe what is user impersonation

    Sample answer If granted, a user can act as another user and/or belong to another group through HTTP request headers. User impersonation should be granted only for cluster administrators.

    When using kubectl, --as=USER and --as-group=GROUP command-line flags set impersonation headers.

    cf. User impersonation

  6. Describe view, edit, admin ClusterRoles

    Sample answer These are called aggregated ClusterRoles. An aggregated ClusterRole merges privileges of other ClusterRoles that have special labels.

    When defining new custom resources, consider aggregating the appropriate privileges into these ClusterRoles.

    cf. User-facing roles