I've been leading a project team for 3 years and have developed a large system around Kubernetes.
In this article, I'd like to share my knowledge and findings of what I think important when developing custom Kubernetes controllers.
Please leave your comments and/or suggestions on my Twitter, if any.
- Implementing controllers
- Components and their collaborations
- Access control
List extension mechanisms of kube-apiserver
- Custom resources: to define custom resources using OpenAPI schema
- Aggregation layer: to configure reverse-proxy servers to provide additional API groups
- Admission webhooks: to validate or mutate resources before saving them to etcd
- Authentication webhook: to verify authentication tokens with external authentication servers
- Authorization webhook: to authorize requests with external authorization servers
Describe the operation sequence of admission controls
- Authentication and authorization
- Mutating webhook
- Object schema validation
- Validating webhook
- Saving data to etcd
Describe the problem and solution when multiple mutating webhooks edit the same resource
Sample answerThere is no way to specify the order of applying mutating webhooks for kube-apiserver.
Suppose we have two mutating webhooks to edit Pods, one is to add a volume mount configuration to all containers, and another is to add a container. To make all containers have the volume mount configuration, the first webhook needs to be called after the second.
We can set the reinvocation policy of the first webhook to
IfNeededto make the first called after the second.
Describe what happens when a call of an admission webhook fails
Sample answerIt depends on the setting of failure policy of the webhook. For
admissionregistration.k8s.io/v1, the default is
Failso that the request is rejected.
Describe how kube-apiserver prevents resource editing conflicts
Sample answerAll resources saved in etcd have a resource version that is updated every time it is edited. kube-apiserver utilizes this to reject conflicting edit requests if the resource revision is different from the saved one.
This mechanism is called optimistic locking and is applied to all PUT (update) requests.
cf. Optimistic lock control for Kubernetes API Server object modification
List and describe available PATCH methods
- JSON Patch: can be used for both built-in and custom resources.
- JSON Merge Patch: ditto. For the difference from JSON patch, read http://erosb.github.io/post/json-patch-vs-merge-patch/ .
- Strategic Merge Patch: can be used for only built-in resources.
- Server Side Apply: can be used for both built-in and custom resources.
This works substantially differently from others. Fields in resources may have owners, and the owners manage only their fields.
cf. PATCH operations
cf. An example of using dynamic client of k8s.io/client-go
Sample answerSubresources are a partial element of a resource for which a REST API endpoint is provided separately from the main resource. The most common subresource is
Since subresources have an independent set of API endpoint and verbs, they have independent RBAC permissions from the main resources.
cf. Types (Kinds)
Describe what is the storage version of API
Sample answerEach Kubernetes API is versioned. When an incompatible change is introduced to an API, its version is bumped.
When an API resource is saved in etcd, the resource is converted to a specific version of the API and serialized. This specific version is called the storage version of the API.
Describe how to bump Kubernetes API version step by step
- Introduce a new API version. The storage version of the API stays the old version.
- Change the storage version to the new one after the new version gets stabilized and matured.
- Update the saved API resources in etcd to the new version (by updating them).
- Deprecate the old API version. Tell users to update their resources to the new version.
- Remove the old API version after a while.
Describe why conversion webhooks have to implement a round-trip conversion
Sample answerSuppose that an API sets v1 as its storage version.
When creating the API resource as v2, the conversion webhook needs to convert the resource from v2 to v1. kube-apiserver then saves the resource as v1 in etcd.
When retrieving the API resource as v2, the conversion webhook needs to convert back the saved resource from v1 to v2.
Clearly, the conversion webhook needs to implement a round-trip conversion.
Describe how to avoid missing information in round-trip conversion
Sample answerThe common technique is to save the missing information as annotations. For instance, HorizontalPodAutoscaler saves fields added in v2 as annotations in v1.
Describe how kube-apiserver and aggregation API servers authenticate/authorize each other
Sample answerThey mutually authenticate each other using TLS. Read Authentication Flow for details.
As to authorization, aggregation API servers have to be granted to create SubjectAccessReview resources by kube-apiserver. To grant the privilege, bind a system built-in Role called
kube-systemnamespace to the ServiceAccount of the aggregation API server.
Describe what are Event resources and how long they live in kube-apiserver
Sample answerEvent is a resource to record events that happened to a target resource.
kubectl describe pods NAMEdisplays the events of the Pod in a readable manner.
Events usually live for only one hour in kube-apiserver.
cf. Emitting, Consuming, and Presenting: The Event Lifecycle
What namespace should be used for Events of cluster resources such as Node?
Describe what is reconciliation in Kubernetes
Sample answerReconciliation is a process to make sure the actual state of the world matches the desired state. In other terms, reconciliation is the implementation of declarative API.
Describe how to watch resources in kube-apiserver
Sample answerkube-apiserver provides a way called watch to feed changes for all API object resources to clients. Watch is much more efficient than polling kube-apiserver periodically.
Describe how Delete REST API works
Sample answerDelete REST API begins the deletion of a given resource. The completion of the REST API call does not necessarily mean that the resource is removed from kube-apiserver.
kubectl deletewaits for the completion of the deletion by watching kube-apiserver until the resource is removed. With
kubectl deletedoes not wait for the completion.
Describe what is
metadata.deletionTimestampand how it works
metadata.deletionTimestampis usually not set. It is set when a resource cannot be deleted immediately. The timestamp indicates the schedule of the deletion.
For Pods, this field is used to implement graceful termination. Containers get SIGTERM as soon as the deletion timestamp is set, and get SIGKILL after the timestamp expires. The Pod resource itself will not be removed until kubelet completes the deletion of Pod processes.
The deletion timestamp is also set when
metadata.finalizersis not empty as described below.
Describe what is
metadata.finalizersand how it works
metadata.finalizersis not empty, the resource will not be removed from kube-apiserver. A controller can do some finalization process for deleting objects by inserting an item to
metadata.finalizers. When the controller completes the finalization, it should remove the item from
As soon as
metadata.finalizersbecomes empty, kube-apiserver deletes the resource from etcd.
cf. Using Finalizers
Describe how k8s.io/client-go/tools/leaderelection works
Sample answerThe package implements leader election by using kube-apiserver resoruces. Currently, the recommended resource to be used is
This package does not guarantee that only one client is acting as a leader (a.k.a. fencing).
Implementation example: https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go
Describe what is
metadata.ownerReferencesand how it works
Sample answerThe field is used by the garbage collector to implement cascading deletion of resources.
The field is also used by controllers to identify the parent resource.
What does happen to PersistentVolumeClaims instantiated from a StatefulSet when the StatefulSet is deleted?
Sample answerThey will remain.
If you want to delete PVC along with the StatefulSet, set PVC's
metadata.ownerReferenceto the StatefulSet or something else. For example, Elastic Cloud on Kubernetes (ECK) sets the owner of PVC to Elasticsearch custom resource.
Components and their collaborations
Describe the roles of these components:
- etcd: to store resource objects persistently.
- kube-apiserver: to access etcd and provides REST API for other components.
- kube-controller-manager: is a set of controllers to watch and edit resources in kube-apiserver.
- kube-scheduler: to schedule new Pods to a Node.
- kubelet: to run Pods on each Node.
- kube-proxy: to configure network rules on each Node for Services.
- containerd: to accept CRI requests from kubelet and run containers.
- CoreDNS: to provide internal DNS for Service names.
Describe the behavior of each component from the creation of a Pod to the running of the containers inside
- kube-apiserver saves a new Pod resource in etcd
- kube-scheduler finds the new Pod
- kube-scheduler allocates a Node to the new Pod based on available resources and other conditions
- kubelet on the allocated Node finds the new Pod
- kubelet initializes the Pod runtime as follows:
- kubelet sends a CRI request to CRI runtime such as containerd to create an infrastructure container
- CRI runtime calls CNI plugins to initialize the network namespace of the Pod
- kubelet sequentially requests CRI runtime to run
spec.initContainers, if any
- kubelet concurrently requests CRI runtime to run
Describe who creates
defaultServiceAccount in each namespace and when
defaultServiceAccount does not exist immediately after a namespace is created.
defaultServiceAccount is created by kube-controller-manager a little after the namespace is created. Similarly, the Secret token for the
defaultServiceAccount is created by kube-controller-manager a little after
For this reason, creating a Pod in a newly created namespace sometimes fails. It is safe to create a Deployment instead.
Explain what happens to a Pod when kubelet or Node running the Pod cannot communicate with kube-apiserver
Sample answerkube-apiserver watches kubelet by receiving periodic heartbeat from it. If the heartbeat stops, kube-apiserver add taints to the Node resource.
Pods can tolerate the taints up to 300 seconds because they have the following tolerations by default:
tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300
When 300 seconds have elapsed, graceful termination is initiated. As
spec.terminationGracePeriodSecondsis 30 seconds by default,
metadata.deletionTimestampis normally set to 30 seconds from now.
When additional 30 seconds have elapsed, the Pod transitions to
Terminatingstatus. However, since kubelet cannot see the status of the Pod, the Pod will remain running.
Describe how ReplicaSet controller works if a Pod is Terminating
Sample answerReplicaSet controller usually adds a new Pod in a timely manner.
Describe how StatefulSet controller works if a Pod is Terminating
Sample answerStatefulSet controller cannot add a new Pod because Pods in a StatefulSet have stable network IDs.
Explain why fencing a failing Node is important before removing Node resource
Sample answerWhen kubelet cannot communicate with kube-apiserver, Pods on the Node becomes Terminating but will not be removed. Such Pods can be removed if the Node resource is removed from kube-apiserver.
However, Pod processes may still live if the problem is merely communication between kubelet and kube-apiserver. In this case, removing Nodes and Pods might cause split brain syndrome because a new Pod having the same ID of a StatefulSet would run on another Node.
To avoid such incidents, a failing Node should be killed with STONITH or something like that before removing the Node resource.
Explain the difference between
resources.requestsof a container
resources.limitssets an upper limit of resource usage to containers using Linux cgroups.
resources.requestsis used by kube-scheduler to choose available Nodes.
resources.requests.cpuis also used to distribute CPU time among containers using CFS shares.
What does happen when a container has only
Sample answerThe container is modified to have
resources.requests.memorywith the same value of
This is the same for
cf. Create a Pod that gets assigned a QoS class of Guaranteed
What does happen when a container consumes more memory than requested?
Sample answerPods that overuse memory become candidates of eviction when Node is running out of memory.
cf. Interactions between Pod priority and quality of service
Describe Quality of Service classes for Pods
Sample answerThere are three classes, namely, Guaranteed, Burstable, and BestEffort.
Pods that have requests and limits for both CPU and memory, and have the same value for requests and limits are classified into Guaranteed. Guaranteed Pods will not be evicted, except in exceptional cases.
Pods that have at least a resource request are classified into Burstable. Other Pods are classified into BestEffort. Burstable Pods are less likely to be evicted than BestEffort Pods.
Describe PriorityClass for Pods
Sample answerPriorityClass is used by kube-scheduler to perform Pod preemption. Preemption is an operation that removes a low-priority Pod from a Node and schedules a high-priority Pod to the Node.
Is a Pod evicted when the Node is running out of CPU time?
For this reason, setting a proper CPU request is important for production environments.
Describe types of Service, namely, ClusterIP, NodePort, and LoadBalancer
Sample answerClusterIP is the most basic Service type. It provides a virtual IP address to service consumers to access backend Pods.
NodePort provides a port number in addition to the virtual IP provided by ClusterIP type. Service consumers can reach backend Pods by connecting any Node with the port number.
LoadBalancer tells an external load balancer to assign a virtual IP address and route packets to the virtual address to backend Pods.
Explain the relationship between Service and Endpoints (EndpointSlices)
Sample answerEvery Service is accompanied by an Endpoint(Slice)s of the same name. Endpoint(Slice)s represents addresses of backend Pods.
Endpoint(Slice)s is created and updated automatically by kube-controller-manager if Service has a Pod selector. If not, Endpoint(Slice)s need to be maintained by other means.
Describe the usage of
spec.containers.portsof Pod and
Sample answerIf defined with names,
spec.containers.portscan be used in
readinessProbe, or Service's
targetPortfield to reference the port by the name.
containers: - ports: - name: health containerPort: 8080 protocol: TCP livenessProbe: httpGet: port: health path: /healthz
EXPOSEin Dockerfiles is merely a documentation.
Both of them do not actually publish the port. A container may listen on other ports than the specified ones. Also, a container may not listen on the specified ports.
Explain how packets from the outside reach Pods If the Service's
spec.externalTrafficPolicyis set to
spec.externalTrafficPolicyis mainly for LoadBalancer type Services. If this field is empty or
Cluster(default), kube-proxy rewrites packets' source address to the Node address and forwards them to the destination Pod. In this mode, the destination Pod may be running on another Node.
If this field is
Local, kube-proxy does not rewrite the source address. In this mode, the destionation Pod must be running on the same node where kube-proxy is running. Therefore, the external load balancer routes packets only to the Nodes where the destination Pods are running.
For example, MetalLB advertises the virtual address only from the Nodes where the destination Pods are running.
Describe what happens when a readinessProbe fails
Sample answerA readinessProbe checks if the container is ready to accept requests.
If a readinessProbe fails, the Pod becomes unready and is excluded from Service load balancing targets.
Describe what happens when livenessProbe fails
Sample answerA livenessProbe checks if the container is alive.
If a livenessProbe fails, the container process is killed and restarted.
Can a Role (not a ClusterRole) grant access to cluster-scoped resources?
Can a ClusterRole grant access to namespace-scoped resources?
Sample answerYes. Such a ClusterRole can be used to grant access to resources in any namespace.
Is it a good idea to edit the privilege of the
Sample answerDefinitely not.
defaultServiceAccount is used by any Pods that do not specify ServiceAccount. Editing the privilege of the
defaultServiceAccount would cause unexpected behavior.
Describe how kube-apiserver prevents privilege escalation
Sample answerkube-apiserver checks when a subject (user or ServiceAccount) creates or updates (Cluster)RoleBinding. If the subject does not have the same privilege as what it is going to grant to other entities, kube-apiserver will deny the operation.
If you are implementing a controller that dynamically grants some privilege to other ServiceAccounts, make sure that the ServiceAccount of the controller has the same privilege.
Describe what is user impersonation
Sample answerIf granted, a user can act as another user and/or belong to another group through HTTP request headers. User impersonation should be granted only for cluster administrators.
--as-group=GROUPcommand-line flags set impersonation headers.
Sample answerThese are called aggregated ClusterRoles. An aggregated ClusterRole merges privileges of other ClusterRoles that have special labels.
When defining new custom resources, consider aggregating the appropriate privileges into these ClusterRoles.