I've been leading a project team for 3 years and have developed a large system around Kubernetes.
In this article, I'd like to share my knowledge and findings of what I think important when developing custom Kubernetes controllers.
Please leave your comments and/or suggestions on my Twitter, if any.
- API
- Implementing controllers
- Components and their collaborations
- Resources
- Networking
- Monitoring
- Access control
API
List extension mechanisms of kube-apiserver
Sample answer
- Custom resources: to define custom resources using OpenAPI schema
- Aggregation layer: to configure reverse-proxy servers to provide additional API groups
- Admission webhooks: to validate or mutate resources before saving them to etcd
- Authentication webhook: to verify authentication tokens with external authentication servers
- Authorization webhook: to authorize requests with external authorization servers
Describe the operation sequence of admission controls
Sample answer
- Authentication and authorization
- Mutating webhook
- Object schema validation
- Validating webhook
- Saving data to etcd
Describe the problem and solution when multiple mutating webhooks edit the same resource
Sample answer
There is no way to specify the order of applying mutating webhooks for kube-apiserver.Suppose we have two mutating webhooks to edit Pods, one is to add a volume mount configuration to all containers, and another is to add a container. To make all containers have the volume mount configuration, the first webhook needs to be called after the second.
We can set the reinvocation policy of the first webhook to
IfNeeded
to make the first called after the second.Describe what happens when a call of an admission webhook fails
Sample answer
It depends on the setting of failure policy of the webhook. Foradmissionregistration.k8s.io/v1
, the default isFail
so that the request is rejected.Describe how kube-apiserver prevents resource editing conflicts
Sample answer
All resources saved in etcd have a resource version that is updated every time it is edited. kube-apiserver utilizes this to reject conflicting edit requests if the resource revision is different from the saved one.This mechanism is called optimistic locking and is applied to all PUT (update) requests.
cf. Optimistic lock control for Kubernetes API Server object modification
List and describe available PATCH methods
Sample answer
- JSON Patch: can be used for both built-in and custom resources.
- JSON Merge Patch: ditto. For the difference from JSON patch, read http://erosb.github.io/post/json-patch-vs-merge-patch/ .
- Strategic Merge Patch: can be used for only built-in resources.
- Server Side Apply: can be used for both built-in and custom resources.
This works substantially differently from others. Fields in resources may have owners, and the owners manage only their fields.
cf. PATCH operations
cf. An example of using dynamic client of k8s.io/client-goDescribe subresources
Sample answer
Subresources are a partial element of a resource for which a REST API endpoint is provided separately from the main resource. The most common subresource is/status
that representsstatus
element.Since subresources have an independent set of API endpoint and verbs, they have independent RBAC permissions from the main resources.
cf. Types (Kinds)
Describe what is the storage version of API
Sample answer
Each Kubernetes API is versioned. When an incompatible change is introduced to an API, its version is bumped.When an API resource is saved in etcd, the resource is converted to a specific version of the API and serialized. This specific version is called the storage version of the API.
Describe how to bump Kubernetes API version step by step
Sample answer
- Introduce a new API version. The storage version of the API stays the old version.
- Change the storage version to the new one after the new version gets stabilized and matured.
- Update the saved API resources in etcd to the new version (by updating them).
- Deprecate the old API version. Tell users to update their resources to the new version.
- Remove the old API version after a while.
Describe why conversion webhooks have to implement a round-trip conversion
Sample answer
Suppose that an API sets v1 as its storage version.When creating the API resource as v2, the conversion webhook needs to convert the resource from v2 to v1. kube-apiserver then saves the resource as v1 in etcd.
When retrieving the API resource as v2, the conversion webhook needs to convert back the saved resource from v1 to v2.
Clearly, the conversion webhook needs to implement a round-trip conversion.
Describe how to avoid missing information in round-trip conversion
Sample answer
The common technique is to save the missing information as annotations. For instance, HorizontalPodAutoscaler saves fields added in v2 as annotations in v1.Describe how kube-apiserver and aggregation API servers authenticate/authorize each other
Sample answer
They mutually authenticate each other using TLS. Read Authentication Flow for details.As to authorization, aggregation API servers have to be granted to create SubjectAccessReview resources by kube-apiserver. To grant the privilege, bind a system built-in Role called
extension-apiserver-authentication-reader
inkube-system
namespace to the ServiceAccount of the aggregation API server.
Implementing controllers
Describe what are Event resources and how long they live in kube-apiserver
Sample answer
Event is a resource to record events that happened to a target resource.kubectl describe pods NAME
displays the events of the Pod in a readable manner.Events usually live for only one hour in kube-apiserver.
cf. Emitting, Consuming, and Presenting: The Event Lifecycle
What namespace should be used for Events of cluster resources such as Node?
Sample answer
default
namespace.Describe what is reconciliation in Kubernetes
Sample answer
Reconciliation is a process to make sure the actual state of the world matches the desired state. In other terms, reconciliation is the implementation of declarative API.Describe how to watch resources in kube-apiserver
Sample answer
kube-apiserver provides a way called watch to feed changes for all API object resources to clients. Watch is much more efficient than polling kube-apiserver periodically.Describe how Delete REST API works
Sample answer
Delete REST API begins the deletion of a given resource. The completion of the REST API call does not necessarily mean that the resource is removed from kube-apiserver.kubectl delete
waits for the completion of the deletion by watching kube-apiserver until the resource is removed. With--wait=false
,kubectl delete
does not wait for the completion.Describe what is
metadata.deletionTimestamp
and how it worksSample answer
metadata.deletionTimestamp
is usually not set. It is set when a resource cannot be deleted immediately. The timestamp indicates the schedule of the deletion.For Pods, this field is used to implement graceful termination. Containers get SIGTERM as soon as the deletion timestamp is set, and get SIGKILL after the timestamp expires. The Pod resource itself will not be removed until kubelet completes the deletion of Pod processes.
The deletion timestamp is also set when
metadata.finalizers
is not empty as described below.cf. Metadata
Describe what is
metadata.finalizers
and how it worksSample answer
Whilemetadata.finalizers
is not empty, the resource will not be removed from kube-apiserver. A controller can do some finalization process for deleting objects by inserting an item tometadata.finalizers
. When the controller completes the finalization, it should remove the item frommetadata.finalizers
.As soon as
metadata.finalizers
becomes empty, kube-apiserver deletes the resource from etcd.cf. Using Finalizers
Describe how k8s.io/client-go/tools/leaderelection works
Sample answer
The package implements leader election by using kube-apiserver resoruces. Currently, the recommended resource to be used isLease
.This package does not guarantee that only one client is acting as a leader (a.k.a. fencing).
Implementation example: https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go
Describe what is
metadata.ownerReferences
and how it worksSample answer
The field is used by the garbage collector to implement cascading deletion of resources.The field is also used by controllers to identify the parent resource.
What does happen to PersistentVolumeClaims instantiated from a StatefulSet when the StatefulSet is deleted?
Sample answer
They will remain.If you want to delete PVC along with the StatefulSet, set PVC's
metadata.ownerReference
to the StatefulSet or something else. For example, Elastic Cloud on Kubernetes (ECK) sets the owner of PVC to Elasticsearch custom resource.
Components and their collaborations
Describe the roles of these components:
- etcd
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- kubelet
- kube-proxy
- containerd
- CoreDNS
Sample answer
- etcd: to store resource objects persistently.
- kube-apiserver: to access etcd and provides REST API for other components.
- kube-controller-manager: is a set of controllers to watch and edit resources in kube-apiserver.
- kube-scheduler: to schedule new Pods to a Node.
- kubelet: to run Pods on each Node.
- kube-proxy: to configure network rules on each Node for Services.
- containerd: to accept CRI requests from kubelet and run containers.
- CoreDNS: to provide internal DNS for Service names.
Describe the behavior of each component from the creation of a Pod to the running of the containers inside
Sample answer
- kube-apiserver saves a new Pod resource in etcd
- kube-scheduler finds the new Pod
- kube-scheduler allocates a Node to the new Pod based on available resources and other conditions
- kubelet on the allocated Node finds the new Pod
- kubelet initializes the Pod runtime as follows:
- kubelet sends a CRI request to CRI runtime such as containerd to create an infrastructure container
- CRI runtime calls CNI plugins to initialize the network namespace of the Pod
- kubelet sequentially requests CRI runtime to run
spec.initContainers
, if any - kubelet concurrently requests CRI runtime to run
spec.containers
Describe who creates
default
ServiceAccount in each namespace and whenSample answer
Thedefault
ServiceAccount does not exist immediately after a namespace is created.The
default
ServiceAccount is created by kube-controller-manager a little after the namespace is created. Similarly, the Secret token for thedefault
ServiceAccount is created by kube-controller-manager a little afterdefault
is created.For this reason, creating a Pod in a newly created namespace sometimes fails. It is safe to create a Deployment instead.
Explain what happens to a Pod when kubelet or Node running the Pod cannot communicate with kube-apiserver
Sample answer
kube-apiserver watches kubelet by receiving periodic heartbeat from it. If the heartbeat stops, kube-apiserver add taints to the Node resource.Pods can tolerate the taints up to 300 seconds because they have the following tolerations by default:
tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300
When 300 seconds have elapsed, graceful termination is initiated. As
spec.terminationGracePeriodSeconds
is 30 seconds by default,metadata.deletionTimestamp
is normally set to 30 seconds from now.When additional 30 seconds have elapsed, the Pod transitions to
Terminating
status. However, since kubelet cannot see the status of the Pod, the Pod will remain running.Describe how ReplicaSet controller works if a Pod is Terminating
Sample answer
ReplicaSet controller usually adds a new Pod in a timely manner.Describe how StatefulSet controller works if a Pod is Terminating
Sample answer
StatefulSet controller cannot add a new Pod because Pods in a StatefulSet have stable network IDs.Explain why fencing a failing Node is important before removing Node resource
Sample answer
When kubelet cannot communicate with kube-apiserver, Pods on the Node becomes Terminating but will not be removed. Such Pods can be removed if the Node resource is removed from kube-apiserver.However, Pod processes may still live if the problem is merely communication between kubelet and kube-apiserver. In this case, removing Nodes and Pods might cause split brain syndrome because a new Pod having the same ID of a StatefulSet would run on another Node.
To avoid such incidents, a failing Node should be killed with STONITH or something like that before removing the Node resource.
Resources
Explain the difference between
resources.limits
andresources.requests
of a containerSample answer
resources.limits
sets an upper limit of resource usage to containers using Linux cgroups.resources.requests
is used by kube-scheduler to choose available Nodes.resources.requests.cpu
is also used to distribute CPU time among containers using CFS shares.What does happen when a container has only
resources.limits.memory
?Sample answer
The container is modified to haveresources.requests.memory
with the same value ofresources.limits.memory
.This is the same for
resources.limits.cpu
.cf. Create a Pod that gets assigned a QoS class of Guaranteed
What does happen when a container consumes more memory than requested?
Sample answer
Pods that overuse memory become candidates of eviction when Node is running out of memory.cf. Interactions between Pod priority and quality of service
Describe Quality of Service classes for Pods
Sample answer
There are three classes, namely, Guaranteed, Burstable, and BestEffort.Pods that have requests and limits for both CPU and memory, and have the same value for requests and limits are classified into Guaranteed. Guaranteed Pods will not be evicted, except in exceptional cases.
Pods that have at least a resource request are classified into Burstable. Other Pods are classified into BestEffort. Burstable Pods are less likely to be evicted than BestEffort Pods.
Describe PriorityClass for Pods
Sample answer
PriorityClass is used by kube-scheduler to perform Pod preemption. Preemption is an operation that removes a low-priority Pod from a Node and schedules a high-priority Pod to the Node.Is a Pod evicted when the Node is running out of CPU time?
Sample answer
No.For this reason, setting a proper CPU request is important for production environments.
Networking
Describe types of Service, namely, ClusterIP, NodePort, and LoadBalancer
Sample answer
ClusterIP is the most basic Service type. It provides a virtual IP address to service consumers to access backend Pods.NodePort provides a port number in addition to the virtual IP provided by ClusterIP type. Service consumers can reach backend Pods by connecting any Node with the port number.
LoadBalancer tells an external load balancer to assign a virtual IP address and route packets to the virtual address to backend Pods.
Explain the relationship between Service and Endpoints (EndpointSlices)
Sample answer
Every Service is accompanied by an Endpoint(Slice)s of the same name. Endpoint(Slice)s represents addresses of backend Pods.Endpoint(Slice)s is created and updated automatically by kube-controller-manager if Service has a Pod selector. If not, Endpoint(Slice)s need to be maintained by other means.
Describe the usage of
spec.containers.ports
of Pod andEXPOSE
in DockerfilesSample answer
If defined with names,spec.containers.ports
can be used inlivenessProbe
,readinessProbe
, or Service'stargetPort
field to reference the port by the name.containers: - ports: - name: health containerPort: 8080 protocol: TCP livenessProbe: httpGet: port: health path: /healthz
EXPOSE
in Dockerfiles is merely a documentation.Both of them do not actually publish the port. A container may listen on other ports than the specified ones. Also, a container may not listen on the specified ports.
Explain how packets from the outside reach Pods If the Service's
spec.externalTrafficPolicy
is set toLocal
Sample answer
spec.externalTrafficPolicy
is mainly for LoadBalancer type Services. If this field is empty orCluster
(default), kube-proxy rewrites packets' source address to the Node address and forwards them to the destination Pod. In this mode, the destination Pod may be running on another Node.If this field is
Local
, kube-proxy does not rewrite the source address. In this mode, the destionation Pod must be running on the same node where kube-proxy is running. Therefore, the external load balancer routes packets only to the Nodes where the destination Pods are running.For example, MetalLB advertises the virtual address only from the Nodes where the destination Pods are running.
Monitoring
Describe what happens when a readinessProbe fails
Sample answer
A readinessProbe checks if the container is ready to accept requests.If a readinessProbe fails, the Pod becomes unready and is excluded from Service load balancing targets.
Describe what happens when livenessProbe fails
Sample answer
A livenessProbe checks if the container is alive.If a livenessProbe fails, the container process is killed and restarted.
Access control
Can a Role (not a ClusterRole) grant access to cluster-scoped resources?
Sample answer
No.Can a ClusterRole grant access to namespace-scoped resources?
Sample answer
Yes. Such a ClusterRole can be used to grant access to resources in any namespace.Is it a good idea to edit the privilege of the
default
ServiceAccount?Sample answer
Definitely not.The
default
ServiceAccount is used by any Pods that do not specify ServiceAccount. Editing the privilege of thedefault
ServiceAccount would cause unexpected behavior.Describe how kube-apiserver prevents privilege escalation
Sample answer
kube-apiserver checks when a subject (user or ServiceAccount) creates or updates (Cluster)RoleBinding. If the subject does not have the same privilege as what it is going to grant to other entities, kube-apiserver will deny the operation.If you are implementing a controller that dynamically grants some privilege to other ServiceAccounts, make sure that the ServiceAccount of the controller has the same privilege.
Describe what is user impersonation
Sample answer
If granted, a user can act as another user and/or belong to another group through HTTP request headers. User impersonation should be granted only for cluster administrators.When using
kubectl
,--as=USER
and--as-group=GROUP
command-line flags set impersonation headers.Describe
view
,edit
,admin
ClusterRolesSample answer
These are called aggregated ClusterRoles. An aggregated ClusterRole merges privileges of other ClusterRoles that have special labels.When defining new custom resources, consider aggregating the appropriate privileges into these ClusterRoles.