Production Readiness Checklist for Deploying Application on Kubernetes

Hong Chun
6 min readDec 13, 2020

Background

After spending almost a year running applications on the production environment, I’ve learned some of the useful configurations that I would love to share.

I always find myself referring to this checklist every single time when considering its production readiness from a Kubernetes configuration perspective.

Reliability

1. Health Check

Why?

Kubernetes’ pod auto health won’t be available if Kubernetes doesn’t know your Pod died. It needed a way to know if your Pod is healthy or if it’s ready to receive traffics. Sometimes applications take a while to boot up before it became available.

How?

HTTP Endpoint:

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: busybox
livenessProbe: # Do This
httpGet:
path: /health
port: 80
initialDelaySeconds: 3
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
readinessProbe: # Do This
httpGet:
path: /health
port: 80
initialDelaySeconds: 3
timeoutSeconds: 10
periodSeconds: 10
successThreshold: 1
failureThreshold: 10

Custom Script:

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: busybox
livenessProbe: # Do This
exec:
command:
- sh
- /healthcheck.sh
initialDelaySeconds: 3
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
readinessProbe: # Do This
exec:
command:
- sh
- /healthcheck.sh
initialDelaySeconds: 3
timeoutSeconds: 10
periodSeconds: 10
successThreshold: 1
failureThreshold: 10

How to Validate?

Sending high traffics to the Pod during the following scenario to see whether you’re receiving non 5xx errors:

  • Pod rolling update / start a new Pod
  • Kill an existing Pod
  • Kill/crash the process running in the container
  • Modify the /healthcheck endpoint to become inaccessible

2. Rollout Strategy

Why?

We needed to ensure that during a Deployment rollout, we don’t replace all Pod at once but instead, replacing them gradually to avoid downtime.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 2
selector:
matchLabels:
component: myapp
strategy: # Do This
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
metadata:
labels:
component: myapp
...

How to Validate?

Rolling out new Deployment and monitor the Pod replacement process which has at least 1 Pod available at any given time (Based on the configuration above).

2. Graceful shutdown

Why?

Generally, when Kubernetes scheduled to shut down the Pod, it would issue a SIGTERM signal to the container’s process. Usually, the container’s process would then initiate the graceful shutdown process. This is the make sure it gave enough time for the currently running process to be finished instead of forcefully stop them.

This article I found has a very detailed explanation of exactly what happened when Pod shutdown is initiated, I highly recommend reading it — https://learnk8s.io/graceful-shutdown

How?

Most frameworks handle them out of the box, but if you have a process that doesn’t support it out of the box can you make use of preStop hook.

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: nginx
lifecycle:
preStop: # Do This
exec:
command: ["/usr/sbin/nginx", "-s", "quit"]

How to Validate?

Sending high traffics to the Pod during the following scenario to see whether you’re receiving non-5xx errors:

  • Pod rolling update
  • Kill an existing Pod

4. Give Kubernetes Enough Time to Remove Your Service Entirely

Why?

Kubernetes is a distributed environment and when you exposed your Pod via Service, every Node in the Kubernetes Cluster has to know about it and stop sending traffics to your Pod that is Shutting Down.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: nginx
lifecycle:
preStop: # Do This
exec:
command: ["/bin/bash", "-c", "/bin/sleep 5;"]

How to Validate?

Sending high traffics to the Pod during the following scenario to see whether you’re receiving non-5xx errors:

  • Pod rolling update
  • Kill an existing Pod

High Availability

1. Pod Distribution Across Availability Zone or Node

Why?

High availability setup is highly recommended to increase the application’s resiliency while reducing risks in the events of the following issues happen:

  • Pod died with only 1 replica
  • An availability zone outage
  • Node failure

How?

Replication for redundancy:

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 2 # Do This or Use HorizontalPodAutoscaler
selector:
matchLabels:
component: myapp
template:
metadata:
labels:
component: myapp
...

Distribute Pod across availability zone & Node:

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
selector:
matchLabels:
component: myapp
template:
metadata:
labels:
component: myapp
spec:
affinity:
podAntiAffinity: # Do This
# "preferredXXX" for soft requirement
# or use "requiredXXX" for hard requirement
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm: # Distribute Across Availability Zone
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- myapp
topologyKey: failure-domain.beta.kubernetes.io/zone
weight: 100
- podAffinityTerm: # Distribute Across Node
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- myapp
topologyKey: kubernetes.io/hostname
weight: 100

How to Validate?

Check if Pod is scheduled on a different node:

kubectl get pods --output=wide# Output
NAME READY STATUS RESTARTS AGE IP NODE
myapp 1/1 Running 0 13s 10.200.0.4 node1
myapp 1/1 Running 0 13s 10.200.0.5 node2

Check if Pod is scheduled on a different availability zone:

kubectl get nodes --show-labels# Output
NAME STATUS ROLES AGE VERSION LABELS
node1 Ready <none> 2d v1.17 failure-domain.beta.kubernetes.io/zone=eu-west-1a
node2 Ready <none> 2d v1.17 failure-domain.beta.kubernetes.io/zone=eu-west-1b

2. PodDisruptionBudget

Why?

Pod could be evicted in various scenarios. For example, when:

  • Node’s auto-scaling is trying to scale down
  • Someone else trying to perform kubectl drain to upgrade the Kubernetes cluster

It is important to configure PodDisruptionBudget so you Pod are not killed without rescheduling on a new Node.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
replicas: 2
selector:
matchLabels:
app: myapp
...
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: myapp
spec:
maxUnavailable: 1 # Do This
selector:
matchLabels:
component: myapp

How to Validate?

Use kubectl drain to drain the node where the Pod is on and monitor to see if a new Pod is scheduled on another Node.

Performance Efficiency

1. Configure CPU & Memory

Why?

By default, Pod will have the whole Node’s CPU & Memory available to it if it didn’t have a resource request & limit configured.

Resource requests: Schedule Pod on a Node that has at least x amount of CPU & Memory. It guarantees the Pod has its own reserved resources.

Resource limits: Equal or higher than Resource requests, it's a limit where the Pod cannot exceed even if the Node has available resources.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: busybox
resources: # Do This
limits:
memory: 512Mi
requests:
cpu: 1024m
memory: 512Mi

How to Validate?

Check resources requests & limit set on Pod:

kubectl get pod mypod --output=yaml

Check how much resources the Pod currently is using:

kubectl top pod mypod

2. Dynamic Scaling with HorizontalPodAutoscaler

Why?

An unpredictable workload could easily use up all the resources in a single Pod, provision for the highest resources/replica meaning we underutilized resources. We can strike the balance withHorizontalPodAutoscaler , while having the Pod autoscale when needed.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
selector:
matchLabels:
component: myapp
template:
metadata:
labels:
component: myapp
...
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: myapp
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 1
maxReplicas: 2 # Do This
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80

How to Validate?

Sending high traffics to the Pod and monitor to see whether the Pod is scaling up or scaling down when there’s low traffics.

Final Note

The above checklist might not be covering every use case, but it should have the most common configuration that would work for most applications. I did like to know how is your checklist looks like. :)

--

--

Hong Chun

Software Engineer, Documenting my Engineering Journey