Production Readiness Checklist for Deploying Application on Kubernetes

6 min readDec 13, 2020

Background

After spending almost a year running applications on the production environment, I’ve learned some of the useful configurations that I would love to share.

I always find myself referring to this checklist every single time when considering its production readiness from a Kubernetes configuration perspective.

Reliability

1. Health Check

Why?

Kubernetes’ pod auto health won’t be available if Kubernetes doesn’t know your Pod died. It needed a way to know if your Pod is healthy or if it’s ready to receive traffics. Sometimes applications take a while to boot up before it became available.

How?

HTTP Endpoint:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: busybox
          livenessProbe: # Do This
            httpGet:
              path: /health
              port: 80
            initialDelaySeconds: 3
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          readinessProbe: # Do This
            httpGet:
              path: /health
              port: 80
            initialDelaySeconds: 3
            timeoutSeconds: 10
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10

Custom Script:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: busybox
          livenessProbe: # Do This
            exec:
              command:
                - sh
                - /healthcheck.sh
            initialDelaySeconds: 3
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          readinessProbe: # Do This
            exec:
              command:
                - sh
                - /healthcheck.sh
            initialDelaySeconds: 3
            timeoutSeconds: 10
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10

How to Validate?

Sending high traffics to the Pod during the following scenario to see whether you’re receiving non 5xx errors:

Pod rolling update / start a new Pod
Kill an existing Pod
Kill/crash the process running in the container
Modify the /healthcheck endpoint to become inaccessible

2. Rollout Strategy

Why?

We needed to ensure that during a Deployment rollout, we don’t replace all Pod at once but instead, replacing them gradually to avoid downtime.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec: 
  replicas: 2
  selector:
    matchLabels:
      component: myapp
  strategy: # Do This
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    metadata:
      labels:
        component: myapp
...

How to Validate?

Rolling out new Deployment and monitor the Pod replacement process which has at least 1 Pod available at any given time (Based on the configuration above).

2. Graceful shutdown

Why?

Generally, when Kubernetes scheduled to shut down the Pod, it would issue a SIGTERM signal to the container’s process. Usually, the container’s process would then initiate the graceful shutdown process. This is the make sure it gave enough time for the currently running process to be finished instead of forcefully stop them.

This article I found has a very detailed explanation of exactly what happened when Pod shutdown is initiated, I highly recommend reading it — https://learnk8s.io/graceful-shutdown

How?

Most frameworks handle them out of the box, but if you have a process that doesn’t support it out of the box can you make use of preStop hook.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: nginx
          lifecycle:
            preStop: # Do This
              exec:
                command: ["/usr/sbin/nginx", "-s", "quit"]

How to Validate?

Sending high traffics to the Pod during the following scenario to see whether you’re receiving non-5xx errors:

Pod rolling update
Kill an existing Pod

4. Give Kubernetes Enough Time to Remove Your Service Entirely

Why?

Kubernetes is a distributed environment and when you exposed your Pod via Service, every Node in the Kubernetes Cluster has to know about it and stop sending traffics to your Pod that is Shutting Down.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: nginx
          lifecycle:
            preStop: # Do This
              exec:
                command: ["/bin/bash", "-c", "/bin/sleep 5;"]

How to Validate?

Sending high traffics to the Pod during the following scenario to see whether you’re receiving non-5xx errors:

Pod rolling update
Kill an existing Pod

High Availability

1. Pod Distribution Across Availability Zone or Node

Why?

High availability setup is highly recommended to increase the application’s resiliency while reducing risks in the events of the following issues happen:

Pod died with only 1 replica
An availability zone outage
Node failure

How?

Replication for redundancy:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 2 # Do This or Use HorizontalPodAutoscaler
  selector:
    matchLabels:
      component: myapp
  template:
    metadata:
      labels:
        component: myapp
...

Distribute Pod across availability zone & Node:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  selector:
    matchLabels:
      component: myapp
  template:
    metadata:
      labels:
        component: myapp
    spec:
      affinity:
        podAntiAffinity: # Do This
          # "preferredXXX" for soft requirement
          # or use "requiredXXX" for hard requirement
          preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm: # Distribute Across Availability Zone
                labelSelector:
                  matchExpressions:
                    - key: component
                      operator: In
                      values:
                        - myapp
                topologyKey: failure-domain.beta.kubernetes.io/zone
              weight: 100
            - podAffinityTerm: # Distribute Across Node
                labelSelector:
                  matchExpressions:
                    - key: component
                      operator: In
                      values:
                        - myapp
                topologyKey: kubernetes.io/hostname
              weight: 100

How to Validate?

Check if Pod is scheduled on a different node:

kubectl get pods --output=wide# Output
NAME     READY     STATUS    RESTARTS   AGE    IP           NODE
myapp    1/1       Running   0          13s    10.200.0.4   node1
myapp    1/1       Running   0          13s    10.200.0.5   node2

Check if Pod is scheduled on a different availability zone:

kubectl get nodes --show-labels# Output
NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   2d      v1.17     failure-domain.beta.kubernetes.io/zone=eu-west-1a
node2   Ready    <none>   2d      v1.17     failure-domain.beta.kubernetes.io/zone=eu-west-1b

2. PodDisruptionBudget

Why?

Pod could be evicted in various scenarios. For example, when:

Node’s auto-scaling is trying to scale down
Someone else trying to perform kubectl drain to upgrade the Kubernetes cluster

It is important to configure PodDisruptionBudget so you Pod are not killed without rescheduling on a new Node.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  replicas: 2
  selector:
    matchLabels:
      app: myapp
...
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: myapp
spec:
  maxUnavailable: 1 # Do This
  selector:
    matchLabels:
      component: myapp

How to Validate?

Use kubectl drain to drain the node where the Pod is on and monitor to see if a new Pod is scheduled on another Node.

Performance Efficiency

1. Configure CPU & Memory

Why?

By default, Pod will have the whole Node’s CPU & Memory available to it if it didn’t have a resource request & limit configured.

Resource requests: Schedule Pod on a Node that has at least x amount of CPU & Memory. It guarantees the Pod has its own reserved resources.

Resource limits: Equal or higher than Resource requests, it's a limit where the Pod cannot exceed even if the Node has available resources.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: busybox
          resources: # Do This
            limits:
              memory: 512Mi
            requests:
              cpu: 1024m
              memory: 512Mi

How to Validate?

Check resources requests & limit set on Pod:

kubectl get pod mypod --output=yaml

Check how much resources the Pod currently is using:

kubectl top pod mypod

2. Dynamic Scaling with HorizontalPodAutoscaler

Why?

An unpredictable workload could easily use up all the resources in a single Pod, provision for the highest resources/replica meaning we underutilized resources. We can strike the balance withHorizontalPodAutoscaler , while having the Pod autoscale when needed.

How?

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  selector:
    matchLabels:
      component: myapp
  template:
    metadata:
      labels:
        component: myapp
...
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 1
  maxReplicas: 2 # Do This
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80

How to Validate?

Sending high traffics to the Pod and monitor to see whether the Pod is scaling up or scaling down when there’s low traffics.

Final Note

The above checklist might not be covering every use case, but it should have the most common configuration that would work for most applications. I did like to know how is your checklist looks like. :)

Production Readiness Checklist for Deploying Application on Kubernetes

Background

Reliability

1. Health Check

2. Rollout Strategy

2. Graceful shutdown

4. Give Kubernetes Enough Time to Remove Your Service Entirely

High Availability

1. Pod Distribution Across Availability Zone or Node

2. PodDisruptionBudget

Performance Efficiency

1. Configure CPU & Memory

2. Dynamic Scaling with HorizontalPodAutoscaler

Final Note

Written by Hong Chun