Module 03

Pods, Containers & Scheduling

The building blocks of every Kubernetes application

CKA CKAD ~45 slides

Use arrow keys or click anywhere to navigate

Meet Sarah...

Sarah is a developer at a fast-growing startup. She has been building containerized apps with Docker for a year. Today, her team is moving to Kubernetes, and her manager just asked her to deploy their flagship API to the cluster.

"Where do I even start?" she wonders, staring at the kubectl command line.

The answer: Pods. Everything in Kubernetes begins with a Pod.

Let's follow Sarah's journey from her first Pod to mastering scheduling, probes, and resource management. By the end, you will be able to do everything she learns -- and pass the CKA/CKAD exam questions about it.

What Is a Pod?

Sarah's Docker experience taught her to think in containers. But Kubernetes wraps containers in something bigger...

Think of a Pod as a shared apartment for containers

Containers in the same Pod share the same network (localhost), storage volumes, and lifecycle. They are always scheduled together on the same node -- just like roommates share an address, a kitchen, and a lease.

  • A Pod is the smallest deployable unit in Kubernetes (not a container!)
  • Most Pods run a single container, but multi-container Pods are common for sidecars
  • Pods are ephemeral -- they can be killed and replaced at any time
  • Every Pod gets its own IP address within the cluster

Anatomy of a Pod

Sarah opens the documentation and sees this structure. Let's break it down piece by piece.

Shared Resources

  • Network namespace -- all containers share one IP, communicate via localhost
  • Volumes -- mounted storage accessible to all containers in the Pod
  • IPC namespace -- containers can use shared memory
  • PID namespace -- optionally share process visibility

Per-Container Settings

  • Image -- which container image to run
  • Ports -- which ports the container exposes
  • Resources -- CPU/memory requests and limits
  • Environment variables -- config injected at runtime
  • Probes -- health checks (liveness, readiness, startup)

Sarah's First Pod YAML

Sarah writes her very first Pod manifest:

apiVersion: v1
kind: Pod
metadata:
  name: my-api
  labels:
    app: my-api
    tier: backend
spec:
  containers:
  - name: api
    image: nginx:1.25
    ports:
    - containerPort: 80
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Every Kubernetes object follows the same pattern: apiVersion, kind, metadata, spec.

Declarative vs Imperative

Sarah has two ways to create her Pod. Understanding the difference is fundamental to Kubernetes thinking.

The Restaurant Analogy

Imperative = walking into a kitchen and cooking it yourself, step by step.
Declarative = handing a menu order to the waiter and letting the kitchen figure out how to make it.

Imperative (Quick & dirty)

kubectl run my-api --image=nginx:1.25
kubectl expose pod my-api --port=80

Good for one-off debugging; not recommended for production.

Declarative (The K8s way)

kubectl apply -f pod.yaml

Version-controlled, repeatable, auditable. Always prefer this.

CKA/CKAD Tip: On the exam, use imperative commands to save time generating YAML, then edit the YAML before applying: kubectl run my-api --image=nginx --dry-run=client -o yaml > pod.yaml

Pod Lifecycle

Sarah deploys her Pod and watches it go through several phases. Here's what happens behind the scenes...

PhaseMeaningWhat's happening
PendingAccepted but not runningScheduler is finding a node; images are being pulled
RunningAt least one container runningMain container(s) executing normally
SucceededAll containers exited 0Common for Jobs; Pod completed its task
FailedContainer exited non-zeroSomething went wrong; check logs
UnknownState cannot be determinedUsually a communication failure with the node
Key insight: Pods do not self-heal. If a Pod dies, it stays dead unless a controller (Deployment, ReplicaSet) recreates it. That is why we rarely create bare Pods in production.

Multi-Container Pod Patterns

Sarah's teammate asks: "Can I run a log shipper next to my app in the same Pod?" Absolutely -- that is the sidecar pattern.

Sidecar

Extends main container functionality. Examples: log collectors, proxies, sync agents.

The main container and the sidecar share volumes and network.

Ambassador

Proxies network connections from the main container to the outside world.

Example: a proxy that handles connection pooling to a database.

Adapter

Transforms output from the main container into a standard format.

Example: converting logs to a common format before shipping.

spec:
  containers:
  - name: app
    image: my-app:1.0
  - name: log-shipper        # sidecar container
    image: fluentd:latest
    volumeMounts:
    - name: logs
      mountPath: /var/log/app
  volumes:
  - name: logs
    emptyDir: {}

Init Containers

Before Sarah's API can start, it needs to wait for the database to be ready. Init containers solve this perfectly.

Init containers run before the main containers start. They run sequentially, one at a time, and each must complete successfully before the next starts.

Common use cases:

  • Wait for a dependent service to be available
  • Pre-populate shared volumes with data
  • Run database migrations before the app starts
  • Fetch secrets or config from a vault
spec:
  initContainers:
  - name: wait-for-db
    image: busybox:1.36
    command: ['sh', '-c',
      'until nslookup postgres.default.svc.cluster.local; do
        echo "Waiting for DB..."; sleep 2;
      done']
  containers:
  - name: api
    image: my-api:1.0

Namespaces

Sarah's cluster is shared by multiple teams. How do they keep their resources from colliding?

Like folders on your computer, keeping things organized

Namespaces provide logical isolation within a cluster. Different teams, environments, or projects can each have their own namespace -- with their own resource quotas and access policies.

Default Namespaces

  • default -- where resources go if no namespace is specified
  • kube-system -- Kubernetes system components (DNS, scheduler, etc.)
  • kube-public -- publicly readable, used for cluster info
  • kube-node-lease -- node heartbeats for health detection

Working with Namespaces

# Create a namespace
kubectl create namespace dev

# Deploy to a namespace
kubectl apply -f pod.yaml -n dev

# List pods in a namespace
kubectl get pods -n dev

# List across all namespaces
kubectl get pods -A

Note: Not all resources are namespaced. Nodes, PersistentVolumes, and ClusterRoles are cluster-scoped. Check with kubectl api-resources --namespaced=false.

Labels & Selectors

Sarah has 50 Pods running. How does she find, group, and manage them? Labels are the answer.

The Post-it note system that makes Kubernetes work

Labels are key-value pairs you stick on any resource. Selectors are how you query them. Services use selectors to find Pods. Deployments use selectors to manage ReplicaSets. Everything connects through labels.

Setting Labels

metadata:
  labels:
    app: my-api
    tier: backend
    env: production
    version: v2

Querying with Selectors

# Equality-based
kubectl get pods -l app=my-api

# Set-based
kubectl get pods -l 'env in (prod,staging)'

# Multiple conditions (AND)
kubectl get pods -l app=my-api,tier=backend
Best practice: Adopt a consistent labeling convention across your org: app.kubernetes.io/name, app.kubernetes.io/version, app.kubernetes.io/component.

Annotations

While labels are for identification and selection, annotations store non-identifying metadata.

Labels vs Annotations

LabelsAnnotations
Used for selection & groupingNon-identifying metadata
Must be short (63 chars max)Can be large (256KB max)
Used by K8s selectorsUsed by tools & humans

Common Annotations

metadata:
  annotations:
    kubernetes.io/change-cause: "Update to v2"
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    description: "Main API service"

Knowledge Check: Pods Basics

Let's see what you have learned so far

Q1: What is the smallest deployable unit in Kubernetes?

A) Container
B) Pod
C) Node
D) Deployment
Correct: B) Pod. A Pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that share network and storage. You cannot deploy a bare container directly -- it must be inside a Pod.

Q2: Which multi-container pattern extends the main container with supporting functionality like log shipping?

A) Sidecar
B) Ambassador
C) Init container
D) Adapter
Correct: A) Sidecar. The sidecar pattern adds auxiliary functionality to the main container -- like log shippers, monitoring agents, or service meshes. Ambassador proxies network connections, and Adapter transforms output formats.

Q3: What happens to a bare Pod (not managed by a controller) when it crashes?

A) Kubernetes automatically creates a new Pod
B) The kubelet moves it to another node
C) The container may restart based on restartPolicy, but the Pod is not rescheduled
D) The namespace controller recreates it
Correct: C) The kubelet will restart the container based on the Pod's restartPolicy (default: Always), but if the node itself fails, the bare Pod is lost forever. Only controllers like Deployments can reschedule Pods on other nodes.

Environment Variables

Sarah needs to pass database credentials and feature flags to her app. Kubernetes offers several ways to inject configuration.

Direct Values

env:
- name: DB_HOST
  value: "postgres.default.svc"
- name: LOG_LEVEL
  value: "info"

From ConfigMaps & Secrets

env:
- name: DB_PASSWORD
  valueFrom:
    secretKeyRef:
      name: db-credentials
      key: password
- name: APP_CONFIG
  valueFrom:
    configMapKeyRef:
      name: app-config
      key: config.json
Security note: Never hardcode secrets in Pod YAML. Use Kubernetes Secrets (or an external secret manager) and reference them with secretKeyRef.

ConfigMaps: Externalizing Configuration

ConfigMaps decouple configuration from container images. Change config without rebuilding.

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  DATABASE_URL: "postgres://db.default.svc:5432/mydb"
  LOG_LEVEL: "info"
  config.yaml: |
    server:
      port: 8080
      timeout: 30s

Three ways to consume ConfigMaps:

Env Variables

envFrom:
- configMapRef:
    name: app-config

Volume Mount

volumes:
- name: cfg
  configMap:
    name: app-config

Single Key

env:
- name: LOG_LEVEL
  valueFrom:
    configMapKeyRef:
      name: app-config
      key: LOG_LEVEL

Secrets

Sarah's app needs a database password. She knows better than to put it in plain YAML...

# Create a Secret imperatively
kubectl create secret generic db-creds \
  --from-literal=username=admin \
  --from-literal=password=S3cureP@ss!

# Or declaratively (values must be base64-encoded)
apiVersion: v1
kind: Secret
metadata:
  name: db-creds
type: Opaque
data:
  username: YWRtaW4=        # base64 of "admin"
  password: UzNjdXJlUEBzcyE=
Important: Base64 is encoding, NOT encryption. Secrets are stored unencrypted in etcd by default. Enable encryption at rest and use RBAC to restrict access. Consider external secret managers (HashiCorp Vault, AWS Secrets Manager) for production.

Resource Requests & Limits

Sarah's Pod is eating all the memory on a shared node, affecting other teams. Time to set boundaries.

Requests (minimum guaranteed)

The scheduler uses requests to decide where to place the Pod. The node must have at least this much available.

resources:
  requests:
    memory: "128Mi"
    cpu: "250m"   # 0.25 CPU core

Limits (maximum allowed)

The kubelet enforces limits. Exceed memory limit = OOMKilled. Exceed CPU limit = throttled.

resources:
  limits:
    memory: "256Mi"
    cpu: "500m"   # 0.5 CPU core
QoS Classes -- determined automatically by K8s:
Guaranteed requests == limits for all containers
Burstable at least one request or limit set
BestEffort no requests or limits set (evicted first under pressure)

LimitRanges & ResourceQuotas

Cluster admins use these to enforce resource governance across namespaces.

LimitRange (per-Pod/container defaults)

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "256Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    type: Container

If a Pod doesn't specify resources, these defaults apply.

ResourceQuota (per-namespace totals)

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
spec:
  hard:
    requests.cpu: "4"
    requests.memory: "8Gi"
    limits.cpu: "8"
    limits.memory: "16Gi"
    pods: "20"

The namespace cannot exceed these totals across all Pods.

Health Probes: Is Your App Actually Working?

Sarah's Pod is "Running" but users are getting 503 errors. The container started, but the app inside hasn't finished initializing. Kubernetes needs a way to know when the app is truly ready.

Liveness Probe

"Is the app alive?"

If it fails, kubelet restarts the container. Use this to catch deadlocks and unrecoverable states.

Readiness Probe

"Is the app ready for traffic?"

If it fails, the Pod is removed from Service endpoints. No traffic is sent until it passes again.

Startup Probe

"Has the app finished starting?"

Disables liveness/readiness checks until it succeeds. Perfect for slow-starting apps (Java, .NET).

Probe Types & Configuration

Three mechanisms to check health:

HTTP GET

Most common for web apps

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3

TCP Socket

For non-HTTP services (databases, caches)

readinessProbe:
  tcpSocket:
    port: 3306
  initialDelaySeconds: 5
  periodSeconds: 10

Exec Command

Run a command inside the container

livenessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5
Tuning tip: initialDelaySeconds should be >= your app's startup time. Set failureThreshold high enough to avoid false positives but low enough to detect real failures. For slow-starting apps, use a startup probe instead of a long initial delay.

Restart Policies

What happens when a container exits? The restartPolicy field determines the behavior.

PolicyBehaviorUse Case
Always (default)Always restart, regardless of exit codeLong-running services (web servers, APIs)
OnFailureRestart only on non-zero exit codeJobs that should retry on failure
NeverNever restartOne-shot tasks, debugging
Backoff behavior: Kubernetes uses exponential backoff for restarts: 10s, 20s, 40s... up to 5 minutes. You will see CrashLoopBackOff status when this is happening. Check logs with kubectl logs <pod> --previous.

Knowledge Check: Config & Lifecycle

Test your understanding of environment, probes, and resources

Q1: A Pod with requests.cpu="500m" and limits.cpu="500m" for ALL containers has which QoS class?

A) Guaranteed
B) Burstable
C) BestEffort
D) Restricted
Correct: A) Guaranteed. When every container in a Pod has requests equal to limits for both CPU and memory, the Pod gets the Guaranteed QoS class. These Pods are the last to be evicted under resource pressure.

Q2: A readiness probe failure causes Kubernetes to:

A) Restart the container
B) Remove the Pod from Service endpoints
C) Delete the Pod
D) Reschedule the Pod to another node
Correct: B) Remove the Pod from Service endpoints. A readiness failure stops traffic from being sent to the Pod, but the Pod keeps running. A liveness failure, on the other hand, causes a container restart.

Q3: What is the purpose of a startup probe?

A) To check if the node has enough resources
B) To run initialization scripts
C) To disable liveness/readiness checks until the app has started
D) To validate the container image before pulling
Correct: C) Startup probes protect slow-starting containers. Until the startup probe succeeds, liveness and readiness probes are disabled -- preventing premature restarts of applications that take a long time to initialize (e.g., Java apps with large classpaths).

How Scheduling Works

Sarah has 10 nodes in her cluster. When she creates a Pod, how does Kubernetes decide which node it lands on? Enter the kube-scheduler -- the matchmaker of the cluster.

Kubernetes plays matchmaker between Pods and Nodes

The scheduler looks at each Pod's requirements (CPU, memory, affinity rules) and each node's capacity, then finds the best match -- like a housing algorithm matching tenants to apartments.

The scheduling process:

  1. Filtering -- eliminate nodes that don't meet the Pod's requirements (insufficient resources, wrong labels, taints)
  2. Scoring -- rank remaining nodes by preference (spreading, affinity, resource balance)
  3. Binding -- assign the Pod to the highest-scoring node

All of this happens in milliseconds. You can influence it but rarely need to override it.

nodeSelector: The Simplest Scheduling Constraint

Want your Pod on a specific type of node? Use nodeSelector -- it matches Pod to node labels.

# First, label your node
kubectl label node worker-1 disktype=ssd

# Then reference it in your Pod spec
apiVersion: v1
kind: Pod
metadata:
  name: fast-io-app
spec:
  nodeSelector:
    disktype: ssd
  containers:
  - name: app
    image: my-app:1.0
How it works: The scheduler will ONLY place this Pod on nodes with the label disktype=ssd. If no node matches, the Pod stays Pending.

For more complex placement logic, use node affinity (coming up next).

Node Affinity: Advanced Node Selection

Node affinity is a more expressive version of nodeSelector. It supports soft preferences and complex expressions.

Required (hard rule)

requiredDuringSchedulingIgnoredDuringExecution

Pod MUST be placed on a matching node. Like nodeSelector but with richer operators.

Preferred (soft rule)

preferredDuringSchedulingIgnoredDuringExecution

Scheduler TRIES to match, but will place elsewhere if needed. Has a weight (1-100).

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: ["us-east-1a", "us-east-1b"]
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: node-type
            operator: In
            values: ["high-memory"]

Operators: In, NotIn, Exists, DoesNotExist, Gt, Lt

Pod Affinity & Anti-Affinity

Sarah wants her web Pods close to her cache Pods (for low latency) but spread across zones (for high availability). Pod affinity and anti-affinity solve both.

Pod Affinity -- "co-locate with"

affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: redis
      topologyKey: kubernetes.io/hostname

Place this Pod on the same node as Pods labeled app=redis.

Pod Anti-Affinity -- "spread away"

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: web
        topologyKey: topology.kubernetes.io/zone

Try to spread web Pods across different zones.

topologyKey defines the scope: kubernetes.io/hostname = per node, topology.kubernetes.io/zone = per zone, topology.kubernetes.io/region = per region.

Taints & Tolerations

Some nodes in Sarah's cluster are reserved for GPU workloads only. How does she keep regular Pods off those nodes?

VIP areas in the cluster

Taints are like a "VIP Only" sign on a node -- regular Pods are repelled. Tolerations are like a VIP pass on a Pod -- allowing it through. Taints go on nodes; tolerations go on Pods.

Taint a Node

# Apply a taint
kubectl taint nodes gpu-node-1 \
  gpu=true:NoSchedule

# Remove a taint (note the minus)
kubectl taint nodes gpu-node-1 \
  gpu=true:NoSchedule-

Tolerate a Taint

spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Taint Effects

Three effects determine how strictly the taint is enforced:

EffectBehaviorExisting Pods?
NoScheduleNew Pods without a toleration will NOT be scheduled hereNot affected
PreferNoScheduleScheduler TRIES to avoid this node (soft version)Not affected
NoExecuteNew Pods rejected AND existing Pods without toleration are evictedEvicted!
Real-world usage:
  • Master/control-plane nodes are tainted with node-role.kubernetes.io/control-plane:NoSchedule
  • When a node becomes unreachable, K8s automatically adds node.kubernetes.io/unreachable:NoExecute
  • GPU nodes, high-memory nodes, and spot instances often use custom taints

Topology Spread Constraints

Sarah's Pods keep landing on the same node. If that node goes down, all replicas are lost. She needs to spread them evenly.

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: web
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: web

Key Fields

  • maxSkew -- max difference in Pod count between topology domains
  • topologyKey -- what topology to spread across (zone, node, region)
  • whenUnsatisfiable -- DoNotSchedule (hard) or ScheduleAnyway (soft)

vs Pod Anti-Affinity

Topology spread constraints give you finer control over how evenly Pods are distributed. Anti-affinity is more binary -- "don't put me near X".

Knowledge Check: Scheduling

How well do you understand Pod placement?

Q1: What taint effect evicts existing Pods that don't have a matching toleration?

A) NoSchedule
B) PreferNoSchedule
C) NoExecute
D) Evict
Correct: C) NoExecute. This is the strictest taint effect: it not only prevents new Pods from being scheduled but also evicts already-running Pods that don't have a matching toleration. NoSchedule only affects new scheduling decisions.

Q2: Which scheduling feature lets you say "preferably place this Pod on a high-memory node, but it's okay if you can't"?

A) nodeSelector
B) preferredDuringSchedulingIgnoredDuringExecution node affinity
C) Taints with PreferNoSchedule
D) Pod anti-affinity
Correct: B) The preferredDuringSchedulingIgnoredDuringExecution node affinity lets you express soft preferences for node placement. nodeSelector is hard only. Taints repel Pods from nodes, not attract them. Pod anti-affinity relates to other Pods, not node characteristics.

Q3: In the kube-scheduler's process, what happens during the "Filtering" phase?

A) Nodes that don't meet the Pod's requirements are eliminated
B) Remaining nodes are ranked by preference
C) The Pod is assigned to the best node
D) Resource limits are applied to the Pod
Correct: A) Filtering eliminates nodes that cannot run the Pod (insufficient resources, non-matching nodeSelector, taints without tolerations, etc.). Scoring (ranking) happens next, followed by binding (assignment).

Manual Scheduling with nodeName

In rare cases, you can bypass the scheduler entirely by setting nodeName directly.

apiVersion: v1
kind: Pod
metadata:
  name: manual-pod
spec:
  nodeName: worker-node-3   # Bypasses the scheduler entirely
  containers:
  - name: app
    image: nginx:1.25
Warning: Avoid nodeName in production. It bypasses all scheduling logic: no resource checks, no affinity rules, no taints. If the named node doesn't exist or is down, the Pod will never run. Use nodeSelector or affinity instead.

When is this useful?

  • Debugging scheduler issues
  • Static Pods (managed by kubelet directly, not the API server)
  • Testing on a specific node during development

Static Pods

Sarah notices some Pods in kube-system that she can't delete through kubectl. These are static Pods -- managed directly by the kubelet on each node.

What are Static Pods?

  • Managed by kubelet, not the API server
  • Defined as YAML files in a directory on the node (usually /etc/kubernetes/manifests/)
  • Kubelet watches this directory and creates/updates/deletes Pods automatically
  • A "mirror Pod" is created in the API server so you can see them with kubectl get pods

Used for Control Plane Components

  • kube-apiserver
  • kube-controller-manager
  • kube-scheduler
  • etcd

This is how kubeadm bootstraps the control plane -- the kubelet starts these before the API server is even running.

CKA Tip: Static Pod manifests path is configured via --pod-manifest-path in kubelet config or staticPodPath in the kubelet config file. Check with ps aux | grep kubelet.

Priority & Preemption

The cluster is full and Sarah's critical payment service can't get scheduled. But there are lower-priority batch jobs running. Priority classes to the rescue.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For critical production workloads"
# Reference in Pod spec
spec:
  priorityClassName: high-priority
  containers:
  - name: payment-service
    image: payment:2.0
How preemption works: When a high-priority Pod can't be scheduled, the scheduler will evict lower-priority Pods to make room. The evicted Pods go back to the scheduling queue. Built-in classes: system-cluster-critical (2000000000) and system-node-critical (2000001000).

Pod Disruption Budgets (PDB)

The ops team is draining nodes for maintenance. Sarah's API has 3 replicas, and she wants to guarantee at least 2 are always available during the disruption.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2          # OR use maxUnavailable: 1
  selector:
    matchLabels:
      app: my-api

minAvailable

Minimum number (or percentage) of Pods that must remain available during voluntary disruptions.

maxUnavailable

Maximum number (or percentage) of Pods that can be unavailable. Use one or the other, not both.

Applies to voluntary disruptions only: node drains, rolling updates, cluster autoscaler. Does NOT protect against involuntary disruptions like hardware failures.

Designing Good Health Checks

Sarah's team has been bitten by poorly configured probes. Here are the lessons they learned.

Do

  • Make liveness probes lightweight (don't check dependencies)
  • Use readiness probes to check downstream dependencies
  • Use startup probes for slow apps instead of long initialDelaySeconds
  • Set reasonable thresholds and timeouts

Don't

  • Don't make liveness probe check the database (cascading restarts!)
  • Don't set initialDelaySeconds too low for the app's actual start time
  • Don't forget to configure probes at all (K8s assumes the Pod is healthy)
  • Don't use the same endpoint for liveness and readiness
Pattern: /healthz for liveness (am I alive?), /readyz for readiness (am I ready for traffic?). The liveness check is internal-only; the readiness check verifies dependencies.

Container Lifecycle Hooks

Kubernetes lets you run code at two points in a container's lifecycle:

postStart

Runs immediately after the container is created (no guarantee it runs before the ENTRYPOINT).

Use case: register with a service, warm caches.

preStop

Runs before the container receives SIGTERM. Blocks the shutdown for its duration.

Use case: drain connections, deregister from load balancer.

lifecycle:
  postStart:
    exec:
      command: ["/bin/sh", "-c", "echo 'Started' >> /var/log/lifecycle.log"]
  preStop:
    httpGet:
      path: /shutdown
      port: 8080
Graceful shutdown: When K8s terminates a Pod, it sends preStop hook, waits, then SIGTERM, waits terminationGracePeriodSeconds (default 30s), then SIGKILL. Configure your app to handle SIGTERM gracefully.

Knowledge Check: Probes & Advanced Topics

Almost there -- test your advanced knowledge

Q1: Why should liveness probes NOT check external dependencies like a database?

A) It would be too slow
B) A database outage would cause cascading container restarts across all Pods
C) Kubernetes doesn't allow network calls in probes
D) The probe would time out and be ignored
Correct: B) If the liveness probe checks the database and the database goes down, every Pod's liveness check would fail, causing Kubernetes to restart all of them simultaneously -- making the situation far worse. Liveness should only check if the app process itself is healthy.

Q2: Where are static Pod manifests typically stored on a kubeadm cluster?

A) /var/lib/kubelet/pods/
B) /etc/kubernetes/manifests/
C) /opt/kubernetes/static/
D) /etc/kubernetes/pki/
Correct: B) /etc/kubernetes/manifests/ is the default static Pod path for kubeadm clusters. The kubelet watches this directory and manages the Pods defined there. Control plane components (apiserver, scheduler, controller-manager, etcd) are typically run as static Pods.

Q3: What does a PodDisruptionBudget protect against?

A) Hardware failures
B) OOM kills
C) Voluntary disruptions like node drains and rolling updates
D) Network outages
Correct: C) PDBs only apply to voluntary disruptions -- operations initiated by a cluster admin or controller, such as node drains (kubectl drain), rolling updates, and cluster autoscaler scale-downs. They do not protect against involuntary disruptions like hardware failures or kernel panics.

Security Contexts

The security team reviewed Sarah's deployment and flagged that her containers are running as root. Time to lock things down.

spec:
  securityContext:           # Pod-level settings
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
  containers:
  - name: app
    image: my-app:1.0
    securityContext:         # Container-level settings
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]
        add: ["NET_BIND_SERVICE"]

Pod-Level

runAsUser, runAsGroup, fsGroup, runAsNonRoot, supplementalGroups

Container-Level

allowPrivilegeEscalation, readOnlyRootFilesystem, capabilities, privileged, seccompProfile

Minimum security baseline: runAsNonRoot: true, allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, drop ALL capabilities, add only what you need.

Service Accounts

Pods use ServiceAccounts to authenticate with the Kubernetes API. Every Pod gets one.

# Create a service account
kubectl create serviceaccount my-app-sa

# Use it in a Pod
spec:
  serviceAccountName: my-app-sa
  automountServiceAccountToken: false  # Disable if not needed
  containers:
  - name: app
    image: my-app:1.0
Security best practice:
  • Create dedicated ServiceAccounts per workload (don't use the default one)
  • Set automountServiceAccountToken: false unless the Pod needs API access
  • Bind minimal RBAC roles to the ServiceAccount
  • Since K8s 1.24+, tokens are no longer auto-created as Secrets; they use the TokenRequest API (short-lived, audience-bound)

Debugging Pods: Sarah's Toolkit

Things will go wrong. Here is the debugging playbook Sarah keeps handy.

# Check Pod status and events
kubectl describe pod my-api

# View container logs (current and previous crash)
kubectl logs my-api
kubectl logs my-api --previous
kubectl logs my-api -c sidecar   # specific container

# Exec into a running container
kubectl exec -it my-api -- /bin/sh

# Debug with ephemeral container (K8s 1.25+)
kubectl debug -it my-api --image=busybox --target=app

# View resource usage
kubectl top pod my-api

# Check events in namespace
kubectl get events --sort-by=.lastTimestamp
Common issues: ImagePullBackOff = wrong image name or no pull secret. CrashLoopBackOff = container keeps crashing, check logs. Pending = no node available, check resources/taints/affinity. OOMKilled = increase memory limit.

Essential kubectl Commands

Creating & Managing

# Generate YAML without creating
kubectl run nginx --image=nginx \
  --dry-run=client -o yaml

# Apply from file
kubectl apply -f pod.yaml

# Delete
kubectl delete pod my-api
kubectl delete -f pod.yaml

# Edit live resource
kubectl edit pod my-api

Inspecting

# List with extra info
kubectl get pods -o wide

# JSON output + jq
kubectl get pod my-api -o json | jq '.status'

# Watch for changes
kubectl get pods -w

# Sort by restart count
kubectl get pods --sort-by='.status.containerStatuses[0].restartCount'
CKA/CKAD speed tips: Set up aliases: alias k=kubectl. Use kubectl explain pod.spec.containers to browse API docs directly from the command line. Tab completion saves time: source <(kubectl completion bash).

Putting It All Together: Sarah's Production Pod

After everything she learned, here is Sarah's production-ready Pod manifest. Notice how it incorporates labels, probes, resources, security, and more.

apiVersion: v1
kind: Pod
metadata:
  name: my-api
  namespace: production
  labels:
    app: my-api
    tier: backend
    version: v2
  annotations:
    prometheus.io/scrape: "true"
spec:
  serviceAccountName: my-api-sa
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
  containers:
  - name: api
    image: my-api:2.0
    ports:
    - containerPort: 8080
    resources:
      requests: { cpu: "250m", memory: "128Mi" }
      limits:   { cpu: "500m", memory: "256Mi" }
    livenessProbe:
      httpGet: { path: /healthz, port: 8080 }
      initialDelaySeconds: 10
      periodSeconds: 15
    readinessProbe:
      httpGet: { path: /readyz, port: 8080 }
      initialDelaySeconds: 5
      periodSeconds: 5
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: env
            operator: In
            values: ["production"]

Sarah's Next Question

Sarah's production Pod is running beautifully. But she realizes something troubling...

The problem with bare Pods

"What happens if my Pod crashes? Or if I need to run 5 copies? Or update to a new version without downtime?"

Bare Pods can't do any of that. She needs something smarter -- something that manages Pods for her.

Next up: Workload Controllers -- Deployments, ReplicaSets, StatefulSets, and more.

Final Knowledge Check

Comprehensive review of Pods, Containers, and Scheduling

Q1: A Pod spec has tolerations: [{key: "gpu", operator: "Equal", value: "true", effect: "NoSchedule"}]. What does this mean?

A) The Pod will only run on GPU nodes
B) The Pod CAN be scheduled on nodes tainted with gpu=true:NoSchedule, but is not required to
C) The Pod applies a taint to the node it runs on
D) The Pod requires the node to have a GPU label
Correct: B) A toleration allows (but does not require) scheduling on a tainted node. The Pod can still be scheduled on untainted nodes too. To force it onto GPU nodes, you would also need a nodeSelector or node affinity for gpu=true.

Q2: Which field in a Pod spec completely bypasses the kube-scheduler?

A) nodeSelector
B) nodeName
C) nodeAffinity
D) priorityClassName
Correct: B) nodeName. When set, the Pod is directly assigned to that node without going through the scheduler. No filtering, scoring, or resource checking is performed. nodeSelector and nodeAffinity are hints TO the scheduler, not bypasses.

Q3: Init containers in a Pod run:

A) In parallel with each other, before main containers
B) Sequentially, one at a time, each must succeed before the next starts
C) In parallel with the main containers
D) Only when the main containers fail
Correct: B) Init containers run one at a time in the order they are defined. Each must complete successfully (exit code 0) before the next one starts. Only after all init containers succeed do the main containers start. If an init container fails, the Pod restarts according to its restartPolicy.
Module Complete

Key Takeaways

  • Pods are the smallest deployable unit -- shared network, storage, lifecycle
  • Declarative YAML is the Kubernetes way; imperative for quick generation
  • Labels & Selectors are the glue that connects everything in K8s
  • Probes (liveness, readiness, startup) tell K8s about your app's health
  • Resource requests & limits determine QoS and scheduling
  • Scheduling is a 3-step process: filter, score, bind
  • Taints/Tolerations control which Pods can run where
  • Security contexts and ServiceAccounts lock down your workloads
  • Bare Pods don't self-heal -- you need controllers (next module)

Next: Module 04 -- Workload Controllers

← Back