Kubernetes Training — Presentation 8 of 8

Observability, Troubleshooting
& Exam Prep

It's 3 AM. Your phone rings. The application is down. How do you debug it? This is the most important skill in Kubernetes — and 30% of the CKA exam.

CKA: Troubleshooting — 30% CKAD: Application Observability — 15%

The 3 AM Scenario

When Things Go Wrong

"PagerDuty alert: 'Production API - 5xx error rate above 50%'. You open your laptop. The dashboard is red. Users are complaining on social media. Your manager is on Slack asking for an ETA. Where do you even start?"

Today we build the skills to answer that question confidently. By the end of this session, you will have a systematic approach to debugging any Kubernetes issue — and the exam skills to prove it.

Overview

Our Journey Today

Observability

Logging — the black box recorder
Monitoring — the dashboard
Azure Monitor for AKS

Troubleshooting

Debugging Pods — CrashLoopBackOff
Debugging Services — connectivity
Common failure scenarios
Systematic debugging flow

Exam & Beyond

Helm — package management
CRDs & Operators
CKA exam strategy
CKAD exam strategy
kubectl aliases & speed tricks

Logging

Logging: Reading the Black Box Recorder

In aviation, when something goes wrong, investigators look at the black box. In Kubernetes, that black box is your container logs. Everything your application writes to stdout/stderr is captured by the kubelet.

How K8s Logging Works

Containers write to stdout and stderr
Container runtime captures output
Stored as files on the node: /var/log/containers/
Rotated by kubelet (default 10Mi, 5 files)
Lost when pod is deleted (no persistence!)

Key Principle

Applications in Kubernetes should always log to stdout/stderr, never to files inside the container. This is one of the 12-Factor App principles.

If your legacy app writes to /var/log/app.log, use a sidecar container to stream that file to stdout.

Logging

kubectl logs — Your First Debugging Tool

# Basic: logs from a single pod
kubectl logs my-pod

# Specific container in a multi-container pod
kubectl logs my-pod -c sidecar

# Follow logs in real-time (like tail -f)
kubectl logs my-pod -f

# Last 100 lines only
kubectl logs my-pod --tail=100

# Logs from the last hour
kubectl logs my-pod --since=1h

# Logs from a PREVIOUS crashed container (critical for CrashLoopBackOff!)
kubectl logs my-pod --previous

# Logs from all pods matching a label
kubectl logs -l app=api --all-containers

# Logs from a specific deployment's pods
kubectl logs deployment/api-server --tail=50

💡

Exam Must-Know: kubectl logs pod-name --previous shows logs from the last crashed container instance. This is essential for debugging CrashLoopBackOff.

Logging

Cluster-Level Log Aggregation

Node-level logs are ephemeral. For production, you need centralized log aggregation that survives pod restarts and node failures.

Node-Level Agent

DaemonSet on every node collects logs and forwards them.

Fluentd / Fluent Bit
Filebeat
Azure Monitor Agent

Most common pattern

Sidecar Container

Dedicated container in each pod streams logs.

For apps that write to files
Higher resource cost
More control per-app

Use when node agent isn't enough

Direct Push

Application pushes logs directly to backend.

Application Insights SDK
Custom logging libraries
No K8s involvement

Least common in K8s

Monitoring

Monitoring: The Dashboard That Tells You Everything

Logs tell you what happened. Monitoring tells you what's happening right now. It's the difference between reading a crash report and watching the speedometer.

Metrics Server (Built-in)

Lightweight, in-cluster metrics aggregator
Provides CPU and memory metrics
Powers kubectl top commands
Powers HorizontalPodAutoscaler
Does NOT store historical data

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -n production
kubectl top pods --sort-by=memory

Prometheus (Industry Standard)

Pull-based metrics collection
Time-series database
PromQL query language
Alerting via Alertmanager
Grafana for visualization

The Prometheus + Grafana stack is the de facto standard for Kubernetes monitoring.

Azure Integration

Azure Monitor for AKS

Azure provides managed observability for AKS clusters without the overhead of running your own Prometheus/Grafana stack.

Container Insights

Agent-based log and metric collection
Pre-built dashboards for clusters, nodes, pods
KQL (Kusto) queries for deep analysis
Integration with Azure Alerts
Live container logs in the portal

Azure Managed Prometheus + Grafana

Fully managed Prometheus-compatible metrics
Azure Managed Grafana for dashboards
PromQL support
No infrastructure to manage
Built-in recording rules and alerts

# Enable Container Insights on an existing AKS cluster
az aks enable-addons -a monitoring -n myCluster -g myResourceGroup

# Query container logs via KQL
ContainerLog | where LogEntry contains "error" | project TimeGenerated, LogEntry | top 50

Knowledge Check

Quiz: Observability Fundamentals

Q1: Which command shows logs from a previously crashed container instance?

kubectl logs pod-name --crashed

kubectl logs pod-name --previous

kubectl logs pod-name --last

kubectl describe pod pod-name

Correct: kubectl logs pod-name --previous (or -p). This retrieves logs from the previous container instance, which is critical for debugging CrashLoopBackOff.

Q2: What component must be installed for kubectl top to work?

Prometheus

Metrics Server

cAdvisor

Grafana

Correct: Metrics Server. It provides the metrics API that kubectl top reads from. cAdvisor provides container metrics to kubelet, but kubectl top needs Metrics Server to aggregate them.

Q3: Where should Kubernetes applications write their logs?

/var/log/application.log

stdout and stderr

A shared PersistentVolume

Directly to Elasticsearch

Correct: stdout and stderr. This follows the 12-Factor App methodology and allows the container runtime and kubectl logs to capture output. File-based logging requires sidecar containers to forward.

Troubleshooting

Debugging Pods: A Story-Driven Approach

"You deploy your application. Instead of Running, the pod shows CrashLoopBackOff. You wait. It restarts. Crashes again. Restarts. Crashes again. The back-off delay grows: 10s, 20s, 40s, 80s... up to 5 minutes. Users are waiting."

Let's walk through exactly how to investigate this, step by step.

Troubleshooting

Step 1: Get the Status

Always start with kubectl get to understand the current state. The STATUS column tells you what phase the pod is in.

# See all pods and their status
kubectl get pods -n production
NAME          READY   STATUS             RESTARTS   AGE
api-server    0/1     CrashLoopBackOff   5          3m
web-frontend  1/1     Running            0          1h
db-worker     0/1     Pending            0          10m

Status	Meaning	Next Step
CrashLoopBackOff	Container starts and crashes repeatedly	`kubectl logs --previous`
Pending	Can't be scheduled	`kubectl describe pod`
ImagePullBackOff	Can't pull container image	Check image name, registry access
Init:0/1	Init container hasn't completed	`kubectl logs pod -c init-container`
Running	All containers started	Check readiness probes, service

Troubleshooting

Step 2: Describe the Pod

kubectl describe is your detective report. It shows events, conditions, and the full pod specification. Focus on the Events section at the bottom.

kubectl describe pod api-server -n production

# Key sections to examine:

# 1. Status & Conditions
Status:       Running
Conditions:
  Ready:      False    # <-- readiness probe failing

# 2. Container State
State:        Waiting
  Reason:     CrashLoopBackOff
Last State:   Terminated
  Reason:     Error
  Exit Code:  1        # <-- non-zero = crash

# 3. Events (most important!)
Events:
  Warning  BackOff   kubelet  Back-off restarting failed container
  Normal   Pulled    kubelet  Successfully pulled image
  Warning  Unhealthy kubelet  Readiness probe failed: connection refused

💡

Exit codes: 0 = success, 1 = application error, 137 = OOMKilled (128 + 9/SIGKILL), 139 = segfault, 143 = SIGTERM (graceful shutdown).

Troubleshooting

Step 3: Read the Logs

Events tell you what Kubernetes sees. Logs tell you what the application sees. For a crashing container, use --previous to see the last run's output.

# Logs from the crashed container
kubectl logs api-server -n production --previous

# Example output that tells us the problem:
Starting API server on port 8080...
Connecting to database at postgres.db.svc:5432...
ERROR: FATAL: password authentication failed for user "admin"
ERROR: Cannot connect to database. Exiting.

The answer is clear: the database password is wrong. The fix might be updating a Secret or ConfigMap.

⚠️

Common pitfall: If logs show nothing (container crashes instantly), the problem might be the command/entrypoint. Use kubectl describe to check the container's command, args, and image.

Troubleshooting

Step 4: Get Inside the Container

Sometimes you need to poke around inside a running container or test connectivity from within the cluster.

# Exec into a running container
kubectl exec -it api-server -n production -- /bin/sh

# Run a specific command without interactive shell
kubectl exec api-server -- cat /etc/config/settings.conf

# For crashed/minimal containers, use ephemeral debug containers (K8s 1.25+)
kubectl debug -it api-server --image=busybox --target=api-server

# Spin up a temporary debug pod in the same namespace
kubectl run debug --image=busybox --rm -it --restart=Never -- sh

# Test DNS resolution from inside the cluster
kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
  nslookup api-service.production.svc.cluster.local

# Test connectivity to a service
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
  curl -v http://api-service.production:8080/health

Troubleshooting

Debugging Services: Where's the Break in the Chain?

Users can't reach your app. The chain is: User → Ingress → Service → Endpoints → Pod. Check each link.

# 1. Does the Service exist and have the right selector?
kubectl get svc api-service -n production -o wide

# 2. Does the Service have endpoints? (Most common issue!)
kubectl get endpoints api-service -n production
# If ENDPOINTS is <none>, the selector doesn't match any running pods

# 3. Check that pod labels match the service selector
kubectl get pods -n production --show-labels
kubectl get svc api-service -n production -o jsonpath='{.spec.selector}'

# 4. Is the pod actually listening on the right port?
kubectl exec api-server -- netstat -tlnp
# or
kubectl exec api-server -- ss -tlnp

# 5. Test from within the cluster
kubectl run test --rm -it --image=curlimages/curl --restart=Never -- \
  curl http://api-service.production:8080

💡

Top 3 service issues: (1) No endpoints — label mismatch, (2) Wrong port — targetPort doesn't match container port, (3) Pod not Ready — readiness probe failing.

Knowledge Check

Quiz: Debugging Fundamentals

Q1: A pod's exit code is 137. What does this mean?

Application error (uncaught exception)

OOMKilled (out of memory, killed by SIGKILL)

Image pull failure

Graceful shutdown via SIGTERM

Correct: OOMKilled. Exit code 137 = 128 + 9 (SIGKILL). The Linux kernel killed the process because it exceeded its memory limit. Fix by increasing the memory limit or fixing a memory leak.

Q2: A Service shows <none> for endpoints. What is the most likely cause?

The Service port is wrong

The Service selector doesn't match any running pod labels

The cluster DNS is down

Network policies are blocking traffic

Correct: The Service selector doesn't match any running pod labels. Endpoints are populated by the Endpoints controller which watches for pods matching the Service's selector. Check labels with kubectl get pods --show-labels.

Q3: Which command lets you debug a crashed container that has no shell binary?

kubectl exec -it pod-name -- /bin/sh

kubectl debug -it pod-name --image=busybox --target=container-name

kubectl attach pod-name

kubectl cp debug-tools pod-name:/usr/bin/

Correct: kubectl debug with an ephemeral container. This injects a debug container (like busybox) that shares the process namespace with the target container, letting you inspect it even if the original has no shell.

Failure Scenarios

Scenario 1: Pod Stuck in Pending

A Pending pod means the scheduler cannot find a suitable node. Here's the investigation flow:

kubectl describe pod stuck-pod -n production

# Common events you'll see:
Events:
  Warning  FailedScheduling  0/3 nodes are available:
    1 Insufficient cpu           # Node doesn't have enough CPU
    2 node(s) had taints that the pod didn't tolerate  # Missing toleration

Common Causes

Insufficient resources — no node has enough CPU/memory
Taints/Tolerations — all nodes tainted
Node selector/affinity — no matching nodes
PVC not bound — waiting for storage
ResourceQuota exceeded

Investigation Commands

# Check node capacity vs usage
kubectl top nodes
kubectl describe node node-1

# Check for taints
kubectl get nodes -o json | \
  jq '.items[].spec.taints'

# Check PVC status
kubectl get pvc -n production

Failure Scenarios

Scenario 2: CrashLoopBackOff

The container starts, crashes, and Kubernetes keeps restarting it with exponential back-off (10s, 20s, 40s... up to 5 minutes).

Common Causes

Application error — missing config, bad credentials, unhandled exception
Missing dependencies — database not reachable, required service down
OOMKilled — memory limit too low (exit code 137)
Bad command/entrypoint — wrong command in pod spec
Missing volume mounts — required config file not present
Liveness probe kills healthy container — probe misconfigured

Debugging Flowchart

kubectl describe pod → check exit code
kubectl logs --previous → read crash output
Exit code 137? → increase memory limits
Exit code 1? → fix application config
No logs? → check command/args in describe
Liveness probe? → check probe config and timing

Failure Scenarios

Scenario 3: ImagePullBackOff

Kubernetes cannot pull the container image. This is often one of the simplest issues to fix, but it can be confusing.

Common Causes

Typo in image name — ngnix instead of nginx
Tag doesn't exist — v2.0 hasn't been pushed yet
Private registry — no imagePullSecret configured
Registry down — Docker Hub rate limiting, ACR outage
Network policy — blocking egress to registry

Fix Checklist

# 1. Check the exact image reference
kubectl describe pod my-pod | grep Image

# 2. Can you pull it manually?
docker pull myregistry.azurecr.io/app:v2

# 3. Is imagePullSecret configured?
kubectl get pod my-pod -o jsonpath=\
'{.spec.imagePullSecrets}'

# 4. Create/fix pull secret
kubectl create secret docker-registry \
  acr-creds \
  --docker-server=myregistry.azurecr.io \
  --docker-username=user \
  --docker-password=pass

Troubleshooting

The Systematic Debugging Flow

When something is broken, follow this order. It covers 95% of issues:

kubectl get pods -n NAMESPACE — What's the status?
kubectl describe pod POD -n NAMESPACE — Events + conditions?
kubectl logs POD -n NAMESPACE [--previous] — What did the app say?
kubectl get events -n NAMESPACE --sort-by=.lastTimestamp — Cluster events?
kubectl get svc,ep -n NAMESPACE — Service has endpoints?
kubectl exec / kubectl debug — Can I test from inside?
kubectl top pods/nodes — Resource pressure?
kubectl get nodes — Any NotReady nodes?

💡

CKA Exam: Troubleshooting is 30% of the exam. Memorize this flow. Practice it until it becomes muscle memory.

Troubleshooting

Debugging Node Issues

If a node shows NotReady, the kubelet may be down, or the node may have resource pressure.

# Check node status
kubectl get nodes
NAME     STATUS     ROLES          AGE   VERSION
node-1   Ready      control-plane  30d   v1.29.0
node-2   NotReady   <none>         30d   v1.29.0

# Detailed node conditions
kubectl describe node node-2

# Key conditions to check:
Conditions:
  MemoryPressure   True   # Node running out of memory
  DiskPressure     True   # Node running out of disk
  PIDPressure      False
  Ready            False  # Kubelet not reporting or unhealthy

# On the node itself (SSH or debug container):
systemctl status kubelet
journalctl -u kubelet -f
systemctl status containerd

⚠️

CKA Exam Scenario: You may be asked to fix a broken kubelet. Check: Is kubelet running? Is the config correct? Is the certificate valid? systemctl restart kubelet is often the fix after correcting a config issue.

Knowledge Check

Quiz: Troubleshooting Scenarios

Q1: A pod is Pending with event "0/3 nodes are available: 3 Insufficient memory". What should you do?

Restart the pod

Delete and recreate the namespace

Reduce the pod's memory request or add more nodes

Increase the pod's memory limit

Correct: The scheduler uses memory requests (not limits) to place pods on nodes. Either reduce the memory request to fit on existing nodes, or scale up the cluster to add more capacity.

Q2: Users can't reach your app via the Service. kubectl get endpoints shows the service has no endpoints. What should you check first?

That the pod labels match the Service selector

The cluster DNS configuration

The pod's security context

The Ingress controller logs

Correct: No endpoints means no pods match the Service selector. The most common cause is a label mismatch. Compare kubectl get svc -o yaml (check spec.selector) with kubectl get pods --show-labels.

Q3: A node shows NotReady. Which component should you check first on that node?

kube-proxy

kubelet

etcd

kube-scheduler

Correct: kubelet. The kubelet is responsible for reporting node status to the API server. A NotReady node usually means kubelet is stopped, crashed, or misconfigured. Check with: systemctl status kubelet.

Helm

Helm: The Package Manager for Kubernetes

Think of Helm like apt or npm for Kubernetes. Instead of managing dozens of YAML files, you install a "chart" that contains everything your application needs — deployments, services, configmaps, RBAC, all templated and versioned.

Key Concepts

Chart — a package of templated K8s manifests
Release — an installed instance of a chart
Repository — where charts are stored (like npm registry)
Values — configuration overrides (values.yaml)

Why Use Helm?

Templated manifests with variables
Version management and rollbacks
Dependency management between charts
Reusable across environments (dev/staging/prod)
Huge ecosystem (Artifact Hub)

Helm

Essential Helm Commands

# Add a repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Search for charts
helm search repo nginx
helm search hub wordpress        # search Artifact Hub

# Install a chart (creates a release)
helm install my-nginx bitnami/nginx -n web --create-namespace

# Install with custom values
helm install my-app ./my-chart -f production-values.yaml

# Override specific values on the command line
helm install my-app ./my-chart --set replicaCount=3,image.tag=v2

# List releases
helm list -n web

# Upgrade a release
helm upgrade my-nginx bitnami/nginx --set replicaCount=3

# Rollback to a previous revision
helm rollback my-nginx 1

# Uninstall a release
helm uninstall my-nginx -n web

# Show what would be installed (dry-run)
helm template my-app ./my-chart -f values.yaml

Helm

Helm Chart Structure

Directory Layout

my-chart/
  Chart.yaml          # metadata (name, version)
  values.yaml         # default config values
  templates/
    deployment.yaml   # templated manifests
    service.yaml
    ingress.yaml
    _helpers.tpl      # template helpers
    NOTES.txt         # post-install message
  charts/             # sub-chart dependencies

Template Example

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}
spec:
  replicas: {{ .Values.replicaCount }}
  template:
    spec:
      containers:
      - name: app
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        ports:
        - containerPort: {{ .Values.service.port }}

CRDs & Operators

Custom Resource Definitions: Extending Kubernetes

Kubernetes ships with resources like Pods, Services, and Deployments. CRDs let you define your own resource types, making the API server understand application-specific concepts.

What CRDs Enable

Define custom resources (e.g., Database, Certificate)
Use kubectl to manage them like built-in resources
RBAC and admission control work on CRDs
Extend K8s without modifying its source code

Real-World Examples

cert-manager: Certificate, Issuer
Istio: VirtualService, Gateway
Prometheus: ServiceMonitor, PrometheusRule
ArgoCD: Application
Crossplane: RDSInstance, Bucket

# List all CRDs in the cluster
kubectl get crds

# Get custom resources
kubectl get certificates -n production
kubectl describe certificate my-tls -n production

CRDs & Operators

Operators: Automating Human Knowledge

An Operator is a CRD + a custom controller that encodes operational knowledge. Think of it as a robot sysadmin that watches your custom resources and takes action.

The Operator Pattern

Define a CRD (e.g., PostgresCluster)
Deploy a controller that watches for these resources
Controller creates/manages pods, services, config
Controller handles upgrades, backups, failover

The controller runs a reconciliation loop: "desired state in CRD vs actual state in cluster → take action to converge."

Example: Database Operator

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: my-db
spec:
  postgresVersion: 15
  instances:
  - replicas: 3
    dataVolumeClaimSpec:
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 10Gi

The operator handles replication, failover, backups, and upgrades automatically.

Knowledge Check

Quiz: Helm & CRDs

Q1: What is the Helm command to rollback a release to a previous version?

helm undo release-name

helm rollback release-name REVISION

helm revert release-name --to=previous

helm upgrade release-name --rollback

Correct: helm rollback release-name REVISION. Each helm install/upgrade creates a numbered revision. Use helm history release-name to see all revisions, then rollback to a specific one.

Q2: What does an Operator consist of?

A Helm chart and a values file

A Custom Resource Definition (CRD) and a custom controller

A DaemonSet and a ConfigMap

An Admission Webhook and a Service

Correct: An Operator combines a CRD (to define the desired state of a custom resource) with a controller (that watches those resources and takes action to achieve the desired state). This is the Operator Pattern.

Q3: Which Helm command shows the rendered YAML without installing it?

helm lint my-chart

helm show values my-chart

helm template my-release my-chart

helm install --debug my-release my-chart

Correct: helm template renders chart templates locally and displays the output YAML without sending anything to the cluster. Great for reviewing what Helm will create before installing. helm install --dry-run also works but requires cluster access.

CKA Exam

CKA: Certified Kubernetes Administrator

Exam Format

Duration	2 hours
Questions	15-20 performance-based tasks
Passing Score	66%
Environment	Real cluster(s) via PSI browser
Resources	kubernetes.io/docs, helm.sh/docs, github.com/kubernetes allowed
Cost	$395 USD (includes 1 retake)
Validity	2 years

Domain Weights

Troubleshooting	30%
Cluster Architecture	25%
Workloads & Scheduling	15%
Services & Networking	20%
Storage	10%

Troubleshooting is the largest domain. Master the debugging flow from earlier in this session.

CKA Exam

CKA: Strategy & Time Management

Time Management

Read all questions first (2 min). Flag easy ones.
Easy questions first — bank quick points
Budget ~6 min per question (15 questions = 90 min)
Skip and return if stuck after 8 minutes
Leave 10 min at the end for review
Check which cluster/context each question uses!

Critical Skills to Practice

Create pods, deployments, services imperatively
Fix broken kubelet, etcd, scheduler configs
RBAC: create roles and bindings fast
Network Policies from scratch
Upgrade a cluster with kubeadm
etcd backup and restore
Debug Pending/CrashLoopBackOff pods

💡

Pro tip: Always start with kubectl config use-context for each question. Working on the wrong cluster is the #1 reason people lose points.

CKA Exam

CKA: Practice Scenarios

Cluster Maintenance

Upgrade the control plane from v1.28 to v1.29 using kubeadm
Drain a node, perform maintenance, uncordon it
Back up etcd to /opt/etcd-backup.db
Restore etcd from a backup file

# etcd backup
ETCDCTL_API=3 etcdctl snapshot save \
  /opt/etcd-backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Troubleshooting Tasks

A node is NotReady. Fix the kubelet.
A pod can't reach a service. Fix the NetworkPolicy.
The scheduler is not running. Fix the static pod manifest.
Create a user certificate and RBAC binding.

# Common kubelet fix
ssh node-2
systemctl status kubelet
# Check for config errors
journalctl -u kubelet | tail -50
# Fix config and restart
systemctl restart kubelet

CKAD Exam

CKAD: Certified Kubernetes Application Developer

Exam Format

Duration	2 hours
Questions	15-20 performance-based tasks
Passing Score	66%
Focus	Application development, not cluster admin
Resources	Same as CKA (kubernetes.io/docs)

Domain Weights

Application Design & Build	20%
Application Deployment	20%
Application Observability & Maintenance	15%
Application Environment, Config & Security	25%
Services & Networking	20%

CKAD Exam

CKAD: Speed Tricks & Strategy

The CKAD is about speed. You need to create resources fast without writing YAML from scratch.

Generate YAML, Don't Write It

# Generate pod YAML
kubectl run nginx --image=nginx \
  --dry-run=client -o yaml > pod.yaml

# Generate deployment YAML
kubectl create deploy app \
  --image=app:v1 --replicas=3 \
  --dry-run=client -o yaml > deploy.yaml

# Generate service YAML
kubectl expose deploy app \
  --port=80 --target-port=8080 \
  --dry-run=client -o yaml > svc.yaml

# Generate job YAML
kubectl create job backup \
  --image=busybox \
  --dry-run=client -o yaml > job.yaml

CKAD Focus Areas

Multi-container pods — sidecar, init containers
Probes — liveness, readiness, startup
ConfigMaps & Secrets — creation and consumption
Resource limits — requests/limits
Rolling updates — strategy, rollback
Jobs & CronJobs
Network Policies
Ingress rules
SecurityContext
Helm basics (install, upgrade, rollback)

Exam Cheat Sheet

kubectl Aliases & Speed Shortcuts

Set these up at the start of your exam. They save minutes over the 2-hour session.

# Essential aliases (add to ~/.bashrc at exam start)
alias k=kubectl
alias kgp='kubectl get pods'
alias kgs='kubectl get svc'
alias kgd='kubectl get deploy'
alias kgn='kubectl get nodes'
alias kd='kubectl describe'
alias kdp='kubectl describe pod'
alias kaf='kubectl apply -f'
alias kdf='kubectl delete -f'
alias kex='kubectl exec -it'
alias klo='kubectl logs'
alias klof='kubectl logs -f'

# Enable auto-completion (usually pre-configured)
source <(kubectl completion bash)
complete -o default -F __start_kubectl k

# Quick context switch
alias kcc='kubectl config current-context'
alias kuc='kubectl config use-context'

# The most important shortcut: generate YAML
export do='--dry-run=client -o yaml'
# Usage: k run nginx --image=nginx $do > pod.yaml

Exam Cheat Sheet

kubectl Power Moves for the Exam

Quick Resource Creation

# Pod with command
k run busybox --image=busybox \
  --restart=Never -- sleep 3600

# Service for existing deployment
k expose deploy app --port=80 --type=ClusterIP

# ConfigMap from literal
k create cm my-config \
  --from-literal=key=value

# Secret from literal
k create secret generic my-secret \
  --from-literal=pass=s3cret

# Role + RoleBinding
k create role pod-reader \
  --verb=get,list --resource=pods -n dev
k create rolebinding read-pods \
  --role=pod-reader --user=jane -n dev

Quick Debugging

# Force delete a stuck pod
k delete pod stuck --grace-period=0 --force

# Get all events sorted by time
k get events --sort-by=.lastTimestamp

# Get pod IPs
k get pods -o wide

# JSON path for specific field
k get pod my-pod \
  -o jsonpath='{.status.podIP}'

# All images in namespace
k get pods -o jsonpath=\
'{.items[*].spec.containers[*].image}'

# Watch pods in real-time
k get pods -w

# Replace a resource quickly
k get pod my-pod -o yaml > tmp.yaml
# edit tmp.yaml
k replace -f tmp.yaml --force

Knowledge Check

Quiz: CKA/CKAD Exam Prep

Q1: What percentage of the CKA exam is dedicated to Troubleshooting?

15%

20%

30%

40%

Correct: 30%. Troubleshooting is the single largest domain in the CKA exam. It covers debugging nodes, pods, services, networking, and cluster components.

Q2: What is the passing score for both CKA and CKAD exams?

50%

66%

70%

75%

Correct: 66%. Both CKA and CKAD require a minimum score of 66% to pass. The exam includes a free retake if you fail the first attempt.

Q3: Which kubectl flag generates YAML without creating the resource?

--output=yaml

--dry-run=client -o yaml

--generate-yaml

--template=yaml

Correct: --dry-run=client -o yaml. This creates the resource definition client-side without sending it to the API server, and outputs it as YAML. Essential for quickly generating templates during the exam.

Exam Resources

Study Resources & Practice Labs

Free Resources

kubernetes.io/docs — official docs (allowed in exam)
killer.sh — free exam simulator (included with exam purchase)
KodeKloud free labs — CKA/CKAD challenges
kubectl cheat sheet — kubernetes.io/docs/reference/kubectl/cheatsheet
GitHub: dgkanatsios/CKAD-exercises
Play with Kubernetes — browser-based cluster

Paid Courses

KodeKloud CKA/CKAD — hands-on labs, best for beginners
Linux Foundation courses — official training
Udemy: Mumshad Mannambeth — highly rated CKA course
A Cloud Guru — structured learning path

Practice on killer.sh at least twice before the exam. It's harder than the real exam, which is exactly what you want.

Exam Cheat Sheet

Bookmark These Docs Pages

During the exam, you can access kubernetes.io/docs. Knowing where to find things saves critical time.

Topic	Docs Location
kubectl Cheat Sheet	`kubernetes.io/docs/reference/kubectl/cheatsheet/`
Pod Spec Reference	`kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/`
Network Policies	`kubernetes.io/docs/concepts/services-networking/network-policies/`
RBAC	`kubernetes.io/docs/reference/access-authn-authz/rbac/`
etcd Backup/Restore	`kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/`
kubeadm Upgrade	`kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/`
Persistent Volumes	`kubernetes.io/docs/concepts/storage/persistent-volumes/`
Security Context	`kubernetes.io/docs/tasks/configure-pod-container/security-context/`

💡

Exam Tip: Use the search bar on kubernetes.io — it's fast. But having mental bookmarks of key pages is even faster.

Final Knowledge Check

Quiz: Final Comprehensive Review

Q1: You need to check if a specific ServiceAccount can create deployments. Which command do you use?

kubectl get rolebinding --serviceaccount=ns:sa

kubectl auth can-i create deployments --as=system:serviceaccount:ns:sa

kubectl describe sa sa-name -n ns

kubectl auth check sa-name --verb=create --resource=deployments

Correct: kubectl auth can-i create deployments --as=system:serviceaccount:ns:sa. The --as flag impersonates the service account. Note the format: system:serviceaccount:NAMESPACE:SA_NAME.

Q2: Your pod is running but the readiness probe is failing. What happens?

The pod is killed and restarted

The pod is evicted from the node

The pod is removed from Service endpoints (no traffic routed to it)

Nothing — readiness probes are informational only

Correct: The pod is removed from Service endpoints, so no new traffic is routed to it. The pod continues running (unlike liveness probe failure, which restarts the container). This is the key difference between readiness and liveness probes.

Q3: During the CKA exam, which documentation sites are you allowed to access?

Only kubernetes.io/docs

kubernetes.io/docs, kubernetes.io/blog, helm.sh/docs, and github.com/kubernetes

Any website

No external resources are allowed

Correct: You can access kubernetes.io (docs, blog), helm.sh/docs, and github.com/kubernetes. No other websites, no personal notes, no Stack Overflow. Practice navigating these docs efficiently before the exam.

Course Recap

The Full Journey: All 8 Sessions

Sessions 1-4: Foundations

K8s Fundamentals — architecture, pods, kubectl
Workload Resources — deployments, scaling, updates
Networking — services, ingress, DNS, network policies
Storage — volumes, PV/PVC, StorageClasses

Sessions 5-8: Advanced

Scheduling & Lifecycle — affinity, taints, probes
Advanced Patterns — multi-container, jobs, operators
Config, Security & RBAC — secrets, policies, access control
Observability & Exam Prep — debugging, Helm, CKA/CKAD

You now have the knowledge to deploy, manage, secure, and troubleshoot Kubernetes workloads — and pass the certification exams.

Course Recap

Key Takeaways

Think Declaratively

Define desired state, let Kubernetes reconcile. This applies to everything: deployments, RBAC, network policies, operators.

Labels Are Everything

Services find pods via labels. Network policies select pods via labels. Scheduling uses labels. Get your labeling strategy right.

Security Is Not Optional

Run as non-root, set RBAC from day one, use network policies, scan images. Retro-fitting security is painful.

Master Debugging

get → describe → logs → exec. This flow solves 95% of issues. Practice until it's muscle memory.

Use kubectl Efficiently

Imperative commands for speed, declarative YAML for production. --dry-run=client -o yaml bridges the gap.

Keep Learning

Kubernetes evolves fast. Follow release notes, practice with new features, and stay connected to the community.

What's Next

Your Next Steps

This Week

Set up a practice cluster (kind, minikube, or AKS)
Deploy a real app end-to-end
Practice the debugging flow on broken pods
Schedule your exam date (accountability!)

Before the Exam

Complete 2 full killer.sh practice sessions
Master all imperative kubectl commands
Practice RBAC, NetworkPolicy, and etcd backup
Time yourself: 15 questions in 2 hours

Kubernetes Training Complete

Thank You & Good Luck!

You've invested significant time and effort in learning Kubernetes. Now it's time to put that knowledge into practice. Go build things, break things, fix things, and earn that certification.

CKA CKAD You've Got This

Observability, Troubleshooting& Exam Prep

When Things Go Wrong

Our Journey Today

Observability

Troubleshooting

Exam & Beyond

Logging: Reading the Black Box Recorder

How K8s Logging Works

Key Principle

kubectl logs — Your First Debugging Tool

Cluster-Level Log Aggregation

Node-Level Agent

Sidecar Container

Direct Push

Monitoring: The Dashboard That Tells You Everything

Metrics Server (Built-in)

Prometheus (Industry Standard)

Azure Monitor for AKS

Container Insights

Azure Managed Prometheus + Grafana

Quiz: Observability Fundamentals

Q1: Which command shows logs from a previously crashed container instance?

Q2: What component must be installed for kubectl top to work?

Q3: Where should Kubernetes applications write their logs?

Debugging Pods: A Story-Driven Approach

Step 1: Get the Status

Step 2: Describe the Pod

Step 3: Read the Logs

Step 4: Get Inside the Container

Debugging Services: Where's the Break in the Chain?

Quiz: Debugging Fundamentals

Q1: A pod's exit code is 137. What does this mean?

Q2: A Service shows <none> for endpoints. What is the most likely cause?

Q3: Which command lets you debug a crashed container that has no shell binary?

Scenario 1: Pod Stuck in Pending

Common Causes

Investigation Commands

Scenario 2: CrashLoopBackOff

Common Causes

Debugging Flowchart

Scenario 3: ImagePullBackOff

Common Causes

Fix Checklist

The Systematic Debugging Flow

Debugging Node Issues

Quiz: Troubleshooting Scenarios

Q1: A pod is Pending with event "0/3 nodes are available: 3 Insufficient memory". What should you do?

Q2: Users can't reach your app via the Service. kubectl get endpoints shows the service has no endpoints. What should you check first?

Q3: A node shows NotReady. Which component should you check first on that node?

Helm: The Package Manager for Kubernetes

Key Concepts

Why Use Helm?

Essential Helm Commands

Helm Chart Structure

Directory Layout

Template Example

Custom Resource Definitions: Extending Kubernetes

What CRDs Enable

Real-World Examples

Operators: Automating Human Knowledge

The Operator Pattern

Example: Database Operator

Quiz: Helm & CRDs

Q1: What is the Helm command to rollback a release to a previous version?

Q2: What does an Operator consist of?

Q3: Which Helm command shows the rendered YAML without installing it?

CKA: Certified Kubernetes Administrator

Exam Format

Domain Weights

CKA: Strategy & Time Management

Time Management

Critical Skills to Practice

CKA: Practice Scenarios

Cluster Maintenance

Troubleshooting Tasks

CKAD: Certified Kubernetes Application Developer

Exam Format

Domain Weights

CKAD: Speed Tricks & Strategy

Generate YAML, Don't Write It

Observability, Troubleshooting
& Exam Prep