Module 4 of 4

Advanced Knative and Operations

Kafka Integration, Tuning, Observability, and Production Readiness

From running Knative to running it well in production.

Kafka Integration

"Kafka is the backbone of event-driven architectures. Knative's Kafka integration lets you use Kafka as the durable, high-throughput foundation for your events."

Knative offers three levels of Kafka integration:

KafkaSource -- Consume from Kafka topics into Knative
KafkaChannel -- Use Kafka as the backing store for Channels
Kafka Broker -- Native Kafka-backed Broker (best performance)

KafkaSource in Detail

apiVersion: sources.knative.dev/v1beta1
kind: KafkaSource
metadata:
  name: payment-events
spec:
  consumerGroup: knative-payments
  bootstrapServers:
    - kafka-cluster.kafka.svc.cluster.local:9092
  topics:
    - payments.completed
    - payments.failed
  sink:
    ref:
      apiVersion: eventing.knative.dev/v1
      kind: Broker
      name: default
  # Optional: initial offset
  initialOffset: latest     # or "earliest"

Each Kafka message becomes a CloudEvent
Kafka headers are mapped to CloudEvent extensions
Consumer group manages offsets automatically

KafkaChannel: Durable Event Delivery

# Install Kafka Channel
kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.0/eventing-kafka-channel-install.yaml

# Set KafkaChannel as the default channel type
apiVersion: v1
kind: ConfigMap
metadata:
  name: default-ch-webhook
  namespace: knative-eventing
data:
  default-ch-config: |
    clusterDefault:
      apiVersion: messaging.knative.dev/v1beta1
      kind: KafkaChannel
      spec:
        numPartitions: 3
        replicationFactor: 3

Key benefit: Events survive pod restarts and broker failures. In-Memory channels do NOT provide this guarantee.

Kafka Broker: Native Integration

# Install Kafka Broker
kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.0/eventing-kafka-broker.yaml

# Create a Kafka Broker
apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
  name: kafka-broker
  annotations:
    eventing.knative.dev/broker.class: Kafka
spec:
  config:
    apiVersion: v1
    kind: ConfigMap
    name: kafka-broker-config
    namespace: knative-eventing
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-broker-config
  namespace: knative-eventing
data:
  default.topic.partitions: "6"
  default.topic.replication.factor: "3"
  bootstrap.servers: "kafka-cluster:9092"

Kafka Integration Comparison

Feature	KafkaSource	KafkaChannel	Kafka Broker
Direction	Kafka to Knative	Within Knative	Within Knative
Use Case	Import external events	Durable channels	Full event hub
Ordering	Per partition	Per partition	Per partition
Performance	High	High	Highest
Complexity	Low	Medium	Medium
Durability	Kafka guarantees	Kafka guarantees	Kafka guarantees

Recommendation: Use Kafka Broker for new deployments. Use KafkaSource when integrating with existing Kafka topics.

Knowledge Check

1. What is the key advantage of using KafkaChannel over InMemoryChannel?

A) KafkaChannel is faster

B) KafkaChannel provides durability -- events survive pod restarts

C) KafkaChannel supports more event types

Correct: B. KafkaChannel stores events in Kafka, providing durability. InMemoryChannel loses all undelivered events if the pod restarts.

2. What is the difference between KafkaSource and Kafka Broker?

A) KafkaSource imports events from Kafka; Kafka Broker uses Kafka as the internal event routing layer

B) They are the same thing with different names

C) KafkaSource is for production, Kafka Broker is for development

Correct: A. KafkaSource consumes from existing Kafka topics into Knative. Kafka Broker uses Kafka internally as the backbone for Broker/Trigger event routing.

3. What broker class annotation do you use for a Kafka-backed Broker?

A) eventing.knative.dev/broker.class: KafkaBroker

B) eventing.knative.dev/broker.class: Kafka

C) eventing.knative.dev/broker.class: MTKafkaBroker

Correct: B. The annotation value is simply "Kafka" to use the native Kafka Broker implementation.

Custom Event Sources

Two approaches to create your own event sources:

SinkBinding

Injects K_SINK into any Deployment. Your app posts CloudEvents to that URL.

Easiest approach
Works with any language
You manage the app lifecycle

ContainerSource

Knative manages the container lifecycle. It injects K_SINK and runs your container.

Knative-managed lifecycle
Better for dedicated sources
Automatic restarts

ContainerSource Example

apiVersion: sources.knative.dev/v1
kind: ContainerSource
metadata:
  name: azure-blob-watcher
spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/blob-watcher:v1
          env:
            - name: STORAGE_ACCOUNT
              value: "mystorageaccount"
            - name: CONTAINER_NAME
              value: "uploads"
            - name: CONNECTION_STRING
              valueFrom:
                secretKeyRef:
                  name: storage-secret
                  key: connection-string
  sink:
    ref:
      apiVersion: eventing.knative.dev/v1
      kind: Broker
      name: default

Your container polls Azure Blob Storage and emits CloudEvents to $K_SINK when new blobs appear.

Autoscaler Tuning: config-autoscaler

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-autoscaler
  namespace: knative-serving
data:
  # Scale-to-zero settings
  enable-scale-to-zero: "true"
  scale-to-zero-grace-period: "30s"
  scale-to-zero-pod-retention-period: "0s"

  # Scaling windows
  stable-window: "60s"           # Window for stable mode decisions
  panic-window-percentage: "10"  # % of stable window for panic mode
  panic-threshold-percentage: "200" # Trigger panic if 2x target

  # Scale bounds
  max-scale-up-rate: "1000"      # Max ratio of scale-up per tick
  max-scale-down-rate: "2"       # Max ratio of scale-down per tick

  # Target utilization
  target-burst-capacity: "200"   # Extra capacity for bursts
  activator-capacity: "100"      # Requests activator can buffer

Panic Mode: Handling Traffic Spikes

"When traffic suddenly doubles, you can't wait 60 seconds to scale up. Panic mode kicks in and scales aggressively using a much shorter observation window."

  Normal (Stable Mode)              Panic Mode
  ------------------               ----------
  60-second window                  6-second window (10% of 60s)
  Gradual scaling                   Aggressive scaling

  Panic triggers when:
  observed_concurrency > 2x target (panic-threshold: 200%)

  Returns to stable when:
  traffic stays below target for full stable window

Observability Stack

Three pillars of observability for Knative:

Metrics

Request count, latency, errors
Autoscaler decisions
Queue depth in queue-proxy
Export to Prometheus

Tracing

Distributed request tracing
End-to-end event flow
Zipkin or Jaeger
OpenTelemetry support

Logging: Structured JSON logs from all Knative components, configurable via config-logging ConfigMap.

Metrics Configuration

# config-observability ConfigMap (knative-serving namespace)
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: knative-serving
data:
  # Enable Prometheus metrics
  metrics.backend-destination: prometheus

  # Request metrics reporting period
  metrics.reporting-period-seconds: "5"

  # Enable request tracing
  enable-tracing: "true"

  # Tracing backend
  zipkin-endpoint: "http://zipkin.observability.svc.cluster.local:9411/api/v2/spans"

  # Sample rate (1.0 = trace everything, 0.1 = 10%)
  sample-rate: "0.1"

Key Metrics to Monitor

Metric	What It Tells You
`revision_request_count`	Total requests per revision
`revision_request_latencies`	Response time distribution (p50, p95, p99)
`revision_app_request_count`	Requests reaching your container (excludes queue-proxy overhead)
`autoscaler_desired_pods`	How many pods the autoscaler wants
`autoscaler_actual_pods`	How many pods are actually running
`activator_request_count`	Requests buffered by the activator (cold starts)
`queue_depth`	Requests waiting in queue-proxy

Knowledge Check

1. When does the Knative autoscaler enter "panic mode"?

A) When any pod crashes

B) When observed concurrency exceeds the panic threshold (default 200% of target)

C) When CPU usage exceeds 90%

Correct: B. Panic mode triggers when concurrency exceeds the panic-threshold-percentage (default 200%) of the target, using a shorter observation window for faster scaling.

2. Which ConfigMap controls tracing and metrics in Knative Serving?

A) config-observability

B) config-metrics

C) config-tracing

Correct: A. The config-observability ConfigMap in the knative-serving namespace controls metrics backend, tracing endpoints, and sampling rates.

3. What does the max-scale-down-rate setting control?

A) The maximum number of pods that can be removed at once

B) The maximum ratio of current to desired pods when scaling down per evaluation cycle

C) The time delay between scale-down decisions

Correct: B. max-scale-down-rate (default 2) means the autoscaler can halve the number of pods per tick at most, preventing aggressive scale-down.

Logging Configuration

# config-logging ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-logging
  namespace: knative-serving
data:
  # Log level for Knative components
  loglevel.controller: "info"
  loglevel.autoscaler: "info"    # Set to "debug" for troubleshooting
  loglevel.activator: "info"
  loglevel.webhook: "info"
  loglevel.queueproxy: "info"

  # Structured logging format
  zap-logger-config: |
    {
      "level": "info",
      "development": false,
      "outputPaths": ["stdout"],
      "errorOutputPaths": ["stderr"],
      "encoding": "json",
      "encoderConfig": {
        "timeKey": "ts",
        "levelKey": "level",
        "nameKey": "logger",
        "callerKey": "caller",
        "messageKey": "msg"
      }
    }

Knative with Istio

Istio provides advanced networking features beyond basic Kourier:

mTLS -- Automatic mutual TLS between all services
Authorization policies -- Fine-grained access control
Traffic mirroring -- Shadow traffic for testing
Circuit breaking -- Prevent cascade failures
Observability -- Built-in metrics, tracing, access logs

# Install Knative with Istio networking
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.14.0/net-istio.yaml

# Configure Knative to use Istio
kubectl patch configmap/config-network \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"ingress-class":"istio.ingress.networking.knative.dev"}}'

Kourier vs. Istio: When to Choose

Factor	Kourier	Istio
Complexity	Simple	Complex
Resource Usage	Lightweight (~50MB)	Heavy (~500MB+)
mTLS	No	Yes (automatic)
Auth Policies	No	Yes
Traffic Mirroring	No	Yes
Learning Curve	Low	High
Best For	Dev, simple prod	Enterprise, multi-tenant

Recommendation: Start with Kourier. Move to Istio when you need mTLS, authorization, or are already using Istio for other workloads.

Multi-Tenant Knative

"Multiple teams sharing one Knative installation? You need isolation, resource limits, and clear ownership boundaries."

Namespace isolation -- Each team gets their own namespace
ResourceQuotas -- Limit total pods, CPU, memory per namespace
LimitRanges -- Default resource requests/limits for containers
Network Policies -- Restrict cross-namespace communication
RBAC -- Control who can create/modify Knative resources

# Example ResourceQuota for a team namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    pods: "50"
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"

Performance: Optimizing Cold Starts

Strategy	Impact	Trade-off
Set `minScale: "1"`	Eliminates cold start	Always-on cost
Small container images	Faster image pull	Build complexity
Pre-pull with DaemonSet	No image pull delay	Disk space on nodes
Fast app startup	Reduces init time	App refactoring
Increase `target-burst-capacity`	More headroom	More idle pods
Use `scale-down-delay`	Prevents premature scale-down	More idle time

# Pre-pull images with a DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: image-prepull
spec:
  template:
    spec:
      initContainers:
        - name: prepull
          image: myregistry.azurecr.io/my-api:v1
          command: ["sh", "-c", "exit 0"]
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9

Concurrency Tuning Guide

High Concurrency

For fast, lightweight endpoints:

annotations:
  autoscaling.knative.dev/target: "100"
spec:
  containerConcurrency: 0  # unlimited

Result: Fewer pods, each handling many requests.

Low Concurrency

For heavy processing (ML, image processing):

annotations:
  autoscaling.knative.dev/target: "1"
spec:
  containerConcurrency: 1  # one at a time

Result: Many pods, each handling one request.

Tip: Load test to find the right target. Start with target = 70% of what your container can handle at acceptable latency.

Knowledge Check

1. What is the most effective way to completely eliminate cold starts?

A) Set minScale to 1 or higher

B) Use a smaller container image

C) Increase the stable-window

Correct: A. Setting minScale to 1+ ensures at least one pod is always running. Other strategies reduce cold start duration but don't eliminate it.

2. When should you choose Istio over Kourier for Knative networking?

A) For all production deployments

B) When you need mTLS, authorization policies, or already use Istio

C) When you have more than 10 services

Correct: B. Istio adds significant complexity and resources. Choose it when you specifically need its features (mTLS, auth policies, traffic mirroring) or already run Istio.

3. For an ML inference endpoint that takes 5 seconds per request, what is the recommended concurrency setup?

A) target: 100, containerConcurrency: 0

B) target: 1, containerConcurrency: 1

C) target: 50, containerConcurrency: 50

Correct: B. For heavy processing workloads like ML inference, low concurrency (1) ensures each pod handles one request at a time, preventing resource contention and timeouts.

Troubleshooting: Service Not Ready

# Step 1: Check service status
kn service describe my-api
# Look at Conditions:
#   Ready: False
#   ConfigurationsReady: False
#   RoutesReady: True

# Step 2: Check the latest revision
kn revision describe my-api-00003
# Look for:
#   ContainerHealthy: False
#   ResourcesAvailable: False

# Step 3: Check pods
kubectl get pods -l serving.knative.dev/service=my-api
kubectl describe pod my-api-00003-deployment-xxx

# Common causes:
# - Image pull errors (wrong image name, missing credentials)
# - Container crash (check logs: kubectl logs ...)
# - Readiness probe failing
# - Resource limits too low (OOMKilled)
# - Missing ConfigMaps or Secrets

Troubleshooting: Revision Failures

# Revision stuck in "not ready"
kubectl get revisions
NAME               CONFIG NAME   READY   REASON
my-api-00003       my-api        False   ContainerMissing

# Check revision details
kubectl get revision my-api-00003 -o yaml | grep -A 10 "conditions"

# Common REASON values and fixes:

Reason	Cause	Fix
`ContainerMissing`	Image not found	Check image name and registry access
`ExitCode1`	Container crashes	Check container logs
`ResourcesUnavailable`	Not enough cluster resources	Scale cluster or reduce requests
`ProgressDeadlineExceeded`	Pod took too long to start	Check probes, image size, startup time

Troubleshooting: Eventing Issues

# Events not being delivered? Systematic check:

# 1. Check Broker status
kubectl get broker default -o yaml
# Is READY: True?

# 2. Check Trigger status
kubectl get triggers -o wide
# Are all triggers READY: True?
# Is the subscriber URL correct?

# 3. Check Source status
kubectl get pingsource,kafkasource -o wide

# 4. Deploy event-display to see what's arriving
kn service create debug-display \
  --image gcr.io/knative-releases/knative.dev/eventing/cmd/event_display
kn trigger create catch-all --broker default --sink ksvc:debug-display
kubectl logs -l serving.knative.dev/service=debug-display -f

# 5. Check eventing controller logs
kubectl logs -n knative-eventing -l app=eventing-controller --tail=50

# 6. Check dead letter sink for failed deliveries
kubectl logs -l serving.knative.dev/service=dead-letter-handler

Knative on AKS Best Practices

Use ACR integration -- Attach ACR to AKS for seamless image pulls
Node pools -- Dedicated node pool for Knative system components
Cluster autoscaler -- Enable AKS cluster autoscaler alongside Knative autoscaling
Azure Monitor -- Integrate with Container Insights for centralized observability
Azure Key Vault -- Use CSI driver for secrets instead of K8s Secrets
Private endpoints -- Use private AKS clusters with internal load balancers
Availability Zones -- Spread nodes across AZs for resilience

# AKS-specific: Internal load balancer for Kourier
kubectl annotate svc kourier -n kourier-system \
  service.beta.kubernetes.io/azure-load-balancer-internal="true"

AKS Network Configuration for Knative

# Create AKS cluster optimized for Knative
az aks create \
  --resource-group myRG \
  --name knative-cluster \
  --node-count 3 \
  --node-vm-size Standard_D4s_v3 \
  --network-plugin azure \
  --network-policy azure \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 20 \
  --zones 1 2 3 \
  --attach-acr myACR \
  --enable-managed-identity

# Configure DNS for Knative
# Option 1: Use Azure DNS with external-dns
# Option 2: Use nip.io for development
# Option 3: Configure config-domain with your domain
kubectl patch configmap/config-domain \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"mycompany.com":""}}'

When to Use Knative vs. Regular Deployments

Use Knative When:

Traffic is bursty or unpredictable
Services have idle periods (scale-to-zero)
You need built-in traffic splitting
Event-driven processing
Rapid iteration with revisions
Stateless HTTP services

Use Regular Deployments When:

Stateful workloads (databases)
Long-running background workers
Non-HTTP protocols (gRPC streaming, TCP)
Need PersistentVolumes
DaemonSets or node-specific workloads
Steady, predictable traffic

Decision Flowchart

  Is it an HTTP workload?
     /          \
   Yes           No --> Regular Deployment
    |
  Is it stateless?
     /          \
   Yes           No --> Regular Deployment (StatefulSet)
    |
  Does it have variable/bursty traffic?
     /          \
   Yes           No (steady) --> Either works, Knative adds convenience
    |
  Do you want scale-to-zero?
     /          \
   Yes           No
    |              |
  Knative!      Do you want built-in traffic splitting?
                   /          \
                 Yes           No --> Regular Deployment is fine
                  |
               Knative!

Migrating to Knative

Phase 1: Install Knative alongside existing workloads
Phase 2: Deploy new services as Knative Services
Phase 3: Migrate stateless HTTP services one at a time
Phase 4: Add Knative Eventing for event-driven patterns

# Knative can coexist with regular Deployments!
# Same cluster, same namespace, no conflicts.

# Convert a Deployment to Knative Service:
# 1. Take your container image
# 2. Create a Knative Service YAML
# 3. Move env vars, secrets, configmaps
# 4. Add autoscaling annotations
# 5. Deploy and test
# 6. Switch DNS / traffic

Knative Security Checklist

Network Enable TLS everywhere (auto-tls + cert-manager)
Network Use cluster-local for internal services
Network Consider Istio for mTLS between services
Auth Use RBAC to restrict Knative resource access
Auth Service accounts with least privilege
Secrets Use Azure Key Vault CSI driver, not plain K8s Secrets
Secrets Never put secrets in container env vars directly
Images Use private registries with image pull secrets
Images Scan images for vulnerabilities
Runtime Set resource limits to prevent noisy neighbors

Course Summary

Module 1: Introduction

Serverless on K8s, no vendor lock-in, Serving + Eventing components

Module 2: Serving

Autoscaling (KPA), traffic splitting, custom domains, TLS

Module 3: Eventing

CloudEvents, Broker/Trigger, Sources, Sequences, dead letters

Module 4: Operations

Kafka, tuning, observability, troubleshooting, AKS best practices

Key Takeaways

Knative brings serverless to Kubernetes -- scale-to-zero, auto-scaling, traffic management
Kafka integration provides durable, high-throughput event backbone
Tune the autoscaler for your workload -- concurrency, burst capacity, panic mode
Observability is critical -- metrics, tracing, and structured logging
Choose Kourier for simplicity, Istio for enterprise features
Use Knative for stateless HTTP workloads with variable traffic
Start small -- install alongside existing workloads, migrate gradually

Your containers now scale to zero, spring back to life, split traffic, and react to events. Welcome to serverless on Kubernetes.