Module 4 of 4

Advanced Knative and Operations

Kafka Integration, Tuning, Observability, and Production Readiness

From running Knative to running it well in production.

Kafka Integration

"Kafka is the backbone of event-driven architectures. Knative's Kafka integration lets you use Kafka as the durable, high-throughput foundation for your events."

Knative offers three levels of Kafka integration:

KafkaSource in Detail

apiVersion: sources.knative.dev/v1beta1
kind: KafkaSource
metadata:
  name: payment-events
spec:
  consumerGroup: knative-payments
  bootstrapServers:
    - kafka-cluster.kafka.svc.cluster.local:9092
  topics:
    - payments.completed
    - payments.failed
  sink:
    ref:
      apiVersion: eventing.knative.dev/v1
      kind: Broker
      name: default
  # Optional: initial offset
  initialOffset: latest     # or "earliest"
  

KafkaChannel: Durable Event Delivery

# Install Kafka Channel
kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.0/eventing-kafka-channel-install.yaml

# Set KafkaChannel as the default channel type
apiVersion: v1
kind: ConfigMap
metadata:
  name: default-ch-webhook
  namespace: knative-eventing
data:
  default-ch-config: |
    clusterDefault:
      apiVersion: messaging.knative.dev/v1beta1
      kind: KafkaChannel
      spec:
        numPartitions: 3
        replicationFactor: 3
  

Key benefit: Events survive pod restarts and broker failures. In-Memory channels do NOT provide this guarantee.

Kafka Broker: Native Integration

# Install Kafka Broker
kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.0/eventing-kafka-broker.yaml

# Create a Kafka Broker
apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
  name: kafka-broker
  annotations:
    eventing.knative.dev/broker.class: Kafka
spec:
  config:
    apiVersion: v1
    kind: ConfigMap
    name: kafka-broker-config
    namespace: knative-eventing
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-broker-config
  namespace: knative-eventing
data:
  default.topic.partitions: "6"
  default.topic.replication.factor: "3"
  bootstrap.servers: "kafka-cluster:9092"
  

Kafka Integration Comparison

FeatureKafkaSourceKafkaChannelKafka Broker
DirectionKafka to KnativeWithin KnativeWithin Knative
Use CaseImport external eventsDurable channelsFull event hub
OrderingPer partitionPer partitionPer partition
PerformanceHighHighHighest
ComplexityLowMediumMedium
DurabilityKafka guaranteesKafka guaranteesKafka guarantees

Recommendation: Use Kafka Broker for new deployments. Use KafkaSource when integrating with existing Kafka topics.

Knowledge Check

1. What is the key advantage of using KafkaChannel over InMemoryChannel?

A) KafkaChannel is faster
B) KafkaChannel provides durability -- events survive pod restarts
C) KafkaChannel supports more event types
Correct: B. KafkaChannel stores events in Kafka, providing durability. InMemoryChannel loses all undelivered events if the pod restarts.

2. What is the difference between KafkaSource and Kafka Broker?

A) KafkaSource imports events from Kafka; Kafka Broker uses Kafka as the internal event routing layer
B) They are the same thing with different names
C) KafkaSource is for production, Kafka Broker is for development
Correct: A. KafkaSource consumes from existing Kafka topics into Knative. Kafka Broker uses Kafka internally as the backbone for Broker/Trigger event routing.

3. What broker class annotation do you use for a Kafka-backed Broker?

A) eventing.knative.dev/broker.class: KafkaBroker
B) eventing.knative.dev/broker.class: Kafka
C) eventing.knative.dev/broker.class: MTKafkaBroker
Correct: B. The annotation value is simply "Kafka" to use the native Kafka Broker implementation.

Custom Event Sources

Two approaches to create your own event sources:

SinkBinding

Injects K_SINK into any Deployment. Your app posts CloudEvents to that URL.

  • Easiest approach
  • Works with any language
  • You manage the app lifecycle

ContainerSource

Knative manages the container lifecycle. It injects K_SINK and runs your container.

  • Knative-managed lifecycle
  • Better for dedicated sources
  • Automatic restarts

ContainerSource Example

apiVersion: sources.knative.dev/v1
kind: ContainerSource
metadata:
  name: azure-blob-watcher
spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/blob-watcher:v1
          env:
            - name: STORAGE_ACCOUNT
              value: "mystorageaccount"
            - name: CONTAINER_NAME
              value: "uploads"
            - name: CONNECTION_STRING
              valueFrom:
                secretKeyRef:
                  name: storage-secret
                  key: connection-string
  sink:
    ref:
      apiVersion: eventing.knative.dev/v1
      kind: Broker
      name: default
  

Your container polls Azure Blob Storage and emits CloudEvents to $K_SINK when new blobs appear.

Autoscaler Tuning: config-autoscaler

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-autoscaler
  namespace: knative-serving
data:
  # Scale-to-zero settings
  enable-scale-to-zero: "true"
  scale-to-zero-grace-period: "30s"
  scale-to-zero-pod-retention-period: "0s"

  # Scaling windows
  stable-window: "60s"           # Window for stable mode decisions
  panic-window-percentage: "10"  # % of stable window for panic mode
  panic-threshold-percentage: "200" # Trigger panic if 2x target

  # Scale bounds
  max-scale-up-rate: "1000"      # Max ratio of scale-up per tick
  max-scale-down-rate: "2"       # Max ratio of scale-down per tick

  # Target utilization
  target-burst-capacity: "200"   # Extra capacity for bursts
  activator-capacity: "100"      # Requests activator can buffer
  

Panic Mode: Handling Traffic Spikes

"When traffic suddenly doubles, you can't wait 60 seconds to scale up. Panic mode kicks in and scales aggressively using a much shorter observation window."
  Normal (Stable Mode)              Panic Mode
  ------------------               ----------
  60-second window                  6-second window (10% of 60s)
  Gradual scaling                   Aggressive scaling

  Panic triggers when:
  observed_concurrency > 2x target (panic-threshold: 200%)

  Returns to stable when:
  traffic stays below target for full stable window
    

Observability Stack

Three pillars of observability for Knative:

Metrics

  • Request count, latency, errors
  • Autoscaler decisions
  • Queue depth in queue-proxy
  • Export to Prometheus

Tracing

  • Distributed request tracing
  • End-to-end event flow
  • Zipkin or Jaeger
  • OpenTelemetry support

Logging: Structured JSON logs from all Knative components, configurable via config-logging ConfigMap.

Metrics Configuration

# config-observability ConfigMap (knative-serving namespace)
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: knative-serving
data:
  # Enable Prometheus metrics
  metrics.backend-destination: prometheus

  # Request metrics reporting period
  metrics.reporting-period-seconds: "5"

  # Enable request tracing
  enable-tracing: "true"

  # Tracing backend
  zipkin-endpoint: "http://zipkin.observability.svc.cluster.local:9411/api/v2/spans"

  # Sample rate (1.0 = trace everything, 0.1 = 10%)
  sample-rate: "0.1"
  

Key Metrics to Monitor

MetricWhat It Tells You
revision_request_countTotal requests per revision
revision_request_latenciesResponse time distribution (p50, p95, p99)
revision_app_request_countRequests reaching your container (excludes queue-proxy overhead)
autoscaler_desired_podsHow many pods the autoscaler wants
autoscaler_actual_podsHow many pods are actually running
activator_request_countRequests buffered by the activator (cold starts)
queue_depthRequests waiting in queue-proxy

Knowledge Check

1. When does the Knative autoscaler enter "panic mode"?

A) When any pod crashes
B) When observed concurrency exceeds the panic threshold (default 200% of target)
C) When CPU usage exceeds 90%
Correct: B. Panic mode triggers when concurrency exceeds the panic-threshold-percentage (default 200%) of the target, using a shorter observation window for faster scaling.

2. Which ConfigMap controls tracing and metrics in Knative Serving?

A) config-observability
B) config-metrics
C) config-tracing
Correct: A. The config-observability ConfigMap in the knative-serving namespace controls metrics backend, tracing endpoints, and sampling rates.

3. What does the max-scale-down-rate setting control?

A) The maximum number of pods that can be removed at once
B) The maximum ratio of current to desired pods when scaling down per evaluation cycle
C) The time delay between scale-down decisions
Correct: B. max-scale-down-rate (default 2) means the autoscaler can halve the number of pods per tick at most, preventing aggressive scale-down.

Logging Configuration

# config-logging ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-logging
  namespace: knative-serving
data:
  # Log level for Knative components
  loglevel.controller: "info"
  loglevel.autoscaler: "info"    # Set to "debug" for troubleshooting
  loglevel.activator: "info"
  loglevel.webhook: "info"
  loglevel.queueproxy: "info"

  # Structured logging format
  zap-logger-config: |
    {
      "level": "info",
      "development": false,
      "outputPaths": ["stdout"],
      "errorOutputPaths": ["stderr"],
      "encoding": "json",
      "encoderConfig": {
        "timeKey": "ts",
        "levelKey": "level",
        "nameKey": "logger",
        "callerKey": "caller",
        "messageKey": "msg"
      }
    }
  

Knative with Istio

Istio provides advanced networking features beyond basic Kourier:

# Install Knative with Istio networking
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.14.0/net-istio.yaml

# Configure Knative to use Istio
kubectl patch configmap/config-network \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"ingress-class":"istio.ingress.networking.knative.dev"}}'
  

Kourier vs. Istio: When to Choose

FactorKourierIstio
ComplexitySimpleComplex
Resource UsageLightweight (~50MB)Heavy (~500MB+)
mTLSNoYes (automatic)
Auth PoliciesNoYes
Traffic MirroringNoYes
Learning CurveLowHigh
Best ForDev, simple prodEnterprise, multi-tenant

Recommendation: Start with Kourier. Move to Istio when you need mTLS, authorization, or are already using Istio for other workloads.

Multi-Tenant Knative

"Multiple teams sharing one Knative installation? You need isolation, resource limits, and clear ownership boundaries."
# Example ResourceQuota for a team namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    pods: "50"
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
  

Performance: Optimizing Cold Starts

StrategyImpactTrade-off
Set minScale: "1"Eliminates cold startAlways-on cost
Small container imagesFaster image pullBuild complexity
Pre-pull with DaemonSetNo image pull delayDisk space on nodes
Fast app startupReduces init timeApp refactoring
Increase target-burst-capacityMore headroomMore idle pods
Use scale-down-delayPrevents premature scale-downMore idle time
# Pre-pull images with a DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: image-prepull
spec:
  template:
    spec:
      initContainers:
        - name: prepull
          image: myregistry.azurecr.io/my-api:v1
          command: ["sh", "-c", "exit 0"]
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
  

Concurrency Tuning Guide

High Concurrency

For fast, lightweight endpoints:

annotations:
  autoscaling.knative.dev/target: "100"
spec:
  containerConcurrency: 0  # unlimited
      

Result: Fewer pods, each handling many requests.

Low Concurrency

For heavy processing (ML, image processing):

annotations:
  autoscaling.knative.dev/target: "1"
spec:
  containerConcurrency: 1  # one at a time
      

Result: Many pods, each handling one request.

Tip: Load test to find the right target. Start with target = 70% of what your container can handle at acceptable latency.

Knowledge Check

1. What is the most effective way to completely eliminate cold starts?

A) Set minScale to 1 or higher
B) Use a smaller container image
C) Increase the stable-window
Correct: A. Setting minScale to 1+ ensures at least one pod is always running. Other strategies reduce cold start duration but don't eliminate it.

2. When should you choose Istio over Kourier for Knative networking?

A) For all production deployments
B) When you need mTLS, authorization policies, or already use Istio
C) When you have more than 10 services
Correct: B. Istio adds significant complexity and resources. Choose it when you specifically need its features (mTLS, auth policies, traffic mirroring) or already run Istio.

3. For an ML inference endpoint that takes 5 seconds per request, what is the recommended concurrency setup?

A) target: 100, containerConcurrency: 0
B) target: 1, containerConcurrency: 1
C) target: 50, containerConcurrency: 50
Correct: B. For heavy processing workloads like ML inference, low concurrency (1) ensures each pod handles one request at a time, preventing resource contention and timeouts.

Troubleshooting: Service Not Ready

# Step 1: Check service status
kn service describe my-api
# Look at Conditions:
#   Ready: False
#   ConfigurationsReady: False
#   RoutesReady: True

# Step 2: Check the latest revision
kn revision describe my-api-00003
# Look for:
#   ContainerHealthy: False
#   ResourcesAvailable: False

# Step 3: Check pods
kubectl get pods -l serving.knative.dev/service=my-api
kubectl describe pod my-api-00003-deployment-xxx

# Common causes:
# - Image pull errors (wrong image name, missing credentials)
# - Container crash (check logs: kubectl logs ...)
# - Readiness probe failing
# - Resource limits too low (OOMKilled)
# - Missing ConfigMaps or Secrets
  

Troubleshooting: Revision Failures

# Revision stuck in "not ready"
kubectl get revisions
NAME               CONFIG NAME   READY   REASON
my-api-00003       my-api        False   ContainerMissing

# Check revision details
kubectl get revision my-api-00003 -o yaml | grep -A 10 "conditions"

# Common REASON values and fixes:
  
ReasonCauseFix
ContainerMissingImage not foundCheck image name and registry access
ExitCode1Container crashesCheck container logs
ResourcesUnavailableNot enough cluster resourcesScale cluster or reduce requests
ProgressDeadlineExceededPod took too long to startCheck probes, image size, startup time

Troubleshooting: Eventing Issues

# Events not being delivered? Systematic check:

# 1. Check Broker status
kubectl get broker default -o yaml
# Is READY: True?

# 2. Check Trigger status
kubectl get triggers -o wide
# Are all triggers READY: True?
# Is the subscriber URL correct?

# 3. Check Source status
kubectl get pingsource,kafkasource -o wide

# 4. Deploy event-display to see what's arriving
kn service create debug-display \
  --image gcr.io/knative-releases/knative.dev/eventing/cmd/event_display
kn trigger create catch-all --broker default --sink ksvc:debug-display
kubectl logs -l serving.knative.dev/service=debug-display -f

# 5. Check eventing controller logs
kubectl logs -n knative-eventing -l app=eventing-controller --tail=50

# 6. Check dead letter sink for failed deliveries
kubectl logs -l serving.knative.dev/service=dead-letter-handler
  

Knative on AKS Best Practices

# AKS-specific: Internal load balancer for Kourier
kubectl annotate svc kourier -n kourier-system \
  service.beta.kubernetes.io/azure-load-balancer-internal="true"
  

AKS Network Configuration for Knative

# Create AKS cluster optimized for Knative
az aks create \
  --resource-group myRG \
  --name knative-cluster \
  --node-count 3 \
  --node-vm-size Standard_D4s_v3 \
  --network-plugin azure \
  --network-policy azure \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 20 \
  --zones 1 2 3 \
  --attach-acr myACR \
  --enable-managed-identity

# Configure DNS for Knative
# Option 1: Use Azure DNS with external-dns
# Option 2: Use nip.io for development
# Option 3: Configure config-domain with your domain
kubectl patch configmap/config-domain \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"mycompany.com":""}}'
  

When to Use Knative vs. Regular Deployments

Use Knative When:

  • Traffic is bursty or unpredictable
  • Services have idle periods (scale-to-zero)
  • You need built-in traffic splitting
  • Event-driven processing
  • Rapid iteration with revisions
  • Stateless HTTP services

Use Regular Deployments When:

  • Stateful workloads (databases)
  • Long-running background workers
  • Non-HTTP protocols (gRPC streaming, TCP)
  • Need PersistentVolumes
  • DaemonSets or node-specific workloads
  • Steady, predictable traffic

Decision Flowchart

  Is it an HTTP workload?
     /          \
   Yes           No --> Regular Deployment
    |
  Is it stateless?
     /          \
   Yes           No --> Regular Deployment (StatefulSet)
    |
  Does it have variable/bursty traffic?
     /          \
   Yes           No (steady) --> Either works, Knative adds convenience
    |
  Do you want scale-to-zero?
     /          \
   Yes           No
    |              |
  Knative!      Do you want built-in traffic splitting?
                   /          \
                 Yes           No --> Regular Deployment is fine
                  |
               Knative!
    

Migrating to Knative

# Knative can coexist with regular Deployments!
# Same cluster, same namespace, no conflicts.

# Convert a Deployment to Knative Service:
# 1. Take your container image
# 2. Create a Knative Service YAML
# 3. Move env vars, secrets, configmaps
# 4. Add autoscaling annotations
# 5. Deploy and test
# 6. Switch DNS / traffic
  

Knative Security Checklist

Course Summary

Module 1: Introduction

Serverless on K8s, no vendor lock-in, Serving + Eventing components

Module 2: Serving

Autoscaling (KPA), traffic splitting, custom domains, TLS

Module 3: Eventing

CloudEvents, Broker/Trigger, Sources, Sequences, dead letters

Module 4: Operations

Kafka, tuning, observability, troubleshooting, AKS best practices

Key Takeaways

Your containers now scale to zero, spring back to life, split traffic, and react to events. Welcome to serverless on Kubernetes.

← Back