Module 2 of 4

Knative Serving Deep Dive

Autoscaling, Traffic Splitting, and Production-Ready Services

From simple deployments to sophisticated traffic management -- mastering Knative Serving.

Knative Service Anatomy

A Knative Service manages three child resources automatically:

        Knative Service
        /      |       \
       /       |        \
Configuration  |      Route
      |        |        |
   Revision    |   Traffic Rules
   (v1, v2..)  |   (% splits, tags)
               |
         Kubernetes
        (Pods, Services)

Configuration

The Configuration defines what your service looks like:

Container image and tag
Environment variables
Resource limits and requests
Scaling annotations
Health probes

Each change to the Configuration creates a new Revision.

Revision: Immutable Snapshots

"A Revision is like a Git commit for your running service. Once created, it never changes. You can always go back."

Created automatically on each Configuration change
Contains the exact container spec at that point in time
Can be independently scaled (including to zero)
Named: service-name-00001, service-name-00002, etc.
Can also use custom names via template.metadata.name

Route: Traffic Management

The Route controls how traffic reaches your Revisions:

Percentage-based traffic splitting across revisions
Tagged revisions get their own URL endpoints
@latest always points to the newest revision
Enables canary, blue/green, and A/B testing patterns

# Route URL pattern
http://service-name.namespace.example.com         # main
http://tag-name-service-name.namespace.example.com # tagged

Creating Services: Full YAML

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-api
  namespace: production
  labels:
    app: my-api
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "1"
        autoscaling.knative.dev/maxScale: "10"
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1.2.0
          ports:
            - containerPort: 8080
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-secret
                  key: url
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

Knowledge Check

1. What are the three child resources managed by a Knative Service?

A) Configuration, Revision, Route

B) Deployment, ReplicaSet, Pod

C) Ingress, Service, Endpoint

Correct: A. A Knative Service manages a Configuration (which creates Revisions) and a Route (which manages traffic distribution).

2. What triggers the creation of a new Revision?

A) Manually creating a Revision resource

B) Any change to the Configuration (template spec)

C) Updating the traffic split percentages

Correct: B. A new Revision is automatically created whenever the template spec in the Configuration changes (image, env vars, resources, etc.).

3. How do tagged revisions receive traffic?

A) They only receive traffic from the main URL

B) They get a unique URL (tag-name-service.namespace.example.com) for direct access

C) They cannot receive any traffic until untagged

Correct: B. Tagged revisions get a dedicated URL for direct access, even if they receive 0% of the main traffic.

Traffic Splitting: Canary Deployments

"Release to 5% of users first. If error rates stay low, gradually increase. If something breaks, roll back instantly."

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-api
spec:
  template:
    metadata:
      name: my-api-v2
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v2.0.0
  traffic:
    - revisionName: my-api-v2
      percent: 5
    - revisionName: my-api-v1
      percent: 95

Canary Progression

# Start with 5%
kn service update my-api \
  --traffic my-api-v2=5 \
  --traffic my-api-v1=95

# Monitor metrics... looks good! Increase to 25%
kn service update my-api \
  --traffic my-api-v2=25 \
  --traffic my-api-v1=75

# Still good! Go to 50/50
kn service update my-api \
  --traffic my-api-v2=50 \
  --traffic my-api-v1=50

# Full rollout
kn service update my-api \
  --traffic my-api-v2=100

# Something went wrong? Instant rollback!
kn service update my-api \
  --traffic my-api-v1=100

Blue/Green Deployments

# Blue is live (current)
# Deploy green with a tag (0% main traffic)
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-api
spec:
  template:
    metadata:
      name: my-api-green
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v2.0.0
  traffic:
    - revisionName: my-api-blue
      percent: 100
    - revisionName: my-api-green
      percent: 0
      tag: green

Test green at http://green-my-api.ns.example.com, then switch 100% when ready.

Knative Autoscaling

Knative provides two autoscaler implementations:

KPA (Knative Pod Autoscaler)

Default autoscaler
Scales based on concurrency or requests per second
Supports scale-to-zero
Fast, responsive scaling

HPA (Horizontal Pod Autoscaler)

Kubernetes-native HPA
Scales based on CPU or memory
Does NOT support scale-to-zero
Good for CPU-bound workloads

KPA: How It Works

  Request ---> Queue-Proxy (sidecar in each pod)
                    |
               Metrics reported
                    |
              Autoscaler collects
              concurrency/RPS data
                    |
          Desired replicas =
          total_concurrency / target_concurrency
                    |
          Scale up or down
          (including to zero)

The queue-proxy sidecar is the key -- it measures real request concurrency per pod.

Scale-to-Zero: The Mechanics

Stable window (default 60s): If average concurrency is 0 for this duration, scale to zero
Scale-to-zero grace period (default 30s): Additional grace before terminating last pod
When scaled to zero, the Activator receives traffic
Activator buffers requests and signals the autoscaler to create a pod
Once the pod is ready, Activator forwards buffered requests

# Key config in config-autoscaler ConfigMap
enable-scale-to-zero: "true"         # default: true
scale-to-zero-grace-period: "30s"    # default: 30s
stable-window: "60s"                 # default: 60s

Cold Start Considerations

"Scale-to-zero is great for cost savings. But when that first request arrives, users wait for a cold start. Let's manage that tradeoff."

Cold start = time from zero pods to first response
Includes: pod scheduling + image pull + container start + app ready
Typically 2-10 seconds depending on image size and app startup

Mitigation Strategies

Set minScale: "1" for latency-critical services
Use small container images (Alpine, distroless)
Pre-pull images with DaemonSets
Optimize app startup time

Knowledge Check

1. What metric does KPA (Knative Pod Autoscaler) primarily use for scaling?

A) CPU utilization

B) Request concurrency or requests per second

C) Memory usage

Correct: B. KPA scales based on observed concurrency or RPS, measured by the queue-proxy sidecar in each pod.

2. What is the role of the queue-proxy sidecar?

A) It queues events for Knative Eventing

B) It measures request concurrency and reports metrics to the autoscaler

C) It acts as a message queue between services

Correct: B. The queue-proxy is injected into every Knative pod and measures request concurrency, enforcing concurrency limits and reporting metrics.

3. Which autoscaler supports scale-to-zero?

A) KPA (Knative Pod Autoscaler) only

B) HPA (Horizontal Pod Autoscaler) only

C) Both KPA and HPA

Correct: A. Only KPA supports scale-to-zero. HPA (Kubernetes-native) requires a minimum of 1 replica.

Scaling Annotations

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-api
spec:
  template:
    metadata:
      annotations:
        # Min and Max replicas
        autoscaling.knative.dev/minScale: "2"
        autoscaling.knative.dev/maxScale: "50"

        # Target concurrency per pod
        autoscaling.knative.dev/target: "100"

        # Autoscaler class (kpa.autoscaling.knative.dev or hpa.autoscaling.knative.dev)
        autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"

        # Metric type: concurrency (default) or rps
        autoscaling.knative.dev/metric: "concurrency"

        # Scale down delay
        autoscaling.knative.dev/scale-down-delay: "5m"
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1

Target Concurrency Explained

The target annotation controls how aggressively Knative scales:

Target	Behavior	Use Case
`target: "1"`	1 request per pod at a time	Heavy processing, ML inference
`target: "10"`	10 concurrent requests per pod	Moderate API workloads
`target: "100"`	100 concurrent requests per pod	Light, fast endpoints

Formula: desired_pods = total_concurrent_requests / target

With target=10 and 50 concurrent requests: 5 pods

Burst Capacity and Initial Scale

# In config-autoscaler ConfigMap (knative-serving namespace)
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-autoscaler
  namespace: knative-serving
data:
  # Allow the autoscaler to create burst pods
  # beyond what metrics suggest (handles traffic spikes)
  allow-zero-initial-scale: "true"

  # Initial scale when a revision is first deployed
  initial-scale: "1"

  # Max burst capacity on scale-from-zero
  max-scale-up-rate: "1000"

  # Panic mode: scale aggressively if traffic spikes
  panic-window-percentage: "10.0"
  panic-threshold-percentage: "200.0"

containerConcurrency vs. target

containerConcurrency (hard limit)

Maximum concurrent requests the container can handle. Extra requests are queued by queue-proxy.

spec:
  template:
    spec:
      containerConcurrency: 10

Set to 0 for unlimited (default).

target (soft target)

The autoscaler's target. Drives scaling decisions. Does NOT limit actual concurrency.

annotations:
  autoscaling.knative.dev/target: "10"

Use 70% of containerConcurrency as a recommended target.

Custom Domains: DomainMapping

Map your own domain to a Knative Service:

apiVersion: serving.knative.dev/v1beta1
kind: DomainMapping
metadata:
  name: api.mycompany.com
  namespace: production
spec:
  ref:
    name: my-api
    kind: Service
    apiVersion: serving.knative.dev/v1

# Or configure default domain in config-domain ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-domain
  namespace: knative-serving
data:
  mycompany.com: ""   # All services get *.mycompany.com

TLS and HTTPS

Secure your Knative services with automatic TLS:

# Option 1: Use cert-manager for automatic certificates
# Install cert-manager, then configure Knative:
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-network
  namespace: knative-serving
data:
  auto-tls: "Enabled"
  http-protocol: "Redirected"    # Redirect HTTP -> HTTPS
  certificate-class: "cert-manager.io"

# Option 2: Bring your own certificate
kubectl create secret tls my-tls-cert \
  --key=tls.key --cert=tls.crt -n production

# Reference in DomainMapping
spec:
  tls:
    secretName: my-tls-cert

Knowledge Check

1. What is the difference between containerConcurrency and the autoscaling target annotation?

A) They are the same thing with different names

B) containerConcurrency is a hard limit enforced by queue-proxy; target is a soft scaling goal

C) containerConcurrency applies to CPU and target applies to memory

Correct: B. containerConcurrency is a hard cap -- excess requests are queued. The target annotation is used by the autoscaler to decide when to add/remove pods.

2. If you have a target of 10 and 80 concurrent requests, how many pods will the autoscaler create?

A) 8 pods

B) 10 pods

C) 80 pods

Correct: A. desired_pods = total_concurrent_requests / target = 80 / 10 = 8 pods.

3. How do you enable automatic TLS for Knative services?

A) Set autoscaling.knative.dev/tls: "true" annotation

B) Set auto-tls: "Enabled" in the config-network ConfigMap with cert-manager

C) TLS is always enabled by default

Correct: B. Configure auto-tls in config-network and use cert-manager (or another certificate provider) to handle certificate issuance.

Private / Cluster-Local Services

Not every service should be publicly accessible:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: internal-api
  labels:
    networking.knative.dev/visibility: cluster-local
spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/internal-api:v1

Only accessible from within the cluster
URL: http://internal-api.namespace.svc.cluster.local
Great for microservice-to-microservice communication
Still benefits from Knative autoscaling and revisions

Container Configuration: Env Vars

spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1
          env:
            # Direct value
            - name: LOG_LEVEL
              value: "info"

            # From ConfigMap
            - name: API_BASE_URL
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: api-url

            # From Secret
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password

            # All keys from a ConfigMap
          envFrom:
            - configMapRef:
                name: feature-flags

Resources and Probes

spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 1000m
              memory: 512Mi

          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20

Readiness probes are critical -- Knative uses them to know when to route traffic.

Volume Mounts in Knative

spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
              readOnly: true
            - name: secret-volume
              mountPath: /etc/secrets
              readOnly: true
      volumes:
        - name: config-volume
          configMap:
            name: app-config
        - name: secret-volume
          secret:
            secretName: app-secrets

Note: Knative does NOT support PersistentVolumeClaims -- services should be stateless. Use external storage (Azure Blob, databases) for persistent data.

Service Account and Image Pull Secrets

spec:
  template:
    spec:
      serviceAccountName: my-app-sa
      imagePullSecrets:
        - name: acr-credentials
      containers:
        - image: myregistry.azurecr.io/my-api:v1

For AKS with ACR integration:

# Attach ACR to AKS (no imagePullSecrets needed)
az aks update \
  --name myAKSCluster \
  --resource-group myResourceGroup \
  --attach-acr myACR

Knowledge Check

1. How do you make a Knative Service only accessible within the cluster?

A) Set spec.visibility: "private"

B) Add the label networking.knative.dev/visibility: cluster-local

C) Remove the Route from the Service

Correct: B. The label networking.knative.dev/visibility: cluster-local makes the service only reachable from within the cluster.

2. Can Knative Services use PersistentVolumeClaims?

A) Yes, just like regular Deployments

B) No, Knative Services should be stateless; use external storage

C) Yes, but only with ReadOnlyMany access mode

Correct: B. Knative Services are designed to be stateless. They support ConfigMaps and Secrets as volumes, but not PVCs. Use external storage services for persistent data.

3. Why are readiness probes especially important in Knative?

A) Knative uses them to determine when to route traffic to a new pod

B) They control the autoscaling target

C) They are required for scale-to-zero to work

Correct: A. Knative relies on readiness probes to know when a pod (especially one scaling from zero) is ready to receive traffic. Without them, requests may fail.

Scaling Strategy Decision Guide

Scenario	minScale	target	Autoscaler
Dev/test workloads	0	Default (100)	KPA
Low-traffic APIs	0 or 1	50-100	KPA
Production APIs	2+	Based on load testing	KPA
CPU-heavy processing	1+	N/A (CPU metric)	HPA
ML inference	1	1-5	KPA
WebSocket services	1+	Low (5-10)	KPA

Monitoring Your Services

# Quick health check
kn service list
kn revision list
kn route list

# Describe a service (see conditions, traffic, URLs)
kn service describe my-api

# Watch pods scale
kubectl get pods -w -l serving.knative.dev/service=my-api

# Check autoscaler decisions
kubectl logs -n knative-serving -l app=autoscaler -f

# Key metrics to watch:
# - Revision ready latency (cold start time)
# - Request concurrency per pod
# - Response latency (p50, p95, p99)
# - Scale-from-zero duration

Knative Serving Best Practices

Always set resource requests -- helps scheduler and autoscaler
Use readiness probes -- critical for cold-start reliability
Keep images small -- smaller = faster cold starts
Set minScale based on SLA -- 0 for dev, 2+ for production
Use named revisions for important releases
Test traffic splits in staging before production
Monitor cold start times -- optimize if > 5 seconds
Use cluster-local for internal services

What's Coming Next

In the next module, we explore Knative Eventing:

Event-driven architecture patterns
CloudEvents standard
Brokers, Triggers, Sources
Building event pipelines
Dead letter sinks and error handling

Module 3: Knative Eventing

Key Takeaways

A Knative Service manages Configuration + Revisions + Route
KPA scales on concurrency/RPS and supports scale-to-zero; HPA uses CPU/memory
containerConcurrency is a hard limit; target is a soft scaling goal
Traffic splitting enables canary and blue/green deployments natively
DomainMapping connects custom domains; auto-tls with cert-manager for HTTPS
Use cluster-local label for internal services
Knative Services should be stateless -- use external storage