Module 2 of 4

Knative Serving Deep Dive

Autoscaling, Traffic Splitting, and Production-Ready Services

From simple deployments to sophisticated traffic management -- mastering Knative Serving.

Knative Service Anatomy

A Knative Service manages three child resources automatically:

        Knative Service
        /      |       \
       /       |        \
Configuration  |      Route
      |        |        |
   Revision    |   Traffic Rules
   (v1, v2..)  |   (% splits, tags)
               |
         Kubernetes
        (Pods, Services)
    

Configuration

The Configuration defines what your service looks like:

Each change to the Configuration creates a new Revision.

Revision: Immutable Snapshots

"A Revision is like a Git commit for your running service. Once created, it never changes. You can always go back."

Route: Traffic Management

The Route controls how traffic reaches your Revisions:

# Route URL pattern
http://service-name.namespace.example.com         # main
http://tag-name-service-name.namespace.example.com # tagged
  

Creating Services: Full YAML

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-api
  namespace: production
  labels:
    app: my-api
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "1"
        autoscaling.knative.dev/maxScale: "10"
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1.2.0
          ports:
            - containerPort: 8080
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-secret
                  key: url
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
  

Knowledge Check

1. What are the three child resources managed by a Knative Service?

A) Configuration, Revision, Route
B) Deployment, ReplicaSet, Pod
C) Ingress, Service, Endpoint
Correct: A. A Knative Service manages a Configuration (which creates Revisions) and a Route (which manages traffic distribution).

2. What triggers the creation of a new Revision?

A) Manually creating a Revision resource
B) Any change to the Configuration (template spec)
C) Updating the traffic split percentages
Correct: B. A new Revision is automatically created whenever the template spec in the Configuration changes (image, env vars, resources, etc.).

3. How do tagged revisions receive traffic?

A) They only receive traffic from the main URL
B) They get a unique URL (tag-name-service.namespace.example.com) for direct access
C) They cannot receive any traffic until untagged
Correct: B. Tagged revisions get a dedicated URL for direct access, even if they receive 0% of the main traffic.

Traffic Splitting: Canary Deployments

"Release to 5% of users first. If error rates stay low, gradually increase. If something breaks, roll back instantly."
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-api
spec:
  template:
    metadata:
      name: my-api-v2
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v2.0.0
  traffic:
    - revisionName: my-api-v2
      percent: 5
    - revisionName: my-api-v1
      percent: 95
  

Canary Progression

# Start with 5%
kn service update my-api \
  --traffic my-api-v2=5 \
  --traffic my-api-v1=95

# Monitor metrics... looks good! Increase to 25%
kn service update my-api \
  --traffic my-api-v2=25 \
  --traffic my-api-v1=75

# Still good! Go to 50/50
kn service update my-api \
  --traffic my-api-v2=50 \
  --traffic my-api-v1=50

# Full rollout
kn service update my-api \
  --traffic my-api-v2=100

# Something went wrong? Instant rollback!
kn service update my-api \
  --traffic my-api-v1=100
  

Blue/Green Deployments

# Blue is live (current)
# Deploy green with a tag (0% main traffic)
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-api
spec:
  template:
    metadata:
      name: my-api-green
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v2.0.0
  traffic:
    - revisionName: my-api-blue
      percent: 100
    - revisionName: my-api-green
      percent: 0
      tag: green
  

Test green at http://green-my-api.ns.example.com, then switch 100% when ready.

Knative Autoscaling

Knative provides two autoscaler implementations:

KPA (Knative Pod Autoscaler)

  • Default autoscaler
  • Scales based on concurrency or requests per second
  • Supports scale-to-zero
  • Fast, responsive scaling

HPA (Horizontal Pod Autoscaler)

  • Kubernetes-native HPA
  • Scales based on CPU or memory
  • Does NOT support scale-to-zero
  • Good for CPU-bound workloads

KPA: How It Works

  Request ---> Queue-Proxy (sidecar in each pod)
                    |
               Metrics reported
                    |
              Autoscaler collects
              concurrency/RPS data
                    |
          Desired replicas =
          total_concurrency / target_concurrency
                    |
          Scale up or down
          (including to zero)
    

The queue-proxy sidecar is the key -- it measures real request concurrency per pod.

Scale-to-Zero: The Mechanics

# Key config in config-autoscaler ConfigMap
enable-scale-to-zero: "true"         # default: true
scale-to-zero-grace-period: "30s"    # default: 30s
stable-window: "60s"                 # default: 60s
  

Cold Start Considerations

"Scale-to-zero is great for cost savings. But when that first request arrives, users wait for a cold start. Let's manage that tradeoff."

Mitigation Strategies

Knowledge Check

1. What metric does KPA (Knative Pod Autoscaler) primarily use for scaling?

A) CPU utilization
B) Request concurrency or requests per second
C) Memory usage
Correct: B. KPA scales based on observed concurrency or RPS, measured by the queue-proxy sidecar in each pod.

2. What is the role of the queue-proxy sidecar?

A) It queues events for Knative Eventing
B) It measures request concurrency and reports metrics to the autoscaler
C) It acts as a message queue between services
Correct: B. The queue-proxy is injected into every Knative pod and measures request concurrency, enforcing concurrency limits and reporting metrics.

3. Which autoscaler supports scale-to-zero?

A) KPA (Knative Pod Autoscaler) only
B) HPA (Horizontal Pod Autoscaler) only
C) Both KPA and HPA
Correct: A. Only KPA supports scale-to-zero. HPA (Kubernetes-native) requires a minimum of 1 replica.

Scaling Annotations

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-api
spec:
  template:
    metadata:
      annotations:
        # Min and Max replicas
        autoscaling.knative.dev/minScale: "2"
        autoscaling.knative.dev/maxScale: "50"

        # Target concurrency per pod
        autoscaling.knative.dev/target: "100"

        # Autoscaler class (kpa.autoscaling.knative.dev or hpa.autoscaling.knative.dev)
        autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"

        # Metric type: concurrency (default) or rps
        autoscaling.knative.dev/metric: "concurrency"

        # Scale down delay
        autoscaling.knative.dev/scale-down-delay: "5m"
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1
  

Target Concurrency Explained

The target annotation controls how aggressively Knative scales:

TargetBehaviorUse Case
target: "1"1 request per pod at a timeHeavy processing, ML inference
target: "10"10 concurrent requests per podModerate API workloads
target: "100"100 concurrent requests per podLight, fast endpoints

Formula: desired_pods = total_concurrent_requests / target

With target=10 and 50 concurrent requests: 5 pods

Burst Capacity and Initial Scale

# In config-autoscaler ConfigMap (knative-serving namespace)
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-autoscaler
  namespace: knative-serving
data:
  # Allow the autoscaler to create burst pods
  # beyond what metrics suggest (handles traffic spikes)
  allow-zero-initial-scale: "true"

  # Initial scale when a revision is first deployed
  initial-scale: "1"

  # Max burst capacity on scale-from-zero
  max-scale-up-rate: "1000"

  # Panic mode: scale aggressively if traffic spikes
  panic-window-percentage: "10.0"
  panic-threshold-percentage: "200.0"
  

containerConcurrency vs. target

containerConcurrency (hard limit)

Maximum concurrent requests the container can handle. Extra requests are queued by queue-proxy.

spec:
  template:
    spec:
      containerConcurrency: 10
      

Set to 0 for unlimited (default).

target (soft target)

The autoscaler's target. Drives scaling decisions. Does NOT limit actual concurrency.

annotations:
  autoscaling.knative.dev/target: "10"
      

Use 70% of containerConcurrency as a recommended target.

Custom Domains: DomainMapping

Map your own domain to a Knative Service:

apiVersion: serving.knative.dev/v1beta1
kind: DomainMapping
metadata:
  name: api.mycompany.com
  namespace: production
spec:
  ref:
    name: my-api
    kind: Service
    apiVersion: serving.knative.dev/v1
  
# Or configure default domain in config-domain ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-domain
  namespace: knative-serving
data:
  mycompany.com: ""   # All services get *.mycompany.com
  

TLS and HTTPS

Secure your Knative services with automatic TLS:

# Option 1: Use cert-manager for automatic certificates
# Install cert-manager, then configure Knative:
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-network
  namespace: knative-serving
data:
  auto-tls: "Enabled"
  http-protocol: "Redirected"    # Redirect HTTP -> HTTPS
  certificate-class: "cert-manager.io"
  
# Option 2: Bring your own certificate
kubectl create secret tls my-tls-cert \
  --key=tls.key --cert=tls.crt -n production

# Reference in DomainMapping
spec:
  tls:
    secretName: my-tls-cert
  

Knowledge Check

1. What is the difference between containerConcurrency and the autoscaling target annotation?

A) They are the same thing with different names
B) containerConcurrency is a hard limit enforced by queue-proxy; target is a soft scaling goal
C) containerConcurrency applies to CPU and target applies to memory
Correct: B. containerConcurrency is a hard cap -- excess requests are queued. The target annotation is used by the autoscaler to decide when to add/remove pods.

2. If you have a target of 10 and 80 concurrent requests, how many pods will the autoscaler create?

A) 8 pods
B) 10 pods
C) 80 pods
Correct: A. desired_pods = total_concurrent_requests / target = 80 / 10 = 8 pods.

3. How do you enable automatic TLS for Knative services?

A) Set autoscaling.knative.dev/tls: "true" annotation
B) Set auto-tls: "Enabled" in the config-network ConfigMap with cert-manager
C) TLS is always enabled by default
Correct: B. Configure auto-tls in config-network and use cert-manager (or another certificate provider) to handle certificate issuance.

Private / Cluster-Local Services

Not every service should be publicly accessible:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: internal-api
  labels:
    networking.knative.dev/visibility: cluster-local
spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/internal-api:v1
  

Container Configuration: Env Vars

spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1
          env:
            # Direct value
            - name: LOG_LEVEL
              value: "info"

            # From ConfigMap
            - name: API_BASE_URL
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: api-url

            # From Secret
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password

            # All keys from a ConfigMap
          envFrom:
            - configMapRef:
                name: feature-flags
  

Resources and Probes

spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 1000m
              memory: 512Mi

          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
  

Readiness probes are critical -- Knative uses them to know when to route traffic.

Volume Mounts in Knative

spec:
  template:
    spec:
      containers:
        - image: myregistry.azurecr.io/my-api:v1
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
              readOnly: true
            - name: secret-volume
              mountPath: /etc/secrets
              readOnly: true
      volumes:
        - name: config-volume
          configMap:
            name: app-config
        - name: secret-volume
          secret:
            secretName: app-secrets
  

Note: Knative does NOT support PersistentVolumeClaims -- services should be stateless. Use external storage (Azure Blob, databases) for persistent data.

Service Account and Image Pull Secrets

spec:
  template:
    spec:
      serviceAccountName: my-app-sa
      imagePullSecrets:
        - name: acr-credentials
      containers:
        - image: myregistry.azurecr.io/my-api:v1
  

For AKS with ACR integration:

# Attach ACR to AKS (no imagePullSecrets needed)
az aks update \
  --name myAKSCluster \
  --resource-group myResourceGroup \
  --attach-acr myACR
  

Knowledge Check

1. How do you make a Knative Service only accessible within the cluster?

A) Set spec.visibility: "private"
B) Add the label networking.knative.dev/visibility: cluster-local
C) Remove the Route from the Service
Correct: B. The label networking.knative.dev/visibility: cluster-local makes the service only reachable from within the cluster.

2. Can Knative Services use PersistentVolumeClaims?

A) Yes, just like regular Deployments
B) No, Knative Services should be stateless; use external storage
C) Yes, but only with ReadOnlyMany access mode
Correct: B. Knative Services are designed to be stateless. They support ConfigMaps and Secrets as volumes, but not PVCs. Use external storage services for persistent data.

3. Why are readiness probes especially important in Knative?

A) Knative uses them to determine when to route traffic to a new pod
B) They control the autoscaling target
C) They are required for scale-to-zero to work
Correct: A. Knative relies on readiness probes to know when a pod (especially one scaling from zero) is ready to receive traffic. Without them, requests may fail.

Scaling Strategy Decision Guide

ScenariominScaletargetAutoscaler
Dev/test workloads0Default (100)KPA
Low-traffic APIs0 or 150-100KPA
Production APIs2+Based on load testingKPA
CPU-heavy processing1+N/A (CPU metric)HPA
ML inference11-5KPA
WebSocket services1+Low (5-10)KPA

Monitoring Your Services

# Quick health check
kn service list
kn revision list
kn route list

# Describe a service (see conditions, traffic, URLs)
kn service describe my-api

# Watch pods scale
kubectl get pods -w -l serving.knative.dev/service=my-api

# Check autoscaler decisions
kubectl logs -n knative-serving -l app=autoscaler -f

# Key metrics to watch:
# - Revision ready latency (cold start time)
# - Request concurrency per pod
# - Response latency (p50, p95, p99)
# - Scale-from-zero duration
  

Knative Serving Best Practices

What's Coming Next

In the next module, we explore Knative Eventing:

Module 3: Knative Eventing

Key Takeaways

← Back