SSO, Monitoring, Notifications, and Production Operations
Civica Training Program
ArgoCD is installed and bootstrapped. Now let's make it enterprise-grade.
You have ArgoCD deploying applications across dev, staging, and prod using the App of Apps pattern. But your CISO asks: "Who has access? How do we audit? What happens if ArgoCD goes down? How do we get notified of failures?"
Azure AD SSO + RBAC
Monitoring, notifications
DR, multi-team, troubleshooting
Replace password auth with Azure AD single sign-on using OIDC.
Create an App Registration in Azure AD for ArgoCD.
ArgoCD-SSOhttps://argocd.civica.internal/auth/callbackImportant: If you have more than 150 groups, Azure AD uses a group overage claim. Configure a filter or use Application Roles instead.
Add the Azure AD OIDC configuration to ArgoCD's Helm values.
# values-production.yaml configs: cm: url: https://argocd.civica.internal oidc.config: | name: Azure AD issuer: https://login.microsoftonline.com/TENANT_ID/v2.0 clientID: CLIENT_ID clientSecret: $oidc.azure.clientSecret # From argocd-secret requestedScopes: - openid - profile - email requestedIDTokenClaims: groups: essential: true secret: extra: oidc.azure.clientSecret: "YOUR_CLIENT_SECRET" # Use External Secrets in prod!
Map Azure AD security groups to ArgoCD roles and projects.
# values-production.yaml (continued) configs: rbac: policy.csv: | # Platform team: full admin access g, "ad-group-id-platform-team", role:admin # Payments team: read + sync their own apps p, role:payments-team, applications, get, payments-team/*, allow p, role:payments-team, applications, sync, payments-team/*, allow p, role:payments-team, applications, action, payments-team/*, allow p, role:payments-team, logs, get, payments-team/*, allow g, "ad-group-id-payments-team", role:payments-team # Orders team: read + sync their own apps p, role:orders-team, applications, get, orders-team/*, allow p, role:orders-team, applications, sync, orders-team/*, allow g, "ad-group-id-orders-team", role:orders-team # Read-only for all authenticated users p, role:readonly, applications, get, */*, allow g, "ad-group-id-all-devs", role:readonly policy.default: "" # No default access scopes: "[groups]"
Understanding the Casbin-based policy format.
# Format: p, role, resource, action, object, effect # Examples: # Allow role to get apps in project p, role:dev, applications, get, myproject/*, allow # Allow role to sync specific app p, role:dev, applications, sync, myproject/myapp, allow # Deny delete for everyone p, role:dev, applications, delete, */*, deny
# Format: g, group-or-user, role # Map Azure AD group to role g, "azure-ad-group-object-id", role:dev # Map specific user g, "[email protected]", role:admin
applications | get, create, update, delete, sync, action |
repositories | get, create, update, delete |
clusters | get, create, update, delete |
logs | get |
Azure AD SSO and RBAC.
g, "ad-group-id", role:admin maps an Azure AD group to an ArgoCD role.policy.default: "" means users without an explicit role assignment get no access. This follows the principle of least privilege — access must be explicitly granted.Ensuring AKS can pull images and ArgoCD can monitor image updates.
# Attach ACR to AKS (recommended) az aks update \ --resource-group myRG \ --name myAKS \ --attach-acr myACR # This grants AKS kubelet identity # the AcrPull role on the registry. # No imagePullSecrets needed! # Verify az aks check-acr \ --resource-group myRG \ --name myAKS \ --acr myACR.azurecr.io
# Install ArgoCD Image Updater helm install argocd-image-updater \ argo/argocd-image-updater \ -n argocd # Configure ACR access # Use Workload Identity or # create a Service Principal with AcrPull # Annotate your Application: annotations: argocd-image-updater.argoproj.io/ image-list: app=myacr.azurecr.io/myapp argocd-image-updater.argoproj.io/ app.update-strategy: semver argocd-image-updater.argoproj.io/ write-back-method: git
ArgoCD exposes Prometheus metrics out of the box.
argocd-server, argocd-repo-server, argocd-application-controller each expose /metrics
App sync status, health, git operations, reconciliation time, API requests, and more
Metrics are enabled by default. Just point Prometheus at the ArgoCD services.
# values-production.yaml server: metrics: enabled: true serviceMonitor: enabled: true controller: metrics: enabled: true serviceMonitor: enabled: true repoServer: metrics: enabled: true serviceMonitor: enabled: true
The metrics that matter most for operational health.
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
argocd_app_info | App sync status and health | sync_status != "Synced" |
argocd_app_reconcile_count | Number of reconciliation attempts | High error rate |
argocd_app_sync_total | Total sync operations by phase | phase = "Error" |
argocd_git_request_total | Git fetch operations | High failure rate |
argocd_app_reconcile_duration | Time to reconcile apps | > 5 minutes |
argocd_redis_request_total | Redis cache operations | High error rate |
argocd_cluster_api_resource_objects | Managed resource count | Unexpected drop |
Visualising ArgoCD health and performance.
Grafana Dashboard ID: 14584 (community)
Grafana Dashboard ID: 19993 (community)
Import community dashboards or build custom ones. The ArgoCD project provides a sample dashboard JSON in their docs.
Alerts that will save you from midnight surprises.
# prometheus-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: argocd-alerts spec: groups: - name: argocd rules: - alert: ArgoCDAppOutOfSync expr: argocd_app_info{sync_status!="Synced"} == 1 for: 15m labels: severity: warning annotations: summary: "App {{ $labels.name }} is out of sync" - alert: ArgoCDAppUnhealthy expr: argocd_app_info{health_status!="Healthy",health_status!="Progressing"} == 1 for: 10m labels: severity: critical annotations: summary: "App {{ $labels.name }} is {{ $labels.health_status }}" - alert: ArgoCDSyncFailed expr: increase(argocd_app_sync_total{phase="Error"}[1h]) > 3 labels: severity: critical
ArgoCD has a built-in notification system for Slack, Teams, email, and webhooks.
Get notified in Slack when deployments succeed, fail, or drift.
# values-production.yaml notifications: enabled: true secret: items: slack-token: xoxb-your-slack-bot-token notifiers: service.slack: | token: $slack-token templates: template.app-sync-succeeded: | slack: attachments: | [{ "color": "#18be52", "title": "{{ .app.metadata.name }} synced successfully", "text": "Application {{ .app.metadata.name }} is now running revision {{ .app.status.sync.revision }}.", "fields": [{ "title": "Project", "value": "{{ .app.spec.project }}", "short": true }] }] triggers: trigger.on-sync-succeeded: | - when: app.status.operationState.phase in ['Succeeded'] send: [app-sync-succeeded] trigger.on-sync-failed: | - when: app.status.operationState.phase in ['Error', 'Failed'] send: [app-sync-failed]
For organisations using Teams as their primary communication platform.
# Configure Teams webhook notifications: notifiers: service.teams: | recipientUrls: deployments-channel: https://outlook.office.com/webhook/xxx templates: template.app-deployed: | teams: title: "Deployment: {{ .app.metadata.name }}" text: | Application **{{ .app.metadata.name }}** has been synced. Revision: {{ .app.status.sync.revision | trunc 7 }} Status: {{ .app.status.health.status }} subscriptions: - recipients: - teams:deployments-channel triggers: - on-sync-succeeded - on-sync-failed - on-health-degraded
Per-app notifications: You can also annotate individual Applications to subscribe to specific notification channels, so each team gets only their relevant alerts.
Monitoring and notifications.
What happens if ArgoCD itself goes down? Planning for the worst.
If ArgoCD stops working, your applications keep running. ArgoCD doesn't run your apps — it only manages their deployment. Existing workloads are unaffected.
Practical commands for backing up and restoring ArgoCD.
# Export all ArgoCD Applications argocd admin export -n argocd > backup.yaml # Or backup specific resources kubectl get applications -n argocd -o yaml \ > apps-backup.yaml kubectl get appprojects -n argocd -o yaml \ > projects-backup.yaml # Backup Sealed Secrets key (critical!) kubectl get secret -n kube-system \ -l sealedsecrets.bitnami.com/sealed-secrets-key \ -o yaml > sealed-secrets-key-backup.yaml # Store backups in Azure Blob Storage az storage blob upload \ --container backup \ --file backup.yaml
# 1. Reinstall ArgoCD (same Helm values) helm install argocd argo/argo-cd \ -n argocd -f values-production.yaml # 2. Restore Sealed Secrets key kubectl apply -f sealed-secrets-key-backup.yaml # 3. Re-connect Git repos argocd repo add [email protected]:civica/gitops-config \ --ssh-private-key-path ./argocd-key # 4. Apply the root App of Apps kubectl apply -f argocd/apps.yaml # ArgoCD will re-discover and sync # everything from Git automatically!
The ultimate GitOps pattern: ArgoCD managing its own configuration.
Create an ArgoCD Application that points to ArgoCD's own Helm values in Git. When you update the values (e.g., add a new RBAC rule), ArgoCD upgrades itself.
# argocd/applications/argocd-self.yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: argocd namespace: argocd spec: project: default source: repoURL: https://argoproj.github.io/argo-helm chart: argo-cd targetRevision: 6.0.0 helm: valueFiles: - $values/argocd/values-production.yaml sources: - repoURL: [email protected]:civica/gitops-config targetRevision: main ref: values destination: server: https://kubernetes.default.svc namespace: argocd syncPolicy: automated: selfHeal: true
Ensuring teams can work independently without stepping on each other.
apiVersion: argoproj.io/v1alpha1 kind: AppProject metadata: name: payments-team namespace: argocd spec: sourceRepos: - "[email protected]:civica/gitops-config" destinations: - server: "*" namespace: "payments-dev" - server: "*" namespace: "payments-staging" - server: "*" namespace: "payments-prod" clusterResourceWhitelist: [] roles: - name: team-lead policies: - p, proj:payments-team:team-lead, applications, *, payments-team/*, allow groups: - "ad-group-payments-leads"
How teams interact with the shared GitOps platform.
| Action | Who | How |
|---|---|---|
| Add new microservice | Dev team | PR to add base + overlays in apps/ directory |
| Deploy new version | CI pipeline | Updates image tag in dev overlay, auto-syncs |
| Promote to staging | Dev team lead | PR to update staging overlay image tag |
| Promote to prod | Platform team | PR approval + manual sync in ArgoCD |
| Add infrastructure | Platform team | PR to infra/ directory, reviewed by SRE |
| Change RBAC | Platform team | PR to cluster/rbac/ and ArgoCD RBAC policy |
| Troubleshoot app | Dev team | ArgoCD UI (scoped to their project only) |
| Rollback | Dev team lead | git revert on the offending commit |
Disaster recovery and multi-team operations.
The problems you'll actually encounter and how to fix them.
# Check what's different argocd app diff my-app # Common causes: # - Mutating webhooks adding fields # - Default values injected by K8s # - Resource fields ignored by ArgoCD # Fix: Add to ignoreDifferences spec: ignoreDifferences: - group: apps kind: Deployment jsonPointers: - /spec/replicas # If HPA manages
# Check sync operation details argocd app get my-app # Check events kubectl get events -n my-namespace \ --sort-by=.metadata.creationTimestamp # Common causes: # - RBAC: ArgoCD SA lacks permissions # - Resource quota exceeded # - Invalid manifest (schema error) # - Namespace doesn't exist # - Image pull failure (ACR auth)
Diving deeper into common operational issues.
# Repo clone failures kubectl logs -n argocd \ deploy/argocd-repo-server # Common causes: # - SSH host key changed # - PAT expired # - Network policy blocking egress # - Repo too large (increase resources) # Fix SSH host key: argocd cert add-ssh \ --batch github.com
# Controller reconciliation slow # Check queue depth: kubectl exec -n argocd \ deploy/argocd-application-controller \ -- argocd admin settings resource-overrides # Solutions: # - Increase controller resources # - Reduce reconciliation timeout # - Use resource tracking (annotation) # - Split into multiple ArgoCD instances # - Exclude non-essential resource types
Pro tip: Enable ArgoCD server debug logging temporarily with --loglevel debug to diagnose complex issues. Remember to revert after troubleshooting.
Lessons learned from running GitOps at scale.
:latest)Common mistakes to avoid on your GitOps journey.
| Anti-Pattern | Why It's Bad | What to Do Instead |
|---|---|---|
| Plain secrets in Git | Anyone with repo access can read them | Sealed Secrets or External Secrets |
Using :latest image tag | No audit trail, unpredictable rollouts | Pin to specific tags (semver or SHA) |
| Manual kubectl in production | Drift, no audit trail, breaks GitOps loop | All changes through Git PRs only |
| Branch-per-environment | Merge conflicts, diverging configs | Directory-per-environment (overlays) |
| Auto-sync everything in prod | Risky — no human gate for critical changes | Manual sync + PR approval for prod |
| One giant monolithic app | Blast radius too large, slow syncs | App of Apps with granular child apps |
| Ignoring drift alerts | Erodes trust in GitOps as source of truth | Investigate and resolve every drift event |
| No RBAC on ArgoCD | Everyone can sync/delete any app | Projects + Azure AD RBAC from day one |
Everything we've built across all four presentations.
Git is the source of truth. ArgoCD is the engine. Azure is the platform. Your team is in control.
Everything we've covered across all four presentations.
Practical next steps after completing this module.
Start small. One service, one environment. Build confidence, then expand. GitOps is a journey, not a big bang.
Key takeaways from Azure Integration and Operations.
Azure AD OIDC for SSO. Map AD groups to ArgoCD roles. Disable admin account.
Built-in Prometheus metrics. ServiceMonitors for auto-discovery. Key alerts for sync failures.
Built-in system with Slack, Teams, email. Triggers, templates, services pattern.
Git is the backup. Back up Sealed Secrets keys. Test recovery regularly.
ArgoCD Projects for isolation. RBAC per team. Namespace-scoped access.
Pin images, manual sync for prod, no secrets in Git, investigate every drift.
Azure Integration & Operations
Congratulations on completing the GitOps Module!