Argo Rollouts — LearnwithVishnu

🚀Argo Rollouts

BeginnerEngineerProductionArchitectProgressive delivery — canary, blue-green, metric-driven promotion, automatic rollback

What is Argo Rollouts Canary Rollout AnalysisTemplate Interview Q&A

🚀 What is Argo Rollouts?

›

Why standard Kubernetes rolling updates are not enough

A standard Kubernetes rolling update replaces pods one by one. The moment new pods are live, real users hit them. If the new version has a bug — even a subtle performance regression — it is immediately affecting users. You have no way to send only 10% of traffic to the new version while keeping 90% on the old version.

Argo Rollouts adds the missing layer: traffic management + metric analysis + automated rollback. Deploy new version, measure its behaviour scientifically, promote or rollback based on data.

	K8s Rolling Update	Argo Rollouts Canary
Traffic control	None — all traffic shifts as pods replace	Precise % control — 10%, 25%, 50%
Metric-based promotion	None	Prometheus/Datadog queries drive promotion
Automatic rollback	Only on pod crash (not on high error rate)	On ANY metric threshold breach
Pause for approval	No	Pause at any step for human approval

Real Scenario — Payment ServiceHPE payment processing: deploy v2.1.0 as 10% canary. AnalysisRun queries Prometheus every minute for P99 latency and error rate. If both healthy for 5 checks, auto-promote to 25%. At 50% traffic, pause for manual approval from release manager. After approval, promote to 100%. If any analysis check fails at any stage: automatic rollback to stable within 2 minutes. Zero human watching dashboards at deployment time.

Install + kubectl plugin + watch command

🐦 Canary Rollout

›

Complete Rollout spec with Istio traffic routing + steps

📊 AnalysisTemplate — Data-Driven Promotion

›

This is what separates progressive delivery from just a slow rollout

Without AnalysisTemplate: canary just waits a fixed time then promotes. A bug that appears after 10 minutes would not be caught. With AnalysisTemplate: Prometheus (or Datadog, or a custom webhook) is queried continuously. The system promotes or rolls back based on real metric data, not time elapsed.

AnalysisTemplate with Prometheus success rate + latency

🚀 Deployment Strategies — Canary and Blue-Green Deep Dive

›

Why Argo Rollouts over standard Kubernetes rolling updates?

Standard Kubernetes rolling update sends traffic to new pods as soon as they start. There is no traffic splitting, no automated analysis, no pause-and-check. Argo Rollouts adds: controlled traffic percentages, automated metric analysis (roll back if error rate rises), manual pause gates, and full visibility of the rollout progress.

Strategy	How it works	When to use
Canary	Send a small % of traffic to new version first. Monitor. Gradually increase. Auto-rollback if metrics fail.	Applications where you want to limit blast radius. Most common in production.
Blue-Green	New version (green) deployed fully alongside old (blue). Switch 100% traffic at once. Old stays for instant rollback.	Databases, stateful services, when you need instant full-cutover or instant rollback.
Rolling (native K8s)	Gradually replace old pods with new. No traffic control.	Non-critical apps, when you just need basic zero-downtime.

Complete canary rollout example

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-api
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: payment-api
  template:          # same as Deployment pod template
    metadata:
      labels:
        app: payment-api
    spec:
      containers:
      - name: payment-api
        image: myacr.azurecr.io/payment-api:v2.1.0
        ports:
        - containerPort: 8080
  strategy:
    canary:
      steps:
      - setWeight: 10        # Step 1: send 10% traffic to new version
      - pause: {duration: 5m}  # wait 5 minutes, observe metrics
      - setWeight: 30        # Step 2: increase to 30%
      - pause: {}            # pause indefinitely — manual promotion needed
      - setWeight: 60
      - pause: {duration: 10m}
      - setWeight: 100       # Step 4: full rollout
      canaryService: payment-api-canary    # Service pointing to new pods
      stableService: payment-api-stable    # Service pointing to old pods
      trafficRouting:
        nginx:
          stableIngress: payment-ingress   # NGINX Ingress controls traffic split

Complete blue-green example

strategy:
  blueGreen:
    activeService: payment-api           # currently receiving traffic (blue)
    previewService: payment-api-preview  # new version deployed here (green)
    autoPromotionEnabled: false          # require manual promotion
    scaleDownDelaySeconds: 30            # keep blue running 30s after promotion
    prePromotionAnalysis:                # run analysis before switching traffic
      templates:
      - templateName: success-rate
      args:
      - name: service-name
        value: payment-api-preview

📊 AnalysisTemplate — Automated Rollback

›

AnalysisTemplate queries Prometheus/Datadog to decide pass or fail

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 2m           # check every 2 minutes
    count: 3               # must pass 3 consecutive checks
    successCondition: result[0] >= 0.95   # 95% success rate required
    failureLimit: 1        # fail if check fails once
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{
            service="{{args.service-name}}",
            status!~"5.."
          }[5m])) /
          sum(rate(http_requests_total{
            service="{{args.service-name}}"
          }[5m]))

Rollout commands

# Install kubectl plugin
kubectl argo rollouts version

# Watch rollout progress in terminal
kubectl argo rollouts get rollout payment-api --watch

# Promote manually (resume a paused step)
kubectl argo rollouts promote payment-api

# Abort and rollback immediately
kubectl argo rollouts abort payment-api

# Manually set canary weight
kubectl argo rollouts set image payment-api payment-api=myimage:v2
kubectl argo rollouts promote payment-api --skip-current-step

🎯 Interview Questions

›

ARGO ROLLOUTS · ARCHITECT

What is the difference between Canary and Blue-Green deployments? When do you use each?

Both are progressive delivery strategies that reduce deployment risk, but they work differently. Canary: you gradually shift traffic from old to new version. At 10% canary, 90% of users get v1, 10% get v2. If v2 has a problem, only 10% of users are affected. You can watch metrics for minutes or hours before promoting further. Best for: detecting subtle performance regressions or increased error rates that only appear at scale. Risk: the canary users do experience any bugs in v2 — they are real users. Blue-Green: two complete environments exist simultaneously. Blue is current production. Green is the new version. You switch ALL traffic from blue to green instantly. If green has a problem: switch back to blue instantly — zero re-deployment needed. Best for: zero-downtime deployments where you need instant rollback, database schema changes where you need to test the full stack before switch, releases where mixed traffic between versions is unacceptable. Risk: costs 2x infrastructure during the transition period. At HPE for TeMIP platform: we used canary for API changes (gradual, metric-driven) and blue-green for database migrations (need full environment to test before switch).

ARGO ROLLOUTS · PRODUCTION

Argo Rollouts detected a regression during canary rollout. Walk through what happens.

Automated rollback scenario. We deploy payment-service v2.1.0 as 10% canary. AnalysisRun starts, querying Prometheus every minute. First two checks pass — success rate 97%, P99 latency 450ms. Third check: P99 jumps to 820ms — crosses the 500ms threshold. AnalysisRun marks as Failed. Argo Rollouts sees the failed analysis and immediately begins rollback: shifts traffic weight from 10% canary back to 0%, sets stable version (v2.0.9) as the single traffic target. The entire rollback completes in under 2 minutes. Simultaneously: Argo Rollouts posts an annotation to the Rollout object with reason for failure. Our monitoring picks this up and sends a Slack alert: payment-service canary rollback triggered — P99 latency 820ms exceeded 500ms threshold. Developers look at what changed in v2.1.0 that could cause latency. Find: a new ORM query that does a full table scan on the accounts table. Fix the query, bump to v2.1.1, re-deploy canary. This time the analysis passes — automated promotion to 25%, 50%, 100%. Zero user-visible production incident. This is the exact value of metric-driven progressive delivery: the system catches regressions before they affect all users, without a human watching dashboards at 2am.

ARGO ROLLOUTS · ENGINEER

What is the difference between a canary and blue-green deployment strategy in Argo Rollouts?

Canary: gradually shifts traffic to the new version in configurable steps — 10%, 30%, 60%, 100%. At each step you can pause for manual approval, run automated analysis (Prometheus metrics, Datadog), or set a duration. If analysis fails at any step, Argo Rollouts automatically rolls back to the previous stable version. The old and new pods run simultaneously during the rollout. Traffic splitting is controlled by an Istio VirtualService, NGINX Ingress, or AWS ALB annotations. Best for: most web services where you want progressive exposure and automated safety. Blue-Green: deploys the complete new version alongside the complete old version. Traffic still goes 100% to old (blue) while new (green) is running. After validation (automated or manual), traffic switches 100% from blue to green in one step. Old version stays running for instant rollback — just switch the service selector back. Best for: databases or services where you cannot have mixed versions handling traffic simultaneously, or when you need instant full-cutover with instant rollback capability. The key difference: canary = gradual traffic shift with automated metric gates. Blue-green = parallel full deployment with instant switch.

ARGO ROLLOUTS · ENGINEER

How does AnalysisTemplate work and what happens when analysis fails?

AnalysisTemplate defines a metric query that Argo Rollouts runs during a rollout to determine if the new version is healthy. It connects to Prometheus, Datadog, CloudWatch, or a web endpoint. You define: the query, the success condition (e.g. error rate < 5%), failure limit (how many failed checks before declaring failure), and interval/count (how often to check and for how long). An AnalysisRun is created each time the template is used in a rollout step. The run executes the metric checks on the interval. If successCondition is met: the step passes and the rollout proceeds. If failureLimit is exceeded: the AnalysisRun fails. When analysis fails: Argo Rollouts automatically aborts the rollout and rolls back to the last stable version. The canary service selector is reverted to the stable pods, traffic returns to old version, and the new pods scale down. The rollout status shows Degraded with the reason. You can inspect with kubectl argo rollouts get rollout myapp and kubectl describe analysisrun to see which metric failed and the actual values. This is the core value of Argo Rollouts over manual canary: automated data-driven rollback instead of human reaction time.

ARGO ROLLOUTS · PRODUCTION

Argo Rollouts canary is stuck at a step. How do you investigate and resolve?

Step 1: check rollout status. kubectl argo rollouts get rollout myapp --watch shows current step, weight, and why it is paused. If pause: {} with no duration, it is waiting for manual promotion — this is expected, not a bug. Step 2: if paused at an analysis step: kubectl get analysisrun shows the current run. kubectl describe analysisrun myapp-xxx shows each metric result. If the analysis is running and not yet complete: wait for it. If it is failed: read why, then decide to abort or investigate the new version. Step 3: if stuck not at a pause step: check the Argo Rollouts controller logs. kubectl logs -n argo-rollouts deploy/argo-rollouts. Common causes: traffic routing issue (Ingress annotation wrong, NGINX controller not compatible version), canaryService or stableService not found, AnalysisTemplate query returning no data (Prometheus connectivity issue). Step 4: to manually promote past a pause: kubectl argo rollouts promote myapp. To skip a specific step: kubectl argo rollouts promote myapp --skip-current-step. To abort the rollout immediately: kubectl argo rollouts abort myapp. After abort: rollout reverts to stable, you fix the issue, update the image tag, and start a new rollout.

Continue Learning

🐱 ArgoCD ☸️ Kubernetes 🔥 Prometheus 🏠 All Topics