Interview Prep
🔵 AKS☸️ K8s⚙️ Terraform🔧 Jenkins🐳 Docker🤖 Ansible📈 Monitor🟠 AWS📝 Git🔴 OpenShift🐍 Python🖥️ Bash🚀 Senior
🔵Azure / AKS — AGIC, Security, RBAC, Zero Trust, CI/CD, Monitor35 Q&As›
Azure/AKS · ENGINEER
What is the Azure resource hierarchy and why does it matter?
Four levels from top to bottom: Management Groups (org-level governance), Subscriptions (billing + access boundary), Resource Groups (logical container), Resources (actual VMs, VNets, databases). Why it matters: policies and RBAC assigned at any level cascade downward. Assign a policy at Management Group level = applies to ALL subscriptions in the org. Assign a policy at Subscription level = all resource groups in that subscription. This is how enterprises enforce standards across hundreds of teams — tag requirements, region restrictions, VM SKU limits. Resource Groups are the operational unit: all resources in a resource group share a lifecycle. Delete the resource group = delete everything in it. In production: one resource group per application per environment (rg-telecom-sro-prod, rg-telecom-sro-staging).
Azure/AKS · ARCHITECT
What is Azure Managed Identity and when do you use it over service principals?
Managed Identity is Azure's answer to service principals with auto-managed credentials. Two types: System-assigned is created for and tied to one resource, deleted when the resource is deleted. User-assigned is a standalone identity that can be assigned to multiple resources. Use Managed Identity over service principals whenever possible: no credential rotation needed (Azure manages the underlying certificates), no secret to store or leak, permissions are tied to the resource lifecycle. Service principals still needed when: authenticating from outside Azure (on-premises server, GitHub Actions for non-OIDC auth), cross-tenant access, legacy applications. At HPE on Azure: we replaced all service principal secrets in Azure DevOps pipelines with workload identity federation. Zero credentials stored, automatic rotation, complete audit trail.
Azure/AKS · PRODUCTION
AKS pods cannot reach Azure Key Vault. Walk through troubleshooting.
Layered investigation. First: check if Workload Identity is configured correctly. kubectl describe pod and look for the azure.workload.identity/client-id annotation on the service account. Second: check if the managed identity has Key Vault access. az role assignment list --assignee should show Key Vault Secrets User or Key Vault Reader. Third: check Key Vault access policies or RBAC depending on which model the vault uses. az keyvault show --name my-vault shows if it uses access policies or RBAC. Fourth: check network — is there a Private Endpoint on the vault? If so, is the AKS subnet able to reach the private endpoint subnet? Fifth: check Key Vault firewall — does it have network rules that block the AKS node IPs? Enable Allow Azure services. Sixth: test from inside the pod manually using the workload identity token to call Key Vault API directly. Common issue at HPE: Key Vault had both access policies AND RBAC enabled, and the role assignment was for RBAC but the application was checking access policies.
Azure/AKS · ENGINEER
Explain Azure NSGs vs Azure Firewall vs Application Gateway.
Three different layers of network security. NSG (Network Security Group): stateful layer 4 firewall applied to a subnet or NIC. Allow/deny rules based on IP, port, protocol. No deep packet inspection. Free. Use for: basic inbound/outbound rules between subnets and internet. Azure Firewall: managed stateful firewall at the VNet level. Supports FQDN filtering (allow outbound to *.ubuntu.com but not other internet). Threat intelligence feed. Centralised policy management. Cost: significant. Use for: centralised outbound internet filtering in hub-and-spoke architecture. Application Gateway (with WAF): layer 7 load balancer with Web Application Firewall. SSL termination, URL-based routing, cookie-based session persistence, OWASP rule sets. Use for: web applications needing WAF, SSL offload, HTTP routing. Architecture: NSGs on every subnet (always), Application Gateway/WAF for inbound web traffic, Azure Firewall for centralised outbound control in large enterprises.
Azure/AKS · ARCHITECT
How does hub-and-spoke network topology work in Azure?
Hub-and-spoke is the standard enterprise Azure network architecture. Hub VNet: contains shared services — Azure Firewall for centralised internet filtering, VPN/ExpressRoute gateway for on-premises connectivity, Azure Bastion for jump server access. Spoke VNets: one per application or environment, peered to the hub. Spoke-to-spoke communication goes through hub firewall (gives visibility and control). Benefits: centralised security policy, one gateway for all on-premises connectivity, each spoke team manages their own VNet independently. Implementation: VNet peering from each spoke to hub with UseRemoteGateways enabled so spokes use hub's VPN gateway. Hub has AllowGatewayTransit enabled. Azure Virtual WAN is the managed version of hub-and-spoke — reduces operational overhead but less customisation. At HPE: hub-spoke with 8 spoke VNets per region. Each product team owns their spoke, central networking team owns the hub.
Azure/AKS · ENGINEER
What is the primary advantage of AKS over self-managed Kubernetes on Azure VMs?
AKS is a managed Kubernetes service — Azure manages the control plane entirely. This means Azure handles API server availability, etcd backups, control plane upgrades, and HA for the master nodes at no additional charge. You only pay for the worker nodes. With self-managed Kubernetes on VMs, you are responsible for every aspect: provisioning the control plane VMs, configuring etcd clustering, setting up HA for the API server, managing TLS certificates, and running upgrades manually. The trade-off is control vs operational effort. Self-managed gives you full configuration flexibility — specific Kubernetes versions, custom admission controllers, non-standard networking. AKS constrains some of this but removes weeks of operational work. For most production workloads: AKS is the right choice. Only choose self-managed if you have specific compliance requirements or need control-plane-level customisation that AKS does not support.
Azure/AKS · ENGINEER
What are AKS Node Pools and when do you use multiple node pools?
A Node Pool is a group of virtual machines with identical configuration in an AKS cluster — same VM size, same OS, same Kubernetes version. Every AKS cluster has at least one system node pool (runs kube-system components) and optionally one or more user node pools (runs your workloads). You use multiple node pools when: 1. Different workloads need different VM sizes — ML training needs GPU VMs (NC-series), web APIs need general-purpose VMs (D-series). Put each in its own pool. 2. Cost optimisation — batch/non-critical workloads go on Spot node pools (60-80% cheaper, can be evicted). Critical workloads go on regular (on-demand) pools. 3. OS separation — some workloads need Windows nodes (legacy .NET apps), others need Linux. 4. Isolation — security-sensitive workloads on dedicated nodes using node selectors and taints. In practice: system pool (2-3 nodes, Standard_D2s_v3), app pool (auto-scales 2-20, Standard_D4s_v3), spot pool (batch jobs, 0-10 nodes). Each pool scales independently.
Azure/AKS · ENGINEER
What is Workload Identity in AKS and why is it better than using Service Principal credentials?
Workload Identity allows a Kubernetes pod to authenticate to Azure services (Key Vault, Storage, SQL) using a Managed Identity — without any stored credentials. How it works: create a User Assigned Managed Identity in Azure, federate it with a Kubernetes Service Account, annotate the Service Account with the managed identity client ID. When a pod using that Service Account makes an Azure SDK call, it gets a token from the Azure OIDC endpoint automatically. The pod never has a secret, password, or client secret. Why it is better than Service Principal credentials: Service Principal approach stores a client_id and client_secret as a Kubernetes Secret. That secret can be read by anyone with kubectl access to the namespace, can appear in logs, and must be manually rotated. Workload Identity has zero stored credentials — nothing to rotate, nothing to leak, nothing to accidentally commit to Git. If a pod is compromised, the attacker cannot steal a long-lived credential — they can only make calls that the managed identity is authorised for, and only while the pod runs. This is the current Microsoft-recommended approach for AKS-to-Azure authentication.
Azure/AKS · PRODUCTION
How do you troubleshoot ImagePullBackOff errors when AKS pulls from Azure Container Registry?
Systematic diagnosis. Step 1: get the exact error. kubectl describe pod pod-name -n namespace — look at Events section. It will say either ImagePullBackOff (retrying after failure) or ErrImagePull (initial failure). The message shows the exact error: unauthorized, manifest not found, or connection refused. Step 2: verify the image reference. Check that the image name and tag in the pod spec exactly match what exists in ACR. Common mistake: image tag does not exist (pushed as v1.2.3, pod spec says latest). az acr repository show-tags --name myregistry --repository myapp confirms available tags. Step 3: check ACR permissions. AKS uses either a managed identity or service principal to pull from ACR. Run: az aks check-acr --name myakscluster --resource-group myrg --acr myregistry.azurecr.io. If permission is missing: az role assignment create --assignee AKS-CLIENT-ID --role AcrPull --scope ACR-RESOURCE-ID. Step 4: check network. If AKS is private and ACR is behind a private endpoint, ensure the AKS subnet can reach the ACR private endpoint. Step 5: check imagePullSecret. If using imagePullSecrets in the pod spec, verify the secret exists and has valid credentials: kubectl get secret regcred -n namespace -o yaml.
Azure/AKS · ENGINEER
Explain the difference between System Assigned and User Assigned Managed Identity in Azure.
Both Managed Identity types allow Azure resources to authenticate to other Azure services without credentials. System Assigned: created automatically when you enable it on a resource (VM, AKS, Function App). Its lifecycle is tied to the resource — when you delete the VM, the identity is deleted. Cannot be shared across resources. Good for single-resource scenarios where you want automatic cleanup. User Assigned: created as a standalone Azure resource, independent of any compute resource. Can be assigned to multiple VMs, AKS clusters, or Function Apps simultaneously. Its lifecycle is independent — deleting the VM does not delete the identity. Good for: sharing the same identity across multiple resources (e.g., 10 AKS pods all use the same identity to access Key Vault), pre-creating identities before compute resources exist, and AKS Workload Identity (which requires User Assigned). In AKS: system-assigned identity is used for the cluster itself to manage Azure resources (create load balancers, assign IPs). User-assigned identity is used for pods to access Azure services via Workload Identity.
Azure/AKS · ENGINEER
What is the relationship between ACR and AKS? How does AKS authenticate to ACR?
ACR (Azure Container Registry) is where your container images are stored. AKS pulls images from ACR to run as containers in pods. The authentication between them can work three ways. Method 1 — Managed Identity attachment (recommended): az aks update --attach-acr myregistry assigns the AcrPull role to the AKS kubelet managed identity. Nodes pull images automatically without any credentials stored anywhere. Method 2 — Service Principal: AKS authenticates using a service principal that has AcrPull role on the registry. Requires managing and rotating the SP credentials. Method 3 — imagePullSecret: create a Kubernetes Secret containing ACR credentials, reference it in pod spec. Most manual approach, credentials stored in etcd. In production always use Method 1. The flow: pod spec references an image in ACR, kubelet on the node needs to pull it, kubelet uses its managed identity to get an ACR token from Azure AD, pulls the image. The developer never manages any credentials. Common ACR features used with AKS: geo-replication (ACR in same region as AKS for fast pulls), content trust (signed images only), vulnerability scanning (Defender for Containers).
Azure/AKS · ARCHITECT
How do you structure Terraform for AKS across multiple environments (Dev, Staging, Production)?
The correct structure uses modules + environment workspaces or directories. Recommended approach — separate environment directories with shared modules. Root structure: modules/aks/ (reusable AKS module), modules/acr/, modules/networking/. Environment directories: environments/dev/main.tf, environments/staging/main.tf, environments/production/main.tf. Each environment's main.tf calls the modules with environment-specific variables: module aks { source = ../../modules/aks, cluster_name = aks-production, node_count = 5, vm_size = Standard_D4s_v3 }. Environment-specific tfvars files hold the variable values. Remote state: each environment has its own state file in Azure Blob Storage — never share state between environments. Separate storage accounts per environment prevents a corrupted dev state from affecting production. CI/CD: separate pipelines per environment. Dev pipeline runs on every push. Staging on merge to main. Production requires manual approval. State locking prevents concurrent applies. Never run terraform apply manually on production — always through the pipeline with audit trail.
Azure/AKS · ENGINEER
Which Kubernetes workload type is best for a database like MongoDB and why?
StatefulSet. Databases require two things that regular Deployments cannot provide: stable network identity and persistent storage per pod. With a Deployment, when a pod restarts it gets a new name (app-7d8f9-xyz → app-9k3m-abc) and potentially attaches to a different PVC. MongoDB replica sets identify members by hostname — if the hostname changes, the replica set breaks and data can be lost or the replica set cannot re-form. With a StatefulSet: pods get stable, predictable names (mongo-0, mongo-1, mongo-2) that persist across restarts. Each pod gets its own PVC (volumeClaimTemplate creates a dedicated PVC per pod). Pods start and terminate in order — mongo-0 starts before mongo-1, which matters for MongoDB primary election. When mongo-1 restarts, it comes back as mongo-1 with the same PVC containing its data. The headless service (clusterIP: None) provides DNS entries per pod: mongo-0.mongo-svc.namespace.svc.cluster.local — MongoDB uses these to find replica set members. In production: use a StatefulSet with 3 replicas (1 primary, 2 secondaries), dedicated SSD storage class, pod disruption budget allowing maximum 1 unavailable, and pod anti-affinity to spread replicas across availability zones.
Azure/AKS · BEGINNER
What Azure services do you use and how do they integrate?
Core services I use daily: AKS (managed Kubernetes), ACR (container registry — stores Docker images that AKS pulls), Azure DevOps (CI/CD pipelines — builds images, deploys to AKS), Azure Key Vault (secrets/certs — mounted into pods via CSI driver), Virtual Network and NSGs (network isolation), Application Gateway (L7 load balancer with WAF, terminates SSL), Azure Monitor + Log Analytics (logs and metrics), Managed Identity (pod-to-Azure authentication without credentials). Integration flow: Developer commits code → Azure DevOps pipeline triggers → builds Docker image → pushes to ACR → deploys to AKS via helm upgrade → AKS pulls image from ACR using Managed Identity (no credentials stored) → pods access Key Vault secrets via Workload Identity → Application Gateway routes external traffic to pods → Azure Monitor collects all logs and metrics → alerts to PagerDuty.
Azure/AKS · ENGINEER
How does Application Gateway work with AKS? What is AGIC?
Application Gateway is Azure Layer 7 load balancer. Features: SSL termination (offload TLS processing from pods), URL-based routing (/api → backend API pods, /app → frontend pods), WAF (Web Application Firewall — blocks OWASP Top 10 attacks), health probes (automatically removes unhealthy backends), session affinity, auto-scaling. AGIC (Application Gateway Ingress Controller): a controller pod running inside AKS. Watches Kubernetes Ingress resources. Translates them into Application Gateway configuration — creates listeners, routing rules, backend pools automatically. When you create a Kubernetes Ingress resource with class azure/application-gateway, AGIC automatically programs the Application Gateway. No manual App Gateway configuration needed. Benefit over NGINX Ingress: App Gateway is managed by Azure, auto-scales, includes WAF. NGINX needs you to manage the ingress controller pods. Setup: az aks enable-addons --addons ingress-appgw --appgw-id /subscriptions/.../applicationGateways/myAppGW. App Gateway must be in same VNet as AKS (or peered VNet).
Azure/AKS · ENGINEER
What is NSG and how do you use it with AKS?
NSG (Network Security Group) is a stateful firewall at the subnet or NIC level. Rules: priority (100-4096, lower = higher priority), source IP/range, destination IP/range, port, protocol, allow/deny. For AKS: Subnet NSG rules needed: allow port 443 inbound for HTTPS traffic to Application Gateway subnet. Allow port 10250 between nodes (kubelet communication). Allow port 2379-2380 for etcd. Allow 30000-32767 for NodePort services. Deny everything else inbound. Outbound: allow 443 to ACR, Key Vault, Azure APIs (use service tags: AzureContainerRegistry, AzureKeyVault, AzureMonitor). Deny outbound to internet from production nodes (use private endpoints for ACR and Key Vault). Best practice: manage NSGs in Terraform. Never manually modify NSGs in production — changes must go through Git PR process. az network nsg rule list --nsg-name myNSG --resource-group myRG shows current rules. NSG flow logs: enable for security audit — logs every allowed/denied connection to Log Analytics.
Azure/AKS · ENGINEER
What is Azure DevOps pipeline? Walk through CI/CD flow.
Azure DevOps Pipeline YAML flow: trigger: branches: include: [main] → pipeline starts on merge to main. Stage 1 Build: checkout code, dotnet build or mvn package or npm install && npm build, run unit tests, SonarQube analysis (fail if quality gate fails), docker build --build-arg BUILD_ID=$(Build.BuildId), trivy image scan (fail on CRITICAL), docker push to ACR with tag $(Build.BuildId). Stage 2 Deploy Dev: dependsOn: Build. uses KubernetesManifest task or helm upgrade --install --namespace dev --set image.tag=$(Build.BuildId) --atomic --timeout 5m. Run smoke tests. Stage 3 Deploy Staging: auto-deploy after dev succeeds. Stage 4 Deploy Production: environment: production with approval check — release manager must approve. Service connections: connection from Azure DevOps to AKS (using Managed Identity or Service Principal). Variables: non-secret in Variable Groups (Library), secrets from Azure Key Vault linked to Variable Group. Artifact: Build.BuildId is the versioned tag — every build gets unique ID, traceable from pipeline to running pod.
Azure/AKS · ENGINEER
How do you store and manage variables and secrets in Azure DevOps?
Variable Groups (non-sensitive): Pipelines → Library → Variable Groups. Store: environment names, registry URLs, cluster names, feature flags. Reference in pipeline: variables: group: myapp-variables. Azure Key Vault linked Variable Group (for secrets): create Variable Group → link to Azure Key Vault → select secrets. Secrets appear as pipeline variables but are fetched from Key Vault at runtime — never stored in Azure DevOps. Reference: $(secret-name). Pipeline YAML variables vs Variable Groups: pipeline-level variables for things specific to that pipeline. Variable Groups for shared config across multiple pipelines. Secret masking: Azure DevOps automatically masks secrets in pipeline logs — $(my-secret) prints as ***. Environment variables: use env: section in pipeline steps to inject as environment variables. Best practice: never hardcode any value that might change between environments. Use variables for everything: image tag, namespace, cluster name, replica count. Keep secrets exclusively in Key Vault — not even in Variable Group values, only Key Vault references.
Azure/AKS · ARCHITECT
How do you set up Workload Identity in AKS? Why use it?
Without Workload Identity: pods use Service Principal credentials stored as Kubernetes Secrets to authenticate to Azure. Those credentials can expire, can be leaked, and are a security risk. Workload Identity: federate a Kubernetes ServiceAccount with an Azure Managed Identity using OIDC. Pod uses the K8s SA → AKS injects a signed token → pod presents token to Azure AD → Azure AD validates via OIDC trust → issues short-lived Azure access token. No stored credentials anywhere. Setup: 1. Enable OIDC issuer on cluster: az aks update --enable-oidc-issuer --enable-workload-identity. 2. Create User Assigned Managed Identity: az identity create --name payment-identity --resource-group myRG. 3. Create Kubernetes ServiceAccount with annotation: kubectl annotate serviceaccount payment-sa azure.workload.identity/client-id=<identity-client-id>. 4. Create federated credential: az identity federated-credential create linking the K8s SA to the Managed Identity. 5. Grant Managed Identity access: az keyvault set-policy --name myVault --object-id <identity-object-id> --secret-permissions get list. In pod spec: serviceAccountName: payment-sa. The SDK (azure-identity DefaultAzureCredential) automatically uses the injected token.
Azure/AKS · ENGINEER
How do you implement Azure Monitor and Container Insights for AKS?
Enable Container Insights: az aks enable-addons --addons monitoring --workspace-resource-id /subscriptions/.../workspaces/myLogAnalytics. What it collects automatically: node-level metrics (CPU, memory, disk, network), pod metrics (CPU, memory, restarts), container logs (stdout/stderr from all containers), Kubernetes events. Query with KQL in Log Analytics: KubePodInventory | where TimeGenerated > ago(1h) | where Namespace == "production" | where PodStatus == "Failed" | project TimeGenerated, PodName, ContainerStatus. ContainerLog | where LogEntry contains "ERROR" | summarize count() by Computer, bin(TimeGenerated, 5m). Alert rules: Alerts → New Alert Rule → use metric (e.g. Percentage CPU > 80%) or KQL query (e.g. failed pods count > 0 for 5 minutes). Action groups: send to email, SMS, PagerDuty, Slack webhook. Dashboards: Azure Monitor Workbooks provide pre-built Kubernetes dashboards. Custom Grafana dashboards: add Azure Monitor as Grafana data source, query ContainerInsights tables. Cost control: set Log Analytics data cap to avoid unexpected bills from verbose application logs.
Azure/AKS · BEGINNER
What is AKS and what does Azure manage for you?
AKS (Azure Kubernetes Service) is managed Kubernetes. Azure manages: control plane (API server, etcd, scheduler, controller manager) — free, zero maintenance. etcd backups — automatic. Control plane HA — built-in. Control plane upgrades — initiated by you, executed by Azure. You manage: worker nodes (VM size, count, OS updates), application deployments, networking config, security policies. Key benefit: eliminates 30-40% of ops effort spent managing control plane in self-managed K8s. You pay only for worker node VMs.
Azure/AKS · BEGINNER
What are AKS Node Pools and why use multiple?
A node pool is a group of VMs with the same configuration. The system node pool runs Kubernetes system components (CoreDNS, metrics-server). User node pools run your workloads. Reasons for multiple pools: workload isolation (keep noisy batch jobs away from latency-sensitive APIs), cost optimisation (standard VMs for most workloads, GPU VMs only in the GPU pool), OS separation (Linux pool for containers, Windows pool for .NET apps), scaling independence (scale web pool to 20 nodes during peak, batch pool stays at 2), taint/toleration-based scheduling (taint gpu-pool=true:NoSchedule, only GPU workloads tolerate it).
Azure/AKS · ENGINEER
How do you configure Application Gateway Ingress Controller (AGIC) for AKS?
AGIC makes Azure Application Gateway act as the Kubernetes Ingress controller. Architecture: Application Gateway (public IP, WAF, SSL termination) → AGIC controller (pod in AKS) → Routes to Kubernetes Services → Pods. Setup: 1. Create Application Gateway in the same VNet as AKS (or peered VNet). 2. Enable AGIC addon: az aks enable-addons --name myAKS --resource-group myRG --addons ingress-appgw --appgw-id /subscriptions/.../applicationGateways/myAppGW. 3. Create Ingress resource with class annotation: kubernetes.io/ingress.class: azure/application-gateway. 4. For SSL: upload certificate to App Gateway, reference in Ingress: appgw.ingress.kubernetes.io/ssl-certificate: my-cert. AGIC watches Ingress resources and automatically updates App Gateway backend pools, HTTP settings, and routing rules. One App Gateway can serve multiple applications using path-based routing.
Azure/AKS · ENGINEER
How do you add TLS/SSL certificates to applications in AKS?
Two approaches: cert-manager (automated) or manual. cert-manager approach (recommended for production): Install cert-manager helm chart. Create ClusterIssuer pointing to Let's Encrypt: apiVersion: cert-manager.io/v1, kind: ClusterIssuer, spec.acme.server: https://acme-v02.api.letsencrypt.org/directory. Create Certificate resource or add cert-manager annotations to Ingress: cert-manager.io/cluster-issuer: letsencrypt-prod. cert-manager automatically requests the cert, stores it as a K8s Secret, and renews it before expiry. Manual approach: create TLS secret: kubectl create secret tls myapp-tls --cert=tls.crt --key=tls.key -n production. Reference in Ingress: spec.tls: - hosts: [app.example.com], secretName: myapp-tls. For App Gateway: upload PFX certificate to Key Vault, reference via AGIC annotation. For NGINX ingress: the secret is used directly. Monitor expiry: kubectl get certificate -A shows READY status and expiry dates.
Azure/AKS · ENGINEER
How do you deploy 10 applications via Jenkins, Terraform, Helm, and Kubernetes?
The complete pipeline for each app: Terraform manages the infrastructure layer: AKS cluster, ACR, Key Vault, networking. All in code, state in Azure Blob. Jenkins CI pipeline: trigger on Git commit → checkout code → docker build → trivy scan → docker push to ACR → update Helm values file with new image tag → commit back to Git → ArgoCD detects the change. Helm chart per application: values.yaml contains image tag, replica count, resource limits, environment variables. Each app has its own Helm release. kubectl apply or helm upgrade --install --atomic --timeout 5m --namespace app-namespace. For 10 apps: define each as a separate Helm release. Use ArgoCD App-of-Apps pattern: one root ArgoCD Application points to a directory of 10 Application manifests. ArgoCD syncs all 10. Namespace isolation: each app gets its own namespace. Network policies restrict cross-namespace communication. RBAC per namespace: dev team only accesses their app namespace.
Azure/AKS · ENGINEER
How do you upgrade an AKS cluster? What is the process?
Pre-upgrade: az aks get-upgrades --name myAKS --resource-group myRG shows available versions. Read the release notes for breaking changes. Test in dev/staging first. Raise Change Request, get approvals, schedule maintenance window. Verify application backups and rollback plan. Upgrade process: Step 1 — upgrade control plane: az aks upgrade --name myAKS --resource-group myRG --kubernetes-version 1.29.0 --control-plane-only. Verify control plane: az aks show shows new version, kubectl get nodes shows control plane upgraded. Step 2 — upgrade node pools one at a time: az aks nodepool upgrade --name userpool --cluster-name myAKS --resource-group myRG --kubernetes-version 1.29.0. Kubernetes cordons and drains each node, creates a new node with the new version, moves pods. Step 3 — validate: kubectl get nodes all nodes show new version, kubectl get pods -A all pods running, run smoke tests against all critical endpoints. Monitor for 30 minutes before marking complete. Important: minor version skew of 1 is allowed (control plane 1.29, nodes 1.28). Never skip minor versions.
Azure/AKS · ENGINEER
How do you verify users have correct access in AKS?
Audit from cluster side: kubectl get rolebindings,clusterrolebindings -A -o wide shows every binding — who has what access where. For a specific user: kubectl auth can-i --list --as=user@company.com --namespace production shows everything they can do. For a group: kubectl auth can-i --list --as-group=devteam. Check if specific action allowed: kubectl auth can-i delete pods --as=user@company.com -n production. Audit from Azure AD side: az role assignment list --all shows all Azure RBAC assignments. For AKS specifically: az aks show shows AAD integration status, admin groups. Regular access review process: quarterly, export all RoleBindings and ClusterRoleBindings, review with team leads, remove any stale accounts or over-privileged roles. Automate detection: Azure AD access reviews can be scheduled to require each user to re-confirm their access. Enable AKS audit logging: all kubectl commands logged to Log Analytics. Query: AzureActivity | where OperationNameValue contains "kubectl" | project Caller, OperationNameValue, ResultType.
Azure/AKS · ARCHITECT
How do you implement zero trust for applications in AKS?
Zero trust means: never trust, always verify. No implicit trust based on network location. Every pod, every request must be explicitly authorised. Layers: 1. Network: Kubernetes NetworkPolicies — deny all traffic by default, explicitly allow only what is needed. apiVersion: networking.k8s.io/v1, kind: NetworkPolicy, spec.podSelector: {}, spec.policyTypes: [Ingress, Egress], spec.ingress: [] (deny all). Then add explicit allow policies for required paths. Use Cilium for L7 policies (allow only GET /api/v1/orders not all HTTP). 2. Identity: Workload Identity for pod-to-Azure authentication. Each service has its own Managed Identity. Payment service can only access payment Key Vault secrets. 3. mTLS between services: Istio or Linkerd encrypts all pod-to-pod traffic and mutually authenticates every call. 4. Image security: only images from ACR with valid scan results allowed (Azure Policy). 5. Runtime: Falco detects anomalous behaviour (shell in container, unexpected network call). 6. RBAC: least privilege for every service account. No default service account has meaningful permissions. Result: even if one pod is compromised, it cannot reach other services, cannot access credentials it was not explicitly granted, and the anomaly is detected.
Azure/AKS · ENGINEER
How do you increase node size if a node has slow performance or runs out of resources?
You cannot resize an existing node pool in-place — the VM size is fixed at creation. The process is blue-green node pool replacement: Step 1: create a new node pool with the larger VM size: az aks nodepool add --name newpool --cluster-name myAKS --resource-group myRG --node-vm-size Standard_D8s_v3 --node-count 3. Step 2: taint the old node pool to prevent new pods scheduling on it: kubectl taint nodes -l agentpool=oldpool dedicated=old:NoSchedule. Step 3: cordon all old nodes: kubectl cordon node --selector agentpool=oldpool. Step 4: drain old nodes one at a time: kubectl drain nodename --ignore-daemonsets --delete-emptydir-data. Pods move to the new larger nodes. Step 5: verify all pods running on new nodes: kubectl get pods -o wide. Step 6: delete old node pool: az aks nodepool delete --name oldpool. For immediate relief while waiting: add more nodes (scale out) to distribute load: az aks nodepool scale --name existingpool --node-count 5. Check which pods are causing pressure: kubectl top nodes, kubectl top pods -A --sort-by=cpu.
Azure/AKS · ENGINEER
How do you overcome pod restart impact on applications?
Pod restarts cause brief downtime unless handled properly. Prevention strategies: Multiple replicas: minimum 2 replicas per service. One pod restarting leaves the other handling traffic. Kubernetes automatically routes traffic away from the restarting pod. PodDisruptionBudget: ensures at least 1 replica stays running during voluntary disruptions (node drain, upgrades): kubectl create pdb myapp-pdb --selector=app=myapp --min-available=1. Readiness probes: pod only receives traffic when the readiness probe passes. During restart and initialisation, traffic is not sent to the pod. livenessProbe restarts pods that are stuck but not serving. startupProbe gives slow-starting apps (Java) time to initialise before liveness checks begin. PreStop hook: add lifecycle.preStop.exec.command to gracefully drain connections before the pod terminates: sleep 15 to allow load balancer to drain. Resource limits: set appropriate memory limits to prevent OOMKilled. Set CPU requests correctly to prevent CPU throttling causing slowness that triggers liveness probe. For detecting the cause: kubectl get events -n production --sort-by=lastTimestamp shows restart reasons. kubectl logs mypod --previous shows what happened before the last restart.
Azure/AKS · ARCHITECT
How does RBAC integration with Azure AD work in AKS?
AKS AAD integration creates a unified identity layer. How it works: Enable on cluster: az aks update --name myAKS --resource-group myRG --enable-aad --enable-azure-rbac --aad-admin-group-object-ids . Authentication flow: developer runs az aks get-credentials → kubectl command sent to AKS API server → API server redirects to Azure AD for authentication → developer authenticates with their corporate credentials (MFA supported) → Azure AD returns a token → AKS validates the token → checks Kubernetes RBAC permissions for the user/group. Define permissions in code (Terraform): create Azure RBAC role assignments: azurerm_role_assignment resource with role_definition_name "Azure Kubernetes Service RBAC Reader" scoped to the namespace. AD groups map to K8s roles: devteam@company.com group bound to edit ClusterRole in dev namespace only. ops@company.com group bound to cluster-admin ClusterRole. Why AD groups over individual users: when engineer joins the team → add to AD group → automatically gets all permissions. When they leave → remove from AD group → access revoked everywhere. No manual K8s RBAC changes needed. Audit: every API server call is logged with the Azure AD identity of the caller.
Azure/AKS · ENGINEER
How do you define specific role access in AKS? Example: only read pods in production.
Step 1: create a Role (namespace-scoped) or ClusterRole (cluster-wide). For read-only pods in production namespace: apiVersion: rbac.authorization.k8s.io/v1, kind: Role, metadata.name: pod-reader, metadata.namespace: production, rules: - apiGroups: [""], resources: ["pods", "pods/log"], verbs: ["get", "list", "watch"]. Step 2: bind to an AD group with RoleBinding: kind: RoleBinding, metadata.namespace: production, subjects: - kind: Group, name: "support-team@company.com" (this is the AD group objectId or email), roleRef.kind: Role, roleRef.name: pod-reader. Step 3: verify: kubectl auth can-i get pods --as-group=support-team@company.com -n production → yes. kubectl auth can-i delete pods --as-group=support-team@company.com -n production → no. In Terraform (recommended — no manual kubectl): resource "kubernetes_role" and resource "kubernetes_role_binding". Store all RBAC definitions in a GitOps repo — any change goes through PR review and audit trail. Principle: production gets READ ONLY for most engineers. Only the deployment service account and ops team have write access.
Azure/AKS · ENGINEER
What security controls do you apply for pods and applications in AKS?
Pod-level security: SecurityContext on every pod — runAsNonRoot: true (no root processes), readOnlyRootFilesystem: true (container cannot write to its own filesystem), allowPrivilegeEscalation: false, drop all capabilities then add only what is needed (capabilities.drop: [ALL], capabilities.add: [NET_BIND_SERVICE] if needed). Container image security: use distroless or minimal base images. Trivy or Grype scan in CI pipeline — fail build on CRITICAL/HIGH CVEs. Only pull from trusted registry (ACR) enforced via Azure Policy. Tag immutability on ACR — no overwriting image tags. Network security: NetworkPolicy deny-all default, explicit allow for required paths only. Network policies enforced by Cilium or Azure CNI. Secrets: never in environment variables directly, never in ConfigMaps. Use CSI Secrets Store driver to mount Key Vault secrets as files. Workload Identity for pod-to-Azure authentication. Admission control: Azure Policy for Kubernetes enforces: required resource limits, banned privileged containers, required security context fields, approved image registries only. Runtime: Falco for anomaly detection. These controls together mean a compromised pod has no root access, no writable filesystem, no ability to reach other services or Azure resources it was not explicitly granted, and any anomalous behaviour is detected within seconds.
Azure · SENIOR
Design a production AKS architecture for a financial services application.
Architecture I have implemented at HPE for a telecom production platform. Network: private AKS cluster (API server not reachable from internet), Azure CNI networking (pods get VNet IPs directly, no NAT), separate subnets for system node pool and user node pools, private endpoints for all PaaS services (ACR, Key Vault, Azure SQL, Storage). Identity: managed identity for the cluster, workload identity for pod-level Azure access (payment pods get their own identity to access Key Vault payment secrets, order pods get a different identity), zero service principal secrets. Node pools: system pool (3 nodes, Standard_D2s_v3, always on), user pool (auto-scale 2-20, Standard_D4s_v3), GPU pool (optional, only when needed for ML). Security: Azure Policy add-on enforces: no privileged containers, required resource limits, approved registry only (ACR). Defender for Containers for runtime security. Network policies (Calico) for pod-to-pod traffic control. Monitoring: Azure Monitor + Container Insights, Prometheus for custom metrics, Grafana dashboards, Alert rules for: node pressure, pod crash rate, certificate expiry. Disaster recovery: secondary cluster in paired region, Azure Traffic Manager for DNS failover, ArgoCD managing both clusters from same Git.
Azure · PRODUCTION
AKS node pool upgrade is stuck at 60%. What do you do?
First: check why it is stuck. kubectl get nodes shows which nodes are on old version vs new. The upgrade process is: cordon node (stop new pods), drain node (move existing pods), delete old node, create new node, uncordon. Stuck at 60% usually means: one node cannot be drained. kubectl describe node stucknode -- look at Events. Common causes: Pod with no PodDisruptionBudget that cannot be evicted (kubectl drain fails with "cannot delete pods not managed by RC/RS/DaemonSet/Job" -- these are orphaned pods, manually delete them), PDB set too strictly (minAvailable = replicas, so drain cannot remove any pod -- temporarily relax PDB or add a replica), DaemonSet pods (kubectl drain --ignore-daemonsets handles these), local storage (kubectl drain --delete-emptydir-data). If drain keeps timing out: az aks nodepool upgrade --no-wait and manually drain the stuck node. Last resort: delete the stuck node manually (az vmss delete-instances) -- AKS will replace it with a new upgraded node. Never let an upgrade run unattended on production. Always monitor each node drain manually the first time you run an upgrade on a new cluster.
☸️Kubernetes — Workloads, HPA, Networking, Storage, Production28 Q&As›
Kubernetes · ARCHITECT
How do you design a zero-downtime deployment strategy in Kubernetes?
The approach depends on risk level. For standard services: Rolling Update with maxUnavailable:0 and maxSurge:1. This ensures full capacity is maintained throughout. For database schema changes: expand-and-contract pattern — deploy new code that handles both old and new schema simultaneously, run migration, then remove old code path. For high-risk production releases: Argo Rollouts with canary strategy — send 10% of traffic to new version, monitor Prometheus error rate for 5 minutes, if below 1% proceed to 50%, then 100%. If error rate breaches threshold, automatic rollback triggers. Result: zero customer-visible downtime across 200+ deployments per quarter at HPE.
Kubernetes · ENGINEER
A pod is in CrashLoopBackOff. Walk through your systematic troubleshooting.
Step 1: kubectl describe pod — check Events section at the bottom for OOMKilled (memory limit too low), ImagePullBackOff (wrong image), scheduling failures, missing secrets. Step 2: kubectl logs --previous — this shows logs from the CRASHED container, not the new one. Look for startup errors, missing environment variables, connection refused to dependencies. Step 3: If the app starts briefly then dies — kubectl exec -it into the pod while it is briefly running to inspect the filesystem or test connections. Common causes in order of frequency: missing Kubernetes Secret or ConfigMap, wrong environment variable name, memory limit too low (OOMKilled), aggressive liveness probe restarting healthy pods before they are ready.
Kubernetes · ENGINEER
Explain the difference between Liveness and Readiness probes with real examples.
Liveness probe — is the container still alive? If it fails, K8s restarts the pod. Use for detecting deadlocks. Example: a Java app that gets stuck in an infinite loop still has a running process but is not doing work. Liveness probe hits /healthz, gets no response, K8s restarts the pod. Readiness probe — is the container ready to RECEIVE traffic? If it fails, K8s removes the pod from Service endpoints but does NOT restart it. Use for slow startup or temporary unavailability during cache warming. Example: a pod takes 30 seconds to load its cache on startup. Without readiness probe, traffic hits the pod before it is ready and users see errors. With readiness probe, traffic only reaches the pod once it responds 200 to /ready. Production rule: always set both. Missing readiness probe means users hit pods that are not ready.
Kubernetes · PRODUCTION
How does kube-proxy implement Services? What is the difference between iptables and IPVS mode?
kube-proxy runs on every node and maintains network rules that implement Service virtual IPs. When you create a Service with ClusterIP 10.96.0.1:80, kube-proxy creates rules so that traffic to that IP gets distributed to healthy pod IPs. iptables mode (default): kube-proxy creates iptables NAT rules. Traffic hits the virtual IP, iptables randomly selects a pod IP and NATs the packet. Problem: iptables rules are O(n) — with 10,000 Services, each connection scans thousands of rules. IPVS mode: uses Linux kernel IPVS (IP Virtual Server) which uses hash tables — O(1) lookup regardless of number of Services. For clusters with more than 1000 Services, IPVS mode is significantly faster. Enable with kube-proxy --proxy-mode=ipvs.
Kubernetes · ARCHITECT
How do you manage secrets securely in Kubernetes? What are the risks of default Secrets?
Default Kubernetes Secrets are only base64-encoded, not encrypted. Anyone who can read the etcd database or has get/list on Secrets can decode them instantly. Production approaches in order of security: Level 1 — Enable etcd encryption at rest (EncryptionConfiguration) — encrypts data in etcd but Secrets are still readable via kubectl by anyone with RBAC access. Level 2 — Sealed Secrets (Bitnami) — asymmetrically encrypted Secrets safe to commit to Git. Only the controller in cluster can decrypt. Level 3 — External Secrets Operator — Secrets never live in K8s at all. They are pulled from AWS Secrets Manager, Azure Key Vault, or HashiCorp Vault at runtime. Most secure. At HPE we use External Secrets with Vault — no production credentials ever touch etcd.
Kubernetes · ENGINEER
What are QoS classes in Kubernetes and why do they matter during node pressure?
Kubernetes assigns each pod a QoS class based on its resource configuration. Guaranteed: pod has both requests and limits set to identical values for all containers. These pods are the LAST to be evicted under memory pressure. Burstable: pod has requests set but limits are higher, or only requests set. Evicted after BestEffort pods. BestEffort: pod has NO resource requests or limits. These are the FIRST to be evicted when a node runs out of memory. In production: always set resource requests and limits to get Guaranteed or at minimum Burstable class. A single BestEffort pod can get killed at any time during memory pressure even if it is your most critical service. This is one of the most common causes of mysterious pod evictions in production.
Kubernetes · ARCHITECT
Explain how the Kubernetes scheduler works and how you can influence scheduling decisions.
Scheduler works in two phases: Filter — removes nodes that cannot run the pod (insufficient CPU/memory, node has a taint the pod does not tolerate, node selector does not match, affinity rules exclude the node). Score — ranks remaining nodes (least-requested prefers underutilised nodes, pod affinity scores preferred nodes higher). To influence scheduling: nodeSelector — simplest, requires exact label match. nodeAffinity — flexible label expressions, can be preferred or required. podAffinity/podAntiAffinity — schedule near or away from other pods. Taints and Tolerations — mark nodes for specific workloads (GPU nodes, high-memory nodes). PriorityClasses — higher priority pods can preempt lower priority pods when resources are scarce. Real use case: spread web pods across availability zones using topologyKey: topology.kubernetes.io/zone with podAntiAffinity.
Kubernetes · PRODUCTION
What happens during a Kubernetes node failure? How does K8s recover?
When a node goes offline: Node controller in kube-controller-manager stops receiving heartbeats from kubelet. After node-monitor-grace-period (default 40 seconds), node is marked NotReady. After pod-eviction-timeout (default 5 minutes for most conditions, 30 seconds for node unreachable with TaintBasedEvictions), pods on the node are marked for eviction. ReplicaSet controller sees fewer pods than desired, creates new pods. Scheduler assigns them to healthy nodes. Total recovery time: typically 5-7 minutes end-to-end. To reduce recovery time: use pod-eviction-timeout on the node taint, use liveness probes to fail fast, use multiple replicas with PodDisruptionBudget. Stateful workloads: StatefulSet pods are NOT automatically rescheduled (to prevent data corruption). Manual intervention required to delete the pod if node is permanently dead.
Kubernetes · ENGINEER
What is the difference between a Deployment and a StatefulSet? When do you use each?
Deployment: pods are interchangeable. Any pod can be replaced by any other. Pods get random names (my-app-5d8f7-xkp2r). No stable storage per pod. Use for stateless apps: web servers, APIs, microservices. StatefulSet: pods have identity. Each pod has a stable ordinal name (postgres-0, postgres-1). Ordered startup and shutdown. Each pod gets its own PVC that follows the pod. Stable DNS hostname per pod. Use for: databases (PostgreSQL, MySQL, MongoDB), message queues (Kafka), search (Elasticsearch), distributed caches (Redis Cluster). Key operational difference: if you delete postgres-0 from a StatefulSet, K8s creates a NEW postgres-0 on another node AND attaches the SAME PVC. The data survives. If you delete a Deployment pod, K8s creates a new pod with a new random name and no storage connection.
Kubernetes · ARCHITECT
How do you implement multi-tenancy in Kubernetes for multiple teams?
Soft multi-tenancy for trusted internal teams: Namespace per team or per environment (team-payments, team-search). RBAC: each team gets Role with create/update/delete on their namespace resources. ResourceQuota per namespace limits CPU, memory, pod count so one team cannot starve others. LimitRange sets default requests/limits so pods without resource specs still get sensible defaults. NetworkPolicy default-deny then explicit allow: teams cannot reach each other unless intentionally configured. Hard multi-tenancy for untrusted workloads: separate physical clusters. The complexity of truly isolating untrusted workloads in a shared cluster is not worth it. Use vCluster for a middle ground — virtual clusters inside a physical cluster with complete API isolation. At scale: 15 internal teams on one cluster with namespace isolation, 3 external customers on dedicated clusters.
Kubernetes · PRODUCTION
How do you perform a zero-downtime Kubernetes cluster upgrade?
Managed clusters (EKS/AKS/GKE): upgrade control plane first via console/CLI — provider handles leader election, minimal API downtime (< 30 seconds usually). Worker nodes: use managed node groups for rolling replacement. For self-managed: cordon each node (mark unschedulable: kubectl cordon node-name), drain it (kubectl drain node-name --ignore-daemonsets --delete-emptydir-data — this respects PodDisruptionBudgets), upgrade the node OS and K8s components, uncordon (kubectl uncordon node-name), validate pods are running, proceed to next node. Before any upgrade: check all workloads have PodDisruptionBudgets set, verify all critical deployments have replicas >= 2, read the K8s changelog for API deprecations (check with kubectl deprecations plugin), test in staging first. At HPE: upgraded 50-node clusters with zero application downtime using this rolling strategy.
Kubernetes · ENGINEER
Explain the Kubernetes container lifecycle — what states does a container go through?
A container in Kubernetes goes through these states. Waiting: container is not yet running — it is pulling the image, waiting for a secret to be available, or waiting for an init container to complete. Running: container is executing. Terminated: container finished execution (exit code 0 = success, non-zero = failure). At the Pod level the phases are: Pending (pod accepted but containers not yet started — usually image pulling or scheduling), Running (at least one container running), Succeeded (all containers exited 0), Failed (at least one container exited non-zero), Unknown (node communication lost). Restart policy controls behaviour on failure: Always (default for Deployments — always restart), OnFailure (restart only on non-zero exit), Never (do not restart). When a container crashes repeatedly, Kubernetes applies exponential backoff: 10s, 20s, 40s, 80s, 160s, 300s max between restarts. This is the CrashLoopBackOff state you see in kubectl get pods — container keeps crashing and Kubernetes keeps waiting longer between restarts. Check kubectl logs pod-name --previous to see logs from the crashed container.
Kubernetes · ENGINEER
What are the four Kubernetes Service types and when do you use each?
ClusterIP (default): exposes the service on an internal cluster IP. Only reachable from within the cluster. Use for: microservice-to-microservice communication. database services that should never be exposed externally. Most services in a production cluster are ClusterIP. NodePort: exposes the service on each node's IP at a static port (30000-32767). Accessible from outside the cluster via NodeIP:NodePort. Use for: development and testing, on-premise clusters without cloud load balancers. Not recommended for production — exposes high ports, depends on node IPs, no TLS termination. LoadBalancer: provisions a cloud load balancer (AWS NLB, Azure LB, GCP LB) that routes external traffic to the service. Use for: production workloads needing direct external access without ingress, TCP/UDP services (databases, game servers). Cost: each LoadBalancer service creates a cloud LB, which has hourly cost. ExternalName: maps a service to a DNS name. Use for: integrating external services (RDS database, external API) into cluster service discovery. Pods call the service name, it resolves to the external DNS. In practice: use ClusterIP for everything internal, one LoadBalancer for the Ingress Controller, ExternalName for external dependencies.
Kubernetes · ENGINEER
Explain Kubernetes RBAC — Roles, ClusterRoles, RoleBindings, ClusterRoleBindings.
RBAC controls who can do what on which Kubernetes resources. The four objects: Role defines permissions within a single namespace — rules that allow specific verbs (get, list, watch, create, update, delete) on specific resources (pods, deployments, secrets). ClusterRole defines permissions cluster-wide — same structure as Role but applies across all namespaces, or to cluster-scoped resources (nodes, persistent volumes). RoleBinding attaches a Role or ClusterRole to a subject (User, Group, ServiceAccount) within a specific namespace. ClusterRoleBinding attaches a ClusterRole to a subject across the entire cluster. Common patterns: developer access — ClusterRole with read-only on pods/deployments, RoleBinding per namespace (developer can view their namespace, not production). CI/CD service account — ClusterRole with create/update on deployments, RoleBinding in the deployment namespace only. Least privilege: never give ClusterAdmin unless absolutely necessary. A compromised service account with ClusterAdmin can read all secrets in the cluster including other teams' production credentials. In AKS: Azure AD integration means you can bind Azure AD groups to Kubernetes RBAC — platform team in AD group gets ClusterAdmin, developers in another group get namespace-scoped access.
Kubernetes · ARCHITECT
What is the role of Helm in Kubernetes and how does it manage releases?
Helm is the package manager for Kubernetes — it packages, versions, and deploys collections of Kubernetes manifests as a single unit called a chart. Without Helm: you apply 15 separate YAML files for one application. Updating a value means editing multiple files. Rolling back means tracking which YAML version was previously applied. With Helm: one helm install command deploys all 15 resources. One helm upgrade updates what changed. helm rollback reverts to a previous release version. A Helm chart contains: templates/ (Kubernetes YAML with Go template variables), values.yaml (default configuration values), Chart.yaml (chart metadata — name, version, dependencies). Release management: Helm stores release state in Kubernetes Secrets in the deployment namespace. Each upgrade creates a new revision. helm history shows all revisions. helm rollback revision-number reverts in seconds. Values override: helm install myapp ./mychart -f production-values.yaml --set image.tag=v2.1.0. The environment-specific values.yaml overrides defaults, and --set overrides specific values at the command line. In GitOps with ArgoCD: ArgoCD manages Helm releases — when values.yaml in Git changes, ArgoCD runs helm upgrade automatically. The combination of Helm for packaging and ArgoCD for delivery is the production standard for Kubernetes application deployment.
Kubernetes · BEGINNER
What is the difference between a Pod, ReplicaSet, and Deployment?
Pod: the smallest deployable unit in Kubernetes. Contains one or more containers that share the same network namespace and storage. Pods are ephemeral — if a Pod dies, it is gone. No self-healing. ReplicaSet: ensures a specified number of Pod replicas are always running. If a Pod dies, ReplicaSet creates a new one. But ReplicaSet alone has no rolling update capability. Deployment: manages ReplicaSets. Provides declarative updates — you tell it what you want, it figures out how to get there. Rolling updates: create a new ReplicaSet with the new version, scale it up while scaling down the old. Rollback: switch back to the previous ReplicaSet. In practice: you never create ReplicaSets directly. You always create Deployments, which manage ReplicaSets, which manage Pods.
Kubernetes · ENGINEER
What is HPA (Horizontal Pod Autoscaler) and how do you configure it?
HPA automatically scales the number of pod replicas based on observed CPU utilization or custom metrics. How it works: HPA controller queries metrics API every 15 seconds. If current metric > target: calculate desired replicas = ceil(current * (current_metric / target_metric)). Apply scale up/down with cooldown periods to prevent thrashing. Basic HPA on CPU: apiVersion: autoscaling/v2, kind: HorizontalPodAutoscaler, spec.scaleTargetRef.name: payment-api, spec.minReplicas: 2, spec.maxReplicas: 20, spec.metrics: type: Resource, resource.name: cpu, resource.target.type: Utilization, resource.target.averageUtilization: 70. Custom metrics HPA (Kafka lag): metrics: type: External, external.metric.name: kafka_consumer_lag, external.metric.target.value: 1000. This scales pods when Kafka consumer lag exceeds 1000 messages. Prerequisites: metrics-server installed (for CPU/memory), Prometheus adapter or KEDA (for custom metrics). Always set resource requests — HPA cannot calculate utilization percentage without them.
Kubernetes · ENGINEER
How does Kubernetes networking work? How does external traffic reach a pod?
External traffic flow in AKS: 1. DNS resolves app.company.com to Azure Load Balancer public IP. 2. Azure Load Balancer receives traffic on port 443, forwards to NGINX Ingress Controller pod (NodePort or LoadBalancer service). 3. NGINX Ingress Controller reads Ingress rules — matches hostname and path, selects backend Service. 4. Service (ClusterIP) has a virtual IP. kube-proxy maintains iptables rules that distribute traffic to matching pod IPs. 5. Request reaches the pod. Pod-to-pod communication: pods get IPs from the pod CIDR. Azure CNI: pods get IPs from the actual VNet subnet — each pod is a first-class citizen in the network. Flannel/Calico: overlay network, pods get virtual IPs, NAT at the node boundary. Service discovery: pods reach other services by DNS name: payment-api.production.svc.cluster.local. CoreDNS resolves this to the ClusterIP. Key insight: the Service IP is virtual — it never actually appears on any network interface. It is implemented entirely via iptables/eBPF rules on each node.
Kubernetes · PRODUCTION
Pod is in OOMKilled state. How do you diagnose and fix?
OOMKilled means the container exceeded its memory limit and the Linux OOM killer terminated it. Diagnose: kubectl describe pod mypod — look for Last State: Terminated Reason: OOMKilled. kubectl top pod mypod --containers shows current memory usage. Check if memory limit is set too low: kubectl get pod mypod -o jsonpath=.spec.containers[0].resources. Check application logs before the kill: kubectl logs mypod --previous. Check if it is a memory leak: watch -n5 kubectl top pod mypod — if memory grows continuously without levelling off, it is a leak. Fix options: Increase memory limit: edit the Deployment resources.limits.memory. But first check if the increase is justified — if the app is well-behaved and just needs more memory, increase it. If memory is growing continuously, investigate the code — memory leak in the application. JVM applications: set explicit heap size with -Xmx flag to control maximum heap. Set limits 20-30% above maximum heap to account for JVM overhead (metaspace, thread stacks, native memory). Python applications: check for unbounded in-memory caches, accumulating lists. Add Prometheus memory metrics to track allocation over time.
Kubernetes · ENGINEER
What are PersistentVolumes and PersistentVolumeClaims?
PersistentVolume (PV): a piece of storage provisioned in the cluster. Can be backed by Azure Disk, Azure Files, NFS, or other storage. Has capacity, access mode, and reclaim policy. Like a physical disk that exists independently of any pod. PersistentVolumeClaim (PVC): a request for storage by a user/pod. Specifies: how much storage (5Gi), access mode (ReadWriteOnce, ReadWriteMany), and optionally a StorageClass. Kubernetes binds a PVC to a matching PV. Pod uses the PVC as a volume. StorageClass: defines the type of storage to dynamically provision. Azure Disk StorageClass provisions a new Azure Managed Disk when a PVC is created. No pre-provisioning needed. Access modes: ReadWriteOnce (one node at a time — Azure Disk), ReadWriteMany (multiple nodes simultaneously — Azure Files, NFS), ReadOnlyMany. StatefulSet + PVCs: volumeClaimTemplates in StatefulSet creates one PVC per pod replica. mongo-0 gets its own disk, mongo-1 gets its own disk. They are not shared. Lifecycle: pods are deleted, PVCs remain (data preserved). PVCs are deleted, PVs are released (reclaim policy: Retain = keep data, Delete = delete the disk).
Kubernetes · ARCHITECT
How do you implement Kubernetes network policies for micro-segmentation?
Network policies use pod/namespace label selectors to control traffic. Default K8s: all pods can talk to all pods (no isolation). Default deny all: apiVersion: networking.k8s.io/v1, kind: NetworkPolicy, metadata.namespace: production, spec.podSelector: {} (matches all pods), spec.policyTypes: [Ingress, Egress] with empty rules. Now no pod can receive or send traffic. Explicit allow — only payment-api can talk to postgres: spec.podSelector.matchLabels: app=postgres, ingress.from.podSelector.matchLabels: app=payment-api, ports: 5432. Only ingress controller can receive external traffic: spec.podSelector.matchLabels: app=nginx-ingress, ingress.from.namespaceSelector.matchLabels: kubernetes.io/metadata.name=ingress-nginx. Allow DNS (critical — without this pods cannot resolve names): spec.podSelector: {}, egress.ports: 53/UDP and 53/TCP. Allow monitoring (Prometheus scrape): egress from monitoring namespace to all pods on port 9090. For L7 policies (HTTP path/method level): use Cilium CiliumNetworkPolicy — allows only POST to /api/v1/payment from order-service. Standard NetworkPolicy cannot do this.
Kubernetes · BEGINNER
What is the difference between a Deployment, StatefulSet, and DaemonSet?
Deployment: manages stateless pods. Pods are interchangeable, get random names (app-7d9f-xyz), can be scheduled on any node. Rolling updates replace pods one at a time. Use for: web APIs, microservices, anything stateless. StatefulSet: manages stateful pods needing stable identity. Pods get ordered stable names (mongo-0, mongo-1, mongo-2) that persist across restarts. Each pod gets its own PVC via volumeClaimTemplates. Pods start/stop in order. DNS: pod-0.service.namespace. Use for: databases, message brokers, any app that knows its own identity. DaemonSet: ensures one pod runs on every node (or selected nodes). When new node added, DaemonSet pod automatically scheduled on it. When node removed, pod garbage collected. Use for: log collectors (Fluentd, Promtail), monitoring agents (Datadog agent, node-exporter), network plugins (Cilium, Calico), security agents (Falco). The key distinction: Deployment = any pod anywhere, StatefulSet = specific pod with stable identity, DaemonSet = one per node.
Kubernetes · ENGINEER
How do Kubernetes probes work? Liveness, Readiness, Startup.
Readiness probe: determines when the pod is ready to RECEIVE TRAFFIC. Until readiness probe passes, the pod is removed from the Service endpoints. Traffic is not sent to it. Use for: apps that take time to load config, warm up caches, or establish database connections. If readiness fails: pod stays Running but gets no traffic. Kubernetes does NOT restart it. Liveness probe: determines if the container is ALIVE (not stuck in a deadlock, infinite loop, or corrupted state). If liveness fails: Kubernetes restarts the container. Use for: catching application freezes that do not crash the process. Startup probe: used for slow-starting containers (Java apps, legacy apps). Liveness and readiness probes are disabled until the startup probe succeeds. Prevents liveness from killing a slow-starting pod before it finishes initialising. Without startup probe: a Java app taking 90 seconds to start would be killed by liveness (default 30s timeout) before it could serve traffic. Types: httpGet (HTTP 200-399 = success), tcpSocket (connection accepted = success), exec (command exit code 0 = success). Always tune: initialDelaySeconds, periodSeconds, failureThreshold to match your app's actual startup and response characteristics.
Kubernetes · ENGINEER
Explain ConfigMaps and Secrets. When do you use each?
ConfigMap: stores non-sensitive configuration as key-value pairs. Environment variables, config files, command-line arguments. Not encrypted at rest by default. Use for: application config (log level, feature flags, database hostnames). kubectl create configmap app-config --from-literal=LOG_LEVEL=INFO --from-file=app.properties. Secret: stores sensitive data, base64 encoded (not encrypted by default in etcd — enable etcd encryption for true security). Use for: passwords, tokens, certificates, API keys. kubectl create secret generic db-creds --from-literal=password=mypassword. Better: use Azure Key Vault with CSI Secrets Store driver — secrets are NOT stored in etcd at all, fetched from Key Vault and mounted as files. Mount options: as environment variables (env.valueFrom.configMapKeyRef), as volume files (mount config as /etc/config/app.properties), as volume for secrets (mount at /var/secrets/). Best practice: never put secrets in environment variables (visible in kubectl describe pod, ps aux output). Always mount as files. Use Workload Identity to fetch from Key Vault dynamically.
Kubernetes · PRODUCTION
How do you handle application rollback in Kubernetes?
Deployment rollback: kubectl rollout undo deployment/myapp -n production. Kubernetes rolls back to the previous ReplicaSet. Check history first: kubectl rollout history deployment/myapp shows all revisions with CHANGE-CAUSE annotations (add --record to deployments). Roll back to specific version: kubectl rollout undo deployment/myapp --to-revision=3. Monitor rollback: kubectl rollout status deployment/myapp --watch. Helm rollback: helm rollback myapp [REVISION] -n production. helm history myapp shows all releases. After rollback: investigate why the new version failed BEFORE attempting another deployment. Root cause analysis: kubectl logs from the failed pods, kubectl describe pod for events, check monitoring dashboards for the exact minute deployment started. Prevention: use --atomic in helm upgrade (auto-rollback if deployment fails within timeout). Use ArgoCD — OutOfSync detection and one-click rollback via UI. Blue-green: instant rollback by switching traffic back to blue. Canary: if error rate increases at 10% traffic, never increase to 30% — auto-rollback. Key point: rollback is not failure. Rollback + investigation + fix + re-deploy is the right pattern. Never leave a broken version deployed.
Kubernetes · SENIOR
Production K8s cluster: 40% pods evicted, 2am page. Walk through exactly what you do.
Minute 0-2 — assess without touching anything: kubectl get nodes (are nodes healthy?), kubectl get pods -A | grep -v Running (full picture), kubectl top nodes (CPU/memory pressure?), kubectl get events -A --sort-by=.lastTimestamp | tail -30 (what just happened?). Minute 2-5 — identify trigger: if nodes show MemoryPressure or DiskPressure → eviction is expected, kubelet protecting the node, fix: add nodes or reduce pod density. If sudden → check what changed, was there a deployment 30 min before? Minute 5-10 — triage by severity: are any critical services (payment, auth) affected? If yes → declare incident immediately, do not solo-debug. Minute 10-20 — fix based on root cause: resource pressure → kubectl cordon the struggling node, drain pods to healthy nodes. Quota issue → check ResourceQuota. Node NotReady → check kubelet logs on the node. Key principle: never touch a production cluster blindly. Understand before acting. Communicate blast radius within 5 minutes of being paged.
Kubernetes · SENIOR
How do you design a K8s cluster for 99.9% SLA with 50+ microservices?
Multi-AZ worker nodes — minimum 3 nodes across 3 availability zones so one zone failure does not bring the cluster down. Pod Anti-Affinity rules — ensure no two replicas of the same service land on the same node. topologySpreadConstraints with maxSkew:1. Resource requests and limits on every pod — prevents noisy neighbours starving other services. HPA on every stateless service — scale based on CPU (70% threshold) or custom metrics (Kafka lag, queue depth). PodDisruptionBudgets — minAvailable:1 ensures at least one pod always running during node drains. ReadinessProbes — pod only gets traffic when genuinely ready. LivenessProbes — restart pods that are alive but not serving. Cluster Autoscaler — nodes scale with pod demand. Pre-production load testing — verify the HPA and CA work before peak traffic. At HPE I managed a 50+ pod OpenShift cluster serving 3 telecom provisioning systems at 99.9% SLA using exactly this pattern.
Kubernetes · ARCHITECT
Explain RBAC in Kubernetes. How did you implement it in production?
RBAC has four objects. Role: namespace-scoped, defines which verbs (get/list/watch/create/delete) on which resources (pods/deployments/secrets). ClusterRole: same but cluster-wide. RoleBinding: attaches a Role to a subject (User, Group, ServiceAccount) in a namespace. ClusterRoleBinding: attaches ClusterRole cluster-wide. Production pattern at HPE: developers got a Role in their team namespace with get/list/watch on pods, deployments, services — no create/delete in production. The deployment service account got create/update on deployments only. No human had cluster-admin except the platform team. Integration with Azure AD: we used AKS AAD integration — engineers authenticated with their corporate credentials, mapped to K8s groups via ClusterRoleBindings. The developer AD group mapped to view ClusterRole. The release team AD group mapped to a custom deploy Role. Audit: kubectl get rolebindings,clusterrolebindings -A shows the entire access matrix.
⚙️Terraform — State, Drift, Workspaces, Security Scanning20 Q&As›
Terraform · ARCHITECT
How do you structure Terraform for a team of 10 engineers managing 5 environments?
Monorepo with three sections: modules/ (reusable building blocks versioned with Git tags), environments/ (one subfolder per environment with its own backend.tf and terraform.tfvars), and shared/ (data sources common across environments). Governance rules: no direct resource creation outside modules — everything through a module for consistency. Module changes require PR review from 2 engineers. Environments are promoted in order: dev → staging → uat → prod. Pipeline runs terraform plan on every PR so engineers see the infrastructure diff before merging — same as they see code diffs. State files in Azure Storage with separate containers per environment, locking enabled. Result at HPE: reduced infrastructure drift by 80% compared to ad-hoc portal changes.
Terraform · ENGINEER
What is Terraform state and why is losing it catastrophic?
Terraform state is Terraform's memory — it maps between your HCL configuration and real cloud resources including all cloud-assigned properties (resource IDs, IP addresses, connection strings). On every plan and apply, Terraform reads state to understand what exists and calculates only the diff. Losing state means Terraform no longer knows what it manages. Running terraform apply after losing state tries to create everything again — either fails because resources exist, creates duplicates, or worse, if running terraform plan -destroy it calculates that everything needs to be destroyed because nothing is in state. Prevention: always remote state (Azure Storage or S3), never local. Enable blob versioning on Azure Storage for point-in-time recovery. Enable DynamoDB state locking for S3 to prevent concurrent applies corrupting state. Back up state files regularly. At HPE: weekly automated backup of all state files to a separate storage account in a different region.
Terraform · PRODUCTION
Production Terraform apply destroyed a resource that should not have been destroyed. What happened and how do you recover?
This happens from a few causes: resource renamed in code (Terraform sees destroy old + create new), for_each key changed, depends_on cycle, or someone ran terraform apply without reviewing the plan carefully. Immediate recovery: check if the resource was deleted or replaced. If deleted and the resource has backups (RDS snapshot, Azure disk snapshot): restore from backup, then terraform import the restored resource into state. If the resource was infrastructure (VNet, subnet): recreate via terraform apply — K8s pods will reschedule, services will recover. Post-incident actions: add -target flag to apply to prevent accidental broad destroys, enforce plan review step before apply, add require_approval = true for prod in Atlantis. Prevention: terraform plan -out=tfplan always, apply only the saved plan, never allow apply without reviewing plan output. In CI/CD: never auto-apply to production — always require manual approval after reviewing plan. We had this at HPE: a VNet was recreated after a module refactor. Recovery took 4 hours. After this we added state mv steps to our refactoring runbook.
Terraform · ARCHITECT
What is Terraform drift and how do you detect and prevent it at enterprise scale?
Drift occurs when real infrastructure diverges from Terraform state — usually from manual console changes, auto-scaling, or provider-initiated changes. Detection: scheduled nightly Jenkins job runs terraform plan -refresh-only -detailed-exitcode. Exit code 2 = changes detected = drift. Alert sent to Slack. Review in the morning and decide: was this change intentional (update .tf files + import) or unauthorized (revert with terraform apply). Prevention: Azure Policy / AWS SCP denying resource modifications except from the Terraform service principal. Every infrastructure change must be a PR with plan output — no exceptions for production. Atlantis automates this: plan on PR, apply after merge. For compliance-heavy environments (banking, telco): every resource has a mandatory terraform managed tag, and Azure Policy alerts on any resource without this tag — indicates it was created outside Terraform.
Terraform · ENGINEER
Explain count vs for_each in Terraform. When should you use each?
count creates a list of resources numbered by index: count = 3 creates resource[0], resource[1], resource[2]. Problem: if you remove resource[0], indices shift — resource[1] becomes resource[0], resource[2] becomes resource[1]. Terraform sees this as destroy resource[0], modify resource[1], modify resource[2]. You wanted to delete one but Terraform destroys and recreates two. for_each creates a map keyed by string: for_each = { "east" = ..., "west" = ... } creates resource["east"] and resource["west"]. Deleting "east" from the map only deletes resource["east"] — "west" is untouched. Rule: always use for_each for real production resources. Use count only for on/off patterns (count = var.enable_feature ? 1 : 0). Real consequence: we had 5 subnets created with count at HPE. Removing subnet index 2 caused Terraform to plan destroying and recreating subnets 3 and 4 — which had running K8s nodes. Caught in plan review, avoided production incident.
Terraform · ARCHITECT
How do you manage secrets in Terraform? What are the risks?
Three layers of risk: 1) Secrets in .tf files committed to Git — visible to everyone with repo access, permanent in Git history even after deletion. 2) Secrets in terraform.tfvars committed to Git — same problem. 3) Secrets in state file — Terraform stores all resource attributes in state including sensitive fields like connection strings, passwords, private keys. State file is often stored in S3/Azure Storage with broad team access. Solutions in order of security: Mark variables sensitive = true — Terraform redacts them from plan output and logs. Use environment variables (TF_VAR_secret_name) in CI/CD pipelines — never hardcode. Read secrets from Key Vault data sources at apply time (data "azurerm_key_vault_secret"). Use random_password resource to generate passwords and store in Key Vault — Terraform never transmits the actual value after creation. For state file: encrypt at rest (Azure Storage server-side encryption, S3 with KMS), restrict access to state container to only the Terraform service principal and SRE team.
Terraform · PRODUCTION
Someone created 50 Azure resources manually over 6 months. How do you bring them under Terraform without downtime?
This is a common enterprise adoption problem — "shadow IT" that needs governance. My approach: Phase 1 — Inventory. Use az resource list to enumerate all manually created resources. Categorize by team and criticality. Phase 2 — Write configuration. For each resource, write the Terraform HCL to match current settings. Use Azure Exporter or aztfexport tool to auto-generate .tf files from existing resources — saves 80% of writing time. Phase 3 — Import in non-production first. Pick the lowest-risk environment, run terraform import for each resource, then terraform plan to verify zero changes before proceeding. Phase 4 — Production import. Import 5-10 resources at a time in off-peak hours. After each batch: terraform plan shows no changes = safe to proceed. Phase 5 — Enforce going forward. After all resources imported: Azure Policy denying resource creation outside Terraform service principal. Total timeline at HPE for 200 resources: 3 weeks with 2 engineers. Main risk: import succeeds but plan shows unexpected changes — always review plan before applying after import.
Terraform · ENGINEER
What is the difference between Terraform Cloud, Terraform Enterprise, and self-hosted Atlantis?
Terraform Cloud (SaaS): HashiCorp-hosted, free tier available. Manages state, runs plans and applies in hosted runners, VCS integration, basic RBAC. Limitation: runners cannot reach private networks without agent setup. Good for: small-medium teams, public cloud without network restrictions. Terraform Enterprise (self-hosted): Runs in your own infrastructure. Full RBAC, SSO, audit logging, Sentinel policy as code, private network access. Good for: enterprises with compliance requirements, air-gapped environments. Cost: expensive ($20+ per user per month). Atlantis (open source, self-hosted): Simple Terraform pull request automation. Runs plan on PR as bot comment, applies on merge. No state management — you bring your own backend. Good for: teams that want PR-driven Terraform without paying for Cloud/Enterprise. We use Atlantis at HPE — lightweight, free, runs inside our AKS cluster with access to private networks, full audit trail in GitHub.
Terraform · PRODUCTION
Terraform plan works but apply fails halfway through. 20 resources created, 5 failed. What do you do?
Partial apply is one of the most stressful Terraform situations. What happened: Terraform applies resources in dependency order. It created 20, hit an error on resource 21. State file records the 20 successfully created resources. The 5 failed ones are NOT in state. First action: do NOT re-run terraform apply immediately. The 20 created resources might have created side effects (RBAC assignments, DNS records, security groups) that apply again would duplicate. Investigation: check the error message carefully. Common causes: permission denied (service principal missing a role), resource name conflict (already exists manually), quota exceeded, API rate limit. Fix the root cause first. Then: terraform plan again to see the full picture. If the plan shows only creating the failed resources and no unexpected changes to the 20 already created: apply is safe. If the plan shows unexpected changes to existing resources: investigate state inconsistency with terraform state show. At HPE: we hit an Azure quota limit for public IPs. Fixed the quota, re-ran apply, it created only the remaining resources. The 20 already created were untouched.
Terraform · ARCHITECT
How do you handle Terraform provider version upgrades across 50 microservices?
Provider upgrades are high-risk in large organizations because provider changes can introduce breaking changes or plan diffs on unchanged resources. Strategy: pin all providers with version constraints in versions.tf — never allow automatic upgrades (no version = latest is dangerous). Use ~> 3.90 style: >= 3.90, < 4.0. Upgrade process: 1) Create a dedicated branch for the upgrade. 2) Update version constraint in one non-prod environment. 3) Run terraform init -upgrade to fetch new provider. 4) Run terraform plan — review ALL changes carefully. Even minor provider upgrades can cause plan diffs if provider fixes a bug in how it reads resource state. 5) Apply in dev, test for 1 week. 6) Roll out to staging and prod. For 50 microservices: use a module that sets the required_providers block — all callers inherit the constraint. When you update the module version, all services that upgrade to the new module version get the new provider constraint. This staggers the upgrade naturally through the promotion process.
Terraform · ENGINEER
How do you structure Terraform for multiple environments and what are the best practices?
The recommended structure uses reusable modules with separate environment directories. Folder structure: modules/ folder contains reusable, environment-agnostic infrastructure definitions (modules/aks/, modules/network/, modules/database/). Each module takes variables as inputs and produces outputs. environments/ folder has a subdirectory per environment: environments/dev/, environments/staging/, environments/production/. Each environment's main.tf calls the modules with environment-specific values. Separate state per environment: each environment has its own backend.tf pointing to a different Azure Storage container or AWS S3 prefix. Never share state between environments — a corrupted state in dev should never affect production. tfvars files: dev.tfvars, staging.tfvars, production.tfvars contain the variable values per environment (VM sizes, replica counts, retention periods). CI/CD integration: dev pipeline runs on feature branches, applies automatically. Staging applies on merge to main. Production requires manual approval gate before apply. Always run terraform plan in the pipeline and store the plan output — apply only executes the reviewed plan, not a fresh plan. Remote state locking (via Azure Storage lease or DynamoDB) prevents two pipeline runs from applying simultaneously. Never allow terraform apply from a local machine against production.
Terraform · ENGINEER
What are the advantages of Bicep over ARM templates for Azure infrastructure?
Both Bicep and ARM Templates define Azure infrastructure declaratively, but Bicep is the modern replacement for ARM JSON. Syntax: ARM is verbose JSON with heavy nesting. Bicep is a domain-specific language that compiles to ARM — much cleaner syntax with 40-60% less code for the same resource. Readability: a 200-line ARM template is typically 80 lines in Bicep. String interpolation, conditional expressions, and loops are much more readable. Modules: Bicep has first-class module support — split infrastructure across multiple files, call modules like functions. ARM has linked templates but they require URL references and are harder to manage. Type safety: Bicep has strong typing and intellisense in VS Code. ARM JSON has limited validation. Tooling: Bicep CLI and VS Code extension provide instant feedback. No need to understand ARM's complex schema. Compilation: Bicep always compiles to ARM — you can decompile existing ARM templates to Bicep with bicep decompile. This lets you start with existing ARM templates and migrate gradually. When to choose Terraform over Bicep: Terraform manages multi-cloud infrastructure (Azure + AWS + GCP) in one codebase. Bicep is Azure-only. If your infrastructure is purely Azure, Bicep is excellent. If you have any multi-cloud, Terraform wins.
Terraform · ENGINEER
What is Terraform drift and how do you detect and fix it?
Drift occurs when infrastructure changes outside Terraform — someone modifies a security group manually, an auto-healing process changes a VM, or an Azure Policy enforces a setting. Detection: terraform plan compares state file (what Terraform thinks exists) with actual cloud resources via API calls. Differences are shown as changes. Run terraform plan regularly in a read-only CI job — alert if plan output is non-empty. Fix options: 1. Re-apply Terraform to restore desired state: terraform apply accepts the plan. Overwrites the manual change. 2. Import the manual change into state: terraform import azurerm_virtual_machine.web /subscriptions/.../virtualMachines/myVM — updates state to reflect reality, then update the .tf code to match. 3. Use lifecycle ignore_changes: lifecycle { ignore_changes = [tags] } — tells Terraform to ignore specific attributes that change outside its control (e.g. auto-managed tags). Prevention: enforce "no console changes" policy. Use Azure Policy or SCPs (AWS) to block manual infrastructure changes in production. All changes through Terraform, all Terraform through CI/CD pipeline with PR review.
Terraform · ENGINEER
Explain Terraform workspaces. When do you use them vs separate state files?
Terraform workspaces allow multiple state files for the same configuration in the same directory. Default workspace is always "default". Create: terraform workspace new dev. Switch: terraform workspace select production. Reference in code: terraform.workspace variable. Example: resource_group_name = "rg-${terraform.workspace}". Use workspaces when: environments are structurally identical (same resources, just different names/sizes), small team, simple infrastructure. Limitations: all workspaces share the same backend config, no separation of access controls between workspaces. Separate directories + separate state files when: environments differ significantly (production has more redundancy, WAF, DR), different teams manage different environments, need different IAM permissions for production vs dev, want blast-radius isolation. At HPE I used separate directories per environment with separate state files in separate storage containers. Production state container had strict IAM — only CI/CD service principal could write. Developers could plan against dev but not apply to production. This is the enterprise pattern.
Terraform · PRODUCTION
You run terraform apply and it gets stuck. What do you do?
A stuck apply is worse than a failed apply — the state might be partially modified and locked. Step 1: check if it is genuinely stuck or just slow. AKS cluster creation takes 10-15 minutes. Azure SQL creation takes 5 minutes. Wait and check Azure Portal — is the resource being created? Step 2: check the state lock. If another process holds the lock, your apply waits. Check Azure Storage → terraform.tfstate.lock.info file. If the locking process crashed, the lock is orphaned. Force-unlock: terraform force-unlock LOCK_ID. Only do this if you are CERTAIN no other process is applying. Step 3: if a specific resource is stuck: check Azure Activity Log for the resource — Azure might be returning an error that Terraform is retrying. Sometimes Azure has transient issues. Step 4: if apply was interrupted: run terraform plan again — it shows what changed and what remains. The state file reflects what was successfully created before the interruption. Run apply again to complete the remaining resources. Prevention: set -parallelism=5 (default 10) to reduce concurrent API calls and avoid rate limiting. Use timeouts in resource configs: timeouts { create = "30m" }.
Terraform · ENGINEER
Where should you store Terraform code and state? Best practices.
Terraform CODE in Git: version controlled, PR reviewed, all changes audited. One repo per major infrastructure domain or per cloud (infra-azure/, infra-aws/). Branching: main = production, feature branches for changes, PR required to merge. Never commit .tfstate or .tfvars with secrets to Git. Add to .gitignore: *.tfstate, *.tfstate.backup, .terraform/, *.tfvars (if containing secrets). Terraform STATE in remote backend: Azure: Azure Storage Account, Blob Container, terraform.tfstate file. backend "azurerm" { resource_group_name = "tfstate-rg", storage_account_name = "tfstateXXX", container_name = "tfstate", key = "production.tfstate" }. AWS: S3 bucket with versioning enabled. State locking: AzureRM backend uses Azure Blob lease for locking — prevents two applies running simultaneously. Separate state per environment: dev has its own tfstate, staging has its own, production has its own. This means a failed plan in staging cannot corrupt production state. Access control: production state storage account locked to CI/CD pipeline service principal and senior engineers only. Developers can read staging state, not production.
Terraform · ENGINEER
How does Terraform code scanning work in a pipeline?
Static analysis tools for Terraform security and compliance scanning: tfsec: scans Terraform code for security misconfigurations. Runs in CI pipeline before terraform plan. Example findings: storage account allows public access, security group allows 0.0.0.0/0 on port 22, AKS cluster has RBAC disabled. Checkov: policy-as-code scanner. Checks against CIS benchmarks for Azure, AWS, GCP. Outputs pass/fail per check with remediation advice. Pipeline integration: - stage: Security_Scan steps: - script: pip install checkov && checkov -d . --framework terraform --compact --quiet. Fail the pipeline if HIGH severity issues found. tflint: linting and provider-specific rules. Catches syntax errors, deprecated attributes, invalid resource configurations. Infracost: calculates cost of terraform plan output — see the cost impact before applying. Workflow: git push → Terraform lint (tflint) → Security scan (tfsec + Checkov) → terraform fmt check → terraform validate → terraform plan (in CI) → review plan output → manual approval → terraform apply (in CD). All scan results stored as pipeline artifacts for audit.
Terraform · ARCHITECT
How do you prevent Terraform from accidentally destroying production resources?
Multiple defence layers: Layer 1 — lifecycle block: lifecycle { prevent_destroy = true } on critical resources (databases, VNets, AKS clusters). Terraform errors if it tries to destroy. Requires explicit removal of the block before any destruction. Layer 2 — state separation: production has its own state file. Developers cannot run terraform apply in production — only the CI/CD pipeline can. Layer 3 — plan review gate: CI pipeline runs terraform plan and posts output as PR comment. Any destroy shows up highlighted. Required reviewer must approve. No auto-apply of destroys. Layer 4 — workspace/folder protection: production Terraform files in a separate directory. Branch protection rules prevent direct commits. Layer 5 — IAM restrictions: production service principal only has Contributor on the specific resource group, not subscription-wide. Cannot delete resource groups (Owner role needed). Layer 6 — Azure resource locks: az lock create --name "production-lock" --lock-type CanNotDelete --resource-group production-rg. Even Owner role cannot delete locked resources without first removing the lock — generates an audit trail. Layer 7 — backup: Azure Backup enabled on all databases. Even if destroyed, recovery is possible. Defence in depth: no single layer fails catastrophically.
Terraform · SENIOR
You run terraform plan and see 15 resources will be destroyed. The team panics. What do you do?
Step 1: do not apply. Read every destroy line carefully. Terraform destroys when: a resource was renamed (sees old name as deleted, new as created — use moved block to fix without recreation). A required attribute changed that forces replacement (some attributes like AKS node pool OS disk type cannot change in-place). The resource was imported outside Terraform and the state does not match. Step 2: for each destroy, determine if it is intentional. If database or network resource: almost certainly a mistake. If an old unused resource: may be fine. Step 3: fix the cause not the symptom. Renamed resource: add moved { from = aws_instance.old; to = aws_instance.new } block — Terraform updates state without destroying. Attribute change forcing replacement: check if you can use lifecycle { ignore_changes = [disk_type] } or if a blue-green replacement is acceptable. Step 4: add lifecycle { prevent_destroy = true } on all production databases, networks, and stateful resources. This makes Terraform error if it ever tries to destroy them, requiring explicit removal of the block first. Step 5: in CI/CD, require manual approval for any plan containing destroy. Never auto-apply destroys.
Terraform · ARCHITECT
How do you structure Terraform for multiple environments and multiple cloud regions?
Structure I have used in production at HPE. Three layers: Layer 1 — modules/ directory. Reusable components: modules/aks-cluster/, modules/networking/, modules/monitoring/. Each module has variables.tf (inputs), main.tf (resources), outputs.tf (exposed values). No environment-specific code here. Layer 2 — environments/ directory. environments/dev/, environments/staging/, environments/production/. Each environment directory calls the modules with environment-specific variable values. environments/production/main.tf: module "aks" { source = "../../modules/aks-cluster"; node_count = 5; vm_size = "Standard_D4s_v3" }. Layer 3 — separate state per environment. Each environment has its own backend.tf pointing to a different storage container: dev state in dev-tfstate storage, production state in prod-tfstate storage with stricter access controls. Multi-region: create a subdirectory per region: environments/production/eastus/, environments/production/westeurope/. Each has its own state. Cross-region resources (like Azure Traffic Manager) go in environments/production/global/. Key rule: production state is read-only for most engineers. Only the pipeline service principal and a small group can run terraform apply in production.
🔧Jenkins — Parallel Stages, Shared Libraries, Scale18 Q&As›
Jenkins · ARCHITECT
How do you design Jenkins for 50 microservices without code duplication?
Shared Library — this is the answer that separates senior engineers. Create a Git repo for shared library with vars/ folder containing Groovy functions: buildAndPush.groovy, deployToK8s.groovy, runTests.groovy, slackNotify.groovy. Register in Jenkins: Manage Jenkins → Global Pipeline Libraries. Each microservice Jenkinsfile pins to a version: @Library('jenkins-shared-library@v2.1.0'). Entire service Jenkinsfile is 25 lines — just calls shared functions. When you add Trivy scanning: update buildAndPush.groovy once, tag v2.2.0, update one line in each service. Done in 30 minutes. Without this: 50 Jenkinsfiles to edit, 3 days minimum, inevitable inconsistency. At HPE: 15 services, one shared library. Pipeline change that took 3 days now takes 30 minutes.
Jenkins · ENGINEER
What is the difference between Declarative and Scripted Jenkins pipelines?
Declarative pipeline has a strict validated structure: pipeline → agent → options → stages → stage → steps. Structure is checked before execution — syntax errors caught immediately, not mid-run. Reads cleanly. Enforces best practices. Use for: 95% of real-world pipelines. Scripted pipeline is pure Groovy in a node block. Full programming language power — any control flow, any data structure, dynamic stage generation. Use for: complex dynamic pipelines, generating stages based on runtime data, iterating over dynamic deployment target lists. Rule: start Declarative. When you hit a genuine limitation, add a script block inside Declarative. Only go full Scripted for genuinely complex platform engineering pipelines.
Jenkins · ENGINEER
Explain Jenkins agents. What types exist and when do you use each?
Agents are where builds run — master only orchestrates. Four types: Permanent SSH agent — dedicated VM, Jenkins SSHes in. Always available, zero startup time. Downside: idle cost 24/7, environment drift as packages accumulate. Good for simple setups. Docker agent — each build in a fresh container. Clean and reproducible. Requires Docker daemon. Kubernetes pod agent — pod created per build, destroyed after. Zero idle cost, unlimited scale, Kubernetes handles scheduling. 10-30 second startup overhead. Production standard for cloud-native teams. Multi-container pod — specialist containers in one pod: Maven for compilation, Docker-in-Docker for image building, Trivy for scanning, all sharing the same workspace volume. My recommendation: Kubernetes agents for any team running on K8s. Zero idle cost, perfect reproducibility, auto-scaling. Combined with KEDA: agent pool scales to zero overnight.
Jenkins · PRODUCTION
Jenkins build is queued and never starts. How do you debug it?
Systematic approach. Step 1: check if agents are online. Manage Jenkins → Nodes. Any agent showing offline means builds for that label queue indefinitely. Step 2: click the queued build → Why is this blocked? Jenkins tells you exactly: label mismatch (pipeline says linux-k8s but agent has label linux), all agents busy (increase executor count or add more agents). Step 3: if Kubernetes agents, check pods. kubectl get pods -n jenkins shows pending agent pods. kubectl describe pod shows why: Insufficient CPU, insufficient memory, image pull error, node selector constraints. Step 4: check resource quotas: kubectl describe resourcequota -n jenkins — namespace may have hit CPU ceiling. Step 5: check Jenkins logs: Manage Jenkins → System Log for provisioning errors. Most common causes in order: label mismatch between pipeline and agent, all agents at executor limit, K8s pod stuck due to resource constraints, agent VM rebooted and disconnected.
Jenkins · ARCHITECT
How do you secure Jenkins credentials in a production environment?
Multiple layers. Layer 1: Jenkins Credentials store (AES-128 encrypted). Use withCredentials() block — value automatically masked as **** in all log output. Organize credentials with meaningful IDs: staging-acr-credentials, production-db-password. Set credential scope to Global (not System — System scope is invisible to pipelines). Layer 2: Vault integration. For enterprises: Jenkins authenticates to HashiCorp Vault via AppRole. Pipeline fetches secret at runtime. When secret rotates in Vault, all pipelines get new value automatically — no manual Jenkins credential updates. No production secrets stored in Jenkins at all. Layer 3: Cloud-native IRSA/Workload Identity. For K8s agents on AWS/Azure: pod authenticates as a Kubernetes service account mapped to an IAM role. Zero stored credentials. Perfect for Kubernetes-native pipelines. The anti-patterns to avoid: env variables set directly in shell (visible in process list), echoing credentials even to masked logs, storing credentials as plain text files in workspace.
Jenkins · PRODUCTION
Jenkins master is running out of memory. How do you fix it and prevent it?
Diagnosis: Manage Jenkins → System Information → Memory shows heap usage. When heap is >80% consistently, builds slow down, then OOM errors start. Immediate fix: increase JVM heap. In Jenkins startup configuration (Dockerfile, systemd, or JENKINS_HOME): JAVA_OPTS=-Xmx4g -Xms2g -XX:+UseG1GC. 4GB heap handles 50-100 concurrent builds comfortably. Structural fixes: configure build log rotation (Manage Jenkins → System → # builds to keep per project → set 20). Without rotation, Jenkins stores logs for every build forever — master fills up. Configure workspace cleanup — add cleanWs() to post always block of every pipeline. Enable build discard: buildDiscarder(logRotator(numToKeepStr: 20)) option in pipeline. Root cause prevention: Jenkins master should ONLY orchestrate. If you moved builds to agents correctly, master memory usage is minimal — it only stores logs and schedules. If master is starved, it usually means some builds are still running on master — check Manage Jenkins → Nodes → Built-In Node → Executors → set to 0.
Jenkins · ENGINEER
What is Jenkins Multibranch Pipeline and how does branch strategy work with it?
Multibranch Pipeline automatically discovers all branches in a Git repository and creates a Jenkins pipeline for each. When you push a new branch, Jenkins detects it (via webhook or polling) and creates a pipeline. When you delete the branch, Jenkins deletes the pipeline. No manual pipeline creation or cleanup. The power comes from combining it with when conditions in Jenkinsfile. One Jenkinsfile, different behaviour per branch: feature/* branches deploy to a feature namespace in dev. The develop branch deploys to staging automatically. The main branch requires human approval then deploys to production. Branch protection integration: configure GitHub to require the Jenkins status check to pass before merge. Jenkins posts commit status via GitHub Checks API. Red circle on PR = failed build = merge blocked. This enforces quality automatically without anyone policing it manually. PR approval gates your code review. Jenkins gates your quality. Neither is optional.
Jenkins · ENGINEER
How do Kubernetes agents work in Jenkins and what are the benefits over static agents?
Kubernetes agents work through the Jenkins Kubernetes Plugin. When a build triggers, the plugin creates a Pod in Kubernetes for that build. The pod contains one or more containers — typically your build tool (Maven, Node.js) plus any other tools needed (Docker, kubectl). The agent container connects back to the Jenkins controller via JNLP protocol. When the build completes, the pod is deleted. Benefits over static agents: no idle cost (pods only exist during builds), clean environment for every build (no state contamination), horizontal scalability (Kubernetes can schedule hundreds of agent pods in parallel), different container images per job (Maven job uses Maven image, Node.js job uses Node image), resource limits per build (prevent one build consuming all CPU). The main challenge: pod startup time (10-30 seconds) adds to build time. Mitigate with pod templates cached by Kubernetes or by using pod disruption budgets to keep warm agent pods ready. At HPE, we ran all Jenkins builds as Kubernetes pods on OpenShift — each microservice build spun up, compiled, built the Docker image, pushed to the registry, and terminated. Zero idle agents.
Jenkins · ENGINEER
What is a Jenkins Shared Library and how do you structure it for a large organisation?
A Shared Library is a common Groovy codebase loaded into Jenkins and available to all pipelines. Without it: every team writes their own docker build, push, deploy steps in their Jenkinsfile — duplication, inconsistency, and maintenance nightmare. With it: common steps are in the library, Jenkinsfiles become thin orchestration files. Structure: the library lives in a Git repository with this structure: src/ (Groovy classes), vars/ (global variables — these are the functions pipelines call), resources/ (static files). Example: vars/dockerBuild.groovy defines a dockerBuild(image, tag) function. Any Jenkinsfile can call dockerBuild("myapp", "v1.2") and it runs the standardised build. Configuration: in Jenkins → Manage Jenkins → Configure System → Global Pipeline Libraries, point to the Git repo and give it a name. In Jenkinsfile: @Library("shared-library") import _ at the top. Production use at HPE: one shared library for all microservice pipelines. Steps for build, test, security scan, deploy to OpenShift, and notification were all in the library. When we added SonarQube, we added it once to the library and all 50 pipelines got it automatically on next run.
Jenkins · PRODUCTION
Jenkins build is queued but never starts. What do you investigate?
Builds queue when no agent is available to run them. Investigation steps: 1. Check build queue item — click the stuck build, it shows why it is waiting: "No agents are online", "All agents are busy", or the label expression is not matched. 2. Check agents: Manage Jenkins → Nodes. Are agents online? If offline: click the agent, see the disconnect reason. Reconnect manually or check the agent process. 3. For Kubernetes agents: check pod creation. kubectl get pods -n jenkins — are agent pods being created? kubectl describe pod for pending pods — might be scheduling failure (insufficient CPU/memory, node selector mismatch, PVC issues). 4. Label mismatch: the pipeline has agent { label "linux" } but no agent has that label. Check agent labels in Manage Jenkins → Nodes → agent configuration. 5. Executor count: agent is online but all executors are busy. Check how many concurrent builds are running. 6. Resource limits on Kubernetes: the namespace has a ResourceQuota — no more pods can be created. kubectl describe resourcequota -n jenkins. 7. Plugin issue: Kubernetes plugin cannot authenticate to cluster. Check Jenkins logs: tail -f /var/log/jenkins/jenkins.log | grep -i error.
Jenkins · ARCHITECT
How do you secure Jenkins in a production environment?
Security in layers. Authentication: enable security (not anonymous access). Integrate with LDAP/Active Directory or OAuth (GitHub, Google) using the LDAP plugin or GitHub Authentication plugin — developers use their existing credentials. Authorisation: use Matrix-Based Security or Role Strategy plugin. Principle of least privilege: most users get Job/Read and Job/Build for their team folders. Only DevOps team gets Administer. Never give everyone Administer. Credentials: never put passwords, tokens, or keys in Jenkinsfile or build parameters. Store all secrets in Jenkins Credentials Store (encrypted using master.key). Reference with withCredentials() or credentials() binding. Better: integrate with HashiCorp Vault or Azure Key Vault. Agent isolation: Kubernetes agents run in isolated pods. Each job gets a clean environment. Prevent agents from having access to controller filesystem. Network: Jenkins controller behind a VPN or internal load balancer — not publicly accessible. Webhook validation: validate GitHub webhook signatures using the Jenkins GitHub plugin. Audit: install the Audit Trail plugin — logs all Jenkins actions with user and timestamp. Regular updates: Jenkins and plugins have frequent security patches. Set up automated update notifications.
Jenkins · ENGINEER
How do you implement parallel stages in Jenkins to speed up pipeline?
Parallel stages run simultaneously in a Declarative Pipeline. stages { stage("Parallel Tests") { parallel { stage("Unit Tests") { steps { sh "mvn test -Dtest=Unit*" } }; stage("Integration Tests") { steps { sh "mvn test -Dtest=Integration*" } }; stage("Security Scan") { steps { sh "trivy image myapp:${env.BUILD_NUMBER}" } } } } }. The three stages run at the same time. If any fails, the parallel block fails. Limit with agent { label "docker" } per stage if they need different environments. Resource considerations: each parallel stage runs on a separate agent if available. Ensure Jenkins has enough agents. For matrix builds (test against multiple JDK versions): matrix { axes { axis { name "JAVA_VERSION"; values "11", "17", "21" } }; stages { stage("Build") { steps { sh "java -version" } } } }. Performance impact: parallel execution reduced our pipeline from 45 minutes to 12 minutes at HPE. Unit tests (5 min) + integration tests (8 min) + security scan (6 min) running in parallel = 8 minutes total instead of 19 minutes sequential.
Jenkins · ENGINEER
What are Jenkins shared libraries and why are you use them?
Shared Libraries allow you to define reusable pipeline code that multiple Jenkinsfiles can import. Without them: every project has its own Jenkinsfile copying the same boilerplate. If you need to change the security scan step, you update 20 Jenkinsfiles. With Shared Library: one change propagates to all pipelines. Structure: vars/ directory contains global functions as Groovy files. Call: @Library("my-shared-lib") import. vars/dockerBuild.groovy: def call(String imageName, String tag) { sh "docker build -t ${imageName}:${tag} ." }. In Jenkinsfile: @Library("platform-library") _ → dockerBuild("myapp", env.BUILD_NUMBER). Steps I put in shared library at HPE: dockerBuildAndPush (standard build, scan, push), helmDeploy (standard helm upgrade with retry and rollback), slackNotify (standard Slack alerts on success/failure), sonarAnalysis (standard SonarQube quality gate). Configure in Jenkins: Manage Jenkins → Configure System → Global Pipeline Libraries → Add library with Git URL. Libraries versioned with Git tags — teams can pin to specific versions or use main for latest.
Jenkins · ARCHITECT
How do you implement Jenkins at scale with multiple teams?
Architecture: Jenkins controllers (masters) with distributed agents. For scale: do NOT share one Jenkins master for all teams. Use one master per business unit or pipeline type. Better: Jenkins Kubernetes Plugin — spin up fresh pods as Jenkins agents, delete after build. No persistent agents to maintain. Each build gets a clean environment. Kubernetes manifest per build type: agent { kubernetes { yaml pod-spec } } in Jenkinsfile. Job organization: GitHub Organization Plugin — automatically creates jobs for every repo in a GitHub org. Developers do not register their pipelines manually. Folder Structure: top-level folders per team. Each team has own credentials, own slaves, own shared library. RBAC: Jenkins Matrix-based security. Operations team has full access. Developers have access to their folder only. Release managers have access to production environments. Scaling agents: Jenkins scales agent pods on Kubernetes automatically. Peak builds: 50 concurrent pods. Idle: 0 pods. No idle compute cost. Monitoring: Jenkins metrics plugin exposes Prometheus metrics. Alert on: executor saturation (all agents busy), long queue times, failed builds rate. Infrastructure as code: use Jenkins Configuration as Code (JCasC) plugin — entire Jenkins config as YAML in Git.
Jenkins · ENGINEER
How do you integrate code scanning (SonarQube) in a Jenkins pipeline?
Declarative pipeline with SonarQube: pipeline { agent any; environment { SONAR_TOKEN = credentials("sonar-token") }; stages { stage("Checkout") { steps { checkout scm } }; stage("SonarQube Analysis") { steps { withSonarQubeEnv("SonarQube") { sh "mvn sonar:sonar -Dsonar.projectKey=myapp -Dsonar.host.url=http://sonar:9000 -Dsonar.login=$SONAR_TOKEN" } } }; stage("Quality Gate") { steps { timeout(time: 5, unit: "MINUTES") { waitForQualityGate abortPipeline: true } } }; stage("Build Docker") { when { expression { return currentBuild.result == null } }; steps { sh "docker build -t myapp:${BUILD_NUMBER} ." } } } }. The waitForQualityGate step polls SonarQube until the quality gate result is available. If the gate fails (coverage below threshold, too many bugs, security hotspots), the pipeline aborts — the Docker image is never built, nothing gets deployed. Quality gate conditions you define in SonarQube: code coverage > 80%, no new CRITICAL or BLOCKER issues, duplicated lines < 5%, security rating A. For new code: Sonar also checks only the changed lines (new code gate) so legacy code does not block new features.
Jenkins · ENGINEER
How do you store secrets securely in Jenkins?
Jenkins Credentials Store: Manage Jenkins → Credentials → System → Global credentials. Store: Username/Password, Secret Text, SSH Keys, Certificates. Reference in pipeline: environment { DB_PASS = credentials("db-password") } or withCredentials([string(credentialsId: "api-token", variable: "TOKEN")]) { sh "curl -H Authorization:Bearer $TOKEN" }. Jenkins masks credentials in build logs automatically. Better approach — Azure Key Vault integration: Jenkins Azure Key Vault plugin. Configure vault URL and credentials in Jenkins. Reference in pipeline: withAzureKeyvault(credentialID: "my-vault", secrets: [[envVariable: "DB_PASS", name: "database-password", secretType: "Secret"]]) { sh "connect to db with $DB_PASS" }. Secrets fetched at runtime, never stored in Jenkins. Best practices: use service principals with minimum permissions, rotate credentials regularly, never hardcode in Jenkinsfile, use short-lived tokens where possible, audit credential access logs. What NOT to do: never echo $SECRET in scripts (even though masked, avoid the habit), never write secrets to files that get archived as build artifacts, never pass secrets as build parameters (they appear in build history).
Jenkins · SENIOR
Jenkins pipeline takes 45 minutes. Development team is frustrated. How do you fix it?
I have reduced a 45-minute pipeline to 12 minutes. The analysis first — instrument each stage to find where time goes: add timestamps to each stage, run the pipeline and read the logs. Usually 80% of the time is in 20% of stages. Common findings: sequential tests that could run in parallel (unit + integration + security scan all running one after the other), no caching (Maven downloads 500MB of dependencies every run), no Docker layer caching (rebuilding base image from scratch), waiting for environment to spin up. Fixes applied: parallel stages — wrap independent stages in parallel { stage("Unit Tests") stage("Security Scan") stage("Lint") }. Caching — for Maven: mount .m2 as a persistent volume on the Jenkins agent. For npm: cache node_modules. For Docker: use --cache-from with a cached layer image in ACR/ECR. Agent pools — pre-warmed agents eliminate 2-3 minutes of agent startup time. Selective testing — only run full test suite on main branch; on feature branches run only tests related to changed files. Incremental builds — if only docs changed, skip the build stage entirely. Result at HPE: 70% pipeline time reduction, development team satisfaction improved significantly.
Jenkins · PRODUCTION
Jenkins master goes down at 11pm before a critical release at 9am. What do you do?
This is why we have runbooks and do not depend on a single Jenkins master. Immediate response: check if it is a service crash or a full node failure. Restart the Jenkins service first — most outages are service-level: systemctl restart jenkins. If node failure: failover to the Jenkins secondary if you have HA configured. If no HA: check if all pipelines are stored in Jenkinsfiles in Git (they should be). The pipelines themselves are not lost. Recovery steps: spin up a new Jenkins instance (we had an AMI/VM image with Jenkins pre-installed and configured). Restore the configuration from backup — CASC (Configuration as Code) plugin means the entire Jenkins config is a YAML file in Git. Restore jobs from Git — all Jenkinsfiles are in source control. Restore credentials from the secrets manager — never store credentials only in Jenkins. Within 45 minutes: new Jenkins instance running, CASC applied, all jobs re-registered, credentials from Key Vault re-injected. Preventive action after: implement Jenkins HA (active-passive with EFS/Azure Files for shared storage). Lesson learned: always use CASC + Jenkinsfiles in Git. Never depend on the Jenkins UI for configuration.
🐳Docker — BuildKit, Disk Management, Production Patterns20 Q&As›
Docker · ENGINEER
What is the difference between CMD and ENTRYPOINT in a Dockerfile?
Both define what runs when the container starts. The key difference: ENTRYPOINT is the executable, CMD provides default arguments to ENTRYPOINT. When you override CMD (docker run myimage custom-arg), ENTRYPOINT stays. When you override ENTRYPOINT (docker run --entrypoint sh myimage), CMD is ignored. Production rule: use ENTRYPOINT for the application, CMD for default flags. Example: ENTRYPOINT ["python", "-m", "gunicorn"] CMD ["--workers", "4", "app:app"]. In production you also must use exec form (square brackets) not shell form — shell form makes /bin/sh PID 1 and your app never receives SIGTERM, causing 10-second kill delays on every container stop.
Docker · ENGINEER
Explain Docker layer caching. How do you optimise it?
Docker builds images layer by layer. If a layer has not changed since the last build, Docker reuses the cached version — making subsequent builds fast. The critical rule: once a layer cache is invalidated, ALL layers after it rebuild. The mistake: COPY . . before pip install — every code change invalidates the pip install cache and rebuilds all dependencies. The fix: COPY requirements.txt . first, RUN pip install, then COPY . . — dependencies only reinstall when requirements.txt changes. At HPE this changed our CI build time from 8 minutes to 90 seconds. In CI pipelines use BuildKit with --cache-from to share cache between pipeline runs. Use docker image history to see which layers are largest and where cache is being broken.
Docker · PRODUCTION
Your container is OOMKilled every few hours. Walk me through how you find the cause and fix it.
Exit code 137 = OOMKilled = kernel killed the process because it exceeded memory limit. Systematic approach: Step 1 — confirm it is OOM: docker inspect container --format={{.State.OOMKilled}} returns true. Step 2 — check what the memory limit is and what the container was actually using before death: docker stats showed 490MB out of 512MB limit over several hours. Step 3 — is it a memory leak or just insufficient limit? Memory leak: usage grows monotonically and never decreases. Insufficient limit: usage stabilizes at a level above the limit. At HPE we had a Python service with a Kafka consumer — it was caching all messages in memory without eviction. Fix: profiled with memory_profiler, found unbounded cache dictionary, added LRU cache with maxsize=1000. For short-term: increased limit from 512m to 1g to stop the pages. For permanent fix: added Prometheus memory metric alert at 80% of limit to catch future leaks before they cause kills.
Docker · ARCHITECT
What is a multi-stage Docker build and why is it critical for production?
Multi-stage builds use multiple FROM instructions in one Dockerfile. Each stage starts fresh and can copy artifacts from previous stages. The final image only contains what you explicitly copy into the last stage. Why it matters: a Go application with its compiler toolchain is 1.2GB. The same app compiled and copied to scratch (empty base image) is 8MB. Smaller image = faster pull, faster startup, smaller attack surface, lower registry storage cost, better security (fewer installed packages = fewer CVEs). At HPE we reduced our Java microservice images from 680MB (JDK) to 180MB (JRE only). The build stage has all the tools. The runtime stage has only the compiled binary and its runtime dependencies. I also use multi-stage for separating test execution — run tests in stage 1, if they fail the build fails before producing any image, so you can never push a tested-failed image.
Docker · ENGINEER
How do Docker networking modes differ? When do you use each?
Bridge (default): containers get their own network namespace, communicate via Docker-managed bridge network. Containers see each other by IP or container name on custom networks. Use for most applications. Host: container shares the host network stack directly — no port mapping needed, performance is slightly better, but zero network isolation. Use for network monitoring tools or when you need absolute maximum throughput. None: container has no network interface. Use for batch jobs that should never make network calls. Overlay: cross-host networking for Docker Swarm. In Kubernetes, Docker networking is replaced entirely by CNI plugins (Calico, Cilium, Flannel) — kube-proxy handles Service networking at the node level, not Docker. Important: create a custom bridge network instead of using the default bridge — custom networks provide automatic DNS resolution by container name. Default bridge uses IP addresses only.
Docker · PRODUCTION
A new Docker image is 2GB. Your CTO asks you to reduce it. How do you approach this?
Systematic reduction approach: Step 1 — analyse with docker image history to see which layers are biggest. Step 2 — switch to smaller base image: ubuntu (77MB) → debian-slim (30MB) → alpine (5MB) → distroless (2MB) → scratch (0). Caveat: alpine uses musl libc, some Python packages need glibc, test before committing. Step 3 — multi-stage build: separate build tools from runtime. A Maven/JDK build image is 500MB, but only the JAR needs to go to production, use JRE not JDK. Step 4 — clean up in same RUN layer: RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/* — the cleanup must be in the same RUN, otherwise the package cache is already committed to a layer. Step 5 — .dockerignore to exclude tests, docs, .git. At HPE: a Node.js service went from 1.4GB (with node_modules) to 120MB (multi-stage + alpine + only production dependencies).
Docker · ARCHITECT
How do you handle secrets in Docker? What are the risks of using ENV for secrets?
ENV variables for secrets have three problems: First, docker inspect my-container shows all environment variables in plaintext to anyone with Docker access. Second, docker history myimage shows all build-time ENV instructions. Third, child processes inherit all environment variables — if the app spawns subprocesses, secrets leak there too. Production approaches in order of security: Runtime injection via orchestrator (Kubernetes Secrets, Docker Swarm secrets) — secret never in image, injected at runtime as file mount. External secrets manager (HashiCorp Vault, AWS Secrets Manager) — app fetches secret at startup using IAM role. BuildKit secret mount for build-time secrets (pip private index tokens etc.) — RUN --mount=type=secret,id=pip_token pip install... — secret is never committed to any layer. Never do: ENV DB_PASSWORD=secret in Dockerfile (baked in, visible in docker history forever even if later changed).
Docker · PRODUCTION
Production container is running slowly. How do you diagnose it?
Performance diagnosis framework: Step 1 — is it CPU, memory, IO, or network? docker stats shows live CPU%, memory usage, network IO, block IO. Step 2 — if CPU high: docker exec -it container top or docker exec container ps aux. Look for a process consuming 100% CPU — could be a tight loop, or the app is CPU-bound and needs more replicas. Step 3 — if memory high: close to limit = risk of OOMKill. Check for memory leak: watch docker stats over 30 minutes. If memory grows continuously = leak. Step 4 — if disk IO high: use iostat inside container or host-level iotop. If writing too many logs = use log rotation. Step 5 — add profiling: for Python use py-spy or memory_profiler, for Java use async-profiler, for Go use pprof. In production I always have resource limits set AND Prometheus container metrics (container_cpu_usage_seconds_total, container_memory_usage_bytes) to correlate performance issues with recent deployments or traffic patterns.
Docker · ENGINEER
What is the difference between COPY and ADD in Dockerfile?
Use COPY for everything. ADD does everything COPY does plus two extra features: it auto-extracts tar files and it can fetch URLs. These extra features are the reason NOT to use ADD — they make Dockerfiles unpredictable. ADD https://example.com/file.tar.gz /app automatically downloads and extracts, which is a security risk (downloading from external URLs in builds). COPY is explicit and predictable. The only valid use case for ADD is if you specifically need tar extraction in a single layer — but even then, I prefer COPY + RUN tar xzf. Interviewers ask this because ADD is in many old tutorials and its use signals that someone learned Docker from outdated sources.
Docker · ARCHITECT
How do you manage Docker images in a large organisation with 50+ microservices?
Image management strategy: Naming convention: registry.company.com/team/service:version — never just service:latest. Tagging strategy: semantic versioning (v1.2.3) + git SHA (sha-abc1234) + latest for latest main branch. Never deploy :latest to production. Base image governance: define approved base images (python:3.11-slim, openjdk:21-slim, node:20-alpine). Run weekly Trivy scans on all approved bases. When a base image CVE is found, trigger rebuild of all services using it via a dependency graph in CI. Registry: private registry (ECR or Harbor) with image scanning on push, vulnerability policy that blocks CRITICAL CVEs. Retention policy: keep last 10 tags per service, delete everything older. At HPE we had Harbor with automated Trivy scanning — any image with a CRITICAL CVE was automatically quarantined and developers notified within 5 minutes of push.
Docker · ENGINEER
What is a multi-stage Docker build and why is it important for production?
Multi-stage builds use multiple FROM instructions in one Dockerfile, where each FROM starts a new stage. The key feature: you copy only the output from one stage into the next, leaving everything else behind. A Java application without multi-stage: one image with JDK + Maven + source code + tests + compiled JAR = 600MB. With multi-stage: Stage 1 uses maven:3.9 to compile. Stage 2 uses eclipse-temurin:17-jre-alpine and only COPY --from=builder the compiled JAR. Final image: 180MB with no Maven, no source code, no test files. Why it matters for production: security (smaller attack surface — no compiler or build tools to exploit), performance (faster image pulls, faster pod startup, less registry storage), compliance (many security policies require minimal runtime images). Important optimisation inside each stage: copy dependency files (pom.xml, package.json) before source code. Docker caches each layer — if pom.xml hasn't changed, it skips the dependency download step on the next build. Source code changes every commit but dependencies change rarely. This cache trick reduces build time from 5 minutes to 30 seconds for most builds.
Docker · ENGINEER
Explain Docker networking — bridge, host, overlay, and when to use each.
Bridge network (default): each container gets a private IP in a virtual network (172.17.0.0/16 range). Containers on the same bridge can communicate by container name (if using user-defined bridge). Host network: container shares the host's network stack — no isolation, uses host IP and ports directly. Use for performance-critical applications where network overhead matters, or when the app needs to bind specific host ports. Custom user-defined bridge: docker network create mynetwork. Containers on the same user-defined bridge can resolve each other by name — myapp can reach database:5432. Default bridge doesn't have DNS resolution between containers. Always use custom networks, not the default bridge. Overlay network: multi-host networking for Docker Swarm or when containers on different hosts need to communicate. Uses VXLAN encapsulation. For Kubernetes: containers in the same Pod share a network namespace (same localhost). Between Pods: handled by the CNI plugin (Flannel, Calico, Azure CNI) which ensures every pod has a routable IP. Docker Compose automatically creates a user-defined bridge network for all services in the same compose file — services can reach each other by service name: redis://redis:6379 works because Compose creates DNS entries per service name.
Docker · PRODUCTION
A container is running but the application inside is not responding. How do you debug?
Systematic debugging steps. Step 1: check container status and recent logs. docker ps shows the container is running (Up). docker logs container-name --tail 100 shows application output — check for startup errors, exceptions, missing config. docker logs --follow to watch in real time. Step 2: check resource usage. docker stats container-name shows CPU, memory, network, disk I/O. Is the container at memory limit (potential OOM kill incoming)? High CPU from runaway loop? Step 3: get a shell inside. docker exec -it container-name sh (or bash if available). Inside: check if the process is running: ps aux. Check if it is listening on the expected port: netstat -tlnp or ss -tlnp. Try curling the endpoint from inside: curl localhost:8080/health. Step 4: check container events. docker inspect container-name shows restart count, exit code of previous runs. docker events --filter container=container-name shows all lifecycle events. Step 5: network connectivity. From another container or host: docker exec -it other-container curl http://target-container:8080. Check if the service port is actually exposed: docker port container-name. Step 6: if the container exits immediately on startup: docker run --entrypoint sh myimage -c "sleep 3600" to override ENTRYPOINT and keep it running for investigation.
Docker · ARCHITECT
How does Docker Compose work for a multi-service application? What are the key features?
Docker Compose defines and runs multi-container applications from a single YAML file. The compose file defines: services (each becomes a container), networks (how services communicate), volumes (persistent data). Key features for production-like local environments: depends_on with condition: service_healthy ensures the database is healthy before the app starts — prevents the classic "app starts before DB is ready" problem. Health checks define how Docker determines if a service is healthy: healthcheck: test: ["CMD", "pg_isready"] interval: 10s retries: 5. Profiles: mark services with profiles: [tools] — they only start when docker compose --profile tools up is run. Useful for optional debug tools. Environment variable files: env_file: .env.development separates config from the compose file. Override files: docker-compose.yml (base) + docker-compose.override.yml (local dev overrides like volume mounts for live code reload). Production compose: add resource limits (cpus: "0.5", memory: 512m), restart policies (restart: unless-stopped), and use secrets instead of environment variables for sensitive values. Docker Compose is ideal for local development and integration testing. For production Kubernetes: convert compose to Helm charts using Kompose (kompose convert) as a starting point.
Docker · ENGINEER
What is Docker BuildKit and why should you use it?
BuildKit is the next-generation Docker build engine, enabled by default in Docker 23+. Advantages over classic builder: parallel stage execution — in multi-stage builds, independent stages build simultaneously instead of sequentially, significantly faster. Better caching: RUN --mount=type=cache,target=/root/.m2 mounts a persistent cache directory that survives across builds. Maven downloads once, reuses forever. Cache export: --cache-to type=registry exports build cache to registry for CI/CD to reuse between pipeline runs. Secret mounting: RUN --mount=type=secret,id=npmrc,target=/root/.npmrc npm install — secrets available during build but not baked into the image and not visible in docker history. SSH forwarding: RUN --mount=type=ssh git clone git@github.com:private/repo — SSH key available during build without being in the image. Syntax: DOCKER_BUILDKIT=1 docker build . (or set in daemon.json). Better output: structured progress output, per-step timing. Use BuildKit in CI: DOCKER_BUILDKIT=1 docker build --cache-from type=registry,ref=myacr.azurecr.io/myapp:cache --cache-to type=registry,ref=myacr.azurecr.io/myapp:cache,mode=max .
Docker · PRODUCTION
Docker container is consuming all disk space on the host. How do you fix?
Docker uses disk in several places: images, containers (writable layer), volumes, build cache. Check usage: docker system df shows breakdown. Images too large: docker image ls --format "table {{.Repository}} {{.Tag}} {{.Size}}" | sort -k3 -hr. Remove unused: docker image prune -a removes all images not used by any container. Or selective: docker rmi image-id. Stopped containers using disk: docker container prune removes all stopped containers. Volumes: docker volume prune removes unused volumes (CAREFUL — check if any volume has data you need). Build cache: docker builder prune clears build cache (can be large — 10-20GB on active build machines). Nuclear option: docker system prune --all --volumes removes everything unused. In production CI agents: add to cron: 0 2 * * * docker system prune -f (weekly cleanup). Set daemon.json image GC: log-driver: json-file with max-size and max-file to prevent container logs filling disk. Monitoring: set up Prometheus node-exporter to alert when /var/lib/docker filesystem exceeds 80%. In Kubernetes: eviction threshold eviction-hard: imagefs.available<15% controls when kubelet starts removing images.
Docker · ENGINEER
Write a production-grade Dockerfile for a Python Flask application.
# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
WORKDIR /app
# Install deps first (layer caching - deps change less often than code)
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
# Stage 2: Runtime image
FROM python:3.11-slim
# Security: create non-root user
RUN useradd --create-home --no-log-init appuser
WORKDIR /app
# Copy only installed packages from builder
COPY --from=builder /root/.local /home/appuser/.local
# Copy application code
COPY --chown=appuser:appuser . .
# Switch to non-root user
USER appuser
# Environment
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
PATH=/home/appuser/.local/bin:$PATH
EXPOSE 5000
# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:5000/health || exit 1
# Use exec form (not shell form) for proper signal handling
ENTRYPOINT ["python3", "-m", "gunicorn"]
CMD ["--bind", "0.0.0.0:5000", "--workers", "4", "app:app"]
Docker · ENGINEER
What is Docker layer caching and how do you optimise for it?
Each instruction in a Dockerfile creates a layer. Docker caches each layer. When rebuilding: if a layer and all layers before it are unchanged, Docker uses the cache. If any layer changes, that layer and ALL subsequent layers are rebuilt. Optimisation rule: put things that change LESS OFTEN earlier in the Dockerfile. Wrong order (slow): COPY . . then RUN pip install. Every code change invalidates the pip install layer — downloads all packages again. Correct order (fast): COPY requirements.txt . then RUN pip install then COPY . . (code). pip install only reruns when requirements.txt changes. For Docker builds in CI: use --cache-from to reuse cache from ACR: docker build --cache-from myacr.azurecr.io/myapp:cache --build-arg BUILDKIT_INLINE_CACHE=1 -t myapp:${BUILD_ID} . and push the cache image: docker push myacr.azurecr.io/myapp:cache. BuildKit (enabled by default in Docker 23+) has improved cache logic including parallel layer building and external cache sources. Result: a correctly ordered Dockerfile with registry cache cuts CI build time from 8 minutes to 90 seconds.
Docker · SENIOR
Your Docker image is 2.1GB. A colleague says it is fine. How do you push back?
2.1GB is a serious problem in production. I would show the concrete impact: Pull time — on a cold node, pulling 2.1GB takes 3-5 minutes. During a spike, K8s schedules pods on new nodes. They cannot start for 5 minutes. Your HPA is useless during that window. Storage — if you run 20 services on one node, 2.1GB per service = 42GB just in images. Attack surface — a 2.1GB image likely includes a full OS, package manager, compiler, and tools. Every one of those is a potential CVE. Fix I would propose: multi-stage build. Builder stage: full JDK/Maven/Node — installs, compiles, builds. Final stage: FROM eclipse-temurin:17-jre-alpine (100MB base) COPY --from=builder just the JAR. Result: 2.1GB → 180MB in one afternoon. Other reductions: use alpine or distroless base images, remove build tools (apt-get install X && ... && apt-get purge X in one RUN layer), use .dockerignore to exclude tests, docs, node_modules. I have reduced images from 1.8GB to 220MB multiple times. The effort is 2 hours, the benefit lasts the lifetime of the service.
Docker · PRODUCTION
How do you handle secrets in Docker containers? What NOT to do?
What NOT to do: never put secrets in ENV in the Dockerfile (they appear in docker inspect, docker history, and any image layer). Never bake secrets into the image with COPY or ARG. Never store them in docker-compose.yml committed to Git. Production approaches: Kubernetes Secrets mounted as files — the container reads from /var/secrets/db-password at startup. Better than env vars because they do not appear in ps aux output. External secrets manager — HashiCorp Vault, Azure Key Vault, AWS Secrets Manager. The container authenticates using its workload identity (Workload Identity/IRSA/Managed Identity) and fetches secrets at runtime. The CSI Secrets Store driver mounts secrets directly into pods as files without storing them in etcd. Docker Swarm — docker secret create stores secrets encrypted, mounted at /run/secrets/ inside the container. Rule of thumb: if the secret could be seen by running docker inspect or kubectl describe pod, it is too exposed. Secrets should be fetched at runtime using the container identity, never baked in.
🤖Ansible — Roles, Dynamic Inventory, Vault17 Q&As›
Ansible · ENGINEER
What is Ansible and how is it different from Chef and Puppet?
Ansible is agentless — it connects to servers over SSH and requires only Python on the target server. No agent daemon to install, maintain, or upgrade. Chef and Puppet require an agent running on every managed server — agent upgrades, agent authentication, agent failures become their own operational problem. Ansible uses YAML playbooks which any developer can read. Chef uses Ruby DSL which requires programming knowledge. Ansible is push-based — control node pushes tasks when you run ansible-playbook. Puppet and Chef are pull-based — agents periodically check for updates. The pull model is better for continuous compliance; the push model is better for on-demand deployments and CI/CD integration. At HPE: we chose Ansible specifically because the infrastructure team could write and understand playbooks without needing Ruby knowledge, and because we needed CI/CD integration that push-based Ansible makes natural.
Ansible · ENGINEER
Explain Ansible variable precedence. Which wins?
Ansible has 22 precedence levels. For interviews, the 6 most important in order from lowest to highest: role defaults (role/defaults/main.yml) — anyone can override these; inventory variables (host and group vars in inventory file); group_vars (files in group_vars/ folder); host_vars (files in host_vars/ folder); playbook vars (vars: section); extra vars (-e flag) — always wins, cannot be overridden. Practical implication: role defaults are the safety net defaults. group_vars/production.yml overrides them for production. host_vars/critical-server.yml can further override for one specific server. And in an emergency, -e "log_level=DEBUG" overrides everything without touching any files. The most common mistake: setting variables in role vars/main.yml (high precedence) instead of defaults/main.yml (low precedence) — then nobody can override them from group_vars, which breaks multi-environment playbooks.
Ansible · ARCHITECT
How do you design Ansible roles to support both on-premise and cloud environments without duplicating code?
The key is parameterization and abstraction through variables. Design roles to be environment-agnostic by default, environment-specific through variable overrides. Example: my nginx role defines nginx_worker_processes in defaults/main.yml as 4. For cloud VMs with 8 cores, group_vars/cloud_webservers.yml sets it to 8. For on-prem servers with 16 cores, group_vars/onprem_webservers.yml sets it to 16. The role code never changes — only the variables differ. For genuinely different behavior (systemd vs init.d, different package managers), use when conditionals on ansible_os_family and ansible_distribution_major_version. For cloud-specific tasks (register with cloud load balancer, fetch secrets from Key Vault), use delegate_to: localhost to run cloud API calls from the control node. The role structure stays identical — cloud tasks are just enabled or disabled via variables like cloud_provider: azure or cloud_provider: none.
Ansible · PRODUCTION
Your Ansible playbook runs successfully against dev but fails against production. What do you investigate?
Systematic approach — differences between dev and prod that could cause failures: First, run with -vvv to see exact SSH and task output. Most common causes in order of frequency: 1) Variable values — prod group_vars has different values (db_host, app_port, credentials). Verify with ansible prod-servers -m debug -a "var=hostvars[inventory_hostname]". 2) Ansible Vault — prod uses different vault password. Verify vault decryption works: ansible-playbook --check --vault-password-file prod_vault.pass. 3) Network/firewall — target port not open, package repository not reachable from prod network. Test with ansible prod-server -m uri -a "url=https://registry.example.com". 4) Permissions — prod has stricter sudo rules or SELinux enforcing. Check with ansible prod-server -m shell -a "getenforce". 5) OS version differences — prod is RHEL 8, dev is RHEL 9. Some modules behave differently. Use --check --diff to preview exactly what would change on prod without making changes.
Ansible · ARCHITECT
What is Ansible Automation Platform and when would you choose it over CLI Ansible?
AAP is enterprise Ansible with Web UI, RBAC, scheduling, audit logs, and centralized credential management. You need AAP when you have: more than 3 engineers running Ansible (SSH keys on laptops = security risk), any compliance requirement (PCI-DSS, SOC2 require audit trails of every change — CLI Ansible has none), production environments that need approval gates (AAP Workflow Templates support approval steps), and 24x7 operations (AAP schedules nightly compliance runs without a Jenkins dependency). Key AAP RBAC use case: Dev team gets Execute permission on dev Job Templates only. Ops team gets Execute on all. Nobody gets SSH key access to servers directly — all access goes through AAP with full logging. At Vodafone scale with 400+ servers across dev/staging/prod, CLI Ansible was a security and audit nightmare. AAP replaced it: every playbook run recorded, every credential centralized, every dev action approved by ops.
Ansible · ENGINEER
What is idempotency in Ansible and why does it matter?
Idempotency means running a playbook once or 100 times produces the same result — the system ends up in the desired state either way, with no side effects from repeated runs. Why it matters: CI/CD pipelines run playbooks on every deployment. If a playbook is not idempotent, running it twice might install duplicate packages, create duplicate users, append duplicate config lines, or fail because a resource already exists. Ansible built-in modules (package, file, service, user, template, lineinfile) are idempotent. The shell and command modules are NOT idempotent by default — they run every time. If you must use shell, use creates or removes flags: shell: create_database.sh creates=/var/lib/db — this skips the command if the file already exists. The measure of a good playbook: run it against an already-configured server — all tasks should show "ok" (unchanged), zero "changed". If any task shows "changed" every time, it is not idempotent.
Ansible · PRODUCTION
Production server configuration drifted from your Ansible playbooks. How do you detect and remediate this?
Configuration drift in Ansible is detected by running playbooks in check mode against production: ansible-playbook site.yml --check --diff -i inventory/prod. This shows every difference between current state and desired state without making any changes. The --diff flag shows exact file content changes. Anything showing "changed" in check mode = drift. Common drift sources: manual emergency fixes during incidents that were never formalized into playbooks, security patches applied manually, and configuration changes made directly on servers by application teams. Remediation decision: if the drift was an intentional improvement, update the playbook first, then apply. If the drift was incorrect, run ansible-playbook site.yml -i inventory/prod to revert to desired state. Prevention: run check mode as a nightly Jenkins job. Any drift detected = Slack alert to the team. At HPE: nightly drift detection on 50+ servers. Alert fires maybe twice per month, usually from manual emergency changes. Having the alert meant we always caught and formalized the change within 24 hours.
Ansible · ENGINEER
What is the difference between include_tasks and import_tasks in Ansible?
import_tasks (static): The tasks file is read and included at parse time, before playbook execution starts. It is as if the tasks were written directly in the playbook. Result: you can use --list-tasks to see all tasks before running, tags applied to the import apply to all imported tasks. Limitation: you cannot use variables in the file path — it must be a static path. include_tasks (dynamic): The tasks file is loaded at runtime when that point in the playbook is reached. You can use variables in the file path: include_tasks: "{{ ansible_os_family }}_tasks.yml" — loads different file based on OS. Tags on the include_tasks do NOT automatically apply to included tasks. Limitation: --list-tasks does not show the included tasks before running. Rule of thumb: use import_tasks for static includes where you always know what to include. Use include_tasks for conditional inclusion based on variables, or when you need to loop over multiple task files.
Ansible · ARCHITECT
How do you handle secrets in Ansible across a team of 20 engineers?
Three-layer secret management strategy. Layer 1: Ansible Vault for playbook secrets (database passwords, API keys in vars/secrets.yml). Vault password stored in a password manager (HashiCorp Vault or 1Password for teams) — never in Git. Each environment has a separate vault password. Layer 2: SSH keys managed in AAP credential store — engineers never see or hold SSH keys. AAP injects them at job execution time. Complete audit: who connected to which server and when. Layer 3: For production secrets that rotate regularly (DB passwords, API tokens), use External Secrets Operator or Vault Agent to inject secrets at playbook runtime from HashiCorp Vault, never hardcode even in vault files. Rotation: when a secret rotates, update in HashiCorp Vault — all playbooks pick it up automatically on next run without any code changes. At HPE: I implemented this three-layer approach. Result: no engineer has direct SSH access to production servers, every secret access is audited, and we passed SOC2 audit without any findings related to credential management.
Ansible · PRODUCTION
A runaway Ansible playbook is running on production and making unintended changes. How do you stop it?
Immediate stop: Ctrl+C in the terminal if you are watching it. Ansible stops after the current task completes — it does not kill mid-task. If it is running in Jenkins/AAP: cancel the job immediately in the UI. For SSH-based playbooks you can also kill the SSH sessions to the target hosts: pkill -f "ssh.*production-server" from the control node — this interrupts the current task on all hosts. Assessment: check what already ran using --start-at-task to understand blast radius. Ansible stores no rollback information — if tasks already ran (files changed, services restarted, packages installed), you must manually reverse them or re-run an earlier version of the playbook. Prevention: always run --check --diff in CI before any prod apply. Use serial to limit blast radius. For high-risk plays, add a manual approval step in AAP workflow before the actual execution stage. At HPE: we had a runaway playbook that restarted all telecom services simultaneously instead of serially. The fix took 2 hours. After this we added serial: 1 to all service-restart playbooks and mandatory --check in CI.
Ansible · ENGINEER
What are Ansible roles and how do they improve playbook organisation?
Ansible roles are a structured way to organise playbooks into reusable components. A role has a standard directory structure: roles/nginx/tasks/main.yml (tasks to execute), roles/nginx/handlers/main.yml (handlers like restart nginx), roles/nginx/templates/nginx.conf.j2 (Jinja2 templates), roles/nginx/vars/main.yml (role-specific variables), roles/nginx/defaults/main.yml (default values, lowest priority), roles/nginx/files/ (static files to copy), roles/nginx/meta/main.yml (role dependencies). Use in playbook: roles: - nginx - { role: application, version: "2.1.0" }. Benefits: reusability — one nginx role used by 20 different playbooks. Separation of concerns — nginx config is in the nginx role, not mixed with application config. Testing — roles can be tested independently with Molecule. Distribution — roles can be shared via Ansible Galaxy. galaxy.yml for publishing. Molecule for testing: molecule test runs the role in a Docker container and verifies it. At HPE I created roles for: base-hardening (SSH config, firewall, fail2ban), application-deploy, monitoring-agent, certificate-renewal. Each team consumed these roles without needing to understand their internals.
Ansible · ENGINEER
How do you use Ansible dynamic inventory for cloud environments?
Static inventory files break when servers come and go (cloud auto-scaling). Dynamic inventory scripts or plugins query the cloud API at runtime. Azure dynamic inventory: install azure.azcollection.azure_rm inventory plugin. Create azure_rm.yml: plugin: azure.azcollection.azure_rm, include_vm_resource_groups: [production-rg]. ansible-inventory -i azure_rm.yml --list returns all Azure VMs. Group by tag: keyed_groups: key: tags.Environment (creates groups like tag_Environment_production). Target all production servers: ansible -i azure_rm.yml tag_Environment_production -m ping. AWS dynamic inventory: aws_ec2 plugin. Create aws_ec2.yml: plugin: amazon.aws.aws_ec2, regions: [eu-west-1], filters: instance-state-name: [running]. keyed_groups: key: tags.Role → groups webservers, databases from EC2 tags. Kubernetes dynamic inventory: for managing K8s nodes with Ansible. In CI/CD: Ansible Tower/AWX stores dynamic inventory configurations. Credentials for cloud APIs configured in Tower. Teams run playbooks against live inventory without managing any host files.
Ansible · ENGINEER
Write an Ansible playbook to install Nginx on all web servers.
--- - name: Install and configure Nginx on web servers hosts: webservers become: yes vars: nginx_port: 80 tasks: - name: Install Nginx apt: name: nginx state: present update_cache: yes when: ansible_os_family == "Debian" - name: Install Nginx yum: name: nginx state: present when: ansible_os_family == "RedHat" - name: Start and enable Nginx service: name: nginx state: started enabled: yes - name: Configure Nginx template: src: templates/nginx.conf.j2 dest: /etc/nginx/nginx.conf backup: yes notify: Reload Nginx - name: Open port 80 in firewall ufw: rule: allow port: "{{ nginx_port }}" proto: tcp when: ansible_os_family == "Debian" handlers: - name: Reload Nginx service: name: nginx state: reloaded. Key points: become: yes for sudo. when conditions handle different OS families. template module uses Jinja2 template for config. notify triggers the handler only if the task changed — handler runs once at the end even if notified multiple times. Idempotent: running this 100 times always results in the same state.
Ansible · ENGINEER
What is Ansible Vault and how do you use it for secrets?
Ansible Vault encrypts sensitive data (passwords, API keys, certificates) within YAML files or as standalone encrypted strings. The encrypted files can safely be committed to Git — without the vault password they are unreadable. Encrypt a file: ansible-vault encrypt group_vars/production/secrets.yml. Edit encrypted file: ansible-vault edit group_vars/production/secrets.yml — opens in editor, saves re-encrypted. Encrypt a single string (inline): ansible-vault encrypt_string "my-db-password" --name db_password → outputs encrypted YAML value to paste into a vars file. Decrypt for viewing: ansible-vault decrypt secrets.yml (temporary, for debugging). Use in playbook: the vault-encrypted vars are automatically decrypted when running with the vault password. Run playbook: ansible-playbook site.yml --vault-password-file ~/.vault_pass or --ask-vault-pass for interactive prompt. In CI/CD: store vault password in Jenkins/Azure DevOps as a secret credential. Pass as --vault-password-file /dev/stdin with the credential piped in. Multiple vault IDs: ansible-vault encrypt --vault-id prod@prompt secrets.yml allows different passwords for different environments. Best practice: never store the vault password in the same repo as the encrypted files.
Ansible · PRODUCTION
Ansible fails with "UNREACHABLE" on 5 of 20 servers. How do you handle it?
UNREACHABLE means Ansible cannot connect via SSH (or WinRM) to those hosts. It does not mean the playbook task failed. Diagnose: check if the hosts are actually up: ping hostname. Check SSH manually: ssh user@hostname. Common causes: host is down or being rebooted (wait and retry), SSH key changed or expired (update inventory credentials), firewall blocking SSH port 22 (check NSG/security group), wrong username or Python interpreter path, bastion/jump host required but not configured. Handle gracefully in playbook: serial: 5 processes 5 hosts at a time. max_fail_percentage: 20 — if more than 20% of hosts fail, abort. This prevents cascading failures. Retry failed hosts: Ansible creates a retry file (site.retry) listing failed hosts. Re-run: ansible-playbook site.yml --limit @site.retry to only target the failed hosts. ignore_unreachable: true in play or task to continue with remaining hosts even if some are unreachable — use carefully and log which hosts were skipped. After fix: ansible all -m ping --limit problematic_hosts verifies connectivity before re-running the full playbook.
Ansible · SENIOR
Production server configuration drifted. How do you use Ansible to detect and fix?
Configuration drift means servers that should be identical have diverged -- someone ran a manual command, a package auto-updated, a file was edited. Detection: ansible all -m setup --tree /tmp/facts collects current state of all servers. Compare against expected state. Use ansible-lint to validate playbooks. The idempotent approach: write your playbooks with state: present/absent/latest so running them multiple times always produces the same result. Run playbooks in check mode first: ansible-playbook site.yml --check --diff shows exactly what would change without making changes. This is your drift detection. Fix: ansible-playbook site.yml --limit drifted_servers applies the correct state. For scheduled drift detection: run the playbook in check mode via cron or Ansible Tower/AWX every 6 hours. If check mode finds differences, send an alert. Decide: auto-remediate (re-run without --check) or require manual approval. At HPE we ran check mode daily and auto-remediated non-critical drift (package versions) but required human approval for file changes in /etc/.
Ansible · PRODUCTION
Ansible playbook fails halfway through on 50 servers. How do you recover safely?
The danger: 25 servers are in the new state, 25 are in the old state. Your environment is inconsistent. Recovery steps: do not just re-run the full playbook blindly. Step 1: identify exactly which servers failed. The Ansible output shows the exact hosts. ansible-playbook site.yml --limit "failed_servers" --start-at-task "task that failed". Step 2: understand WHY it failed. Check the error message. Common causes: SSH connection timeout (network issue, not a playbook problem -- retry), package manager locked (another process running apt/yum -- wait and retry), disk full (needs manual intervention before retry), wrong variable value (fix the variable first). Step 3: use --start-at-task to resume from where it failed -- do not re-run from the beginning on servers where steps already completed. Step 4: for the 25 servers that completed successfully, run the playbook in --check mode to verify they are in the correct state. Step 5: after full recovery, run the complete playbook against all 50 servers in check mode to confirm consistency. Prevention: use serial: 10% in the playbook so Ansible processes servers in batches. If the first 10% fail, it stops before affecting all servers.
📈Monitoring — Alertmanager, Grafana, PromQL, Datadog15 Q&As›
Prometheus · ENGINEER
What is Prometheus and how does it differ from traditional monitoring tools like Nagios?
Prometheus is a pull-based, time-series monitoring system. It periodically scrapes metrics from HTTP endpoints on your applications and infrastructure. Nagios and Zabbix are check-based — they periodically run a check script (is the service responding? is disk below 90%?) and return OK/WARNING/CRITICAL. The fundamental difference: Prometheus stores actual metric values over time as numbers, enabling powerful queries like trends, rates, and percentile calculations. Nagios only knows current state. With Prometheus you can ask: what was the 99th percentile latency last Tuesday between 2pm-4pm? What is the rate of change in memory usage over the last 6 hours? Nagios cannot answer these. Prometheus is cloud-native — designed for containers where services appear and disappear. Service discovery automatically finds new pods. Nagios requires manual host registration.
Prometheus · ENGINEER
Explain the four Prometheus metric types. When do you use each?
Counter: always increases, never decreases (resets on restart). Use for total requests, errors, bytes processed. Always use rate() or increase() in PromQL to get rate of change per second — the raw counter number is not useful. Gauge: current value, can go up or down. Use for memory usage, active connections, queue depth, temperature. Query directly without rate(). Histogram: distributes values into predefined buckets. Use for request latency and response sizes where you need percentile calculations. Creates three series: _bucket (counts), _count (total), _sum (sum of all values). Use histogram_quantile() to calculate P95/P99. Summary: similar to histogram but calculates quantiles in the application code. Less flexible than histogram because you cannot aggregate across multiple instances. Prefer histogram unless you have a specific reason for summary.
Prometheus · ARCHITECT
An alert is firing in Prometheus but you see no issue in Grafana. What are the possible causes?
Five possible causes. One: time range mismatch — Grafana dashboard is showing last 1 hour but the alert triggered on a spike that is now outside the range. Expand time range. Two: alert has no FOR duration — fired on a single data point that immediately resolved. The alert appears in history but Grafana dashboard now shows normal. Add FOR 5m to the alert rule. Three: alert is silenced in Alertmanager — someone silenced it while investigating. Check Alertmanager UI silences. Four: the metric labels in the alert do not match the Grafana panel query — alert is for namespace=production but Grafana panel shows all namespaces combined. Look at the exact alert labels. Five: recording rules lag — if the alert uses a recording rule, there is a scrape interval delay. The alert fired on slightly stale data. Check the recording rule evaluation interval. At HPE: most of these issues come from copy-pasting alert rules without understanding the labels — the alert fires for a specific pod but the Grafana panel aggregates all pods.
Prometheus · PRODUCTION
Prometheus is using too much memory and pods keep OOMKilling. How do you fix it?
Root cause: too many time series (high cardinality) or too long retention. Investigation: Prometheus UI → Status → TSDB Status shows top series by metric name and label name. Look for any label with millions of unique values — typically user_id, session_id, request_id, URL path with parameters. These are high-cardinality labels and each unique value = separate time series = memory. Immediate fix: increase memory limit and reduce retention from 15 days to 7 days. Medium-term fix: remove high-cardinality labels from metrics using metric_relabel_configs in the scrape config — drop the offending label. Long-term fix: code review of application metrics — every label must have bounded cardinality. Also check for metrics that are never queried and can be dropped entirely. Recording rules help too — pre-aggregate expensive queries into new lower-cardinality series. At HPE: a developer added request_path as a label with full URL paths including query parameters. 50 million unique series in 2 hours. Fix: drop path label in relabel config, add path_prefix with only the first URL segment.
Prometheus · ENGINEER
What is the difference between Prometheus rate() and increase() functions?
Both calculate how much a counter changed over a time window, but express it differently. rate(counter[5m]) gives the per-second rate of increase averaged over 5 minutes. increase(counter[5m]) gives the total increase over 5 minutes. Mathematically: increase = rate × 300 (seconds in 5 minutes). Use rate() when you care about speed — requests per second, errors per second. Use increase() when you care about total count in a window — how many restarts in the last hour, how many deploys today. Important nuance: both handle counter resets (when a process restarts and counter goes to 0) by detecting the reset and not counting it as a decrease. This is why you must use rate()/increase() instead of subtracting counter values directly.
Prometheus · ARCHITECT
How do you design Prometheus alerting for a Kubernetes production cluster to avoid alert fatigue?
Alert fatigue happens when too many alerts fire, operators stop paying attention, and real incidents get missed. Prevention strategy: only alert on symptoms (user-visible impact) not causes. Symptom alert: high error rate. Cause alert: database connection timeout — often too specific. Four golden signals to always alert on: Latency (P99 > SLA), Traffic (abnormal request rate), Errors (error rate > threshold), Saturation (CPU/memory/disk approaching limit). For each alert: add FOR duration (5 minutes for critical, 15 for warning) to prevent noise from transient spikes. Add runbook_url annotation pointing to documented response. Inhibition rules: if a node is down, suppress all pod alerts from that node — the root cause is the node, not the individual pods. Group related alerts in Alertmanager: group by namespace so 50 pod alerts from one bad deployment become one grouped notification. Route by team: platform team gets infra alerts, app team gets their service alerts. Review alerts monthly — if an alert fires more than twice a week for a non-incident, it is threshold too low or not important enough.
Datadog · ENGINEER
What is Datadog and when would you choose it over Prometheus+Grafana?
Datadog is a SaaS observability platform covering metrics, logs, traces, APM, synthetic testing, and security in one platform. You choose Datadog over Prometheus+Grafana when: team does not want operational overhead of managing monitoring infrastructure, you need distributed tracing (APM) with automatic service dependency mapping, you want logs and metrics in the same tool with automatic correlation, you need synthetic monitoring of external endpoints, or you need ML-based anomaly detection. Prometheus+Grafana wins when: cost is a priority (Datadog is $15-35/host/month, Prometheus is free), team has Kubernetes expertise and wants fine-grained control, you are multi-cloud and want one consistent tool, or you need custom PromQL queries Datadog cannot express. In practice: large enterprises often use BOTH — Prometheus for detailed K8s metrics and alerting, Datadog for APM and distributed tracing where auto-instrumentation saves weeks of engineering.
Datadog · PRODUCTION
Datadog agent on Kubernetes is not sending metrics. How do you troubleshoot?
Layered debugging. First: is the agent pod running? kubectl get pods -n datadog shows DaemonSet should have one pod per node. If pod is in CrashLoopBackOff, check logs: kubectl logs daemonset/datadog -n datadog. Second: check agent status from inside the pod. kubectl exec -it datadog-agent-pod -- agent status. This shows all checks, whether they are passing, and any errors. Third: verify API key is valid. kubectl exec -it datadog-agent-pod -- agent check datadog. A wrong API key gives 403 Forbidden. Fourth: check network connectivity. Can the agent reach api.datadoghq.com on port 443? Some clusters have restrictive egress policies. kubectl exec -it datadog-agent-pod -- curl -v https://api.datadoghq.com. Fifth: check if custom metrics are enabled if you expect custom metrics. clusterAgent.metricsProvider.enabled must be true in Helm values. Common issue at scale: the cluster agent (not the node agent) is responsible for Kubernetes state metrics. If the cluster agent is down, pod/deployment metrics disappear from Datadog even though node agents are running.
Datadog · ARCHITECT
How does Datadog APM work? What is distributed tracing?
Distributed tracing tracks a single user request as it flows through multiple services. Without tracing: you see slow responses but don't know if the slowness is in the API, the database call, or the downstream payment service. With tracing: every request gets a unique trace ID. Each service that handles the request creates a span (start time, end time, tags). All spans with the same trace ID are linked into a waterfall diagram showing exactly where time was spent. Implementation: instrument each service with a Datadog tracer (ddtrace for Python, dd-java-agent for Java). The tracer automatically creates spans for HTTP requests, database queries, Redis calls, Kafka consumers — no code changes needed for most frameworks. The trace ID is propagated via HTTP headers (x-datadog-trace-id) between services. In Datadog: APM → Traces shows all traces. APM → Service Map shows all services as nodes with error rates and latency on the edges between them. Critical for understanding: which service is causing slowness, what is the dependency graph, what is the blast radius if one service is slow.
Monitoring · ENGINEER
How do you configure Alertmanager for production alerting?
Alertmanager receives alerts from Prometheus and handles routing, grouping, silencing, and notification. Config: global.smtp_from / slack_api_url. route: group_by: [alertname, namespace]. group_wait: 30s (wait to collect related alerts). group_interval: 5m (how often to re-send grouped alerts). repeat_interval: 4h (re-alert if not resolved). receiver: slack-production. routes: - match: severity=critical, receiver: pagerduty. Inhibition rules: critical alert inhibits warning — if payment-api-down fires, suppress all individual pod alerts for payment-api (noise reduction). Receiver config for Slack: - name: slack-production, slack_configs: url: $SLACK_WEBHOOK, channel: #alerts-production, text: {{ range .Alerts }}{{ .Annotations.description }}{{ end }}. PagerDuty: service_key: $PAGERDUTY_KEY, severity: critical. Best practices: critical alerts = PagerDuty (wakes someone up). Warning = Slack (visible but no wake-up). Alert on symptoms not causes: alert on "high error rate" not "CPU high" (CPU high is a cause, user impact is the symptom). Dead man switch: always-firing alert ensures Alertmanager itself is working. If it goes silent, something is broken.
Monitoring · ENGINEER
How do you set up Grafana dashboards for Kubernetes?
Grafana connects to Prometheus as a data source. Add data source: Configuration → Data Sources → Add Prometheus → URL: http://prometheus:9090. For Kubernetes monitoring use kube-prometheus-stack (Prometheus Operator + pre-built Grafana dashboards): helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack. Includes 20+ pre-built dashboards: Kubernetes Cluster Overview (nodes, pods, namespaces), Node Exporter Full (CPU, memory, disk, network per node), Kubernetes Pods (per-pod CPU, memory, restarts), Kubernetes Deployments. Custom dashboard for your application: Add panel → PromQL query → select visualisation type. Panel for HTTP error rate: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service). Organise dashboards: folder per team (platform, payments, orders). Dashboard variables: $namespace dropdown lets user filter all panels. Variables: Settings → Variables → type=query → query: label_values(kube_pod_info, namespace). Now every panel uses $namespace variable. Export dashboards as JSON → store in Git → GitOps deploy via ConfigMap → Grafana sidecar picks them up automatically.
Monitoring · ENGINEER
What PromQL queries do you use daily for AKS monitoring?
CPU throttling (most important): rate(container_cpu_throttled_seconds_total[5m]) > 0.1 — pods being CPU throttled, indicates limits are too low. OOMKilled rate: increase(kube_pod_container_status_restarts_total[1h]) > 3 and kube_pod_container_status_last_terminated_reason == "OOMKilled". Memory pressure: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 — node has less than 10% memory available. Pod pending: kube_pod_status_phase{phase="Pending"} > 0 for more than 5 minutes — pods waiting for resources. HTTP error rate per service: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.01. P99 latency: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)). Certificate expiry: (x509_cert_expiry - time()) / 86400 < 30 — certificates expiring in less than 30 days. Disk pressure: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15 — less than 15% disk free. These 8 queries cover the most common production alert scenarios.
Monitoring · ENGINEER
How do you set up Datadog for AKS monitoring?
Install Datadog Agent as DaemonSet via Helm: helm repo add datadog https://helm.datadoghq.com. values.yaml: datadog.apiKey from environment variable (DD_API_KEY), datadog.clusterName: myaks-production, clusterAgent.enabled: true, datadog.kubelet.tlsVerify: false (for AKS), datadog.logs.enabled: true, datadog.logs.containerCollectAll: true, datadog.apm.enabled: true. helm install datadog datadog/datadog -f values.yaml -n datadog. What Datadog collects automatically: node CPU/memory/disk metrics, pod metrics (CPU, memory, restarts), container logs from all pods, Kubernetes events. Add application-level monitoring: annotate pods for Datadog log collection: ad.datadoghq.com/container.logs: [{"source":"python","service":"payment-api"}]. For APM: add DD_AGENT_HOST (via downward API) and dd-trace library to application. Dashboards: Datadog has pre-built Kubernetes dashboards. Custom dashboards: drag-and-drop with your custom metrics. Monitors (alerts): create from any metric with threshold, anomaly detection, or forecast alerts. Integration with PagerDuty: one-click in Datadog integrations. Cost control: use DogStatsD for custom metrics instead of APM where possible, configure log exclusion filters to avoid ingesting noisy/irrelevant logs.
Monitoring · SENIOR
Production alert: 503 errors spiking. You have Prometheus + Grafana. What is your process?
Minute 0-2: check the Grafana dashboard. HTTP error rate panel shows which service and which endpoint is returning 503s. Error rate chart shows when it started -- was there a deployment? Minute 2-5: narrow the blast radius. Is it one service or multiple? kubectl get pods -n production -- are pods crashing? If CrashLoopBackOff: the service itself is broken. If pods are running: might be a downstream dependency (database, external API). Minute 5-10: Prometheus queries to isolate. sum(rate(http_requests_total{status="503"}[5m])) by (service, endpoint) -- which specific endpoint? Check the upstream: if payment-api returns 503, check if it can reach the database: probe_success metric from Blackbox Exporter. Check latency: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) by (service) -- is the 99th percentile exploding? This points to slowness not crashes. Minute 10+: if database connection pool exhausted -- increase pool size or scale the service. If a dependency is down -- circuit breaker should kick in, check if it is working. Always: communicate status every 10 minutes to stakeholders even if "still investigating". Silent incident response makes stakeholders nervous.
Monitoring · ARCHITECT
How do you set SLOs and SLAs for a production service? What did you use at HPE?
SLA is the contractual commitment to customers. SLO is the internal engineering target (usually stricter than SLA). SLI is the measurement. At HPE for the telecom provisioning platform: SLI choices -- availability (percentage of requests returning 2xx/3xx, measured per 1-minute window), latency (percentage of requests completing under 500ms), error rate (percentage of requests returning 5xx). SLO values -- availability: 99.9% (allows 43.8 minutes downtime per month), latency: 95% of requests under 500ms, 99% under 2 seconds. SLA to business -- 99.5% availability monthly. Error budget -- if SLO is 99.9% availability over 30 days, the error budget is 43.8 minutes. If we use more than that: freeze new deployments until next month, focus engineering on reliability. Implementation in Prometheus: recording rule calculates rolling 30-day availability. Alert fires when burn rate exceeds 2x (fast burn = will exhaust budget in 2 weeks) or 1x for 6 hours (slow burn). The multi-window alerting (5-minute window + 1-hour window must both exceed threshold) reduces alert noise dramatically. Grafana dashboard shows: current SLO compliance, error budget remaining, burn rate. Reviewed weekly in engineering meeting.
📊Datadog — APM, Dashboards, Monitors, Logs4 Q&As›
Datadog · ENGINEER
What is Datadog APM and how do you instrument a Python application?
Datadog APM (Application Performance Monitoring) traces requests as they flow through your services — showing latency, errors, and bottlenecks. To instrument Python: pip install ddtrace. Run: ddtrace-run python app.py. This auto-instruments Flask, Django, requests, psycopg2 etc. In code: from ddtrace import tracer; with tracer.trace("operation.name") as span: span.set_tag("user.id", user_id). Key concepts: Trace = end-to-end request. Span = single operation within a trace. Service Map = visual of all services and their dependencies. Flame graph = shows exactly where time is spent within a request. Configure via DD_SERVICE, DD_ENV, DD_VERSION env vars for unified service tagging.
Datadog · ENGINEER
How do you create a Datadog monitor and alert? Walk through a real example.
A Datadog Monitor watches a metric or log query and alerts when a threshold is breached. Real example — CPU alert on Kubernetes node: Go to Monitors → New Monitor → Metric. Query: avg:system.cpu.user{kube_cluster_name:prod} by {host}. Set threshold: Alert when > 85% for 5 minutes. Warning at 70%. Configure notification: "@slack-devops-alerts High CPU on {{host.name}} — current: {{value}}%". Add escalation: if unresolved after 30 minutes, page on-call via PagerDuty. Types of monitors: Metric monitor (threshold on time-series), Log monitor (alert on log pattern count), APM monitor (P99 latency > 500ms), Composite monitor (CPU AND memory both high), Anomaly monitor (ML-based deviation from baseline).
Datadog · ARCHITECT
How do you set up Datadog in a Kubernetes cluster and what should you monitor?
Install via Helm: helm repo add datadog https://helm.datadoghq.com; helm install datadog-agent datadog/datadog --set datadog.apiKey=YOUR_KEY --set datadog.clusterName=prod. The DaemonSet deploys one agent per node. What to monitor: Node level — CPU, memory, disk, network per node. Pod level — container CPU/memory limits vs requests, OOMKills, restarts. Control plane — API server latency, etcd health, scheduler queue. Application — custom metrics via StatsD or DogStatsD. Logs — collect with datadog.logs.enabled=true, then use Log Patterns to detect anomalies. Key dashboards: Kubernetes Overview (built-in), Node health, Deployment status. SLO tracking: create SLOs in Datadog based on APM error rate or uptime monitors — links directly to error budget reporting.
Datadog · ARCHITECT
What is the difference between Datadog Logs, Metrics, and Traces? When do you use each?
Three pillars of observability — each answers a different question. Metrics: What is happening? Numeric, time-series data. CPU=85%, request rate=1200/min, error rate=0.3%. Low storage cost, fast to query, ideal for dashboards and alerts. Logs: Why did it happen? Unstructured text events. "ERROR: database connection failed at 14:32:05". High detail but expensive to store and slow to search. Use for debugging specific incidents. Traces: How did it happen? Distributed request flow across services. Shows that a slow API call spent 800ms waiting on database query. Links logs and metrics to a specific request. In practice: Alert triggers from a Metric threshold. You open the Dashboard, correlate with Logs to find the error message. You open the Trace to see which service call caused the slowdown. Together they give full context — metrics alone cannot tell you WHY, logs alone cannot show you WHERE across services.
🟠AWS — EC2, EKS, HA Design, Troubleshooting43 Q&As›
AWS · ARCHITECT
Design a production-grade VPC architecture for a 3-tier application on AWS.
Three-tier VPC with public, private-app, and private-data subnets across 3 AZs. VPC CIDR 10.0.0.0/16 giving 65,536 IPs. Public subnets host: Application Load Balancer (inbound 443 from 0.0.0.0/0), NAT Gateways (one per AZ for HA), Bastion host (if needed). Private app subnets host EKS nodes and EC2 — they route outbound traffic through NAT Gateway. Private data subnets host RDS, ElastiCache — NO internet route at all, fully isolated. Security groups implement least-privilege: ALB-SG allows inbound 443 from anywhere. App-SG allows inbound only from ALB-SG. DB-SG allows inbound only from App-SG. VPC Flow Logs enabled for security audit. VPC endpoints for S3 and DynamoDB avoid NAT Gateway costs for AWS service traffic. At scale: inter-AZ data transfer costs money — keep app pods and their RDS AZ aligned.
AWS · ENGINEER
What is the difference between Security Groups and NACLs in AWS?
Security Groups are stateful — if you allow inbound traffic, return traffic is automatically allowed. They operate at the instance/ENI level. You can only create ALLOW rules. Changes take effect immediately. NACLs (Network ACLs) are stateless — you must explicitly allow both inbound and outbound traffic for a connection to work. They operate at the subnet level and apply to all instances in the subnet. Rules are evaluated in order by rule number — first match wins. You can create both ALLOW and DENY rules. Use case for NACLs: blocking a specific IP range at the subnet level (DDoS mitigation), quick emergency block. Use case for Security Groups: fine-grained instance-level control. Best practice: Security Groups for normal operations. NACLs as an additional layer for subnet-level blocking only.
AWS · ARCHITECT
Explain IRSA — IAM Roles for Service Accounts — and why it matters for EKS security.
Before IRSA: to give a pod AWS access, you stored access keys as Kubernetes secrets or gave the EC2 node role broad permissions (all pods on that node get all permissions). Both are security risks. IRSA uses OIDC federation: EKS cluster has an OIDC endpoint. You associate this with your AWS account. Create an IAM role with a trust policy allowing only the specific Kubernetes service account in a specific namespace. Pod uses that service account. At runtime, the pod automatically gets a temporary credential via the OIDC token. No keys stored anywhere. If the pod is compromised, credentials expire in 1 hour. Blast radius is limited to exactly the permissions in that role. Implementation: eksctl utils associate-iam-oidc-provider, then eksctl create iamserviceaccount. This is the production standard for EKS and what every AWS interview expects you to know for containerised workloads.
AWS · PRODUCTION
S3 bucket was accidentally made public. What do you do in the next 60 seconds?
First 60 seconds: block public access immediately at both account level and bucket level using aws s3control put-public-access-block with all four flags set to true. This stops new exposure instantly. Next 5 minutes: check what was in the bucket using aws s3api list-objects-v2. Check CloudTrail for GetObject API calls in the last hour to understand what was potentially accessed. Check the bucket policy and ACL that allowed public access. Next 30 minutes: file a security incident. Notify your security team. If the bucket contained PII, initiate your GDPR/data breach notification process (72 hours under GDPR). Fix the root cause: identify what Terraform/IaC missed the block_public_acls setting. Prevention for future: AWS Config rule s3-bucket-public-read-prohibited that fires immediately. SCP at organization level blocking all public bucket creation. In Terraform: always include aws_s3_bucket_public_access_block resource with all four booleans set to true.
AWS · ENGINEER
What is the difference between EBS, EFS, and S3 storage on AWS?
EBS (Elastic Block Store): block storage, attached to one EC2 instance at a time (ReadWriteOnce). Like a hard drive. Use for OS volumes, databases (RDS uses EBS), single-instance app data. High IOPS, low latency. Types: gp3 (general), io2 (high performance databases), st1 (throughput — Kafka), sc1 (cold, infrequent access). EFS (Elastic File System): network file system, multiple EC2 instances can mount simultaneously (ReadWriteMany). NFS protocol. Use for shared content (web tier accessing same files), EKS pods needing shared storage across nodes. More expensive than EBS. S3: object storage, HTTP API (not mountable as filesystem natively). Unlimited scale. Use for: backups, static content, data lake, application artifacts, logs. Cannot run a database on S3. Differences in EKS context: EBS for databases in StatefulSets, EFS for shared config or content across pods, S3 for application data and backups via SDK.
AWS · PRODUCTION
How do you troubleshoot an EC2 instance that is unreachable via SSH?
Systematic approach. First: check EC2 console — is the instance state Running? Check system status checks and instance status checks. System check failure means AWS hardware issue — stop and start the instance (this migrates to new hardware). Instance check failure means OS-level issue. Second: check Security Group — does it allow inbound 22 from your IP? Connections silently drop without SG rule. Third: check NACLs — is there a DENY rule on port 22? Fourth: check instance system log — EC2 Console → Get system log — shows Linux boot messages and any panic/crash. Fifth: if no SSH key access, use AWS Systems Manager Session Manager — no SSH needed, works through Systems Manager agent. Sixth: for EBS-backed instances, detach the root volume, attach to a working instance as secondary volume, fix the issue (bad sshd_config, full disk), reattach. Production rule: disable SSH entirely and use SSM Session Manager — no inbound ports, full audit trail.
AWS · ARCHITECT
How does AWS Auto Scaling work with ALB for a production application?
Full flow: ALB receives traffic, distributes to target group. Target group contains EC2 instances or EKS pods. Auto Scaling Group manages the EC2 instances. When CPU/memory/custom metric crosses threshold, ASG launches new instances, registers them with the target group, ALB starts sending traffic once health check passes. Scale-down: after cooldown period (300 seconds default), underutilised instances are terminated, deregistered from target group first so in-flight requests complete. Key settings: min/desired/max capacity, health check grace period (give new instances time to start), cooldown (prevent rapid scale-up/down flapping), instance warm-up (how long before new instance counted in metrics). ALB health check vs EC2 health check: ALB health check tests HTTP endpoint. If it fails, ALB removes instance from rotation but ASG does not know. Configure ASG to use ELB health checks to replace unhealthy instances automatically.
AWS · ENGINEER
What is IRSA in EKS and how does it compare to Azure Workload Identity?
IRSA (IAM Roles for Service Accounts) is EKS's mechanism for giving pods access to AWS services without storing credentials. It works through OIDC federation: the EKS cluster has an OIDC issuer URL. An IAM Role is created with a trust policy that trusts tokens from that OIDC issuer for a specific Kubernetes ServiceAccount. The pod uses that ServiceAccount and gets temporary AWS credentials automatically via the AWS SDK credential chain. The trust policy specifies: "I trust tokens from cluster X for ServiceAccount Y in namespace Z." When the payment pod calls S3, boto3 sees the ServiceAccount token in the pod filesystem, exchanges it for temporary AWS credentials, and makes the S3 call. No access keys stored anywhere. Comparison with Azure Workload Identity: identical concept, different implementation. Both use OIDC federation between Kubernetes ServiceAccounts and the cloud identity system (AWS IAM vs Azure AD). Both eliminate stored credentials. The practical difference: IRSA configuration is done per-cluster with eksctl or Terraform. Azure Workload Identity requires the federated credential to be set up on the Managed Identity resource. Both are the current production standard for their respective platforms — never use static access keys in pods.
AWS · ENGINEER
What is the difference between ALB and NLB in AWS? When do you use each?
ALB (Application Load Balancer) operates at Layer 7 (HTTP/HTTPS). It understands the content of the request: URL path routing (/api → backend, /static → S3), host-based routing (api.example.com → API service, app.example.com → frontend), gRPC, WebSocket, SSL termination, content-based routing, and WAF integration. Use ALB for: web applications, microservices with path-based routing, HTTP API endpoints, WebSocket applications. NLB (Network Load Balancer) operates at Layer 4 (TCP/UDP/TLS). It routes based on IP and port only, no content inspection. Handles millions of requests per second with ultra-low latency (microseconds vs milliseconds for ALB). Preserves the source IP of the client (ALB changes source IP to the ALB IP). Use NLB for: TCP/UDP applications (gaming servers, IoT), high-frequency trading where microsecond latency matters, applications requiring source IP preservation, EKS Services with type LoadBalancer where you need a static IP (NLB supports static Elastic IPs, ALB does not). In EKS: use the AWS Load Balancer Controller (LBC). Annotate Service type LoadBalancer with service.beta.kubernetes.io/aws-load-balancer-type: external to get an NLB, or use Ingress with IngressClass alb to get an ALB for HTTP routing.
AWS · ARCHITECT
How do you design a highly available three-tier application on AWS?
Three-tier (presentation, application, data) deployed across two AZs minimum. Presentation tier: static assets in S3 with CloudFront CDN in front. Dynamic frontend in ECS/EKS. ALB distributes across AZs. Auto Scaling Group maintains minimum 2 instances across AZs. Application tier: EKS with node groups spanning two AZs. HPA scales pods. Cluster Autoscaler adds nodes. ALB routes to the EKS service. Security groups allow only the ALB to reach the application tier — no direct internet access. Data tier: RDS with Multi-AZ deployment — primary in AZ-1, standby in AZ-2. Automatic failover if primary fails (1-2 minutes). Read replicas for read-heavy workloads. ElastiCache (Redis) in cluster mode across AZs for session storage and caching. All data tier resources in private subnets — no public access. Supporting services: Route53 for DNS with health checks and automatic failover. CloudFront for global CDN and DDoS protection (absorbs layer 7 attacks at edge). WAF attached to CloudFront and ALB for OWASP Top 10 protection. KMS encrypts all data at rest. Secrets Manager for database credentials (rotate automatically). CloudWatch for monitoring, CloudTrail for audit. For disaster recovery: S3 Cross-Region Replication for static assets, RDS cross-region read replica that can be promoted, infrastructure as Terraform code so you can rebuild in a new region from code.
AWS · PRODUCTION
EKS pods cannot access S3. Walk through your troubleshooting steps.
Step 1: check the error. From inside the pod: kubectl exec -it pod-name -- aws s3 ls s3://my-bucket. The error tells you everything. "Unable to locate credentials" means no IAM role attached. "Access Denied" means wrong permissions. "NoSuchBucket" means wrong bucket name or region. Step 2: verify IRSA setup. kubectl describe serviceaccount my-sa -n namespace — check for the annotation eks.amazonaws.com/role-arn. If missing: the ServiceAccount is not linked to an IAM role. Step 3: verify the pod is using the correct ServiceAccount. kubectl describe pod my-pod — check serviceAccountName. If it says "default", the deployment is not using the annotated ServiceAccount. Step 4: verify the IAM role trust policy. aws iam get-role --role-name MyEKSRole -- check the trust policy allows the EKS cluster OIDC issuer and the specific ServiceAccount. Common mistake: trust policy says the right cluster but wrong namespace or ServiceAccount name. Step 5: verify the IAM role has the right permissions. aws iam simulate-principal-policy checks whether the role can perform s3:GetObject on the bucket ARN without actually doing it. Step 6: check the bucket policy. The S3 bucket might have a policy that denies access regardless of IAM role. Also check if the bucket is in a different region and you need --region flag. Step 7: token expiry. IRSA tokens expire after 24 hours by default. Check pod age — if very old pod, restart it to get fresh credentials.
AWS · ENGINEER
Your web application is experiencing unpredictable traffic spikes, causing performance issues. How would you design an auto-scaling solution to handle these spikes?
To handle unpredictable traffic spikes, I would design an auto-scaling solution using AWS Auto Scaling groups. First, I would create an Auto Scaling group for the web application instances, defining the minimum, maximum, and desired capacity. I would set up scaling policies based on CloudWatch metrics, such as CPU utilization or request count. For example, if the average CPU utilization exceeds 70% for 5 consecutive minutes, I would trigger the scaling out policy to add more instances. Conversely, if the CPU utilization drops below 20% for 10 minutes, I would trigger the scaling in policy to terminate instances. I would also use Elastic Load Balancing (ELB) to distribute incoming traffic across multiple instances to ensure even load distribution and high availability.
AWS · ENGINEER
Your company is facing high AWS bills. What strategies would you implement to optimize costs without compromising performance?
To optimize costs, I would implement the following strategies:
- Right-Sizing: Analyze the usage of EC2 instances and other resources, and right-size them to the most cost-effective instance types based on actual usage.
- Reserved Instances (RIs) and Savings Plans: Purchase Reserved Instances or Savings Plans for predictable workloads to get significant discounts compared to On-Demand pricing.
- Spot Instances: Use Spot Instances for fault-tolerant and flexible workloads to take advantage of the lower prices.
- Auto Scaling: Implement Auto Scaling to ensure resources are only used when needed, scaling in during off-peak times.
- Storage Optimization: Use S3 lifecycle policies to move infrequently accessed data to cheaper storage classes (e.g., S3 Infrequent Access or Glacier). Delete unused EBS volumes, snapshots, and obsolete data.
- Monitoring and Alerts: Set up CloudWatch alarms to monitor resource usage and receive alerts for any anomalies or over-utilization.
- Cost Explorer and Budgets: Use AWS Cost Explorer and Budgets to analyze spending patterns and set up budgets and alerts to stay within budget.
AWS · ENGINEER
Your organization requires a disaster recovery plan for a critical application hosted on AWS. What would your plan include?
The disaster recovery plan would include:
- Backup and Restore: Regularly back up data using AWS Backup or automated scripts to S3, and enable versioning and cross-region replication for S3 buckets.
- Pilot Light: Maintain a minimal version of the environment always running in another region. In case of a disaster, scale up the resources to handle production traffic.
- Warm Standby: Run a scaled-down version of the fully functional environment in another region. In case of a disaster, scale up to full capacity.
- Multi-Region Active-Active: Deploy the application across multiple regions in an active-active configuration. Use Route 53 for DNS failover to route traffic to healthy regions.
- Data Replication: Use services like RDS with Multi-AZ or Aurora Global Database for database replication across regions. Use DynamoDB global tables for multi-region replication of NoSQL data.
- Regular Testing: Regularly test the disaster recovery plan by simulating failover and recovery scenarios to ensure that RTO (Recovery Time Objective) and RPO (Recovery Point Objective) are met.
AWS · ENGINEER
Your company needs to comply with strict security regulations for handling sensitive customer data. How would you ensure compliance on AWS?
To ensure compliance, I would:
- Encryption: Enable encryption at rest and in transit for all sensitive data using AWS Key Management Service (KMS) and SSL/TLS. Use AWS Certificate Manager for managing SSL/TLS certificates.
- Identity and Access Management (IAM): Implement the principle of least privilege by creating granular IAM policies and roles. Use IAM roles for EC2 instances and other services to avoid hardcoding credentials.
- Logging and Monitoring: Enable CloudTrail for auditing API calls, configure CloudWatch Logs for logging application and system logs, and use AWS Config for resource configuration tracking and compliance auditing.
- Network Security: Use VPC with subnets, security groups, and NACLs to control inbound and outbound traffic. Implement AWS WAF and Shield for web application protection.
- Compliance Programs: Leverage AWS compliance programs and services like AWS Artifact to access audit reports and compliance documentation for regulatory requirements.
- Automated Security Assessments: Use AWS Trusted Advisor, Inspector, and Macie to regularly scan for vulnerabilities and data leaks, and ensure compliance with security best practices.
- Periodic Audits and Penetration Testing: Conduct regular security audits and penetration testing to identify and mitigate potential vulnerabilities.
AWS · ENGINEER
Your company needs to migrate a large on-premises database to AWS with minimal downtime. What migration strategy would you recommend?
For minimal downtime migration of a large database, I would recommend:
- AWS Database Migration Service (DMS): Use AWS DMS to perform the migration. Set up the source and target endpoints and create a replication task to perform continuous data replication from the on-premises database to the AWS target database.
- Pre-Migration Planning: Conduct a thorough assessment of the source database, including schema, data size, and dependencies. Plan the migration strategy, downtime window, and rollback procedures.
- Schema Conversion: Use AWS Schema Conversion Tool (SCT) to convert the source database schema to the target database schema (e.g., from Oracle to Aurora PostgreSQL).
- Continuous Data Replication: Enable continuous data replication using DMS to keep the source and target databases in sync. This minimizes downtime by ensuring that only the final incremental changes need to be applied during the cutover.
- Testing: Perform multiple migration tests in a staging environment to ensure data consistency, performance, and application compatibility.
- Cutover: Schedule the cutover during a maintenance window. Perform a final sync using DMS, switch the application to the new database, and validate the data integrity and application functionality.
AWS · ENGINEER
Your organization is expanding rapidly and needs a multi-account strategy to manage different projects and departments efficiently on AWS. How would you implement this?
To implement a multi-account strategy, I would:
- AWS Organizations: Use AWS Organizations to create and manage multiple AWS accounts. Create Organizational Units (OUs) for different projects, departments, or environments (e.g., development, staging, production).
- Service Control Policies (SCPs): Implement SCPs to enforce governance and control access at the organizational level. Define policies to restrict or allow specific actions based on organizational requirements.
- Cross-Account Access: Set up IAM roles for cross-account access, allowing users and services in one account to access resources in another account securely. Use AWS Resource Access Manager (RAM) to share resources like VPCs and subnets across accounts.
- Billing and Cost Management: Use consolidated billing to aggregate usage and simplify billing management. Allocate budgets and set up cost allocation tags to track expenses per account, project, or department.
- Centralized Logging and Monitoring: Use AWS CloudTrail, CloudWatch, and AWS Config across all accounts to centralize logging, monitoring, and compliance tracking. Aggregate logs to a central S3 bucket or use AWS CloudWatch Logs Insights for cross-account log analysis.
- Security and Compliance: Implement a centralized security model using AWS Security Hub, GuardDuty, and Macie to monitor and enforce security best practices across all accounts.
AWS · ENGINEER
Your application must be highly available and fault-tolerant. What architecture would you design on AWS to meet these requirements?
To design a highly available and fault-tolerant architecture:
- Multi-AZ Deployment: Deploy the application across multiple Availability Zones (AZs) to ensure high availability and fault tolerance. Use services like RDS with Multi-AZ for database deployments.
- Elastic Load Balancing (ELB): Use ELB to distribute incoming traffic across multiple EC2 instances running in different AZs. This ensures that traffic is balanced and that the application remains available even if an instance or AZ fails.
- Auto Scaling: Configure Auto Scaling groups to automatically adjust the number of EC2 instances based on demand. This ensures that the application can handle traffic spikes and maintain performance.
- Data Replication and Backup: Use S3 for data storage with versioning and cross-region replication enabled. Regularly back up data using AWS Backup and implement snapshot policies for EBS volumes and RDS instances.
- Route 53: Use Route 53 for DNS management with health checks and failover routing policies to route traffic to healthy endpoints and provide seamless failover in case of an AZ or region failure.
- Serverless Services: Where applicable, use serverless services like AWS Lambda, DynamoDB, and S3 to reduce the risk of infrastructure-related failures and increase scalability.
AWS · ENGINEER
Your company wants to modernize its legacy monolithic application by migrating to a microservices architecture on AWS. How would you approach this?
To modernize the legacy monolithic application:
- Assessment and Planning: Conduct a thorough assessment of the existing monolithic application to understand its components, dependencies, and data flow. Identify the logical boundaries and the potential microservices that can be created.
- Containerization: Start by containerizing the existing monolith using Docker. Deploy the containerized application on Amazon ECS or EKS to manage the containers.
- Service Decomposition:
Gradually decompose the monolith into microservices. Identify and extract functionalities into independent services. Use AWS Lambda for stateless, event-driven microservices or ECS/EKS for stateful services.
- Service Communication: Use Amazon API Gateway to manage API requests and service-to-service communication. Implement inter-service communication using RESTful APIs, gRPC, or AWS App Mesh for service mesh architecture.
- Data Management: Decouple the data layer by using purpose-built databases for each microservice. Use DynamoDB, RDS, or Aurora depending on the data requirements. Implement data synchronization and eventual consistency where necessary.
- CI/CD Pipeline: Set up a CI/CD pipeline using AWS CodePipeline, CodeBuild, and CodeDeploy to automate the build, test, and deployment processes for the microservices.
- Monitoring and Logging: Implement centralized logging and monitoring using AWS CloudWatch, X-Ray, and Elasticsearch Service to trace requests, monitor performance, and troubleshoot issues.
AWS · ENGINEER
Your organization wants to integrate its on-premises infrastructure with AWS to create a hybrid cloud environment. What approach would you take?
To create a hybrid cloud environment:
- Networking: Establish a secure network connection between the on-premises data center and AWS using AWS Direct Connect or VPN. This ensures low latency and secure communication.
- Identity and Access Management: Implement federated access by integrating on-premises Active Directory with AWS IAM. Use AWS Single Sign-On (SSO) or IAM roles with SAML for seamless identity management.
- Data Integration: Use AWS DataSync or Storage Gateway to integrate on-premises storage with AWS storage services like S3. For databases, use AWS DMS to replicate data between on-premises databases and AWS databases.
- Hybrid Services: Leverage hybrid services like AWS Outposts to run AWS infrastructure and services on-premises for workloads that require low latency or data residency compliance.
- Workload Distribution: Use AWS Elastic Beanstalk, ECS, or EKS to deploy and manage workloads across both on-premises and AWS environments. Implement load balancing and auto-scaling to manage the distribution of workloads.
- Monitoring and Management: Use AWS Systems Manager to manage and monitor both on-premises and AWS resources. Implement CloudWatch and AWS Config for centralized monitoring, logging, and compliance tracking.
AWS · ENGINEER
Your team is tasked with building a new serverless application on AWS. What services and architecture would you use to design this application?
For a serverless application, I would use the following architecture and services:
- Compute: Use AWS Lambda to run the application code without managing servers. Write functions to handle individual tasks and trigger them based on events.
- API Gateway: Use Amazon API Gateway to create and manage RESTful APIs. Integrate API Gateway with Lambda functions to handle incoming requests and responses.
- Storage: Use Amazon S3 for storing static assets like images, videos, and documents. Use DynamoDB for a serverless, managed NoSQL database solution for storing application data.
- Authentication and Authorization: Use Amazon Cognito for user authentication, authorization, and user management. Integrate Cognito with API Gateway to secure the APIs.
- Messaging and Integration: Use Amazon SNS and SQS for message publishing and queuing. Use AWS Step Functions to coordinate and manage complex workflows and state machines.
- Monitoring and Logging: Use AWS CloudWatch for monitoring application performance and setting up alarms. Use CloudWatch Logs to aggregate and analyze logs from Lambda functions.
- CI/CD: Use AWS CodePipeline, CodeBuild, and CodeDeploy to automate the build, test, and deployment processes for the serverless application.
Each of these scenarios covers critical aspects of an AWS Cloud Solution Architect role, helping candidates demonstrate their practical knowledge and problem-solving abilities in real-world situations.
AWS · ENGINEER
What is the difference between?
EC2 instance types (e.g., t2, m5, c5)?
AWS · ENGINEER
How does?
Amazon S3 handle object versioning?
AWS · ENGINEER
Explain the difference between?
RDS Multi-AZ and Read Replicas.
AWS · ENGINEER
What is the difference between?
Security Groups and NACLs (Network ACLs)?
AWS · ENGINEER
Explain?
Route 53 failover routing** and how to configure it.
AWS · ENGINEER
Explain how?
Auto Scaling Groups** work with **Elastic Load Balancers.
AWS · ENGINEER
What are the different types of?
load balancers in AWS (ALB, NLB, CLB)?
AWS · ENGINEER
How do you design a?
highly available architecture** across multiple regions?
AWS · ENGINEER
What is the difference between?
IAM roles, policies, and users?
AWS · ENGINEER
How do you?
grant temporary access to an S3 bucket?
AWS · ENGINEER
Explain?
AWS Key Management Service (KMS)** and how to encrypt EBS volumes.
AWS · ENGINEER
How does?
ECS (Elastic Container Service)** differ from **EKS (Elastic Kubernetes Service)?
AWS · ENGINEER
What is?
Fargate,** and how does it simplify container management?
AWS · ENGINEER
Explain?
DynamoDB Global Tables** and how they achieve cross-region replication.
AWS · ENGINEER
How does?
CloudWatch differ from CloudTrail?
AWS · ENGINEER
How do you set up?
alarms and automated scaling** using CloudWatch?
AWS · ENGINEER
How do you implement?
AWS Savings Plans** to reduce costs?
AWS · ENGINEER
Explain the difference between?
Reserved Instances** and **Spot Instances.
AWS · ENGINEER
How do you track and manage?
AWS billing and cost allocation?
AWS · ENGINEER
How do you design high availability in AWS?
High availability means the system continues operating despite failures. Multi-AZ architecture: deploy across at least 2 AZs. ELB distributes traffic across AZs. RDS Multi-AZ: synchronous replication, automatic failover in 60-120 seconds. EC2 Auto Scaling Groups: span multiple AZs, replace unhealthy instances automatically. EKS: worker nodes across 3 AZs, pod anti-affinity ensures replicas spread across AZs. Route 53 health checks: route traffic away from unhealthy endpoints. S3: 11 nines durability, multi-AZ storage automatically. For stateless services: easy — add more instances across AZs. For stateful services: harder — use managed services (RDS, ElastiCache) which handle replication. DR beyond single region: Route 53 latency routing or failover routing to secondary region. RDS read replicas in secondary region promoted to primary if main region fails. RTO target: 15 minutes. RPO target: 5 minutes. Test DR quarterly: simulate region failure, verify traffic routes correctly, verify database failover completes.
AWS · PRODUCTION
EKS node group is unhealthy — nodes not joining cluster. Debug process.
Step 1: check node group status. aws eks describe-nodegroup --cluster-name myEKS --nodegroup-name myNodeGroup. Status DEGRADED or CREATE_FAILED shows the reason. Step 2: check EC2 instances in the Auto Scaling Group. aws autoscaling describe-auto-scaling-groups. Are instances launching? If launching but not joining: Step 3: check EC2 instance system logs. aws ec2 get-console-output --instance-id i-xxxx shows bootstrap errors. Common causes: IAM instance role missing AmazonEKSWorkerNodePolicy or AmazonEC2ContainerRegistryReadOnly policies. VPC DNS not enabled (enableDnsHostnames must be true). Security Group blocking port 443 to EKS cluster endpoint. Wrong AMI — must use EKS-optimised AMI for the correct K8s version. Bootstrap script error — check /var/log/cloud-init-output.log on the node. Step 4: if node joins but shows NotReady: kubectl describe node nodename — look at Conditions. Usually: kubelet not started (systemctl status kubelet), CNI plugin not installed, or disk pressure. kubectl get events -A shows cluster-level events that may explain the issue.
AWS · SENIOR
Compare EKS, ECS, and Lambda. How do you decide which to use?
EKS: managed Kubernetes. Full K8s feature set. Control plane managed by AWS ($0.10/hr). You manage worker nodes. Use when: team has K8s expertise, complex microservices needing K8s features (RBAC, network policies, custom schedulers), multi-cloud portability required, existing Kubernetes workloads. ECS: AWS-proprietary container orchestrator. Simpler than K8s. No control plane cost. Two modes: EC2 (you manage servers) and Fargate (fully serverless). Use when: team is new to containers, pure AWS environment, want simplicity over flexibility, Fargate avoids node management entirely. Lambda: serverless functions. No infrastructure. Pay per execution (per 100ms). Scale to zero. Cold start latency (100ms-3s). Use when: event-driven processing (S3 uploads, SQS messages, API calls), short-lived tasks (under 15 minutes), variable traffic with periods of zero load. Decision framework: Is the workload containerised and long-running? → EKS or ECS. Is it event-driven and short? → Lambda. Does the team know Kubernetes? → EKS. Does the team want simplicity? → ECS Fargate. At HPE we used EKS for the main microservices platform (K8s expertise existed) and Lambda for event-driven integrations (S3 processing, webhook handlers).
📝Git & GitHub — Branching, Workflows, Troubleshooting10 Q&As›
GitHub Actions · ENGINEER
What is GitHub Actions and how is it different from Jenkins?
GitHub Actions is CI/CD built directly into GitHub — no server to manage, no plugins to update, no agents to configure. You write YAML workflows stored in .github/workflows/. When an event happens (push, PR, schedule), GitHub automatically runs the workflow on hosted servers (runners). Key differences from Jenkins: GitHub manages all infrastructure — you never think about agents going offline or Jenkins master running out of memory. GitHub Actions uses OIDC for cloud authentication, so no long-lived secrets stored anywhere. The free tier gives 2000 minutes/month which is enough for most projects. Jenkins gives more control and is better for complex pipelines with many dependencies, but GitHub Actions wins for simplicity and GitHub-integrated projects.
GitHub Actions · ENGINEER
Explain the difference between jobs and steps in GitHub Actions.
Jobs run in parallel by default on separate runner machines. Each job gets a fresh VM — nothing carries over between jobs unless you explicitly pass outputs or use artifacts. Steps run in sequence inside a job on the same machine. If a step fails, the job fails and remaining steps are skipped (unless you add if: always()). Common pattern: quality job (tests, SonarQube) runs in parallel with security job (Trivy, Checkov). Both must pass before the build job starts. Build job produces a Docker image, uploads it as an artifact. Deploy job downloads that artifact and deploys. The needs keyword creates a dependency graph between jobs.
GitHub Actions · ARCHITECT
How do you manage secrets securely in GitHub Actions?
Three layers. Repository secrets (Settings → Secrets → Actions) for most secrets — these are encrypted and masked in logs. Environment secrets for environment-specific values (prod DB host vs dev DB host). The production environment can require review approval before secrets are accessible, preventing accidental prod deployments. OIDC for cloud providers — instead of storing AWS/Azure keys as long-lived secrets, configure GitHub as an OIDC provider in your cloud account. The workflow requests a short-lived token at runtime using id-token: write permission. No credentials stored anywhere. Rotation happens automatically. At HPE: we moved from AWS access keys in secrets to OIDC — eliminated an entire class of credential rotation toil and security risk.
GitHub Actions · PRODUCTION
A GitHub Actions workflow is passing locally but failing in CI. Walk through diagnosis.
Systematic approach: First, enable debug logging by adding ACTIONS_RUNNER_DEBUG=true as a repository secret and re-running. The detailed logs usually reveal the exact problem. Common causes: 1) Environment difference — local machine has tools installed globally (kubectl, helm, node) that the runner does not have. Fix: add explicit setup steps (actions/setup-node, actions/setup-java). 2) Secret not set — the workflow references a secret that exists in your environment but not in GitHub. Check Settings → Secrets. 3) Permissions — runner cannot push to registry or create releases. Check permissions block in workflow YAML. 4) Path difference — workflow runs from repo root, local you run from a subdirectory. Use working-directory: ./subfolder. 5) Token scope — GITHUB_TOKEN has limited permissions. Check if you need additional permissions.
GitHub Actions · ARCHITECT
How do you design a GitHub Actions workflow for multiple environments with approval gates?
Use GitHub Environments with protection rules. Create three environments in repo Settings → Environments: staging (no approvals, auto-deploy), uat (1 required reviewer), production (2 required reviewers + branch protection). Each environment can have its own secrets (prod DB password different from staging). In the workflow, set environment: production on the deployment job. GitHub automatically pauses the workflow and sends notifications to required reviewers. If they approve, the workflow continues. If they reject or timeout (configurable), the workflow fails. This gives a complete audit trail — who approved what deployment, when, and from which commit. No Slack messages that disappear, no email threads. All in GitHub history.
GitHub Actions · ENGINEER
What are reusable workflows and when should you use them?
Reusable workflows are workflow files that other workflows can call — like functions. You define them once with workflow_call trigger and inputs/secrets parameters. Any other workflow calls them with uses: org/repo/.github/workflows/deploy.yml. Use them when: multiple repos do the same deployment pattern (all microservices deploy to Kubernetes the same way), you want to enforce standards (all deployments must include security scan before push), or you want to simplify individual repo workflows (50-line workflow becomes 10 lines calling reusable workflows). Difference from composite actions: reusable workflows run as separate jobs. Composite actions are steps within a job. Reusable workflows are better for complete deployment flows; composite actions for step-level reuse.
GitHub Actions · PRODUCTION
GitHub Actions minutes are being used too quickly. How do you optimise?
Measure first: Actions → billing shows minutes per workflow. Four main optimisations. One: caching — use actions/cache for node_modules, Maven .m2, pip cache. Saves 2-5 minutes per run. Use hashFiles() to invalidate cache only when dependency files change. Two: parallel vs sequential — check if jobs that run sequentially could run in parallel (needs removed). Three: conditional jobs — skip deployment jobs on PR (if: github.event_name == push). Skip tests on documentation-only changes using paths-ignore or path filters. Four: self-hosted runners — for heavy workloads, run on your own server. No minute counting. Your server, your hardware. Good for Docker builds with large images (10+ GB) where network bandwidth to GitHub runners is the bottleneck.
GitHub Actions · ARCHITECT
How does OIDC authentication work in GitHub Actions?
OIDC (OpenID Connect) lets GitHub Actions authenticate to cloud providers without stored credentials. Traditional approach: store AWS access key ID and secret in GitHub secrets. Problem: long-lived credentials, manual rotation, risk of exposure. OIDC approach: GitHub acts as an identity provider. When a workflow runs, GitHub mints a short-lived JWT token signed by GitHub's private key. Your cloud provider (AWS, Azure, GCP) is configured to trust GitHub's identity provider. The workflow exchanges the GitHub JWT for a cloud-specific credential. The cloud credential expires after 1 hour. Setup: AWS IAM console → Identity providers → Add GitHub as OIDC provider. Create an IAM role with condition that allows only your specific repo/branch. Add permissions: id-token: write to workflow. Use aws-actions/configure-aws-credentials action. Result: no secrets to rotate, no stored credentials, automatic short-lived tokens per run.
Git · ENGINEER
Explain git rebase vs merge. Which do you prefer and why?
Merge: creates a new merge commit combining two branches. Preserves the exact history of both branches including when the feature branch diverged. Non-destructive -- existing commits are never changed. Rebase: moves the entire feature branch to start from the tip of the target branch. Rewrites commit history to appear linear. The feature commits get new SHA hashes. When to use merge: on main/release branches -- never rebase shared branches because rewriting history breaks everyone else who has pulled them. Long-lived feature branches where the history of when the branch diverged is important. When to use rebase: on your local feature branch before creating a PR (interactive rebase to clean up "WIP commit 1", "fix typo", "fix the fix" into meaningful commits). To bring your feature branch up to date with main before merging. Golden rule: never rebase commits that have been pushed to a shared remote branch. If you need to clean up a pushed branch (only yours, no one else has pulled it): git push --force-with-lease. --force-with-lease is safer than --force: it fails if anyone else has pushed since your last pull.
Git · PRODUCTION
You accidentally pushed a secret (API key) to a public GitHub repo. What do you do?
This is a security incident. Speed matters -- bots scan GitHub in under 60 seconds. Step 1 (immediate): revoke the secret. Go to wherever the API key was issued (AWS Console, GitHub Settings, Stripe Dashboard) and invalidate it immediately. Even if you delete it from Git, assume it was already harvested. A new key is your actual protection. Step 2: remove from Git history. If just pushed: git rebase -i HEAD~N to remove the commit, then git push --force-with-lease. For commits deeper in history: git filter-branch or git filter-repo (safer and faster). git filter-repo --path-glob "*.env" --invert-paths removes files. BFG Repo Cleaner: bfg --delete-files .env. Step 3: force all collaborators to re-clone or rebase -- the rewritten history means their local copies have the old commits. Step 4: GitHub has a secret scanning feature -- go to Security tab, check if GitHub already detected and alerted. Step 5: audit access logs. If the key gave cloud access: check CloudTrail (AWS) or Activity Log (Azure) for any actions using that key in the last 24 hours. Prevention: git-secrets pre-commit hook, GitHub Actions secret scanning, .env in .gitignore from day one, secrets manager for all credentials.
🔴OpenShift — SCC, Routes, vs Kubernetes6 Q&As›
OpenShift · ENGINEER
How is OpenShift different from vanilla Kubernetes?
OpenShift adds enterprise security, operational tooling, and developer workflow on top of Kubernetes. Key differences: Security Context Constraints enforce that pods cannot run as root by default — stricter than vanilla K8s PodSecurityPolicy or Pod Security Admission. Routes are OpenShift's native HTTP routing resource (backed by HAProxy router) — richer TLS options than Kubernetes Ingress. Operator Lifecycle Manager is built-in for managing complex application lifecycle. Internal container registry is included out of the box. Image Streams are an OCP-native concept that automatically triggers deployments when an upstream image is updated. The oc CLI extends kubectl with OpenShift-specific commands. Web console has both developer and administrator perspectives. From an operational standpoint: OpenShift clusters are more opinionated and harder to customise than vanilla Kubernetes, but enterprise features (RBAC, audit logging, built-in registry, operator framework) save significant setup time. At HPE we ran OCP 4.x for the telecom SRO/COM/NBI platform — the SCC enforcement and Operator-based deployment model suited our enterprise compliance requirements.
OpenShift · PRODUCTION
A pod fails with SCC error in OpenShift. Walk through the fix.
SCC (Security Context Constraints) error means the pod is requesting privileges that no assigned SCC allows. Step 1: read the error message carefully. oc describe pod failing-pod shows the exact SCC violation — usually running as root UID (0), using hostPath volumes, or requesting host network. Step 2: check what SCCs the pod's service account can use. oc adm policy scc-review -z service-account-name -n namespace. Step 3: choose the right fix. Correct fix: update the container to run as non-root. Add USER 1001 to the Dockerfile. This makes the container work with the default restricted SCC — no privilege escalation needed. Workaround for legacy apps: create a dedicated service account, grant the minimum SCC needed (nonroot before anyuid), use that service account in the deployment. Never use anyuid in production without a valid reason and never use privileged unless it is a node-level DaemonSet. Step 4: verify the fix. oc get pod fixed-pod -o jsonpath shows which SCC was assigned. At HPE: legacy telecom apps (TeMIP) needed anyuid because they were built to run as root. We isolated them in a dedicated namespace with tighter network policies to compensate.
OpenShift · ARCHITECT
Explain OpenShift Operators and why they matter.
An Operator is a Kubernetes controller that encodes the operational knowledge of running a complex application. Instead of a human manually creating replicas, running backups, handling failover, and managing upgrades, an Operator does all of this automatically by watching custom resources. Example: a PostgreSQL Operator watches PostgreSQLCluster objects. When you create one, the Operator creates the StatefulSet, creates users, configures replication, sets up monitoring, schedules backups, and handles failover — things that would take hours of manual work. In OpenShift, Operator Lifecycle Manager is built-in and OperatorHub provides a catalogue of certified operators. Key objects: Subscription (which operator, which channel, auto or manual upgrade), ClusterServiceVersion (installed operator version + capabilities), InstallPlan (pending upgrades waiting for approval). Production best practice: always use Manual approval for operators in production. Auto-upgrade can break things. With Manual, you review the upgrade notes and approve during a maintenance window. At HPE: we used the OpenShift GitOps Operator (ArgoCD) and the Prometheus Operator — both managed through OLM with manual upgrade approval.
OpenShift · ENGINEER
What is the difference between a Route and an Ingress in OpenShift?
Route is OpenShift's native HTTP routing resource, backed by the HAProxy-based OpenShift Router. Ingress is the standard Kubernetes resource that also works in OpenShift but routes to the same underlying router. Key differences: Routes have more TLS options natively — edge (TLS terminates at router, plain HTTP to pod), passthrough (TLS passes through to pod, end-to-end encryption), and reencrypt (TLS to router, new TLS to pod). Routes support custom certificates per route without needing cert-manager. Route status shows the exact hostname assigned. Ingress in OpenShift is converted to a Route internally. Use Routes when you want OpenShift-native features or when using the oc CLI and web console. Use Ingress when you want portability across Kubernetes distributions or when using Helm charts designed for vanilla K8s. The hostname format for Routes: app-name-namespace.apps.cluster-domain — automatically assigned if not specified.
OpenShift · SENIOR
How does OpenShift compare to vanilla Kubernetes? What adds value?
OpenShift IS Kubernetes (uses the same API server, etcd, kubelet) but adds enterprise features on top. Key additions: Security Context Constraints (SCC): more granular than K8s Pod Security Standards. Controls UID, capabilities, SELinux context per pod. Default in OpenShift: pods cannot run as root — more secure by default than vanilla K8s. Routes: OpenShift-native Ingress. Simpler than K8s Ingress with annotations. Integrated with HAProxy. TLS termination built-in. Operator Framework: pre-installed Operator Lifecycle Manager. Install complex applications (databases, monitoring) via OperatorHub — one click to install and manage. ImageStreams: abstraction over container images. Track image changes, trigger automatic redeploys when a base image updates. Built-in registry: integrated container registry, no ACR/ECR needed for internal images. OpenShift Console: richer UI than K8s dashboard. Built-in monitoring stack (Prometheus + Grafana via OpenShift Monitoring Operator). Developer experience: oc new-app deploys an application from Git with S2I (source-to-image) without writing a Dockerfile. When to choose OpenShift over AKS: heavily regulated industries (banking, healthcare) needing stricter defaults. Teams wanting an integrated platform (registry, monitoring, CI/CD). On-premise data centre deployments.
OpenShift · SENIOR
What are Security Context Constraints (SCC) in OpenShift and how do they differ from K8s Pod Security Standards?
SCC is OpenShift equivalent of Kubernetes Pod Security Standards (PSS) but more powerful and granular. SCCs control what a pod is allowed to do at the OS level: which UIDs it can run as, which capabilities it can request, whether it can use host networking/PID/IPC, whether it can mount host paths, what SELinux context it runs under. Standard SCCs: restricted (most secure, no root, no host access -- default for user workloads), anyuid (allow running as any UID including root), privileged (full host access -- only for infrastructure pods like node agents). How pods get SCCs: the pod service account is assigned an SCC via RBAC. oc adm policy add-scc-to-user anyuid -z my-service-account. Common problem: a containerised application that was written to run as root fails in OpenShift with restricted SCC because OpenShift assigns a random high UID. Fix: update the Dockerfile to use a non-root user, or grant anyuid SCC to the service account (less secure but sometimes necessary for third-party apps). K8s PSS comparison: PSS has three levels (privileged, baseline, restricted) applied at namespace level. Less granular than SCC but standardised across clouds. OpenShift 4.x supports both SCC (backward compatibility) and PSS.
🐍Python — Terraform Automation, Log Analysis, Azure SDK9 Q&As›
Python · ENGINEER
What is subprocess.run() and why do you use it instead of os.system()?
subprocess.run() executes a shell command and returns a CompletedProcess object with stdout, stderr, and returncode. os.system() runs the command but gives you no output — you only know the exit code. In DevOps scripts, capturing output is essential: you need to parse kubectl get pods output, read terraform output JSON, check helm status. Always use subprocess.run() with capture_output=True, text=True. Set a timeout so scripts never hang indefinitely. Check returncode != 0 and exit or raise an exception — never silently continue after a command failure.
Python · ENGINEER
How does boto3 authenticate to AWS? Why should you never hardcode credentials?
boto3 follows a credential chain: first it checks if there's an IAM Role attached (EC2 instance profile, EKS pod identity via IRSA, Lambda execution role). If not, it reads environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. If not, it reads ~/.aws/credentials file. In production on EC2 or EKS, the IAM Role is attached automatically — no credentials are stored anywhere. Never hardcode credentials because: they end up in Git history forever even after deletion, anyone with repo access can see them, rotating them requires code changes. Use IAM Roles for production, environment variables for CI/CD, credentials file only for local development.
Python · ADVANCED
How do you write a Python script that monitors Kubernetes pod health and alerts Slack when pods are crashing?
The pattern: kubernetes SDK to poll pod status, retry decorator for network reliability, Slack webhook for notifications. Start with config.load_incluster_config() in a pod or load_kube_config() locally. List pods with v1.list_pod_for_all_namespaces(). For each pod's container_statuses, check state.waiting.reason == CrashLoopBackOff. If found, build a message and post to Slack webhook using requests.post(). Run this as a Kubernetes CronJob every 5 minutes. Add a cooldown mechanism — track which pods you already alerted on to avoid spam. At HPE: this exact script ran as a CronJob and caught three CrashLoopBackOff pods during a deployment, alerting the team before users noticed.
Python · ENGINEER
How do you use Python to automate Terraform deployments?
import subprocess, json, os, sys
def terraform_plan(env, working_dir):
"""Run terraform plan and return exit code + output"""
result = subprocess.run(
["terraform", "plan", "-var-file", f"envs/{env}.tfvars",
"-out", f"/tmp/{env}.plan", "-json"],
cwd=working_dir, capture_output=True, text=True
)
if result.returncode not in [0, 2]: # 0=no changes, 2=changes planned
raise RuntimeError(f"terraform plan failed: {result.stderr}")
# Parse JSON output to extract resource changes
for line in result.stdout.split("\n"):
if line.strip():
try:
event = json.loads(line)
if event.get("type") == "change_summary":
changes = event.get("changes", {})
print(f"Plan: +{changes.get("add",0)} ~{changes.get("change",0)} -{changes.get("remove",0)}")
# Alert if destroys detected
if changes.get("remove", 0) > 0:
notify_slack(f"WARNING: Plan includes {changes["remove"]} destroys in {env}")
except json.JSONDecodeError:
pass
return result.returncode
# Run: python3 deploy.py production
tf_dir = "/infra/environments/production"
rc = terraform_plan("production", tf_dir)
if rc == 2: # changes to apply
if env == "production":
wait_for_approval() # check approval ticket
subprocess.run(["terraform", "apply", f"/tmp/{env}.plan"], cwd=tf_dir, check=True)
Python · ENGINEER
How do you parse and analyse Kubernetes logs with Python?
import subprocess, json, re
from datetime import datetime, timedelta
from collections import Counter
def get_pod_logs(namespace, label_selector, since_hours=1):
"""Get logs from all pods matching a label"""
pods_result = subprocess.run(
["kubectl", "get", "pods", "-n", namespace,
"-l", label_selector, "-o", "json"],
capture_output=True, text=True, check=True
)
pods = json.loads(pods_result.stdout)["items"]
all_logs = []
for pod in pods:
pod_name = pod["metadata"]["name"]
log_result = subprocess.run(
["kubectl", "logs", pod_name, "-n", namespace,
f"--since={since_hours}h"],
capture_output=True, text=True
)
for line in log_result.stdout.split("\n"):
if line:
all_logs.append({"pod": pod_name, "line": line})
return all_logs
def analyse_errors(logs):
errors = [l for l in logs if "ERROR" in l["line"] or "FATAL" in l["line"]]
# Group by error pattern
patterns = Counter()
for e in errors:
# Extract error type (first word after ERROR:)
m = re.search(r"ERROR:?\s*(\w+)", e["line"])
if m: patterns[m.group(1)] += 1
print(f"Total errors: {len(errors)}")
for pattern, count in patterns.most_common(5):
print(f" {pattern}: {count}")
logs = get_pod_logs("production", "app=payment-api")
analyse_errors(logs)
Python · ENGINEER
Write a Python script that checks if all K8s pods in a namespace are Running.
import subprocess, json, sys
def check_pods(namespace):
result = subprocess.run(
["kubectl","get","pods","-n",namespace,"-o","json"],
capture_output=True, text=True
)
if result.returncode != 0:
print(f"Error: {result.stderr}"); sys.exit(1)
pods = json.loads(result.stdout)["items"]
if not pods:
print(f"No pods in {namespace}"); return True
failed = []
for pod in pods:
name = pod["metadata"]["name"]
phase = pod["status"].get("phase", "Unknown")
if phase != "Running":
failed.append(f"{name}: {phase}")
if failed:
print(f"FAILED pods in {namespace}:")
for f in failed: print(f" {f}")
return False
else:
print(f"All {len(pods)} pods Running in {namespace}")
return True
if not check_pods("production"):
sys.exit(1) # fail the CI/CD step
Python · ENGINEER
Write a Python script to send a Slack alert when CPU usage exceeds 80%.
import psutil, requests, os, time
SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK_URL"] # never hardcode
THRESHOLD = 80
HOSTNAME = os.uname().nodename
def send_alert(cpu):
payload = {
"text": f":red_circle: *CPU Alert* on `{HOSTNAME}`",
"attachments": [{
"color": "danger",
"fields": [{
"title": "CPU Usage",
"value": f"{cpu:.1f}% (threshold: {THRESHOLD}%)",
"short": True
}]
}]
}
resp = requests.post(SLACK_WEBHOOK, json=payload)
resp.raise_for_status()
print(f"Alert sent: CPU={cpu:.1f}%")
while True:
cpu = psutil.cpu_percent(interval=5)
if cpu > THRESHOLD:
send_alert(cpu)
time.sleep(60) # check every minute
Python · ENGINEER
How do you use Python to call the Azure REST API and list all AKS clusters?
from azure.identity import DefaultAzureCredential
from azure.mgmt.containerservice import ContainerServiceClient
import os
# DefaultAzureCredential works with: managed identity (in Azure),
# service principal env vars, Azure CLI login (local dev)
credential = DefaultAzureCredential()
subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
client = ContainerServiceClient(credential, subscription_id)
print("AKS Clusters:")
for cluster in client.managed_clusters.list():
print(f" {cluster.name}: {cluster.location} | "
f"K8s: {cluster.kubernetes_version} | "
f"Nodes: {sum(p.count for p in cluster.agent_pool_profiles)}")
# Or using raw requests + token:
token = credential.get_token("https://management.azure.com/.default")
headers = {"Authorization": f"Bearer {token.token}"}
url = f"https://management.azure.com/subscriptions/{subscription_id}/providers/Microsoft.ContainerService/managedClusters?api-version=2024-01-01"
import requests
resp = requests.get(url, headers=headers)
for cluster in resp.json()["value"]:
print(cluster["name"])
Python · PRODUCTION
How do you handle errors and retries in Python DevOps scripts?
Production scripts must be resilient. Error handling pattern:
import time, functools, logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
def retry(max_attempts=3, delay=5, exceptions=(Exception,)):
"""Decorator for automatic retry with exponential backoff"""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(1, max_attempts+1):
try:
return func(*args, **kwargs)
except exceptions as e:
if attempt == max_attempts:
logging.error(f"{func.__name__} failed after {max_attempts} attempts: {e}")
raise
wait = delay * (2 ** (attempt-1)) # exponential backoff
logging.warning(f"Attempt {attempt} failed: {e}. Retrying in {wait}s...")
time.sleep(wait)
return wrapper
return decorator
@retry(max_attempts=3, delay=10)
def deploy_to_kubernetes(manifest_path):
result = subprocess.run(["kubectl","apply","-f",manifest_path],
capture_output=True, text=True, check=True)
return result.stdout
# Always log start, end, and key decisions
# Always send alerts on final failure (Slack, PagerDuty, email)
# Never silently swallow exceptions in production
🖥️Bash — Health Checks, Functions, Error Handling14 Q&As›
Bash · ENGINEER
Server is at high CPU. Walk through how you find the cause.
Start broad, then narrow. First: uptime to see load average — compare to number of CPUs. If load is 2× number of CPUs, something is wrong. Then: ps aux --sort=-%cpu to find the top consumer. Note the PID and process name. Check how long it has been running with ps -o pid,etime,cmd -p PID. If it is a known service (nginx, java): check its logs — journalctl -u nginx --since '30 min ago'. If it is a runaway process: check what it is doing with strace -p PID -e trace=all — you will see infinite loops, repeated failed syscalls. Common causes at HPE: a Kafka consumer stuck in retry loop consuming 100% CPU. Fix: kill the process, find the poison message, add retry limit with backoff in code.
Bash · ENGINEER
What is the difference between a process and a thread in Linux?
A process is an independent program with its own memory space, file descriptors, and PID. A thread is a lightweight execution unit WITHIN a process — threads share the same memory space and file descriptors as the parent process. Creating a process (fork) is expensive — copies all memory. Creating a thread is cheap — shares existing memory. In Linux, both are implemented as tasks with clone() syscall — processes use clone() without CLONE_VM flag (separate memory), threads use clone() with CLONE_VM (shared memory). For DevOps: ps aux shows processes. To see threads: ps -eLf or top -H. Important for troubleshooting: if a Java process has 200 threads and CPU is high, it might be a thread pool exhaustion issue. Use jstack PID to get thread dump.
Bash · PRODUCTION
Your disk is 100% full on a production server. Walk through the fix without downtime.
Do NOT just delete random files. Systematic approach: First: df -h to confirm which partition is full. Second: du -sh /* to find the largest directories. Third: common culprits in order — /var/log (logs grew unbounded), /var/lib/docker (Docker images/containers), /tmp (someone wrote large temp files), /home (developer left large files). Safe immediate fixes: journalctl --vacuum-size=500M to trim journal logs. find /var/log -name '*.gz' -mtime +30 -delete to remove old compressed logs. docker system prune -f to remove unused Docker resources. For permanent fix: add logrotate config, add monitoring alert at 80% disk usage. At HPE: had this on a TeMIP server. /var/log/app filled up because log level was set to DEBUG in production. Fixed by changing log level to INFO and adding logrotate.
Bash · ARCHITECT
Explain Linux file permissions. How do you secure a private key file?
Every file has three permission sets: owner, group, others. Each set has three bits: read (4), write (2), execute (1). Common values: 755 = owner can rwx, group and others can rx — good for executables. 644 = owner can rw, group and others can read — good for config files. 600 = only owner can rw, nobody else has any access — required for SSH private keys. 700 = only owner can rwx — good for directories with sensitive content. For SSH private key: chmod 600 ~/.ssh/id_rsa. If permissions are wrong (too open), SSH refuses to use the key with a permission denied error. For production: sensitive config files should be 640 (owner read-write, group read) and owned by the application user. Never 777 on production — that means anyone can modify the file.
Bash · PRODUCTION
How do you investigate a memory leak on a Linux server?
Memory leak = application allocates memory and never frees it. Symptoms: free -h shows available memory decreasing over hours/days, server eventually OOM-kills processes. Investigation: watch the specific process over time: watch -n 60 'ps -o pid,vsz,rss,comm -p PID' — VSZ (virtual) and RSS (resident) should both grow over time for a leak. Check dmesg and journalctl -k for OOM killer messages — they show which process was killed and how much memory it had. For Java: jmap -histo PID shows object count by class — which class is growing? For Python: use tracemalloc or memory_profiler. For Go: use pprof. Immediate mitigation: restart the leaking service (cron job restart every night if fix takes time). Permanent fix: find the object that is never dereferenced and fix the code. At HPE: Python Kafka consumer cached every processed message ID in a dict without expiry. Fixed by using OrderedDict with maxlen limit.
Bash · ENGINEER
What is set -euo pipefail and why do you use it in bash scripts?
Three separate options: set -e makes the script exit immediately when any command returns non-zero exit code. Without it, errors are silently ignored and the script continues — dangerous in deployment scripts. set -u makes the script exit when you reference an undefined variable. Without it, a typo in a variable name gives an empty string — silent bug. Example: rm -rf $DIRECOTRY/ (typo) without -u would run rm -rf / (delete everything). set -o pipefail makes a pipeline fail if ANY command in the pipe fails. Without it, ls /nonexistent | sort returns exit code 0 because sort succeeded — the ls failure is hidden. Together they make bash scripts behave like proper programming languages — fail loudly on errors rather than silently continuing in a broken state. Every production bash script should start with these.
Bash · PRODUCTION
A service cannot connect to a database. Walk through network troubleshooting.
Layered investigation from application to network. Step 1: can we reach the DB host at all? ping db-server from the app server. If ping fails, routing or firewall issue. Step 2: is the DB port open? nc -zv db-server 5432 (PostgreSQL) or nc -zv db-server 3306 (MySQL). If this fails, DB is not listening, firewall blocking, or wrong host/port. Step 3: is DNS resolving correctly? nslookup db-server — check if it resolves to the right IP. Step 4: is there a firewall rule? On the DB server: sudo iptables -L -n | grep 5432, or ss -tlnp | grep 5432 — is PostgreSQL actually listening? On the app server: check if outbound traffic on 5432 is allowed. Step 5: test the actual connection with the DB client: psql -h db-server -U user -d dbname — this confirms credentials and SSL settings too. Step 6: check application config — wrong host name? wrong port? wrong credentials in config file?
Bash · ARCHITECT
What is the Linux /proc filesystem and how do you use it for troubleshooting?
/proc is a virtual filesystem — it exists only in memory, not on disk. It exposes kernel and process information as readable files. Every process has a directory /proc/PID containing: cmdline (full command), fd (open file descriptors), status (memory, state), net (network info). Key files: /proc/meminfo shows detailed memory breakdown including cached, buffers, available. /proc/cpuinfo shows CPU details, core count. /proc/loadavg shows 1/5/15 minute load average. /proc/net/tcp shows all TCP connections in kernel format. For troubleshooting: cat /proc/PID/status shows memory usage and OOM score. ls -la /proc/PID/fd | wc -l counts open file descriptors — if this is very high, you have a file descriptor leak. cat /proc/PID/net/tcp shows which network connections this process has. You should never edit /proc files except for specific tuning like /proc/sys/net/ipv4/tcp_fin_timeout or /proc/PID/oom_score_adj.
Bash · ENGINEER
Write a Bash health check script for multiple services.
#!/bin/bash
set -euo pipefail
SERVICES=(
"payment-api:production:8080:/health"
"order-api:production:8080:/health"
"inventory:production:8080:/actuator/health"
)
FAILED=0
for svc in "${SERVICES[@]}"; do
IFS=":" read -r name namespace port path <<< "$svc"
# Port-forward in background
kubectl port-forward "svc/$name" "1${port}:${port}" \
-n "$namespace" &>/dev/null &
PF_PID=$!
sleep 2 # Wait for port-forward
# Health check
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"http://localhost:1${port}${path}" \
--connect-timeout 5 --max-time 10)
kill $PF_PID 2>/dev/null || true
if [[ "$HTTP_CODE" == "200" ]]; then
echo "✅ $name: HEALTHY (HTTP $HTTP_CODE)"
else
echo "❌ $name: UNHEALTHY (HTTP $HTTP_CODE)"
FAILED=$((FAILED+1))
fi
done
if [[ $FAILED -gt 0 ]]; then
echo "ALERT: $FAILED service(s) unhealthy"
exit 1
fi
echo "All services healthy"
Bash · ENGINEER
How do you use Bash functions and error handling in DevOps scripts?
#!/bin/bash
set -euo pipefail
# ── Logging functions ──────────────────────────────────
log() { echo "[$(date "+%Y-%m-%d %H:%M:%S")] INFO $*"; }
warn() { echo "[$(date "+%Y-%m-%d %H:%M:%S")] WARN $*" >&2; }
error() { echo "[$(date "+%Y-%m-%d %H:%M:%S")] ERROR $*" >&2; exit 1; }
# ── Retry function ─────────────────────────────────────
retry() {
local max_attempts="$1"; shift
local delay="$1"; shift
local cmd="$@"
local attempt=1
while [[ $attempt -le $max_attempts ]]; do
if eval "$cmd"; then
return 0
fi
warn "Attempt $attempt/$max_attempts failed. Retrying in ${delay}s..."
sleep "$delay"
attempt=$((attempt + 1))
delay=$((delay * 2)) # exponential backoff
done
error "$cmd failed after $max_attempts attempts"
}
# ── Cleanup trap ───────────────────────────────────────
cleanup() {
log "Cleaning up temp files..."
rm -f /tmp/deploy-$$.*
# Kill any background processes
jobs -p | xargs -r kill 2>/dev/null || true
}
trap cleanup EXIT
# ── Usage ──────────────────────────────────────────────
retry 3 5 kubectl rollout status deployment/myapp -n production
retry 3 10 helm upgrade --install myapp ./chart --atomic
Bash · ENGINEER
Write a Bash script to deploy a new image version to AKS.
#!/bin/bash
set -euo pipefail # exit on error, undefined vars, pipe failures
# Variables from CI/CD environment
APP_NAME="${APP_NAME:?APP_NAME required}"
IMAGE_TAG="${IMAGE_TAG:?IMAGE_TAG required}"
NAMESPACE="${NAMESPACE:-production}"
ACR_NAME="${ACR_NAME:?ACR_NAME required}"
IMAGE="${ACR_NAME}.azurecr.io/${APP_NAME}:${IMAGE_TAG}"
echo "Deploying ${APP_NAME}:${IMAGE_TAG} to ${NAMESPACE}"
# Update the deployment image
kubectl set image deployment/${APP_NAME} \
${APP_NAME}=${IMAGE} \
-n ${NAMESPACE}
# Wait for rollout
kubectl rollout status deployment/${APP_NAME} \
-n ${NAMESPACE} \
--timeout=300s
if [ $? -eq 0 ]; then
echo "Deployment successful"
# Verify pods are running
kubectl get pods -n ${NAMESPACE} -l app=${APP_NAME}
else
echo "Deployment failed - rolling back"
kubectl rollout undo deployment/${APP_NAME} -n ${NAMESPACE}
exit 1
fi
Bash · ENGINEER
How do you use Bash for log analysis? Find all ERROR lines in last hour.
#!/bin/bash
# Find errors from last hour in application logs
LOG_FILE="/var/log/app/application.log"
ONE_HOUR_AGO=$(date -d "1 hour ago" "+%Y-%m-%d %H:%M:%S")
# Count errors in last hour
ERROR_COUNT=$(awk -v cutoff="$ONE_HOUR_AGO" \
'$1" "$2 >= cutoff && /ERROR/ {count++} END {print count+0}' \
"$LOG_FILE")
echo "Errors in last hour: $ERROR_COUNT"
# Extract unique error messages
echo "Top error messages:"
grep "ERROR" "$LOG_FILE" | \
awk '{for(i=4;i<=NF;i++) printf $i" "; print ""}' | \
sort | uniq -c | sort -rn | head -10
# Check pod logs directly from Kubernetes
# Get logs from all pods matching a label
kubectl get pods -n production -l app=payment-api -o name | \
xargs -I {} kubectl logs {} -n production --since=1h | \
grep ERROR | \
sort | uniq -c | sort -rn | head -20
# Alert if error count exceeds threshold
if [ "$ERROR_COUNT" -gt 100 ]; then
echo "ALERT: High error rate - $ERROR_COUNT errors in last hour"
# Add Slack/PagerDuty notification here
fi
Bash · ENGINEER
Explain set -euo pipefail and why you use it in every script.
set -e: exit immediately if any command returns non-zero exit code. Without this: a failed command is silently ignored and the script continues in a broken state. With -e: if kubectl apply fails, the script stops immediately — you know about it. set -u: treat unset variables as errors. Without this: if $APP_NAME is undefined, $APP_NAME expands to empty string silently. kubectl set image deployment/ would run with an empty deployment name. With -u: script exits with "unbound variable" error — you know a required variable was not set. set -o pipefail: in a pipeline (cmd1 | cmd2 | cmd3), the exit code of the whole pipeline is the exit code of the LAST command. Without pipefail: kubectl get pods | grep ERROR | sort — if kubectl fails, grep and sort might still succeed, and the overall exit code is 0 (success!). With pipefail: if any command in the pipeline fails, the whole pipeline is considered failed. Together set -euo pipefail means: the script is strict. Any error, undefined variable, or pipe failure stops the script immediately. This is the production standard. Use it on the very first line of every script after the shebang.
Bash · ENGINEER
How do you check if a Kubernetes namespace exists before creating resources?
#!/bin/bash
set -euo pipefail
NAMESPACE="production"
# Method 1: check exit code
if kubectl get namespace "$NAMESPACE" >/dev/null 2>&1; then
echo "Namespace $NAMESPACE already exists"
else
echo "Creating namespace $NAMESPACE"
kubectl create namespace "$NAMESPACE"
# Add labels for network policy
kubectl label namespace "$NAMESPACE" env=production team=platform
fi
# Method 2: using --dry-run (idempotent approach)
kubectl create namespace "$NAMESPACE" \
--dry-run=client -o yaml | kubectl apply -f -
# Idempotent: if exists, no-op. If not, creates it.
# This is the preferred production approach
# Method 3: wait for namespace to be Active
kubectl wait --for=jsonpath={.status.phase}=Active \
namespace/$NAMESPACE --timeout=30s
echo "Namespace $NAMESPACE is Active and ready"
🚀Senior Scenarios — Outage War Stories, ROI, On-Call9 Q&As›
DevOps · SENIOR
Walk through the worst production outage you have handled.
At HPE on the telecom provisioning platform (OCP cluster serving 3 network provisioning systems). Alert at 2 AM: all provisioning requests failing, 100% error rate. Impact: new SIM cards and number activations for thousands of customers failing. My process: Minute 0 — acknowledged alert, opened incident channel, messaged stakeholders "Investigating outage in provisioning platform". Minute 2 — kubectl get pods -A: 40% of pods in CrashLoopBackOff across all namespaces. kubectl get nodes: all nodes showing MemoryPressure. kubectl top nodes: memory at 98-100% on every node. Minute 5 — root cause: a recent configuration change to the Oracle database connection pool settings had removed the pool size limit. Each pod was creating 200 connections instead of 20. The connection pool memory overhead OOMKilled the pods. Pods restarted, created more connections, more OOM. A feedback loop. Minute 8 — mitigation: kubectl scale deployment oracle-adapter --replicas=1 in each namespace. Reduced connection load immediately. Nodes stabilised. Minute 15 — services restored. Root cause fix: rolled back the ConfigMap change. Pods restarted cleanly with correct pool settings. Post-incident: added a Prometheus alert for database connection count. Added ResourceQuota for memory per namespace. Added PodDisruptionBudget. Documented in runbook. Changed review process for database config changes.
DevOps · SENIOR
How do you present DevOps metrics and ROI to management?
Management does not care about deployment frequency or MTTR in isolation. They care about business outcomes. Translation: Deployment frequency 30→ per day: "We now deliver features to customers 5x faster. Competitors take 2 weeks. We take 2 days." MTTR 4 hours→ 20 minutes: "Average revenue lost per incident dropped from ₹2L to ₹17K based on transaction volume during downtime." Change failure rate 15%→ 3%: "Rollbacks in production dropped from 18 per quarter to 4. Each rollback previously caused 2-hour customer impact." Pipeline automation: "Manual deployment process took 6 hours of engineer time. Now fully automated. Saves 180 engineer-hours per month = ₹3.6L monthly at loaded cost." Infrastructure cost: "Right-sizing and autoscaling reduced Azure spend by 35%. Saving ₹18L annually." Security: "Trivy scanning caught 47 CRITICAL CVEs before they reached production this quarter." Present as a quarterly Business Review: show trend graphs (4 quarters of DORA metrics), show cost trend (infrastructure spend over time), show security posture (CVEs caught vs missed). Ask management: "What do you want us to improve next quarter? Faster feature delivery, fewer incidents, or lower cost?" Let them prioritise. This creates partnership instead of reporting.
DevOps · SENIOR
Describe your approach to on-call and incident management.
On-call is a system, not a heroic individual. Good on-call has: Clear escalation: L1 (automated recovery), L2 (on-call engineer), L3 (senior + management). Alert quality: every alert is actionable. No alert fires without a clear runbook. No flapping alerts. We reviewed and culled alerts quarterly — removed 40% that were noise. Runbooks: for every alert, a runbook with: what it means, initial triage commands, common causes and fixes, escalation criteria. New team member should be able to follow it at 2 AM. Rotation fairness: maximum 1 week per rotation. No more than 2 incidents per night (if exceeded, root cause must be fixed before next rotation). Post-incident process: every P1 gets a blameless RCA within 24 hours. Action items must be implemented within 1 sprint. Repeat incidents get root cause eliminated (not just fixed). At HPE: we had a certificate expiry alert that kept firing quarterly. Instead of just renewing the cert each time, I implemented cert-manager for automatic renewal. The alert never fired again. On-call health: monitor on-call burden. Track: alerts per week per engineer, sleep-affecting alerts, time to resolve. If any metric exceeds threshold: stop feature work, fix the reliability problem first.
DevOps · SENIOR
How do you design a CI/CD pipeline for 10 microservices deploying to AKS?
Architecture: each microservice has its own pipeline but all use the same template. Jenkins shared library (or Azure DevOps template) defines the standard stages. Stage 1 — CI (runs on every PR): checkout, unit tests, code coverage check, SonarQube quality gate (abort if fails), Trivy image scan (abort on CRITICAL), docker build, push to ACR with PR tag. Stage 2 — Dev deploy (auto on merge to develop): helm upgrade --install --atomic --namespace dev --set image.tag=${BUILD_ID}. Run smoke tests. Stage 3 — Staging deploy (auto on merge to main): deploy to staging, run integration tests, performance test (k6 or JMeter), security scan against running app (DAST). Stage 4 — Production deploy (manual approval): raise CR automatically, wait for approval, deploy during maintenance window, helm upgrade --atomic --timeout 10m, automated smoke tests, monitor for 30 minutes, close CR. Canary for critical services: Argo Rollouts with 10% → analysis → 30% → analysis → 100% pattern. Each microservice has its own ArgoCD Application. The App-of-Apps root application manages all 10. Any deployment visible to the whole team in ArgoCD UI. Rollback: one click in ArgoCD or helm rollback.
DevOps · SENIOR
How do you implement a complete security pipeline for container images?
Defence in depth — security at every stage: Stage 1 — Developer workstation: pre-commit hooks run dockerfile-lint (Hadolint) and basic secrets detection (detect-secrets). Catches obvious issues before they reach CI. Stage 2 — CI on every commit: Trivy scans the built image for OS CVEs, language CVEs, and IaC misconfigurations. FAIL on CRITICAL, alert on HIGH. Checkov scans Terraform and Kubernetes YAML for misconfigurations (public storage, no resource limits, privileged containers). Semgrep static analysis on application code. Stage 3 — Registry: ACR Content Trust with Notary — only signed images can be pulled by AKS. Azure Defender for Containers — continuous monitoring of images in ACR, alerts on new CVEs. Tag immutability — :latest cannot be overwritten. Stage 4 — Cluster admission: Azure Policy for Kubernetes (OPA Gatekeeper) enforces at admission time: images must come from ACR only, must have passed Trivy scan, must not run as root, must have resource limits. Stage 5 — Runtime: Falco DaemonSet detects runtime anomalies — shell spawned in container, unexpected network connection, sensitive file access. Result: a CVE in a base image is caught in CI before deployment, or if it emerges post-deployment, Defender alerts and Falco detects exploitation attempts.
DevOps · SENIOR
Walk through a P1 production incident: API returning 503s. Your exact process.
This is about speed, communication, and systematic thinking under pressure. Minute 0 — acknowledge: respond to the PagerDuty alert within 3 minutes. Open incident channel. "Acknowledged, investigating, will update in 10 minutes." Never go silent. Minute 1-3 — understand blast radius: how many users affected? Is it all traffic or specific region/service? Check the monitoring dashboard first — HTTP error rate spike. When did it start? Correlate with recent deployments. Minute 3-8 — isolate: kubectl get pods -n production — any CrashLoopBackOff? kubectl get events -n production --sort-by=lastTimestamp — what happened recently? Check the service that is 503ing: is it the pod that is broken or its dependency? Check: was there a deployment in the last 30 minutes? Check downstream: if payment-api is 503, is it because the database is unreachable? Minute 8-15 — fix: if bad deployment → kubectl rollout undo deployment/payment-api. If database connection issue → check connection pool, restart pods with --max-unavailable=0. If node pressure → cordon the node, drain pods to healthy nodes. Minute 15 — communicate: "Root cause identified: deployment at 14:32 introduced a misconfigured database URL. Rolled back at 14:47. Services restored. Monitoring." Post-incident (24 hours): full RCA, timeline, what we missed, what we are changing.
DevOps · SENIOR
Major release in 2 hours. Pipeline breaks. What do you do?
First question: is there a way to deploy without the pipeline? I always keep an emergency deployment runbook. Then triage: flaky test (skip with justification, document, fix after release), broken tool -- SonarQube down, registry unreachable (bypass with tech lead approval), actual code issue (delay the release -- no release is worth pushing broken code). I communicate immediately to all stakeholders: "Pipeline issue detected at X, investigating, update in 15 minutes." Never go silent. If bypassing: I need written confirmation from tech lead level that the decision was made and risk accepted. I create an immediate ticket to fix within 24 hours. The key is: I own the outcome either way. If I delay, I own the delay. If I bypass and something breaks, I own that too. Own your decisions clearly.
DevOps · SENIOR
How do you convince a team to adopt DevOps practices they are resisting?
Resistance comes from fear: fear of more work, fear of breaking things, fear of losing control. I never lead with process or tools. I lead with pain. First, I listen to understand what actually hurts. Long release cycles? Too many production incidents? Weekend deployments taking 6 hours? Then I show a small win. Take the most painful thing and fix it. If deployments take 6 hours manually, automate one step -- not everything, just one. Show the team their deployment time dropped to 4 hours. Now they want more. At HPE, the team resisted automated testing because "it takes time to write tests." I took the most common production bug from the last quarter, wrote a test that would have caught it, and showed it. The next incident happened and the test caught it before prod. The team started writing tests themselves. The pattern: show the pain, fix one piece, celebrate the win, let the team pull you forward. Mandate from above creates compliance. Demonstrated value creates ownership. Ownership is what you actually want.
DevOps · SENIOR
Senior developer keeps bypassing CI/CD and deploying directly to production.
People problem disguised as a technical problem. Technical fix: remove direct prod access -- all deployments go through the pipeline, no exceptions. Implement audit logging so every manual action is visible to everyone. People fix: understand WHY they are bypassing. Always a reason. Usually: pipeline is too slow (45 minutes is unacceptable), too many false failures (flaky tests blocking good code), bureaucratic approvals for trivial changes. At HPE a developer was bypassing because our pipeline took 45 minutes. We parallelised the test stages, added caching, optimised the Docker build. Pipeline went to 12 minutes. Bypassing stopped completely. Fix the real problem. If direct access is needed for genuine emergencies: create a documented emergency process with required ticket, manager approval, and mandatory post-incident review. This channels the behaviour constructively rather than just blocking it. The last resort: if bypasses continue after the real problem is fixed, it is now a performance issue handled by management.
🐙ArgoCD & GitOps — Sync, App-of-Apps, Secrets, FluxCD7 Q&As›
ArgoCD · ARCHITECT
What is GitOps and how does ArgoCD implement it?
GitOps is an operational model where Git is the single source of truth for infrastructure and application state. Every change to a production system happens through a Git commit and PR review — no direct kubectl apply, no console clicks, no shell scripts run manually. ArgoCD implements GitOps by continuously watching a Git repository. Every 3 minutes (default), ArgoCD fetches the desired state from Git and compares it with the actual state of the Kubernetes cluster. If they differ (sync needed), ArgoCD can automatically apply the Git state to the cluster. The critical shift: developers push to Git. ArgoCD pulls and applies. Nobody needs direct cluster credentials. Every deployment is a commit with author, timestamp, and review. Rollback is git revert — takes 2 minutes instead of 30. Drift detection: if someone manually kubectl applies something, ArgoCD marks the app as OutOfSync and can revert it automatically with selfHeal: true. At HPE: ArgoCD manages all telecom platform deployments. When a microservice update causes issues, rollback is git revert the image tag change. ArgoCD applies it within 3 minutes.
ArgoCD · ENGINEER
What is the App of Apps pattern in ArgoCD?
App of Apps is a pattern where a parent ArgoCD Application watches a directory in Git that contains other Application YAML files. The parent app manages the child apps. When you commit a new Application YAML file to that directory, ArgoCD automatically creates and manages that new application. Use App of Apps when: you have many services (10+) and want to manage them consistently, you want a single sync to bootstrap an entire environment, or you want ArgoCD to manage its own applications (self-managing). The directory structure: argocd/production/ contains: sro-app.yaml, com-app.yaml, monitoring.yaml. The parent app watches that directory. Adding a new service is just adding a new YAML file. Promoted from HPE design: we bootstrap staging with one parent app sync that creates 15 child applications in the right order using sync waves. Complete environment up in under 10 minutes.
ArgoCD · PRODUCTION
ArgoCD application is OutOfSync. Walk through diagnosis and fix.
OutOfSync means the actual cluster state differs from what is in Git. Step 1: identify what is different. argocd app diff app-name shows the exact diff — like git diff but for Kubernetes resources. Step 2: determine cause. Common causes: someone ran kubectl apply manually (config drift), Kubernetes mutated a resource (annotations, status fields), the app changed its own config at runtime (bad practice), or a new commit to Git changed something. Step 3: if the diff is intentional Git change — sync it. argocd app sync app-name. Step 4: if the diff is manual drift — the Git version is correct, revert the manual change. With selfHeal: true, ArgoCD does this automatically. Without it: argocd app sync --force. Step 5: if neither — investigate what changed. kubectl describe resource shows last-applied-configuration. Check audit logs. Step 6: prevent recurrence. Enable selfHeal: true in sync policy. Restrict direct kubectl access using Kubernetes RBAC or admission webhooks. Rule: in production, Git is always right. Manual changes should be exceptions with immediate follow-up to update Git.
ArgoCD · ARCHITECT
How do you manage multi-cluster deployments with ArgoCD?
ArgoCD can manage multiple Kubernetes clusters from one ArgoCD instance. Register each cluster: argocd cluster add context-name. Each Application's destination.server points to the cluster API endpoint. For environment promotion: three clusters (dev, staging, prod), separate Application objects pointing to each cluster, same Git repo but different branches or folder paths (environments/dev/, environments/staging/, environments/prod/). Promotion flow: merge PR to dev branch → ArgoCD syncs dev cluster. After QA approval: PR to update image tag in environments/staging/ → ArgoCD syncs staging cluster. After approval: PR to update environments/prod/ → ArgoCD syncs prod cluster. Each environment has its own ArgoCD Project with RBAC. Dev cluster: auto-sync allowed for all. Prod cluster: sync windows restrict deployments to business hours, requires manual sync approval. ApplicationSets (ArgoCD feature): generate Applications for multiple clusters from one template — useful when you have 10+ clusters following the same pattern (multi-region or multi-tenant).
ArgoCD · ARCHITECT
How do you manage secrets in a GitOps workflow with ArgoCD?
The GitOps dilemma: everything is in Git, but secrets cannot be in Git (even encrypted, it is risky). Solutions: Sealed Secrets (Bitnami): kubeseal encrypts a Secret with the cluster's public key. Only the cluster can decrypt it. The SealedSecret (encrypted) is safe to commit to Git. ArgoCD syncs SealedSecrets, the controller decrypts them into real Secrets. Works fully GitOps — no external secret store required. External Secrets Operator: defines ExternalSecret resources in Git that reference secrets in an external store (Azure Key Vault, AWS Secrets Manager, HashiCorp Vault). The operator fetches and creates Kubernetes Secrets. ArgoCD syncs the ExternalSecret definition (safe to commit), the operator handles the actual secrets. Best for enterprise: centralised secret rotation, audit trail, compliance. Vault Agent Injector: HashiCorp Vault sidecar injected into pods fetches secrets at runtime, writes to in-memory filesystem. Application reads from /vault/secrets/. No Kubernetes Secrets at all. Most secure but most complex. My production recommendation: External Secrets Operator + Azure Key Vault. Secrets defined in code (the ExternalSecret manifest), stored in Key Vault, rotated in Key Vault. ArgoCD monitors the ExternalSecret, ESO syncs changes. Full GitOps, no secrets in Git, centralised management.
ArgoCD · SENIOR
ArgoCD vs FluxCD — when do you choose one over the other?
Both are CNCF GitOps tools. I have used ArgoCD in production. ArgoCD: multi-cluster management from one UI, excellent visibility into sync status, RBAC built into the UI, supports Helm/Kustomize/plain YAML, App-of-Apps pattern for bootstrapping entire clusters. Best when: team needs visibility, multiple clusters to manage, non-platform engineers need to see deployment status. FluxCD: CLI-first, no UI (Weave GitOps adds one), modular controllers, built-in image automation (auto-commit new image tags to Git), better multi-tenancy with namespace scoping. Best when: platform team manages it exclusively, you want Kubernetes-native approach, you need image automation without extra tools. Choose ArgoCD when: product team wants self-service deploys with visibility. Choose FluxCD when: pure GitOps, platform-team-managed, CLI workflow preferred. In practice: many enterprises use ArgoCD for application delivery and FluxCD for cluster bootstrapping. They can coexist. My recommendation for your context: ArgoCD — the UI is critical for communicating deployment status to non-platform stakeholders.
ArgoCD · PRODUCTION
ArgoCD shows an application as OutOfSync but sync keeps failing. Debug walkthrough.
Step 1: read the sync error in the ArgoCD UI. Application → Sync Status → click the error. Common errors: "namespace not found" (the target namespace does not exist — create it or add CreateNamespace sync option), "resource already exists" (manual resource in the cluster not in Git — delete the manual resource or use kubectl apply --server-side), RBAC error (ArgoCD service account lacks permission to create this resource type). Step 2: check the diff. ArgoCD shows exactly what differs between Git and cluster. Sometimes it is a field being continuously mutated by a controller (like lastAppliedConfiguration annotation) — use ignoreDifferences to tell ArgoCD to ignore that field. Step 3: check the application logs. kubectl logs -n argocd deploy/argocd-application-controller | grep my-app. Step 4: verify the Git repository is accessible. ArgoCD repository page shows connection status. Expired credential, wrong branch name, or private repo without SSH key all cause sync failure. Step 5: if Helm-based app: render the manifests locally first. helm template my-release ./chart -f values.yaml. If template fails locally, it will fail in ArgoCD too. Fix the template, push to Git, ArgoCD syncs.
⎈Helm — Charts, Values, Upgrade Recovery, Dependencies9 Q&As›
Helm · ENGINEER
What is Helm and what problem does it solve?
Helm is the package manager for Kubernetes. Without Helm: to deploy an application you have multiple YAML files (Deployment, Service, Ingress, ConfigMap, Secret) and you manually edit each file for different environments (change image tag, change replica count, change hostname). This is error-prone and hard to version. Helm solves this with charts: a chart packages all the YAML files as templates with variables. You provide values (replicaCount: 3, image.tag: v1.2.0) and Helm renders the final YAML. The same chart deploys to dev, staging, and prod with different values files. Upgrade is one command: helm upgrade --install. Rollback is one command: helm rollback. History is tracked: helm history shows every deployment. This is why Helm is the standard way to deploy applications on Kubernetes — every major tool (Prometheus, ArgoCD, nginx ingress, cert-manager) is distributed as a Helm chart.
Helm · ENGINEER
What is the difference between helm install and helm upgrade --install?
helm install fails if the release already exists (the name is taken). helm upgrade --install (also called atomic upsert) installs the release if it does not exist, upgrades it if it does. This makes it idempotent — safe to run multiple times, perfect for CI/CD pipelines where you don't know if this is first deployment or an upgrade. Use helm upgrade --install with --atomic flag in production: if the upgrade fails (pods don't become ready within timeout), Helm automatically rolls back to the previous revision. Without --atomic, a failed upgrade leaves the release in a FAILED state and you must manually rollback. Also use --wait with --timeout 5m so the command waits until all pods are running before returning success. This ensures your CI/CD pipeline knows if the deployment actually worked.
Helm · ARCHITECT
How do you manage Helm chart values across multiple environments?
Layered values files. Base values.yaml contains defaults that work for any environment (usually dev-sized). Environment-specific files override only what differs. helm upgrade --install myapp ./chart -f values.yaml -f values-production.yaml — later files override earlier ones. Production values-production.yaml only needs to specify what is different: larger resource limits, pinned image tag instead of latest, external database instead of local PostgreSQL, correct hostname, production-specific annotations. This keeps values-production.yaml small and focused. For image tags: never put the image tag in Git-committed values files for production. Pass it at deploy time: --set image.tag=$CI_COMMIT_SHA. This way Git does not have a rapidly-changing file, and every deploy has a unique traceable tag. Secret management: never put secrets in values files committed to Git. Use helm secrets plugin for encrypted values or pass secrets via --set from CI/CD secret variables.
Helm · PRODUCTION
Helm upgrade failed and the application is down. What do you do?
First: if you used --atomic, Helm already rolled back automatically. helm history shows: revision 5 failed, rolled back to revision 4 is active. Check what the failure was: helm status myapp shows the error. Check pod events: kubectl describe pod and kubectl logs for the failed pods. Second: if you did NOT use --atomic and the release is in FAILED state: helm rollback myapp 4 to roll back to last known good revision. This takes seconds. Third: investigate the failure. helm template myapp ./chart --values values.yaml > rendered.yaml and compare with working version. kubectl apply --dry-run=client -f rendered.yaml catches most issues. Common failure causes: image does not exist in registry (wrong tag, registry authentication), resource limits too low (pod OOMKilled), health check misconfigured (readiness probe fails, Helm times out waiting for ready). Prevention: always use --atomic --wait --timeout 10m. Add staging deployment before prod. Use helm diff (plugin) to preview changes before upgrading. Add smoke tests as a post-upgrade step in CI/CD.
Helm · ENGINEER
Walk through the complete Helm chart structure and what each file does.
A Helm chart is a directory with this structure: Chart.yaml — metadata: chart name, version, appVersion (the app image tag), description, dependencies list. values.yaml — default configuration values. Any value here can be overridden at install time. templates/ directory — Kubernetes YAML files with Go template syntax. _helpers.tpl — named template functions reused across manifests (define once, include everywhere). templates/deployment.yaml, service.yaml, ingress.yaml, configmap.yaml — standard K8s resources. templates/NOTES.txt — displayed to user after helm install, typically shows how to access the app. charts/ — sub-charts (dependencies). .helmignore — files to exclude from packaging (like .git, README). When you run helm install, Helm renders all templates/ files by combining the template syntax with values (from values.yaml overridden by your -f flags or --set), producing plain Kubernetes YAML, then applies it to the cluster. The key design principle: templates should be generic, values should be environment-specific. Your templates/deployment.yaml should work for dev, staging, and production — the differences come entirely from values files.
Helm · ENGINEER
How do you handle multiple environments with Helm? What is the values file strategy?
Environment-specific values override the defaults in values.yaml. Structure: values.yaml (base defaults — used in development), values-staging.yaml (staging overrides), values-production.yaml (production overrides). Install command: helm install myapp ./mychart -f values-production.yaml. The -f flag merges production values ON TOP of the base values.yaml. Only the keys present in values-production.yaml override the defaults. Using --set for single overrides: helm upgrade myapp ./mychart --set image.tag=v2.1.0. In CI/CD: helm upgrade --install myapp ./mychart -f values-${ENVIRONMENT}.yaml --set image.tag=${IMAGE_TAG} --atomic --wait. The --atomic flag: if the upgrade fails, automatically rolls back to the previous release. The --wait flag: waits for all pods to be Running and ready before returning success. Production best practice: never use helm install (fails if already exists) — always use helm upgrade --install (upgrades or installs). Never use --set for many values in CI/CD — use -f with a committed values file so changes are tracked in Git. ArgoCD with Helm: ArgoCD can manage Helm releases — you commit values file changes to Git, ArgoCD detects the change and runs helm upgrade automatically.
Helm · PRODUCTION
A Helm upgrade failed and left the release in a broken state. How do you recover?
First: understand the state. helm status myapp shows the release status: deployed, failed, pending-upgrade, or superseded. helm history myapp shows all revisions with their status. If upgrade failed: helm rollback myapp 5 (roll back to revision 5 — the last working one). Helm rollback re-applies the previous revision manifests. Check: kubectl get pods after rollback — are pods healthy? If rollback also fails (rare but happens with CRDs or irreversible changes): helm status shows you the release is stuck. To force it: helm upgrade myapp ./mychart --force --atomic reuses the chart but force-replaces pods. Nuclear option: helm uninstall myapp then helm install — but this causes downtime. Debugging failed upgrade: helm upgrade myapp ./mychart --debug 2>&1 | tee upgrade.log shows you exactly what Kubernetes returned. Look for: invalid YAML, missing required values, resource quota exceeded, PVC not found. Prevention: always use --atomic on upgrades in CI/CD. It automatically rolls back on failure, so a failed pipeline never leaves the cluster in a broken intermediate state. Also: run helm template . -f values-prod.yaml before the upgrade to catch template rendering errors before touching the cluster.
Helm · ARCHITECT
How do you manage Helm chart dependencies in a microservices architecture?
Helm dependencies (subcharts) allow one chart to include others. Defined in Chart.yaml under dependencies. Example: your application chart depends on PostgreSQL and Redis. Chart.yaml: dependencies: - name: postgresql, version: "12.x.x", repository: https://charts.bitnami.com/bitnami. - name: redis, version: "17.x.x", repository: https://charts.bitnami.com/bitnami. Run helm dependency update to download them into charts/ directory. Values for subcharts are namespaced: postgresql.auth.password: "mypassword" in your values.yaml configures the postgresql subchart. Two strategies for microservices: Strategy 1 (Umbrella chart): one parent chart with all microservices as subcharts. One helm upgrade deploys everything. Simple but: one bad subchart can block the whole deployment. Strategy 2 (Independent charts per service): each microservice has its own chart, deployed independently by ArgoCD. More resilient — payment service chart can deploy without touching order service. This is the preferred approach for large systems. At HPE: each microservice (TeMIP, UOC, UTM, UCA) had its own Helm chart managed as an independent ArgoCD Application. The charts were stored in a dedicated Helm charts Git repo, versioned separately from application code.
Helm · SENIOR
Helm upgrade failed midway leaving the release in a broken state. How do you recover?
Check the release status: helm status my-release -n production. If it shows "failed" or "pending-upgrade", Helm has left the release in a bad state. Check history: helm history my-release -n production shows all revisions with their status and timestamps. Rollback: helm rollback my-release [REVISION] -n production. Example: helm rollback my-release 3 rolls back to revision 3. This redeploys the previous chart version with the previous values. If rollback also fails: this usually means the cluster state is inconsistent (partially applied resources). Manually check: kubectl get all -l app.kubernetes.io/instance=my-release -n production. Delete the broken resources manually if needed, then run helm rollback. If the release is stuck in "pending-upgrade": helm history shows it as pending forever, new upgrades fail with "cannot re-use a name that is still in use." Force unlock: kubectl delete secret -n production sh.helm.release.v1.my-release.v[N] deletes the stuck history entry. Then retry the upgrade. Prevention: use helm upgrade --atomic which automatically rolls back if any deployment fails its readiness check within --timeout. Add --timeout 5m --wait to all production upgrades.
⭐More Topics — KEDA, OTel, ELK, GitFlow, DR, Certificates22 Q&As›
Azure/AKS · ENGINEER
What is Bicep and how does it compare to ARM Templates and Terraform?
Bicep is Microsoft's domain-specific language for Azure infrastructure. It compiles to ARM JSON templates but has much simpler syntax. ARM templates: verbose JSON, hard to read, difficult to maintain (100 lines in Bicep = 300+ lines in ARM JSON). Bicep: clean declarative syntax, type safety, IDE autocomplete, native Azure support. Terraform: multi-cloud (works on AWS, Azure, GCP), larger community, better module ecosystem, but requires separate installation and understanding of HCL. When to use Bicep: Azure-only infrastructure, want native Azure tooling, team is already in Azure DevOps ecosystem. When to use Terraform: multi-cloud, need complex state management across providers, team already knows Terraform. Many teams use both: Terraform for infrastructure (AKS cluster, networking), Bicep for Azure-specific services (Azure DevOps settings, Azure-specific policies).
Azure/AKS · ENGINEER
How do you implement Azure Policy for Kubernetes (OPA Gatekeeper)?
Azure Policy for Kubernetes uses OPA Gatekeeper to enforce rules at admission time — before resources are created. Built-in policies enforce: containers must not run as root, images must come from allowed registries (ACR only), pods must have CPU/memory limits, no privileged containers, required labels on all resources. Assign policies: az policy assignment create --name require-resource-limits --policy-definition-id /providers/Microsoft.Authorization/policyDefinitions/... --scope /subscriptions/.../resourceGroups/.../providers/Microsoft.ContainerService/managedClusters/myAKS. Effect: Audit (logs violations but allows), Deny (blocks non-compliant resources). Custom policies: write ConstraintTemplate (defines the constraint schema as Rego code) and Constraint (instance of that template). Example: require all pods to have team label. Constraint: spec.match.kinds: [Pods], spec.parameters.requiredLabels: [team]. Test: kubectl apply pod without team label → admission rejected with policy violation message. Policy compliance: Azure Portal shows all non-compliant resources and remediation suggestions.
Kubernetes · ENGINEER
What is a Kubernetes Operator and when do you create one?
An Operator is a Kubernetes controller that automates operational tasks for a specific application. It extends the K8s API with Custom Resource Definitions (CRDs) and watches those resources to take action. Built-in controllers handle generic workloads (Deployment, StatefulSet). Operators handle application-specific logic. Example: PostgreSQL Operator manages: creating a database cluster (CRD: kind: PostgresCluster), taking backups on schedule, performing failover when primary dies, scaling read replicas, rotating passwords. Without operator: ops team manually handles all of this. With operator: it is automated and self-healing. When to create an operator: you have a stateful application with complex day-2 operations that cannot be modelled with standard K8s resources. Examples: database operators (CloudNativePG, MongoDB Community), messaging operators (Strimzi for Kafka), certificate operators (cert-manager). Build with: Operator SDK (Go or Ansible based). For most teams: use an existing operator from OperatorHub rather than building one. Building is complex — Go programming required for most.
Kubernetes · PRODUCTION
How do you debug a Kubernetes networking issue where pods cannot communicate?
Systematic network debugging: Step 1: verify basic pod connectivity. kubectl exec pod-a -- wget -qO- http://pod-b-ip:port. If this fails: Step 2: check NetworkPolicies. kubectl get networkpolicies -n namespace. If any policies exist, they might be blocking. Temporarily delete the policy to test: if communication works without the policy, the policy is the culprit. Step 3: check Service. kubectl get svc myservice. kubectl describe svc myservice — check if Endpoints field has pod IPs. If Endpoints is empty: label selector on Service does not match pod labels. kubectl get pod --show-labels. Step 4: check CoreDNS. kubectl exec pod-a -- nslookup myservice.namespace.svc.cluster.local. If DNS fails: kubectl get pods -n kube-system | grep coredns. Check CoreDNS logs. Step 5: check kube-proxy (or Cilium). kubectl get pods -n kube-system | grep kube-proxy. If using Cilium: cilium connectivity test runs a full connectivity validation. Step 6: use ephemeral debug container: kubectl debug pod-a -it --image=nicolaka/netshoot -- bash. netshoot has all networking tools (tcpdump, curl, netstat, dig).
Terraform · ENGINEER
What is a Terraform data source and when do you use it?
Data sources allow Terraform to READ information about existing resources without managing them. You use data sources when: the resource was created outside this Terraform code (created manually, by another team, or by a different Terraform state), you need to reference an existing resource by name or tag, you want to look up dynamic values (latest AMI ID, current subscription ID). Examples: data "azurerm_resource_group" "existing" { name = "shared-rg" } — reference an existing resource group created by another team. data "azurerm_client_config" "current" {} — get current subscription ID and tenant ID. data "azurerm_key_vault" "vault" { name = "company-vault"; resource_group_name = "platform-rg" } — look up Key Vault to get its ID. Then reference: azurerm_key_vault_secret.app_secret { key_vault_id = data.azurerm_key_vault.vault.id }. Data sources fetch at plan time — they show the current value in the plan output. If the referenced resource does not exist, the plan fails. Common use: look up the existing VNet created by the network team, then create subnets in it using this Terraform code.
Jenkins · PRODUCTION
Jenkins build triggered from GitHub webhook is not starting. Debug process.
Step 1: check if webhook is reaching Jenkins. GitHub → repo Settings → Webhooks → click the webhook → Recent Deliveries. Green tick = delivered. Red X = failed (check the error — usually SSL cert issue or Jenkins URL unreachable). Step 2: check Jenkins receive the webhook. Manage Jenkins → System Log → look for webhook-related log entries. Jenkins Bitbucket Plugin or GitHub Plugin logs the trigger. Step 3: check GitHub plugin configuration. Manage Jenkins → Configure System → GitHub section → GitHub Servers → test connection. Verify the GitHub API URL and credentials are correct. Step 4: check the job trigger configuration. The Multibranch pipeline or job must have "GitHub hook trigger for GITScm polling" checked. Or use the generic webhook trigger plugin. Step 5: check if Jenkins is behind a firewall. GitHub webhooks come from GitHub IP ranges — ensure your firewall allows inbound from GitHub IPs on port 443 or 8080. GitHub publishes its IP ranges: curl https://api.github.com/meta. Step 6: fallback — use polling. Add H/5 * * * * to "Poll SCM" — Jenkins polls every 5 minutes as fallback. Less efficient but more reliable than webhooks through complex firewalls.
Docker · ENGINEER
How do you scan Docker images for vulnerabilities in CI/CD?
Trivy is the most commonly used scanner (open source, fast, accurate). Integration in Jenkins/Azure DevOps: after docker build, before docker push: trivy image --severity CRITICAL,HIGH --exit-code 1 myapp:${BUILD_NUMBER}. --exit-code 1 makes the pipeline fail if CRITICAL or HIGH CVEs are found. Output formats: table (human readable), json (for parsing), sarif (for GitHub Security tab). Trivy scans: OS packages (apt, yum), language packages (pip, npm, maven, gem), Dockerfile misconfigurations, secrets accidentally embedded in images. In Azure DevOps: task type: Docker, command: run, image: aquasec/trivy:latest, options: --rm, containerCommand: image --exit-code 1 --severity CRITICAL,HIGH myregistry.azurecr.io/myapp:$(Build.BuildId). ACR Defender for Containers: scans images pushed to ACR automatically. Alerts on new CVEs even for images already in registry. Baseline: do not fail on MEDIUM — too many false positives. CRITICAL always fail. HIGH fail unless explicitly suppressed with documented justification. Scan the base image too: FROM python:3.11-slim inherits all python:3.11-slim vulnerabilities. Rebuild when base image has new security patches.
Ansible · ENGINEER
What is the difference between ansible ad-hoc commands and playbooks? When to use each?
Ad-hoc commands: run a single Ansible module once without writing a file. ansible all -m ping, ansible webservers -m shell -a "free -m", ansible databases -m service -a "name=mysql state=restarted". Use for: quick one-off tasks, checking system state, emergency actions, exploring what Ansible can do. Playbooks: YAML files defining multiple tasks in sequence. Reusable, version-controlled, idempotent. Use for: everything that happens more than once, automation that runs on schedule, configuration management, deployment automation. The rule: if you find yourself running the same ad-hoc command more than twice, make it a playbook task. Ad-hoc commands are not idempotent by default — running ansible all -m shell -a "useradd newuser" twice creates the user twice and errors. A playbook task with user: name=newuser state=present is idempotent — second run is a no-op. Debugging: ansible all -m setup collects facts (OS, memory, disk, IP) as ad-hoc. ansible all -m command -a "cat /etc/os-release" checks OS versions. These are naturally ad-hoc because you run them occasionally for diagnostics.
Monitoring · ARCHITECT
How do you implement distributed tracing with OpenTelemetry?
OpenTelemetry is the CNCF standard for distributed tracing, metrics, and logs. It provides vendor-neutral instrumentation. Architecture: application code instrumented with OTel SDK (or auto-instrumented via agent) → sends telemetry to OTel Collector → Collector processes and exports to backend (Jaeger, Zipkin, Azure Monitor, Datadog). Python auto-instrumentation: pip install opentelemetry-instrument-flask opentelemetry-exporter-otlp. OTEL_SERVICE_NAME=payment-api OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 opentelemetry-instrument python app.py. This automatically instruments Flask HTTP requests, SQLAlchemy DB calls, Redis calls — no code changes needed. OTel Collector deployment in Kubernetes: DaemonSet or Deployment with OTel Collector. Receives traces from pods, batches, exports to Jaeger. Collector config: receivers: otlp (grpc port 4317), processors: batch, exporters: jaeger + prometheus. Context propagation: each service passes trace-id and span-id in HTTP headers (W3C TraceContext). Downstream services create child spans under the same trace. You see the entire request journey across all services in one Jaeger trace view. At HPE: added OTel to 5 Java microservices using the Java auto-instrumentation agent (one JVM flag, no code changes). Reduced P99 latency debugging from hours to minutes.
Monitoring · ENGINEER
How do you set up ELK Stack for Kubernetes log centralisation?
ELK = Elasticsearch (storage and search) + Logstash or Fluentd (log processing) + Kibana (visualisation). For Kubernetes: use Fluent Bit (lightweight) or Fluentd as the log collector DaemonSet instead of Logstash. Architecture: Fluent Bit DaemonSet on every node → reads /var/log/containers/*.log → adds K8s metadata (pod name, namespace, labels) → sends to Elasticsearch. Fluent Bit config: [INPUT] Name tail, Path /var/log/containers/*.log. [FILTER] Name kubernetes, Match kube.*, Kube_URL https://kubernetes.default.svc:443. [OUTPUT] Name es, Match *, Host elasticsearch, Index k8s-logs. Elasticsearch: deploy via ECK (Elastic Cloud on Kubernetes) operator or Helm. Index templates define mappings. ILM (Index Lifecycle Management) policies: hot (recent logs, fast SSDs) → warm (week old, standard disk) → cold (month old, cold storage) → delete (after 90 days). Kibana: create index pattern k8s-logs-*. Discover: search and filter logs. Dashboard: visualise error rates, namespace activity. Alerts: Kibana Watcher or Elastic alerting rules. For high volume: use Logstash between Fluent Bit and Elasticsearch for complex parsing/enrichment.
AWS · ENGINEER
How do you implement IAM least privilege in AWS?
IAM least privilege: grant only the permissions actually needed, nothing more. Principles: start with no permissions. Add only what the role/user needs. Review and revoke regularly. Tools: IAM Access Analyzer: analyzes policies and flags overly permissive rules. Shows which permissions are used vs unused over last 90 days. AWS recommends removing unused permissions. IAM Policy Simulator: test what actions a user/role can actually perform before granting access. Condition keys: restrict by time, IP, VPC endpoint, MFA status. Example restrictive S3 policy: Allow s3:GetObject on arn:aws:s3:::mybucket/* with condition aws:SourceVpc=vpc-xxx (only from VPC). Service Control Policies (SCPs): apply to entire AWS accounts. Prevent users from disabling CloudTrail, creating public S3 buckets, disabling GuardDuty. Applies even to account root user. Permission boundaries: max permissions an entity can have. Even if inline policy grants more, boundary limits actual access. For EKS pods: use IRSA (IAM Roles for Service Accounts). Each pod gets its own role with only what it needs. payment-pod can write to payment-bucket only. order-pod cannot access payment-bucket at all. Audit: use AWS Access Advisor (last accessed information) and CloudTrail to see what is actually used.
AWS · ARCHITECT
How do you design a multi-region active-passive DR on AWS?
Active-passive: primary region handles all traffic. Secondary region is warm standby, ready to take over. RTO: 15 minutes. RPO: 5 minutes. Components: Route 53: primary record points to ALB in us-east-1 with health check. If health check fails for 3 consecutive times: automatic failover to ALB in eu-west-1. EKS: active cluster in us-east-1. Passive cluster in eu-west-1, running minimal workloads. ArgoCD: one ArgoCD managing both clusters. Same Git manifests deployed to both. RDS: Multi-AZ in primary region for HA. Cross-region read replica in secondary. RTO for DB failover: promote read replica to primary (5-10 minutes). Scripts this automatically: Lambda triggered by Route 53 health check alarm. S3: cross-region replication enabled on all buckets. Objects sync in seconds. EFS/Azure Files: AWS DataSync for periodic sync. Terraform: identical infrastructure code for both regions with different variable files. Test quarterly: simulate us-east-1 failure by pointing Route 53 health check to a failing endpoint. Verify traffic routes to eu-west-1. Verify database promotion completes. Document every step with expected timing. Refine until RTO target is consistently achieved.
Git · ENGINEER
What is the difference between git fetch, git pull, and git merge?
git fetch: downloads changes from remote WITHOUT modifying your working directory or current branch. Updates remote tracking branches (origin/main). Safe operation — never modifies local work. Use to see what changed remotely: git fetch origin, git log HEAD..origin/main shows commits on remote you do not have yet. git pull: git fetch followed by git merge (or git rebase with --rebase flag). Downloads AND integrates remote changes into your current branch. If your branch has local commits, creates a merge commit. With --rebase: replays your local commits on top of remote changes — cleaner linear history but rewrites local commits. git merge: integrates changes from one branch into another. Creates a new merge commit preserving both branches history. Use in Jenkinsfiles and CI: git pull --rebase origin main. Use for long-lived feature branches: git merge main (update feature branch with main changes). Best practice for team workflow: git fetch origin, review what changed (git log HEAD..origin/main), then git pull --rebase origin main. The --rebase keeps history clean without merge commits for routine syncs. Never use git pull on shared branches (main, release) — always prefer explicit fetch + merge with PR review.
Git · ENGINEER
How do you handle merge conflicts in Git?
Merge conflicts happen when two branches modify the same line of the same file. How to resolve: Step 1: identify conflicts after git merge or git rebase. git status shows "both modified" files. Step 2: open each conflicted file. Conflict markers: <<<<<<< HEAD (your changes), ======= (separator), >>>>>>> feature-branch (incoming changes). Edit the file to keep the correct version — could be yours, theirs, or a combination. Remove the conflict markers entirely. Step 3: after editing each file: git add filename. Step 4: git merge --continue (or git rebase --continue). Tools: git mergetool opens a visual diff tool (vimdiff, VS Code, IntelliJ). VS Code diff editor shows conflicts clearly. Prevention: merge main into your feature branch frequently (git merge main). Small, focused PRs — less code changed means fewer conflicts. Communication: tell team members when you are working in a file that others are also changing. Team rule: never work in the same file at the same time for large changes. If conflicts happen repeatedly in the same file: consider splitting the file into smaller modules or having clearer code ownership.
Git · ENGINEER
Explain GitFlow vs trunk-based development. Which do you prefer?
GitFlow: multiple long-lived branches (main, develop, feature/xxx, release/xxx, hotfix/xxx). PRs go to develop. Release branch created from develop. After testing, merged to main and tagged. Hotfixes branch from main. Complex, lots of merges, good for scheduled release cycles. Problems: long-lived branches diverge significantly. Integration issues discovered late. Trunk-based development (TBD): one main branch (trunk). Everyone commits to main (or short-lived feature branches of 1-2 days). Continuous integration — code goes to production daily. Feature flags hide incomplete features in production. Why TBD is better for DevOps: forces small, focused commits. Catches integration issues immediately (CI runs on every commit). Aligns with continuous deployment — if code is always in a releasable state, you can deploy anytime. At HPE we used TBD: feature branches never lasted more than 2 days. Short-lived branches, small PRs, fast reviews. Paired with ArgoCD: every merge to main triggered deployment to dev. Allowed us to deploy to production 5x per day safely. GitFlow is still valid for: embedded software, regulated releases, large teams with long QA cycles. For cloud-native microservices: TBD is the right choice.
OpenShift · ENGINEER
How do you create and manage OpenShift Routes with TLS?
OpenShift Route exposes a service externally through the built-in HAProxy router. Three TLS termination types: Edge termination: TLS terminated at the router. Traffic from router to pod is HTTP (unencrypted). Certificate stored in the Route object. Most common for web apps. Passthrough: TLS passes through to the pod. Pod handles TLS itself. Used when the application must own the TLS handshake (mutual TLS, client certs). Re-encrypt: TLS terminated at router, then re-encrypted to pod with a different certificate. Used when both external and internal encryption are required. Create edge route: oc expose svc/myapp --hostname=myapp.company.com --tls-termination=edge --cert=tls.crt --key=tls.key --ca-cert=ca.crt. Or via YAML: spec.tls.termination: edge, spec.tls.certificate: -----BEGIN CERTIFICATE-----, spec.tls.key: -----BEGIN PRIVATE KEY-----. Automated certificates: OpenShift cert-manager integration or custom cert-manager via OperatorHub. Annotate Route: cert-manager.io/issuer: letsencrypt → cert-manager requests and manages the certificate automatically. Wildcard certs: one certificate for *.apps.cluster.company.com covers all routes.
OpenShift · PRODUCTION
How do you troubleshoot image pull failures in OpenShift?
OpenShift image pull failures: more complex than vanilla K8s because of SCC and built-in registry. Step 1: kubectl describe pod mypod — Events section shows: Failed to pull image "myimage": rpc error: code = Unknown desc = ... Common errors and causes: "unauthorized" — authentication failure. Check if the pull secret is correct and attached to the service account. oc get sa default -n mynamespace -o yaml shows imagePullSecrets. "not found" — image tag wrong or registry unreachable. Check image name exactly: registry/org/repo:tag. "509: certificate signed by unknown authority" — registry uses self-signed cert. Add registry CA to OpenShift cluster image config or use --insecure-registry. For ACR: create pull secret: oc create secret docker-registry acr-pull-secret --docker-server=myacr.azurecr.io --docker-username=$SP_APP_ID --docker-password=$SP_PASSWORD. Attach to service account: oc secrets link default acr-pull-secret --for=pull. For internal OpenShift registry: ensure the BuildConfig pushed the image correctly. oc get is (ImageStream) shows available images. For permission issues: if pod is trying to pull from internal registry, the SA needs system:image-puller role in the namespace where the image lives.
Python · ENGINEER
How do you write a Python script to check certificate expiry dates?
import ssl, socket, datetime
def check_cert_expiry(hostname, port=443, warn_days=30):
"""Check SSL certificate expiry for a hostname"""
context = ssl.create_default_context()
try:
with socket.create_connection((hostname, port), timeout=10) as sock:
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
cert = ssock.getpeercert()
# Parse expiry date
expiry_str = cert["notAfter"] # "Nov 15 12:00:00 2024 GMT"
expiry = datetime.datetime.strptime(expiry_str, "%b %d %H:%M:%S %Y %Z")
days_left = (expiry - datetime.datetime.utcnow()).days
status = "CRITICAL" if days_left < 7 else "WARNING" if days_left < warn_days else "OK"
return {"hostname": hostname, "expiry": expiry.isoformat(),
"days_left": days_left, "status": status}
except ssl.SSLError as e:
return {"hostname": hostname, "error": str(e), "status": "ERROR"}
except Exception as e:
return {"hostname": hostname, "error": str(e), "status": "UNREACHABLE"}
# Check multiple hosts
hosts = ["app.company.com", "api.company.com", "payment.company.com"]
for host in hosts:
result = check_cert_expiry(host)
print(f"{result['status']:10} {host}: {result.get('days_left','N/A')} days remaining")
if result["status"] in ["CRITICAL","WARNING"]:
# Send to PagerDuty or Slack
print(f" ALERT: Certificate expiring soon on {host}")
Bash · PRODUCTION
Write a Bash script that monitors a Kubernetes deployment and alerts if pods are not ready.
#!/bin/bash
set -euo pipefail
DEPLOYMENT="payment-api"
NAMESPACE="production"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"
THRESHOLD=2 # alert if less than 2 pods ready
check_deployment() {
# Get ready/total replicas
READY=$(kubectl get deployment "$DEPLOYMENT" -n "$NAMESPACE" \
-o jsonpath="{.status.readyReplicas}" 2>/dev/null || echo "0")
DESIRED=$(kubectl get deployment "$DEPLOYMENT" -n "$NAMESPACE" \
-o jsonpath="{.spec.replicas}" 2>/dev/null || echo "0")
echo "Ready: ${READY:-0}/${DESIRED:-0}"
if [[ "${READY:-0}" -lt "$THRESHOLD" ]]; then
PODS=$(kubectl get pods -n "$NAMESPACE" -l "app=$DEPLOYMENT" \
--no-headers | awk "{print \$1, \$3}")
MESSAGE="ALERT: $DEPLOYMENT in $NAMESPACE has only ${READY:-0}/${DESIRED:-0} ready pods\n\nPod status:\n$PODS"
curl -s -X POST "$SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d "{"text": "$MESSAGE"}"
echo "Alert sent to Slack"
return 1
fi
return 0
}
# Run every 60 seconds
while true; do
check_deployment || true
sleep 60
done
DevOps · SENIOR
How do you handle a Terraform state file that got corrupted?
Corrupted Terraform state is a critical incident — it can lead to Terraform trying to recreate or delete real infrastructure. Immediate response: set the remote state to locked if possible to prevent anyone else from running terraform apply. Assess damage: terraform state list — if this fails, state is indeed corrupted. Look at the raw state file in Azure Blob/S3. Try to parse as JSON — if it is valid JSON but with wrong content, you may be able to manually fix it. If state file is empty or completely broken: Enable versioning on your Azure Blob/S3 bucket (you did do this, right?). Restore previous version: az storage blob restore --account-name tfstate --container-name state --blob-name prod.tfstate. After restore: run terraform plan — compare what Terraform thinks exists vs actual Azure resources. For any resources Terraform lost track of: terraform import to re-add them to state without recreating. For resources Terraform thinks exist but do not: terraform state rm resource.type.name removes from state without destroying. Testing the recovery: run terraform plan, verify no unexpected creates/destroys, get tech lead review before any apply. Prevention: enable blob versioning and soft delete on state storage. Weekly automated backup of state files. Use Terraform Cloud or remote backend with built-in versioning.
Kubernetes · ARCHITECT
How do you implement Kubernetes cluster autoscaler in AKS?
Cluster Autoscaler (CA) automatically adjusts the number of nodes in a node pool. When pods cannot be scheduled (Pending due to insufficient CPU/memory): CA adds nodes. When nodes are underutilised for 10+ minutes: CA removes nodes and reschedules pods. Enable in AKS: az aks nodepool update --name userpool --cluster-name myAKS --resource-group myRG --enable-cluster-autoscaler --min-count 2 --max-count 20. CA checks every 10 seconds for unschedulable pods. Adds nodes in the most suitable pool (based on pod resource requests and node selectors). Scale-down: CA marks nodes for removal if utilisation below 50% for 10 minutes. Drains pods, deletes node from Azure. Caveats: pods with local storage (emptyDir, hostPath) block scale-down. Pods with strict PodDisruptionBudgets block drain. System pods (DaemonSets) are excluded from scale-down. Tune with annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false" on pods that must not be evicted. For faster scale-up: over-provisioning pattern — keep 2 low-priority pause pods (Deployment with priority class below all real workloads). When real pods need nodes, CA evicts the pause pods immediately creating space. CA pre-provisions nodes for pause pods = nodes ready for real workload spikes.
Azure/AKS · ENGINEER
How do you implement horizontal scaling with KEDA in AKS?
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to scale on external events — Kafka lag, Azure Service Bus queue depth, Prometheus metrics, HTTP request rate. Standard HPA scales only on CPU/memory. KEDA fills the gap. Install: helm install keda kedacore/keda -n keda --create-namespace. ScaledObject for Azure Service Bus: apiVersion: keda.sh/v1alpha1, kind: ScaledObject, spec.scaleTargetRef.name: order-processor, spec.minReplicaCount: 0 (scale to zero when queue empty!), spec.maxReplicaCount: 50, spec.triggers: type: azure-servicebus, metadata.queueName: orders, metadata.messageCount: "10" (1 pod per 10 messages). Authentication: TriggerAuthentication using Workload Identity — no credentials stored. Kafka scaler: type: kafka, metadata.bootstrapServers: kafka:9092, metadata.consumerGroup: my-group, metadata.topic: orders, metadata.lagThreshold: "100" (1 pod per 100 messages of lag). Scale to zero: KEDA can scale to 0 replicas when no events — saves compute cost completely. HPA cannot go below 1. HTTP-based scaler: scale on requests per second to an endpoint. Use case: payment processing pods scale from 0 to 50 based on Azure Service Bus queue depth. Zero cost when no payments. Near-instant scale up when orders come in.