🔷 What is Terraform?
›Terraform is an Infrastructure as Code (IaC) tool by HashiCorp. You write code in HCL (HashiCorp Configuration Language) to define cloud infrastructure, and Terraform creates and manages it automatically. It is declarative — you describe what you want, Terraform figures out how to get there.
IaC vs Manual Provisioning
| Aspect | Manual (Portal/CLI) | Terraform IaC |
|---|---|---|
| Consistency | Different every time | Identical every time |
| Speed | Hours/days | Minutes |
| Auditability | None | Git history = full audit trail |
| Repeatability | Impossible to guarantee | Guaranteed — same code = same infra |
| Rollback | Manual, risky | terraform apply of previous Git version |
| Review process | No review | PR review of infrastructure changes |
| Drift detection | None | terraform plan shows every drift |
Terraform vs Alternatives
| Tool | Type | When to use |
|---|---|---|
| Terraform | Multi-cloud IaC | Any cloud, multi-cloud, best ecosystem |
| Pulumi | Multi-cloud IaC (real code) | When team prefers Python/TypeScript over HCL |
| AWS CloudFormation | AWS-only IaC | AWS-locked, free, no state management needed |
| Azure Bicep / ARM | Azure-only IaC | Azure-locked teams, native Azure integration |
| Ansible | Config management | Server configuration, not infra provisioning |
📝 HCL Fundamentals — Variables, Locals, Outputs
›File Structure — always use this layout
| File | Purpose |
|---|---|
main.tf | Resources — the actual infrastructure |
variables.tf | Variable declarations with types and descriptions |
outputs.tf | Values exported for other modules or humans |
locals.tf | Computed values, avoid repetition |
backend.tf | Remote state configuration |
providers.tf | Provider versions and authentication |
terraform.tfvars | Variable values — DO NOT commit to Git if it has secrets |
versions.tf | Terraform version constraints |
🗄️ State Management — Remote State, Locking, Import
›Terraform state (.tfstate) is Terraform's memory — it records every resource it has created including all cloud-assigned properties (resource IDs, IP addresses, connection strings). On every plan and apply, Terraform compares your .tf files against state to calculate what needs to change.
Why remote state is mandatory for teams
- Local state — stored on your laptop. If you leave the team, state is gone. Two engineers cannot work simultaneously.
- Remote state — stored in cloud (Azure Storage, S3+DynamoDB). Anyone on the team can run Terraform. State is locked during apply — prevents simultaneous changes corrupting state.
State Commands — know all of these
terraform import — bring manual resources under Terraform
When someone creates a resource manually (portal, CLI) without Terraform. You need to bring it under Terraform management without destroying and recreating it.
📦 Reusable Modules — Design, Versioning, Registry
›A module is a reusable package of Terraform files — like a function in programming. Modules enable DRY (Don't Repeat Yourself) infrastructure. Write once, use in dev/staging/prod with different variable values.
Module folder structure
Module versioning — critical for teams
Complete reusable AKS module example
🌍 Multi-Environment — tfvars, Workspaces, Folder Pattern
›The most common Terraform architecture question at senior interviews: "How do you manage dev, staging, and prod with the same Terraform code?"
Two approaches — know both, prefer folder pattern
| Folder per environment | Workspace per environment | |
|---|---|---|
| Isolation | Complete — separate state, separate runs | Same code, separate state |
| Risk | Low — impossible to accidentally affect prod | Higher — one wrong workspace select = prod |
| Clarity | Easy to see difference between envs | Harder to track which workspace |
| Recommended for | Production teams, regulated environments | Simple projects, single cloud account |
☁️ Multi-Cloud Terraform — AWS + Azure + GCP
›Terraform is the only IaC tool that manages all three major clouds with the same workflow. This is its biggest advantage over CloudFormation (AWS-only) or Bicep (Azure-only).
📡 Drift Detection & Remediation
›Drift occurs when someone modifies infrastructure directly (Azure portal, AWS console, CLI) without going through Terraform. The real state now differs from what Terraform's state file records. This is one of the most important topics at architect-level interviews.
How drift happens
- Developer manually adds a firewall rule in the console "just this once"
- Ops team resizes a VM manually during an incident, forgets to update Terraform
- Cloud provider automatically changes a setting (certificate rotation, auto-scaling)
- Legacy resources created before Terraform adoption
Drift remediation decision tree
| Scenario | Action | Command |
|---|---|---|
| Manual change was wrong — revert it | Apply Terraform to restore desired state | terraform apply |
| Manual change was intentional — keep it | Update .tf files to match, then apply | terraform import + plan + apply |
| Resource not in Terraform yet | Import it into state, write config | terraform import |
| Stop future drift | Enforce policy — deny console changes | Azure Policy / AWS SCP |
🔐 Access Control, Permissions & Secrets
›Access control for Terraform infrastructure is a two-layer problem: who can run Terraform (human operators) and what Terraform can do (service principal permissions). Getting this wrong means either too-permissive access (security risk) or operators constantly blocked.
Principle: Terraform runs with least privilege
| Identity | What it is | Permissions |
|---|---|---|
| CI/CD Service Principal | Automated pipeline identity | Contributor on specific resource groups only |
| Developer | Human operator | Read on prod state, full on dev/staging |
| SRE/Platform team | Infrastructure owners | Full on all environments via PR approval |
| terraform plan (PR) | Automated — runs on every PR | Read-only on all resources |
| terraform apply (prod) | Runs after manual approval | Write on approved resources only |
✅ Compliance & Security Scanning
›At enterprise scale (Razorpay, JP Morgan, Walmart) every Terraform PR must pass compliance checks before merge. Checkov and tfsec catch misconfigurations before they reach cloud — a CRITICAL finding in prod costs 1000x more to fix than in a PR.
Compliance tools
| Tool | What it checks | When to run |
|---|---|---|
| Checkov | Security misconfigs: public S3, unencrypted disks, open SGs, missing encryption | Every PR — blocks merge on CRITICAL |
| tfsec | Security and best practice violations | Every PR |
| terraform validate | Syntax and schema validation | Every PR — fast, runs first |
| terraform fmt | Code formatting | Every PR — fail on diff |
| Infracost | Cost estimation — shows cost change in PR comment | Every PR — informational |
| Terratest | Integration tests — provision real infra, assert, destroy | Nightly or on module changes |
| OPA/Sentinel | Policy as code — custom compliance rules | In Terraform Cloud/Enterprise |
⚡ Terraform in CI/CD — Jenkins + GitHub Actions
›Terraform should NEVER be run manually from a laptop in production. All production changes go through CI/CD with: automated plan on PR → human review of plan → approval gate → automated apply. This gives you audit trail, peer review, and prevents "oops I ran apply against prod by mistake".
🔍 Troubleshooting — Real Production Problems
›Most common Terraform failures and fixes
| Error | Cause | Fix |
|---|---|---|
| Error acquiring the state lock | Previous apply crashed, lock not released | terraform force-unlock <lock-id> |
| Resource already exists | Resource created manually, not in state | terraform import resource.id |
| State is corrupted/inconsistent | Concurrent apply, crash during apply | Restore state from backup, use state commands to fix |
| Plan shows destroy unexpected resource | Resource renamed in code, for_each key changed | terraform state mv old new before apply |
| Provider authentication failed | Expired credentials, wrong env vars | Refresh service principal token, check ARM_ env vars |
| Cycle detected | Circular dependency between resources | Use depends_on or restructure module |