This migration covered 3 environments, more than 50 resource types, and well above 300 Terraform resources.
The team maintaining this environment was basically one contractor. So the operator goal was simple: get access to new Cloudflare provider features and do not break production.
Cloudflare is too close to the edge to migrate casually. DNS, WAF, rulesets, Zero Trust, tunnels, redirects. If the process is messy, the feedback loop gets slow very quickly.
I treated this as a controlled edge migration: phased rollout, no auto-apply, test before prod, and scripted state repair for resources the provider could not upgrade cleanly.
So before touching provider v5, I set up the working model first:
- block dangerous commands
- start with read-only Cloudflare access
- work on a local copy of state first
- dry-run the migration tool
- only then think about import, state cleanup, and apply
This post is mostly about how to set up the migration so engineers can get through it with less guesswork. Not the full resource-by-resource migration.
Start read-only, block
terraform apply, pull state locally, and expect manual cleanup.Why This Mattered to the Business Link to heading
This was not only a Terraform upgrade.
This was a Day 2 infrastructure problem.
At the beginning, most teams are happy just because IaC exists. Fine, we are cool. But then product pressure goes up, urgent changes happen, people do click ops, someone says they will clean it later, and the Terraform coverage starts sliding. First it is 100%, then 90%, then 80%, and after that every change becomes slower and less trustworthy.
That is the real risk.
If the edge configuration is not represented correctly in code, then:
- delivery gets slower
- production changes get harder to review
- new engineers need more tribal knowledge
- the company becomes dependent on the memory of one operator
So the migration mattered because it pulled the edge layer back into a shape where the business can keep moving without relying on click ops.
What This Unlocks Beyond Terraform Link to heading
Another reason this mattered: Cloudflare is pushing hard beyond classic CDN and DNS use cases. The developer platform is now a real product surface, not just a side feature.
If the provider layer is outdated or half-managed manually, it becomes harder to adopt what Cloudflare is actually investing in.
The migration helps keep the company ready for things like:
- Workers AI for running inference on Cloudflare’s network
- AI Gateway for observability, caching, retries, rate limiting, and fallback for AI applications
- Vectorize for vector search and retrieval workloads
- Durable Objects for stateful coordination and real-time systems
- Agents SDK for stateful agents with scheduling, tools, and human-in-the-loop flows
- Hyperdrive for connecting Workers to existing regional databases with better global performance
That matters because it keeps the path open for building application features on the same platform that already sits in front of production traffic.
For a small company, that is leverage.
The value is not “we upgraded Terraform”. The value is “we are in a position to adopt new platform capabilities without first untangling old infra debt.”
Start with a Read-Only Cloudflare Token Link to heading
The next part is authentication.
I provide a read-only Cloudflare token first. Not admin. Not “temporary full access”. Just enough access to inspect what already exists.
Just read-only.
Why?
- I want discovery first
- provider refresh and plan usually need API reads
- I want to inspect what exists before allowing any write path
- if the token leaks somewhere, the blast radius is much smaller
For this stage I only need visibility into the objects already managed by Terraform: zones, DNS, rulesets, WAF objects, Zero Trust resources, and similar things depending on the stack.
If later I need write access, I switch credentials only after the diff is reviewed by a human. Discovery and API lookups use the read-only token. Mutation happens in a separate reviewed phase.
Keep the Authentication Boring Link to heading
I prefer the auth model to be boring and explicit. Usually it is just:
export CLOUDFLARE_API_TOKEN="..."
The token should come from a proper secret source:
- local secret manager
- CI secret store
- short-lived shell session
And it should not be committed to the repo or copied into prompt text.
This is not the place to be creative.
The Migration Plan I Followed Link to heading
My notes ended up being very close to this sequence:
- upgrade from
4.52.0to4.52.5first - run
tf-migratein dry-run mode - apply the HCL rewrite and review every changed
.tffile - fix renamed or removed resources manually where needed
- switch to provider
~> 5 - repair state issues
- apply in
testfirst - only after that touch
prod
I also prefer to split the work into several PRs:
- PR1: transitional provider upgrade to
4.52.5 - PR2:
tf-migrateHCL rewrite plus manual fixes - PR3: provider
~> 5, state cleanup, import flow, final validation
This keeps the diff readable, makes CI output easier to understand, and gives engineers a cleaner checkpoint after every phase.
It also reduces delivery risk. When one phase goes wrong, I know exactly where to stop, revert, or re-plan instead of carrying one giant migration diff through the whole stack.
Pull State and Work Locally First Link to heading
One practical lesson from this migration: I do not want to start by experimenting against the normal backend.
First I pull the state locally and prepare a local work mode:
./tf.sh state pull > migration.tfstate
cp <environment>.tfvars <environment>.auto.tfvars
I copy the environment tfvars to *.auto.tfvars simply to make local Terraform runs load the same environment-specific values without adding extra flags to every command.
Then I temporarily switch the backend:
terraform {
backend "local" {
path = "migration.tfstate"
}
}
This part matters a lot.
The moment I know state cleanup, imports, and provider schema upgrades may be involved, I want a local copy first. It gives me a safer place to inspect, test, and understand the damage before touching the normal backend flow.
It also gives a faster feedback loop. That matters because this migration is not one command. It is many small iterations.
One obvious warning here: local Terraform state may contain secrets. Treat that local file accordingly.
Dry Run the Migration Tool First Link to heading
Before changing the provider constraint, I run the migration tool in dry-run mode:
tf-migrate migrate --source-version v4 --target-version v5 --dry-run --config-dir .
And the warnings are the interesting part.
In this migration, the dry run showed exactly where manual work was still required. The main categories were:
- application-scoped Access policies
- removed resources in
v5 - resources that would need state cleanup and re-import
That is already a very good result. The tool does not need to finish the migration. It just needs to show where engineers should spend manual review time.
What tf-migrate Did Not Finish
Link to heading
A few warnings from the dry run were especially important.
Application-scoped Access policies could not be migrated automatically. In v5, those policies need to live inline inside cloudflare_zero_trust_access_application.
cloudflare_split_tunnel was removed and had to move to device profile configuration.
cloudflare_zone_settings_override was removed too. The migration generated per-setting resources, but the old state still had to be removed and the new resources had to be imported correctly.
There were also a few field-level changes. For example, min_days_for_renewal disappeared from origin CA certificate resources.
This is why I treat the migration tool output as the first pass, not as the final migration.
Expect Manual State Cleanup Link to heading
This migration is not only about renaming resources.
Some failures happen because the old state payload cannot be decoded correctly by the v5 provider. So Terraform fails before you even get a useful diff.
I saw this pattern on resources such as:
- Zero Trust gateway policies
- load balancer monitors
- zones and zone settings related objects
The errors looked like provider decode problems, for example:
rule_settings: expected object, got arrayheader: expected object, got arrayplan: expected object, got string
When that happens, the path is usually:
- back up the state
- remove only the failing addresses from state
- import them again with the
v5format - re-run plan
This is one more reason why I like the local backend step first. It gives engineers room to repair state deliberately instead of rushing through it.
Some Cloudflare Resources Need Manual Review Anyway Link to heading
The migration tool helps a lot, but some resources still need human attention.
The ones I would watch first are:
- Access policies attached to applications
- split tunnel configuration
- zone settings overrides
- rulesets
- load balancer resources
For example, application Access policies are not just a rename problem. In v5, some of them need to move into inline policies on the application resource. That is not something I want to trust to an automatic rewrite without review.
Zone settings are another good example. Old override-style resources may turn into many per-setting resources. That often means imports and explicit state cleanup, not just HCL edits.
Use Small Shell Scripts as Migration Helpers Link to heading
One thing that helped a lot was using small disposable shell scripts instead of trying to remember every state rm and every import format.
I would recommend this to anyone doing the same migration.
Not because the scripts are fancy. The opposite. They are boring, explicit, and easy to review.
I ended up with three useful categories of scripts:
- scripts that remove stale state entries
- scripts that import renamed resources back into state
- scripts that query Cloudflare API and match objects automatically
All IDs in the examples below are dummy values. They are here only to show the expected shape.
Example 1: Bulk State Cleanup Script Link to heading
For resources that obviously had to be removed from legacy state, I prefer a helper like this:
#!/usr/bin/env bash
set -euo pipefail
terraform state rm \
'cloudflare_ruleset.example' \
'cloudflare_access_policy.example' \
'cloudflare_zone_settings_override.this' \
'cloudflare_worker_domain.this' \
'cloudflare_tunnel_virtual_network.default'
This is much safer than typing a long list manually while you are tired and already many plans deep into the migration.
Example 2: Environment-Aware Import Script Link to heading
For resources that exist in all environments but have different IDs, I like a small wrapper that detects the AWS account and chooses the correct import ID.
#!/usr/bin/env bash
set -euo pipefail
ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
case "$ACCOUNT_ID" in
"123456789012")
ENV_NAME="test"
CLOUDFLARE_ACCOUNT_ID="7f3c5a0b1d4e6f8899aabbccddeeff00"
RESOURCE_ID="2c9d4a8f7b6e5d4c3b2a1908fedcba76"
;;
"210987654321")
ENV_NAME="prod"
CLOUDFLARE_ACCOUNT_ID="7f3c5a0b1d4e6f8899aabbccddeeff00"
RESOURCE_ID="8a7b6c5d4e3f2109fedcba9876543210"
;;
*)
echo "Unsupported account: $ACCOUNT_ID"
exit 1
;;
esac
echo "Detected environment: $ENV_NAME"
terraform import cloudflare_load_balancer_monitor.default "$CLOUDFLARE_ACCOUNT_ID/$RESOURCE_ID"
That pattern was useful for load balancer monitor, pools, load balancer, worker domains, and a few other resources.
Example 3: Parse Plan and Re-Import Existing Rulesets Link to heading
Rulesets were more interesting.
Sometimes Terraform wanted to create a ruleset that already existed in Cloudflare. In that case, I do not want to import by hand one by one if the plan already contains enough metadata to identify the object.
So another useful helper script pattern is:
- read
plan.txt - find
cloudflare_ruleset.* will be created - extract zone ID, name, phase, and description
- call Cloudflare API
- resolve the matching ruleset ID
- run
terraform import
Very rough shape:
#!/usr/bin/env bash
set -euo pipefail
PLAN_FILE="${1:-plan.txt}"
ZONE_ID="f1e2d3c4b5a697887766554433221100"
RULESET_ID="9b8a7c6d5e4f32100123456789abcdef"
# parse plan output here
# call Cloudflare API here
# match by zone_id + name + phase
# terraform import "cloudflare_ruleset.example" "zones/$ZONE_ID/$RULESET_ID"
This is one of those places where automation actually saves time instead of adding risk.
More importantly, it reduces team dependency.
Without these scripts, the migration would live mostly in one engineer’s memory. With them, another engineer can follow the same sequence, understand the shape of the repair work, and repeat it without reverse-engineering the whole stack from scratch.
Put Tools Behind Guardrails Link to heading
I still use code assistants for this kind of work. They are useful for scanning many .tf files, detecting renamed resources, summarizing warnings, and preparing repetitive edits.
But I keep the boundary simple:
- read the repository
- read provider docs
- prepare code changes
- summarize migration warnings
- never apply infrastructure changes on its own
If you use Cursor, beforeShellExecution is one of the easiest controls to add. We use it as a deny layer before the command is executed.
"beforeShellExecution": [
{
"command": "~/.cursor/hooks/block-apply.sh"
}
]
Very small hook:
#!/usr/bin/env bash
set -euo pipefail
input="$(cat 2>/dev/null || echo '{}')"
cmd="$(echo "$input" | jq -r '.command // .cmd // .shell_command // empty')"
case "$cmd" in
*"terraform apply"*|*"terraform destroy"*|*"terraform import"*|*"terraform state rm"*|*"terraform state mv"*|*"auto-approve"*)
echo "Blocked by policy during migration window: $cmd" >&2
exit 2
;;
esac
That was enough for the first stage. The tool could still scan modules, compare v4 and v5 resources, prepare refactors, and produce review notes, but it could not jump straight to mutation.
Important detail: this deny policy is for the assistant-driven exploration and diff-preparation phase.
Later, once the review is done, I run the approved terraform import, terraform state rm, and terraform apply steps myself in a separate supervised shell session. The hook is there to stop premature mutation, not to ban the whole migration workflow forever.
What Helped the Most Link to heading
If I reduce the whole experience to a few practical points, these are the things that helped most:
- use
testfirst and keepprodbehind review - run
tf-migratein dry-run mode before changing provider version - expect some
state rmplusterraform importwork - keep helper scripts for repetitive import flows
- keep CI plans running during every phase
- remove
-auto-approvefor the migration window
None of this is complicated, but together it makes the migration much easier to pass.
It also lowers maintenance cost later. Repeated state repair or import logic stops being a custom one-time ritual and starts becoming documented operational tooling.
The First Phase Workflow Link to heading
Before any write-capable step, I want the work loop to be very small:
- read the Terraform code
- run the dry-run migration
- prepare code changes
- run safe checks like
terraform fmtandterraform validate - prepare import and state-cleanup commands for review
- review the diff
The first goal is not to “finish the migration”. The first goal is to remove uncertainty.
Test Before Prod, and Keep CI Running Link to heading
Another useful note from the migration: test first, always.
My rollout rule was:
- migrate and apply in
test - make sure post-apply plan is clean
- keep CI planning both
testandprod - only then allow the
prodpath
During migration, temporary prod plan instability can happen because of intermediate rename and state steps. That is acceptable for a short period. Blind prod apply is not.
Also, I keep -auto-approve out of the flow completely for this migration window.
Result Link to heading
The upgrade path was not fully automatic, but the combination of dry-run migration, local state work, scripted imports, and staged rollout made it predictable enough to execute safely.
That was the real objective. Not to make the migration look elegant, but to make it pass without breaking production and without turning one engineer’s memory into the only runbook.
For me this is the fun part of infrastructure work. I am comfortable owning technical risk when the process is clear and the rollback path is real.
From different angles, the result was:
- better path to adopt newer Cloudflare platform capabilities
- less dependence on click ops and one-person memory
- safer rollout shape for a production edge migration
- clearer signal that this environment can be maintained by another engineer later
Define the Reset Path Before You Need It Link to heading
One more practical point: define the way back before the first apply.
After local repair work, I want an explicit reset path:
- restore the normal backend block
- reinitialize Terraform
- migrate backend metadata if needed
- remove temporary files like copied tfvars, scratch plans, and local state artifacts
If you skip this part, the migration gets messy very quickly. Temporary files pile up, people forget which plan is the latest one, and it becomes much easier to make a bad decision.
Final Thoughts Link to heading
For me, the hard part of this migration is not HCL rewrite. The hard part is keeping the process predictable enough that other engineers can follow it too.
So the rule is simple:
- read-only first
- local state first
- dry-run first
- small helper scripts instead of manual repetition
- human review before mutation
That is how I prefer to start a Cloudflare provider v4 to v5 migration and help the next engineer get through it faster.