How to Migrate the Terraform Cloudflare Provider from v4 to v5 Safely

This migration covered 3 environments, more than 50 resource types, and well above 300 Terraform resources.

The team maintaining this environment was basically one contractor. So the operator goal was simple: get access to new Cloudflare provider features and do not break production.

Cloudflare is too close to the edge to migrate casually. DNS, WAF, rulesets, Zero Trust, tunnels, redirects. If the process is messy, the feedback loop gets slow very quickly.

I treated this as a controlled edge migration: phased rollout, no auto-apply, test before prod, and scripted state repair for resources the provider could not upgrade cleanly.

So before touching provider v5, I set up the working model first:

block dangerous commands
start with read-only Cloudflare access
work on a local copy of state first
dry-run the migration tool
only then think about import, state cleanup, and apply

This post is mostly about how to set up the migration so engineers can get through it with less guesswork. Not the full resource-by-resource migration.

Long story short

Do not start this migration with broad Cloudflare permissions and direct access to the normal remote backend.
Start read-only, block terraform apply, pull state locally, and expect manual cleanup.

Why This Mattered to the Business Link to heading

This was not only a Terraform upgrade.

This was a Day 2 infrastructure problem.

At the beginning, most teams are happy just because IaC exists. Fine, we are cool. But then product pressure goes up, urgent changes happen, people do click ops, someone says they will clean it later, and the Terraform coverage starts sliding. First it is 100%, then 90%, then 80%, and after that every change becomes slower and less trustworthy.

That is the real risk.

If the edge configuration is not represented correctly in code, then:

delivery gets slower
production changes get harder to review
new engineers need more tribal knowledge
the company becomes dependent on the memory of one operator

So the migration mattered because it pulled the edge layer back into a shape where the business can keep moving without relying on click ops.

What This Unlocks Beyond Terraform Link to heading

Another reason this mattered: Cloudflare is pushing hard beyond classic CDN and DNS use cases. The developer platform is now a real product surface, not just a side feature.

If the provider layer is outdated or half-managed manually, it becomes harder to adopt what Cloudflare is actually investing in.

The migration helps keep the company ready for things like:

Workers AI for running inference on Cloudflare’s network
AI Gateway for observability, caching, retries, rate limiting, and fallback for AI applications
Vectorize for vector search and retrieval workloads
Durable Objects for stateful coordination and real-time systems
Agents SDK for stateful agents with scheduling, tools, and human-in-the-loop flows
Hyperdrive for connecting Workers to existing regional databases with better global performance

That matters because it keeps the path open for building application features on the same platform that already sits in front of production traffic.

For a small company, that is leverage.

The value is not “we upgraded Terraform”. The value is “we are in a position to adopt new platform capabilities without first untangling old infra debt.”

Start with a Read-Only Cloudflare Token Link to heading

The next part is authentication.

I provide a read-only Cloudflare token first. Not admin. Not “temporary full access”. Just enough access to inspect what already exists.

Just read-only.

Why?

I want discovery first
provider refresh and plan usually need API reads
I want to inspect what exists before allowing any write path
if the token leaks somewhere, the blast radius is much smaller

For this stage I only need visibility into the objects already managed by Terraform: zones, DNS, rulesets, WAF objects, Zero Trust resources, and similar things depending on the stack.

If later I need write access, I switch credentials only after the diff is reviewed by a human. Discovery and API lookups use the read-only token. Mutation happens in a separate reviewed phase.

Keep the Authentication Boring Link to heading

I prefer the auth model to be boring and explicit. Usually it is just:

export CLOUDFLARE_API_TOKEN="..."

The token should come from a proper secret source:

local secret manager
CI secret store
short-lived shell session

And it should not be committed to the repo or copied into prompt text.

This is not the place to be creative.

The Migration Plan I Followed Link to heading

My notes ended up being very close to this sequence:

upgrade from 4.52.0 to 4.52.5 first
run tf-migrate in dry-run mode
apply the HCL rewrite and review every changed .tf file
fix renamed or removed resources manually where needed
switch to provider ~> 5
repair state issues
apply in test first
only after that touch prod

I also prefer to split the work into several PRs:

PR1: transitional provider upgrade to 4.52.5
PR2: tf-migrate HCL rewrite plus manual fixes
PR3: provider ~> 5, state cleanup, import flow, final validation

This keeps the diff readable, makes CI output easier to understand, and gives engineers a cleaner checkpoint after every phase.

It also reduces delivery risk. When one phase goes wrong, I know exactly where to stop, revert, or re-plan instead of carrying one giant migration diff through the whole stack.

Pull State and Work Locally First Link to heading

One practical lesson from this migration: I do not want to start by experimenting against the normal backend.

First I pull the state locally and prepare a local work mode:

./tf.sh state pull > migration.tfstate
cp <environment>.tfvars <environment>.auto.tfvars

I copy the environment tfvars to *.auto.tfvars simply to make local Terraform runs load the same environment-specific values without adding extra flags to every command.

Then I temporarily switch the backend:

terraform {
  backend "local" {
    path = "migration.tfstate"
  }
}

This part matters a lot.

The moment I know state cleanup, imports, and provider schema upgrades may be involved, I want a local copy first. It gives me a safer place to inspect, test, and understand the damage before touching the normal backend flow.

It also gives a faster feedback loop. That matters because this migration is not one command. It is many small iterations.

One obvious warning here: local Terraform state may contain secrets. Treat that local file accordingly.

Dry Run the Migration Tool First Link to heading

Before changing the provider constraint, I run the migration tool in dry-run mode:

tf-migrate migrate --source-version v4 --target-version v5 --dry-run --config-dir .

And the warnings are the interesting part.

In this migration, the dry run showed exactly where manual work was still required. The main categories were:

application-scoped Access policies
removed resources in v5
resources that would need state cleanup and re-import

That is already a very good result. The tool does not need to finish the migration. It just needs to show where engineers should spend manual review time.

What `tf-migrate` Did Not Finish Link to heading

A few warnings from the dry run were especially important.

Application-scoped Access policies could not be migrated automatically. In v5, those policies need to live inline inside cloudflare_zero_trust_access_application.

cloudflare_split_tunnel was removed and had to move to device profile configuration.

cloudflare_zone_settings_override was removed too. The migration generated per-setting resources, but the old state still had to be removed and the new resources had to be imported correctly.

There were also a few field-level changes. For example, min_days_for_renewal disappeared from origin CA certificate resources.

This is why I treat the migration tool output as the first pass, not as the final migration.

Expect Manual State Cleanup Link to heading

This migration is not only about renaming resources.

Some failures happen because the old state payload cannot be decoded correctly by the v5 provider. So Terraform fails before you even get a useful diff.

I saw this pattern on resources such as:

Zero Trust gateway policies
load balancer monitors
zones and zone settings related objects

The errors looked like provider decode problems, for example:

rule_settings: expected object, got array
header: expected object, got array
plan: expected object, got string

When that happens, the path is usually:

back up the state
remove only the failing addresses from state
import them again with the v5 format
re-run plan

This is one more reason why I like the local backend step first. It gives engineers room to repair state deliberately instead of rushing through it.

Some Cloudflare Resources Need Manual Review Anyway Link to heading

The migration tool helps a lot, but some resources still need human attention.

The ones I would watch first are:

Access policies attached to applications
split tunnel configuration
zone settings overrides
rulesets
load balancer resources

For example, application Access policies are not just a rename problem. In v5, some of them need to move into inline policies on the application resource. That is not something I want to trust to an automatic rewrite without review.

Zone settings are another good example. Old override-style resources may turn into many per-setting resources. That often means imports and explicit state cleanup, not just HCL edits.

Use Small Shell Scripts as Migration Helpers Link to heading

One thing that helped a lot was using small disposable shell scripts instead of trying to remember every state rm and every import format.

I would recommend this to anyone doing the same migration.

Not because the scripts are fancy. The opposite. They are boring, explicit, and easy to review.

I ended up with three useful categories of scripts:

scripts that remove stale state entries
scripts that import renamed resources back into state
scripts that query Cloudflare API and match objects automatically

All IDs in the examples below are dummy values. They are here only to show the expected shape.

Example 1: Bulk State Cleanup Script Link to heading

For resources that obviously had to be removed from legacy state, I prefer a helper like this:

#!/usr/bin/env bash
set -euo pipefail

terraform state rm \
  'cloudflare_ruleset.example' \
  'cloudflare_access_policy.example' \
  'cloudflare_zone_settings_override.this' \
  'cloudflare_worker_domain.this' \
  'cloudflare_tunnel_virtual_network.default'

This is much safer than typing a long list manually while you are tired and already many plans deep into the migration.

Example 2: Environment-Aware Import Script Link to heading

For resources that exist in all environments but have different IDs, I like a small wrapper that detects the AWS account and chooses the correct import ID.

#!/usr/bin/env bash
set -euo pipefail

ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"

case "$ACCOUNT_ID" in
  "123456789012")
    ENV_NAME="test"
    CLOUDFLARE_ACCOUNT_ID="7f3c5a0b1d4e6f8899aabbccddeeff00"
    RESOURCE_ID="2c9d4a8f7b6e5d4c3b2a1908fedcba76"
    ;;
  "210987654321")
    ENV_NAME="prod"
    CLOUDFLARE_ACCOUNT_ID="7f3c5a0b1d4e6f8899aabbccddeeff00"
    RESOURCE_ID="8a7b6c5d4e3f2109fedcba9876543210"
    ;;
  *)
    echo "Unsupported account: $ACCOUNT_ID"
    exit 1
    ;;
esac

echo "Detected environment: $ENV_NAME"
terraform import cloudflare_load_balancer_monitor.default "$CLOUDFLARE_ACCOUNT_ID/$RESOURCE_ID"

That pattern was useful for load balancer monitor, pools, load balancer, worker domains, and a few other resources.

Example 3: Parse Plan and Re-Import Existing Rulesets Link to heading

Rulesets were more interesting.

Sometimes Terraform wanted to create a ruleset that already existed in Cloudflare. In that case, I do not want to import by hand one by one if the plan already contains enough metadata to identify the object.

So another useful helper script pattern is:

read plan.txt
find cloudflare_ruleset.* will be created
extract zone ID, name, phase, and description
call Cloudflare API
resolve the matching ruleset ID
run terraform import

Very rough shape:

#!/usr/bin/env bash
set -euo pipefail

PLAN_FILE="${1:-plan.txt}"
ZONE_ID="f1e2d3c4b5a697887766554433221100"
RULESET_ID="9b8a7c6d5e4f32100123456789abcdef"

# parse plan output here
# call Cloudflare API here
# match by zone_id + name + phase
# terraform import "cloudflare_ruleset.example" "zones/$ZONE_ID/$RULESET_ID"

This is one of those places where automation actually saves time instead of adding risk.

More importantly, it reduces team dependency.

Without these scripts, the migration would live mostly in one engineer’s memory. With them, another engineer can follow the same sequence, understand the shape of the repair work, and repeat it without reverse-engineering the whole stack from scratch.

Put Tools Behind Guardrails Link to heading

I still use code assistants for this kind of work. They are useful for scanning many .tf files, detecting renamed resources, summarizing warnings, and preparing repetitive edits.

But I keep the boundary simple:

read the repository
read provider docs
prepare code changes
summarize migration warnings
never apply infrastructure changes on its own

If you use Cursor, beforeShellExecution is one of the easiest controls to add. We use it as a deny layer before the command is executed.

"beforeShellExecution": [
  {
    "command": "~/.cursor/hooks/block-apply.sh"
  }
]

Very small hook:

#!/usr/bin/env bash
set -euo pipefail

input="$(cat 2>/dev/null || echo '{}')"
cmd="$(echo "$input" | jq -r '.command // .cmd // .shell_command // empty')"

case "$cmd" in
  *"terraform apply"*|*"terraform destroy"*|*"terraform import"*|*"terraform state rm"*|*"terraform state mv"*|*"auto-approve"*)
    echo "Blocked by policy during migration window: $cmd" >&2
    exit 2
    ;;
esac

That was enough for the first stage. The tool could still scan modules, compare v4 and v5 resources, prepare refactors, and produce review notes, but it could not jump straight to mutation.

Important detail: this deny policy is for the assistant-driven exploration and diff-preparation phase.

Later, once the review is done, I run the approved terraform import, terraform state rm, and terraform apply steps myself in a separate supervised shell session. The hook is there to stop premature mutation, not to ban the whole migration workflow forever.

What Helped the Most Link to heading

If I reduce the whole experience to a few practical points, these are the things that helped most:

use test first and keep prod behind review
run tf-migrate in dry-run mode before changing provider version
expect some state rm plus terraform import work
keep helper scripts for repetitive import flows
keep CI plans running during every phase
remove -auto-approve for the migration window

None of this is complicated, but together it makes the migration much easier to pass.

It also lowers maintenance cost later. Repeated state repair or import logic stops being a custom one-time ritual and starts becoming documented operational tooling.

The First Phase Workflow Link to heading

Before any write-capable step, I want the work loop to be very small:

read the Terraform code
run the dry-run migration
prepare code changes
run safe checks like terraform fmt and terraform validate
prepare import and state-cleanup commands for review
review the diff

The first goal is not to “finish the migration”. The first goal is to remove uncertainty.

Test Before Prod, and Keep CI Running Link to heading

Another useful note from the migration: test first, always.

My rollout rule was:

migrate and apply in test
make sure post-apply plan is clean
keep CI planning both test and prod
only then allow the prod path

During migration, temporary prod plan instability can happen because of intermediate rename and state steps. That is acceptable for a short period. Blind prod apply is not.

Also, I keep -auto-approve out of the flow completely for this migration window.

Result Link to heading

The upgrade path was not fully automatic, but the combination of dry-run migration, local state work, scripted imports, and staged rollout made it predictable enough to execute safely.

That was the real objective. Not to make the migration look elegant, but to make it pass without breaking production and without turning one engineer’s memory into the only runbook.

For me this is the fun part of infrastructure work. I am comfortable owning technical risk when the process is clear and the rollback path is real.

From different angles, the result was:

better path to adopt newer Cloudflare platform capabilities
less dependence on click ops and one-person memory
safer rollout shape for a production edge migration
clearer signal that this environment can be maintained by another engineer later

Define the Reset Path Before You Need It Link to heading

One more practical point: define the way back before the first apply.

After local repair work, I want an explicit reset path:

restore the normal backend block
reinitialize Terraform
migrate backend metadata if needed
remove temporary files like copied tfvars, scratch plans, and local state artifacts

If you skip this part, the migration gets messy very quickly. Temporary files pile up, people forget which plan is the latest one, and it becomes much easier to make a bad decision.

Final Thoughts Link to heading

For me, the hard part of this migration is not HCL rewrite. The hard part is keeping the process predictable enough that other engineers can follow it too.

So the rule is simple:

read-only first
local state first
dry-run first
small helper scripts instead of manual repetition
human review before mutation

That is how I prefer to start a Cloudflare provider v4 to v5 migration and help the next engineer get through it faster.

Why This Mattered to the Business Link to heading

What This Unlocks Beyond Terraform Link to heading

Start with a Read-Only Cloudflare Token Link to heading

Keep the Authentication Boring Link to heading

The Migration Plan I Followed Link to heading

Pull State and Work Locally First Link to heading

Dry Run the Migration Tool First Link to heading

What tf-migrate Did Not Finish Link to heading

Expect Manual State Cleanup Link to heading

Some Cloudflare Resources Need Manual Review Anyway Link to heading

Use Small Shell Scripts as Migration Helpers Link to heading

Example 1: Bulk State Cleanup Script Link to heading

Example 2: Environment-Aware Import Script Link to heading

Example 3: Parse Plan and Re-Import Existing Rulesets Link to heading

Put Tools Behind Guardrails Link to heading

What Helped the Most Link to heading

The First Phase Workflow Link to heading

Test Before Prod, and Keep CI Running Link to heading

Result Link to heading

Define the Reset Path Before You Need It Link to heading

Final Thoughts Link to heading

What `tf-migrate` Did Not Finish Link to heading