Comments

Note: Terraform currently requires the state to exist after Terraform has been run. Technically, at some point in the future, Terraform should be able to populate the local state file with the real infrastructure if the file didn't exist. But currently, Terraform state is a mixture of both a cache and required configuration and isn't optional.

I'd go even further and suggest there should be no state file at all. Terraform should just query the remote state every time it's needed. The state file going out of sync with the real infrastructure state is a major source of issues with Terraform. Only the provider knows what the state is. I realize this would make Terraform slower, but correctness is more important.

If this isn't possible, can you tell me why not, please?

Thanks for your consideration.

This comment has been minimized.

edited

I'm glad you asked since I haven't had a chance to really write this down in any single place.

This isn't possible because there needs to be some database of some sort to map Terraform config <=> real world. For some providers like AWS you could theoretically use something like AWS tags (early versions of Terraform actually had no state file and did this). We quickly ran into problems: not all resources support tags.

Going forward from there, we ran into bigger issues: we encode more than just attributes in the state file. We have to encode things like depends_on so that when you delete items from a configuration we can delete them in the proper order. We can't just encode rules like "subnets before VPC" in Terraform because this also effects cross-provider resources and the complexity is effectively infinite.

In addition to depends_on, we are going to store (in the future) information like last run time, when a resource was created, lifecycle options like prevent destroy to avoid accidental destroy, Terraform-specific tags/annotations on resources, etc.

We need state somewhere.

You brought up sync issues, Terraform by default will refresh the state on every plan/apply operation. This is effectively the same as if we didn't have a state file.

The pain I've found people have with state files is generally in conflicts when two people modify it. We are certainly working to improve that, but "removing the state file" just shifts a WHOLE INCREDIBLE AMOUNT of complexity from one place to a completely new place.

Beyond that, you also brought up performance. We have customers that manage over 10,000 resources with Terraform (in a single state file). Personally, we don't recommend managing that many resources in a single state file, but Terraform can do it. They get around this with clever tooling around targeted refreshing. If Terraform synced every resource on every operation, these users just could not use Terraform. They must work under the assumption that the state is in sync most of the time, and allow errors when its not.

But I think its important to reiterate that the state file isn't a convenient performance optimization. If anything, it is an inconvenient performance optimization that we need to store the critical metadata above it.

We have plans to improve things though! For example, for Terraform 0.9 we actually plan to split the state into two files: tfcache and tfdata (final named tbd). The tfcache will be the attribute data that syncs, and you can openly ignore this if you want and let Terraform sync your entire state. The tfdata is critical metadata for syncing and operations that must not be deleted. This should help lower conflicts a lot and simplify management.

As an aside: This sort of issue reminds me of another issue. I'm not trying to degrade your viewpoint in any way and I appreciate you asking this question, but its a pattern I find folks do that usually isn't the right approach: X is complicated and causes problems, please get rid of X. When something is complicated and causes problems, the implementors generally have had multiple conversations about "why do we have this? do we need it? can we get rid of it?" and have determined that its either needed or that there is a way to improve it. State falls under this and I hope that this helps!

As I said, I'm glad you asked since I haven't had a chance to really write this down in any single place.

Another time this cropped up (the issue is still around here somewhere, closed) is when someone recommended we abandon representing infrastructure as a graph, and just try everything in parallel until it succeeds. That may be oversimplifying it but basically: retry until you don't get an error or you've retried enough times.

The graph is SUPER complicated but it enables a level of safety and understanding. I view state as a similar thing, but we've done it less well... up to this point. We're working on improving that with time though.