Infrastructure as code (IaC) has been a big project for us recently, and we clearly saw the benefits it would bring. These benefits include
- applying our existing review process to our infrastructure (not just to the application code);
- using languages, tools, and build chains that we already know; and
- allowing for scenarios that are difficult to accomplish otherwise, like easily spinning up a copy of our production infrastructure. (We can then stress-test it and forecast how much leeway we have.)
We explored several options and ultimately settled on Pulumi. We liked that it lets you use a full-fledged programming language of your choice (C# in our case) without having to learn another proprietary DSL like many other tools would have us do.
Proof of concept
It was easy to get started with Pulumi, and, in no time at all, we had a proof of concept ready, defining a couple of Azure resources in our test environment.
This is what it looked like:
It is easy to understand what is going on here, but it is also clear that we are imperatively telling Pulumi what to do:
- Get the resource group.
- Get the Redis cache and the alert action group.
- Create an alert.
This works, and it is basically what you find when you look at how other people or companies are using Pulumi.
From our perspective, however, this approach has some major drawbacks.
If we take a step back, we realize that being able to use a full-fledged programming language like C# to define your infrastructure is a double-edged sword.
It is good because you are reusing the knowledge you and your team already have. You can keep using the same static analysis tools, build chains, and CI/CD pipelines, and you can, if necessary, leverage the full power of the language, unlike being restricted by proprietary formats built on top of JSON or YAML files, which is the approach Terraform took.
This last point, however, raised a red flag for us. With the infrastructure being such a critical part of our system, we do not want the developers to run wild with writing elaborate code that obscures what is going to change in the infrastructure as a result of deploying it.
Going back to the first iteration of our code, imagine what this would look like if we had our entire infrastructure written down like this and we needed to update it. With each change, you would have to be very cautious when reviewing it because there are basically no limits to what the code can do.
So, it was clear to us from the very beginning that this is not the way to go.
We had a distinct idea of how it should look. Ideally, you would define the infrastructure in a declarative manner; a configuration interpreter would translate it to Pulumi objects, and these would then be deployed to Azure by the Pulumi engine.
Once you start working on that, you realize that the most important thing the interpreter is going to do is resolve dependencies between resources.
For example, when you have a metric alert in Azure, it depends on
- the resource group that it is a part of,
- the action group defining what happens when the alert triggers, and
- the resource itself on which the metrics are observed.
These three dependencies also have their own dependencies.
This is what we ended up with:
As you can see, we now declare the infrastructure rather than laying out the exact steps to create it, and we use the ResourceDependency class to represent the relationships between resources.
The way it works internally is by utilizing a concept of stack references in Pulumi. These are outputs of a stack (meaning typically an ID of a resource), which you can consume as an input of another stack, i.e., dependencies.
So, when we take the example of the Redis alert above, we see that it has three dependencies, and the first one, which is the resource group, is going to be resolved by referencing the AzureResourceGroups stack and its output called Monitoring:
To create the stack outputs, we get to the interpreter itself, which looks like this:
It goes through all the alerts we defined, resolves their dependencies, and deploys the alerts.
This is much closer to what we wanted, however, there is still a lot of boilerplate that is going to be repeated through all the stacks. The entire Run method is something that is not specific to alerts, and it would be nice to get rid of it.
Also, it is not ideal that to define a dependency, you have to use the special-purpose ResourceDependency class with a lambda inside. Ideally, you would just reference the other resource. Additionally, in each stack, you have to explicitly list the resources as key-value pairs and do all of that just to allow the reflection bits which, in turn, are there just to represent the mechanism of stack references in the code.
Finally, because we are crossing the boundaries of Pulumi stacks and only simple values can be passed through them, the code is not as type-safe as we would like. Note that the three dependencies of the alert we are creating here are represented not as a resource group, an action group, and the resource itself, but instead, as the IDs of the resources, which are just strings.
Where we are now
After refactoring, our infrastructure as code looks like this:
First, note that we got rid of the ResourceDependency class and instead, to represent relationships between resources, we just use object references. That makes the configuration easier to understand and removes the remaining non-static aspect of it.
How do we then interpret the resource dependencies?
For each type of resource, we have a class we call deployer, which only does two things.
- It gets the resource from Azure (in case we need to reference it but do not want to manage it in Pulumi), or
- It creates it. In that case, it knows what the dependencies are.
Note that the methods are now fully type-safe, allowing us to easily access properties of the Azure resources we are provided as dependencies.
The stack itself, that is, the point where the deployment starts, now looks like this:
It goes through all the types of resources in the specific environment and deploys all of them. On the individual level, this boils down to a simple decision in the ResourceDeployer.Deploy method:
The resource we are going to deploy is either
- a manual resource, meaning we only need to get a reference to it, but not manage it, with Pulumi. Then it's simply a matter of calling a getter that Pulumi provides for each type of resource;
- a managed resource, meaning we define all its properties in code, and Pulumi makes sure it is synchronized with Azure; or
- an imported resource, which is basically the same as a managed resource but in a situation where the resource already exists in Azure, and we want Pulumi to start managing it.
To keep the code snippets succinct and to the point, I left out some of the plumbing such as caching the Pulumi resources, but it is not hard to figure out how that works. If something is unclear, do not hesitate to hit us with a question.
This approach minimizes the cognitive load required to understand and change the infrastructure. However complex our infrastructure becomes, it will always be represented by a set of resource declarations that one can easily comprehend.
Currently, we only have a very small part of our infrastructure defined in code, so it is too early to say if this approach will stand the test of time. Also, because of the chosen architecture, it is unlikely we will be able to adopt micro-stacks, something which is supposed to be a preferred way of organizing large projects. However, we presume that a lot of the intricacies associated with defining complex infrastructure in code can be avoided by adopting the declarative approach.
In the long term, we would like to publish the reusable bits as a NuGet package so that the IaC community gains one more possibility to define infrastructure in .NET with Pulumi.
What is your experience with defining infrastructure in C#? Have you considered going declarative?
For more engineering insights shared by Mews tech team: