Why define infrastructure through code?
There is a lot of chatter lately about infrastructure as code in the Cloud engineers/DevOps community, but one should not get hyped up about the latest technology/framework/library and invest time and resources into it right away. That could easily turn into a huge waste of time when you realize after a few months that you have introduced even more problems with the hyped tool than you previously had without it.
So, what are the main reasons we wanted to go into the “infrastructure as code” direction?
- We wanted to be able to run load tests/stress tests easily—to spin up a replica of a given environment and test where its limitations are.
- We were facing a problem with having poor observability into infrastructure changes. We were changing resources in the cloud directly and, even with rules in place, we found it hard to control.
- We wanted to have the infrastructure defined declaratively in code to see what the differences are between specific environments right away, rather than having to dig through the Azure portal and having to resolve it “by eye”.
Okay, and are there any reasons why not to go into infrastructure as code? Or any cons associated with it? Well, glad you asked. Of course there are.
- You have less control over the resources in some sense. The actions are hidden in the details of how a given tool works. For this reason, we are not going to put our critical storage under the control of the tool for example.
- As with every software, it has its bugs and limitations that you must understand and find workarounds, which can be time costly.
- There is naturally some learning curve associated with the technology.
I would say that the common saying is applicable here also... "it depends". Having infrastructure defined through code is not a silver bullet, but it can be beneficial. So, it really comes down to your situation. For example, in a small startup in which you need to iterate as fast as possible and you have a relatively small infrastructure, which needs to change quickly through iterations, the IaC tool can slow this process down (through its bugs, limitations, and learning curve). Its benefits are not really that important for the startup. For example, if it does not have a production environment, nobody cares that much if a mistake will cause a short downtime.
On the other hand, in a bigger company where:
- You already have a profiting production environment with many users,
- Your infrastructure is large but more or less stable, and
- You have SLAs for your uptime
… then you will value the benefits much more. For example, the improvement to observability into infrastructure changes can reduce space for mistakes (which can cause downtimes), and that is more valuable for bigger companies.
Which tool to use?
Once you start looking into the options, you will find that there are many possibilities. Our main requirement was that we needed a tool compatible with Azure, since all of our infrastructure is based there. That was our starting point, and then we did a little research on the possible tools. Here is a quick overview of what the possibilities were for us:
- Azure Resource Manager Templates
- Azure CLI with Powershell scripts
- Azure .NET SDK
After our research, we had two hot candidates: Terraform and Pulumi. Other options were left out mostly due to two reasons: Either they were unmaintainable (e.g., Azure Resource Manager templates with thousands of lines of jsons, Azure CLI with Powershell scripts), or they were pretty new technology without much coverage of the Azure resources/active community/good documentation (e.g., Azure .NET SDK, Bicep, and Farmer).
Terraform vs Pulumi
Both of those options have many similarities: They have been around for some time already, and they have great documentation, a nice CLI experience, a strong community behind them, and teams that are continuously working on improving the product.
The main difference between them is that Terraform uses its own DSL (domain specific language), and Pulumi uses existing programming languages (SDK for each major language like Typescript, Python, C#, Go, etc.). Then there are differences in caveats:
- How they handle state management (e.g., how and where they store the state of the infrastructure).
- How they handle concurrent changes.
- And many more.
Both have these topics covered in some way, but slightly differently.
After some time researching these candidates and working with them (or more like playing with them), we decided to go with Pulumi. As described above, these options are similar in the developer experience (CLI, documentation, community, etc.), but the key differentiator in our case was that we could use C# with Pulumi, because we have strong knowledge, experience, and an ecosystem around C#. The idea of having to learn a new language, learn the best practices in it, develop tooling around it, and understand its limitations and problems was not a pleasant one. On the other hand, there are many companies that use Terraform successfully, so, it’s definitely a good choice to consider for your needs.
We started with the adoption of Pulumi in our least important environments where we did some proof concepts, and we got more familiar with it. Our next plan is to start with the adoption horizontally in all environments, which means that we will get to production as soon as possible instead of leaving it as the last environment to migrate sometime in the future, which would be a big unknown. This gives us the advantage of setting the processes around the deployment right away and migrating specific types of resources in all environments in parallel (starting from the least important ones—like dashboards—to the more important ones—like computing).
If you found this blog post interesting, you would probably like our next article on how we declare the infrastructure with Pulumi. Check it out and let me know what you think about our choice for our infrastructure as code tool.
For more engineering insights shared by Mews tech team: