Agari has made significant investment into infrastructure as code. Almost two years into this project, we've learned some lessons. (If you'd like to read about our first year efforts, check out my previous blog post - Ansible and Terraform at Agari: An Automation Journey.) Our efforts have already yielded dividends by increasing engineering velocity while maintaining infrastructure reliability. Additionally, our new tool set enables more experimentation, assured in the knowledge that we're never more than a few commands away from being able to roll back changes or re-deploy with zero data loss.
Our automation tool set makes managing cloud-based infrastructure orders of magnitude faster and more repeatable, but with more powerful tools at our disposal comes complexity, technical debt, and raised barriers to entry for simple changes. Managing state, deploying secrets, working with distributed data stores, and using tools that are themselves in a state of rapid development (e.g. Packer, Terraform, and Ansible) can yield significant and sometimes unexpected challenges. I'll share several pitfalls and best practices in a three-part series.
Your automation repository: git, branching and organization.
First, I'd like to address one of the more contentious issues we've struggled with: the organization of an integrated automation repository. I've seen a variety of approaches to organizing infrastructure code. You might place each of its automation tools (e.g. Ansible, Terraform, CloudFormation, Puppet) in a separate repository. Alternately, you might prefer that its infrastructure code be co-resident with each product or product component. For example, all of the web caching code and its automation code might live in a web caching repo, while the same would be true of a database component.
We've taken a different approach. All automation code, by which I mean nearly all non-product code, lives in one big, slightly messy git repo. I'm not going to say that this is the most elegant, perfectly virtuous solution but it has some important advantages.
Maximizing code re-use between teams
It's very easy to share Ansible roles between product teams when all you have to do is reference it alongside your custom application-specific role. For example, here's how we configure roles for one of our host groups:
roles:- nrt-alerter- role: cousteau
url_domain: producta.stage.agari.com
cousteau_branch: develop- postfix
This particular machine gets three roles: the role specific to its function (nrt-alerter in this instance), a role which checks out code specific to the product it supports and the environment it's in (i.e., cousteau) and finally a general purpose postfix role which supports Amazon SES as well as custom MTA configuration.
Additionally, all hosts get a base set of roles which configure things like users and ssh keys, IDS/IPS agents, host monitoring agents, ntp and log shipping configurations. For these catch all roles we just use a hosts: all
group.
Another major requirement of developing software at Agari is enabling product teams to have visibility into the automation work contributed by other product teams. This promotes code reusability, cross-team collaboration, and dissemination of best practices and reusable design patterns. Just being peripherally aware of what your colleagues are working on can be extremely valuable.
Integration between automation tools
Besides enabling teams to share code, it's also important to be able to leverage the work we've put into writing and maintaining our Ansible roles by using them not strictly for typical playbook runs but also as provisioners for Terraform and Packer when appropriate. An example is a Packer-built Docker container we use to run Apache Airflow. In this case, the Packer provisioner first bootstraps Ansible and then utilizes an Apache Airflow ansible-local provisioner, thereby reusing an Ansible role we've already written to install and configure Airflow, on a virtual machine host.
"provisioners": [{
"type" : "shell","inline" : ["sudo apt-get update","sudo apt-get install -y software-properties-common","sudo apt-add-repository ppa:ansible/ansible","sudo apt-get update","sudo apt-get -y --force-yes install ansible python-apt"]
},{
"type" : "ansible-local","playbook_file": "local.yml","role_paths" : ["{{ user `ans_home` }}/roles/airflow"]
}]
Repo organization
├── bin├── examples├── group_vars├── host_vars├── packer│ ├── airflow├── roles└── terraform├── product_a│ ├── dev│ ├── prod│ └── stage├── product_b
It's important to maintain a pretty clear idea of where things go when you've got multiple product teams contributing to a single infrastructure repository. We organize our repository this way:* Ansible code goes into the top level (i.e. Level 1)* Roles, group and host vars underneath (i.e. Level 2)* This way, Ansible code can easily be referenced by other tools which live in second level directories* Other second level directories include (i.e. Level 2)* A bin
directory for things like Python scripts to interact with EC2 via boto* An examples
directory to store items like example .ssh/config for connecting through our bastion hosts
Parameterizing environments in Terraform with make
We've iterated a few times on how to best organize Terraform configurations. Early efforts duplicated a lot of code between environments, resulting in configuration drift - it became increasingly difficult to keep staging and production in sync. Our solution to this issue is to parameterize the differences between environments and to use a Makefile to run Terraform with environment-specific statefiles and variables.
# Test that we have necessary executables available
EXECUTABLES = ansible terraform
K := $(foreach exec,$(EXECUTABLES),\$(if $(shell which $(exec)),some string,$(error "No $(exec) in PATH")))
all: plan
plan:@if [ -z ${ENV} ]
then echo "usage: make plan ENV=(prod|stage|dev)"
else terraform plan -state=$(ENV)/terraform.tfstate -var-file=$(ENV)/terraform.tfvars \-var-file=$(ENV)/secrets.tfvars -out=$(ENV)/terraform.tfplan $(ARGS)
fi
apply:@if [ -z ${ENV} ]
then echo "usage: make apply ENV=(prod|stage|dev)"
else terraform apply -state=$(ENV)/terraform.tfstate -var-file=$(ENV)/terraform.tfvars \-var-file=$(ENV)/secrets.tfvars $(ARGS)
fi
destroy:@if [ -z ${ENV} ]
then echo "usage: make destroy ENV=(prod|stage|dev)"
else terraform destroy -state=$(ENV)/terraform.tfstate -var-file=$(ENV)/terraform.tfvars \-var-file=$(ENV)/secrets.tfvars $(ARGS)
fi
clean:
rm */terraform.tfplan
.PHONY: all plan apply destroy clean
In this example, the common Terraform configurations live in the parent directory - there's one for each product and then under that directory, we have per-environment directories which contain a terraform.tfvars
, terraform.tfstate
and a secrets.tfvars
. We generally store account credentials in secrets.tfvars
since we have different environments segmented by AWS account (and you should too). The tfvars
file looks something like this:
environment = "prod"
account_id = "313371234567"
ssl_certificate_id = "arn:aws:iam::313371234567:server-certificate/myagari-ev-cert"
app_server_count = 8
test_elb_count = 0
Setting the environment variable enables configurations likeName = "${format("app-%02d", count.index)}.cp.${var.environment}.agari.com"
The account id and ssl certs will need to be set separately per environment, of course. Server counts are often different depending on environment.
Resource counts can be set to zero for resources that exist in one environment but not the others- for example prototypes that are spun up in a development environment.
Git branching
Our git branching strategy is as follows:
master
— represents production for both productsdevelop-product_a
— holds product a-related changes to be deployed to our pre-production environment (i.e. staging). After staging is validated, we merge this branch into master for eventual production deployment.develop-product_b
— holds product b-related changes to be deployed to our pre-production environment (i.e. staging). After staging is validated, we merge this branch into master for eventual production deployment.
Three simple rules for how to use them:
- Run Terraform/Ansible against prod from master and against stage from develop-product_{a,b} only.
- As part of daily development, please commit and push to develop-product_{a,b}.
- On prod deployments, please merge your respective develop-product_{a,b} branch into master as you would for the product code repository, and immediately merge master back into your develop-product_{a,b} branch to ensure coherence. Periodic merges of master into your develop-product_{a,b} branch outside of prod deployments can't hurt either.
Coming up next
In Part 2 of this post we'll discuss:
- State management
- Dealing with datastores: special cases and how to handle complex migrations
- Why your immutable infrastructure doesn't necessarily need to be 100% immutable