[originally published in ShuttleCloud’s blog]
It’s no secret that ShuttleCloud uses Ansible for managing its infrastructure, thus we think it would be useful to explain on how we use it.
At ShuttleCloud, we take automation as a first order component of Software Engineering. We aim to have everything automated in production: application deployment, monitoring, provisioning, etc. Ansible is a key component in this process.
Our infrastructure is almost entirely in Amazon Web Services (AWS), although we are in the process of moving to Google Compute Engine (GCE). The stack is distributed across several regions in US and Europe. Due to the fact that we have SLAs up to 99.7%, we need to have every component, service and database operating in a High Availability manner.
We maintain a dedicated private network (VPC) for most of our clients (GMail, Contacts+, Comcast, etc). This adds complexity because we have to handle more separated infrastructure domains in the US regions in AWS. If we were able to share VPCs between clients, things would be certainly easier.
The migration platform is designed using a microservices architecture. We have a few of them:
- autheps: Takes care of requests and refreshes OAuth2 tokens.
- capabilities: Registry of supported providers and operations.
- migration-api: API exposing our migration service.
- migration-robot: Takes care of moving data between providers.
- security-locker: Storage of sensitive data.
- stats: Gathering and dispatching of KPI data.
These microservices are mostly written in Python, but there is some Ruby too.
The software stack is very diverse. Between development and operations we have to manage:
apache, celery, corosync, couchdb, django, dnsmasq, haproxy, mysql, nginx, openvpn, pacemaker, postgresql, prometheus, rabbitmq, selenium, strongswan.
In total, we manage more than 200 instances, all of them powered by Ubuntu LTS.
Naming Hosts and Dynamic Inventory
All our instances have a human friendly, fully qualified domain name; a string like auth01.gmail.aws that is reachable from inside the internal network. These names are automatically generated based on tags associated to corresponding instances.
Although Ansible provides dynamic inventories for both ec2 and gce, we developed our own in order to have more freedom for grouping hosts. This has proved to be very useful, especially during the migration from AWS to GCE, when you can have machines from both clouds inside the same group.
This custom inventory script also allows us to include a host in multiple groups by specifying a comma separated value in a tag (we named it ansible_groups). This seems to be a popular requested feature (1, 2, 3).
The script also automatically creates some dynamics groups that are useful for specifying group_vars.
We use two main dimensions for selecting hosts: project and role. Project can have values like gmail, comcast or twc, while role‘s values can be haproxy or couchdb. The inventory script takes care of generating composite groups in the form gmail_haproxy or twc_couchdb, thus we have the freedom of targeting haproxies in gmail like:
$ ansible gmail_haproxy -a 'cat /proc/loadavg'
or setting variables in any of the following group vars files:
group_vars/gmail.yml group_vars/gmail_haproxy.yml group_vars/haproxy.yml
Limitations of Built-in EC2 Inventory
If the composite groups didn’t exist, selection can still be achieved by
intersecting groups with the
$ ansible --limit 'gmail:&haproxy' all -a 'cat /proc/loadavg’
or, with the built-in
ec2.py (assumming that a Role tag exists and
project gmail corresponding to the VPC
$ ansible -i ec2.py --limit 'vpc_id_vpc-badbeeff:&tag_Role_haproxy'
but in contrast there is not an evident alternative to settings variables for hosts in both groups vpc_id_vpc-badbeeff and tag_Role_haproxy.
Playbooks and Roles
We use roles for everything and separate them between provision roles (for example couchdb) and deployment roles (for example migration-api).
- Provisioning roles take care of leaving the machine ready to run the required application.
- Deployment roles take care of deployment and rollback actions.
Dependencies between Roles
Right now, deployment roles have an explicit dependency on provisioning roles. For example, role migration-api depends on role django-server and role django-server depends on roles apache and django.
This model is useful because you can apply migration-api to a raw instance and it will be adequately provisioned. However, it has the drawback that once provisioned, it might be a waste of time running the provisioning role each time you want to deploy a new version.
Tags are being slowly added to the tree. It’s better to not abuse them and keep a good organization, otherwise you might end up forgetting which tags to select and when.
CouchDB exposes a REST API, giving us several options for managing users:
- use the
commandmodule combined with
- use the
- write a custom module.
Writing a custom module can be daunting at first, but it pays off. It’s not comparable to the flexibility that Python offers when looking at leveraging command or shell modules.
When it’s not trivial to guarantee idempotency with current modules, a custom one is the way to go.
We use git-crypt for managing secrets. It’s a general purpose tool for selectively encrypting parts of a git repo. We have been using it long before ansible-vault become popular, and it is working well.
With git-crypt, all sensitive data is kept inside a top level directory call
secrets/, and with proper configuration, you tell git-crypt to encrypt
everything inside it. By combining the magic variable
and password lookup you can express:
lookup(‘password’, inventory_dir + ‘/secrets/’ + project + ‘/haproxy-admin.key’)
Script common Invocations
With time, your Ansible tree gains new features, modules and playbooks. Your inventory grows both in machines and group names. It becomes more difficult to memorize the correct parameters you have to pass to ansible-playbook, especially if you don’t run it everyday. In some cases it may be necessary to run more that one playbook to achieve some operational goal.
A simple solution to this is to store the appropriate parameters inside scripts
that will call
ansible for you. For example:
$ cat bin/stop-gmail-migration-robots #!/bin/sh ansible-playbook --limit gmail -t stop email-robots.txt
Yes, it is as useful as it is simple.
Using callback plugins
One handy feature of Ansible is its callback plugins. You just drop a
python file in the
./callback_plugins directory containing a class called
CallbackModule and the magic happens.
In that class, you can add code Ansible to run whenever some events take place. Examples of these events are:
playbook_on_stats: Called before printing the RECAP (source).
runner_on_failed: Called when a task failed (source).
runner_on_ok: Called when a task succeeded (source).
You can find more of them by looking at the examples or searching for
call_callback_module in the source code.
We have created our own callbacks plugins for logging and enforcing usage policy.
Javier had the great idea of using callback plugins to log playbook executions and set up everything needed for make it work.
Each time a playbook is run, a plugin gathers relevant data such as:
- which playbook was run.
- against which hosts.
- who ran it.
- from where it was run.
- what was the final result.
- which revision of Ansible tree was active.
- optional message (requested on terminal).
- and sends it to a REST service. The service takes care of writing everything into a database which you can query later for things like who deployed migration-api in gmail project between last tuesday and wednesday.
Enforcing Usage Policy
Another useful usage of the callback plugins is the ability to enforce that Ansible is run using an updated repo. Before starting a playbook, a git fetch can be run behind the scenes and the current branch is validated against a list of rules.
You can use this as you wish, for example, to enforce that
HEAD be on top of
origin/master and the tree is clean.
Overall we are very satisfied with the tool. Everybody on the team is using it on a daily basis, and both Dev and Ops people commit stuff to the same repo.
We would like to further investigate:
- Splitting the tree in two repos: provisioning and deployment. Using the provisioning repo to generate images with the help of Packer of and let the deployment stuff just assume that everything is in place.
- Using security group names to generate inventory groups (like
ec2.pydoes). This will improve compatibility between AWS and GCE instances as the later only has one set of values (Tags) for both tagging and firewalling instances.
- Considering vault for storing sensitive data.
- Replacing scripts for common invocations with python scripts using Ansible as a library.