Using Ansible at ShuttleCloud

[originally published in ShuttleCloud’s blog]

It’s no secret that ShuttleCloud uses Ansible for managing its infrastructure, thus we think it would be useful to explain on how we use it.

Infrastructure

At ShuttleCloud, we take automation as a first order component of Software Engineering. We aim to have everything automated in production: application deployment, monitoring, provisioning, etc. Ansible is a key component in this process.

Our infrastructure is almost entirely in Amazon Web Services (AWS), although we are in the process of moving to Google Compute Engine (GCE). The stack is distributed across several regions in US and Europe. Due to the fact that we have SLAs up to 99.7%, we need to have every component, service and database operating in a High Availability manner.

We maintain a dedicated private network (VPC) for most of our clients (GMail, Contacts+, Comcast, etc). This adds complexity because we have to handle more separated infrastructure domains in the US regions in AWS. If we were able to share VPCs between clients, things would be certainly easier.

The migration platform is designed using a microservices architecture. We have a few of them:

autheps: Takes care of requests and refreshes OAuth2 tokens.
capabilities: Registry of supported providers and operations.
migration-api: API exposing our migration service.
migration-robot: Takes care of moving data between providers.
security-locker: Storage of sensitive data.
stats: Gathering and dispatching of KPI data.

These microservices are mostly written in Python, but there is some Ruby too.

The software stack is very diverse. Between development and operations we have to manage:

apache, celery, corosync, couchdb, django, dnsmasq, haproxy, mysql, nginx, openvpn, pacemaker, postgresql, prometheus, rabbitmq, selenium, strongswan.

In total, we manage more than 200 instances, all of them powered by Ubuntu LTS.

Naming Hosts and Dynamic Inventory

Naming Hosts

All our instances have a human friendly, fully qualified domain name; a string like auth01.gmail.aws that is reachable from inside the internal network. These names are automatically generated based on tags associated to corresponding instances.

Dynamic Inventory

Although Ansible provides dynamic inventories for both ec2 and gce, we developed our own in order to have more freedom for grouping hosts. This has proved to be very useful, especially during the migration from AWS to GCE, when you can have machines from both clouds inside the same group.

This custom inventory script also allows us to include a host in multiple groups by specifying a comma separated value in a tag (we named it ansible_groups). This seems to be a popular requested feature (1, 2, 3).

The script also automatically creates some dynamics groups that are useful for specifying group_vars.

We use two main dimensions for selecting hosts: project and role. Project can have values like gmail, comcast or twc, while role‘s values can be haproxy or couchdb. The inventory script takes care of generating composite groups in the form gmail_haproxy or twc_couchdb, thus we have the freedom of targeting haproxies in gmail like:

$ ansible gmail_haproxy -a 'cat /proc/loadavg'

or setting variables in any of the following group vars files:

group_vars/gmail.yml
group_vars/gmail_haproxy.yml
group_vars/haproxy.yml

Limitations of Built-in EC2 Inventory

If the composite groups didn’t exist, selection can still be achieved by intersecting groups with the & operator:

$ ansible --limit 'gmail:&haproxy' all -a 'cat /proc/loadavg’

or, with the built-in ec2.py (assumming that a Role tag exists and project gmail corresponding to the VPC vpc-badbeeff):

$ ansible -i ec2.py --limit 'vpc_id_vpc-badbeeff:&tag_Role_haproxy'

but in contrast there is not an evident alternative to settings variables for hosts in both groups vpc_id_vpc-badbeeff and tag_Role_haproxy.

Playbooks and Roles

We use roles for everything and separate them between provision roles (for example couchdb) and deployment roles (for example migration-api).

Provisioning roles take care of leaving the machine ready to run the required application.
Deployment roles take care of deployment and rollback actions.

Dependencies between Roles

Right now, deployment roles have an explicit dependency on provisioning roles. For example, role migration-api depends on role django-server and role django-server depends on roles apache and django.

This model is useful because you can apply migration-api to a raw instance and it will be adequately provisioned. However, it has the drawback that once provisioned, it might be a waste of time running the provisioning role each time you want to deploy a new version.

Using Tags

Tags are being slowly added to the tree. It’s better to not abuse them and keep a good organization, otherwise you might end up forgetting which tags to select and when.

Custom Modules

We have developed some custom modules, in order to manage couchdb users and replications (there is a PR submitted for couchdb_user).

CouchDB exposes a REST API, giving us several options for managing users:

use the command module combined with curl calls.
use the uri module.
write a custom module.

Writing a custom module can be daunting at first, but it pays off. It’s not comparable to the flexibility that Python offers when looking at leveraging command or shell modules.

When it’s not trivial to guarantee idempotency with current modules, a custom one is the way to go.

Managing Secrets

We use git-crypt for managing secrets. It’s a general purpose tool for selectively encrypting parts of a git repo. We have been using it long before ansible-vault become popular, and it is working well.

With git-crypt, all sensitive data is kept inside a top level directory call secrets/, and with proper configuration, you tell git-crypt to encrypt everything inside it. By combining the magic variable inventory_dir variable and password lookup you can express:

lookup(‘password’, inventory_dir + ‘/secrets/’ + project + ‘/haproxy-admin.key’)

Script common Invocations

With time, your Ansible tree gains new features, modules and playbooks. Your inventory grows both in machines and group names. It becomes more difficult to memorize the correct parameters you have to pass to ansible-playbook, especially if you don’t run it everyday. In some cases it may be necessary to run more that one playbook to achieve some operational goal.

A simple solution to this is to store the appropriate parameters inside scripts that will call ansible-playbook or ansible for you. For example:

$ cat bin/stop-gmail-migration-robots
#!/bin/sh
ansible-playbook --limit gmail -t stop email-robots.txt

Yes, it is as useful as it is simple.

Using callback plugins

One handy feature of Ansible is its callback plugins. You just drop a python file in the ./callback_plugins directory containing a class called CallbackModule and the magic happens.

In that class, you can add code Ansible to run whenever some events take place. Examples of these events are:

playbook_on_stats: Called before printing the RECAP (source).
runner_on_failed: Called when a task failed (source).
runner_on_ok: Called when a task succeeded (source).

You can find more of them by looking at the examples or searching for call_callback_module in the source code.

We have created our own callbacks plugins for logging and enforcing usage policy.

Logging

Javier had the great idea of using callback plugins to log playbook executions and set up everything needed for make it work.

Each time a playbook is run, a plugin gathers relevant data such as:

which playbook was run.
against which hosts.
who ran it.
from where it was run.
what was the final result.
which revision of Ansible tree was active.
optional message (requested on terminal).
and sends it to a REST service. The service takes care of writing everything into a database which you can query later for things like who deployed migration-api in gmail project between last tuesday and wednesday.

Enforcing Usage Policy

Another useful usage of the callback plugins is the ability to enforce that Ansible is run using an updated repo. Before starting a playbook, a git fetch can be run behind the scenes and the current branch is validated against a list of rules.

You can use this as you wish, for example, to enforce that HEAD be on top of origin/master and the tree is clean.

Conclusions

Overall we are very satisfied with the tool. Everybody on the team is using it on a daily basis, and both Dev and Ops people commit stuff to the same repo.

We would like to further investigate:

Splitting the tree in two repos: provisioning and deployment. Using the provisioning repo to generate images with the help of Packer of and let the deployment stuff just assume that everything is in place.
Using security group names to generate inventory groups (like ec2.py does). This will improve compatibility between AWS and GCE instances as the later only has one set of values (Tags) for both tagging and firewalling instances.
Considering vault for storing sensitive data.
Replacing scripts for common invocations with python scripts using Ansible as a library.

2015-11-18

https://mmoya.org/en/2015/11/using-ansible-at-shuttlecloud/ Maykel Moya

#Ansible