December 25, 2016

Day 25 - Building a Team CLI with Python: One Alternative to ChatOps

Written by: Jan Ivar Beddari (@beddari)
Edited by: Nicholas Valler (@nvaller)


ChatOps is a great idea. Done right, it creates a well-defined collaborative
space where the barriers to entry are low and sharing improvements is quick.
Because of the immediate gains in speed and ease, ChatOps implementations have
a tendency to outgrow their original constraints. If this happens, the amount
of information and interrupts a team member is expected to filter and process
might become unmanageable. To further complicate the issue, reaching that limit
is a personal experience. Some might be fine with continuously monitoring three
dashboards and five chat rooms, and still get their work done. Others are more
sensitive and perhaps ending up fighting feelings of guilt or incompetence.

Being sufficiently explicit about what and when information reaches team
members takes time to get right. For this reason, I consider shared filtering
to be an inherent attribute of ChatOps, and a very challenging problem to
solve. As humans think and reason differently given the same input, building
and encouraging collaboration around a visible ‘robot’ perhaps isn’t the best

Defining the Team CLI

As an engineer, taking one step back, what alternative approaches exist that
would bring a lot of the same gains as the ChatOps pattern? We want it to be
less intrusive and not as tied to communication, hopefully increasing the
attention and value given to actual human interaction in chat rooms. To me, one
possible answer is to provide a team centric command line interface. This is
a traditional UNIX-like command line tool to run in a terminal window,
installed across all team members environments. Doing this, we shift our focus
from sharing a centralized tool to sharing a decentralized one. In a
decentralized model, there is an (apparent) extra effort needed to signal or
interrupt the rest of the team. This makes the operation more conscious, which
is a large win.

With a distributed model, where each team member operates in their own context,
a shared cli gives the opportunity to streamline work environments beyond the
capabilities of a chatbot API.

Having decided that this is something we’d like to try, we continue defining a
requirements list:

  • Command line UX similar to existing tools
  • Simple to update and maintain
  • Possible to extend very easily

There’s nothing special or clever about these three requirements. Simplicity is
the non-listed primary goal, using what experience we have to try getting
something working quickly. To further develop these ideas we’ll break down the
list and try to pinpoint some choices we’re making.

Command line UX similar to existing tools

Ever tried sharing a folder full of scripts using git? Scripts doesn’t really
need docs and reading git commits everyone can follow along with updates to the
folder, right? No. It just does not work. Shared tooling needs constraints.
Just pushing /usr/local/bin into git will leave people frustrated at the lack
of coherency. As the cognitive load forces people into forking their own
versions of each tool or script, any gains you were aiming for sharing them are

To overcome this we need standards. It doesn’t have to involve much work as we
already mostly agree what a good cli UX is - something similar to well-known
tools we already use. Thus we should be able to quickly set some rules and move

  • A single top level command tcli is the main entry point of our tool
  • All sub-commands are modules organized semantically using one of the two
    following syntax definitions:

    tcli module verb arguments
    tcli module subject verb arguments

  • Use of options is not defined but every module must implement --help

Unlike a folder of freeform scripts, this is a strict standard. But even so the
standard is easy to understand and reason about. It’s purpose is to create just
enough order and consistency to make sharing and reuse within our team

Simple to update and maintain

Arguably - also a part of the UX - are updates and maintenance. A distributed
tool shared across a team needs to be super simple to maintain and update. As a
guideline, anything more involved than running a single command would most
likely be off-putting. Having the update process stay out of any critical usage
paths is equally important. We can’t rely on a tool that blocks to check a
remote API for updates in the middle of a run. That would our most valued
expectation - simplicity. To solve this with a minimal amount of code, we could
reuse some established external mechanism to do update checks.

  • Updates should be as simple as possible, ideally git pull-like.
  • Don’t break expectations by doing calls over the network, shell out to
    package managers or similar.
  • Don’t force updates, stay out of any critical paths.

Possible to extend very easily

Extending the tool should be as easy as possible and is crucial to its long
term success and value. Typically there’s a large amount of hidden specialist
knowledge in teams. Using a collaborative command line tool could help share
that knowledge if the barrier to entry is sufficiently low. In practice, this
means that the main tool must be able to discover and run a wide variety of
extensions or plugins delivered using different methods, even across language
platforms. A great example of this is how it is possible to extend git with
custom sub-commands just by naming them git-my-command and placing them in
your path.

Another interesting generic extension point to consider is running Docker
as plugin modules in our tool. There’s a massive amount of tooling
already packaged that we’d be able to reuse with little effort. Just be sure to
maintain your own hub of canonical images from a secure source if you are doing
this for work.

Our final bullet point list defining goals for extensions:

  • The native plugin interface must be as simple as possible
  • Plugins should be discovered at runtime
  • Language and platform independent external plugins is a first class use case

Summoning a Python skeleton

Having done some thinking to define what we want to achieve, it’s time to start
writing some code. But why Python? What about Ruby, or Golang? The answer is
disappointingly simple: for the sake of building a pluggable cli tool, it does
not matter much what language we use. Choose the one that feels most
comfortable and start building. Due to our design choice to be able to plug
anything, reimplementing the top command layer in a different language later
would not be hard.

So off we go using Python. Anyone having spent time with it would probably
recognize some of the projects listed on the site, all of
them highly valued with great documentation available. When I learned that it
also hosts a cli library called Click, I was intrigued by its description:

“Click is a Python package for creating beautiful command line interfaces in a
composable way with as little code as necessary.”

Sounds perfect for our needs, right? Again, documentation is great as it
doesn’t assume anything and provide ample examples. Let’s try to get ‘hello
tcli’ working!

Hello tcli!

The first thing we’ll need is a working Python dev environment. That could mean
using a virtualenv, a tool and method used for separating libraries and
Python runtimes. If just starting out you could run [virtualenvwrapper] which
further simplifies managing these envs. Of course you could also just skip all
this and go with using Vagrant, Docker or some other environment, which will be
just fine. If you need help with this step, please ask!

Let’s initialize a project, here using virtualenvwrapper:

mkvirtualenv tcli
mkdir -p ~/sysadvent/tcli/tcli
cd ~/sysadvent/tcli
git init

Then we’ll create the three files that is our skeleton implementation. First
our main function cli() that defines our topmost command:


import click
def cli():
    """tcli is a modular command line tool wrapping and simplifying common
    team related tasks."""

Next an empty file to mark the tcli sub-directory as
containing Python packages:

touch tcli/

Last we’ll add a file that describes our Python package and its dependencies:

from setuptools import setup, find_packages


The resulting file structure should look like this:

tree ~/sysadvent/
└── tcli
    └── tcli

That’s all we need for our ‘hello tcli’ implementation. We’ll install our newly
crafted Python package as being editable - this just means we’ll be able to
modify its code in-place without having to rerun pip:

pip install --editable $PWD

pip will read our file and first install the minimal needed
dependencies listed in the install_requires array. You might know another
mechanism for specifying Python deps using requirements.txt which we will not
use here. Last it installs a wrapper executable named tcli pointing to our
cli() function inside It does this using the configuration values
found under entry_points, which are documented in the [Python Packaging User

Be warned that Python packaging and distribution is a large and sometimes
painful subject. Outside internal dev environments I highly recommend
simplifying your life by using fpm.

That should be all, if the stars aligned correctly we’re now ready for the
inaugural tcli run in our shell. It will show a help message and exit:

(tcli) beddari@mio:~/sysadvent/tcli$ tcli
Usage: tcli [OPTIONS] COMMAND [ARGS]...

  tcli is a modular command line tool wrapping and simplifying common team
  related tasks.

  --help  Show this message and exit.

Not bad!

Adding commands

As seen above, the only thing we can do so far is specify the --help option,
which is also done by default when no arguments are given. Going back to our
design, remember that we decided to allow only two specific UX semantics in our
command syntax. Add the following code below the cli() function in
def christmas():
    """This is the christmas module."""

@click.option('--count', default=1, help='number of greetings')
def greet(count, name):
    for x in range(count):
        click.echo('Merry Christmas %s!' % name)

At this point, we should treat the @sysadvent
team to the number of greetings we think they deserve:

tcli christmas greet --count 3 "@sysadvent team"

The keys to understanding what is going on here are the and
@christmas.command() lines: greet() is a command belonging to the
christmas group which in turn belongs to our top level click group. The
Click library uses decorators–a common python pattern–to achieve this.
Spending some hours with the Click documentation we should now be able to
write quite complex command line tools, using minimal Python boilerplate code.

In our design, we defined goals for how we want to be able to extend our
command line tool, and that is where we’ll go next.

Plugging it together

The Click library is quite popular and there’s a large number of
third party extensions available. One such plugin is click-plugins, which
we’ll use to make it possible to extend our main command line script. In Python
terms, plugins can be separate packages that we’ll be able to discover and load
via setuptools entry_points. In non-Python terms this means we’ll be able to
build a plugin using a separate codebase and have it publish itself as
available for the main script.

We want to make it possible for external Python code to register at the
module level of the UX semantics we defined earlier. To make our main tcli
script dynamically look for registered plugins at runtime we’ll need to modify
it a little:

The first 9 lines of tcli/ should now look like this:

from pkg_resources import iter_entry_points

import click
from click_plugins import with_plugins

def cli():

Next, we’ll need to add click-plugins to the install_requires array in our file. Having done that, we reinstall our project using the same
command originally used:

pip install --editable $PWD

Reinstalling is needed here because we’re changing not only code, also the
Python package setup and dependencies.

To test if our new plugin interface is working, clone and install the example
tcli-oncall project:

cd ~/sysadvent/
git clone
cd tcli-oncall
pip install --editable $PWD

After installing, we have some new example dummy commands and code to play

tcli oncall take "a bath"

Take a look at the and tcli_oncall/ files in this project
to see how it works.

There’s bash in my Python!

The plugin interface we defined above obviously only works for native Python
code. An important goal for us is however to integrate and run any executable
as part of our cli as long as it is useful and follows the rules we set. In
order to do that, we’ll replicate how git extensions work to add commands
that appear as if they were built-in.

We create a new file in our tcli project add add the following code (adapted
from this gist) to it:


import os
import re
import itertools
from stat import S_IMODE, S_ISREG, ST_MODE

def is_executable_posix(path):
    """Whether the file is executable.
    Based on from stdlib

        st = os.stat(path)
    except os.error:
        return None

    isregfile = S_ISREG(st[ST_MODE])
    isexemode = (S_IMODE(st[ST_MODE]) & 0111)
    return bool(isregfile and isexemode)

def canonical_path(path):
    return os.path.realpath(os.path.normcase(path))

The header imports some modules we’ll need, and next follows two helper
functions. The first checks if a given path is an executable file, the second
normalizes paths by resolving any symlinks in them.

Next we’ll add a function to the same file that uses these two helpers to
search through all directories in our PATH for executables matching a regex
pattern. The function returns a list of pairs of plugin names and executables
we’ll shortly be adding as modules in our tool:

def find_plugin_executables(pattern):
    filepred = re.compile(pattern).search
    filter_files = lambda files: itertools.ifilter(filepred, files)
    is_executable = is_executable_posix

    seen = set()
    plugins = []
    for dirpath in os.environ.get('PATH', '').split(os.pathsep):
        if os.path.isdir(dirpath):
            rp = canonical_path(dirpath)
            if rp in seen:

            for filename in filter_files(os.listdir(dirpath)):
                path = os.path.join(dirpath, filename)
                isexe = is_executable(path)

                if isexe:
                    cmd = os.path.basename(path)
                    name =, cmd).group(1)
                    plugins.append((name, cmd))
    return plugins

Back in our main, add another function and a loop that iterates
through the executables we’ve found to tie this together:


import tcli.utils
from subprocess import call

def add_exec_plugin(name, cmd):
    @cli.command(name=name, context_settings=dict(
    @click.argument('cmd_args', nargs=-1, type=click.UNPROCESSED)
    def exec_plugin(cmd_args):
        """Discovered exec module plugin."""
        cmdline = [cmd] + list(cmd_args)

# regex filter for matching executable filenames starting with 'tcli-'
FILTER="^%s-(.*)$" % __package__
for name, cmd in tcli.utils.find_plugin_executables(FILTER):
    add_exec_plugin(name, cmd)

The add_exec_plugin function adds a little of bit magic, it has an inner
function exec_plugin that represents the command we are adding, dynamically.
The function stays the same every time it is added, only its variable data
changes. Perhaps surprising is that the cmd variable is also addressable inside
the inner function. If you think this sort of thing is interesting, the topics
to read more about are scopes, namespaces and decorators.

With a dynamic search and load of tcli- prefixed executables in place, we
should test if it works as it should. Make a simple wrapper script in your
current directory, and remember to chmod +x it:


ls "$@"

Running the tcli command will now show a new module called ‘ls’ which we can
run, adding the current directory to our PATH for the test:

export PATH=$PATH:.
tcli ls -la --color

Yay, we made ourselves a new way of calling ls. Perhaps time for a break ;-)

An old man and his Docker

As the above mechanism can be used to plug any wrapper as a module we now have
a quick way to hook Docker images as tcli modules. Here’s a simple example
that runs Packer:


docker run --rm -it "hashicorp/packer@sha256:$sha256" "$@"

The last command below should run the entrypoint from hashicorp/packer,
and we’ve reached YALI (Yet Another Layer of Indirection):

export PATH=$PATH:.
tcli builder

Hopefully it is obvious how this can be useful in a team setting. However,
creating bash wrappers for Docker isn’t that great, it would be a better and
faster UX if we could discover what (local?) containers to load as tcli modules
automatically. One idea to consider is an implementation where tcli used data
from Docker labels with Label Schema. The and
org.label-schema.description labels would be of immediate use, representing
the module command name and a single line of descriptive text, suitable for the
top level tcli --help command output. Docker has an easy-to-use Python API
so anyone considering that as a project should be starting from there.

Other plugin ideas

The scope of what we could or should be doing with the team cli idea is
interesting, bring your peers in and discuss! For me however, the fact that it
runs locally, inside our personal dev envs, is a large plus.

Here’s a short list of ideas to consider where I believe a team cli could bring

  • git projects management, submodules replacement, templating

    tcli project list # list your teams git repositories, with descriptions
    tcli project create # templating
    tcli project [build|test|deploy]

    This is potentially very useful for my current team at $WORK. I’m planning to
    research how to potentially do this with a control repo pattern using

  • Secrets management

    While waiting for our local Vault implementation team to drink all of their
    coffee we can try making a consistent interface to (a subset of) the problem?
    Plugging in our current solution (or non-solution) would help, at least.

    If you don’t already have a gpg wrapper I’d look at blackbox.

  • Shared web bookmarks

    tcli web list
    tcli web open dashboard
    tcli web open licensing

    Would potentially save hours of searching in a matter of weeks ;-)

  • On-call management

    E.g as the example tcli-oncall Python plugin we used earlier.

  • Dev environment testing, reporting, management

    While having distributed dev environments is something I’m a big fan of it
    is sometimes hard figuring out just WHAT your coworker is doing. Running
    tests in each team members context to verify settings, versioning and so on
    is very helpful.

    And really there’s no need for every single one of us to have our own,
    non-shared Golang binary unzip update routine.

Wait, what just happened?

We had an idea, explored it, and got something working! At this stage our team
cli can run almost anything and do so with an acceptable UX, a minimum of
consistency and very little code. Going further, we should probably add some
tests, at least to the functions in tcli.utils. Also, an even thinner design
of the core, where discovering executables is a plugin in itself, would be
better. If someone want to help making this a real project and iron out these
wrinkles, please contact me!

You might have noticed I didn’t bring up much around the versus ChatOps
arguments again. Truth is there is not much to discuss, I just wanted to
present this idea as an alternative, and the term ChatOps get people thinking
about the correct problem sets. A fully distributed team would most likely
try harder to avoid centralized services than others. There is quite some power
to be had by designing your main automated pipeline to act just as another
, driving the exact same components and tooling as us non-robots.

In more descriptive, practical terms it could be you notifying your team ‘My
last build at commit# failed this way’
through standardized tooling, as
opposed to the more common model where all build pipeline logic and message
generation happens centrally.

December 24, 2016

Day 24 - Migrating from mrepo to reposync

Written by: Kent C. Brodie (


We are a RedHat shop (in my case, many CentOS servers, and some RedHat as well). To support the system updates around all of that I currently use mrepo, an open source repository mirroring tool created by Dag Wieers. Mrepo is an excellent yum repository manager that has the ability to house, manage, and mirror multiple repositories. Sadly for many, mrepo’s days are numbered. Today, I’m going to cover why you may need to move from using mrepo, and how to use reposync in its place.

For me, mrepo has thus far done the job well. It allows you to set up and synchronize multiple repositories all on the same single server. In my case, I have been mirroring RedHat 6/7, and Centos 6/7 and it has always worked great. I’ve had this setup for years, dating back to RedHat 5.

While mirroring CentOS with mrepo is fairly trivial, mirroring RedHat updates requires a little extra magic: mrepo uses a clever “registration” process to register a system to RedHat’s RHN (Red Hat Network) service, so that the fake “registered server” can get updates.

Let’s say you have mrepo and wanted to set up a RedHat 6 repository. The key part of this process uses the “gensystemid” command, something like this:

gensystemid -u RHN_username -p RHN_password --release=6Server --arch=x86_64 /srv/mrepo/src/6Server-x86_64/

This command actually logs into RedHat’s RHN, and “registers” the server with RHN. Now that this fake-server component of mrepo is allowed to access RedHat’s updates, it can begin mirroring the repository. If you log into RedHat’s RHN, you will see a “registered server” that looks something like this:

Redhat RHN registered server screen
Redhat RHN registered server screen


So what’s the issue? For you RedHat customers, if you’re still using RHN in any capacity, you hopefully have seen this notice by now:

Redhat RHN warning
Redhat RHN warning

Putting this all together: If you’re using mrepo to get updates for RedHat servers, that process is going to totally break in just over 7 months. mrepo’s functionality for RedHat updates depends on RedHat’s RHN, which goes away July 31st.

Finally, while mrepo is still used widely, it is worth noting that it appears continued development of mrepo ceased over four years ago. There have been a scattering of forum posts out there that mention trying to get mrepo to work with RedHat’s new subscription-management facility, but I never found a documented solution that works.


Reposync is a command-line utility that’s included with RedHat-derived systems as part of the yum-utils RPM package. The beauty of reposync is its simplicity. At the core, an execution of reposync will examine all of the repositories that the system you’re running it on has available, and downloads all of the included packages to local disk. Technically, reposync has no configuration. You run it, and then it downloads stuff. mrepo on the other hand, requires a bit of configuration and customization per repository.


You simply have to think about the setup differently. In our old model, we had one server that acted as the master repository for all things, whether it was RedHat 6, CentOS 7, whatever. This one system was “registered” multiple times, to mirror RPMS for multiple operating system variants and versions.

In the new model, we have to divide things up. You will need one dedicated server per operating system version. This is because any given server can only download RPMs specific to the operating system version that server is running. Fortunately with today’s world of hosting virtual machines, this isn’t an awful setup, it’s actually quite elegant. In my case, I needed a dedicated server for each of: RedHat 6, RedHat 7, CentOS 6, and CentOS 7.

For the RedHat servers, the elegant part of this solution deals with the fact that you no longer need to use “fake” system registration tools (aka gensystemid). You simply register each of the repository servers using RedHat’s preferred system registration: the “subscription-manager register” command that RedHat provides (with the retirement of RHN coming, the older rhn_register command is going bye-bye). mrepo, at present, does not really have a way to do this using RedHat’s “new” registration mechanism.


The best way for you to see how reposync works is to try it out. For this example, I highly recommend starting with a fresh new server. Because I want to show the changes that occur with subscribing the server to extra channel(s), I am using RedHat 6. You are welcome to use CentOS but note the directory names created will be different and by default the server will already be subscribed to the ‘extras’ channel.

For my example, perform the following steps to set up a basic reposync environment:
* Install a new server with RedHat6. A “Basic” install is best. * Register the server with RedHat via the subscription-manager command.
* Do NOT yet add this server to any other RedHat channels * Do NOT yet install any extra repositories like EPEL. * Install the following packages via YUM: yum-utils, httpd * Remove /etc/httpd/conf.d/welcome.conf (The repository will not have an index web page, so by removing this, you’re not redirected to a default apache error document) * Ensure the system’s firewall is set so that you can reach this server via a web browser

The simplest form of the reposync command will download all packages from all channels your system is subscribed to, and place them in a directory of your choosing.

The following command will download thousands of packages and build a full local RedHat repository, including updates:

The resulting directory structure will look like this:

/var/www/html -- rhel-6-server-rpms – Packages ```

If you point your web browser to http://repohost/rhel–6-server-rpms/Packages, you should see all of your packages.

Use RedHat’s management portal to add this same server to RedHat’s “Optional packages” channel. For my example, I also installed the EPEL repository to my yum environment.

With the server now ‘subscribed’ to more stuff (RedHat’s optional channel and EPEL), a subsequent reposync command like performed above now generates the following:

|-- epel
|-- rhel-6-server-optional-rpms
|   `-- Packages
`-- rhel-6-server-rpms
    `-- Packages

Note: EPEL is a simpler repo, it does not use a “Packages” subdirectory.

Hopefully this all makes sense now. The reposync command examines what repositories your server belongs to, and downloads them. Reminder: you need one ‘reposync’ server for each major operating system version you have, because each server can only download RPMs and updates specific to the version of the operating system the server is running.


One more step, actually. A repository directory full of RPM files is only the first of two pieces. The second is the metadata. The repository metadata is set up using the “createrepo” command. The output of this will be a “repodata” subdirectory containing critical files YUM requires to sort things out when installing packages.

Using a simple example and our first repository from above, let’s create the metadata:

After which our directory structure now looks like this:

/var/www/html |– epel |– rhel–6-server-optional-rpms | -- Packages – rhel–6-server-rpms -- Packages – repodata ```

You will need to repeat the createrepo command for each of the individual repositories you have. Each time you use reposync, it should be followed by a subsequent execution of createrepo. The final step in all of this to keep current is the addition of cron job entries that usually run reposync and createrepo every night.


Both reposync and createrepo have several command options. Here are some key options that I found useful and explanations as to when or why to use them.



This downloads not only the RPMS, but also extra metadata that may be useful, most important of which is an xml file that contains version information as relates to updates This totally depends on the particular repository you’re syncing.


Also download the comps.xml file. The comps.xml file is critical to deal with “grouping” of packages (example “yum groupinstall Development-tools” will not function unless the repository has that file).


Only download the latest versions of each RPM. This may or may not be useful, depending on whether you only want the absolute newest of everything, or whether you want ALL versions of everything.



If you have a comps.xml file for your repository, you need to tell createrepo exactly where it is.

–workers N

The number of worker processes to use. This is super handy for repositories that have thousands and thousands of packages. It speeds up the createrepo process significantly.


Do an “update” versus a full new repo. This drastically cuts down on the I/O needed to create the final resulting metadata.


The main point of this SysAdvent article was to help those using mrepo today to wrap their head around reposync, and (no thanks to RedHat), why you need to move away from mrepo to something else like reposync if you’re an RHN user. My goal was to provide some simple examples and to provide understanding how it works.

If you do not actually have official RedHat servers (for example, you only have CentOS etc.), you may be able to keep using mrepo for quite some time, despite that the tool has not had any active development in years. Clearly, a large part of mrepo’s functionality will break after 7/31/2017. Regardless of whether you’re using RedHat or CentOS, reposync is in my opinion an excellent and really simple alternative to mrepo. The only downside is you need multiple servers (one for each OS version), but virtualization helps keep that down to a minimal expense.

December 23, 2016

Day 23 - That Product Team Really Brought The Room Together

Written by: H. “Waldo” Grunenwald (@gwaldo)
Edited by: Cody Wilbourn (

There are plenty of articles talking about DevOps and Teamwork and Aligning Authority with Responsibility, but what does that look like in practice?

Having been on many different kinds of teams, and having run a Product Team, I will talk about why I think that Product Teams are the best way to create and run products sustainably.

Hey, Didn’t you start with “DevOps Doesn’t Work” last time?

Yes, (yes I did). And I believe every word of it. I consider Product Teams to be a definitive implementation of “Scaling DevOps” which so many people seem to struggle with when the number of people involved scales beyond a conference room.

To my mind, Product Teams are the best way to ensure that responsibility is aligned with authority, ensuring that the applications that you need are operated sustainably, and minimizes the likelihood that a given application becomes “Legacy”.

What do you mean “Legacy”?

There is a term that we use in this industry, but I don’t think that I’ve ever seen it be well-defined. In my mind, a Legacy Product is:

  1. Uncared For: Not under active development. Any releases are rare, using old patterns, and are often the result of a security update breaking functionality, causing a fire-drill of fixing dependencies.
  2. In an Orphanage: The people who are responsible for it don’t feel that they own it, but are stuck with it.

If there is a team that actively manages a legacy product, they might not be really equipped to make significant changes. Most of the time they are tasked only with keeping this product barely running, and may have a portfolio of other products in similar state. This “Legacy Team” might have some connotation associated with it of being “second-string” engineers, and it might be a dumping ground for many apps that aren’t currently in active development.

What are we coming from?

The assumed situation is there is a product or service that is defined by “business needs”.
A decision is come to that these goals are worthwhile, and a Project is defined.
This may be a new product or service, or it may be features to an existing product or service. At some point this Project goes into “Production”, where it is hopefully consumed by users, and hopefully it provides value.

Here’s where things get tricky.

In most companies, the team that writes the product is not the same team that runs the product. This is because many companies organize themselves into departments. Those departments often have technical distinctions like “Development” or “Engineering”, and “Quality Assurance”, and an “Operations” and/or “Systems” groups. In these companies, people are aligned along job function, but each group is responsible for a phase of a product’s lifecycle.

And this is exactly where the heart of the problem is:

The first people who respond to a failure of the application aren’t the application’s developers, creating a business inefficiency:
Your feedback loop is broken.

As a special bonus, some companies organize their development into a so-called “Studio Model”, where a “studio” of developers work on one project. When they are done with that project, it gets handed off to a separate team for operation, and another team will handle “maintenance” development work. That original Studio team may never touch that original codebase again! If you have ever had to maintain or operate someone else’s software, you might well imagine the incentives that this drives, like assumptions that everything is available, and latency is always low!

See, the Studio Model is patterned after Movie and Video Game Studios. This can work well if you are releasing a product that doesn’t have an operational component. Studios make a lot of sense if you’re releasing a film. Some applications like single-player Games, and Mobile Apps that don’t rely on Services are great examples of this.

If your product does have an operational component, this is great for the people on the original Studio team, for whom work is an evergreen pasture. Unfortunately it makes things more painful for everyone who has to deal with the aftermath, including the customers. In reality it’s a really efficient way of turning out Legacy code.

Let’s face it, your CEO doesn’t care that you wrote code real good. They care that the features and products work well, and are available so that they bring in money. They want an investment that pays off.

Having Projects isn’t a problem. But funding teams based on Projects is problematic. You should organize around Products.

Ok, I’ll bite. What’s a Product Team?

Simply put, a Product Team is a team that is organized around a business problem. The Product Team is comprised of people such that it is largely Self-Contained, and collectively the team Owns it’s own Products. It is “long-lived”, as the intention behind it is that the team is left intact as long as the product is in service.

Individuals on the team will have “Specialties”, but “that’s not my job” doesn’t exist. The QA Engineer specializes in determining ways of assuring that software does what’s expected to. They are not responsible for the writing of useful test cases, but they are not limited to the writing of tests. Notably, they’re not solely responsible for the writing of tests. Likewise for Operations Engineers, who have specialties in operating software, infrastructure automation, and monitoring, but they aren’t limited to or solely responsible for those components. Likewise for Software Engineers…

But the Product Team doesn’t only include so-called “members of technical staff”. The Product Team may also need other expertise! Design might be an easy assumption, but perhaps you should have a team member from Marketing, or Payments Receivable, or anyone who has domain expertise in the product!

It’s not a matter of that lofty goal of “Everyone can do everything.” Even on Silo teams, this never works. This is “Everyone knows enough to figure anything out“, and ”Everyone feels enough ownership to be able to make changes."

The people on this team are on THIS team. Having or being an engineer on multiple teams is painful and will cause problems.

You mentioned “Aligning Authority with Responsibility” before…

By having the team be closely-knit, and long-lived, certain understandings need to be had. What I mean is that if you want to have a successful product, and a sustainable lifecycle, there are some understandings that need to take place with regards to the staffing:

  • Engineers have a one-to-one relationship to a Product Team.
  • Products have a one-to-one relationship with a Product Team.
  • A Product Team may have a one-to-many relationship with it’s Products.
  • A Product Team will have a one-to-one relationship with a Pager Rotation.
  • An Engineer will have a one-to-one membership with it’s Pager Rotation.

Simply put, having people split among many different teams sounds great in theory, but it never works out well for the individuals. The teams never seem to get the attention required from the Individual Contributors, and an Individual Contributor is in a position of effectively doubling their number of bosses having to appease them all.


Some developers might balk at being made to participate in the operation of the product that they’re building. This is a natural reaction.
They’ve never had to do that before. Yes, exactly.
That doesn’t mean that they shouldn’t have to. That is the “we’ve always done it this way” argument.

This topic has already been well-covered in another article in this year’s SysAdvent, in Alice Goldfuss’ “No More On-Call Martyrs”, itself well-followed up by @DBSmasher’s “On Being On-Call”.

In this regard, I say is that if one’s sleep is on the line - if you are on the hook for the pager - you will take much more care in your assumptions when building a product, than if that is someone else’s problem.

The last thing that amazes me is that this is a pattern that is well-documented in many of the so-called “Unicorn Companies”, who’s practices many companies seek to emulate, but somehow “Developers-on-Call” always is argued to be “A Bridge Too Far”.

I would argue that this is one of their keystones.

Who’s in Charge

Before I talk about anything else, I have to make one thing perfectly clear. If you have a role in Functional Leadership (Engineering Manager, Operations Director, etc), your role will probably change.

In Product Teams, the Product Owner decides work to be done and priorities.

Within the team you have the skills that you need to create and run it, delegating functions that you don’t possess to other Product Teams. (DBA’s being somewhat rare, and “DB-as-a-Service” is somewhat common.)

Many Engineering and Operations managers were promoted because they were good at Engineering or Ops. Unfortunately it’s then that it sets in that, in Lindsay Holmwood’s words, “It’s not a promotion, it’s a career change”, and also addressed in this year’s SysAdvent article “Trained Engineers - Overnight Managers (or ‘The Art of Not Destroying Your Company’)” by Nir Cohen.

How many of you miss Engineering, but spend all of your time doing… stuff?

Under an org that leverages Product Teams, Functional Leaders have a fundamentally different role than they did before.

Leadership Roles

Under Product Team paradigm, Product Managers are responsible for the work, while Functional Managers are responsible for passing of knowledge, and overseeing the career growth of Individual Contributors.

Product Managers Functional Managers
Owns Product IC’s Professional Development
Product Direction Coordinate Knowledge
Assign Work & Priority Keeper of Culture
Hire & Fire from Team Involved in Community
Decide Team Standards Bullshit Detector / Voice of Reason

Product Managers

The Product Manager “Owns the Product”. They are ultimately responsible for the product successfully meeting business needs. Everything else is in support of that. I must stress that it isn’t necessary that a Product Manager be technical, though it does seem to help.

The product owner is the person who understands the business goals that knowledge and those stakes, they assign work and priorities such that it’s aligned with those business goals.

Knowing the specific problems that they’re solving and the makeup of their team, they are responsible for hiring and firing from the team.

Because the Product Team is responsible for their own success, and availability (by which I mean, of course, the Pager), they get to make decisions locally. They get to decide themselves what technologies they want to use and suffer.

Finally, the Product Manager evangalizes their product for other teams to leverage, and helps to on-board them as customers.

Functional Managers

At this point, I expect that the Functional managers are wondering “well what do I do?” Functional Managers aren’t dictating what work is done anymore, but there is still a lot of value that they bring. Their job becomes The People.

I don’t know a single functional manager who has been able to attend to their people’s professional development like they feel that they should.

Since technology decisions are made within the Product Team, the Functional Management has a key role in coordinating knowledge between the members of their Community, keeping track of who’s-using-what, and the relevant successes and pitfalls. When one team is considering a new tool that another is using, or a team is struggling with a tech, the functional manager is well-equipped for connecting people.

Functional Managers are the Keepers of Culture, and are encouraged to be involved in Community. That community-building is both within the company and in their physical region.

Functional managers are crucial for Hiring into the company, and helping Product Managers with hiring skills that they aren’t strong with. For instance, I would run a developer candidate by a development manager for a sanity-check, but for a DBA, I’d be very reliant on a DBA Manager’s expertise and opinion!

Relatedly, the Functional Manager serves as a combination Bullshit Detector and Voice-of-Reason when there are misunderstandings between the Product Owners and their Engineers.

The Reason for Broad Standards

Broad standards are often argued for one of two main reasons: either for “hiring purposes”, where engineers may be swapped relatively interchangably, or because there is a single Ops team responsible for many products, who doesn’t have ability to cope with multiple ways of doing things. (Since any one Engineer might be called upon to solve many apps in the dark of the night.)

Unfortunately, app development can often be hampered by those Standards that don’t fit their case and needs.

Hahahaha I’m kidding! What really happens is that Dev teams clam up about what they’re doing. They subvert the “standards” and don’t tell anyone, either pleading ignorance or claiming that they can’t go back and rewrite because of a deadline. Best case is that they run a request for an “exemption” up the flagpole, where Ops gets Over-riden. And Operations is still left with a “standard” and pile of “one-offs”.

Duplicate Effort

Another claimed reason for broad “Standards” is to “reduce the amount of duplicated effort”. While this is a great goal, again, it tends to cause more friction than is necessary.

The problem is the fallacy that comes from assuming that the way that a problem was solved for one team will be helpful to another. That solution may be helpful, but to assume that it will, and making it mandatory is going to cause unnecessary effort.

At one company, my team ran ELK as a product for other teams to consume. A new team was spun up, and asked about our offerings, but asked my opinion of them using a different service (an externally-hosted ELK-as-a-Service). I was thrilled, in fact! I want to see if we were solving the problem in the best way, or even a good way, and to be able to come back later for some lessons-learned!

Scaling Teams

At some point, your product is going to get bigger than everyone can keep in their head. It may be time to split up responsibilities into a new team. But where to draw boundaries? Interrogate them!

A trick that I learned a long time ago for testing your design in Object-Oriented Programming is to ask the object a question: “What are you?” or “What do you do?” If the answer includes an “And”, you have two things. This works well for evaluating both Class and Method design. (I think that this tidbit was from Sandi Metz’s “Practical Object-Oriented Design in Ruby” (aka “POODR”), which I was exposed to by Mark Menard of Enable Labs.)

What Doesn’t Work

Because this can be a change to how teams work, it’s important to be clear about the rules. If there is a misunderstanding about where work comes from, or who the individual contributors work for, or who decides the people who belong to what team, this begins to fall apart.

Having people work for multiple sets of managers is untenable.

Having people quit is an unavoidable problem in any company. Having a functional manager decide by themselves that they’re going to reassign one of your people away from you is worse, because they’re not playing by the rules.

WARNING: Matrix Organizations Considered Harmful

If someone proposes a Matrix Org, you need to be extremely careful. It’s important that you keep a separation of Church and State. Matrix Organizations instantly create a conflict between the different axes of managers, with the tension being centered on the individual contributor who just wants to do good work. A Matrix Org actively adds politics.

All Work comes from Product Management. Functional Management is for Individual Careers and Sharing Knowledge.

This shouldn’t be hard to remember, as the Functional Leaders shouldn’t have work to assign. But it will be hard, because they’ll probably have a lot of muscle-memory around prioritizing and assigning work.

Now, I’m sure a lot of you are skeptical about how a product team actually works. You might just not believe me.

If you properly staff a team, give them direction, authority, and responsibility, they will amaze you.

Getting Started

As with anything, the hardest thing to do is begin.

Identifying Products

An easy candidate is a new intiative for development that may be coming down the pipeline, but if you aren’t aware of any new products, you probably have many “orphaned” products already running within your environment.

As I discussed last year, there are plenty of ways of finding products that are critical, but not actually maintained by anyone. Common places to look are tools around development, like CI, SCM, and Wikis. Also commonly neglected are what I like to call “Insight Tools” like Logging, Metrics, and Monitoring/Alerting. These all tend to be installed and treated as appliances, not receiving any maintenance or attention unless something breaks. Sadly, it means that there’s a lot of value left on the table with these products!

Speaking with Leadership

If you say “I want to start doing Product Team”, they’re going to think of something along the lines of BizDev. A subtle but important difference is to say that you want to organize a cross-functional team, that is dedicated to the creation and long-term operation of the Product.

I don’t know why, but it seems that executive go gooey when they hear the phrase “cross-functional team”. So, go buzz-word away. While you’re at it, try to initiate some Thought Leadership and coin a term with them like “Product-Oriented Development”! (No, of course it doesn’t mean anything…)

What you’re looking for is a commitment to fund the product long-term. The idea is that your team will solve problems centered around a set of problems. The team is of “Your People”, that becomes a “we”. Oddly enough, when you have a team focused and aligned together, you have really built a capital-T “Team”.


The Product Team should be intact and in-development as long as the product is found to be necessary. When the product is retired, they product team may be disbanded, but nobody should be left with the check. Over time, the features should stabilize, and the bugs will disappear, and the operation of the application should stabilize to a low level of effort, even including external updates.

That doesn’t mean that your engineers need to be farmed out to other teams; you should take on new work, and begin development of new products that aid in your space!


I believe that organizing work in Product Teams is one of the best ways to run a responsible engineering organization. By orienting your work around the Product, you are aligning your people to business needs, and the individuals will have a better understanding of the value of their work. By keeping the team size small, they know how the parts work and fit. By everyone operating the product, they feel a sense of ownership, and by being responsible for the product’s availability, they’re much more likely to build resilient and fault-tolerant applications!

It is for these reasons and more, that I consider Product Teams to be the definitive DevOps implementation.


I’d like to thank my friends for listening to me rant, and my editor Cody Wilbourn for their help bringing this article together. I’d also like to thank the SysAdvent team for putting in the effort that keeps this fun tradition going.

Contact Me

If you wish to discuss with me further, please feel free to reach out to me. I am gwaldo on Twitter and Gmail/Hangouts and Steam, and seldom refuse hugs (or offers of beverage and company) at conferences. Death Threats and unpleasantness beyond the realm of constructive Criticism may be sent to:

c/o FBI Headquarters  
935 Pennsylvania Avenue, NW  
Washington, D.C.  

December 22, 2016

Day 22 - Building a pipeline for Azure Deployments

Written by: Sam Cogan (@samcogan)
Edited by: Michelle Carroll (@miiiiiche)


In my SysAdvent article from last year I talked about automating deployments to Azure using Azure Resource Manager (ARM) and PowerShell Desired State Configuration (DSC). This year, I wanted to take this a few steps further and talk about taking this “infrastructure as code” concept, and using this to build a deployment pipeline for your Azure Infrastructure.

In a world where infrastructure is just another set of code, we have an exciting opportunity to apply techniques developers have been using for a long time to refine our deployment process. Developers having been using a version of the pipeline below for years, and we are able to apply the same technique to infrastructure.


By implementing a deployment pipeline we gain a number of significant benefits:

  1. Better collaboration and sharing of work between distributed teams
  2. Increased security, reliability, and reusability of your code.
  3. Repeatable and reliable packaging and distribution of your code and artifacts.
  4. The ability to catch errors early, and fix them before time and money is wasted during deployment.
  5. Reliable and repeatable deployments.
  6. Absolute control over when a deployment occurs, because of the ability to add gating and security controls to the process.
  7. Moving closer to the concepts of continuous, automated deployment.

The process described in this article focusses on Azure deployments and the tools available for Microsoft Azure — however, this process could easily be applied to other platforms and tools.

Source Control

The first step, and one that I believe everyone writing any sort of code should be doing, is to make sure you are using some sort of version control. Once you’ve jumped in and started writing ARM templates and DSC files you’ve got artifacts that could (and should) be in a version control system. Using version control offers helps us with a number of areas:

  1. Collaboration. As soon as more than one person is involved in creating, editing, or reviewing deployment files, we hit the problems of passing files around by email, knowing which is the most recent version, trying to merge conflicting changes etc. Version control is a simple, well-tested solution to this problem. It provides a single point of truth for everyone, and an easy way to collaboratively edit files and merge the results.
  2. Versioning. One of the big benefits of ARM and DSC is that the code is also the documentation of your infrastructure deployment. With version control, it is also the history of your infrastructure. You can easily see how your infrastructure changed over time through, and even roll back to a previous version.
  3. Repository. A lot of the techniques we will discuss in this article require a central repository to store and access files. Your version control repository can be used for this, and provides a way to access a specific version of the files.

The choice of which version control system to use is really up to you and how you like to work (distributed vs Client/Server). If you work with developers, it is very likely they will already have a system in place, and it’s often easier to take advantage of the existing infrastructure. If you don’t have a system in place (and don’t want the overhead of managing one), then you can look at cloud providers like Github, Visual Studio Team Services or Bitbucket.


At first glance this may seem like a bit of an odd step: none of the script types we are using require compiling, so what is there to build? In this process, “build” is the transformation and composition of files into the appropriate format for later steps, and getting them to the right place. For example, my deployment system expects my ARM templates and DSC files to be delivered in a NuGet package, so I have a build step that takes those files, brings them together in the right folder structure, and creates a NuGet package. Another build step looks at the software installer files required for deployment and, if required, uploads these to Azure Blob storage. This could include MSI or EXE files for installers, but also things like Nuget packages for web applications.

Again, the tools used for this stage are really up to you. At a very basic level you could use PowerShell or even Batch scripts to run this process. Alternatively you could look at build tools like VSTS, TeamCity or Jenkins to co-ordinate the process, this provides the additional benefits of:

  1. Many build systems will come with pre-built process that will do a lot of this work for you.
  2. It’s usually easy to integrate your build system with version control, so that when a new version is committed (or any other type of trigger) a build will be started.
  3. The build systems usually provide some sort of built-in reporting and notification.
  4. Build systems often have workflows that can be used as part of the testing and deployment process.


This step is possibly the most alien for system administrators. structured code testing is often left to developers, with infrastructure testing limited to things like Disaster Recoveryand performance tests. However, because our infrastructure deployments are now effectively just more code, we can apply testing frameworks to that code and try to find problems before we start a deployment. Given that these deployments can take many hours, finding problems early can be a real benefit in terms of time and money.

There are various different testing frameworks out there that you could use, so I recommend picking the one you are comfortable with. My preference is Pester, the PowerShell testing framework. By using Pester, I can write my tests in PowerShell (with some Pester specific language added on), and I gain Pester’s ability to natively test PowerShell modules out of the box. I split my testing into two phases, pre-deployment and post-deployment testing.

Pre-Deployment Testing

As the name suggests, these are the tests that run before I deploy, and are aimed at catching errors in my scripts before I start a deployment. This can be a big time saver, especially when deployment scripts take hours. I t to run:

  1. Syntax Checks. I parse all my JSON and PowerShell files to look for simple syntax errors, missing commas, quotation marks, and other typos, to ensure that the scripts will make it through the parser. I have a simple Pester test that loops through all my JSON files and runs the PowerShell convertFrom-JSON command — if this throws an error, I know it failed.
  2. Best Practices. To get idea of how my PowerShell conforms to best practices I run a Pester test that runs the PowerShell Script Analyser, and fails if there are any errors. These tests are based on the code in Ben Taylor’s “Script Analyzer” article.
  3. Unit Tests. Pester’s initial purpose was to run unit tests against PowerShell scripts, so I run any available unit tests before deployment. It’s not really possible to unit test DSC or ARM templates, but you can run tests against any DSC Resources. These can be downloaded from the PowerShell gallery (which usually come with tests), or you can write tests for custom DSC resources. This article on DSC unit tests is a great starting point forbuilding generic tests for DSC resources.

At this point I’m in a pretty good state to run the deployment. Assuming it doesn’t fail, I can move on to my next set of tests. If any test do fail, then my pipeline stops and I don’t progress any further until the tests pass .

Post Deployment Testing

Once the deployment is complete,I want to be able to check that the environment I just deployed matches the state I defined in my ARM templates and DSC files. The declarative nature of these files means it should be in compliance, but it is good to confirm that nothing has gone wrong, or that what I thought I had modelled in DSC is actually what came out the other end. For example, I have a DSC script that installs IIS, so I have a corresponding test that checks that IIS has been installed. looks like this when written in Pester:

Describe "Web Server"  {
    It "Is Installed" {
        $Output = Get-WindowsFeature web-server
        $Output.InstallState | Should Be "Installed"

You can be as simple of as complex as you want in the tests checking your infrastructure, based on your criteria for a successful deployment.


The whole point of this exercise is to actually get some infrastructure deployed, which we could have done without any of the previous steps. Following this process gives us several benefits: 1. We have a copy of the deployment files in a known-good state, and are the specific version that we know we want to deploy. 2. We have packaged these files in the right format for our deployment process, so we don’t need requirements for manually zipping or arranging files. 3. We have already performed necessary pre-deployment tasks like uploading installers, config files, etc. 4. We have tested our deployment files to make sure they are syntactically correct, and know we won’t have to stop the deployment halfway through because of a missing comma.

At this point, you would kick off your ARM deployment process — this may mean downloading or copying the appropriate files from your Build output, and running the New-AzureResourceGroupDeployment. However, just like we used a build tool to tie our build process to a new version control check-in, we can also use deployment tools to tie our deployment process to a new build. Once a build completes, your deployment software starts creating a release, and even automatically deploying it. Some examples of tools that can do this include VSTS (again), Octopus Deploy or Jenkins

The Deployment Pipeline

Each of the steps we’ve discussed can be implemented on their own and will add benefit to your process straight away. As you gain familiarity with the techniques, you can layer on more steps until you have a pipeline that runs from code commit to deployment, similar to this:


Each step in the process will be a gate in your deployment, and if it fails you don’t move on to the next. This can be controlled by something as simple as some PowerShell or CMD scripts, or something as complex as VSTS or Jenkins — there’s no one right tool or process to use. The process is going to differ markedly depending on what you are trying to deploy, what opportunity there is for testing, which pieces are automated and which are done manually,, and how agile your business is.

Your ultimate goal might be able to deploy your software and infrastructure on every new commit,for true continuous deployment. In many industries, this may not be practical. Even if that is the case, implementing this pipeline means you can still be in a position where you could deploy a new release from any commit. With the pipeline, you gain the confidence in your code. What comes out at the end of the process should be in a known good state, and is ready to go if you want to deploy it.

To give you an example, in one of my environments each commit triggers an automatic deployment to the development environment, as this always needs to be the very latest version. However, the test environment needs to be more stable than the dev environment. While a release is created in the deployment tool, it is still deployed manually by the development team, with multiple human approvals. Moving the release through production requires successful deployments to development and test — once that requirement is met, a member of the team can trigger a manual deployment..

Useful Resources

Pester Testing Framework

Continuous deployment with Visual Studio Team Services

Devops on Windows with Jenkins and Azure Resource Manager

December 21, 2016

Day 21 - Reusable Application Packaging With Habitat

Written by: Joshua Timberman (@jtimberman)
Edited by: Dan Webb (@dan_webb)


Habitat by Chef is a framework for building, deploying, and running any kind of application. Chef’s blog has a good introductory series on the concepts behind Habitat, and the Habitat tutorial is a good place to start learning. In this post, I’m going to take a look at how to integrate Habitat with Chef to deploy and run a package we’ve created for an example application.

The sample application is Heroku’s “ruby-rails-sample.” This application is a simple example, but it does make a database connection, so we can see how that works out, too. The Habitat organization on GitHub has a forked repository where we’ve made changes to add Habitat and integration with Chef and Chef Automate. Let’s take a look at those changes.

Habitat Background

First, we have created a habitat directory. This where the Habitat configuration goes. In this directory, we have:

  • config
  • default.toml
  • hooks

The config directory contains handlebars.js templates for the configuration files for the application. In this case, we only have the Rails database.yml. It looks like this:

default: &default
adapter: postgresql
encoding: unicode
pool: 5

<<: *default
database: {{cfg.database_name}}
username: {{cfg.database_username}}
password: {{cfg.database_password}}
host: {{cfg.database_host}}
port: {{cfg.database_port}}

The next part of the Habitat configuration is the default.toml. This file contains all the default variables that will be used to configure the package. These variables are accessible in hooks, and the templates in config. The configuration file above replaces {{cfg.VARIABLE}} with the value from the default.toml. These values can also be dynamically changed at run time. The default.toml looks like this:

rails_binding_ip = ""
rails_port = 3000
database_name = "ruby-rails-sample_production"
database_username = "ruby-rails-sample"
database_password = ""
database_host = "localhost"
database_port = 5432

The hooks directory contains shell scripts that we call “hooks”. For the ruby-rails-sample application, we have two hooks, init, and run. The init hook is used to initialize the application. It looks like this:

rm -rf {{pkg-svc_static_path}}/*
cp -a {{pkg.path}}/static/* {{pkg.svc_static_path}}
cp {{pkg.svc_config_path}}/database.yml {{pkg.svc_static_path}}/config/database.yml
export GEM_HOME="{{pkg.svc_static_path}}/vendor/bundle/ruby/2.3.0"
export GEM_PATH="$(hab pkg path core/ruby)/lib/ruby/gems/2.3.0:$(hab pkg path core/bundler):$GEM_HOME"
export LD_LIBRARY_PATH="$(hab pkg path core/gcc-libs)/lib"
export PATH="$PATH:{{pkg.svc_static_path}}/bin"
export RAILS_ENV="production"
chown -R hab:hab {{pkg.svc_static_path}}
cd {{pkg.svc_static_path}}
exec 2>&1
if [[ ! -f {{pkg.svc_static_path}}/.migrations_complete ]]; then
echo "Running 'rake bootstrap' in ${PWD}"
exec chpst -u hab bin/rake bootstrap && touch {{pkg.svc_static_path}}/.migrations_complete

Hooks are still templates, like the configuration file database.yml. The values that come from the {{pkg.VARIABLE}} variables are set by the package and are fully documented. To initialize the application, we are going to remove the existing deployed version and copy the new version from the package to the “static” path. This is because we treat the extracted package as immutable. We copy the config file from the service config directory to the static path’s config directory because of how Rails looks for the database.yml file. Then, we ensure that the entire application is readable by the application runtime user, hab. Then, if we haven’t completed a database migration, we do that.

Next, we have a run hook because in order to start our application we need to set some environment variables so Rails knows where to find Ruby and the gems.

export GEM_HOME="{{pkg_svc_static_path}}/vendor/bundle/ruby/2.3.0"
export GEM_PATH="$(hab pkg path core/ruby)/lib/ruby/gems/2.3.0:$(hab pkg path core/bundler):$GEM_HOME"
export LD_LIBRARY_PATH="$(hab pkg path core/gcc-libs)/lib"
export RAILS_ENV="production"

cd {{pkg_svc_static_path}}

exec 2>&1
exec chpst -u hab ./bin/rails server -b {{cfg.rails_binding_ip}} -p {{cfg.rails_port}}

Rails itself doesn’t support dropping privileges, so we use the chpst command to run the application as the hab user.

Next, we have the plan itself, This is a shell script executed by Habitat’s build script. All the gory details of plans are documented in great detail, so I’ll cover the highlights here. The plan itself is a Bourne Again Shell script that contains metadata variables that start with pkg_, and callback functions that start with do_. There are default values in the plan build script if your plan does not specify anything or override the functions. You can view the full in the GitHub repository.

First, we want to ensure that we execute the hooks as root by setting the pkg_svc_user and pkg_svc_group variables. The reason for this is because the init hook needs to create files and directories in a privileged root directory where the service runs.


Habitat packages are built in a “cleanroom” we call a “studio.” This is a stripped down environment that isn’t a full Linux distribution - it has enough OS to build the package. We rely on specifying the dependences to build as pkg_build_deps. As such, many application build scripts may assume that /usr/bin/env is available. Rubygems are an example of this in their native extensions. We cannot possibly know what any arbitrary Rubygem is going to do, however. Our first callback function we override is do_prepare(), where we make a symlink for the Habitat core/coreutils package’s bin/env command to /usr/bin/env if that does not exist. The symlink is removed after the package is built in do_install, later.

do_prepare() {
  build_line "Setting link for /usr/bin/env to 'coreutils'"
  [[ ! -f /usr/bin/env ]] && ln -s "$(pkg_path_for coreutils)/bin/env" /usr/bin/env
  return 0

The next function in the is do_build(). Many software packages in the open source world are built by doing ./configure && make, and the default do_build() function in Habitat does that as a “sane default.” However, Ruby on Rails applications are built using bundler, to download all the Rubygem dependencies required to run the application. Habitat packages have their own runtime dependencies, specified with pkg_deps, and these packages are isolated away from the underlying OS in the /hab directory. This means we need to tell the build script where to find all the libraries we’re going to need to install the Rails application bundle. This includes any Rubygems that install native extensions, such as nokogiri or the PostgreSQL client, pg. The full, commented version is on GitHub.

do_build() {
  # solve compiling nokogiri native extensions!
  local _libxml2_dir=$(pkg_path_for libxml2)
  local _libxslt_dir=$(pkg_path_for libxslt)
  export NOKOGIRI_CONFIG="--use-system-libraries --with-zlib-dir=${_zlib_dir} --with-xslt-dir=${_libxslt_dir} --with-xml2-include=${_libxml2_dir}/include/libxml2 --with-xml2-lib=${_libxml2_dir}/lib"
  bundle config build.nokogiri '${NOKOGIRI_CONFIG}'
  bundle config --with-pg-config="${_pgconfig}"
  bundle install --jobs 2 --retry 5 --path vendor/bundle --binstubs

The next callback function we define is do_install(). Similar to do_build(), most open source software out there does their installation with make install. This isn’t the case with our Rails application, so we need to define our own function. The intent here is to install the content into the correct prefix’s static directory so we can create the artifact. We also need to ensure that any binaries shipped in the application use the correct Ruby by replacing their shebang line. Finally, we cleanup the symlink created in do_prepare().

do_install() {
  cp -R . "${pkg_prefix}/static"
  for binstub in ${pkg_prefix}/static/bin/*; do
    [[ -f $binstub ]] && sed -e "s#/usr/bin/env ruby#$(pkg_path_for ruby)/bin/ruby#" -i "$binstub"
  if [[ $(readlink /usr/bin/env) = "$(pkg_path_for coreutils)/bin/env" ]]; then
    rm /usr/bin/env

Continuous Integration/Delivery

With the contents of our habitat directory in place, we’re ready to build the package. We do this using the “Habitat Studio”. While an application artifact can be built anywhere Habitat runs, we strongly recommend doing this in an automated CI/CD pipeline. In the case of this project, we’re going to automatically build the package and upload it to a Habitat package depot using Chef Automate. The .delivery directory in the project contains a Chef Automate “Build Cookbook” and a configuration file. This cookbook is run on worker “build” nodes in Chef Automate. For this project, we wrap the habitat-build cookbook, which does the heavy lifting. In .delivery/build-cookbook, it defines a dependency on habitat-build:

depends 'habitat-build'

This allows us to include the habitat-build cookbook’s recipes in the various phases within the pipeline. The phases we’re interested in are:

  • Lint: Check that we’re using good shell script practices in our using shellcheck.
  • Syntax: Verify that the script is valid Bash with bash.
  • Publish: Build the artifact and upload it to a Habitat Depot.

Each of these is a recipe in the project’s build cookbook. Importantly, we use the publish recipe in ./delivery/build-cookbook/recipes/publish.rb:

include_recipe 'habitat-build::publish'

This recipe is in the habitat-build cookbook. When the build node runs the publish recipe, it loads the origin key from an encrypted data bag on the Chef Server used with Automate. Then, it executes the hab studio build command with an ephemeral workspace directory.

execute 'build-plan' do
  command "unset TERM; HAB_ORIGIN=#{origin} sudo -E #{hab_binary} studio" \
          " -r #{hab_studio_path}" \
          " build #{habitat_plan_dir}"
  cwd node['delivery']['workspace']['repo']
  live_stream true

This builds the package according to the, and creates a “results” directory that has the output Habitat artifact (.hart file) and a file with informataion about the build. The recipe loads that file and stores the content in a data bag on the Chef Server, and it uploads the package to a Habitat Depot - the publicly available one at

execute 'upload-pkg' do
  command lazy {
    "#{hab_binary} pkg upload" \
    " --url #{node['habitat-build']['depot-url']}" \
    " #{hab_studio_path}/src/results/#{artifact}"
    'HOME' => delivery_workspace,
    'HAB_AUTH_TOKEN' => depot_token
  live_stream true
  sensitive true


Once we have a package, it’s time to deploy it. Generally speaking, this is as simple a matter as installing Habitat on a system, and then running hab start delivery-example/ruby-rails-sample. Of course, we want to automate that, and we do so in our Chef Automate pipeline. After the publish phase are provision and deploy phases where we provision infrastructure - an EC2 node in this case - and run Chef on it to deploy the application. In our project, the .delivery/build-cookbook has the provision and deploy recipes to handle this - it’s outside the scope of habitat-build. Those recipes use chef-provisioning, but one could also write a recipe that uses Terraform or some other kind of provisioning tool. The recipe used for this is actually in the cookbooks/ruby-rails-sample at the top of the repository. This cookbook uses the habitat cookbook, which provides resources for installing Habitat, installing Habitat packages, and enabling Habitat services.

In the metadata.rb:

depends 'habitat'

There is only a default.rb recipe in the ruby-rails-sample cookbook. First, it loads details about the database that it connects to:

database_details = {
                    'host' => '',
                    'username' => '',
                    'password' => ''

The astute reader will note that these are empty values. For now, we don’t have any connection information handling here because these are secrets and should be managed appropriately. Previous iterations of this cookbook used a hardcoded plain text password, so we’re going to move away from that in a later version. For now, let’s step through the recipe.

Habitat is good about managing the application. However, there are still things we need a configuration management system to do on the node(s) that run the application. Notably, we’re going to ensure the hab user and group required to run the application are present.

execute('apt-get update') { ignore_failure true }

package 'iproute2'

group 'hab'

user 'hab' do
  group 'hab'
  home '/hab'

Next, we need to ensure that Habitat itself is installed, and that the application’s package is installed as well. Not only do we want them installed, but we want them to be the latest version available. We’re doing this in a continuous delivery pipeline, so we will assume everything is tested in an acceptance environment before it gets delivered. Right? :-)

hab_install 'habitat' do
  action :upgrade

hab_package 'delivery-example/ruby-rails-sample' do
  action :upgrade

Next, we’re going to manage the runtime configuration file for the application. Remember earlier we had the default.toml? That file contains default values. We actually want to modify the runtime, so we do that with a user.toml file in the service directory. Habitat creates this directory by default when it starts the application, but we need to make sure it exists first so the application starts properly the first time.

directory '/hab/svc/ruby-rails-sample' do
  recursive true

template '/hab/svc/ruby-rails-sample/user.toml' do
  variables database_details
  owner 'hab'
  group 'hab'
  mode '0600'

We’ve passed in the database_details hash we set up earlier. In a future version of this recipe, that hash will come from an encrypted data bag on the Chef Server, and all we need to do is change that data structure in the recipe. And later on, we can also change this project to deploy a PostgreSQL server using Habitat and use the supervisor’s gossip to discover that and configure the application automatically. But, that is an article for another time.

Finally, we want to enable and start the service with Habitat. When we run hab start delivery-example/ruby-rails-sample on a system, it will run the service in the foreground under the Habitat Supervisor. If we do that in a Chef recipe, Chef will hang here forever. The hab_service resource will set up the service to run as a systemd unit.

hab_service 'delivery-example/ruby-rails-sample' do
  action [:enable, :start]

After this recipe runs, we will have Habitat installed at the latest version, and our application package will be installed and running as the hab user. If we are working in an environment where we have multiple Rails applications to manage, we can use this pattern across those other projects and automate ourselves out of a job.

December 20, 2016

Day 20 - How to set and monitor SLAs

Written by: Emily Chang (
Edited by: Ben Cotton (@funnelfiasco)

SLAs give concrete form to a worthy but amorphous goal: you should always be trying to improve the performance and reliability of your services. If you’re maintaining an SLA, collecting and monitoring the right metrics can help you set goals that are meant to improve performance, rather than simply policing it. In this post we’ll walk through the process of collecting data to define reasonable SLAs, and creating dashboards and alerts to help you monitor and maintain them over time.

The ABCs of SLAs, SLOs, and SLIs

Before we go any further, let us first define what the term SLA means within the context of this article. Throughout this post, we will refer to the terms SLA, SLO, and SLI as they are defined in Site Reliability Engineering, a book written by members of Google’s SRE team. In brief:
- SLA: Service Level Agreements are publicly stated or implied contracts with users—either external customers, or another group/team within your organization. The agreement may also outline the economic repercussions (e.g. service credits) that will occur if the service fails to meet the objectives (SLOs) it contains.
- SLO: Service Level Objectives are objectives that aim to deliver certain levels of service, typically measured by one or more Service Level Indicators (SLI).
- SLI: Service Level Indicators are metrics (such as latency or throughput) that indicate how well a service is performing.

In the next section, we will explore the process of collecting and analyzing key SLI metrics that will help us define reasonable SLAs and SLOs.

Collect data to (re)define SLAs and SLOs

Infrastructure monitoring

Maintaining an SLA is difficult, if not impossible, if you don’t have excellent visibility into your systems and applications. Therefore, the first step to maintaining an SLA is to deploy a monitoring platform to make all your systems and applications observable, with no gaps. Every one of your hosts and services should be submitting metrics to your monitoring platform so that when there is a degradation, you can spot the problem immediately, and diagnose the cause quickly.

Whether you are interested in defining external, user-facing SLAs or internal SLOs, you should collect as much data as you can, analyze the data to see what standards you’re currently achieving, and set reasonable goals from there. Even if you’re not able to set your own SLAs, gathering historical performance data may help you make an argument for redefining more reasonable objectives. Generally, there are two types of data you’ll want to collect: customer-facing data (availability/uptime, service response time), and internal metrics (internal application latency).

Collect user-facing metrics to define external SLAs

Synthetic monitoring tools like Pingdom and Catchpoint are widely used to measure the availability and performance of various user-facing services (video load time, transaction response time, etc.). To supplement this data, you’ll probably also want to use an application performance monitoring (APM) tool to analyze the user-facing performance of each of your applications, broken down by its subcomponents.

When it comes to assessing performance metrics, it’s often not enough to simply look at the average values—you need to look at the entire distribution to gain more accurate insights. This is explained in more detail in Site Reliability Engineering: “Using percentiles for indicators allows you to consider the shape of the distribution and its differing attributes: a high-order percentile, such as the 99th or 99.9th, shows you a plausible worst-case value, while using the 50th percentile (also known as the median) emphasizes the typical case.”

For example, let’s say that we wanted to define an SLA that contains the following SLO: In any calendar month period, the user-facing API service will return 99 percent of requests in less than 100 milliseconds. To determine if this is a reasonable SLA, we used Datadog APM to track the distribution of API request latency over the past month, as shown in the screenshot below.

request latency distribution

In this example, the distribution indicates that 99 percent of requests were completed in under 161 ms over the past month. This suggests that it may be difficult to fulfill the previously stated SLO without some backend performance enhancements. But you’d probably be able to meet a 250-ms SLO, assuming this month is fairly representative. If you have the data available, it may also be a good idea to query the latency distribution and graph it over longer periods of time, to identify seasonal trends or long-term variations in performance.

Analyze subcomponent metrics to define internal SLOs

Monitoring customer-facing metrics is important for obvious reasons, but collecting and assessing metrics that impact internal services can be just as crucial. Unlike a user-facing SLA, internal SLO violations do not necessarily result in external or economic repercussions. However, SLOs still serve an important purpose—for example, an SLO can help establish expectations between teams within the same organization (such as how long it takes to execute a query that another team’s service depends on).

Below, we captured and graphed an internal application’s average, 95th percentile, and maximum response times over the past month. By collecting and graphing these metrics over a substantial window of time (the past month), we can identify patterns and trends in behavior, and use this information to assess the ongoing viability of our SLOs.

response time histogram metrics graphed

For example, the graph above indicates that the application response time was maxing out around 1 second, while the response time was averaging about 400 ms. Assuming you have leeway to set your own SLO, looking at the full range of values should help guide you toward a more informed and supportable objective.

Internal SLOs can also serve as more stringent versions of external SLAs. In the case of the graph above, if the external SLA contains an SLO that aims to fulfill 95 percent of requests in under 2 seconds within any given month, this team might choose to set its internal SLO to 1.5 seconds. This would hopefully leave the alerted individual(s) enough time to investigate and take action before the external SLA is violated.

Create SLA-focused dashboards

Once you’ve collected metrics from internal and external services and used this data to define SLOs and SLAs, it’s time to create dashboards to visualize their performance over time. As outlined in Datadog’s Monitoring 101 series, preparing dashboards before you need them helps you detect and troubleshoot issues more quickly—ideally, before they degrade into more serious slowdowns or outages.

End user-focused dashboards

The dashboard below provides a general overview of high-level information, including the real-time status and response time of an HTTP check that pings a URL every second. Separate widgets display the average response time over the past 5 minutes, and the maximum response time over the past hour.

sla overview dashboard

Incorporating event correlation into your graphs can also help provide additional context for troubleshooting. As Ben Maurer has explained, Facebook noticed that it recorded fewer internal SLA violations during certain time periods—specifically, when employees were not releasing code.

To spot if your SLA and SLO violations are correlated with code releases or other occurrences, you may want to overlay them as events on your metric graphs. In the screenshot above, the pink bar on the timeseries graph indicates that a code release occurred, which may have something to do with the spike in page load response time that occurred shortly thereafter. We also included a graph that compares today’s average response time to the previous day’s.

Dashboards that dive beneath the surface

While the previous example provided a general overview of user-facing performance, you should also create more comprehensive dashboards to identify and correlate potential issues across subcomponents of your applications.

In the example shown below, we can see a mixture of customer-facing data (API endpoint errors, slowest page load times), and metrics from the relevant underlying components (HAProxy response times, Gunicorn errors). In each graph, code releases have been overlaid as pink bars for correlation purposes.

SLA dashboard with underlying components

Alerting on SLAs

For many businesses, not meeting their SLAs is almost as serious a problem as downtime. So in addition to creating informative dashboards, you should set up alerts to trigger at increasing levels of severity as metrics approach internal and external SLO thresholds.

Classify your alerts by urgency

Effective alerting assigns an appropriate notification method based on the level of urgency. Datadog’s Monitoring 101 post on alerting has detailed guidelines for determining the best method of notification (record, notification, or page).

As mentioned earlier, you may want to set an internal SLO that is more aggressive than the objectives in your external SLA, and alert on that value. Any time a metric crosses a “warning” threshold (approaching the internal SLO threshold), it should be recorded as a lighter notification, such as an email or a chat room message. However, if the internal SLO is violated, you should page the person(s) responsible, which will ideally provide the individual(s) with enough time to address the situation before the external SLA is violated.

Watch your resources

In addition to alerting on SLO thresholds, you should also collect comprehensive metrics on underlying resources—everything from low-level (disks, memory) to higher-level components (databases, microservices). If these go unchecked, they can degrade into a state where they will negatively impact end-user experience.

Degradations of resource metrics (for example, database nodes running low on disk space) may not immediately impact customers, but they should be addressed before they ripple into more serious consequences. In this case, the appropriate alert would be a notification (email, chat channel), so that someone can prioritize it in the near future (unless the disk space is very low, in which case it is probably an urgent problem). Whenever a less serious resource issue (for example, an increase in database replication errors) is detected, it should be saved as a record. Even if it eventually resolves on its own, these records can be useful later on, if further investigation is required.

Set informative, actionable alerts

Successful alerts should:
- clearly communicate what triggered the alert
- provide actionable steps to address the situation
- include information about how it was resolved in the past, if applicable

For further reading, check out this other SysAdvent article for sage advice about making alerts more actionable and eliminating unnecessary noise.

Some examples of alerts that you might create to monitor SLAs as well as their underlying resources:
- Your average service response time over the past day exceeds your internal SLO threshold (alert as page)
- An important HTTP check has been failing for the past 2 minutes (alert as page)
- 20 percent of health checks are failing across your HAProxy backends (alert as record if it resolves itself, or email/chat notification if it doesn’t)

Put your SLA strategy in action

If you don’t have a monitoring platform in place, start there. After that you’ll be ready to set up dashboards and alerts that reflect your SLAs and key resources and services that the SLA depends on.