December 9, 2016

Day 9 - One year of PowerShell or DevOps in the Microsoft world...

Written by: Dominique Broeglin (@dbroeglin)
Edited by: Matt Stratton (@mattstratton)

Holidays are a time for reflection and giving back. This year, I thought I would reflect on my ongoing Linux to Microsoft transition and give back a few tidbits I learned about DevOps in the Microsoft world. This article is meant to illustrate a journey from Linux to Microsoft in hopes that the “from the trenches” perspective will help others attempting the same transition.

Let’s start with a quick background to set the stage.

Background

After an education in both software engineering and distributed systems, I worked for several IT companies, small and big, in a variety of domains such as telcos, education, justice, transportation, luxury goods, or the press. I was lucky enough to have the opportunity to work with bright people very early in my career. I’m very grateful to all of them, as I learned a lot from them and was spared quite a few mistakes. One of the subjects I was introduced to early is automation, both through my co-workers and through seminal books like The Pragmatic Programmer: From Journeyman to Master and its sequel Pragmatic Project Automation. This pushed me towards automated configuration management in the early days.

If memory serves, I discovered Linux somewhere around 1994 and was instantly hooked. I still recall that first Slackware distribution with its 50-something floppy disks. For twenty years, I’ve worked mainly on Linux and Unices, trying studiously to avoid any Microsoft product. Some people may even recall me swearing that I would never work on Windows. However, a few years ago Microsoft changed its approach, and I was, coincidentally, presented opportunities in Microsoft environments. As was once said by John H. Patterson, “Only fools and dead men don’t change their minds.” So I went on and changed mine, one step at a time…

Armed with almost 20 years of Linux experience and ten years of automated configuration management, I jumped into Microsoft automated configuration management.

Step 1: Develop on Windows, Deploy on Linux

The first step was a Java project that was starting. In that early design and setup phase, only a few people were involved. I was brought in to help with Java architecture and software factory setup. Having set up quite a few factories already that part was not new to me. However, we would have to onboard a sizable number of people in a small amount of time. That would be a challenge. Because, at the time it took quite a bit of time to setup a working environment for a developer, tester, architect or DBA.

During previous projects, I had used Vagrant to setup development environments that could be shared easily. Vagrant is a wrapper around a virtualization tool (at the time VirtualBox, but in the meantime, Hyper-V and VMWare were added too) that helps in setting up a virtual machine. It starts with a fixed base image, but allows for additional setup instructions to be declared in a file that can be shared in the project’s source control repository. Everything needed to setup the environment is declared in a file called Vagrantfile and when a developer wishes to set up his he just executes vagrant up.

A few minutes later, a fully operational environment is running inside a virtual machine which in turn runs on his desktop environment. In a MacOS X or Linux environment, the developer usually edits files by running his editor on the host system but builds, deploys and tests the system in the virtual machine. Moreover, Vagrant allows for provisioning scripts declared in the Vagrantfile but it also allows integrating with a Puppet or Chef infrastructure. Which means that the development virtual machine can be provisioned exactly like the test and production environments.

It also allows for a nice DevOps pipeline by permitting each developer to experiment with an infrastructure change on his own virtual machine. When satisfied she can then share that change through source control. Her fellow developers will benefit from the change by refreshing their sources and executing a simple vagrant provision command. The change can then be merged, again through source control, in the test and production environments. Developers and operation people can experiment in their environment without fear and at a very low cost. Breaking their environment would only mean that they have to revert their changes through source control and execute a vagrant destroy; vagrant up command (with the additional benefit that they get to grab a cup of coffee while this is running).

However, after a few experimentation with Vagrant on Windows, we quickly found out that the developer experience was not satisfactory. The performance was very degraded, editing on Windows while building on Linux proved to be quite challenging. On the other hand, the JEE technology with thousands of Java, XML and ZIP files was posing a greater challenge to virtualized environment than the more common Rails, Node or Python environments.

As we already planned on using Puppet for test and production environment deployment, we settled on deploying each developer’s desktop environment with the same tool. This allowed us to almost completely automate desktop environment setup. It had a few drawbacks, though. Mainly, for those project members that worked on their laptop and had previously installed tooling, the setup process might interfere with those tools or fail altogether because they were set up in an unexpected way. We also had the common issues that were found out only in the test environment because developers tested on their own Windows environment.

Step 2: Deploy on Windows in a Linux shop

The second step was a bit different. This time, we were both developing and deploying on Windows. The project was only a few months old; the deployment process was manual, and configuration management was starting to become an issue. Most of the rest of the company was deploying applications on Linux, and the infrastructure team had already introduced Puppet to manage those hosts. We ourselves used Puppet to deploy our hosts to build on existing experience and to prepare for an eventual migration from Windows to Linux.

All environments were managed by the same system which allowed developers to have local setups almost identical to test and production environments. Deployments and configuration changes were propagated along the pipeline by successively merging from development to test and from test to production. Configuration data was stored in HieraDB a part of Puppet that provides a hierarchical configuration database.

This particular software, we were developing, required complicated configuration management, and we rapidly reaped the benefits of introducing automation. Especially when deployments started multiplying.
However, it revealed much more vividly than in step 1 that Puppet, or any tool that grew up in the Linux world, has an impedance mismatch when deployed on Windows. Basically, Unices are based on files and executables whereas Windows is much more complex. The registry, the right management system, drivers, host processes, services, etc. All those concepts are not hugely different from Linux equivalents. But nonetheless, different enough to make things complicated with tools like Puppet, Chef, Ansible, Salt, etc. Nothing is completely impossible, but you have to work harder to achieve the same result on Windows than in their native OSes.

Especially, when you cannot handle the task with pre-existing resources and need to add some new behavior to the configuration management tool. This can be achieved either by extending them with Ruby code for Puppet and Chef or Python code for Ansible and Salt. But even general purpose languages like Ruby and Python that are ported to Windows show some of that impedance mismatch. Additionally, most of the time, they are not known by existing IT professional. We chose to use Bash through the MSysGit toolbox because both developers and IT professionals were familiar with it. However, that still was not satisfactory. Even if that version of Bash is very well integrated with Windows, it still does not work seamlessly. Very simple tasks like creating a properly configured Windows service required juggling with different concepts and commands.

Note: in the meantime, Microsoft has released a beta version of Bash on Windows which is another version of Bash running on Windows. We did not use it as much as the MSysGit version, but it seems to exhibit the same inherent difficulties.

Step 3: What the heck, let’s learn PowerShell

The latest step of our journey occurred in a full Microsoft environment. Development and deployment happened almost exclusively on Windows. Modern configuration management like Puppet or Chef was not yet introduced. However, several efforts were made to introduce automation, but mostly of the mechanization sort (see below). Repetitive tasks have been accumulated together into scripts, in some cases, those scripts are called through workflow like graphical interfaces, but mostly, the control and knowledge remained with the IT professional.

This time around, people already were familiar with PowerShell, the scripting language introduced by Microsoft in 2006. The approach we took was to expand on that existing body of knowledge with extensive training and systematic exploration of PowerShell solutions to our issues before even considering more advanced but also more alien solutions. We found that PowerShell provided most of the required building blocks for a full deployment and configuration management stack.

PowerShell

PowerShell is a scripting language introduced by Microsoft in 2006. But it was already laid out in the Monad Manifesto by Jeffrey Snover in 2002. As explained in the manifesto, PowerShell takes a new approach to old issues. It is a shell with a pipeline. However, PowerShell passes objects down the pipeline, not text or binary data like its Unix relatives. This saves quite a bit of work when composing simpler commands together.

PowerShell also leverages the .Net framework which makes it as powerful as any general purpose programming language. Access to the .Net framework means that any API available to .Net programmers will also be accessible through PowerShell. But it does not stop there. It also provides jobs, remoting, workflows, package managment and many other features. Among those, two are of particular interest when building an automation solution: Desired State Configuration and Just Enough Administration.

Desired State Configuration

DSC builds on top of PowerShell to implement configuration management building blocks called resources. Those resources, are very similar to Puppet resources and similarly allow IT professionals to declaratively express what the state of the system should be. The actual imperative implementation that changes the system to the desired state is implemented by a PowerShell module.

DSC also provides a declarative domain specific language to express a system’s configuration and an agent that can apply that configuration to a specific host. The agent is an integral part of the Windows Management Framework which means that it will be deployed everywhere PowerShell is.

Transforming traditional imperative scripts into DSC resources allows us to achieve idempotence. Which in turn allows us to handle any configuration change like a simple change in the declared configuration. DSC will then ensure the system ends in the required state from the previous state. Even, a manual change, of the system would be caught by DSC as a configuration drift and could be automatically corrected, depending on the agent’s configuration.

DSC can do much more. However, it is not as mature a solution as Puppet or Chef. It lacks the ecosystem around it: a configuration database, reporting tools, easy but powerful configuration composition and reuse, etc. Therefore, our current approach is to associate both tools. The impedance mismatch mentioned earlier can be solved by letting Puppet or Chef use DSC resources to effect configuration changes on Windows, creating a win/win situation. Benefit from all the power behind the Puppet and Chef ecosystems, while still leveraging PowerShell, which is native to Windows and already known to IT professionals.

Our current preference is to use Chef. Chef is a bit less mature than Puppet, but it seems that Chef and Microsoft actively work together to integrate both solutions.

Just Enough Administration

JEA is a tool that helps reduce administrative rights dissemination. Lots of operations require administrative or at least somehow elevated rights to be performed. That usually means that if you have to perform one of those operations (even if just is a read-only operation), you will get administrative rights. Or some ad hoc solution would be put in place to somehow allow you to do what you needed to. This can lead to pretty complex and fragile solutions.

From my Linux background point of view, JEA is like sudo on steroids. It handles both privilege elevation, RBAC, remoting and integrates nicely with PowerShell scripting. It can easily be deployed via DSC which helps to ensure limited and uniform access rights throughout the whole system.

Perspective

That latest step is still quite new and, for some parts, a work in progress. We have solved all the issues encountered in previous experiences and even a few new ones, which leads us to think we are on the right track. Moreover, it seems the vision we have created aligns nicely with the “new” Microsoft. For instance, the recent open sourcing of .Net and PowerShell, while at the same time porting it to Linux and MacOSX, opens new avenues for automating our few existing Linux systems.

Lessons learned

The following lessons were learned the hard way. They are a bit opinionated. But, all are rooted in real life situations where not following them meant failure…

Start small.

In my experience, trying to automate the whole system at once ends badly. At best, only those parts of the system that were easy end up automated, leaving huge gaps where manual intervention is still needed. At worst, the effort fails completely.

Start small by considering only a single part of the system. However, ensure that that part is entirely automated. Do not leave some manual steps in-between automated parts. The automation should encompass all use cases, even those considered exceptions. If it looks like they are too different to be automated, it usually means that the automation system is too rigid, too specialized to some local or current way of doing things. Not being able to handle today’s exceptions is a good indicator that the system we are creating will probably not be able to handle future either. Also, partial coverage means that we cannot have full confidence in our ability to reconstruct the system. Be it because we are facing a major disaster or, just because we need to experiment with a copy of the actual system.

This makes for slower progress but allows you to get a solid foothold on the automated ground. Expanding from that solid ground will be much easier in the future.

“Mechanization” is not automation…

In my experience, in environments where the concept of automation is new, it is often confused with mechanization. It is what might be called the reduce 10 steps to 1 syndrome.

Wikipedia defines Mechanization as:

Mechanization is the process of changing from working largely or exclusively by hand or with animals to doing that work with machinery.

Too often, automation is confused with mechanization. Allowing developers to submit an SQL schema migration through a form is a form of automation. The developer does not need to open a SQL tool, enter the proper credentials and execute the SQL file by hand. However, it is not automation yet. The process is still almost exclusively under the control of a human being (animal labor is now mostly eliminated from our IT operations, although we still occasionally see some developers talking to ducks). Wikipedia defines Automation as:

Automation is the use of various control systems for operating equipment […] with minimal or reduced human intervention.

To fully automate our SQL database schema migrations, human intervention should be entirely eliminated. Which means that the developer should only specify which version of the database schema is required and automation will take care of bringing the database to the proper state. Tools like Liquibase or Flyway can help a lot with that. To the point that humans do not even need to ask for a database schema version change. The application can, upon starting, check that the database is in the proper state and, if not, apply the relevant migrations.

While introducing automation instead of mechanization is a bit harder it has tremendous advantages down the road. To quote Wikipedia one last time:

The biggest benefit of automation is that it saves labor; however, it is also used to save energy and materials and to improve quality, accuracy, and precision.

Automation completely eliminates human error by eliminating human decision from the process. Which in turn improves quality and repeatability.

Keep away from shiny tools

With the introduction of Infrastructure as Code IT professionals became de facto developers. That means their activity has changed from simply deploying and operating software to actually building the software that deploys that software -even when that software is some piece of infrastructure.

That poses a challenge to any IT professional that did not start his career as a developer. However, in the Linux world, any IT professional has been exposed to a variety of scripting languages which performed a lot of tasks from the most mundane to some quite sophisticated (for instance Gentoo’s build system, ebuild). The Microsoft world, however, has long relied on graphical user interfaces and manual operations. It is thus natural to turn to graphical tools that promise to simplify infrastructure as code by allowing building complex automation through graphical manipulation. I have yet to find a graphical tool that fulfills that promise. During the last years, I found quite a few instances where they made things way worse.

Coding is coding. To this day, most of the coding occurs by writing code. Yes, there are a few specialized tasks that are better handled by graphical tools, but general coding tasks usually involve editing some form of text. So, even if you can leverage graphical tools for some of your automation tasks, in the end, a general purpose scripting language will always be necessary.

The language of choice, on Microsoft Windows, is PowerShell. It is still a Shell with its roots firmly planted in the Unix world but is also very well integrated with the Windows system and Windows applications. All of which makes your life so much easier. If you only learned PowerShell by doing the same stuff you did with .BAT files and sprinkling a bit of Stack Overflow, try to get some formal training. It will be time and money well spent that will pay for itself time and time again.

Involve IT professionals

Automation should be built by the people who are the most intimate with the system being automated. If the automation solution is built by outside people, it often lacks the proper adoption that would permit it to grow. In some extreme cases, the knowledge of what the automation actually does is lost in the system. Developers move on to building other systems, and IT professionals lack the expertise to actually dig into the system when needed.

Of course, in most occurrences, IT professionals are not yet developers and lack the set of skills required to build the automation tools. In which case, bringing in external resources to help mentor them in those new fields will speed things up and prevent fundamental mistakes. However, at the end of the process, IT professionals should be in charge and complete control of their toolchain.

One way to flatten their learning curve while still leaving them in charge is to help them set up a software factory and software tooling very early in the process. If PowerShell module scaffolds can be generated, tests automatically executed, modules packaged and deployed automatically to the repository, etc. they can concentrate on what really matters: the automation system they are building.

The release pipeline model

Earlier this year Microsoft published a white paper about the Release Pipeline Model which neatly sums up everything we mentioned here (and more). I strongly encourage anyone attempting an automated deployment and configuration management effort in the Microsoft world to read it. Most of what you would need to start, but is not mentioned in this post, can be found in it.

Looking forward

It is probably too soon to gain enough perspective on how automated configuration management will evolve in the Microsoft world. On the other hand, all pieces seem to fall neatly in place and work well together.

December 8, 2016

Day 8 - Building Robust Jenkins Pipelines

Written By: Michael Heap (@mheap)
Edited By: Daniel Maher (@phrawzty)

For many years continuous integration was synonymous with the name Jenkins, but as time went on it fell out of favour and was replaced by newer kids on the block such as Travis, GoCD and Concourse. Users became frustrated with Jenkins’ way of defining build jobs and looked to services that allowed you to define your builds as code alongside your project, building real pipelines with fan-in/fan-out capabilities.

Jenkins recently hit version 2, and with it came a whole host of new features! There were some small additions to the base install (such as the git plugin being shipped by default), but true to its roots, Jenkins still ships most of it’s functionality as plugins. Functionality such as defining your job configuration in code is provided by plugins that are maintained by the Jenkins core team, which means they’re held to the same high standards as Jenkins itself.

In this article we’ll be building an Electron application and adding it to Jenkins. Whilst we’ll be using the Github organization plugin, it’s not essential - you can configure repositories by hand if you prefer.

The majority of the code in this article is simple calls to shell commands, but a passing familiarity with either Java or Groovy may help later on.

Getting Started

In this article, we’re going to be building a pipeline that builds, tests, and packages an Electron application for deployment on Linux. We’ll start with a basic Jenkins installation and add all of the plugins required, before writing a Jenkinsfile that builds our project.

Let’s create a Vagrant virtual machine to run Jenkins inside. This isn’t a requirement to run Jenkins, but helps us get up and running slightly easier.

Creating a VM

vagrant init ubuntu/xenial64
sed -i '/#.*private_network*/ s/^  #/ /' Vagrantfile # Enable private networking
vagrant up
vagrant ssh
wget -q -O - https://pkg.jenkins.io/debian/jenkins-ci.org.key | sudo apt-key add -
sudo sh -c 'echo deb http://pkg.jenkins.io/debian-stable binary/ > /etc/apt/sources.list.d/jenkins.list'
sudo apt-get update
sudo apt-get install -y jenkins
sudo cat /var/lib/jenkins/secrets/initialAdminPassword
exit

Visit http://192.168.33.10:8080 on your local machine and continue the setup.

Bootstrapping Jenkins

You’ll need two plugins to get started: the Pipeline plugin and the GitHub Organization Folder plugin. These form the cornerstone of your Jenkins install. Click on None at the top then type pipeline into the search box. Select Pipeline and GitHub Organization Folder before clicking Install.

You’ll be prompted to create a user once the plugins are installed. Do this now, then click on Start using Jenkins to get underway.

Configuring Jenkins

Now that you’re logged in, there’s a little bit of work to do to configure Jenkins so that we can start using it. We’ll need to provide access credentials and set up our GitHub Organization job so that Jenkins knows what to build. Sadly, we need to do this even if we’re only working with public repositories as certain endpoints on the GitHub API require authentication.

  • Click Credentials on the left hand side.

  • Click on (global) next to the Jenkins store.

  • Click Add Credentials on the left.

  • Decide what kind of credentials you want to add. I’m using username & password authentication for HTTPS clones from GitHub, so I visited https://github.com/settings/tokens and generated a token that had only the repo OAuth scope.

  • Provide the credentials by filling in the username and password fields. (Note that the fields are different if you’re using SSH authentication.)

  • Click Jenkins in the top left to go back to the homepage.

  • Click New Item on the left.

  • Choose GitHub Organization and give it a name (I’ve called mine michael-test), then click OK.

  • Under Repository sources, set the Owner field to be either your organisation name or your personal account name.

  • Select the credentials you just set up using the Scan Credentials option.

  • Click Advanced under Repository name pattern and make sure that Build origin PRs merged with base branch is enabled. This will make Jenkins build any incoming pull requests.

It’s worth noting at this point that under the Project Recognizers section it says Pipeline Jenkinsfile. Jenkins will scan an organisation or account for any repos with branches that contain a “Jenkinsfile” at their root and create a build job for that branch.

Let’s create your first Jenkinsfile and kick off a build.

Writing your first Jenkinsfile

We need a project to build, so let’s create one. For now it’ll be an empty project that only contains a Jenkinsfile.

mkdir demo-electron
cd demo-electron
git init
touch Jenkinsfile

Edit Jenkinsfile and add the following content:

echo "Hello World"

Commit and push this up to GitHub (you’ll need to create a repository first). I’ve called my repo demo-electron.

git add .
git commit -m "Initial Jenkinsfile"
git remote add origin git@github.com:michael-test-org/demo-electron.git
git push -u origin master

Go back to Jenkins, and you’ll see an option on the left called Re-scan organization. Click it, then click on the Run now option that appears. You’ll see some log output where it detects the Jenkinsfile and creates a job for you.

Within michael-test there’s now a job called demo-electron, which matches our repository name. Clicking through, we can now see a master job within. By default, Jenkins will build all existing branches. Click into the master job and then click on #1 on the left hand side to see the results from the first build that was performed when the repository was detected. If you click on Console Output on the left, you’ll see Jenkins cloning our repo before outputting “Hello World”.

Congratulations, you just wrote your first Jenkinsfile! It doesn’t do much, but it proves that our Jenkins instance is up and running, detecting pipelines, and automatically executing them.

Throughout the rest of this post, we’ll be building a small Electron based desktop application. This means that our build machine will need some additional dependencies installed. If you want to work along with this tutorial, your Vagrant machine will need a little bit of configuration. Log in to your machine with vagrant ssh and run the following command to install the necessary dependencies to build an Electron application:

sudo apt-get install -y nodejs nodejs-legacy npm jq

Adding Electron

We’re going to add Electron to the project at this point. Instead of working directly on master, we’re going to work on a new branch called add-electron.

Create a new branch to work on add-electron by running the following command:

git checkout -b add-electron

The first thing we need to do is create a standalone Electron application. To do this, we follow the Electron Quick Start document. Go ahead and follow that - once you’re done you should have a package.json, main.js and index.html. We also need to install Electron as a dependency, so run the following command on your local machine in the same directory as package.json.

npm install electron --save-dev

This will make Electron available to run our app. You can test it by running ./node_modules/.bin/electron . in the same folder as your main.js file. This will show a “Hello World” page.

This gives us our first real action to run in Jenkins! Although running npm install doesn’t feel like a lot, ensuring that we can install our dependencies is a great first step towards continuous builds.

Jenkinsfile

It’s time to change our Jenkinsfile! It’ll be a nice simple one to start with: first make sure that we’re on the right branch, then run npm install.

The first thing to do is delete our “Hello World” line and add the following contents:

node {
  
}

This declaration tells Jenkins that anything inside the braces must run on a build agent. Our old echo statement could run without a workspace (a copy of our repository on disk) but as we’re about to interact with the repository we need a workspace.

The next thing to do is to tell Jenkins to make sure that our repository is in the correct state based on the job that we’re running. To do this, we run a special checkout scm command that makes sure we’re on the correct branch. Finally, we run npm install using the sh helper to run a shell command. Our entire Jenkinsfile looks like this after the changes:

node {
  checkout scm
  sh 'npm install'
}

If we commit and push this, Jenkins will build the project using our new Jenkinsfile and run npm install each time we commit.

git add Jenkinsfile index.html main.js package.json
git commit -m "Create Electron app"
git push origin add-electron

While it is possible for Jenkins to be notified from GitHub each time there is a commit to build, this won’t work for us as we’re testing with a local Jenkins instance. Instead, we need to tell Jenkins to search for changes by visiting the main page for our job and clicking on Branch Indexing then Run Now. Jenkins will then rescan the repo for new branches, add our add-electron branch, then run the new Jenkinsfile.

Just like we could view the build for the master branch, we can click in to add-electron, click on Console Output on the left and watch as Jenkins runs npm install for us.

At this point, we can raise a pull request to merge our branch into master on GitHub. Once Jenkins rescans the repository (manually triggered, see above), it will detect this pull request and automatically update the pull request’s commit status with the outcome of the build.

Once the request is merged and our add-electron branch is deleted, Jenkins will go back to just building the master branch on its own.

Creating a build

We now have an application that’s building, but we need a way to package it up and ship it to customers. Let’s create a new working branch called build-app by running the following command:

git checkout -b build-app

There’s a project called electron-builder which can help us build our application. Just like with Electron, we need to require it in our package.json by running the following on your local machine in the same directory as package.json:

npm install electron-builder --save-dev

Next, we need to add some details to our package.json for the builder to use, including configuration values such as our application’s description and application identifier. We’ll also add a few npm scripts that build the application for different operating systems.

Update your package.json to like the following, which contains all of the required information and build scripts:

{
  "name": "sysadvent",
  "version": "0.1.0",
  "main": "main.js",
  "devDependencies": {
    "electron": "^1.4.10",
    "electron-builder": "^10.4.1"
  },
  "description": "SysAdvent",
  "author": "Michael Heap <m@michaelheap.com>",
  "license": "MIT",
  "build": {
    "appId": "com.michaelheap.sysadvent",
    "category": "app.category.type"
  },
  "scripts": {
      "start": "electron main.js",
      "build": "build --mwl --x64 --ia32",
      "build-windows": "build --windows --x64 --ia32",
      "build-mac": "build --mac --x64 --ia32",
      "build-linux": "build --linux --x64 --ia32"
  }
}

At this point, you can test the build on your local machine by running npm run build-<os>; in my case I run npm run build-linux and the build applications show up in the dist folder. I can run ./dist/sysadvent-0.1.0-x86_64.AppImage to run the application. If you’re on OSX, there should be a .app file in the dist folder. If you’re on Windows there should be an installer .exe

This is another huge step in our build process. We now have a tangible build that we can distribute to people! Let’s update our build process so that it builds the application. Edit your Jenkinsfile and add a call to npm run build-linux, as we’re doing to be building on a Linux machine.

node {
    checkout scm
    sh 'npm install'
    sh 'npm run build-linux'
}

This is almost enough to have our build automated, but we somehow have to get the build artefacts out of our workspace and make them available for people to download. Jenkins has built in support for artefacts, which means there’s a built in archiveArtifacts step that we can use. We tell Jenkins to step in to the dist folder with the dir command and tell it to archive all artefacts that end with .AppImage in that directory. Your final Jenkinsfile will look like the following:

node {
    checkout scm
    sh 'npm install'
    sh 'npm run build-linux'
    dir('dist') {
      archiveArtifacts artifacts: '*.AppImage', fingerprint: true;
    }
}

If we commit and push this up to GitHub and retrigger the branch indexing, Jenkins will pick up the new branch, build our application, and publish an artefact.

git add Jenkinsfile package.json
git commit -m "Build packaged application"
git push origin build-app

Once the branch has built and is green, we can raise a pull request and get our changes merged into master. At this point, we have an application that has a build process and creates a packaged application that we can send out to users.

A more complex Jenkinsfile

What we’ve put together so far is a good introduction to building applications with Jenkins, but it’s not really representative of a real world project that would contain tests, code style linting, and various other build steps.

On the full-application branch, you’ll find a build of the Electron application that has basic functionality added, tests for that functionality, and code style linting added. With the addition of these common steps, the Jenkinsfile is already starting to grow.

node {
    checkout scm
    sh 'npm install'
    sh 'npm test'
    sh 'npm run lint'
    sh 'npm run build-linux'
    dir('dist') {
      archiveArtifacts artifacts: '*.AppImage', fingerprint: true;
    }
}

It’s still relatively easy to understand this Jenkinsfile, but imagine that we’ve copied and pasted it in to half a dozen different applications that need to build an Electron app for distribution. If we wanted to update our application and offer an OS X build too, we’d have to update every project and edit every Jenkinsfile one by one. Fortunately, Jenkins has a solution for this too: the Jenkins Global Library.

Creating a Global Library

When writing software applications we frown upon copy and pasting code between files, so why is it accepted when building deployment pipelines? Just like you’d wrap your code up in a library to reuse in your application, we can do the same with our Jenkins build steps.

Just as the Jenkinsfile is written in the Groovy language, our global library will be written in Groovy too. Groovy runs on the JVM, so the namespace structure is very similar to Java. Create a new folder and bootstrap the required files:

mkdir global-lib
cd global-lib
mkdir -p src/com/michaelheap
touch src/com/michaelheap/ElectronApplication.groovy

The ElectronApplication file is where our build logic will live. Take everything in your Jenkinsfile and paste it in to ElectronApplication.groovy inside an execute function:

#!/usr/bin/env groovy

package com.michaelheap;

def execute() {
  node {
    checkout scm
    sh 'npm install'
    sh 'npm test'
    sh 'npm run lint'
    sh 'npm run build-linux'
    dir('dist') {
      archiveArtifacts artifacts: '*.AppImage', fingerprint: true;
    }
  }
}

return this;

Then update your Jenkinsfile so that it calls just this file. Groovy files are automatically compiled into a class that has the same name as the file, so this would be available as new ElectronApplication(). Your Jenkinsfile should look like the following:

def pipe = new com.michaelheap.ElectronApplication()
pipe.execute()

Once we update all of our applications to run this class rather than having the same thing in multiple places, any time we need to update the build pipeline we only need to update it in our Global Library and it will automatically be used in the next build of any job that runs.

There is one final step we need to perform before Jenkins can start using our global library. We need to publish it to a repository somewhere (mine’s on GitHub and load it into the Jenkins configuration. Click on Manage Jenkins on the homepage and then Configure System before scrolling down to Global Pipeline Libraries.

To make Jenkins load your library, follow these steps:

  • Click Add.
  • Provide a name.
  • Set the default version to master.
  • Make sure Load Implicitly is checked, so that we don’t need to declare @Library in every Jenkinsfile.
  • Click Modern SCM.
  • Enter your organisation/account username into the Owner box.
  • Select the credentials to use for the scan.
  • Select your Global Library repository.
  • Click Save at the bottom.

The next time any job that references our new com.michaelheap.ElectronApplication definition runs, the global library will automatically be downloaded and imported so that any functions defined in the file can be used. In this case we call execute which runs everything else we needed.

Making the library DRY

Having a Global Library is a huge step towards re-usability, but if we ever needed to build an Electron application that didn’t have any tests to run, or it needed some additional steps running (or even the same steps in a different order) we’d need to copy and paste our ElectronApplication definition and make the changes in a new file. Isn’t this what we were trying to get away from?

Fortunately, I found an awesome example of how you can build your jobs as a pipeline of steps to be executed in the TYPO3-infrastructure project on Github. It’s quite tough to explain, so instead let’s work through an example. Let’s take our existing ElectronApplication and break it down in to five different steps:

  • npm install
  • npm test
  • npm run lint
  • npm run build-linux
  • archiveArtifacts

Each of these is a command that could be run independently, so instead of having them all in a single file, let’s give them each their own class each by creating some new files:

cd src/com/michaelheap
touch Checkout.groovy InstallDeps.groovy Test.groovy Lint.groovy Build.groovy ArchiveArtifacts.groovy

We move each command out of ElectronApplication and in to an execute function in each of these files. It’s important to ensure that they’re in the correct package namespace, and that they’re inside a node block, as we need a workspace:

InstallDeps.groovy

package com.michaelheap;
def execute() {
  node {
    sh 'npm install'
  }
}

ArchiveArtifacts.groovy

package com.michaelheap;
def execute() {
  node {
    dir('dist') {
      archiveArtifacts artifacts: '*.AppImage', fingerprint: true;
    }
  }
}

At this point, your ElectronApplication file will look pretty empty - just an empty execute function and a return this. We need to instruct Jenkins which steps to run. To do this, we’ll add a new run method that tries to execute a step and handles the error if anything fails:

#!/usr/bin/env groovy

package com.michaelheap;

def run(Object step){
    try {
        step.execute();
    } catch (err) {
        currentBuild.result = "FAILURE"
        error(message)
    }
}

def execute() {
}

return this;

Finally, we have to fill in the execute method with all of the steps we want to run:

def execute() {
    this.run(new Checkout());
    this.run(new InstallDeps());
    this.run(new Test());
    this.run(new Lint());
    this.run(new Build());
    this.run(new ArchiveArtifacts());
}

Once we commit all of our changes and new files to Github, the next time an ElectronApplication pipeline runs, it’ll use our new code.

For example, if we ever needed to set up a pipeline that automatically tested and published new modules to NPM, we wouldn’t have to reimplement all of those steps. We may have to create a new Publish.groovy that runs npm publish, but we can reuse all of our existing steps by creating an NpmModule.groovy file that has the following execute function:

def execute() {
    this.run(new Checkout());
    this.run(new InstallDeps());
    this.run(new Test());
    this.run(new Lint());
    this.run(new Publish());
}

Once that’s added to our global library, we can use it in any project by adding a Jenkinsfile with the following contents:

def pipe = new com.michaelheap.NpmModule();
pipe.execute();

This will reuse all of our existing steps alongside the new Publish step to test, lint, and publish an NPM module.

Conditional builds

One of the awesome things about the GitHub Organization plugin is that it automatically detects new branches and pull requests and runs builds for them. This is great for things like testing and linting, but we don’t want to publish every branch we create. Generally, we want to run all of our tests on every branch but only publish from the master branch. Fortunately, Jenkins provides the branch name as an environment variable called env.BRANCH_NAME. We can therefore add a conditional in NpmModule.groovy so that we only publish when the pipeline is running against master:

def execute() {
    this.run(new Checkout());
    this.run(new InstallDeps());
    this.run(new Test());
    this.run(new Lint());
    if (env.BRANCH_NAME == "master") {
      this.run(new Publish());
    }
}

This works great for teams that are working towards a continuous delivery goal, but in the real world we do sometimes have to deploy from other branches too - whether it’s a legacy branch that receives security fixes or multiple active versions. Jenkins lets you do this too - the Jenkinsfile is just code after all!

If we wanted to publish everything on master, but also everything where the branch name starts with publish- we could change our if statement to look like the following:

if (env.BRANCH_NAME == "master" || (env.BRANCH_NAME.length() >= 8 && env.BRANCH_NAME.substring(0, 8) == "publish-")) {
  this.run(new Publish());
}

It’s a bit long, but it gets the job done. Now, only commits to either master or publish-* will be published.

Using 3rd Party Plugins in a Jenkinsfile

The final part of this post is a section on calling 3rd party plugins from your Jenkinsfile. Plugins may specifically support the Jenkinsfile (you can find a list here), in which case they’re nice and easy to use. Take the Slack plugin for example - it’s been updated with Jenkinsfile compatibility, so it’s as simple as calling slackSend "string to send" in your Jenkinsfile.

Sadly, not all plugins have been updated with Jenkinsfile friendly syntax. In this situation, you need to know which class you want to call and use the special step method. The ws-cleanup plugin is one that I use all the time that hasn’t been updated, so I have to call it via step([$class: 'WsCleanup']). ws-cleanup doesn’t accept any parameters, but you can pass parameters via the step method. For example, the JUnitResultArchiver can be called via step([$class: 'JUnitResultArchiver', testResults: '**/target/surefire-reports/TEST-*.xml']) (as seen in the pipeline plugin documentation).

If we wrap these calls to step up in a custom step like we did with InstallDeps, Test etc then we can start working with the abstractions in our build pipelines. If the plugin ever updates to provide a Jenkinsfile friendly interface we only have a single place to edit rather than dozens of different projects.

WorkspaceCleanup.groovy

package com.michaelheap;
def execute() {
  node {
    step([$class: 'WsCleanup'])
  }
}

Hopefully you won’t find too many plugins that aren’t pipeline-friendly. As more and more people are starting to use the Jenkinsfile, plugin authors are making it easier to work with their addons.

There’s more!

We’re only scratching the surface, and there’s so much more that Jenkins’ new pipeline support can do; we didn’t even get around to building on slave machines, running jobs in parallel, or stage support! If you’d like to learn more about any of these topics (or Jenkinsfiles in general) feel free to ask! You can find me on Twitter as @mheap or you can email me at m@michaelheap.com.

If you’re doing something really cool with Jenkins and the Jenkinsfile, I’d love to hear from you too. I bet I could definitely learn something from you, too!

December 7, 2016

Day 7 - What we can learn about Change from the CIA

Written By: Michael Stahnke (@stahnma)
Edited By: Michelle Carroll (@miiiiiche)

I’ve recently spent some time on the road, working with customers and potential customers, as well as speaking at conferences. It’s been great. During the discussion with customers and prospects, I’m fascinated by the organizational descriptions, behaviors, and policies.

I was reflecting on one of those discussions one week, when I was preparing a lightning talk for our Madison DevOps Meetup. I looked through my chicken scratch notes I keep on talk ideas to see what I could whip up, and found a note about the CIA Simple Sabotage Field Manual. This guide was written in 1944, and declassified in 2008. It’s a collection of worst-practices to run a business. The intent of the guide is to have CIA assets, or citizens of occupied countries, slow the output of companies they are placed in, and thus reducing their effectiveness in supplying enemies. Half of these tips and tricks describe ITIL.

ITIL comes from the Latin word Itilus which means give up.

ITIL and change control processes came up a lot over my recent trips. I’ve never been a big fan of ITIL, but I do think the goals it set out to achieve were perfectly reasonable. I, too, would like to communicate about change before it happens, and have good processes around change. Shortly thereafter is where my ability to work within ITIL starts to deviate.

Let’s take a typical change scenario.

I am a system administrator looking to add a new disk into a volume group on a production system.

First, I open up some terrible ticket-tracking tool. Even if you’re using the best ticket tracking tool out there, you hate it. You’ve never used a ticket tracking tool and thought, “Wow, that was fun.” So I open a ticket of type “change.” Then I add in my plan, which includes scanning the bus for the new disk, running some lvm commands, and creating or expanding a filesystem. I add a backout plan, because that’s required. I guess the backout plan would be to not create the fileystem or expand it. I can’t unscan a bus. Then I have myriad of other fields to fill out, some required by the tool, some required by your company’s process folks (but not enforced at the form level). I save my work.

Now this change request is routed for approvals. It’s likely that somewhere between one and eight people review the ticket, approve the ticket and move it state into ready for review, or ready to be talked about or the like.

From there, I enter my favorite (and by favorite, I mean least favorite), part of the process: the Change Advisory Board (CAB). This is a required meeting that you have to go, or send a representative. They will talk about all changes, make sure all the approvals are in, make sure a backout plan is filled out, make sure the ticket is in the ready to implement phase. This meeting will hardly discuss the technical process of the change. It will barely scratch the surface of an impact analysis for the work. It might ask what time that change will occur. All in all, the CAB meeting is made up of something like eight managers, four project managers, two production control people, and a slew of engineers who just want to get some work done.

Oh, and because reasons, all changes must be open for at least two weeks before implementation, unless it’s an emergency.

Does this sound typical to you? It matches up fairly well with several customers I met with over the last several months.

Let’s recap:

Effort: 30 minutes at the most.

Lag Time: 2+ weeks.

Customer: unhappy.

If I dig into each of these steps, I’m sure we can make this more efficient.

Ticket creation:

If you have required fields for your process, make them required in the tool. Don’t make a human audit tickets and figure out if you forgot to fill out Custom Field 3 with the correct info.

Have a backout plan when it makes sense, recognize when it doesn’t. Rollback, without time-travel, is basically a myth.

Approval Routing:

Who is approving this? Why? Is it the business owner of the system? The technical owner? The manager of the infra team? The manager of the business team? All of them?

Is this adding any value to the process or is it simply a process in place so that if something goes wrong (anything, anything at all) there’s a clear chain of “It’s not my fault” to show? Too many approvals may indicate you have a buck-passing culture (you’re not driving accountability).

Do you have to approve this exact change, or could you get mass approval on this type of change and skip this step in the future? I’ve had success getting DNS changes, backup policy modifications, disk maintenance, account additions/removals, and library installations added to this bucket.

CAB:

How much does this meeting cost? 12-20 people: if the average rate is $50 an hour per person, you’re looking at $600-$1000 just to talk about things in a spreadsheet or PDF. Is this cost in line with the value of the meeting?
What’s the most valuable thing that setup provides? Could it be done asynchronously?

Waiting Period:

Seriously, why? What good does a ticket do by sitting around for 2 weeks? Somebody could happen to stumble upon it while they browse your change tickets in their free time, and then ask an ever-so-important question. However, I don’t have stories or evidence that confirm this possibility.

CIA Sabotage Manual

Let’s see which of the CIA worst-practices to implement in an org (or perhaps best practices to ensure failure) this process hits:

Employees: Work slowly. Think of ways to increase the number of movements needed to do your job: use a light hammer instead of a heavy one; try to make a small wrench do instead of a big one.

This slowness is built into the system with a required duration of 2 weeks. It also requires lots of movement in the approval process. What if approver #3 is on holiday? Can the ticket move into the next phase?

When possible, refer all matters to committees, for “further study and consideration.” Attempt to make the committees as large and bureaucratic as possible. Hold conferences when there is more critical work to be done.

This just described a CAB meeting to the letter. Why make a decision about moving forward when we could simply talk about it and use up as many people’s time as possible?

Maybe, you think I’m being hyperbolic. I don’t think I am. I am certainly attempting to make a point, and to make it entertaining, but this is a very real-world scenario.

Now, if we apply some better practices here, what can we do? I see two ways forward. You can work within a fairly stringent ITIL-backed system. If you choose this path, the goal is to keep the processes forced upon you as out of your way as possible. The other path is to create a new process that works for you.

Working within the process

To work within a process structured with a CAB, a review, and waiting period, you’ll need to be aggressive. Most CAB process have a standard change flow, or pre-approved change flow for things that just happen all the time. Often you have to demonstrate a number of successful changes of a type to be considered for this type of change categorization.

If you have an option like that, drive toward it. When I last ran an operations team, we had dozens (I’m not even sure of the final tally) of standard, pre-approved change types set up. We kept working to get more and more of our work into this category.

The pre-approved designation meant it didn’t have to wait two weeks, and rarely needed to get approval. In cases where it did, it was the technical approver of the service/system who did the approval, which bypassed layers of management and production control processes.

That’s not to say we always ran things through this way. Sometimes, more eyes on a change is a good thing. We’d add approvers if it made sense. We’d change the type of change to normal or high impact if we had less confidence this one would go well. One of the golden rules was, don’t be the person who has an unsuccessful pre-approved change. When that happened, that change type was no longer pre-approved.

To get things into the pre-approved bucket, there was a bit of paperwork, but mostly, it was a matter of process. We couldn’t afford variability. I needed to have the same level of confidence that a change would work, no matter the experience of the person making the change. This is where automation comes in.

Most people think you automate things for speed, and you certainly can, but consistency was a much larger driver around automation for us. We’d look at a poorly-defined process, clean it up, and automate.

After getting 60%+ of the normal changes we made into the pre-approved category, our involvement in the ITIL work displacement activities shrunk dramatically. Since things were automated, our confidence level in the changes was high. I still didn’t love our change process, but we were able to remove much of its impact on our work.

Automating a bad proess just outputs crap...faster

Have a different process

At my current employer, we don’t have a strong ITIL practice, a change advisory board, or mandatory approvers on tickets. We still get stuff done.

Basically, when somebody needs to make a change, they’re responsible for figuring out the impact analysis of it. Sometimes, it’s easy and you know it off the top of your head. Sometimes, you need to ask other folks. We do this primarily on a voluntary mailing list — people who care about infrastructure stuff subscribe to it.

We broadcast proposed changes on that list. From there, impact information can be discovered and added. We can decide timing. We also sometimes defer changes if something critical is happening, such as release hardening.

In general, this has worked well. We’ve certainly had changes that had a larger impact than we originally planned, but I saw that with a full change control board and 3–5 approvers from management as well. We’ve also had changes sneak in that didn’t get the level of broadcast we’d like to see ahead of time. That normally only happens once for that type of change. We also see many changes not hit our discussion list because they’re just very trivial. That’s a feature.

Regulations

If you work in an environment with lots of regulations preventing a more collaborative and iterative process, the first thing I encourage you to do is question those constraints. Are they in place for legal coverage, or are they simply “the way we’ve always worked?” If you’re not sure, dig in a little bit with the folks enforcing regulations. Sometimes a simple discussion about incentives and what behaviors you’re attempting to drive can cause people to rethink a process or remove a few pieces of red tape.

If you have regulations and constraints due to government policies, such as PCI or HIPAA, then you may have to conform. One of the common control in those types of environments is people who work in development environment may not have access or push code into production. If this is the case, dig into what that really means. I’ve seen several organization determine those constraints based on how they were currently operating, instead of what they could be doing.

A common rule is developers should not have uncontrolled access to production. often times companies see that mean they must restrict all access to production for developers. Instead however, if you focus on the uncontrolled part, you may find different incentives for the control. Could you mitigate risks by allowing developers to perform automated deployments and by having read-access for logs, but not have a shell prompt on the systems? If so, you’re still enabling collaboration and rapid movement, without creating a specific handover from development to a production control team.

Conclusion

The way things have always been done probably isn’t the best way. It’s a way. I encourage you to dig in, and play devil’s advocate for your current processes. Read a bit of the CIA sabotage manual, and if starts to hit too close to home, look at your processes, and rethink the approach. Even if you’re a line-level administrator or engineer, your questions and process concerns should be welcome. You should be able to receive justification for why things are the way they are. Dig in and fight that bureaucracy. Make change happen, either to the computers or to the process.

References

December 6, 2016

Day 6 - No More On-Call Martyrs

Written By: Alice Goldfuss (@alicegoldfuss)
Edited By: Justin Garrison (@rothgar)

Ops and on-call go together like peanut butter and jelly. It folds into the batter of our personalities and gives it that signature crunch. It’s the gallows from which our humor hangs.

Taking up the pager is an ops rite-of-passage, a sign that you are needed and competent. Being on-call is important because it entrusts you with critical systems. Being on-call is the best way to ensure your infrastructure maintains its integrity and stability.

Except that’s bullshit.

The best way to ensure the stability and safety of your systems is to make them self-healing. Machines are best cared for by other machines, and humans are only human. Why waste time with a late night page and the fumblings of a sleep-deprived person when a failure could be corrected automatically? Why make a human push a button when a machine could do it instead?

If a company was truly invested in the integrity of its systems, it would build simple, scalable ones that could be shepherded by such automation. Simple systems are key, because they reduce the possible failure vectors you need to automate against. You can’t slap self-healing scripts onto a spaghetti architecture and expect them to work. The more complex your systems become, the more you need a human to look at the bigger picture and make decisions. Hooking up restart and failover scripts might save yourself some sleepless nights, but it wouldn’t guard against them entirely.

That being said, I’m not aware of any company with such an ideal architecture. So, if not self-healing systems, why not shorter on-call rotations? Or more people on-call at once? After all, going 17 hours without sleep can be equivalent to a blood alcohol concentration of 0.05%, and interrupted sleep causes a marked decline in positive mood. Why trust a single impaired person with the integrity of your system? And why make them responsible for it a week at a time?

Because system integrity is only important when it impacts the bottom line. If a single engineer works herself half-to-death but keeps the lights on, everything is fine.

And from this void, a decades-old culture has arisen.

There is a cult of masochism around on-call, a pride in the pain and of conquering the rotating gauntlet. These martyrs are mostly found in ops teams, who spend sleepless nights patching deploys and rebuilding arrays. It’s expected and almost heralded. Every on-call sysadmin has war stories to share. Calling them war stories is part of the pride.

This is the language of the disenfranchised. This is the reaction of the unappreciated.

On-call is glorified when it’s all you’re allowed to have. And, historically, ops folk are allowed to have very little. Developers are empowered to create and build, while ops engineers are only allowed to maintain and patch. Developers are expected to be smart; ops engineers are expected to be strong.

No wonder so many ops organizations identify with military institutions and use phrases such as “firefighting” to describe their daily grind. No wonder they craft coats of arms and patches and nod to each other with tales of horrendous outages. We redefine what it means to be a hero and we revel in our brave deeds.

But, at what cost? Not only do we miss out on life events and much-needed sleep, but we also miss out on career progression. Classic sysadmin tasks are swiftly being automated away, and if you’re only allowed to fix what’s broken, you’ll never get out of that hole. Furthermore, you’ll burn out by bashing yourself against rocks that will never move. No job is worth that.

There is only one real benefit to being on-call: you learn a lot about your systems by watching them break. But if you’re only learning, never building, you and your systems will stagnate.

Consider the following:

  1. When you get paged, is it a new problem? Do you learn something, or is it the same issue with the same solution you’ve seen half a dozen times?
  2. When you tell coworkers you were paged last night, how do they react? Do they laugh and share stories, or are they concerned?
  3. When you tell your manager your on-call shift has been rough, do they try to rotate someone else in? Do they make you feel guilty?
  4. Is your manager on-call too? Do they cover shifts over holidays or offer to take an override? Do they understand your burden?

It’s possible you’re working in a toxic on-call culture, one that you glorify because it’s what you know. But it doesn’t have to be this way. Gilded self-healing systems aside, there are healthier ways to approach on-call rotations:

  1. Improve your monitoring and alerting. Only get paged for actionable things in an intelligent way. The Art of Monitoring is a good place to start.
  2. Have rules in place regarding alert fatigue. The Google SRE book considers more than two pages per 12 hour shift too many.
  3. Make sure you’re compensated for on-call work, either financially or with time-off, and make sure that’s publicly supported by management.
  4. Put your developers on-call. You’ll be surprised what stops breaking.

For those of you who read these steps and think, “that’s impossible,” I have one piece of advice: get another job. You are not appreciated where you are and you can do much better.

On-call may be a necessary evil, but it shouldn’t be your whole life. In the age of cloud platforms and infrastructure as code, your worth should be much more than editing Puppet manifests. Ops engineers are intelligent, scrappy, and capable of building great things. You should be allowed to take pride in making yourself and your systems better, and not just stomping out yet another fire.

Work on introducing healthier on-call processes in your company, so you can focus on developing your career and enjoying your life.

In the meantime, there is support for weathering rough rotations. I started a hashtag called #oncallselfie to share the ridiculous circumstances I’m in when paged. There’s also The On-Call Handbook as a primer for on-call shifts and a way to share what you’ve learned along the way. And if you’re burned out, I suggest this article as a good first step toward getting back on your feet.

You’re not alone and you don’t have to be a martyr. Be a real hero and let the pager rest.

December 5, 2016

Day 5 - How to fight and fix spam. Common problems and best tools.

Written By: Pablo Hinojosa (@pablohn6)
Edited By: Brendan Murtagh (@bmurts)

The Best Tools to Combat and Fix Common Spam Problems

This article summarizes from a 30,000 foot view of what is spam, anti-spam, and how to fix common problems. This is not an article where you are going to find the command(s) to fix spam problems for your MTA. With the help of this article you will understand why you are suffering spam problems and how to identify the root cause. This article is not intended as a how-to, but provide a foundation for troubleshooting and implementing a configuration to help rectify a spam/bad reputation problem.

What is Spam?

Obviously you know what Spam is but, do you know what that represents in global terms? According to Kaspersky Lab Spam and Phishing in Q3 report, “Six in ten of all emails received are now unsolicited spam”. Imagine visiting ten webpages and six of them be unsolicited? What if they were phone calls, sms, or clients of your business? This century’s primary form of communication is email, business-related or not, communications are electronic. Do you imagine start each day 6 of each 10 conversations in an unsolicited way? Systems administrators responsibility is to change that number 10 by number 100, 1000 or as much zeroes you are able to reach.

One thing to know to help understand the scale of spam is that spam is a huge business. Unsolicited emails is one of the most common methods to promote hundreds of legal and illegal business. From a small bikes shop to a huge phishing or ransomware criminal network.

Spam is also a huge consumer of resources, both electronic and human. The SMTP protocol was designed from a naive perspective. Old protocol designers did not take much into account on how to cheat in a communication, which is why big providers design and implement several protocols to authenticate and limit cheating in email delivery (composed by 2 MTA exchanging messages). It is important to learn what those protocols are and how to properly configure them to not be flagged.

However, there are instances where our servers are sending actual Spam. Obviously this is not our intention but we need to quickly identify the issue and begin remediation immediately. In the next section, we will focus on how to detect the sending of spam and discuss techniques to resolve the problem.

Are you a spammer?

Whether or not you are a spammer is a matter of trust. The receiving MTA will question your trustworthiness and you will have to show your reasons. Let’s see what we have to configure to respond no and be trusted to that question.

The most important record is an MX record. As RFC 1034 states, it “identifies a mail exchange for the domain”. You can send emails and be trusted from non MX records servers, but the best way to be trusted (because it is the first thing to be checked) is to send emails from your MX records servers. This is not always possible or desirable. Sometimes another MTA is sending emails spoofing the sending domain. This is typically unauthorized which is why the “SPF record” was created. As RFC 7208 states, with an SPF record you can “explicitly authorize the hosts/IPs that are allowed to use your domain name, and a receiving host can check such authorization”. An SPF record is a TXT record you should create (I recommend this website ) to tell the world which servers are allowed to send on behalf of your domain name.

Some MTAs require more validation. They need the email signed by your MTA to trust you and then be able to verify that signature. This is implemnted by using DKIM. DKIM “defines a domain-level authentication framework for email using public-key cryptography and key server technology to permit verification of the source and contents of messages by either Mail Transfer Agents (MTAs)”. As Wikipedia says: DKIM resulted in 2004 from merging two similar efforts, “enhanced DomainKeys” from Yahoo and “Identified Internet Mail” from Cisco. The configuration of DKIM depends on your MTA and your OS. Generally speaking the steps include, but aren’t limited to generating public-key cryptographic keys, set up your MTA and a TXT record). A simple Google search for your MTA, OS, and DKIM should get you started. You can verify your configuration with this tool.

There are times when a MTA can say, hey! you are cheating me! I reject your email and you should know you are a cheater. That is why DMARC was created. “DMARC is a scalable mechanism by which a mail-originating organization can express domain-level policies and preferences for message validation, disposition, and reporting, that a mail-receiving organization can use to improve mail handling”. It basically uses SPF and DKIM records to make a decision and accept or reject (and notify if you wish) your email. It is just a TXT record, you can use this generator, but there are several tools to create, validate your DNS record or email and read DMARC reports.

If you have not had spam problems previously and you configure MX, SPF, DKIM and DMARC is 99% sure you are going to respond “no” to “Are you spammer?” question and you will be trusted. If you are not trusted, feel free to send me an email and I will help you figure out the reason why with that configuration you are not trusted. Be sure you check all your configuration is OK with this amazing tool. But, wait, what happens if you are a spammer?

You are spammer.

First of all, with “you” I mean your IP. And sometimes, usually in shared hosting services, you have the problem, but you are not the cause. Or worse, your IP is not sending spam now, but it did before. And spam problems are so serious that once you have been flagged for spamming, they do not easily give you the chance to be forgiven. It is all about a reputation. Your reputation is based on your IP spam problems history and even your IP range spam problems history. Yes, the IP 7 bits away from your IP is sending spam and your reputation could be affected. To find out, I recommend you visit this website.

Most of the time, the problem is on your IP and thus your ip is blacklisted. I recommend this tool to check if your IP is blacklisted. But be careful, sometimes you may appear blacklisted, but not because of sending actual spam, but you because do not respect some RFC. That is why this tool shows only main and most famous blacklist. If you are blacklisted you will have to:

  1. Be able to respond “no” to the “Are you a spammer?” question.
  2. Fix your “you are spammer” problems (locate the spam source and fix it).
  3. Request a delist to blacklist.

If you do just step 3, it will actually be worse because there is a strong possibility you will be blacklisted again, spam is so serious that blacklists sometimes only forgive you once, but sometimes not twice.

Often enough this scenario happens, when you send an email which is rejected or the email goes directly to the Receipent’s Spam folder. In the first case, NDR can sometimes tell you the reason (or the blacklist) for why they rejected the email. This level of detail depends on the receiving side’s MTA configuration.

However in second case, anti-spam software and major providers work in a different fashion. Typically service providers will flag your IP as a spammer, which results in all email originating from that host/IP go directly to the Spam folder or are rejected and it negatively affects to your “internal reputation”. The reputation of an email is calculated as a score using a mathematical formula in conjunction with pattern detection and defined rules that are analyzed by the mail server. This tool can show the score of your email content. This is very important when you are sending newsletters because they have a high probability to be marked as spam. This is why many companies and people use dedicated email marketing services like MailChimp and AWeber.

With major providers, that internal reputation depends on additional “secret” factors, but we could also say it helps when a person (not a robot or a mathematical formula) says: this is not spam. Do you remember that button?

If you are having Spam problems (mainly rejects) only with one provider, the next information will help you. If your problem is with Yahoo, you can use this form to say: hey please forgive me. Gmail also has this form. Microsoft (Outlook and Hotmail) has also this form but they also have an internal reputation tool to show you what do they think about you. They are named SNDS and JMRP, and if you are having problems with Hotmail (too often) they will help you a lot.

With small providers sometimes the best option is to send an email to postmaster requesting for a whitelist of your IP.

When you are sending too much spam, sometimes anti-spam software / services or major providers just reject your emails because they have no doubt you are a spammer. If from your MTA IP you cannot telnet to port 25 of MX record IP (timeout), you will not be able to send any email to them and then your emails will be queued. This is the worst symptom. Somebody can send emails to a provider, sometimes nobody can send emails to anybody, our telephone is ringing and everybody screaming. If you came here in that situation, I hope this article has helped you to understand how a serious of a problem spam can be. Remember to validate your configuration and always work as fast as possible to find the source of the spamming.

Locating a spam source is sometimes a hard, but necessary task for System Administrators. if you have read this far, you will understand how anti-spam technically works, so you will have more weapons to fight it. It is also a security research task, because usually a compromise has occurred to one of your clients or your server which was used to send unsolicited email all around of the world. In that case you will have to find the malicious code and also close the point of entry. In this situation, I suggest you to do the following:

  • Study your mail logs to find out if it is a single email account or not. If it is one email account, maybe malware or a cracked email password is the root cause. Changing the password may fix the problem. However, if other malware is still on the client email device, the password could be compromised again. A re-image is the safest method to ensure a clean device or machine.
  • If the FROM email is generated, that could be an internal malicious code generating Spam emails. You can create a wrapper for your MTA, OS and your platform stack to log the source of each email that is sent. For example this is a wrapper for Sendmail, Apache and PHP. Special attention if your platform is Wordpress or Joomla. Bots can try old bugs of non updated plugins or malicious free (but not free) themes to insert the malware.

As conclusion, we can say Spam is a huge problem that affects all email providers. That problem could be caused because a lack of configuration to increase ip reputation or because an actual spam sending due to malware. That is why it is important to take in account your ip reputation and also the security of your infrastructure to skip future problems.

Pablo Hinojosa is a Linux System Administrator that worked at Gigas Hosting Support Team assisting to thousands of clients affected by Spam.

December 4, 2016

Day 4 - Change Management: Keep it Simple, Stupid

Written By: Chris McDermott
Edited By: Christopher Webber (@cwebber)

I love change management. I love the confidence it gives me. I love the traceability–how it’s effectively a changelog for my environment. I love the discipline it instills in my team. If you do change management right, it allows you to move faster. But your mileage may vary.

Not everyone has had a good experience with change management. In caricature, this manifests as the Official Change Board that meets bi-monthly and requires all participants to be present for the full meeting as every proposed plan is read aloud from the long and complicated triplicate copy of the required form. Questions are asked and answered; final judgements eventually rendered. Getting anything done takes weeks or months. People have left organizations because of change management gone wrong.

I suppose we really should start at the beginning, and ask “Why do we need change management at all?” Many teams don’t do much in the way of formal change process. I’ve made plenty of my own production changes without any kind of change management. I’ve also made the occasional human error along the way, with varying degrees of embarrassment.

I challenge you to try a simple exercise. Start writing down your plan before you execute a change that might impact your production environment. It doesn’t have to be fancy – use notepad, or vim, or a pad of paper, or whatever is easiest. Don’t worry about approval or anything. Just jot down three things: step-by-step what you’re planning to do, what you’ll test when you’re done, and what you would do if something went wrong. This is all stuff you already know, presumably. So it should be easy and fast to write it down somewhere.

When I go through this exercise, I find that I routinely make small mistakes, or forget steps, or realize that I don’t know where the backups are. Most mistakes are harmless, or they’re things that I would have caught myself as soon as I tried to perform the change. But you don’t always know, and some mistakes can be devastating.

The process of writing down my change plan, test plan, and roll-back plan forces me to think through what I’m planning carefully, and in many cases I have to check a man page or a hostname, or figure out where a backup file is located. And it turns out that doing all that thinking and checking catches a lot of errors. If I talk through my change plan with someone else, well that catches a whole bunch more. It’s amazing how much smarter two brains are, compared to just one. Sometimes, for big scary changes, I want to run the damn thing past every brain I can find. Heh, in fact, sometimes I show my plan to people I’m secretly hoping can think of a better way to do it. Having another human being review the plan and give feedback helps tremendously.

For me, those are the really critical bits. Write down the complete, detailed plan, and then make sure at least one other person reviews it. There’s other valuable stuff you can do like listing affected systems and stakeholders, and making notification and communication part of the planning process. But it’s critical to keep the process as simple, lightweight, and easy as possible. Use a tool that everyone is already using – your existing ticketing software, or a wiki, or any tool that will work. Figure out what makes sense for your environment, and your organization.

When you can figure out a process that works well, you gain some amazing benefits. There’s a record of everything that was done, and when, and by whom. If a problem manifests 6 or 12 or 72 hours after a change was made, you have the context of why the change was made, and the detailed test plan and roll-back plan right there at your fingertips. Requiring some level of review means that multiple people should always be aware of what’s happening and can help prevent knowledge silos. Calling out stakeholders and communication makes it more likely that people across your organization will be aware of relevant changes being made, and unintended consequences can be minimized. And of course you also reduce mistakes, which is benefit enough all by itself. All of these things combined allow high-functioning teams to move faster and act with more confidence.

I can give you an idea of what this might look like in practice. Here at SendGrid, we have a Kanban board in Jira (a tool that all our engineering teams were already using when we rolled out our change management process). If an engineer is planning a change that has the potential to impact production availability or customer data, they create a new issue on the Change Management Board (CMB). The template has the following fields:

  • Summary
  • Description
  • Affected hosts
  • Stakeholders
  • Change plan
  • Test plan
  • Roll-back plan
  • Roll-back verification plan
  • Risks

All the fields are optional except the Summary, and several of them have example text giving people a sample of what’s expected. When the engineer is happy with the plan, they get at least one qualified person to review it. That might be someone on their team, or it might be a couple of people on different teams. Engineers are encouraged to use their best judgement when selecting reviewers. Once a CMB has been approved (the reviewer literally just needs to add a “LGTM” comment on the Jira issue), it is dragged to the “Approved” column, and then the engineer can move it across the board until they’re done with the change. Each time the CMB’s status in Jira changes, it automatically notifies a HipChat channel where we announce things like deploys. For simple changes, this whole process can happen in the space of 10 or 15 minutes. More complicated ones can take a day or two, or in a few cases weeks (usually indicative of complex inter-team dependencies). The upper bound on how long it has taken is harder to calculate. We’ve had change plans that were written and sent to other teams for review, which then spawned discussions that spawned projects that grew into features or fixes and the original change plan withered and died. Sometimes that’s the the better choice.

I don’t think we have it perfect yet; we’ll probably continue to tune it to our needs. Ours is just one possible solution among many. We’ve tried to craft a process that works for us. I encourage you to do the same.

December 3, 2016

Day 3 - Building Empathy: a devopsec story

Written By: Annie Hedgpeth (@anniehedgie)
Edited By: Kerim Satirli (@ksatirli)

’Twas the night before Christmas, and all through the office not a creature was stirring … except for the compliance auditors finishing up their yearly CIS audits.

Ahh, poor them. This holiday season, wouldn’t you love to give your security and compliance team a little holiday cheer? Wouldn’t you love to see a bit of peace, joy, and empathy across organizations? I was lured into technology by just that concept, and I want to share a little holiday cheer by telling you my story.

I’m totally new to technology, having made a pretty big leap of faith into a career change. The thing that attracted me to technology was witnessing this display of empathy firsthand. My husband works for a large company who specializes in point-of-sale software, and he’s a very effective driver of DevOps within his organization. He was ready to move forward with automating all of the things and bringing more of the DevOps cheer to his company, but his security and compliance team was, in his eyes, blocking his initiatives - and for good reason!

My husband’s year-long struggle with getting his security and compliance team on board with automation was such an interesting problem to solve for me. He was excited about the agile and DevOps methodologies that he had adopted and how they would bring about greater business outcomes by increasing velocity. But the security and compliance team was still understandably hesitant, especially with stories of other companies experiencing massive data breaches in the news with millions of dollars lost. I would remind my husband that they were just trying to do their jobs, too. The security and compliance folks aren’t trying to be a grinch. They’re just doing their job, which is to defend, not to intentionally block.

So I urged him to figure out what they needed and wanted (ENTER: Empathy). And what he realized is that they needed to understand what was happening with the infrastructure. I can see how all of the automated configuration management could have caused a bit of hesitation on behalf of security and compliance. They wanted to be able to inspect everything more carefully and not feel like the automation was creating vulnerability issues that were out of their control.

But the lightbulb turned on when they realized that they could code their compliance controls with a framework called InSpec. InSpec is an open-source framework owned by Chef but totally platform agnostic. The cool thing about it is that you don’t even need to have configuration management to use it, which makes it a great introduction to automation for those that are new to DevOps or any sort of automation.

(Full-disclosure: Neither of us works for Chef/InSpec; we’re just big fans!)

You can run it locally or remotely, with nothing needing to be installed on the nodes being tested. That means you can store your InSpec test profile on your local machine or in version control and run it from the CLI to test your local machine or a remote host.

# run test locally
inspec exec test.rb

# run test on remote host on SSH
inspec exec test.rb -t ssh://user@hostname

# run test on remote Windows host on WinRM
inspec exec test.rb -t winrm://Administrator@windowshost --password 'your-password'

# run test on Docker container
inspec exec test.rb -t docker://container_id

# run with sudo
inspec exec test.rb --sudo [--sudo-password ...] [--sudo-options ...] [--sudo_command ...]

# run in a subshell
inspec exec test.rb --shell [--shell-options ...] [--shell-command ...]

The security and compliance team’s fears were finally allayed. All of the configuration automation that my husband was doing had allowed him to see his infrastructure as code, and now the security and compliance team could see their compliance as code, too.

They began to realize that they could automate a huge chunk of their PCI audits and verify every time the application or infrastructure code changed instead of the lengthy, manual audits that they were used to!

Chef promotes InSpec as being human-readable and accessible for non-developers, so I decided to learn it for myself and document on my blog whether or not that was true for me, a non-developer. As I learned it, I became more and more of a fan and could see how it was not only accessible, but in a very simple and basic way, it promoted empathy between the security and compliance teams and the DevOps teams. It truly is at the heart of the DevSecOps notion. We know that for DevOps to deliver on its promise of creating greater velocity and innovation that silos must be broken down. These silos being torn down absolutely must include those of the security and compliance teams. The InSpec framework does that in such a simple way that it is easy to gloss over. I promise you, though, it doens’t have to be complicated. So here it is…metadata. Let me explain.

If you’re a compliance auditor, then you’re used to working with PDFs, spreadsheets, docs, etc. One example of that is the CIS benchmarks. Here’s what a CIS control looks like.

And this is what that same control looks like when it’s being audited using InSpec. Can you see how the metadata provides a direct link to the CIS control above?

control "cis-1-5-2" do
  impact 1.0
  title "1.5.2 Set Permissions on /etc/grub.conf (Scored)"
  desc "Set permission on the /etc/grub.conf file to read and write for root only."
  describe file('/etc/grub.conf') do
    it { should be_writable.by('owner') }
    it { should be_readable.by('owner') }
  end
end

And then when you run a profile of controls like this, you end up with a nice, readable output like this.

When security and compliance controls are written this way, developers know what standards they’re expected to meet, and security and compliance auditors know that they’re being tested! InSpec allows them to speak the same language. When someone from security and compliance looks at this test, they feel assured that “Control 1.5.1” is being tested and what its impact level is for future prioritization. They can also read plainly how that control is being audited. And when a developer looks at this control, they see a description that gives them a frame of reference for why this control exists in the first place.

And when the three magi of Development, Operations, and Security and Compliance all speak the same language, bottlenecks are removed and progress can be realized!

Since I began my journey into technology, I have found myself at 10th Magnitude, a leading Azure cloud consultancy. My goal today is to leverage InSpec in as many ways as possible to add safety to 10th Magnitude’s Azure and DevOps engagements so that our clients can realize the true velocity the cloud makes possible.

I hope this sparked your interest in InSpec as it is my holiday gift to you! Find me on Twitter @anniehedie, and find much more about my journey with InSpec and technology on my blog.