The 4 KPIs Behind High-Performance IT

I recently read Accelerate – The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. The premise is that a) DevOps principles and practices lead to high IT performance and b) this is scientifically provable. The book’s authors prove their case with scientific analysis via cohort analysis. This may feel like hand-wavy magic, but you don’t have to buy their conclusion hook, line, and sinker.

Actually, that’s not even what this post is about. This post is about something more valuable than the books’ conclusion. Accelerate offers a concrete and quantifiable definition of IT performance. Given this definition, it’s then possible to empirically compare, contrast, rank, and subsequently improve your team’s performance. This was the most valuable insight from the book because I’ve struggled and fallen short in capturing IT performance in numbers.

Accelerate breaks it down in 4 KPIs:

  1. Lead Time: time it takes to go from a customer making a request to the request being fulfilled.
  2. Deployment Frequency: frequency as a proxy for batch size since it is easy to measure and typically has low variability. In other words, (desirable) smaller batches correlate with higher deployment frequency.
  3. Mean Time to Restore (MTTR): as software failures are expected, it makes more sense to measure how quickly teams recover from failure than how often they happen.
  4. Change Fail Percentage: a proxy measure for quality throughout the process.

These KPIs are powerful because they focus on global outcomes and capture many intangibles (such as “quality”) throughout the Software Development Lifecycle (SDLC). Accelerate proposes these KPIs, but does not offer a method of capturing them throughout the SDLC. This post suggests possible approaches and solutions for measuring these KPIs.

Lead Time

Lead time is the easiest to measure once you’ve defined the start and completion criteria. “Customer making a request” may start from when a task is taken from the backlog and work begins. It may also may mean “an item enters the backlog.” These are important distinctions which create undoubtedly different results. One accounts for time required once planning and prep-work completes, which varies widely even for similar requests. The other doesn’t.

Accelerate refers to this as the “fuzzy frontend” problem. The authors settle on measuring the time it takes to implement, test, and deliver because it’s easier to measure with less variability. So the clock starts when an item moves from the backlog into any of the “work in progress” states and ends when the work meets your internal definition of “done” (such as, deployed and verified in production).

Measuring this is straightforward with most task/project management software given you record timestamps for when items moved between states. Collecting measurements likely means writing programs that poll the API or accepting webhooks that calculate the difference from the start time and end time, then send the timing to your telemetry system.

Bear in mind that this metric should be reported to the telemetry system in seconds, but viewed in hours (if you’re extremely good) or days. Measuring in “sprints” or whatever arbitrary unit your team uses is not advisable. This number should be easily understood by people inside and outside your organization.


IOD is a content creation and research company working with some of the top names in IT.
You can be too!  JOIN US.

Deployment Frequency

Deployment frequency is a simple counter incremented whenever someone initiates a deploy and a provided time window. Incrementing the counter may happen via automated deployment pipelines, manual scripts, or API calls. Technically, this comes down to injecting a call to increment a counter in your deploy process.

If the team is already using automated deployment pipelines, then it’s likely a straight forward API integration or scraping information from the tool. If you’re unfortunate enough to have no software for deploying, don’t let that stop you. Distribute a simple script (which may be really as simple as a curl command) and run it manually at the right time.

Mean Time to Restore (MTTR)

This KPI also suffers from the “fuzzy frontend” problem. It’s difficult to define when an incident/outage begins. You’ve likely been in situations where there are outage level conditions before anyone is paged. This may happen due to poor production telemetry, false positives, or on-the-ground reports from customers. Regardless of these scenarios, someone is paged–either automatically or manually–when an outage occurs. That’s when the clock starts. The clock ends when the outage is resolved.

On-call software like PagerDuty provides this data out of the box. In fact, PagerDuty even makes the data available in the analytics feature. However, they only provide monthly breakdowns. That’s not enough. Tracking this data likely requires an API integration via webhooks: One for when an incident is created and another for when the incident is resolved. Calculate the difference and send a data point to your telemetry system.

Change Fail Percentage

The term “failure” must be unpacked. Accelerate defines a failure as a change that:
“result[s] in degraded service or subsequently require remediation (e.g., leads to service impairment or outage, require a hotfix, a rollback, a fix-forward, or a patch.)”

Note that this definition does not include changes that failed to deploy. That information is useful, but not this KPI’s focus.

One approach is to combine two counters: one for deploys and another for failures. The deploy frequency KPI counter can be repurposed for this use case. The second counter increments whenever a subsequent failure occurs. The failure rate is the sum of failures divided by the sum of deploys in a given time window.

Automatically counting service impairments or outages maps nicely to new pages. This is easily implemented by an API integration via webhook, which you may already have from the MTTR KPI.

Automatically counting hotfixes, roll back/forward, or patches is less straightforward. This may be achieved by looking at source control. A webhook integration with your code host can get the job done. The integration could increment the counter whenever a commit message includes the hotfix keyword or a branch prefixed with hotfix is merged. This strategy should work for roll-forward and patches, as long as commit messages or branches include the appropriate tokens.

Other cases may be handled by integrating with your deployment software, or even tagging tasks as “failed” in project management and counting those with an API integration. In the worst case, you can always distribute an “increment failure counter” script and run it manually as needed.

Conclusion

My ideal approach to collecting this data would happen as transparently as possible. Just let the team do the work and watch the data come out. Some data collection may require manual tracking in the worst case, but this should be avoided as much as possible. In all likelihood, adding specific tags or other metadata in your project management software combined with webhook integrations can work around manual tracking.

Collecting and/or scraping data from the various systems will require some technical work. Checkout Zapier before writing code if possible. Zapier can integrate hundreds of existing systems and even process incoming webhooks. This is far easier to start with than writing code. Only write code if it’s absolutely necessary.

All this data should be reported in real time and visible on a single dashboard visible by the entire organization. Do not sequester this data or fall back to looking at it once a month or even longer. Also tag all data points with teams/products/services or whatever other information may be useful to drill down into. Keep this data at the forefront of your organization and let it guide you. Now you have the basis to move your team forward through data, objectives, and key results.

Related posts