How devops metrics can help to improve team productivity?
To compete with the best in the world, every organization aims at improving its developer productivity and wants their team to achieve high quality and high velocity of work.
We, as an organization too, assure our clients to deliver high quality and high velocity. Data empower us to identify gaps and arrive at root cause so we can fix our future performance relative to our past performance.
Therefore, we wanted to come up with a set of DevOps metrics that can be monitored to measure our productivity and identify the areas that can be improved.
How do we come with such DevOps metrics and measure them? It would be great if we have metrics that are backed up by data where we can see the measured impact of metrics. I have previously worked on capturing such metrics and introducing processes, tools, and automation required to improve these.
By doing an exhaustive survey, DevOps report identified a few DevOps metrics and processes of improvements.
How fast work gets done and how much work is getting delivered (Velocity)
- Deployment Frequency
- Lead Time
How reliable your code is (High Quality)
- Change Failure Rate
- Meantime to Recover
What these DevOps metrics indicate and how can we use it to improve our productivity and process?
Both throughput metrics and Quality metrics go hand-in-hand and each of these should improve over time.
Let’s say we started pushing more and more features at a faster rate and also we are getting more and more bugs in production. This means we are pushing unfinished, untested code in production.In this case, our throughput metrics will improve but quality metrics will go down.
Similarly, let us say we have a very good quality metrics but throughput metrics are not improving, indicating we are shipping quality code at a much slower rate. Here, we might miss out on business opportunities as we are rolling out features slowly.
So both the metrics should improve together.
Lets deep dive into the changes that can be done to improve these metrics and what does it mean if they are not getting better.
The number of deployments per day/week/month. This number should get higher as we improve.
What does it mean if your organization has low deployment frequency?
- You might not have an automated deployment process that’s why you deploy less frequently i.e. once a week or month.
- You might ship multiple features in one go where you haveif there is dependencies on other components resulting in a large batch size of the code. The large the batch size, the more the chance of breaking the code in production. Identifying the issues also becomes tough with the larger batch size.
We should keep code change batch size small for quicker test, production movement and to receive the feedback ASAP in case of any issues.
The time taken to go from code committed to code successfully running in production.
What does it mean to have high lead time?
- You don’t have a decoupled architecture that allows you to make code change and deploy it easily (Micro-services Architecture)
- Not having an efficient understanding of business requirement can lead tofeaturegetting changed too many time before we can release our changes in production
- Developers need to spend more time in testing different scenarios manually as no Automated testing pipeline is available(read more about testing here).
- Absence of CI/CD pipelines to automatically test and deploy code changes.
- Absence of Automated infrastructure setup process to migrate things to production. For example, developers writing new services have dependencies on the operations team to move code to production.
- Trunk based development using feature toggles also helps push your code quickly in production and ensures we have a small batch size.
Change Failure Rate
This is the ratio of number of times there is a problem (bugs, failure, outage) with deployment by the total number of deployments.
What causes the high change failure rate?
- Absence of Automated Testing or not having sufficient test coveragecan lead to bug getting introduced in production.
- Absence of Post-deployment testing strategy.Canary deployment, blue-green deploymentare ways you can test your latest code changes before routing production traffic to it.
- Lack of proper code review.(Check here : How to do code review)
Mean Time to Recover.
The average time to restore service once it goes down or is in an unusable state.
What stops us from bringing the services up quickly?
- Absence of better operation automation tools to rollback and deploy(CI/CD) the changes quickly.
eg: Having automation tools to replace the hardware and deploy the code in case of hardware crashes.
- Absence of resilient infrastructure setup, if one AZ goes down you should still be able to serve the request.
- You don’t have sufficient monitoring and alerting capabilities to have proper notification if something goes wrong.
- Absence of log search and distributed log tracing capability to quickly identify the issue and fix it.
How do we capture such metrics?
I am sure there must be paid tools and plugins to measure the metrics but I could not find any open-source tool to measure such metrics. So I wrote an open-source tool (the initial idea to write such a tool came from Varun Achar when I was working with him) to capture these data points and write queries and generate reports on top of it. Thanks to Vinay Wadagavi for contributing towards developing this tool.
I am running this tool in production to capture Throughput metrics and have exposed APIs to capture Quality metrics.
Few other metrics we are capturing are as below:
- Number of commits per repository
- Average Number of comment per PR — If it’s low that means either PR is not happening properly or our code quality is already good that no one is giving comments
- The average number of files changed per PR
- Deployment time
You can pull many other data points that suit your needs.
What numbers should we target?
High performing organization has numbers similar to this.
- Deployment Frequency: On-demand (multiple deploys per day)
- Lead Time: Less than an hour (Small batches required!)
- MTTR: Less than one hour
- Change Failure Rate: 0–15%
These were a few processes and metrics we identified and implemented/improved to make our work more efficient. I hope some of it helps you.
Please reach out to me if you have any queries to help you with your DevOps transformation journey.
Follow us on
Looking for devops course?Find list of courses from different websites https://searchmycourse.in/cloud-computing