… or how you can make on-call for service providers a virtuous cycle.
In the tech world, for everything that is running as a service or website 24/7/365, there must be someone available to take care of any issues that arise. It’s been quite common to see that someone in the organization monitoring the service (or website) and act in case of issues. Some orgs do have an operations team on the frontline, others have developers. Even if the operations team exists, someone from the engineering team who develops and/or maintains that code must be available if there’s an issue that requires further investigation. All of these people that, either at work or at home are available through cell phone, pager, or email, are what is called on-call.
In my opinion, on-call can be a virtuous cycle that improves the code, provides good customer service, and over time tends to decrease the issues. BUT the problem is that, very rarely the environment and the on-call is set up for this virtuous cycle to happen.
When are you NOT setting your company up for a successful on-call?
- Customer service is not the priority: the primary goal of being on-call is to detect issues before customers and prevent them from hitting the issue. The secondary issue is, if an impact is inevitable, to reduce the impact on the customer. This must be the direction given to all engineers on-call, and must be valued by management through rewards, being a top item on the list for performance, time for this work must factored into schedules/estimates, among others. But this is not all, in order to be really customer-centric, the right investment in monitoring must be made, the lessons must be applied in order to improve the process, tools must be written to speed up process, tests for disasters or failures must be conducted. When on-call does not have this focus, but it is essentially to meet metrics (time the issue was in my court), or just to get rid of and go back to the task that has higher priority or rewards, for instance, then this on-call is not virtuous and tend to provide poor customer service.
- Monitoring is not sufficient: if the company is really customer-centric, monitoring in the code must be all over the place. Monitor must be sensitive to minor changes before the customer is impacted and must be very proactive in alerting the team. Not having monitoring in the system may sound like it’s good (no news is good news), but you can essentially be providing a very poor service to the customer as you will only fix some of the issues IF and WHEN the customer reports it. Monitoring cannot be an afterthought, monitoring cannot have second priority. At least not if you plan to be serious about providing a service.
- Wrong people on-call: oftentimes to meet the obligation of being oncall and to reduce the load, anyone on the team is added to the on-call list, even if the person is not familiar with the system. Doing that, although can look good at first as knowledgeable people don’t need to be formally on-call, is a really bad idea in my opinion. First, the knowledgeable person will very likely be engaged anyway. If the person is not available, you risk having an outage or you risk having a customer impact. If the person is available, by doing this you essentially delayed the resolution by having a middle-main engineer whose role is just to engage others. This is not virtuous, as it essentially doesn’t solve the problem of reducing the load, but also add load on whoever is on-call and decreases the customer service quality.
- On-call people are not vested into really fixing the issues and improving the service: essentially, if on-call people do not have a feeling of ownership for what they are maintaining and being woken up for, this is not a virtuous cycle. It’s not virtuous because they will not be vested into doing the proper investigation, fixing the issue, or improving the process – after all, they don’t feel that they are owners and being on-call can be just an obligation, a nuisance. To be virtuous, the right thing is to get engineers to own the service, to be able to make the decisions (and be held accountable for them), to have the desire to improve the service. That is one of the pre-requisites for people to be on-call and actually see the importance of that activity.
- Many of the issues are due to other teams/orgs at the company: having people vested into the service can do so much if the interdependencies makes them suffering the consequences of other team’s issues. This is not a problem unless management does not make the right investment to fix the underlying services that causing issues. Otherwise, it can cause the sense of ownership not to be sufficient as the on-call engineer will be paying the price for something that s/he’s not responsible for, and the cycle is no longer virtuous.
- Too many issues: for any service that is very popular and/or not yet mature, besides having the right people on-call, these people must be able to handle the issues in a reasonable fashion and still be able to sleep and have the basic needs met. Once you’re over this threshold and people have to live for the issues, as they are getting out of hand, something is amiss and needs to be fixed. If the investments are made, this can be fixed and does not impact the virtuous cycle, otherwise it can show that on-call and customer service is not the priority for the company, generating discomfort and dissatisfaction on the side of those that have to do it.
- The feeling is just to get rid of or blame someone else: still related to other items, if the goal is to meet some metrics and just get rid of the issue or blame someone else, this cycle is most definitely not virtuous, as it does not bring the benefits of a great customer service and of improving the process.
I have felt good being on-call given the right circumstances, and being on-call taught me so much about how to engineer a system to be run as a service, and improved and matured the systems I was being on-call for. It’s a matter of not doing the things above, showing respect for those professionals that are on-call, and getting them to be vested into the service through ownership. Essentially, make the "have your skin in the game" something that makes sense to the engineer. It’s rare to find these things, but if you do it, you can rest assured that you started your virtuous cycle.
What do you think? Do you have any other tips or had a different experience being on-call?