This post is for those who want to get serious about providing service, whether or not you will be working in the cloud computing space. After working at Amazon Web Service and Microsoft, I learned a few things about running a service, but most importantly, I believe I learned the most suitable mindset whether you are an engineer who wants to get better at it or a manager/entrepreneur who wants to learn what is needed to do your best.
Since Amazon, I started understanding what having a services mindset really mean, especially since I joined after having an experience of running a web site on my own. Being a good developer alone does not make you a good services developer unless you are really open to learn. And there is a mindset to do the right thing out there, which is primarily a customer obsession. That is where I have to give Amazon all the credit: for anyone who is an Amazon customer, you know how much the company really focuses on the customer service. However, more than that, if you are or were an Amazon engineer, you know how deeply ingrained into the company’s culture the customer focus is. Even to the point that the company is willing to sacrifice engineers’ time for the sake of providing a customer focused world class service.
My first understanding of what that actually meant started when I saw new hires joining Amazon from other companies that did not have much of a services background and how they thought about services in general. If these people were in charge, we would have had a possibility of deviating from the culture of focusing on the customer experience, but luckily the culture is so strong that they ended up adapting themselves to it and learning the Amazon way of doing things. At Microsoft, it is way more common to see people without the most suitable mindset, since the company doesn’t have the culture of providing service since its inception like Amazon.
[One note before getting to the lessons: if you know companies that are serious about providing services, please share with me – I’d love to know them]
These are the important lessons or pieces of advice I’d give:
- No customer must be forgotten: that simply means that any customer matters. This may sound silly, especially if you believe that all companies would have this goal. But that is not true, so let me repeat: the goal is that not a single customer should have a bad experience. If necessary, engineers or managers must awaken up if one customer is experiencing issues. Having a mechanism to give credit back in case something goes bad is not the same as really caring about each and every customer.
- I’ve seen web service systems without any metrics on how customers are actually interacting with the system. The problem is when management, given that we don’t know anything about the system, claims that the system is very successful. There were simply no metrics giving you the bad news.
- Another common belief is that everything is doing well in your system because you did not get any customer support calls (when you do provide that). The first big mistake here is to believe that all customers will call you if something really bad goes wrong. I am, for one, a customer who will probably give up on the company rather than expect that something will be done if I call customer support. And this model is very bad because it does not account for the service that you lose by having customers walking away from your service – and the bad publicity and word of mouth one will get because of that.
- All customer impacted operations must be traceable, must be tracked and must be investigated, so you get better over time. And you have data to believe you are actually getting better.
- This is a corollary of the first one, but it is very important that the service provider is very diligent to trace customer scenarios, detect failures, detect performance degradation, and alert on them. These must be tracked and there must be a serious commitment to fixing issues and improving the service.
- Another example that we may take for granted, but it’s not true always that companies are being diligent and caring about tracing issues, detecting failures, detecting performance degradation, and most importantly many companies are not necessarily caring whether the issues get fixed at all.
- Don’t deceive yourself over the quality of your service.
- Only claim that you have a successful or stable service if you have visibility and know what you are talking about. Do you really catch all potential issues and log them? Do you know that your customers are not experiencing something bad? Do you have visibility into the performance data and trends over time? Do you alert in case things start to get bad?
- Software fails, so everything must be monitored comprehensively. Better safe than sorry here.
- I’ve seen lots of push back to monitor components without really understanding the reason why monitoring is important in the first place. One of the counter arguments is that the component should be reliable and should have been tested before going to production, therefore comprehensive monitoring is not required. It may sound ridiculous to have to say this, but software fails for reasons that one doesn’t expect. Yes, it does – for bugs in your code that you did not catch in your tests, for bugs in libraries you use, for hardware issues, for network issues, among others. If one could prove that the software is bug free and has a formal proof that it will always behave well so it doesn’t need monitoring, this person deserves a prize.
- On monitoring, one of the most important things is to think of how to avoid customer impact at all, so you have to be smart about starting alerting on any signals that indicate that things are going downhill – before they actually do.
- There is an implicit belief that services will be available, so you better strive for that.
- If you are providing a service, especially in the cloud computing space, you better be available, especially the core part of your service. You need to understand the system very well from all aspects not to neglect anything. Once you do, you are risking availability for some customers and, if you really care about your business, you would not allow that to happen.
- A consequence of being available is that you need to be scalable if you’re out there and demand grows. Not thinking ahead of time can potentially kill your business. And again, offering credits back will not fix your brand damage.
- Know distributed system.
- Services are distributed system in all cases. At the very least, your customer will be calling your remotely, but typically internally there are many distributed components. Do not architect and make a service available without having the minimum knowledge of distributed systems. And yes, I am talking about theory here. Know partitioning, know Paxos, know CAP Theorem, know about partial failures. Or have someone on your team who knows and always run ideas by this person. If you don’t know well theory of distributed systems – and especially if you don’t have experience running a service – very likely you will get wrong. In the best case it may take a while for something bad to happen and unfortunately your customers will be the ones mostly impacted.
- Have passion and be proud of what you deliver. Have ownership.
- You need an organization structure where people feel ownership and they are proud of what they are doing. If nobody is owner, and there’s always someone else responsible for a component, then you will not get the same dedication and willing to learn and solve the problems as you would if people feel that it’s their “baby”.
- Share and learn the lessons
- Services that are not successful typically don’t have many problems, but it is simple to see how many lessons successful services had. An effective company learns these lessons effectively and shares them broadly. Have a knowledge base; don’t worry about sharing the shame; make sure people really understand what caused the problem and what takes to fix it. Follow through to make sure changes are driven across the company and these mistakes are not repeated.
- Be serious about on-call
- A company that cares about customers will want to be on top of issues. And it’s not only an out of band process that someone will perhaps take a look at the issue, but any customer impacting issue will cause the team to stop what they are doing and go fix it.
- On-call rotation, although bad for engineers if badly implemented, is vital for a well provided service. Those on call need to be trained, but most importantly they must have an attitude to want to fix the problem and avoid or reduce customer impact.
- Engineers must know the system well enough to diagnose issues and even potentially fixing issues in different components. That is the goal of a well implemented on-call system.
- On-call must be reliable – whoever is on-call needs to be paged, get an SMS, an email, or whatever it takes in the most reliable way. You can’t afford to lose alerts.
- Make your deployment process easy
- There are many things about deployment, including auditability, but one thing that cannot be traded for anything is its simplicity. It must be simple and easy to deploy for two reasons: you need to be quick to release new features and updates, but you must quicker to fix customer issues. Impacting issues may require fixes right away, and if the process gets in the way, as it does in some cases, who will be the one impacted? The customer.
- Have people with hands-on experience making the technical decisions
- People with the skin in the game and who have the actual experience running the service must make the decisions or have great influence. If you don’t have service experience, get someone to help you at the beginning and be humble to take their advice and to learn from them.
- Take your customer’s feedback
- Except for some rare cases, typically your customers know more about using the service than you will ever will. Don’t try to be pretentious to assume that you will know more. Pay attention to their feedback, incorporate into your planning, and be appreciative. Customers help so much – and they stop doing if they notice you don’t care.