Tuesday, October 30, 2012

Yoder on Good vs. Bad architecture

One of the biggest challenges in software engineering is to prove that shortsighted decisions, both at the low and high level, can be detrimental and cost more in the long term. So far, whenever I had to convince someone, it has been very challenging. And along these lines, it's always refreshing to see someone being sensible when it comes to that:

“If you think good architecture is expensive, try bad architecture.” Brian Foote & Joseph Yoder

Hashing functions and MurmurHash

Today I watched a presentation on cloud applications and one of the interesting points is how hashing the users across shards had to be done in a relatively uniform way to make sure it's evenly distributed.

While the presenter mentioned that they used an in-house caching algorithm, they realized that a good hashing algorithm can make a lot of difference. In this case, his suggestion is to use Murmur.

I found a great post on StackExchange on that, which is a must read before picking the hashing function:

Which hashing algorithm is best for uniqueness and speed?

And, of course, leave aside the Not Invented Here syndrome and don't go implement your own function :-)

G-Wan Web Server

This post on the performance of Node.js vs. G-Wan caught my attention:

What makes something "notable"?

These are some of the points:
  • Node.js is not as great in terms of performance as normally advertised
  • They claim that their web server needs 2,444x less servers than Node.js to run a merely "hello world"
  • Technology media dismisses G-WAN, in spite of all the technical superiority

But what really caught my attention the most was not the product, but the tone of this post. To me it seems to be aggressive, and actually I notice that most of the web site seems to have the same tone. Look at the section that mentions whether it's open source:
G-WAN is a freeware. It means that it is free for all (commercial users included). But some virulent (anonymous) users claim that this is not enough. They exige G-WAN's source code, and, "at no cost".
This kind of tone definitely does not attract a lot of the people, and can have quite the opposite effect, as it comes across as too radical. Of course if they do have a compelling business proposal through their software, they can be somewhat successful at least, but having a more neutral stand could be more fruitful and appeal to a larger audience.

One other interesting thing is their results when running benchmarks on Windows:
Linux was found to be much faster. After years of development this gap is surely larger now because Unix leaves more room for developers to innovate.
IIS 7.0/Windows is by far slower than all – despite being part of the kernel. G-WAN/Windows does better than Apache/Linux and GlassFish/Linux despite the Windows user-mode overhead which is 6x higher than on Linux. But G-WAN/Linux crunches G-WAN/Windows. Yes, Windows is that lame.
No wonder why they discontinued the Windows version back in 2009. I wonder how much it is due to the system and how much because they didn't tune the OS for better performance - or even if they used Windows Server or Windows Client.

And even when it compares to Tomcat, it seems that this G-Wan web server kicks ass:
G-WAN runs an "hello world" with 10x less CPU and 24x less RAM handling 11x more requests in 13x less time than Apache Tomcat… on a 6-Core. Many other languages (PHP, C#, JS...) benefit even more.

Update 11/12/2012: Differently than mentioned above, it doesn't seem that the issue was with OS tuning, but with the OS internals. Please see comment on this post by Timothy Bolton.

Monday, October 29, 2012


Harvard Business Review has a good article on micromanagement, which is a recurring theme.

Stop Micromanaging and Learn to Delegate

One of the highlights of this article:
It's important to realize that other people won't do things exactly the same way you would. Challenge yourself to distinguish between the style in which direct reports approach tasks and the quality of the results.

Great Operations Leader

I've posted a few times posts or article by John Allspaw and this is another instance where I admire what he wrote and think it's worth quoting a few lines:

What are the attributes (other than technical ability/experience) that make a great VP of Technical Operations?

Some good quotes:
Great Ops leaders understand that enduring risk of service outage, failure, and degradation is necessary for evolving and enabling a business, so they don't avoid change, they instead build a straightforward and collaborative way for change (and the accompanying risk) to take place.
You want someone who is haunted by worst-case scenarios but doesn't allow them to paralyze the organization or technical evolution of an infrastructure. Leaders that I admire value solutions and remediation-finding over blame assignment and avoidance.
And, as for many other areas of life:
a great Ops leader is one that continually looks elsewhere for improvement, inspiration, mentoring, and guidance. This means that becoming a great Ops leader isn't an actual achievement, it's a never-ending process involving humility, which in turn means that lists of great Ops leader attributes will always be incomplete. :)

Sunday, October 28, 2012

Senior Engineer

John Allspaw wrote a fantastic post on senior engineers that is a must read for those seeking to improve themselves:

On Being A Senior Engineer

Through this post, I found a number of areas that I improved in the past few years, and also some areas that I need to make some adjustments. But that's the beauty of it, if you are actively trying to improve yourself and making the effort for that, results will come. In the post above, John also gives some references to additional posts or books that may good to read for those actively looking to improve.

There are many parts of this post I could quote here, but just a couple of teasers for you:
I’ve mentioned it elsewhere, but I must emphasize the point more: the degree to which other people want to work with you is a direct indication on how successful you’ll be in your career as an engineer. Be the engineer that everyone wants to work with.
I also noticed that to be a really strong indication of an engineer maturity. Managers sometimes think differently and don't take that as an indication, but I most definitely think they should. And the interesting thing is that oftentimes these are the most knowledgeable engineers and that will offer learning possibilities to those around you and let the creativity flourish on the team.
The only true authority stems from knowledge, not from position. Knowledge engenders authority, and authority engenders respect – so if you want respect in an egoless environment, cultivate knowledge.
This is very true. In spite of formal authority, one can simply not have any actual authority if he/she doesn't have the knowledge and doesn't actually earn the authority. Other people become the de fact leaders and, unless boycotted for some reason, they will probably have a much higher influence. So the key verb here is earn.

Saturday, October 27, 2012

Job Descriptions

I've been always curious to see job descriptions when someone reaches out to me about software engineer (or related) positions. Most of them don't really have anything uncommon, but sometimes you see something in them that could be an indication of how the company takes software development and what they value.

For instance,
  • "Write code that is art": this is the first time I see (or perhaps noticed) art in a job description. It is so nice a description with that (see my post Software As Art), as it may indicate a team/company/manager that seems it beyond utilitarism.
  • "Professional and Technical Competencies": it's good to see the word "professional", as that could indicate that this company may want to do things in a professional way.
  • "Use SOLID design principles and patterns.": this means, first of all, that someone at that company knows design principles. Points for that. And they seem to value them. So a bandaid code should be detected and not encouraged, as they should not be using any good principles or patterns.
I know what, if I have to write a job description in the future, I'll put these small signs in the text to those who are paying attention and looking for them.

Software quality hell: bandaid development

In the industry, I see a mindset problem that can be very detrimental to the software quality - and quite opposite to the Software As an Art mindset: shortsightedness to fix only the problem at hand. Let me explain through an example.

Let's say you have a very simple problem: fix a unit or component test that is broken after some major changes in other components. Now, for some reason, a parameter is being passed null and you get a null pointer/reference exception. What do you do?
  1. You can just to a "if (param != null)" and work around the issue. If this is the only issue, the test be passing, you can close the bug with the feeling that you're quick, your manager will think that you're really efficient, everybody will think that the software is now fixed, and ultimately you will jus collect the rewards for being a good engineer for the business.
  2. You can try to understand why this parameter is now null, which would require going through some other code, potentially requiring major fixes in some other areas. This will very likely take longer (sometimes much longer), so it will not give an immediate relief to the problem and, depending on your manager, you may be considered someone who is not quick or overengineer or overly complicates things.
Picking option 1 (bandaid) over option 2 (proper) has a number of consequences that most people don't consider - or just do not care:
  • Ticking bomb: essentially that is what this decision is. The code base becomes harder and harder to reason about. Bugs are very complicated to understand, there are side-effects and regressions in my places.
  • Time saved with this solution will not be enough for the time investigating issues in the future given the difficulty and obscurity of the software as it has all unclear behaviors. It is just a mess.
  • It is just a black hole: the worst, confusing, and less maintainable the code is, the harder it is to isolate yourself from all this mess. To implement new features or bugs, you end up needing to write bad code, because writing good code may require fixing some many other places that becomes very risky to the business and no manager would actually approve that.
  • New people changing this code become very scared of making changes, as it may have effects all over the place. In particular in orgs where risk is not reward, but being bold and going above and beyond is not worth the risks you're taking, people will not take the initiatives to improve such code.
  • Any serious software professional loses the pride of working on such code base. This is an overlooked and often not even noticed aspect of the hidden consequences of such behavior. You risk losing good professionals and, not only that, but given word of mouth, takes the risk of not attracting really good professionals.
In some cases, like during a release, it's perfectly acceptable to pick option 1, as long as there's the professionalism and responsibility of tracking this technical debt and tackle it right away.

In my experiences, I've come across quite a few people that would pick option 1 right away. And not only that, but it would be very hard to convince them of the reasons to consider option 2. I've seen people from all backgrounds (with and without graduate degrees), working at big and known software companies and at startups. As it turns out, it is a mindset issue, like a friend of mine said. Unfortunately this is one of the areas of the creative process of software development that cannot be enforced by process.

When this is restricted some individuals, it's less of a problem if the team culture embraces high quality - as in, you see people in general improving their code quality and spending time on that, either through additional efforts, reading books or any other material to push themselves forward; management actually values that, etc. In that case, the issue can be contained if the team has the encouragement of rejecting code reviews and the "bandaid" engineers accept comments to fix the issue properly.

However, this issue become very toxic when leads/managers themselves are the first ones to pick themselves (or to encourage through subliminal messages) that the bandaid solutions are the ones to go with. Not only that, but if the company's culture, performance reviews or other mechanisms actually encourage and/or reward one to think in a shortsighted way at the expense of the long-term solutions, then there is nothing one engineer alone can do - and in this case I'd suggest to consider other options. In my opinion, that's where companies start to lose the agility to add features and innovate as they get caught up in the software mess they created in the past.

Finally, do you know what the problem is with the software bandaids that keep getting added to the software? If that software is integral to the business, you can't get rid of it and just do it properly from scratch. Someone will need to deal with the consequences of it. And that's when you distinguish great engineers that are worth keeping in your org: they are not the ones to think that it will be someone else's problem. Great engineers don't think about their upcoming review first, but think about doing it right (of course aligning with the business priorities). If you want to build a great team and a great company, start distinguishing between these types of engineers - not always the ones that apparently deliver are the ones worth keeping or rewarding.

PS - for companies that have surveys among engineer and do take them seriously to improve the company, I'd suggest a couple of questions to measure this effect:
  • Do you actually feel proud of the product/service you work on?
  • Do you feel that management embraces high quality and does not promote low quality indirectly through shortsightedness?

Wednesday, October 24, 2012

Caching Algorithms

After working on the Linux kernel and implementing a LRU eviction policy for a memory compressed cache, the following article definitely hit home. It talks about the different eviction policies for caching, which one was picked by Dropbox for its client, and briefly mention at the end how to do a simple cache invalidation.

Caching in theory and practice

Map Reduce Patterns

I just read the post below on Map Reduce patterns. It is long, but it goes over many of the patterns and how map reduce can be used for many interesting computations. I particularly found the PageRank algorithm the most interesting.

MapReduce Patterns, Algorithms, and Use Cases

Kudos to Ilya Katsov (author) for putting this together.

Thursday, October 18, 2012

Life's work

I read this article "What Do I Want To Do When I Grow Up?" Is The Wrong Question To Ask and thought with myself "history of life". This is one of the paragraphs that I could relate very much to:
My unconventional career path took me to five major national and international cities. I stayed at jobs for as long as 18 months and as short as one month. I sold all of my belongings and moved cross-country because my intuition told me to. I worked with more than 15 different startups in one year of living in New York City. I started a blog to document my journey--both the learning and the mistakes. I started a website to document the stories of people boldly pursuing their life's work. I messed up two startups. I accidentally turned insomnia into a global movement. I met with tarot card readers, talked strategy with multimillion-dollar entrepreneurs, and helped a best-selling author launch a publishing company, all to see if I could answer the question I'd been wondering about since I was 5: What do I want to do when I grow up?
And on the same note, this is another great post on "life's work": 8 Signs You've Found Your Life's Work

This part is particularly interesting and I believe it's one of the most noticeable signs that you found your life's work:
The people who matter notice.
"You look vibrant!" and "I've never seen you so healthy and happy!" and "This is without question what you're meant to be doing!" are among the comments you may hear from the people closest to you when you're on the right path.

Windows 8 Closed Distribution Model

Windows 8 is almost there, and this post on its closed distribution model gave me a quite different historical perspective on Windows 8 and its desktop model.

I agree with the article that Microsoft will be moving towards the Windows 8 UI - and less and less of the old desktop UI. That will probably happen for the very same reasons as DOS applications faded into obscurity, features and focus will be on the new UI, so there will be no reasonable way of maintaining an application running on the old UI for too long.

The whole problem is that Microsoft controls which applications are distributed for the new UI. Just like Apple does with the apps for iOS. And just like Apple, it can dictate what one can and cannot have on their device. And, depending on the distribution model, it can just mean that the device is no longer supported and essentially "brick it" by not allowing apps to be released for the OS version supported by older OSs. Just like what happened with iPhones and more recently with the iPad 1, which had a lifetime of just 2 years.

Although I'm currently a Microsoft employee, this is very scary. I never liked Apple model, and for this reason never had a iOS device. I never liked Amazon Kindle's model, even when I was an employee at Amazon, because Amazon decides where I can actually read the books I purchased (besides the fact that I cannot loan a book for how long I want or give the books away, but that's more a digital media problem than only specific to the Kindle case). Based on the same principles, I'm not comfortable with the Windows model going forward.

What is unfortunate about this whole discussion is whether we have any option. I love open source, but are they viable options going forward to compete with these large app stores? Will Android, which has the most open distribution model, be able to survive in spite of its fragmentation, not very oiled processes, and patent litigations?

One could say that we can just go and use open source, but the future will be mostly in the hands of those that hold data. That is why some startups are very valuable, even if they don't have a defined business model, as long as they have user's data. And they will dictate you which platforms that are supported or not, based on their business reasons. Given that, you will not have an option if you want to use these applications.

One could just say no to all of that and not use any of these applications if they imply using a closed platform and some sort of lock-in. But one can just do that to the extent that it doesn't impair your ability to live your life and/or run your business in an efficient way, otherwise you will have just an "illusion of control", but will have to surrender to the reality and pay up, complying with whatever restrictions these big or small companies impose.

The future looks very scary, unless market forces make companies like Microsoft or Apple open up their distribution models. That will only happen, though, if this is better for their revenue, otherwise they will not do it out of "good will".

Tuesday, October 16, 2012

Implementing MVCC on a Key-Value store

The following post is quite interesting and has good ideas on how to implement MVCC (Multi-Version Concurrency Control) on a key-value store:

Implementation of MVCC Transactions for Key-Value Stores

It gives good ideas on how to track when each entity came into existence by storing the transaction that created and the one that deleted it.

However, I have concerns about this global sequence number and I copy here my comment instead of retyping it:
The idea of keeping the transaction ID is very interesting, but requires a global sequence number to be assigned to the transactions, right?

I don’t know which NoSQL databases have that, but when I think about Windows Azure Storage (the one I’ve been working with more recently), that would be a problem. Actually, that’s a problem with any scalable DB, as it can be a contention point.

In the Windows Azure Storage, or others that don’t have that on the server side, it’s more of a problem to have this global number as it requires operations to read the number, increment, and then update them. This creates a contention point, and reduces the rate of transactions you can have.

What are your thoughts on it? Have you tried to implement that on top of a NoSQL DB? Does Oracle Coherence offer an increment operation for this global counter?

UPDATE 10/18/2012: today I read another good post on how Postgres works with MVCC. It has a few more details to complement your knowledge: PostgreSQL Concurrency With MVCC

Monday, October 15, 2012

Business cards

I ordered some business cards for myself on moo.com and have to tell you how pleased I was with the experience. Very nice and functional website, cards were delivered before the estimated date, and quality was just great. They got themselves a customer.

Software as an Art

I see software as an art. Although it can be used simply a mundane tool to get something to work, that's not what energizes me when working with software. What does excite about it is the art in a beautiful solution, in a nice abstraction, in a great architecture, in a readable and maintainable code.

In spite of seeing software as an art, I've been able to deliver in all the projects I've participated in by making the right prioritizations. After all, it must serve a concrete need and be part of a business, adding some sort of value. However, even then, it's been an art and I'm vested into making it better, nicer, even if it requires some harder work.

The problem, though, is to be in environments where software is just viewed as a tool, where achieving a result is more important than anything else (even if it's totally shortsighted), where getting something done simply annihilates the art behind the software, where achieving the reward for doing something (even if crappy) is more important than doing it well or right.

In these environments, you seem to become merely a cog in a machine of results, because your creativity and your heart are not really that required as long as you solve the problem in any way or form. That shifts software from an art or a craft into an almost mechanical task. That simply kills the joy of software development.

If you have similar view of software, how do you actually cope with this in your work environment? Do you try to shield yourself? Or is there any work environment where software as an art/craft is still preserved? I'd love to know about your experiences.

Sunday, October 14, 2012

NoSQL: algorithms and data modeling

My Sunday reading session was on a couple of great NoSQL blog posts that are definitely worth reading:

  • Distributed Algorithms in NoSQL Databases: it talks about data consistency, data placement, and system coordination. I learned some interesting new things, like Bully Algorithm for leader election, passivated replicas, how to handle rebalancing between replicas, and multi-attribute sharding, among others.
  • NoSQL Data Modeling Techniques: it lists a bunch of techniques, many of which match my experiences with NoSQL databases. I like this quote, which is essentially how I explained how I designed the data model: "The main design theme is ”What questions do I have?”"
These posts, in particular the first one, require some background, otherwise it may seem a bit hazy. But the author added a great list of references to that should be read to understand it better.

The more I learn about NoSQL - or work with distributed systems - I see how much the complexity is growing when dealing with seemingly simple problems, like data model. It's very hard to get others to understand, but as soon as you start reading these posts and see how many different concerns one has when thinking of the data model, for instance, then you see how skilled a software engineer must be to try to minimize the mistakes and rework. And, that's because the posts above don't even talk about some processing like MapReduce, which is still another aspect to be factored in.

The ones below I haven't read yet, but are on my list and seem promising:

Saturday, October 13, 2012

Startup vs. Big Company Mindset

Paul Graham wrote a great post on startups that is definitely worth reading (in spite of its length). One of the interesting quotes in this text is: "if they aren't median people, it's a rational choice for founders to start [startup companies]".

As I work for a big (or better huge) company, I always think of the difference of people working at both types of companies. I don't think there's a definitive set of personality traits that define whether you should work at one or the other, especially as you see people working at some point of their lives in one type of company and then moving to a different type.

However, it's interesting Paul's quote on "median people". I think it is along the lines of my post on Working at Big Software Companies. Due to the size of big companies, they must have the rules for the average professional, and try to minimize damage to the outliers (either top or underperformers). But there's no magical silver bullet that works perfectly, so there is always some collateral damage.

One example of collateral damage. Let's assume that you have a few top performers, who are capable of doing the majority of work well and faster than others. And then you have many average or good (but not really top) performers on the team. What are the criteria to divide the work? If you're focusing on efficiency and getting the work done more quickly and better, you would probably assign most of the items to the top performers, or at least the critical ones, and have others taking care of the less important stuff. That would guarantee efficiency, but probably you're risking having a lower team morale as most of your people will not be working on the "meaty" or interesting stuff.

In big organizational structures, more often than not, as a manager you're not being measure by the most efficient output, but to have MORE output than your peers. Another point is that, as a manager, you will be measured by what the teams thinks of you (MS Pool is an example of that at Microsoft). Given all that, if you focus on your career, you're better off by not delivering the most you can do - by purposefully not assigning features to top performer and/or slowing them down (*) - and guaranteeing a high team morale among the average professionals than guaranteeing that the top performers are indeed happy and that you're really very effective.

Actually, because of the typical peer comparison when it comes to performance evaluation, it is interesting to see that it can hurt the overall company's performance. Of course, one could argue, that if everybody is producing like crazy, this system would force all to catch up and keep up with other's performance. But if the company is profitable and there are not enough incentive between producing slightly better than peers vs. producing to your potential and, on top of that, producing to your potential may cause more liability and risks that will be detrimental in that culture, then I'm a firm believer that the rule will be to be (or at least look) better than peers, but not to the extent that it can cause risks.

Risk is another interesting aspect of big company mindset. I've seen more average people being promoted and rewarded for not "getting themselves into trouble" than top performers that were bold and had courage to be brave and take the risks, whatever they were (like trying to shake the existing status quo, trying to get the most efficiency of the team, attempting something new). It's just like any other contest one sees on TV, the ones that keep moving forward and even sometimes win are typically the ones that are not bottom performers and do not make huge mistakes that put them on the spot. If you keep going with an average result (better than someone else), do not take much risks (incl. pissing others off), there's a better chance of moving forward in your career.

The biggest perk for those who stick around is that, over time, it's very unlikely to be fired from these companies - it's possible, but unlikely unless you screw things up very badly. So, once you reach a good salary and all the benefits that come with tenure, it's very hard to jump ship and try something new. In particular, for those interested in focusing on some other aspects of their life, as over time you get comfortable in the position, without requiring a lot of effort on your part to keep going.

With this mindset, it's not surprising that the innovator's dilemma exist and that, unfortunately, most of the innovation will not come from these big companies. Actually, there will be innovation there, don't get me wrong. But it's the kind of innovation that requires an enormous structure in place or ecosystem that anyone new to the area can't afford to such large investment or tap into any potential returns. Also, the big companies will not be the most efficient, unless there's really a market force driving that. But its survival is not a matter of really being efficient, but just being more efficient than competitors - if these can't not bought out or driven out of the market somehow.

(*) On the slowing down top performers, I've seen that happening and couldn't believe my eyes. The same is reported in the book "The Peter Principle". Essentially, from management perspective, having someone "too good" makes others feel bad, so one needs to manage these professionals and, oftentimes, it's better to get rid of them. Yes, it seems an absurd, but that happens.

Tuesday, October 02, 2012

Performance Review

It's not a surprise that Microsoft's performance review has been so debated in the past few years, and more recently after Vanity's Fair post on "Microsoft's Downfall". If you haven't heard of how the performance review works at Microsoft, I'd suggest you read these couple of blog post and article:

Microsoft Stack Ranking is not Good Management
Microsoft’s Downfall: Inside the Executive E-mails and Cannibalistic Culture That Felled a Tech Giant

However, what surprised is to come across some posts on how Google does its performance review. From recruiters, I knew all about the wonders that peer feedback are at Google and how things are more fair.

I did not know, though, the different side of the coin. Read it here:

What are the major deficiencies of the performance review process at Google?
Promotion Systems
Promotions System Redux

Essentially, the problems with performance review are everywhere at big companies. I wonder how well it works at Valve, as it has a non-traditional organization structure.

Monday, October 01, 2012

BTree Library for .NET

As I was reading MongoDB deficiencies, one of the problems I learned about was the lack of counted BTrees. As I wanted to learn more BTrees, I coded up a basic implementation of BTrees in C# and posted it to GitHub, which allowed me to learn more about Git as well.

This is the GitHub repository where you can find the initial BTree implementation (in-memory implementation only for now):


Doesn't a developer spend most time coding?

This blog post is great on the less-known truths about programming, and I definitely recommend it reading it entirely. Here, however, I want to touch on this point:
Averaging over the lifetime of the project, a programmer spends about 10-20% of his time writing code, and most programmers write about 10-12 lines of code per day that goes into the final product, regardless of their skill level.
I completely agree that most of my time is not spent writing code - and it has never had, irrespective of whether I was considered junior or senior in a project. My experience matches perfectly the poster of the post mentioned above, as well as it matches other people on the web (and apparently what was written in the Mythical Man-Month).


This is something that nobody tells you at college: you will not be coding as you may think. As a matter of fact, in big corps, you will probably be communicating and handling other things much more than working on the technical things. If you're operating a service, like I mentioned here, then it's possible that you're be coding even less and potentially debugging and trying to diagnose issues even more than coding.


Another point of this is how the technical side is the focus of interviews, although they can account for the minority of your time on the job. Soft skills, communication, dealing with ambiguity, etc., are often neglected and can be very important for the success at work as well.


Actually, what is important for a professional to be successful? This TED Talk gives us good tips. See this quote:
Embedded within that question is the key to understanding the science of happiness. Because what that question assumes is that our external world is predictive of our happiness levels, when in reality, if I know everything about your external world, I can only predict 10 percent of your long-term happiness. 90 percent of your long-term happiness is predicted not by the external world, but by the way your brain processes the world. And if we change it, if we change our formula for happiness and success, what we can do is change the way that we can then affect reality. What we found is that only 25 percent of job successes are predicted by I.Q. 75 percent of job successes are predicted by your optimism levels, your social support and your ability to see stress as a challenge instead of as a threat.
These are things that we don't look at when hiring someone. Coupled with the fact that you don't spend much of time doing the technical core work anyway, I don't think that our interview techniques are actually that effective to predict an individual's longevity (and success (*)) at the company.

(*) What is success is probably the topic for another post :-)

The virtuous cycle of being on-call

… or how you can make on-call for service providers a virtuous cycle.

In the tech world, for everything that is running as a service or website 24/7/365, there must be someone available to take care of any issues that arise. It’s been quite common to see that someone in the organization monitoring the service (or website) and act in case of issues. Some orgs do have an operations team on the frontline, others have developers. Even if the operations team exists, someone from the engineering team who develops and/or maintains that code must be available if there’s an issue that requires further investigation. All of these people that, either at work or at home are available through cell phone, pager, or email, are what is called on-call.
In my opinion, on-call can be a virtuous cycle that improves the code, provides good customer service, and over time tends to decrease the issues. BUT the problem is that, very rarely the environment and the on-call is set up for this virtuous cycle to happen.

When are you NOT setting your company up for a successful on-call?

- Customer service is not the priority: the primary goal of being on-call is to detect issues before customers and prevent them from hitting the issue. The secondary issue is, if an impact is inevitable, to reduce the impact on the customer. This must be the direction given to all engineers on-call, and must be valued by management through rewards, being a top item on the list for performance, time for this work must factored into schedules/estimates, among others. But this is not all, in order to be really customer-centric, the right investment in monitoring must be made, the lessons must be applied in order to improve the process, tools must be written to speed up process, tests for disasters or failures must be conducted. When on-call does not have this focus, but it is essentially to meet metrics (time the issue was in my court), or just to get rid of and go back to the task that has higher priority or rewards, for instance, then this on-call is not virtuous and tend to provide poor customer service.

- Monitoring is not sufficient: if the company is really customer-centric, monitoring in the code must be all over the place. Monitor must be sensitive to minor changes before the customer is impacted and must be very proactive in alerting the team. Not having monitoring in the system may sound like it’s good (no news is good news), but you can essentially be providing a very poor service to the customer as you will only fix some of the issues IF and WHEN the customer reports it. Monitoring cannot be an afterthought, monitoring cannot have second priority. At least not if you plan to be serious about providing a service.

- Wrong people on-call: oftentimes to meet the obligation of being oncall and to reduce the load, anyone on the team is added to the on-call list, even if the person is not familiar with the system. Doing that, although can look good at first as knowledgeable people don’t need to be formally on-call, is a really bad idea in my opinion. First, the knowledgeable person will very likely be engaged anyway. If the person is not available, you risk having an outage or you risk having a customer impact. If the person is available, by doing this you essentially delayed the resolution by having a middle-main engineer whose role is just to engage others. This is not virtuous, as it essentially doesn’t solve the problem of reducing the load, but also add load on whoever is on-call and decreases the customer service quality.

- On-call people are not vested into really fixing the issues and improving the service: essentially, if on-call people do not have a feeling of ownership for what they are maintaining and being woken up for, this is not a virtuous cycle. It’s not virtuous because they will not be vested into doing the proper investigation, fixing the issue, or improving the process – after all, they don’t feel that they are owners and being on-call can be just an obligation, a nuisance. To be virtuous, the right thing is to get engineers to own the service, to be able to make the decisions (and be held accountable for them), to have the desire to improve the service. That is one of the pre-requisites for people to be on-call and actually see the importance of that activity.

- Many of the issues are due to other teams/orgs at the company: having people vested into the service can do so much if the interdependencies makes them suffering the consequences of other team’s issues. This is not a problem unless management does not make the right investment to fix the underlying services that causing issues. Otherwise, it can cause the sense of ownership not to be sufficient as the on-call engineer will be paying the price for something that s/he’s not responsible for, and the cycle is no longer virtuous.

- Too many issues: for any service that is very popular and/or not yet mature, besides having the right people on-call, these people must be able to handle the issues in a reasonable fashion and still be able to sleep and have the basic needs met. Once you’re over this threshold and people have to live for the issues, as they are getting out of hand, something is amiss and needs to be fixed. If the investments are made, this can be fixed and does not impact the virtuous cycle, otherwise it can show that on-call and customer service is not the priority for the company, generating discomfort and dissatisfaction on the side of those that have to do it.

- The feeling is just to get rid of or blame someone else: still related to other items, if the goal is to meet some metrics and just get rid of the issue or blame someone else, this cycle is most definitely not virtuous, as it does not bring the benefits of a great customer service and of improving the process.

I have felt good being on-call given the right circumstances, and being on-call taught me so much about how to engineer a system to be run as a service, and improved and matured the systems I was being on-call for. It’s a matter of not doing the things above, showing respect for those professionals that are on-call, and getting them to be vested into the service through ownership. Essentially, make the "have your skin in the game" something that makes sense to the engineer. It’s rare to find these things, but if you do it, you can rest assured that you started your virtuous cycle.

What do you think? Do you have any other tips or had a different experience being on-call?