Great post on distributed systems. I can recall an almost identical conversation not too long ago as the one mentioned in the "Level 0" section:
NN: “For our first person shooter, we’re going to write our own networking engine”
NN: “There are good commercial engines, but license costs are expensive and we don’t want to pay these.”
ME: “Do you have any experience in distributed systems?”
NN: “Yes, I’ve written a socket server before.”
ME: “How long do you think you will take to write it?”
NN: “I think 2 weeks. Just to be really safe we planned 4.”
Saturday, June 30, 2012
Thursday, June 28, 2012
Posted by Rodrigo De Castro at 5:26 PM
I learned a great lesson in management, team playing, and individual contributors. How did I learn that? Playing a soccer video game.
Back in the day, when I used to play soccer on the weekends, a major difference between actual playing and video game that always struck me was that you control the entire team when holding your game controller. Recently, when exploring the options within Fifa 12, I realized that now one can pick a player and stick to it throughout the match, having only this individual's perspective. And when doing that, I was amazed at a major lesson in management when noticing the perspective difference.
When I play controlling the entire team, it's like I am the manager or the shareholder. I don't actually focus on one particular player, but I want all of them to work well together and in a concerted way. They need, as a team, to reach the goal of scoring. If players do their work of being well positioned and obey my commands to pass, tackle, and shoot, the team has great chances to play well. If they don't perform their activities well or have problems among themselves, that's going to hurt the team. Also, if I, as the team "controller", focus on some particular players, I tend to diminish the changes my team has actually to play well. The lesson is: the focus is on the group.
When I play as a single player, either in real life or as "pro player" at Fifa 12, my perspective changes. When I have possession of the soccer ball, more often than not, I think of the pleasure I can get out of it. I try to score, because that's what will give me the ultimate joy. Sometimes I will pass to other players, but I'll try to dribble and keep control of the ball more than I should. I no longer have the team perspective, and not necessarily make the best decision for the team.
If the focus is on the team, what does it mean for the individual player? That he may feel privileged in certain situations when the circumstances allow, but quite often he will feel hurt, not having the "right opportunities" if you will. And that's what is very challenging, because on one hand the manager wants the employee to feel motivated to focus and get great work done. For that to happen, it will vary based on the person, but typically the motivators are possibility of learning, growth, taking on more challenges. On the other hand, the manager has the big picture view of how to get the best of that group of people - even because that is how the manager's performance is primarily measure. However, this team's plan not always aligns with the individual's expectations.
And this is just from a purely objective perspective, not to mention the many subjective aspects of this interaction between manager and reports, like potential personal preferences, inaccurate assessment of report's skills, among others.
An individual to succeed and to survive for any substantial amount of time in an organization needs to understand that not always the decision made by management will be understood without access to the same data as management had. There will be times when the best for the team or for the company will come at the expense of some individual's expectations - irrespective of being rock stars or not. There's nothing to be done about that and if you source of joy and happiness is only what you do at work, you're taking risks of having phases when your expected level of joy will not be matched and will definitely not come from that source.
Monday, June 25, 2012
Posted by Rodrigo De Castro at 8:29 PM
I remember asking someone a few months back how he was doing and his answer: too much work, but too little gets done. That got stuck with me, as I had had similar thoughts at the time and have had them since then.
One of the measures of satisfaction to me is simply a good answer to the question “have I been doing any meaningful work recently?” When it gets to a point when I see that I spent a substantial amount of time without doing anything meaningful, then it’s time to reevaluate. I think this becomes particularly worrisome when you are working very hard, but still getting little done.
I’ve had good discussions with friends about that and would like to share some thoughts out of these discussions.
The main question about getting work done is: what’s actually getting in the way? From my experience I can see some explanations that stand out:
- Multiple people involved: the more the organization has a structure that involves multiple people in decisions, the more effort will be spent communicating, bring people up to speed, and trying to achieve any minimum consensus. Although bringing more people to the discussion typically improves the outcome with more perspectives looking at the problem, there’s a sweet spot on the number of people and who is brought into the discussion. At some point, more people don’t add too much value and actually introduce great cost.
- Hierarchy, lack of autonomy: in companies where there’s more hierarchy, more people are responsible for the outcome of certain decisions. If all these people in the hierarchy want to be involved, any decision is simply very hard to be made as it requires blessing from all the people that are ultimately responsible for that work up the management chain. Therefore, autonomy is definitely limited (as I understand autonomy).
Ultimately, the consequences of having these things in the way of getting things done are:
- Pace is just much slower than it could have been, as it went through all the “red tape” to get projects approved, resources allowed, etc.
- Decisions can end up being hasty and incomplete simply because a substantial portion of the the process was spent in communication, talking to peers and managing up.
However, when I come to think about this, it’s natural to some extent to expect this behavior in an organization. If you on the top, you will need to make sure that orgs under you are doing something reasonable. You want to make that people are not signing up for something not realistic that will become a liability in the future. You don’t want to see them handwaving and setting up for failure. Of course that is true for anything that really matters – if you’re working on something that doesn’t really have that much value to the business, you may not go through the same things.
How to improve the organization in a way that makes it more agile and more streamlined?
- Startup environment: in all places I’ve seen or heard of, it seems that nothing beats a startup environment, either at a startup company or within a big company. This reduces the amount of communication, doesn’t have the big hierarchies, and the side-effect is to end up with something that may not properly align with the org or that duplicates some other effort. A fair price to pay for agility, as long a minimum consistency bar with other company products/services is met.
- Owners: another point is to turn hands-on employees (engineers) into owners. Get them vested into the product, get them to think of how to improve the project, to deal with customers, to operate services, to help decide features, to pick what features they want to work on. Then you will have engineers that, rather than spending a 100% on engineering, will become owners.
The counter argument to ownership, from the functional org perspective, are
- Discipline is not really focused on its area and getting that much accomplished: ownership makes sense even in comparison with functional orgs and the allegedly focus that each discipline has. The reason is that it’s a fallacy this focus that engineers have. They end up not really getting that much done anyway due to the overhead in communication and due to the lack of freedom for them to get things done.
- Ownership not clearly defined (or others are the owners): on the functional org topic, ownership may not be clearly defined and, if any, it belongs to Program/Project Managers. Other disciplines, not being owners of the project, may tend to back off and not be that much vested. After all, they don’t have autonomy to improve the project without going through the hierarchy ranks and getting the respective approvals.
Can such big functional orgs work and prosper? I am sure they can work once trust is established and they get into a rhythm, but how thriving such orgs can be for passionate individuals? How innovate can such orgs be if it’s not natural to individuals to feel owners?
Sunday, June 17, 2012
Posted by Rodrigo De Castro at 1:34 PM
· "Availability" is oftentimes not the total time up, but rather related to a customer-oriented SLA
Availability was computed from the time a ticket was opened until it was closed. Than came the exclusions: only customer opened tickets counted – proactive restorations did not count (if the tree fell and no one noticed before it was propped up, it did not fall), then the beginning of the outage start time was from the time the ticket was opened (a somewhat long time from when the problem physically occurred), and lastly, any time the ticket was transferred to the customer (“on the customer’s clock”) did not count. And even above this, scheduled maintenance down time did not count toward availability. The Voice NOC had seemingly qualified away rigor in order to meet the mandated goal of five-nines. (Once again proving the rule that you get what you measure.)
Essentially the Voice NOC was not wrong, nor the Data division right. What the NOC was measuring was compliance with a customer-oriented SLA, a Service Level Agreement, and not really an availability measure. The Data division was measuring availability as “measured uptime,” or the probability the network could be used at any specific time the customer wished to use it.
· Origin: hardware reliability numbers
It is likely that the origin of five-nines availability comes from design specifications for network elements. It also seems likely that the percent measure came before the current MTBF (Mean Time Between Failures) and MTTF (Mean Time To Fix) measurement, since it is a simply expressed figure and the MTTF requirements often match the % calculation while being expressed as an odd-ish number. In practice, these numbers are mostly estimates that the manufacturer makes about the reliability of their equipment. Because telecom equipment frequently is expensive, not deployed in statistically significant sample set sizes, and rushed into service as soon as it passes laboratory tests, the manufacturer estimates its reliability.
It is not clear when the measure of reliability of an individual network element became the measure of overall network availability – but it did: customers don’t care if one element fails, or dozens fail. Customers only care that the service they are paying for and rely on works as offered.
· Availability: average or a probability?
is availability an average or a probability? Generally, if you are building a network element, you would define availability as a “probability of failure”. But if you are an engineer in a NOC running a report, you would define availability as an average. Whereas, according to Cisco, the number of defects in a million is used to calculate software availability and time does not enter directly into the calculation.
· High-availability: averaged out among all system elements?
if one system crashes for 24 hours, and all others work without interruption for the month, the 24 hour outage might be averaged across a large base of installed units. When the number of installed units is large enough, this yields a number that would remain within the required 5 nines of up time. More specifically, if a service provider installs a product described as "carrier grade" or providing five-nines availability, should the service provider expect that this is the product reliability standard expected of every single network element, or should the service provider expect that some of the elements may perform at a much degraded level, and that it is only the world wide “law of large numbers" that is used to measure carrier-grade? You see, it isn’t just that one bad apple in a group of network elements can skew the overall numbers – there are actual customers and real traffic affected by that “bad apple” element. Way beyond some theoretical measure, the effect on customers might be quite severe – certainly outside the guarantees of any customer SLA, and almost certainly extracting a penalty payment from the carrier, and likely attracting a fine from the regulator as well.
· System availability vs. component availability - or, why 5 nines was chosen.
Measuring system availability is actually different from individual element availability. If a string of network elements are directly dependent on each other, say a common circuit which passed though each one, than the availability of the system follows the law of multiplying probabilities. In a 4 node system each of which is five-nines reliable, than the availability would be .99999 * .99999 * .99999 * .99999 = .99996. This partially explains why five-nines was chosen – because it is so close to unity/one that the degradation of availability due to directly dependent probabilities still is very reliable. So the network effect is reduced. (Choosing 0.95 reliability as a standard, for example, would mean that with a string of 12 dependent nodes, one would be in failure mode about half of the time). But with everything built to five-nines, if just one bad apple exists (say with even a stated four-nines reliability), than the string of four nodes as a group becomes .99987 - very close to the reliability of the lowest performing element. In fact, in this case the close-to-unity of the other devices nearly removes them from the equation; the dependent string will always be very near the value of the bad apple, which can be very bad if the apple is actually rotten. In this situation all of the careful design and investment in carrier-grade devices of five-nine reliability becomes economically worthless.