Thursday, December 27, 2012

Geo-replication: MDCC and RedBlue consistency

Below you can find a couple of good posts by Murat on geo-replication for recent academic work. In particular, the first link has good explanation on Paxos optimizations.

MDCC: Multi-Data Center Consistency

Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary

And, if you want to watch the OSDI presentation on the RedBlue consistency (second link above), click on this video link.

Tuesday, December 25, 2012

Go Language at Google

I just read a good article on the Go Language and was pleased to see the efforts to improve build times and code dependencies. Just for the reason of learning what the problems with C/C++ include file have, it's worth reading the article.

The other reasons is to learn that Go is a language without exception, garbage-collected, without type hierarchy (but object-oriented and with interface), with a different concurrency model (it follows Communicating sequential process), among others. And, last but not least, it is an open source language.

Go at Google: Language Design in the Service of Software Engineering


From The Art of Doing Science and Engineering:
I am preaching the message that, with apparently only one life to live on this earth, you ought to try to make significant contributions to humanity rather than just get along through life comfortably - that the life of trying to achieve excellence in some area is in itself a worthy goal for your life. It has often been observed that the true gain is in the struggle and not in the achievement - that a life without a struggle on your part to make yourself excellent is hardly a life worth living. This, it must be observed, is an opinion and not a fact, but it is based on observing many people's lives and speculating on their total happiness rather than the moment to moment pleasures they enjoyed. Again, this opinion of their happiness must be my own interpretation as no one can know another's life. Many reports by people who have written about the "good life" agree with the above opinion. Notice that I leave it to you to pick your goals of excellence, but claim only that a life without such a goal is not really living but it is merely existing - in my opinion. In ancient Greece Socrates (469-399) said, "The unexamined life is not worth living"

Science vs. Engineering

From The Art of Doing Science and Engineering:
In science if you know what you are doing you should not be doing it.
In engineering if you do not know what you are doing you should not be doing it.

Monday, December 24, 2012

"Lean Startup" lessons for big companies

Today I finished reading Lean Startup book and, like most of other readers, I can just rate it as excellent.

Here, though, I'll not talk about the quick iterations proposed by the book with the purpose of learning what works and what not, but rather the organizational aspects suggested by the author. Eric shows how big companies are already having difficulties in the new era if they don't adapt to the startup mindset.

One of the most important things is to have cross-functional small teams in order to have small pockets within the company that work like startups. One good example of company that implemented that is Intuit. Going in the direction of large functional organizations isn't the right thing to do, if the company wants to deliver in small batches. Big functional orgs work well for delivering in big batches, but that does not help with keeping up with the competition and innovation, as innovation requires a lot of experimentation and learning that is only feasible with small batches. For that, the company must be quick, must be willing to take risks, and must have the proper metrics (not vanity metrics, as Eric says) to help direct efforts. Unfortunately it seems that some companies are going in the direction of having fewer and fewer cross-functional orgs and that can be very concerning with regards to their future.

Another interesting aspect is that, different than most orgs, Eric recognizes how different employees have different skills and the ones good at innovating and starting projects are not necessarily the ones who are interested in or skilled for later stages of the project. So, rather than having a team owning a project, the "functional org" model proposed is one where projects move between teams. Each team in this model is specialized in a phase of the project. It's somewhat like a manager told me about his reports: some are starters, some are middlers, some are finishers.

Both of these points would improve the changes of big companies with challenges innovating nowadays. And not only that, it can make better use and be a better environment for employees. As it turns out, "Lean Startup" is not a book for those interested in startups, but also for senior management at big companies.

Sunday, December 23, 2012

Microsoft Surface: slow to render web pages?

I'm typing this post from a Microsoft Surface device. I've had for 10 days now and although I don't plan to do a full review here, one thing called my attention since I got it: it seems to take longer to render web pages than my desktop computer.

Today I did a non-scientific experiment. Both Surface and my Windows 8 desktop, connected to the same wifi network, loading the same pages (e.g., Google Plus,, Facebook), in same browser (IE 10 Metro on Surface, IE10 Desktop on Desktop computer).

On first load, the difference between my desktop and Surface is quite noticeable (sometimes Surface takes seconds for a page that is loaded much more quickly on the desktop). After cached, the difference is not so noticeable. As I was first blogging from my desktop, and then moved the Surface, I could notice and I'm afraid other users will notice as well.

Anyone else noticing this difference or is it just me?


This XKCD cartoon explains the recent Instagram issue.

Origin of Algorithm word

If you never read about the origin of the word, "The Art of Computer Programming" by Donald Knuth explains it:
The word "algorithm" itself is quite interesting [...] The word did not appear in Webester's New World Dictionary as late as 1957; we find only the older form "algorism" with its ancient meaning, the process of doing arithmetic using Arabic numerals. [...] historians of mathematics found the true origin of the word algorism: It comes from the name of a famous Persian textbook author, Abu 'Abd Allah Muhmmad ibn Musa al-Khwarizmi (c. 825) - literally, "Father of Abdullah, Mohammed, son of Moses, native of Khwarizm." [...] Al-Khwarizmi wrote the celebrated book Kitab al-jabr wa'l-muqabala ("Rules of restoring and equating"); another word, "algebra", stems from the title of his book, which was a systematic study of the solution of linear and quadratic equations).

Not only you learn about the origin of algorithm as well as of algebra.

On you will find a 1981 article from Lecture Notes in Computer Science recommended by Knuth, but it's closed to members. I'd be curious to read it, but did not gain access through MS corporate subscription to Springer.

Friday, December 21, 2012

Writing good code: be concise and do one thing

These are two good posts on guidelines about writing clean, good code:

What are some of your personal guidelines for writing good, clear code? (see Jason Barrett Prado's answer)

Rule of 30 – When is a method, class or subsystem too big?

Both posts touch a very interesting point: code length. One of the common indications of bad or confusing code is when you see lengthy methods, functions, classes. I remember seeing or hearing about methods that were 2 to 3 THOUSAND lines long, with these long indented if statements. What happened to them? Typically everybody avoided touching them if the original developer left. And if refactoring is not valued (and/or becomes more of a liability), nobody will ever try to improve its maintainability. That's why these tips above come in handy: forcing yourself to somehow limit the methods and classes goes a long way because it ends up making the developer think about abstraction and how to split the code.

The other important point in Jason's answer is single responsibility. This also goes a long way. It's a very simple concept, but many experienced developers still don't think along these lines. Have a method doing one thing, have a class doing one thing. When you read books like The Art of Readable Code or Clean Code and start adopting the mindset of having good names, then methods that do multiple things will start to become obvious. When that happens, flag and refactor them at some point. That will improve your code greatly as well.

Friday, December 14, 2012

Code Reviews: small and done by experts

I read a good article today on lessons on code reviews from open source software.

Contemporary Peer Review in Action: Lessons from Open Source Development

Unfortunately you need to be an IEEE member/subscriber to access, but if you do have access it, read these lessons.

The core idea is pretty much that: (1) reviews must be small; (2) reviews must be done by experts, otherwise they don't offer much value.

From my experience, most of the developers wanted feedback and took them well to improve the code. However, on the negative side, I've seen some techniques to work around the process in place to require code review - and that's where the purpose was defeated.

The main technique that I've seen is: avoid the developer that gives more feedback and send to an "auto-approver" developer. This is just the technique to bypass the process, as there is essentially zero interest in getting feedback and the code better.

Another technique is to send the review to newhires, with the excuse of ramping them up, but with the intent of not having the design or code questioned at all.

Of course, if a reviewer unexpectedly "annoys" the developer with valid concerns, just reply as "won't fix" and get that captured in a bug fix that will never get prioritized.

This issue becomes even more critical if technical leaders employ these techniques to "get things done".

How do we get developers not to use these techniques and do the right thing of sending reviews to the experts and wait for their feedback? I wonder if these developers are actually vested or just prioritize other things over the quality.

Technical Debt

Some great links about technical debt:

Are bugs part of technical debt?

Technical Debt - from metaphor to theory and practice (presentation)

Special IEEE Explore issue on Technical Debt (Nov 2012)

Must read if you're passionate about quality and has seen or heard some of the issues mentioned in these materials.

Friday, November 23, 2012

Career Advice for Software Engineers

A great career advice post for software engineers. It definitely matches my experiences and my observations:

Don't Call Yourself A Programmer, And Other Career Advice

Find bellow an excellent quote from this post. Sad, but true reality:
The person who has decided to bring on one more engineer is not doing it because they love having a geek around the room, they are doing it because adding the geek allows them to complete a project (or projects) which will add revenue or decrease costs. Producing beautiful software is not a goal. Solving complex technical problems is not a goal. Writing bug-free code is not a goal. Using sexy programming languages is not a goal. Add revenue. Reduce costs. Those are your only goals.

Thursday, November 22, 2012

Success: not features, but solving customer's problems

Not all places are very customer focused due to culture and/or inefficiencies, but I have to say that this quote resonates with me:
"Success is not delivering a feature; success is learning how to solve the customer's problems" - The Lean Startup
Working for the customer can be quite motivating for employees.

Tuesday, November 20, 2012

Using queues for asynchronous processing

Great article on the reasons to consider using message queues for asynchronous processing. I do remember the hard time I had to convince coworkers of the advantages of queue - people that never used them are reluctant to use them and very often prefer to shoehorn solutions on top of databases. That is what this article talks about:

Asynchronous Processing in Web Applications, Part 1: A Database Is Not a Queue

Saturday, November 10, 2012

Generating Sequential IDs

A typical problem we see in distributed systems is how to generate sequential IDs efficiently. Here you can find three approaches:

Percolator (Google)

The timestamp oracle is a server that hands out timestamps in strictly increasing order. Since every transaction requires contacting the timestamp oracle twice, this service must scale well. The oracle periodically allocates a range of timestamps by writing the highest allocated timestamp to stable storage; given an allocated range of timestamps, the oracle can satisfy future requests strictly from memory. If the oracle restarts, the timestamps will jump forward to the maximum allocated timestamp (but will never go backwards). To save RPC overhead (at the cost of increasing transaction latency) each Percolator worker batches timestamp requests across transactions by maintaining only one pending RPC to the oracle. As the oracle becomes more loaded, the batching naturally increases to compensate. Batching increases the scalability of the oracle but does not affect the timestamp guarantees. Our oracle serves around 2 million timestamps per second from a single machine.


Snowflake (Twitter)

To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number. Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file). We encourage you to peruse and play with the code: you’ll find it on github. Please remember, however, that it is currently alpha-quality software that we aren’t yet running in production and is very likely to change.


MySQL Ticket Server (Flickr)

A Flickr ticket server is a dedicated database server, with a single database on it, and in that database there are tables like Tickets32 for 32-bit IDs, and Tickets64 for 64-bit IDs.

The Tickets64 schema looks like:
CREATE TABLE `Tickets64` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `stub` char(1) NOT NULL default '',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `stub` (`stub`)
SELECT * from Tickets64 returns a single row that looks something like:
| id                | stub |
| 72157623227190423 |    a |
When I need a new globally unique 64-bit ID I issue the following SQL:
REPLACE INTO Tickets64 (stub) VALUES ('a');

Romney's Orca fiasco

This is a great article on the IT project fiasco in Mitt Romney's campaign:

Inside Team Romney's whale of an IT meltdown

This is a good quote that is representative of much you see around among IT projects:
[...] I had some serious questions—things like 'Has this been stress tested?', 'Is there redundancy in place?', and 'What steps have been taken to combat a coordinated DDOS attack or the like?', among others. These types of questions were brushed aside (truth be told, they never took one of my questions). They assured us that the system had been relentlessly tested and would be a tremendous success."

Sunday, November 04, 2012

Consensus Protocol

You probably heard of 2-phase commit, Paxos, consensus protocol. Paxos itself seems to be a complex thing to understand and, although proven to be correct, it's not uncommon to hear that it's not really required and some "simpler" algorithm can be used.

These posts I came across give a great explanation on the shortcomings of 2-phase commit when it comes to failures and also of 3-phase commit, which handles well one type of failure (fail-stop). It gives great examples of hard distributed system issues with these protocols that make them not robust enough.

And this is a great quote from these articles:
Mike Burrows, inventor of the Chubby service at Google, says that “there is only one consensus protocol, and that’s Paxos” – all other approaches are just broken versions of Paxos.
I'd definitely recommend you take the time to read them:

Consensus Protocols: Two-Phase Commit
Consensus Protocols: Three-phase Commit
Consensus Protocols: Paxos

And, just to finalize, when I see quotes like above and the number of issues with 2PC and 3PC, I wonder how reliable consensus protocols like the one used by MongoDB to pick the primary replica actually is:
We use a consensus protocol to pick a primary. Exact details will be spared here but that basic process is:
1.get maxLocalOpOrdinal from each server.
2.if a majority of servers are not up (from this server's POV), remain in Secondary mode and stop.
3.if the last op time seems very old, stop and await human intervention.
4.else, using a consensus protocol, pick the server with the highest maxLocalOpOrdinal as the Primary.

Any server in the replica set, when it fails to reach master, attempts a new election process.

Data replication: why 3 copies?

If you ever asked yourself why these storage systems talk about making 3 copies of your data, check out this article that came out in July and that I just reread:

Data Replication in NoSQL Databases

It talks about RAID, why it's not used for big databases, and the reasons for this magical 3 for data replication.

Note that the same reasoning could apply to any replication you may be doing yourself, even at the application level.

Saturday, November 03, 2012

NAT, peer-to-peer and hole punching

This is a great article if you want to understand how connections directly to your game console behind your router (NAT) happens:

Peer-to-Peer Communication Across Network Address Translators

As it turns out, although SYN packets for TCP connections are blocked by default, there are techniques that "punch holes" in the NAT and allow them to go through. For services like Xbox Live or Skype, for instance, that minimizes the need of relay servers.

Tuesday, October 30, 2012

Yoder on Good vs. Bad architecture

One of the biggest challenges in software engineering is to prove that shortsighted decisions, both at the low and high level, can be detrimental and cost more in the long term. So far, whenever I had to convince someone, it has been very challenging. And along these lines, it's always refreshing to see someone being sensible when it comes to that:

“If you think good architecture is expensive, try bad architecture.” Brian Foote & Joseph Yoder

Hashing functions and MurmurHash

Today I watched a presentation on cloud applications and one of the interesting points is how hashing the users across shards had to be done in a relatively uniform way to make sure it's evenly distributed.

While the presenter mentioned that they used an in-house caching algorithm, they realized that a good hashing algorithm can make a lot of difference. In this case, his suggestion is to use Murmur.

I found a great post on StackExchange on that, which is a must read before picking the hashing function:

Which hashing algorithm is best for uniqueness and speed?

And, of course, leave aside the Not Invented Here syndrome and don't go implement your own function :-)

G-Wan Web Server

This post on the performance of Node.js vs. G-Wan caught my attention:

What makes something "notable"?

These are some of the points:
  • Node.js is not as great in terms of performance as normally advertised
  • They claim that their web server needs 2,444x less servers than Node.js to run a merely "hello world"
  • Technology media dismisses G-WAN, in spite of all the technical superiority

But what really caught my attention the most was not the product, but the tone of this post. To me it seems to be aggressive, and actually I notice that most of the web site seems to have the same tone. Look at the section that mentions whether it's open source:
G-WAN is a freeware. It means that it is free for all (commercial users included). But some virulent (anonymous) users claim that this is not enough. They exige G-WAN's source code, and, "at no cost".
This kind of tone definitely does not attract a lot of the people, and can have quite the opposite effect, as it comes across as too radical. Of course if they do have a compelling business proposal through their software, they can be somewhat successful at least, but having a more neutral stand could be more fruitful and appeal to a larger audience.

One other interesting thing is their results when running benchmarks on Windows:
Linux was found to be much faster. After years of development this gap is surely larger now because Unix leaves more room for developers to innovate.
IIS 7.0/Windows is by far slower than all – despite being part of the kernel. G-WAN/Windows does better than Apache/Linux and GlassFish/Linux despite the Windows user-mode overhead which is 6x higher than on Linux. But G-WAN/Linux crunches G-WAN/Windows. Yes, Windows is that lame.
No wonder why they discontinued the Windows version back in 2009. I wonder how much it is due to the system and how much because they didn't tune the OS for better performance - or even if they used Windows Server or Windows Client.

And even when it compares to Tomcat, it seems that this G-Wan web server kicks ass:
G-WAN runs an "hello world" with 10x less CPU and 24x less RAM handling 11x more requests in 13x less time than Apache Tomcat… on a 6-Core. Many other languages (PHP, C#, JS...) benefit even more.

Update 11/12/2012: Differently than mentioned above, it doesn't seem that the issue was with OS tuning, but with the OS internals. Please see comment on this post by Timothy Bolton.

Monday, October 29, 2012


Harvard Business Review has a good article on micromanagement, which is a recurring theme.

Stop Micromanaging and Learn to Delegate

One of the highlights of this article:
It's important to realize that other people won't do things exactly the same way you would. Challenge yourself to distinguish between the style in which direct reports approach tasks and the quality of the results.

Great Operations Leader

I've posted a few times posts or article by John Allspaw and this is another instance where I admire what he wrote and think it's worth quoting a few lines:

What are the attributes (other than technical ability/experience) that make a great VP of Technical Operations?

Some good quotes:
Great Ops leaders understand that enduring risk of service outage, failure, and degradation is necessary for evolving and enabling a business, so they don't avoid change, they instead build a straightforward and collaborative way for change (and the accompanying risk) to take place.
You want someone who is haunted by worst-case scenarios but doesn't allow them to paralyze the organization or technical evolution of an infrastructure. Leaders that I admire value solutions and remediation-finding over blame assignment and avoidance.
And, as for many other areas of life:
a great Ops leader is one that continually looks elsewhere for improvement, inspiration, mentoring, and guidance. This means that becoming a great Ops leader isn't an actual achievement, it's a never-ending process involving humility, which in turn means that lists of great Ops leader attributes will always be incomplete. :)

Sunday, October 28, 2012

Senior Engineer

John Allspaw wrote a fantastic post on senior engineers that is a must read for those seeking to improve themselves:

On Being A Senior Engineer

Through this post, I found a number of areas that I improved in the past few years, and also some areas that I need to make some adjustments. But that's the beauty of it, if you are actively trying to improve yourself and making the effort for that, results will come. In the post above, John also gives some references to additional posts or books that may good to read for those actively looking to improve.

There are many parts of this post I could quote here, but just a couple of teasers for you:
I’ve mentioned it elsewhere, but I must emphasize the point more: the degree to which other people want to work with you is a direct indication on how successful you’ll be in your career as an engineer. Be the engineer that everyone wants to work with.
I also noticed that to be a really strong indication of an engineer maturity. Managers sometimes think differently and don't take that as an indication, but I most definitely think they should. And the interesting thing is that oftentimes these are the most knowledgeable engineers and that will offer learning possibilities to those around you and let the creativity flourish on the team.
The only true authority stems from knowledge, not from position. Knowledge engenders authority, and authority engenders respect – so if you want respect in an egoless environment, cultivate knowledge.
This is very true. In spite of formal authority, one can simply not have any actual authority if he/she doesn't have the knowledge and doesn't actually earn the authority. Other people become the de fact leaders and, unless boycotted for some reason, they will probably have a much higher influence. So the key verb here is earn.

Saturday, October 27, 2012

Job Descriptions

I've been always curious to see job descriptions when someone reaches out to me about software engineer (or related) positions. Most of them don't really have anything uncommon, but sometimes you see something in them that could be an indication of how the company takes software development and what they value.

For instance,
  • "Write code that is art": this is the first time I see (or perhaps noticed) art in a job description. It is so nice a description with that (see my post Software As Art), as it may indicate a team/company/manager that seems it beyond utilitarism.
  • "Professional and Technical Competencies": it's good to see the word "professional", as that could indicate that this company may want to do things in a professional way.
  • "Use SOLID design principles and patterns.": this means, first of all, that someone at that company knows design principles. Points for that. And they seem to value them. So a bandaid code should be detected and not encouraged, as they should not be using any good principles or patterns.
I know what, if I have to write a job description in the future, I'll put these small signs in the text to those who are paying attention and looking for them.

Software quality hell: bandaid development

In the industry, I see a mindset problem that can be very detrimental to the software quality - and quite opposite to the Software As an Art mindset: shortsightedness to fix only the problem at hand. Let me explain through an example.

Let's say you have a very simple problem: fix a unit or component test that is broken after some major changes in other components. Now, for some reason, a parameter is being passed null and you get a null pointer/reference exception. What do you do?
  1. You can just to a "if (param != null)" and work around the issue. If this is the only issue, the test be passing, you can close the bug with the feeling that you're quick, your manager will think that you're really efficient, everybody will think that the software is now fixed, and ultimately you will jus collect the rewards for being a good engineer for the business.
  2. You can try to understand why this parameter is now null, which would require going through some other code, potentially requiring major fixes in some other areas. This will very likely take longer (sometimes much longer), so it will not give an immediate relief to the problem and, depending on your manager, you may be considered someone who is not quick or overengineer or overly complicates things.
Picking option 1 (bandaid) over option 2 (proper) has a number of consequences that most people don't consider - or just do not care:
  • Ticking bomb: essentially that is what this decision is. The code base becomes harder and harder to reason about. Bugs are very complicated to understand, there are side-effects and regressions in my places.
  • Time saved with this solution will not be enough for the time investigating issues in the future given the difficulty and obscurity of the software as it has all unclear behaviors. It is just a mess.
  • It is just a black hole: the worst, confusing, and less maintainable the code is, the harder it is to isolate yourself from all this mess. To implement new features or bugs, you end up needing to write bad code, because writing good code may require fixing some many other places that becomes very risky to the business and no manager would actually approve that.
  • New people changing this code become very scared of making changes, as it may have effects all over the place. In particular in orgs where risk is not reward, but being bold and going above and beyond is not worth the risks you're taking, people will not take the initiatives to improve such code.
  • Any serious software professional loses the pride of working on such code base. This is an overlooked and often not even noticed aspect of the hidden consequences of such behavior. You risk losing good professionals and, not only that, but given word of mouth, takes the risk of not attracting really good professionals.
In some cases, like during a release, it's perfectly acceptable to pick option 1, as long as there's the professionalism and responsibility of tracking this technical debt and tackle it right away.

In my experiences, I've come across quite a few people that would pick option 1 right away. And not only that, but it would be very hard to convince them of the reasons to consider option 2. I've seen people from all backgrounds (with and without graduate degrees), working at big and known software companies and at startups. As it turns out, it is a mindset issue, like a friend of mine said. Unfortunately this is one of the areas of the creative process of software development that cannot be enforced by process.

When this is restricted some individuals, it's less of a problem if the team culture embraces high quality - as in, you see people in general improving their code quality and spending time on that, either through additional efforts, reading books or any other material to push themselves forward; management actually values that, etc. In that case, the issue can be contained if the team has the encouragement of rejecting code reviews and the "bandaid" engineers accept comments to fix the issue properly.

However, this issue become very toxic when leads/managers themselves are the first ones to pick themselves (or to encourage through subliminal messages) that the bandaid solutions are the ones to go with. Not only that, but if the company's culture, performance reviews or other mechanisms actually encourage and/or reward one to think in a shortsighted way at the expense of the long-term solutions, then there is nothing one engineer alone can do - and in this case I'd suggest to consider other options. In my opinion, that's where companies start to lose the agility to add features and innovate as they get caught up in the software mess they created in the past.

Finally, do you know what the problem is with the software bandaids that keep getting added to the software? If that software is integral to the business, you can't get rid of it and just do it properly from scratch. Someone will need to deal with the consequences of it. And that's when you distinguish great engineers that are worth keeping in your org: they are not the ones to think that it will be someone else's problem. Great engineers don't think about their upcoming review first, but think about doing it right (of course aligning with the business priorities). If you want to build a great team and a great company, start distinguishing between these types of engineers - not always the ones that apparently deliver are the ones worth keeping or rewarding.

PS - for companies that have surveys among engineer and do take them seriously to improve the company, I'd suggest a couple of questions to measure this effect:
  • Do you actually feel proud of the product/service you work on?
  • Do you feel that management embraces high quality and does not promote low quality indirectly through shortsightedness?

Wednesday, October 24, 2012

Caching Algorithms

After working on the Linux kernel and implementing a LRU eviction policy for a memory compressed cache, the following article definitely hit home. It talks about the different eviction policies for caching, which one was picked by Dropbox for its client, and briefly mention at the end how to do a simple cache invalidation.

Caching in theory and practice

Map Reduce Patterns

I just read the post below on Map Reduce patterns. It is long, but it goes over many of the patterns and how map reduce can be used for many interesting computations. I particularly found the PageRank algorithm the most interesting.

MapReduce Patterns, Algorithms, and Use Cases

Kudos to Ilya Katsov (author) for putting this together.

Thursday, October 18, 2012

Life's work

I read this article "What Do I Want To Do When I Grow Up?" Is The Wrong Question To Ask and thought with myself "history of life". This is one of the paragraphs that I could relate very much to:
My unconventional career path took me to five major national and international cities. I stayed at jobs for as long as 18 months and as short as one month. I sold all of my belongings and moved cross-country because my intuition told me to. I worked with more than 15 different startups in one year of living in New York City. I started a blog to document my journey--both the learning and the mistakes. I started a website to document the stories of people boldly pursuing their life's work. I messed up two startups. I accidentally turned insomnia into a global movement. I met with tarot card readers, talked strategy with multimillion-dollar entrepreneurs, and helped a best-selling author launch a publishing company, all to see if I could answer the question I'd been wondering about since I was 5: What do I want to do when I grow up?
And on the same note, this is another great post on "life's work": 8 Signs You've Found Your Life's Work

This part is particularly interesting and I believe it's one of the most noticeable signs that you found your life's work:
The people who matter notice.
"You look vibrant!" and "I've never seen you so healthy and happy!" and "This is without question what you're meant to be doing!" are among the comments you may hear from the people closest to you when you're on the right path.

Windows 8 Closed Distribution Model

Windows 8 is almost there, and this post on its closed distribution model gave me a quite different historical perspective on Windows 8 and its desktop model.

I agree with the article that Microsoft will be moving towards the Windows 8 UI - and less and less of the old desktop UI. That will probably happen for the very same reasons as DOS applications faded into obscurity, features and focus will be on the new UI, so there will be no reasonable way of maintaining an application running on the old UI for too long.

The whole problem is that Microsoft controls which applications are distributed for the new UI. Just like Apple does with the apps for iOS. And just like Apple, it can dictate what one can and cannot have on their device. And, depending on the distribution model, it can just mean that the device is no longer supported and essentially "brick it" by not allowing apps to be released for the OS version supported by older OSs. Just like what happened with iPhones and more recently with the iPad 1, which had a lifetime of just 2 years.

Although I'm currently a Microsoft employee, this is very scary. I never liked Apple model, and for this reason never had a iOS device. I never liked Amazon Kindle's model, even when I was an employee at Amazon, because Amazon decides where I can actually read the books I purchased (besides the fact that I cannot loan a book for how long I want or give the books away, but that's more a digital media problem than only specific to the Kindle case). Based on the same principles, I'm not comfortable with the Windows model going forward.

What is unfortunate about this whole discussion is whether we have any option. I love open source, but are they viable options going forward to compete with these large app stores? Will Android, which has the most open distribution model, be able to survive in spite of its fragmentation, not very oiled processes, and patent litigations?

One could say that we can just go and use open source, but the future will be mostly in the hands of those that hold data. That is why some startups are very valuable, even if they don't have a defined business model, as long as they have user's data. And they will dictate you which platforms that are supported or not, based on their business reasons. Given that, you will not have an option if you want to use these applications.

One could just say no to all of that and not use any of these applications if they imply using a closed platform and some sort of lock-in. But one can just do that to the extent that it doesn't impair your ability to live your life and/or run your business in an efficient way, otherwise you will have just an "illusion of control", but will have to surrender to the reality and pay up, complying with whatever restrictions these big or small companies impose.

The future looks very scary, unless market forces make companies like Microsoft or Apple open up their distribution models. That will only happen, though, if this is better for their revenue, otherwise they will not do it out of "good will".

Tuesday, October 16, 2012

Implementing MVCC on a Key-Value store

The following post is quite interesting and has good ideas on how to implement MVCC (Multi-Version Concurrency Control) on a key-value store:

Implementation of MVCC Transactions for Key-Value Stores

It gives good ideas on how to track when each entity came into existence by storing the transaction that created and the one that deleted it.

However, I have concerns about this global sequence number and I copy here my comment instead of retyping it:
The idea of keeping the transaction ID is very interesting, but requires a global sequence number to be assigned to the transactions, right?

I don’t know which NoSQL databases have that, but when I think about Windows Azure Storage (the one I’ve been working with more recently), that would be a problem. Actually, that’s a problem with any scalable DB, as it can be a contention point.

In the Windows Azure Storage, or others that don’t have that on the server side, it’s more of a problem to have this global number as it requires operations to read the number, increment, and then update them. This creates a contention point, and reduces the rate of transactions you can have.

What are your thoughts on it? Have you tried to implement that on top of a NoSQL DB? Does Oracle Coherence offer an increment operation for this global counter?

UPDATE 10/18/2012: today I read another good post on how Postgres works with MVCC. It has a few more details to complement your knowledge: PostgreSQL Concurrency With MVCC

Monday, October 15, 2012

Business cards

I ordered some business cards for myself on and have to tell you how pleased I was with the experience. Very nice and functional website, cards were delivered before the estimated date, and quality was just great. They got themselves a customer.

Software as an Art

I see software as an art. Although it can be used simply a mundane tool to get something to work, that's not what energizes me when working with software. What does excite about it is the art in a beautiful solution, in a nice abstraction, in a great architecture, in a readable and maintainable code.

In spite of seeing software as an art, I've been able to deliver in all the projects I've participated in by making the right prioritizations. After all, it must serve a concrete need and be part of a business, adding some sort of value. However, even then, it's been an art and I'm vested into making it better, nicer, even if it requires some harder work.

The problem, though, is to be in environments where software is just viewed as a tool, where achieving a result is more important than anything else (even if it's totally shortsighted), where getting something done simply annihilates the art behind the software, where achieving the reward for doing something (even if crappy) is more important than doing it well or right.

In these environments, you seem to become merely a cog in a machine of results, because your creativity and your heart are not really that required as long as you solve the problem in any way or form. That shifts software from an art or a craft into an almost mechanical task. That simply kills the joy of software development.

If you have similar view of software, how do you actually cope with this in your work environment? Do you try to shield yourself? Or is there any work environment where software as an art/craft is still preserved? I'd love to know about your experiences.

Sunday, October 14, 2012

NoSQL: algorithms and data modeling

My Sunday reading session was on a couple of great NoSQL blog posts that are definitely worth reading:

  • Distributed Algorithms in NoSQL Databases: it talks about data consistency, data placement, and system coordination. I learned some interesting new things, like Bully Algorithm for leader election, passivated replicas, how to handle rebalancing between replicas, and multi-attribute sharding, among others.
  • NoSQL Data Modeling Techniques: it lists a bunch of techniques, many of which match my experiences with NoSQL databases. I like this quote, which is essentially how I explained how I designed the data model: "The main design theme is ”What questions do I have?”"
These posts, in particular the first one, require some background, otherwise it may seem a bit hazy. But the author added a great list of references to that should be read to understand it better.

The more I learn about NoSQL - or work with distributed systems - I see how much the complexity is growing when dealing with seemingly simple problems, like data model. It's very hard to get others to understand, but as soon as you start reading these posts and see how many different concerns one has when thinking of the data model, for instance, then you see how skilled a software engineer must be to try to minimize the mistakes and rework. And, that's because the posts above don't even talk about some processing like MapReduce, which is still another aspect to be factored in.

The ones below I haven't read yet, but are on my list and seem promising:

Saturday, October 13, 2012

Startup vs. Big Company Mindset

Paul Graham wrote a great post on startups that is definitely worth reading (in spite of its length). One of the interesting quotes in this text is: "if they aren't median people, it's a rational choice for founders to start [startup companies]".

As I work for a big (or better huge) company, I always think of the difference of people working at both types of companies. I don't think there's a definitive set of personality traits that define whether you should work at one or the other, especially as you see people working at some point of their lives in one type of company and then moving to a different type.

However, it's interesting Paul's quote on "median people". I think it is along the lines of my post on Working at Big Software Companies. Due to the size of big companies, they must have the rules for the average professional, and try to minimize damage to the outliers (either top or underperformers). But there's no magical silver bullet that works perfectly, so there is always some collateral damage.

One example of collateral damage. Let's assume that you have a few top performers, who are capable of doing the majority of work well and faster than others. And then you have many average or good (but not really top) performers on the team. What are the criteria to divide the work? If you're focusing on efficiency and getting the work done more quickly and better, you would probably assign most of the items to the top performers, or at least the critical ones, and have others taking care of the less important stuff. That would guarantee efficiency, but probably you're risking having a lower team morale as most of your people will not be working on the "meaty" or interesting stuff.

In big organizational structures, more often than not, as a manager you're not being measure by the most efficient output, but to have MORE output than your peers. Another point is that, as a manager, you will be measured by what the teams thinks of you (MS Pool is an example of that at Microsoft). Given all that, if you focus on your career, you're better off by not delivering the most you can do - by purposefully not assigning features to top performer and/or slowing them down (*) - and guaranteeing a high team morale among the average professionals than guaranteeing that the top performers are indeed happy and that you're really very effective.

Actually, because of the typical peer comparison when it comes to performance evaluation, it is interesting to see that it can hurt the overall company's performance. Of course, one could argue, that if everybody is producing like crazy, this system would force all to catch up and keep up with other's performance. But if the company is profitable and there are not enough incentive between producing slightly better than peers vs. producing to your potential and, on top of that, producing to your potential may cause more liability and risks that will be detrimental in that culture, then I'm a firm believer that the rule will be to be (or at least look) better than peers, but not to the extent that it can cause risks.

Risk is another interesting aspect of big company mindset. I've seen more average people being promoted and rewarded for not "getting themselves into trouble" than top performers that were bold and had courage to be brave and take the risks, whatever they were (like trying to shake the existing status quo, trying to get the most efficiency of the team, attempting something new). It's just like any other contest one sees on TV, the ones that keep moving forward and even sometimes win are typically the ones that are not bottom performers and do not make huge mistakes that put them on the spot. If you keep going with an average result (better than someone else), do not take much risks (incl. pissing others off), there's a better chance of moving forward in your career.

The biggest perk for those who stick around is that, over time, it's very unlikely to be fired from these companies - it's possible, but unlikely unless you screw things up very badly. So, once you reach a good salary and all the benefits that come with tenure, it's very hard to jump ship and try something new. In particular, for those interested in focusing on some other aspects of their life, as over time you get comfortable in the position, without requiring a lot of effort on your part to keep going.

With this mindset, it's not surprising that the innovator's dilemma exist and that, unfortunately, most of the innovation will not come from these big companies. Actually, there will be innovation there, don't get me wrong. But it's the kind of innovation that requires an enormous structure in place or ecosystem that anyone new to the area can't afford to such large investment or tap into any potential returns. Also, the big companies will not be the most efficient, unless there's really a market force driving that. But its survival is not a matter of really being efficient, but just being more efficient than competitors - if these can't not bought out or driven out of the market somehow.

(*) On the slowing down top performers, I've seen that happening and couldn't believe my eyes. The same is reported in the book "The Peter Principle". Essentially, from management perspective, having someone "too good" makes others feel bad, so one needs to manage these professionals and, oftentimes, it's better to get rid of them. Yes, it seems an absurd, but that happens.

Tuesday, October 02, 2012

Performance Review

It's not a surprise that Microsoft's performance review has been so debated in the past few years, and more recently after Vanity's Fair post on "Microsoft's Downfall". If you haven't heard of how the performance review works at Microsoft, I'd suggest you read these couple of blog post and article:

Microsoft Stack Ranking is not Good Management
Microsoft’s Downfall: Inside the Executive E-mails and Cannibalistic Culture That Felled a Tech Giant

However, what surprised is to come across some posts on how Google does its performance review. From recruiters, I knew all about the wonders that peer feedback are at Google and how things are more fair.

I did not know, though, the different side of the coin. Read it here:

What are the major deficiencies of the performance review process at Google?
Promotion Systems
Promotions System Redux

Essentially, the problems with performance review are everywhere at big companies. I wonder how well it works at Valve, as it has a non-traditional organization structure.

Monday, October 01, 2012

BTree Library for .NET

As I was reading MongoDB deficiencies, one of the problems I learned about was the lack of counted BTrees. As I wanted to learn more BTrees, I coded up a basic implementation of BTrees in C# and posted it to GitHub, which allowed me to learn more about Git as well.

This is the GitHub repository where you can find the initial BTree implementation (in-memory implementation only for now):

Doesn't a developer spend most time coding?

This blog post is great on the less-known truths about programming, and I definitely recommend it reading it entirely. Here, however, I want to touch on this point:
Averaging over the lifetime of the project, a programmer spends about 10-20% of his time writing code, and most programmers write about 10-12 lines of code per day that goes into the final product, regardless of their skill level.
I completely agree that most of my time is not spent writing code - and it has never had, irrespective of whether I was considered junior or senior in a project. My experience matches perfectly the poster of the post mentioned above, as well as it matches other people on the web (and apparently what was written in the Mythical Man-Month).


This is something that nobody tells you at college: you will not be coding as you may think. As a matter of fact, in big corps, you will probably be communicating and handling other things much more than working on the technical things. If you're operating a service, like I mentioned here, then it's possible that you're be coding even less and potentially debugging and trying to diagnose issues even more than coding.


Another point of this is how the technical side is the focus of interviews, although they can account for the minority of your time on the job. Soft skills, communication, dealing with ambiguity, etc., are often neglected and can be very important for the success at work as well.


Actually, what is important for a professional to be successful? This TED Talk gives us good tips. See this quote:
Embedded within that question is the key to understanding the science of happiness. Because what that question assumes is that our external world is predictive of our happiness levels, when in reality, if I know everything about your external world, I can only predict 10 percent of your long-term happiness. 90 percent of your long-term happiness is predicted not by the external world, but by the way your brain processes the world. And if we change it, if we change our formula for happiness and success, what we can do is change the way that we can then affect reality. What we found is that only 25 percent of job successes are predicted by I.Q. 75 percent of job successes are predicted by your optimism levels, your social support and your ability to see stress as a challenge instead of as a threat.
These are things that we don't look at when hiring someone. Coupled with the fact that you don't spend much of time doing the technical core work anyway, I don't think that our interview techniques are actually that effective to predict an individual's longevity (and success (*)) at the company.

(*) What is success is probably the topic for another post :-)

The virtuous cycle of being on-call

… or how you can make on-call for service providers a virtuous cycle.

In the tech world, for everything that is running as a service or website 24/7/365, there must be someone available to take care of any issues that arise. It’s been quite common to see that someone in the organization monitoring the service (or website) and act in case of issues. Some orgs do have an operations team on the frontline, others have developers. Even if the operations team exists, someone from the engineering team who develops and/or maintains that code must be available if there’s an issue that requires further investigation. All of these people that, either at work or at home are available through cell phone, pager, or email, are what is called on-call.
In my opinion, on-call can be a virtuous cycle that improves the code, provides good customer service, and over time tends to decrease the issues. BUT the problem is that, very rarely the environment and the on-call is set up for this virtuous cycle to happen.

When are you NOT setting your company up for a successful on-call?

- Customer service is not the priority: the primary goal of being on-call is to detect issues before customers and prevent them from hitting the issue. The secondary issue is, if an impact is inevitable, to reduce the impact on the customer. This must be the direction given to all engineers on-call, and must be valued by management through rewards, being a top item on the list for performance, time for this work must factored into schedules/estimates, among others. But this is not all, in order to be really customer-centric, the right investment in monitoring must be made, the lessons must be applied in order to improve the process, tools must be written to speed up process, tests for disasters or failures must be conducted. When on-call does not have this focus, but it is essentially to meet metrics (time the issue was in my court), or just to get rid of and go back to the task that has higher priority or rewards, for instance, then this on-call is not virtuous and tend to provide poor customer service.

- Monitoring is not sufficient: if the company is really customer-centric, monitoring in the code must be all over the place. Monitor must be sensitive to minor changes before the customer is impacted and must be very proactive in alerting the team. Not having monitoring in the system may sound like it’s good (no news is good news), but you can essentially be providing a very poor service to the customer as you will only fix some of the issues IF and WHEN the customer reports it. Monitoring cannot be an afterthought, monitoring cannot have second priority. At least not if you plan to be serious about providing a service.

- Wrong people on-call: oftentimes to meet the obligation of being oncall and to reduce the load, anyone on the team is added to the on-call list, even if the person is not familiar with the system. Doing that, although can look good at first as knowledgeable people don’t need to be formally on-call, is a really bad idea in my opinion. First, the knowledgeable person will very likely be engaged anyway. If the person is not available, you risk having an outage or you risk having a customer impact. If the person is available, by doing this you essentially delayed the resolution by having a middle-main engineer whose role is just to engage others. This is not virtuous, as it essentially doesn’t solve the problem of reducing the load, but also add load on whoever is on-call and decreases the customer service quality.

- On-call people are not vested into really fixing the issues and improving the service: essentially, if on-call people do not have a feeling of ownership for what they are maintaining and being woken up for, this is not a virtuous cycle. It’s not virtuous because they will not be vested into doing the proper investigation, fixing the issue, or improving the process – after all, they don’t feel that they are owners and being on-call can be just an obligation, a nuisance. To be virtuous, the right thing is to get engineers to own the service, to be able to make the decisions (and be held accountable for them), to have the desire to improve the service. That is one of the pre-requisites for people to be on-call and actually see the importance of that activity.

- Many of the issues are due to other teams/orgs at the company: having people vested into the service can do so much if the interdependencies makes them suffering the consequences of other team’s issues. This is not a problem unless management does not make the right investment to fix the underlying services that causing issues. Otherwise, it can cause the sense of ownership not to be sufficient as the on-call engineer will be paying the price for something that s/he’s not responsible for, and the cycle is no longer virtuous.

- Too many issues: for any service that is very popular and/or not yet mature, besides having the right people on-call, these people must be able to handle the issues in a reasonable fashion and still be able to sleep and have the basic needs met. Once you’re over this threshold and people have to live for the issues, as they are getting out of hand, something is amiss and needs to be fixed. If the investments are made, this can be fixed and does not impact the virtuous cycle, otherwise it can show that on-call and customer service is not the priority for the company, generating discomfort and dissatisfaction on the side of those that have to do it.

- The feeling is just to get rid of or blame someone else: still related to other items, if the goal is to meet some metrics and just get rid of the issue or blame someone else, this cycle is most definitely not virtuous, as it does not bring the benefits of a great customer service and of improving the process.

I have felt good being on-call given the right circumstances, and being on-call taught me so much about how to engineer a system to be run as a service, and improved and matured the systems I was being on-call for. It’s a matter of not doing the things above, showing respect for those professionals that are on-call, and getting them to be vested into the service through ownership. Essentially, make the "have your skin in the game" something that makes sense to the engineer. It’s rare to find these things, but if you do it, you can rest assured that you started your virtuous cycle.

What do you think? Do you have any other tips or had a different experience being on-call?

Thursday, September 27, 2012

MongoDB: Replication Lag, Network Partition

Replication: when everything is working, it seems the easiest thing, but the challenge comes when we have failures. This is something that MongoDB, being a database that is out for some years, shows us very well.

If you look at MongoDB's replication page, there is a section called "Rollback". This applies when there is a replication lag - your secondary is behind the primary - in conjunction with a network partition. It may look like, but it's not a scenario that uncommon.

What happens then? Since the primary is ahead of the secondary, if it wants to rejoin the cluster, it needs to somehow "resync" to the current state. This means that operations applied previously but not replicated (due to the lag) will need to be rolled back. First, this process in MongoDB is manual.

But what happens then if your secondary was behind a large amount of data - more precisely, 300Mb or more? MongoDB does not have the capability of rolling back and manual intervention is required to recover that node.

Read for yourself:
Warning: A mongod instance will not rollback more than 300 megabytes of data. If your system needs to rollback more than 300 MB, you will need to manually intervene to recover this data.
Although it supports asynchronous replication (the eventual consistency), it does not solve these resynchronization problems automatically. Maybe it was just deemed as not very common and low priority compared to other features, but can you imagine the pain if you need to go through this process?


Redis as Memcached Replacement?

I found Redis to be a very interesting option to Memcached for use as caching layer:

  • Pretty much set/get operations on key/value, just like memcached, but with rich data structures, like Lists, Sets, Sorted Sets, Hashes, Blocking Queues. And rich commands to use all of them. That avoids serializing a bunch of data into Memcached values.
  • Durability options: Redis can be used just in memory, like memcached, or can be tuned to be durable (flushing data every so often is an option, or appending data to a log file that can be replayed if the node restarts, among others).
  • Master-slave: it can be configured as master slave, being the slaves your "read replicas".
  • Cluster: just like Memcached, it can be sharded using consistent hashing, but that's taking care of on the client side - which can a problem if used against cloud instances that are more likely to fail (so one needs to be prepared for updating the instance list on the client in case of failures).
  • Security: pretty much non-existent, or at least not built in, like Memcached. You need to resort to other security mechanisms (to be honest, Memcached had some SASL support, but I don't know the current state of that).
  • Not to mention the API, which "[..] is not simply easy to use, but a joy."
  • LUA support on the server side
  • Good client library support, as well. One good example is the library for a Bloom Filter on top of Redis.
  • Publish/Subscribe support: very cool publish/subscribe support on specific channels, pattern matching for both channel and message subscription, etc.

So the question that remains is why Memcached seems to be that much more popular?

  • LRU: one of the questions I've seen is related to the lack of LRU support, but this has been added to Redis
  • Performance: I did not find more recent numbers, but some numbers from 2010 show Redis being outperformed by Memcached. That could be a good reason not to use it.

I wonder if Redis performance has improved lately and if any other reason not to use Redis comes up - please comment if you know other reasons to avoid Redis.

Redis API

This is an interesting quote from Seven Databases in Seven Weeks about Redis:

"It's not simply easy to use: it's a joy. If an API is UX for programmers, then Redis should be in the Museum of Modern Art alongside the Mac Cube."

I have to say that I'm looking forward to learning more about it after reading this :-)

Tuesday, September 25, 2012

CouchDB and Conflict Resolution

Learning CouchDB showed that it is different than other NoSQL DBs I had studied so far with respect to sharding and replication:

  • It does not have sharding, but rather each server has all the data.
  • It has multiple-master (or active-active) configuration, where you can write to any server

The interesting part about the active-active is how conflicts are resolved. Rather than just showing that there's a conflict, all CouchDB servers have a consistent view of the world, even after conflicts. How does it do it?

  • Revision numbers: first, revision numbers start with the change number to the record. Example: 2-de0ea16f8621cbac506d23a0fbbde08a. Therefore, latest changes to any record win (change 2-de0ea16f8621cbac506d23a0fbbde08a wins over 1-7c971bb974251ae8541b8fe045964219).
  • MD5: after the change number, there is change MD5. In case concurrent changes to the record occur across two different servers, there's an ASCII comparison to determine which one wins (change 2-de0ea16f8621cbac506d23a0fbbde08a wins over 2-7c971bb974251ae8541b8fe045964219). Of course there's no concept of time here to determine which change happened first, but at least each server is able to pick a winning version deterministically. Using a MD5 hash of the document can be quite helpful to avoid replicating the same change (if the same change was made to the different CouchDB databases).
  • Conflict List: besides picking automatically a winning list, any client can know when there were conflicts and do something about - either merging them or picking a different version. That can be done, for instance, using a view that outputs conflict and subscribing to it through the Change API. I understand that this is optional and doesn't stop the system in case of conflicts.

For systems where concurrent updates are not common, this is definitely a valid and good approach. Also, there is conflict avoidance through ETAGs (just like Windows Azure Storage), which goes a long way to avoid conflicts.

My concern with CouchDB and active-active is a prolonged split-brain situation, where 2 or more databases are taking writes to the records (potentially the same). Clients can see inconsistent results if they are hitting these different deployments.

Going a bit further on the Split-Brain scenario, if these databases are used by other downstream components in your system, the same conflict resolution problem may occur on their side, as they could be consuming records from these different databases (that are not resolving conflicts for some time). Reason about all these scenarios can be challenging.

The interesting thing about this is that, if indeed there's a split-brain but clients can talk to all deployments, CouchDB could take write and return an error in case replication is currently not working, so clients could have an embedded logic to try to connect to other CouchDB deployments and do some of this replication, which will keep all the deployment in a sort of consistent state.

Some references:

Monday, September 17, 2012

Service Failures, Disaster Recovery

ACM Queue has a couple of must-read posts on service failures - these are just required practices if one wants to do it responsibly:

  • Weathering the Unexpected
    “Google runs an annual, company-wide, multi-day Disaster Recovery Testing event—DiRT—the objective of which is to ensure that Google's services and internal business operations continue to run following a disaster. […] DiRT was developed to find vulnerabilities in critical systems and business processes by intentionally causing failures in them, and to fix them before such failures happen in an uncontrolled manner. DiRT tests both Google's technical robustness, by breaking live systems, and our operational resilience by explicitly preventing critical personnel, area experts, and leaders from participating.”
  • Fault Injection in Production
    “fault injection exercises sometimes referred to as GameDay. The goal is to make these faults happen in production in order to anticipate similar behaviors in the future, understand the effects of failures on the underlying systems, and ultimately gain insight into the risks they pose to the business.”
    “treating the fault-toleration and graceful degradation mechanisms as features. [...] Just like every other feature of the application, it's not finished until you've deployed it to production and have verified that it's working correctly.”

Sunday, September 16, 2012

Spanner: Google's Globally-Distributed Database

Today I read this recent paper by Google:

Spanner: Google's Globally-Distributed Database

As a globally-distributed database, Spanner provides several interesting features.
  • First, the replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance).
  • Second, Spanner has two features that are difficult to implement in a distributed database: it provides externally consistent [16] reads and writes, and globally-consistent reads across the database at a time-stamp. These features enable Spanner to support consistent backups, consistent MapReduce executions [12], and atomic schema updates, all at global scale, and even in the presence of ongoing transactions.

The very interesting part is that it provides semi-relational tables (which some other NoSQL systems also do), SQL-like query language, and synchronous replications – you can see that many applications within Google use non-optimal data stores just to have the synchronous replication.

Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general- purpose transactions. The move towards supporting these features was driven by many factors.
  • The need to support schematized semi-relational tables and synchronous replication is supported by the popularity of Megastore [5]. At least 300 applications within Google use Megastore (despite its relatively low per- formance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. (Bigtable only supports eventually-consistent replication across data- centers.)
  • The need to support a SQL-like query language in Spanner was also clear, given the popularity of Dremel [28] as an interactive data- analysis tool.
  • Finally, the lack of cross-row transactions in Bigtable led to frequent complaints; Percolator [32] was in part built to address this failing. Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.

Essentially, they accomplish by still having several versions of each record (like BigTable, but with timestamps assigned by the server) and guaranteeing timestamps to be within an error margin by using multiple time sources like GPS and atomic clocks. Then there are Paxos groups, of course, and two-phase commits for the transactions.

I found this paper very interesting as I've been thinking of some challenges of implementing these features (like across DC replication and backup) on NoSQL data stores. Quite an interesting paper and accomplishment by this Google team.

Saturday, September 15, 2012

Wednesday, September 12, 2012

MongoDB, Maturity, and "let's implement ourselves"

One common theme I've seen in the past few years is developers downplaying products and features that typically take a look time to implement and to get to a mature state, like data store and data replication. Even when the discussion is finished with the answer is that "our problem is much more limited", "it can't be that difficult", or the plain "we're smart than they are", I haven't been really convinced that the advocates of implementing their own feature (you name it) really knew the consequences of moving forward with it.

An example of how some things are very complex, let's take a look at MongoDB. Here you can find a couple of good posts on the topic of maturity:

How much credibility does the post "Don't use MongoDB" have?
Which companies have moved away from MongoDB and why?
Goodbye MongoDB
Don't use MongoDB

10gen is the company behind MongoDb and we're talking about tens of people working on this system for years - and it still has its issues. This is just normal. I once heard that it takes around 10 years to have a mature DB system - yes, 10 years.

Above you can see that replication is one of the common issues. If you go ahead and download MongoDB source code, you will see that it is under 7,000 lines of code. To repeat the common question: "How hard can it be?", how hard can it be to get 7,000 lines of code right? :-)

Long time back, when I worked on the Linux kernel, my changes amounted to around 5,000 lines of code. Those were the hardest 5,000 lines of code I wrote - hardest to get to work, to get to some stability. And I was able to make good progress with a lot of people trying it out and providing feedback, but it was still far from really declaring a stable patch that anyone in the world could use. And that was a code for non-distributed system, without all the complexities of remote failures, network partition, etc.

Charlie Kindel's on Microsoft's Performance Review

Having gone through my own performance review recently, I couldn't help but reread Charlie's post on performance review to see the perspective of a former executive on the review ratings. Highly recommended.

Got a 4? You Were Just Fired from Microsoft

On the rating scale, this is what Charlie says:

1 You walk on water. We love you. We want to love you long time. Here’s an absurd amount of stock, an up-to 100% bonus, and a huge pat on the back. If we’re not promoting you right now, it’s probably because we just promoted you; we’ll get to it again RSN.

2 Stud/Studette! You are awesome in our eyes and we wish everyone else was just like you. You also get a nice healthy stock grant, a bonus that will make your spouse grin ear-to-ear, and a raise. Please, please, please continue what you’ve been doing.

3 Thank you. You did a fine job. We’d love to keep you around as you are clearly an asset to the company. Keep up the great work! Oh, and if you just do a bit more you could get a “2” next time. But also be careful because you were also close to getting a “4”.

4 I know it stings to hear this, but we are telling you Microsoft really doesn’t care about you anymore and would be just as happy having you work somewhere else. Yea, yea, you might still get a bonus, and there might be small raise involved, but promotion? Forget about it. And stock? Why would we give stock to someone who’s likely not going to be here next year? Ok, ok, maybe you can climb out of this hole over the next year. If you want to try, we’ll let you.

5 Don’t let the door hit you on the butt on the way out. We’d fire you, but that’s just so messy. It’s far easier for us to make you feel unwanted and hope you leave.

Dr. Dobbs on Unit Tests

I've had many posts on unit tests and here goes still another one. I can't count how many times I've heard that unit tests are a waste of time. Managers, pressed for a release, are "ok" if you check in without unit tests. Unit tests, when there's pressure, becomes "optional" and, compared to all features you could be implementing instead, go to the bottom of your priorities list. Yes, these are the chats I have - and no matter how much one tries, it's almost impossible convince otherwise.

Dr. Dobbs published a small text on unit tests called "Unit Testing: Is there really any debate any longer?", which is great and reflects precisely my experience. Still another shot at convincing the opposers :-)

Saturday, September 01, 2012

Craftsmen: Jiro Dreams of Sushi

A few months back, the shuttle drive for my car dealership recommended a bunch of good movies and documentaries he had watched at the Seattle International Film Festival. I always have a great time when he drives me given his huge passion for movies and his kindness to spread the word to all customers on good films.

One of the recommended movies was: Jiro Dreams Of Sushi. Luckily for me, Netflix streams it, so today I watched it.

This is a beautiful movie, with a number of great lessons by Jiro and his family. He is a craftsman, dedicating his life to his craft and an insatiable will to improve on his craft. After 70 years on the job, he was awarded 3 Michelin stars and after 75 years on the job, a documentary on his life was released to the world.

Although it's debatable whether this is a balanced life, it's still absolutely fascinating to see such a passion for his craft and discipline to keep going, irrespective of his age and other life's challenges. No matter the price he charges for his sushi, we are the ones who most benefit from having this shokunin available to keep amazing us. Like mentioned in the documentary, a miracle every night.

Take the time and watch this documentary when you have an opportunity.

Task ContinueWhenAll and Failures

Lately I've been using Task.Factory.ContinueWhenAll in my TPL programming, but I wasn't quite sure how it handles error nor the cleanest way to handle errors coming from inner tasks.

Take a look at the code below for an example. What do you expect the result to be?
Task innerTask1 = Task.Factory.StartNew(() => { throw new Exception("Error1"); });
Task innerTask2 = Task.Factory.StartNew(() => { throw new Exception("Error2"); });

var task = Task.Factory.ContinueWhenAll(
    new[] { innerTask1, innerTask2 },
    innerTasks =>
            foreach (var innerTask in innerTasks)
                Console.WriteLine("Result: {0}", innerTask.Status);
var exceptionTask = task.ContinueWith(
    t => Console.Error.WriteLine("Outer Task exception: {0}", t.Exception),
    TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously);

catch (Exception e)

Console.WriteLine("Done - Outer Task Status: {0}, Exception Task: {1}", task.Status, exceptionTask.Status);
After running this code, you will see:
Result: Faulted
Result: Faulted
Done - Outer Task Status: RanToCompletion, Exception Task: Canceled
So both inner tasks finished with status "Faulted", as indicated in the "Result:" lines, but the outer task finishes as "RanToCompletion" (i.e. successful). At some point later the unobserved exceptions will be thrown in the code when finalized, which is far from ideal if you want to have a good understanding of this code and be able to debug any issues in the future.

In order to properly get the "task" up there to fail if any of the inner tasks fail is to observe the exceptions when ContinueWhenAll continuation runs. One of the most obvious is to check all innerTasks and, if any failed, we can throw its exception.

Today I came across a good on that. The suggest is to make the following change to the code:
Task innerTask1 = Task.Factory.StartNew(() => { throw new Exception("Error1"); });
Task innerTask2 = Task.Factory.StartNew(() => { throw new Exception("Error2"); });

var task = Task.Factory.ContinueWhenAll(
    new[] { innerTask1, innerTask2 },
    innerTasks =>

            foreach (var innerTask in innerTasks)
                Console.WriteLine("Result: {0}", innerTask.Status);
var exceptionTask = task.ContinueWith(
    t => Console.Error.WriteLine("Outer Task exception: {0}", t.Exception),
    TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously);

catch (Exception e)

Console.WriteLine("Done - Outer Task Status: {0}, Exception Task: {1}", task.Status, exceptionTask.Status);
The difference is the "Task.WaitAll" - that makes the exceptions to be observable and the ContinueWhenAll task to fail, thus making the code much more predictable and understandable in case of failures.

This is the result as compared to the output above (just the "Done" line, as I don't want to past the WriteLine that prints the exception here):
Done - Outer Task Status: Faulted, Exception Task: RanToCompletion
So the tip is: for all ContinueWhenAll, add Task.WaitAll. If you have an empty continuation (_ => {}) or any variation of that), just replace it with "Task.WaitAll" - and here it can be a method group, which is even cleaner.

Sunday, August 19, 2012

Dealing with Criticism

This post is about tips of how to criticize, raise concerns, and how to deal with critics in the software industry.

One thing I've learned over the years working in software development is that, no matter what one does, there will be criticism about your work. The more visible and high profile the work is, the more it will be criticized, no matter what. Not being target of criticism would mean very simply that it satisfies all people involved, which is almost never the case. The reason for that is that there are always different approaches to the problem, different approaches on how to document and/or present the problem and solution, among others. A person understanding your work could be highly logical and prefer a very logical/mathematical approach, another can prefer a higher-level approach, while still another person would have liked to see a more business oriented perspective, and a few more would have wanted more technical low level details than you presented. I personally haven't found a way to satisfy all type of audience - if it is at all possible, it will extremely hard and time-consuming and very hard to find this investment worth the time.

Given that, no wonder that criticism will rise among people that get in touch with your work. For very personal (and sometimes petty) reasons or for good business reasons, there will be criticism. In the software industry, in particular, this is bound to happen frequently as engineers are smart and opinionated individuals. The issue becomes more severe as we're in a highly creative field where there is typically no single solution to the problem at hand.

Being Critical

There are many reasons to be critical, and for everyone interested in growing as a person and professional, this is an area that provides tremendous opportunity of growth if you have the right intent and learn how to provide feedback in the right way.

The most important lesson is: first, before being critical, try to think that there could have been reasons to make a decision in that why. Or if that's not the case, it just means that the person does not have your experience and knowledge, which is a great opportunity for your to influence that work.

Author does not have your experience or knowledge

One thing is certain: no two people have the same experience or knowledge. On a particular topic, it's possible that one has more expertise or knowledge, but overall there is always something to be learned from other.

In the case the author of a work does not have your experience on that topic, this is a tremendous opportunity for you to teach and help a fellow coworker. The goal is not to make your coworker feel bad, but actually walk him/her through what you've learned and see the problem through his/her own eyes. It's not about forcing your suggestion on how to do things, but being patient to get the person to see the potential problem you see and potential consequences. If the final solution addresses the problems or consequences, that is what matters.

For the inexperienced author, if your experience indeed applies to the problem, either the person will see or will take the risk of running into the same problem. If the latter, there are two courses of action. First, if it has a great potential to impact the business and the person refuses to see, it may be worth involving others to see if through consensus the argument becomes more compelling . In some other cases, though, the best course of action is to let the person move ahead and take the risk. Either he/she will learn from it in the future, or something is or will be different than your own past experience and it may not be that bad or impactful to the business as you had pictured. Either way, both will learn from the experience and the organization can benefit from this as it will mature.

There was a reason for the decision

Another possibility - often not considered - is that you could be criticizing something without understanding the reason a decision was made that way. Actually, not only the reason, but you may not the entire thought process or context in mind. With that, the author may have considered your point and, among other options and pros/cons, it could have been still the best alternative. This is where you distinguish mature engineers (and very often great engineers from the merely good ones): you are willing to be humble at first and potentially learn from others. As times goes by, these engineers are increasingly right most of the time, but still there are points that these engineers can learn from and fill the gaps in their knowledge and experience - and all of that by just trying to understand others' thought process.

Biases when evaluating someone else's work

As mentioned above, you can have two biases when taking a look at someone else's work: to think that the author might be more intelligent or more knowledgeable than you or that the author is stupid and did not see all of your points. Any of these extremes can be a problem on how you will criticize someone's work.

Thinking that the author is always more experienced, more knowledgeable or more intelligent can hold you back - you may have good suggestions or good criticisms, but this belief makes you not express yourself and be vocal. On the plus side, though, it may make you learn from every single experience if you know to ask the right questions and understand how the author got to that conclusion.

On the other hand, though, thinking that someone is always less knowledgeable, less experienced, or less intelligent can be a trap. This is very common in companies where people are more aggressive to get things done. In this case, you can end up pushing your view on others, even though they may have valid points and alternatives. Also, it can make you less open to the different perspectives, reducing your possibilities of learning something. Long term, the consequences are that you tend to keep doing things in the same way, less likely to adapt to changes or improve processes.

Ideally, you can hit a sweep spot in the middle of these extremes. First and foremost, always consider the possibility that there was a good reason for someone to be done in a given way. And consider your own experience and concerns as potentially valid.

Find what concerns you, not how you'd do it

Problems may have a number of different solutions - in the software industry, this is even truer as there can be so many potential combinations of solutions when dig into the solutions. When tackling a problem, you will probably have a different approach than someone else. It may be similar, but it won't probably exactly the same.

Given that there are multiple solutions, the first thing is to respect the creative process that someone else went through. The solution may not be the same as yours, but it can have merits and can take into account more than you're able to see. So do not focus on HOW you would do it, but rather on what IS your concern and what are the undesirable consequences.

Be respectful of other's time and efforts

If you have too many concerns and disagreements, in order to be mindful of your time and your colleague's, and also that this is a business environment where you must be efficient, have FOCUS and priorities. If you have many points to get across, prioritize and pick your battles. Put yourself in the other's shoes and make his/her life easier - while still raising what is important to you.

Get your point across

These are good actions that can be done to perform criticism in practice:
  • Before criticizing, inquiry the author why a decision was made in a certain way
    • This can prove to be very revealing and may allow you to understand the context. Your criticism may not make sense anymore after that.
  • If previous step did not help, first check if the author considered certain scenarios or concerns
    • At that point the author may recall and address your concerns
  • If your concern was not considered, walk the author through
    • A "this is crap" phrase is not the best way to achieve the desired result, but rather try to make the author see what you're seeing. If required, walk him/her through each step of the thought process, how to get that scenario or how that can have an undesirable consequence

At times not even following these steps will actually help you get your point across as not everybody is open or will follow your arguments. In such cases, it may require more data and other measures to be taken offline. In other situations, it can take a long period of time to change the situation as people need to see a pattern of concerns that become reality before starting to give enough attention to you. In general, though, the steps above can help you get your point across in a respectful way, contributing to a healthier work environment.

Learning How to Deal with Criticism

The flip side of the coin is how to deal with criticism. This can be very tough, as not necessarily all critics will be acting out of good intentions. Even if they do, they may not have spent the amount of time you did thinking about a problem. And to make it worse, even if they do, they could simply not have seen all the issues and problems. Finally, there are those who are very loud and can make you look bad, which requires some damage control to be exercised.

Know why you're doing what you're doing

Before your work is shown to a wider audience, make sure that you have it reviewed in smaller circles to get the most important questions answered. These small circles must composed of people that you provide valuable feedback, but are also easy to work with.

Moreover, to make the entire process more efficient, make sure you know why you're doing that work, so you don't have questions later that will throw you off track. For this, go beyond your role and understand business requirements, technical requirements, try to understand how it fits in the overall bit picture. That can simplify your life down the road when questions that are related (although not directly applicable) to your work are raised.

Assume good intention

Criticism may arise for a number of reasons, but the healthiest thing to do is to assume all of them have the best of the intentions. If you assume they are not out of good intentions, you may not get it right (how can you say for sure the reason for the criticism?) and rather than focusing on the problem at hand, you will detract your thought process to think on the reasons, and why this is happening, why this is happening to you, etc. Forget the reasons and make the healthy assumption that there's a good intent behind the criticism - that will allow you to focus on the criticism.

Accept the fact: there will be critics

I'd say that accepting that there will be critics is one of the most important things in dealing with criticism. The likelihood just increases as you're doing things that impact more people or are more visible. It's unlikely that you will have criticism if you are doing something very local that nobody cares much, but it's very likely that you will do if you are changing how the Windows or Mac interfaces look like. No matter what, there will be critics, it is a matter of life no matter what you do. Not accepting this fact means that, with every criticism, you will reiterate the debate in your mind that there should be no critics of your work and your work environment (or world) is not fair.

Pick which ones to address

Another point is what criticisms to address - depending on the project and impact, it is completely unfeasible to try to convince all critics of your decisions. It is a necessity to be able to pick which critics to focus on and spend your time on these critics and their points. This shows another sign of maturity and something one learns as time goes by. In some cases, there will be valid criticisms that can and should be addressed. If many people are involved and they all have different opinions, it can be just unfeasible to address all of them, so decide the criteria to follow, like: technical/business expertise or hierarchy with the company.

Depending on when the criticisms take place in the process, my suggestion is to make sure to pay some attention to those who disagree with you, as they may either point valid concerns or will help you prepare better for further justifications down the line if these criticisms come up again.

Know how to defend your points

Over time, all of us forget the reasons decisions were made. We may not even remember the decision later on, and if we do, we tend to forget what led us to that. From that point on, it is very hard to be able to defend your decision as either you will need to go through the entire thought process or simply will not remember.

For future questioning and also for you to organize your thoughts, write down the reason decisions were made, from all possible perspectives. Make these answers public, if you like, or write them in a private document you can go back to if you need in the future.

In addition to that, communicate them in an intelligible and organized way, adapting the answers to each audience. Writing down your decisions and reasons is one way to help you organize your thoughts beforehand, but if you have a different technique, make use of it. Know how to communicate them to the right level of detail and focus depending on the audience: for instance, a business audience has a different focus than a technical audience and isn't commonly interested in low level details.

There will be other ways to do the same

Many critics don't think what bothers, but focus on how they would do it and that is what is thrown at you as a criticism. This is a trap, so don't fall for it. If you do, you will end up trying to point out what is the problem with how they would do it, rather than focusing on your solution (and its problems). Make sure to ask for the clear concerns on your solution and if the critic can't state that, follow up at some other time, but don't fall for the "why don't you do it like this?" question.

Be mindful of your own time

Make sure to get your critics to focus on the important concerns, especially if you have many critics or many points to address. In many occasions it's impractical to answer all questions from all people, so it's your job to know how to request feedback and not prolong discussions that don't offer much value. Sometimes the best thing is to capture these critics in emails or documents and be transparent upfront that they will be addressed if time allows - this gives you the opportunity to do in the future, but not to hold your work until every single discussion on why you used this notation vs. that notation is sorted out.

Finally, another suggestion is to know what the agenda for emails or meetings are and know how to get back on track when discussions are sidetracked due to unfounded criticisms.


I hope these tips can help you deal with this vital and sometimes painful process of criticizing or being criticized. It is definitely more an art than a science and there are not hard rules here.

Additional tips on how to deal with criticism are very welcome. Please share if you had different experiences with critics.