Tuesday, July 31, 2012

Publish/Subscribe Systems

The many faces of publish/subscribe is one great paper if you're interested in distributed systems. This is one of those papers that I should have read several years ago when I started working on messaging brokers and even before doing my recent work creating a new publish/subscribe system.

This paper offers a lot of insight into the several options you have to "glue" systems and the characteristics of publish/subscribe, so you can make more informed decisions when picking an implementation for your problem or before you think of coming up with your own.

Please find below some of my highlights from the paper.


The publish/subscribe interaction scheme is receiving increasing attention and is claimed to provide the loosely coupled form of interaction required in such large scale settings. Subscribers have the ability to express their interest in an event, or a pattern of events, and are subsequently notified of any event, generated by a publisher, which matches their registered interest. An event is asynchronously propagated to all subscribers that registered interest in that given event. The strength of this event-based interaction style lies in the full decoupling in time, space, and synchronization between publishers and subscribers.

The Basic Interaction Scheme

The publish/subscribe interaction paradigm provides subscribers with the ability to express their interest in an event or a pattern of events, in order to be notified subsequently of any event, generated by a publisher, that matches their registered interest. In other terms, producers publish information on a software bus (an event manager) and consumers subscribe to the information they want to receive from that bus. This information is typically denoted by the term event and the act of delivering it by the term notification.

The decoupling that the event service provides between publishers and subscribers can be decomposed along the following three dimensions […]:
  • Space decoupling: The interacting parties do not need to know each other.
  • Time decoupling: The interacting parties do not need to be actively participating in the interaction at the same time.
  • Synchronization decoupling: The production and consumption of events do not happen in the main flow of control of the publishers and subscribers, and do not therefore happen in a synchronous manner.
Decoupling the production and consumption of information increases scalability by removing all explicit dependencies between the interacting participants.

The Cousins: Alternative Communication Paradigms

  • Message Passing: Message passing represents a low-level form of distributed communication, in which participants communicate by simply sending and receiving messages. […] The producer and the consumer are coupled both in time and space (cf. Figure 3): they must both be active at the same time and the recipient of a message is known to the sender.
  • RPC: One of the most widely used forms of distributed interaction is the remote invocation, an extension of the notion of “operation invocation” to a distributed context. […] RPC differs from publish/subscribe in terms of coupling: the synchronous nature of RPC introduces a strong time, synchronization (on the consumer side1), and also space coupling (since an invoking object holds a remote reference to each of its invokees).
  • Notifications: In order to achieve synchronization decoupling, a synchronous remote invocation is sometimes split into two asynchronous invocations: the first one sent by the client to the server—accompanied by the invocation arguments and a callback reference to the client—and the second one sent by the server to the client to return the reply. This type of interaction—where subscribers register their interest directly with publishers, which manage subscriptions and send events—corresponds to the so-called observer design pattern. […] It is generally implemented using asynchronous invocations in order to enforce synchronization decoupling. Although publishers notify subscribers asynchronously, they both remain coupled in time and in space.
  • Shared Spaces: The distributed shared memory (DSM) paradigm [Li and Hudak 1989; Tam et al. 1990] provides hosts in a distributed system with the view of a common shared space across disjoint address spaces, in which synchronization and communication between participants take place through operations on shared data. […] A tuple space is composed of a collection of ordered tuples, equally accessible to all hosts of a distributed system. Communication between hosts takes place through the insertion/removal of tuples into/from the tuple space. […] The interaction model provides time and space decoupling, in that tuple producers and consumers remain anonymous with respect to each other. The creator of a tuple needs no knowledge about the future use of that tuple or its destination. Unlike the publish/subscribe paradigm, the DSM model does not provide synchronization decoupling because consumers pull new tuples from the space in a synchronous style (Figure 8). This limits the scalability of the model due to the required synchronization between the participants.
    • A similar communication abstraction, called rendezvous, has been introduced in the Internet Indirection Infrastructure (I3) [Stoica et al. 2002]. Instead of explicitly sending a packet to a destination, each packet is associated with an identifier; this identifier is then used by the receiver to obtain delivery of the packet. This level of indirection decouples the act of sending from the act of receiving.
  • Message Queuing: Message queuing and publish/subscribe are tightly intertwined: message queuing systems usually integrate some form of publish/subscribe-like interaction. […] At the interaction level, message queues recall much of tuple spaces: queues can be seen as global spaces, which are fed with messages from producers. From a functional point of view, message queuing systems additionally provide transactional, timing, and ordering guarantees not necessarily considered by tuple spaces. […] In message queuing systems, messages are concurrently pulled by consumers with one-of-n semantics similar to those offered by tuple spaces through the in() operation (Figure 9). These interaction model is often also referred to as point-to-point (PTP) queuing. Which element is retrieved by a consumer is not defined by the element’s structure, but by the order in which the elements are stored in the queue (generally first-in first-out (FIFO) or priority-based order). […] Similarly to tuple spaces, producers and consumers are decoupled in both time and space. As consumers synchronously pull messages, message queues do not provide synchronization decoupling. Some message queuing systems offer limited support for asynchronous message delivery, but these asynchronous mechanisms do not scale well to large populations of consumers because of the additional interactions needed to maintain transactional, timing, and ordering guarantees.

The Siblings: Publish/Subscribe Variations

Subscribers are usually interested in particular events or event patterns, and not in all events.
  • Topic-based Publish/Subscribe: The earliest publish/subscribe scheme was based on the notion of topics or subjects […] It extends the notion of channels, used to bundle communicating peers, with methods to characterize and classify event content. […] subscribing to a topic T can be viewed as becoming a member of a group T, and publishing an event on topic T translates accordingly into broadcasting that event among the members of T. […] Every topic is viewed as an event service of its own, identified by a unique name, with an interface offering publish() and subscribe() operations.
  • Content-Based Publish/Subscribe: […] the topic-based publish/subscribe variant represents a static scheme which offers only limited expressiveness. The content-based (or property-based [Rosenblum and Wolf 1997]) publish/subscribe variant improves on topics by introducing a subscription scheme based on the actual content of the considered events. In other terms, events are not classified according to some predefined external criterion (e.g., topic name), but according to the properties of the events themselves.
  • Type-Based Publish/Subscribe: Topics usually regroup events that present commonalities not only in content, but also in structure. This observation has led to the idea of replacing the name-based topic classification model by a scheme that filters events according to their type […].

The Incarnations: Implementation Issues

  • Events
    • Messages
    • Invocations
  • The Media
    • Architectures
      • Centralized Architecture: the role of publish/subscribe systems is to permit the exchange of events between producers and consumers in an asynchronous manner. Asynchrony can be implemented by having producers send messages to a specific entity that stores them, and forwards them to consumers on demand. We call this approach a centralized architecture because of the central entity that stores and forwards messages. […] Applications based on such systems have strong requirements in terms of reliability, data consistency, or transactional support, but do not need a high data throughput. Examples of such applications are electronic commerce or banking applications.
      • Distributed Architecture: Asynchrony can also be implemented by using smart communication primitives that implement store and forward mechanisms both in the producer’s and consumer’s processes, so that communication appears asynchronous and anonymous to the application without the need for an intermediary entity. We call this approach a distributed architecture because there is no central entity in the system. TIBCO Rendezvous [TIBCO 1999] uses a decentralized approach in which no process acts as a bottleneck or a single point of failure. Such architectures are well suited for fast and efficient delivery of transient data, which is required for applications like stock exchange or multimedia broadcasting.
      • Dissemination
        • The actual transmission of data can happen in various ways. In particular, data can be sent using point-to-point communication primitives, or using hardware multicast facilities like IP multicast […]. Centralized approaches like certain message queuing systems are likely to use point-to-point communication primitives between producers/consumers and the centralized broker. […] To ensure high throughput, Internet protocol (IP) multicast or a wide range of reliable multicast protocols

Qualities of Service

  • Persistence: The communicating parties do not control how messages are transmitted and when they are processed. Thus, the messaging system must provide guarantees not only in terms of reliability, but also in terms of durability of the information. It is not sufficient to know that a message has reached the messaging system that sits between the producers and consumers; we must get the guarantee that the message will not be lost upon failure of that messaging system. Persistence is generally present in publish/subscribe systems that have a centralized architecture and store messages until consumers are able to process them. Distributed publish/ subscribe systems do not generally offer persistence since messages are directly sent by the producer to all subscribers. Unless the producer keeps a copy of each message, a faulty subscriber may not be able to get missed messages when recovering. TIBCO Rendezvous [TIBCO 1999] offers a mixed approach, in which a process may listen to specific subjects, store messages on persistent storage, and resend missed messages to recovering subscribers.
  • Priorities: message prioritization is a quality of service offered by some messaging systems. Indeed, it may be desirable to sort the messages waiting to be processed by a consumer in order of priority. […] Priorities should be considered as a best-effort quality of service (unlike persistence).
  • Transactions: Transactions are generally used to group multiple operations in atomic blocks that are either completely executed or not executed at all. In messaging systems, transactions are used to group messages into atomic units: either a complete sequence of messages is sent (received), or none of them is.
  • Reliability: Reliability is an important feature of distributed information systems. It is often necessary to have strong guarantees about the reliable delivery of information to one or several distributed entities. Because of the loose synchronization between producers and consumers of information, implementing reliable event propagation (“guaranteed delivery”) is challenging. […] Centralized publish/subscribe systems generally use reliable point-to-point channels to communicate with publishers and subscribers, and keep copies of events on stable storage. Systems based on an overlay network of distributed event brokers often use reliable protocols to propagate events to all or a subset of the brokers. Protocols based on group communication […] are good candidates as they are resilient to the failure of some of the brokers. […] systems that let publishers and subscriber communicate directly with each other, such as TIBCO Rendezvous [TIBCO 1999], also use lightweight reliable multicast protocols. As events are generally not kept in the system for failed or disconnected (time-decoupled) subscribers, guaranteed delivery must be implemented by deploying dedicated processes that store events and replay them to requesting subscribers.

Sunday, July 29, 2012

Make: Electronics - Experiment 17 (Part 2)

This experiment from Make: Electronics is a follow up to Part 1, where we had 555 timer chips in both monostable and astable modes. This time, we have both 555 chips in astable modes for a similar circuit - first chip drives a LED (that keeps turning on and off) and the positive voltage going to the LED is connected to the control pin of the second chip, which drives the loudspeaker.

Take a look at the breadboard with both chips:

Saturday, July 28, 2012

Thoughts on Riak

Some thoughts after playing with Riak for a few days:
  • Basic: essentially a key/value store implementing Amazon's Dynamo where you can decide the level of replication you want on a per bucket or operation basis. The same applies for read operations.

  • Links: typically you don't have relationships between entities in a key/value store, but Riak provides links. So one entity can point to the other and this link can be walked. In other words, much of the relationship features that you'd see through foreign constraints in SQL and that you'd need to implement yourself in a NoSQL DB can be done via Links in Riak.

  • Still under development: given that the book I am following to learn Riak tested a version from Dec/2011, I can see that still many things are being changed and developed for Riak. This is something to keep in mind, as code that works with one version may not work with a later version and you'll need to spend time to understand what happened. A good example of that was with "precommit" hooks - I couldn't easily find a good example of how to use them against the current version (thus, my post on it).

  • Map Reduce support: although I only tested artificial examples from a book, what may not be very realist, still this is an amazing support. Rather than pulling the data to perform computation on remote nodes, we have the capability to push the code to the Riak nodes and have them performing the computation.

  • Secondary indexes: I've had the experience of working on NoSQL database that doesn't provide secondary indexes, and this can be really a painpoint, so I really appreciate this support by Riak. This is only supported by the LevelDB backend, though, and I am not sure what the performance impact is when one compares only primary indexes vs. secondary indexes.

  • Precommit/Postcommit hooks: you can set scripts to be run before or after writing to the DB. Whereas you would need to do that by yourself with other DBs, Riak can run your code on the DB server.

  • Search Support: this feature really surprised me. Through a custom precommit script or one of the already available indexer scripts, you can create inverted indexes for your data. This can be set on a per bucket basis rather than being a global setting. Since Riak integrates Sol, you have all the Lucene + Solr power to perform flexible searches on your inverted indexes.

  • HTTP and Protocol Buffers: I've played with the Restful API, but Riak reduces the overhead of these remote calls by having Protocol Buffers (Protobuf) support as well. HTTP and serializing data can be a performance issue (like in some cloud-based DBs), so this support is definitely welcome.

Overall, I can say that it was really great to learn more Riak and I think it's a great NoSQL option to consider.

Update 07/29/2012: I came across this other great post on the actual 1-year experience with Riak. Definitely a good follow-up reading if you're interested in more details.

Riak: Updated Examples for "Seven Databases in Seven Weeks"

In my last post, I showed you how to fix an example from from the "Seven Databases in Seven Weeks" to run against Riak 1.1.4. This post shows a few more updates to the book examples to get them to run against this more current version of Riak.

  • Search Precommit: page 87 shows the command to install riak_search_kv_hook as a precommit script. Like in the last post, it must have the "language" property to work properly. Otherwise the property is not applied - and no error is returned by Riak. This is the command that works against Riak 1.1.4:
    curl -i -X PUT http://localhost:8091/riak/animals -H "Content-Type: application/json" -d '{"props":{"precommit":[{"mod":"riak_search_kv_hook","fun":"precommit","language":"erlang"}]}}'
  • Server port: pages 89 and 90 show command using "localhost:8098". It should read "localhost:8091". This is actually an erratum.

  • Access secondary index: page 90 shows command how to access secondary index, but with that command, I get the error "Invalid link walk query submitted." The proper command for Riak 1.1.4 is:
    curl http://localhost:8091/buckets/animals/index/mascot_bin/butler

Wednesday, July 25, 2012

Riak: Precommit Hook Example

Book "Seven Databases in Seven Weeks" has an example of how to use a precommit hook in Riak. However, it did not work for me (Riak 1.1.4). The issue is that I was getting the following message for every single commit:
PUT aborted by pre-commit hook
After enabling debug logging in the app.config (*), I realized that it was failing with the following error:
Problem invoking pre-commit hook: [{<<"lineno">>,1},{<<"message">>,<<"ReferenceError: good_scores is not defined">>},{<<"source">>,<<"unknown">>}]
The problem is that the command to set the precommit hook MUST specify the language, otherwise it doesn't work properly. In order to get the book's example fixed then, do the following:
curl -i -X PUT http://localhost:8091/riak/animals -H "content-type: application/json" -d '{"props":{"precommit":[{"name":"good_score","language":"javascript"}]}}'
Once enabled, I was getting the proper error messages. Most importantly, valid data was being accepted and committed by Riak.

You may be wondering how the Javascript looks like. Since I can't copy the book's example due to copyright, this is an example of function that works the same:
// Makes sure the object has JSON contents
function precommitMustBeJSON(object){
  try {
    return object;
  } catch(e) {
    return {"fail":"Object is not JSON"};
Unfortunately Riak does not have a good tutorial or example on that and it was hard to find the reason for the error. Maybe it'll get better as it matures.

One final thing: don't forget that you need to put your function in a .js file in a directory configured in your app.config. For instance, my file is called "my_validators.js" and it's located in the directory I configured in app.config:
{js_source_dir, "/home/rdcastro/riak/js_source"},
(*) To enable debug logging, make sure your "lager" section has the following settings:
            {handlers, [
                {lager_console_backend, debug},
                {lager_file_backend, [
                    {"./log/error.log", error, 10485760, "$D0", 5},
                    {"./log/console.log", debug, 10485760, "$D0", 5}

Tuesday, July 24, 2012

Make: Electronics - Experiment 17 (Part 1)

This experiment from Make: Electronics is really interesting as it introduces a 555 timer chip in both monostable and astable modes. This is what my breadboard looked like:

My challenge with this experiment, though, is that I had to spend a long time trying to troubleshoot two issues:
  • First, when I power up the second chip directly to go into astable mode, sound would not come out of the speaker until a few seconds later. According to the book, it should come out immediately.
  • When chaining the first chip (in monostable mode) to the second one (in astable mode), sound would not come out of the loudspeaker either - while videos on the web showed the speaker working right after the push button for the first circuit was pressed.

SOLUTION: at the end of the day, I learned how to solve the problem through a video on YouTube (linked below). I had to use a ceramic capacitor (I used 100 nF) between pin 3 and the speaker to filter DC and allow AC. That fixed the problem right away.

Of course you must be curious as how it works, so rather than filming my breadboard, I embed a video that shows the same experiment in more details:

Thanks to Chris, who posted this video above, for helping out with this great tip. I submitted an errata to the book, so at least this could be mentioned to save other's time.

Monday, July 23, 2012

Understanding Vector Clocks

Riak is one databases that uses vector clocks for conflict resolution. I came across these two blog posts on Basho.com, company which develops Riak, and these posts are great at explaining the basics of Vector Clocks - definitely a must read if you're into distributed systems:

Why vector clocks are easy?

Why vector clocks are hard?

Voldermort DB (by LinkedIn) is another DB that uses Vector Clocks, as explained below. Not surprisingly, it also takes the idea from Amazon's Dynamo (like Riak):

The redundancy of storage makes the system more resilient to server failure. Since each value is stored N times, you can tolerate as many as N – 1 machine failures without data loss. This causes other problems, though. Since each value is stored in multiple places it is possible that one of these servers will not get updated (say because it is crashed when the update occurs). To help solve this problem Voldemort uses a data versioning mechanism called Vector Clocks that are common in distributed programming. This is an idea we took from Amazon’s Dynamo system. This data versioning allows the servers to detect stale data when it is read and repair it.

Voldermort's code in Java can be on code.google.com.

Finally, before I end this post, you may be asking "why complicate so much?" (if I could get a penny every time I heard that when discussing distributed systems... :-). But in this case, it's a good and typical question: can't we just use timestamp and last one wins? The problem, though, is that it requires times to be perfectly synchronized - which is very difficult and oftentimes impossible. By using vector clocks, you don't have this requirement on the system.

Monday, July 16, 2012

Make: Electronics - Intrusion Alarm (Experiment 15)

Today I finished Experiment 15 of Make: Electronics.

I finally got the hang of soldering and now the alarm is all soldered. Below you can see the pictures when the alarm is ready to be armed (green led on) and armed (red led on). It's connected to a pair of magnetic switches that can be applied to doors or windows.

Sunday, July 15, 2012

Quality improvement through TDD


In an attempt to find data backing TDD, I came across this paper by Microsoft Research, IBM, and North Carolina State University:

Realizing quality improvement through test driven development: results and experiences of four industrial teams

Below you can find some of the most interesting quotes from this paper.

Test-driven development (TDD) (Beck 2003) is an “opportunistic” (Curtis 1989) software development practice that has been used sporadically for decades (Larman and Basili 2003; Gelperin and Hetzel 1987). With this practice, a software engineer cycles minute-by-minute between writing failing unit tests and writing implementation code to pass those tests. […] However, little empirical evidence supports or refutes the utility of this practice in an industrial context.

we do investigate TDD within the prevailing development processes at IBM and Microsoft and not within the context of XP.

Related Works
[…] maintenance fixes and “small” code changes may be nearly 40 times more error prone than new development (Humphrey 1989), and often, new faults are injected during the debugging and maintenance phases. The ease
of running the automated test cases after changes are made should also enable smooth integration of new functionality into the code base and therefore reduce the likelihood that fixes and maintenance changes introduce new defects. The TDD test cases are essentially a high-granularity, low-level, regression test.

Erdogmus et al. (2005) performed a controlled investigation regarding test-first and test-last programming using 24 undergraduate computer science students. They observed that TDD improved programmer productivity but did not, on average, help the engineers to achieve a higher quality product. Their conclusions also brought out a valid point that the effectiveness of the testfirst technique depends on the ability to encourage programmers to enhance their code with test assets.

Müller and Tichy (2001) investigated XP in a university context using 11 students. From a testing perspective they observed that, in the final review of the course, 87% of the students stated that the execution of the test cases strengthened their confidence in their code.

Janzen and Seiedian (2006) conducted an experiment with undergraduate students in a software engineering course. Students in three groups completed semester-long programming projects using either an iterative test-first (TDD), iterative test-last, or linear test-last approach. Results from this study indicate that TDD can be an effective software design approach improving both code-centric aspects such as object decomposition, test coverage, and external quality, as well as developer-centric aspects, which includes productivity and confidence.

TDD Implementations
  • IBM
Unit testing followed as a post-coding activity. In all cases, the unit test process was not formal and was not disciplined. More often than not, there were resource and schedule limitations that constrained the number of test cases developed and run.

With the TDD group, test cases were developed mostly up front as a means of reducing ambiguity and to validate the requirements, which for this team was a full detail standard specification. UML class and sequence diagrams were used to develop an initial design. This design activity was interspersed with the up-front unit test creations for developed classes. Complete unit testing was enforced—primarily via reminders and encouragements. We define complete testing as ensuring that the public interfaces and semantics of each method (the behavior of the method as defined by the specification) were tested utilizing the Junit unit-testing framework. For each public class, there was an associated public test class; for each public method in the class there was an associated public test method in the corresponding unit test class. The target goal was to cover at least 80% of the developed classes by automated unit testing.

[…] to guarantee that all unit tests would be run by all members of the team, an automated build and test systems was set up in both geographical locations. Daily, the build systems extracted all the code from the library build the code and ran all the unit tests. […] After each automated build and test run cycle, an email was sent to all members of the
teams listing all the tests that successfully ran as well as any errors found. This automated build and test served as a daily integration and validation heartbeat for the team.

  • Microsoft
The TDD team at Microsoft did most of their development using a hybrid version of TDD. By hybrid we mean that these projects as with almost all projects at Microsoft had detailed requirements documents written. These detailed requirements documents drove the test and development effort. There were also design meetings and review sessions. This explains our reason to call this a hybrid-TDD approach, as agile teams typically do not have design review meetings.

Quality and Productivity Results
We measure the quality of the software products in terms of defect density computed as defects/thousand lines of code (KLOC).

“When software is being developed, a person makes an error that results in a physical
fault (or defect) in a software element. When this element is executed, traversal of the
fault or defect may put the element (or system) into an erroneous state. When this
erroneous state results in an externally visible anomaly, we say that a failure has
occurred” (IEEE 1988).

All the teams demonstrated a significant drop in defect density: 40% for the IBM team; 60–90% for the Microsoft teams.

Another interesting observation from the outcome measures in Table 3 is the increase in time to develop the features attributed to the usage of the TDD practice, as subjectively estimated by management. The increase in development time ranges from 15% to 35%. From an efficacy perspective this increase in development time is offset by the by the reduced maintenance costs due to the improvement in quality (Erdogmus and Williams 2003), an observation that was backed up the product teams at Microsoft and IBM.

– Start TDD from the beginning of projects. Do not stop in the middle and claim it doesn’t work. Do not start TDD late in the project cycle when the design has already been decided and majority of the code has been written. TDD is best done incrementally and continuously.
– For a team new to TDD, introduce automated build test integration towards the second third of the development phase—not too early but not too late. If this is a “Greenfield” project, adding the automated build test towards the second third of the development schedule allows the team to adjust to and become familiar with TDD. Prior to the automated build test integration, each developer should run all the test cases on their own machine.
– Convince the development team to add new tests every time a problem is found, no matter when the problem is found. By doing so, the unit test suites improve during the development and test phases.
– Get the test team involved and knowledgeable about the TDD approach. The test team should not accept new development release if the unit tests are failing.
– Hold a thorough review of an initial unit test plan, setting an ambitious goal of having the highest possible (agreed upon) code coverage targets.
– Constantly running the unit tests cases in a daily automatic build (or continuous integration); tests run should become the heartbeat of the system as well as a means to track progress of the development. This also gives a level of confidence to the team when new features are added.
– Encourage fast unit test execution and efficient unit test design. Test execution speed is very important since when all the tests are integrated, the complete execution can become quite long for a reasonably-sized project and when using constant test executions. Tests results are important early and often; they provide feedback on the current state of the system. Further, the faster the execution of the tests the more likely developers themselves will run the tests without waiting for the automated build tests results. Such constant execution of tests by developers may also result in faster unit tests additions and fixes.
– Share unit tests. Developers’ sharing their unit tests, as an essential practice of TDD, helps identify integration issues early on.
– Track the project using measurements. Count the number of test cases, code coverage, bugs found and fixed, source code count, test code count, and trend across time, to identify problems and to determine if TDD is working for you.
– Check morale of the team at the beginning and end of the project. Conduct periodical and informal surveys to gauge developers’ opinions on the TDD process and on their willingness to apply it in the future.

Thursday, July 12, 2012

Don't touch my code!

If you've worked at any big company, you probably run into the problem that you are not supposed to touch a component owned by someone else. This can be really infuriating, especially if you know what to change and think that the process gets in the way.

Today I came across this paper by Microsoft Research, University of Zurich, and University of California, David, that talks exactly about this: code ownership. And it makes the case, based on prior research and on Windows Vista and Windows 7 development, that code changes by minor contributors are one of the main predictors of code defects.

Don’t Touch My Code! Examining the Effects of Ownership on Software Quality

Some quotes:
Within Microsoft, we have found that when more people work on a binary, it has more failures.
Based on our observations and discussions with project managers, we suspect that when there is no clear point of contact and the contributions to a software component are spread across many developers, there is an increased chance of communication breakdowns, misaligned goals, inconsistent interfaces and semantics, all leading to lower quality.


The authors also mention that sharing the ownership has a substantial communication and coordination costs:
The set of developers that contribute to a component implicitly form a team that has shared knowledge regarding the semantics and design of the component. Coordination is a known problem in software development [16]. In fact, another of the top three problems identified in Curtis' study [10] was "communication and coordination breakdowns." Working in such a group always creates a need for sharing and integrating knowledge across all members [8]. Cataldo et al. showed that communication breakdowns delay tasks [9]. If a member of this team devotes little attention to the team and/or the component, they may not acquire the knowledge required to make changes to the component without error.


The results of our analysis of ownership in both releases of Windows can be interpreted as follows:
1. The number of minor contributors has a strong positive relationship with both pre- and post-release failures even when controlling for metrics such as size, churn, and complexity.
2. Higher levels of ownership for the top contributor to a component results in fewer failures when controlling for the same metrics, but the effect is smaller than the number of minor contributors.
3. Ownership has a stronger relationship with pre-release failures than post-release failures.
4. Measures of ownership and standard code measures show a much smaller relationship to post-release failures in Windows 7


Finally, the goal is that all these findings can be really actionable, so the authors do have recommendations that teams should follow:
For contexts in which strong ownership is practiced or where empirical studies are consistent with our own findings, we make the following recommendations regarding the development process based on our findings:
1. Changes made by minor contributors should be reviewed with more scrutiny. Changes made by minor contributors should be exposed to greater scrutiny than changes made by developers who are experienced with the source for a particular binary. When possible, major contributors should perform these code inspections. If a major contributor cannot perform all inspections, he or she should focus on inspecting changes by minor contributors.
2. Potential minor contributors should communicate desired changes to developers experienced with the respective binary. Often minor contributors to one binary are major contributors to a depending binary. Rather than making a desired change directly, these developers should contact a major contributor and communicate the desired change so that it can be made by someone who has higher levels of expertise.
3. Components with low ownership should be given priority by QA resources. Metrics such as Minor and Ownership should be used in conjunction with source code based metrics to identify those binaries with a high potential for having many post-release failures. When faced with limited resources for quality-control efforts, these binaries should have priority.

Tuesday, July 10, 2012

Much more than a prototype: a tracer code

I've been working on a code that has been called prototype. This time we decided to do it right. First, to code it "properly", not taking major shortcuts, and write unit tests. Also, we took the time to see if the system would hang together and that it would perform to the expectations. What is nice is that you don't really need to throw it away and, as typical, you can extend it and use it in production.

Then I came across this text from the "Pragmatic Programmer" book (bold mine):
You might think that this tracer code concept is nothing more than prototyping under an aggresive name. There is a difference. With a prototype, you're aiming to explore specific aspects of the final system. With a true prototype, you will throw away whatever you lashed together when trying out the concept, and recode is properly using the lessons you've learned. [...] The tracer code approach addresses a different problem. You need to know how the application as a whole hangs together. You want to show your users how the interactions will work in practice, and you want to give your developers an architectural skeleton on which to hang code. In this case, you might construct a tracer consisting of a trivial implementation of the container packing algorithm (maybe something like first-come, firs-served) and a simple but working user interface. Once you have all the components in the application plumbed together, you have a framework to show your users and your developers. Over time, you add to this framework with new functionality, completing stubbed routines. But the framework stays intact, and you know the system will continue to behave the way it did when your first tracer code was completed.
So in the end, I learned that the proper term is "tracer code".

And what is interesting is what the book says about prototypes (to my first point in this post):
If you feel there is a strong possibility in your environment or culture that the purpose of prototype code may be misinterpreted, you may be better off with the tracer bullet approach. You'll end up with a solid framework on which to base future development.

Saturday, July 07, 2012

Password Hashing - DOs and DONTs

This article on salted password hashing is really great if you want to build security into your website. Definitely a must read if you work with websites or just want to learn more about how to handle password in a secure way.

Wednesday, July 04, 2012

Notes on Facebook's Haystack

Very interesting Facebook's paper on their photo storage (Finding a needle in Haystack: Facebook’s photo storage). A simple yet effective solution to their problem. Definitely a good example of how to leverage the properties of your problem to come up with an solution that provides a great solution (4x reads/sec improvement). Some notes:
  • The solution is essentially: (1) to cache file metadata in memory to reduce number of I/Os for each photo retrieval; (2) store multiple files in a single file, maintaining very large files and reducing metadata.
  • This solution is tailored for this photo storage use case: written once, read often, never modified, and rarely deleted.
  • CDN were not good enough: expensive, but not suitable for Facebook (long tail of requests is substantial, but not cacheable by CDN). But Haystack still uses CDN for the hottest photos.

Program Managers: have them or not? Is this is the question?

A good post on program managers (PM: Secret weapon or wasted headcount?) got me thinking about my experience with Program Managers, especially after working at companies with and without Program Managers. Some thoughts on PMs below.

Some thoughts below - this is not to recruit or scare anyone thinking about being Program Managers, but just to give some insights after the experience with them in different environments and also without them where I had to be an acting PM.


This is big for me. If you don't have a Program Manager, you need to go outside of your comfort zone and try to find out the right answer yourself. You may not have the skill - or the time - to do this job well, but because there's no one to which you can outsource this work, you feel more vested in it. It's related to the startup culture I mentioned in the post Too much work, too little done.

PM technical expertise

PM's role depends much on what product they are working on. Some products, like Microsoft Word, are way less technical than providing a low-level infrastructure technology like Amazon Web Services or Windows Azure. PMs' lack of technical expertise can bias some decisions and focus. Not to mention communication with customers that oftentimes could be more precise and more helpful if PMs were more involved with the technical details.


So what I've seen happening quite frequently is that PMs were supposed to make decision for which they did not have enough info or background, so developers had to spend a substantial amount of time providing this info in order to get the PM's to decide and/or sign off on decisions. That was not very agile.

Customer Focus

For products more customer oriented, like Microsoft Word mentioned above, PMs were invaluable as they did all the work to understand customer's needs, talked to customers directly, supervised focus' groups, come up with the right features from the customer's perspective. That is something I really missed at places where we did not have Program Managers and developers did not have the skills or time to do the work.

The problem is that, for the infrastructure work, though, PMs were not fulfilling the same role. We need more understanding of the customer's scenarios to provide the right guidance. That simply did not matter and many decisions were made based on guesses.


Many Program Managers just follow the same script over and over again managing projects and coming up with features only based on customer's needs or on the competition. When this happen, they are the medium for innovation within the company. We need more Program Managers that are visionaries and help give guidance to execute within that vision. This is beyond processes and bureaucracy; being this catalyst involves a lot of freedom and flexibility to experiment and go beyond the day-to-day bread-and-butter. Otherwise PMs will play catch up, helping the company implement what others have, or simply fulfill features that are told by customers and not think outside of the box.

Public DNS and Traffic Management

Last weekend I read this good paper on Public DNS and Global Traffic Management, written by scientists over at Microsoft Research. The paper on IEEE Xplore is available only for members, but if you have access, it's worth reading it.

One of the main points of this paper is:

It appears that Google, as an cloud service provider, exactly achieves this goal by offering its own Public DNS service. When clients switching from ISP-assigned LDNS to Google Public DNS, their performance accessing Google services will improve, as Google's GTM can now observe the clients' IP and select data centers that are client-best, rather than DNSbest. However, when the clients access any other cloud services, their performance will inevitably degrade. The best data center determined by the GTMs of those services, can only be DNS-best with respect to Google DNS servers. Because the Google DNS servers are further away from the clients than ISP-assigned LDNS, the performance perceived by the clients will be worse than before switching to the Public DNS system.


When using Public DNS, the conclusion at the time was "when the clients access any other cloud services, their performance will inevitably degrade".

And in my opinion, although Public DNS systems may not match the LDNS performance currently when it comes to best datacenter selection, they can get better by having more DNS servers around the world. Any concern about performance should be gone once these Public DNS companies catch up.

The other point is that other companies that have DNS servers with special load balancing method will have less data, therefore making poorer decisions for customers. In a way, Public DNS takes away some of the power these companies have. But the same argument applies, with more DNS servers by Public DNS providers, they should get back to the current state, with similar amount of information.

For technologists, typically the answer on whether to use Public DNS or ISP's Local DNS boils down to what is faster. Google, in particular, has a very interesting caching layer shared by all DNS servers that help it be even faster. So why now use it, like asked here: Should I use my ISP's DNS, or Google's The answer has been plain simple: measure its performance and switch to it if it's faster.


The concern, though, is that any company providing such service would be able to learn the behavior of customers and monetize this by more targeted ads. Google denies such use in their FAQ section and on their Privacy Page, so in theory they do not relate DNS usage to any personally identifiable information. The question is whether this is honored and how much can be inferred about usage even without any PII in their logs.

Business Goal

No service is offered for free without aligning with the business goals, so when one reads "We built Google Public DNS to make the web faster and to retain as little information about usage as we could" (from Public DNS Privacy Page), there must be at least some business purpose for this Public DNS. The first thought, as mentioned above in the quoted text, is that Google can improve their customer's experiences, but since they control the client-side, they could provide more information about the user through Javascripts embedded in their page code, for instance. So the question is, besides that, what is the purpose behind Public DNS?