Wednesday, February 22, 2012

Distributed computing and partial failures

I read a classical paper on distributed systems tonight called "A Note on Distributed Computing". It is from 1994, but still very relevant and a must read for anyone using distributed systems. It talks about the proposal of unifying the programming model for local and distributed objects, like proposed in CORBA at the time.

Although it seems that we are over this phase of unifying the programming model, I feel like that many people still don't get distributed systems as they should, and still don't think that they are that different from local programming. In a way, I think that they don't get things like partial failures, which is one of the things mentioned in this paper on why local and distributed computing cannot cannot be unified under the same model. So it turns out that this paper helps clarify why there is a difference and that must be taken into account when developing systems as well as testing them.

I like that the authors explain the myth of "Quality of Service" by giving a good example that fails if there are partial failures. Also they show how concurrency can be a problem through another example. And going beyond this example, they talk about the popular NFS file system and how, by being a distributed system behind an interface for a local file system, it has had major problems. I remember hearing about that since the 90s.

Below I paste some core excerpts on partial failures and concurrency from the paper. I would still encourage you to take the time and read it entirely.

The hard problems in distributed computing are not the problems of how to get things on and off the wire. The hard problems in distributed computing concern dealing with partial failure and the lack of a central resource manager. The hard problems in distributed computing concern insuring adequate performance and dealing with problems of concurrency. The hard problems have to do with differences in memory access paradigms between local and distributed entities. People attempting to write distributed applications quickly discover that they are spending all of their efforts in these areas and not on the communications protocol programming interface.

Partial failure is a central reality of distributed computing. Both the local and the distributed world contain components that are subject to periodic failure. In the case of local computing, such failures are either total, affecting all of the entities that are working together in an application, or detectable by some central resource allocator (such as the operating system on the local machine).

This is not the case in distributed computing, where one component (machine, network link) can fail while the others continue. Not only is the failure of the distributed components independent, but there is no common agent that is able to determine what component has failed and inform the other components of that failure, no global state that can be examined that allows determination of exactly what error has occurred. In a distributed system, the failure of a network link is indistinguishable from the failure of a processor on the other side of that link.

These sorts of failures are not the same as mere exception raising or the inability to complete a task, which can occur in the case of local computing. This type of failure is caused when a machine crashes during the execution of an object invocation or a network link goes down, occurrences that cause the target object to simply disappear rather than return control to the caller. A central problem in distributed computing is insuring that the state of the whole system is  consistent after such a failure; this is a problem that simply does not occur in local computing.

Being robust in the face of partial failure requires some expression at the interface level. Merely improving the implementation of one component is not sufficient. The interfaces that connect the components must be able to state whenever possible the cause of failure, and there must be interfaces that allow reconstruction of a reasonable state when failure occurs and the cause cannot be determined.

Similar arguments hold for concurrency. Distributed objects by their nature must handle concurrent method invocations. […] One might argue that a multi-threaded application needs to deal with these same issues. However, there is a subtle difference. In a multi-threaded application, there is no real source of indeterminacy of invocations of operations. The application programmer has complete control over invocation order when desired. A distributed system by its nature introduces truly asynchronous operation invocations. Further, a non-distributed system, even when multi-threaded, is layered on top of a single operating system that can aid the communication between objects and can be used to determine and aid in synchronization and in the recovery of failure. A distributed system, on the other hand, has no single point of resource allocation, synchronization, or failure recovery, and thus is conceptually very different.

Post a Comment