Thursday, September 27, 2012

MongoDB: Replication Lag, Network Partition

0 comments
Replication: when everything is working, it seems the easiest thing, but the challenge comes when we have failures. This is something that MongoDB, being a database that is out for some years, shows us very well.

If you look at MongoDB's replication page, there is a section called "Rollback". This applies when there is a replication lag - your secondary is behind the primary - in conjunction with a network partition. It may look like, but it's not a scenario that uncommon.

What happens then? Since the primary is ahead of the secondary, if it wants to rejoin the cluster, it needs to somehow "resync" to the current state. This means that operations applied previously but not replicated (due to the lag) will need to be rolled back. First, this process in MongoDB is manual.

But what happens then if your secondary was behind a large amount of data - more precisely, 300Mb or more? MongoDB does not have the capability of rolling back and manual intervention is required to recover that node.

Read for yourself:
Warning: A mongod instance will not rollback more than 300 megabytes of data. If your system needs to rollback more than 300 MB, you will need to manually intervene to recover this data.
Although it supports asynchronous replication (the eventual consistency), it does not solve these resynchronization problems automatically. Maybe it was just deemed as not very common and low priority compared to other features, but can you imagine the pain if you need to go through this process?

Source: http://docs.mongodb.org/manual/core/replication/

Redis as Memcached Replacement?

0 comments
I found Redis to be a very interesting option to Memcached for use as caching layer:

  • Pretty much set/get operations on key/value, just like memcached, but with rich data structures, like Lists, Sets, Sorted Sets, Hashes, Blocking Queues. And rich commands to use all of them. That avoids serializing a bunch of data into Memcached values.
  • Durability options: Redis can be used just in memory, like memcached, or can be tuned to be durable (flushing data every so often is an option, or appending data to a log file that can be replayed if the node restarts, among others).
  • Master-slave: it can be configured as master slave, being the slaves your "read replicas".
  • Cluster: just like Memcached, it can be sharded using consistent hashing, but that's taking care of on the client side - which can a problem if used against cloud instances that are more likely to fail (so one needs to be prepared for updating the instance list on the client in case of failures).
  • Security: pretty much non-existent, or at least not built in, like Memcached. You need to resort to other security mechanisms (to be honest, Memcached had some SASL support, but I don't know the current state of that).
  • Not to mention the API, which "[..] is not simply easy to use, but a joy."
  • LUA support on the server side
  • Good client library support, as well. One good example is the library for a Bloom Filter on top of Redis.
  • Publish/Subscribe support: very cool publish/subscribe support on specific channels, pattern matching for both channel and message subscription, etc.

So the question that remains is why Memcached seems to be that much more popular?

  • LRU: one of the questions I've seen is related to the lack of LRU support, but this has been added to Redis
  • Performance: I did not find more recent numbers, but some numbers from 2010 show Redis being outperformed by Memcached. That could be a good reason not to use it.

I wonder if Redis performance has improved lately and if any other reason not to use Redis comes up - please comment if you know other reasons to avoid Redis.

Redis API

0 comments
This is an interesting quote from Seven Databases in Seven Weeks about Redis:

"It's not simply easy to use: it's a joy. If an API is UX for programmers, then Redis should be in the Museum of Modern Art alongside the Mac Cube."

I have to say that I'm looking forward to learning more about it after reading this :-)

Tuesday, September 25, 2012

CouchDB and Conflict Resolution

2 comments
Learning CouchDB showed that it is different than other NoSQL DBs I had studied so far with respect to sharding and replication:

  • It does not have sharding, but rather each server has all the data.
  • It has multiple-master (or active-active) configuration, where you can write to any server

The interesting part about the active-active is how conflicts are resolved. Rather than just showing that there's a conflict, all CouchDB servers have a consistent view of the world, even after conflicts. How does it do it?

  • Revision numbers: first, revision numbers start with the change number to the record. Example: 2-de0ea16f8621cbac506d23a0fbbde08a. Therefore, latest changes to any record win (change 2-de0ea16f8621cbac506d23a0fbbde08a wins over 1-7c971bb974251ae8541b8fe045964219).
  • MD5: after the change number, there is change MD5. In case concurrent changes to the record occur across two different servers, there's an ASCII comparison to determine which one wins (change 2-de0ea16f8621cbac506d23a0fbbde08a wins over 2-7c971bb974251ae8541b8fe045964219). Of course there's no concept of time here to determine which change happened first, but at least each server is able to pick a winning version deterministically. Using a MD5 hash of the document can be quite helpful to avoid replicating the same change (if the same change was made to the different CouchDB databases).
  • Conflict List: besides picking automatically a winning list, any client can know when there were conflicts and do something about - either merging them or picking a different version. That can be done, for instance, using a view that outputs conflict and subscribing to it through the Change API. I understand that this is optional and doesn't stop the system in case of conflicts.

For systems where concurrent updates are not common, this is definitely a valid and good approach. Also, there is conflict avoidance through ETAGs (just like Windows Azure Storage), which goes a long way to avoid conflicts.

My concern with CouchDB and active-active is a prolonged split-brain situation, where 2 or more databases are taking writes to the records (potentially the same). Clients can see inconsistent results if they are hitting these different deployments.

Going a bit further on the Split-Brain scenario, if these databases are used by other downstream components in your system, the same conflict resolution problem may occur on their side, as they could be consuming records from these different databases (that are not resolving conflicts for some time). Reason about all these scenarios can be challenging.

The interesting thing about this is that, if indeed there's a split-brain but clients can talk to all deployments, CouchDB could take write and return an error in case replication is currently not working, so clients could have an embedded logic to try to connect to other CouchDB deployments and do some of this replication, which will keep all the deployment in a sort of consistent state.

Some references:
http://wiki.apache.org/couchdb/Replication_and_conflicts
http://guide.couchdb.org/draft/conflicts.html
http://www.quora.com/CouchDB/How-does-CouchDB-solve-conflicts-in-a-deterministic-way

Monday, September 17, 2012

Service Failures, Disaster Recovery

0 comments
ACM Queue has a couple of must-read posts on service failures - these are just required practices if one wants to do it responsibly:

  • Weathering the Unexpected
    “Google runs an annual, company-wide, multi-day Disaster Recovery Testing event—DiRT—the objective of which is to ensure that Google's services and internal business operations continue to run following a disaster. […] DiRT was developed to find vulnerabilities in critical systems and business processes by intentionally causing failures in them, and to fix them before such failures happen in an uncontrolled manner. DiRT tests both Google's technical robustness, by breaking live systems, and our operational resilience by explicitly preventing critical personnel, area experts, and leaders from participating.”
  • Fault Injection in Production
    “fault injection exercises sometimes referred to as GameDay. The goal is to make these faults happen in production in order to anticipate similar behaviors in the future, understand the effects of failures on the underlying systems, and ultimately gain insight into the risks they pose to the business.”
    “treating the fault-toleration and graceful degradation mechanisms as features. [...] Just like every other feature of the application, it's not finished until you've deployed it to production and have verified that it's working correctly.”

Sunday, September 16, 2012

Spanner: Google's Globally-Distributed Database

1 comments
Today I read this recent paper by Google:

Spanner: Google's Globally-Distributed Database

As a globally-distributed database, Spanner provides several interesting features.
  • First, the replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance).
  • Second, Spanner has two features that are difficult to implement in a distributed database: it provides externally consistent [16] reads and writes, and globally-consistent reads across the database at a time-stamp. These features enable Spanner to support consistent backups, consistent MapReduce executions [12], and atomic schema updates, all at global scale, and even in the presence of ongoing transactions.

The very interesting part is that it provides semi-relational tables (which some other NoSQL systems also do), SQL-like query language, and synchronous replications – you can see that many applications within Google use non-optimal data stores just to have the synchronous replication.

Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general- purpose transactions. The move towards supporting these features was driven by many factors.
  • The need to support schematized semi-relational tables and synchronous replication is supported by the popularity of Megastore [5]. At least 300 applications within Google use Megastore (despite its relatively low per- formance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. (Bigtable only supports eventually-consistent replication across data- centers.)
  • The need to support a SQL-like query language in Spanner was also clear, given the popularity of Dremel [28] as an interactive data- analysis tool.
  • Finally, the lack of cross-row transactions in Bigtable led to frequent complaints; Percolator [32] was in part built to address this failing. Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.

Essentially, they accomplish by still having several versions of each record (like BigTable, but with timestamps assigned by the server) and guaranteeing timestamps to be within an error margin by using multiple time sources like GPS and atomic clocks. Then there are Paxos groups, of course, and two-phase commits for the transactions.

I found this paper very interesting as I've been thinking of some challenges of implementing these features (like across DC replication and backup) on NoSQL data stores. Quite an interesting paper and accomplishment by this Google team.

Saturday, September 15, 2012

Wednesday, September 12, 2012

MongoDB, Maturity, and "let's implement ourselves"

0 comments
One common theme I've seen in the past few years is developers downplaying products and features that typically take a look time to implement and to get to a mature state, like data store and data replication. Even when the discussion is finished with the answer is that "our problem is much more limited", "it can't be that difficult", or the plain "we're smart than they are", I haven't been really convinced that the advocates of implementing their own feature (you name it) really knew the consequences of moving forward with it.

An example of how some things are very complex, let's take a look at MongoDB. Here you can find a couple of good posts on the topic of maturity:

How much credibility does the post "Don't use MongoDB" have?
Which companies have moved away from MongoDB and why?
Goodbye MongoDB
Don't use MongoDB

10gen is the company behind MongoDb and we're talking about tens of people working on this system for years - and it still has its issues. This is just normal. I once heard that it takes around 10 years to have a mature DB system - yes, 10 years.

Above you can see that replication is one of the common issues. If you go ahead and download MongoDB source code, you will see that it is under 7,000 lines of code. To repeat the common question: "How hard can it be?", how hard can it be to get 7,000 lines of code right? :-)

Long time back, when I worked on the Linux kernel, my changes amounted to around 5,000 lines of code. Those were the hardest 5,000 lines of code I wrote - hardest to get to work, to get to some stability. And I was able to make good progress with a lot of people trying it out and providing feedback, but it was still far from really declaring a stable patch that anyone in the world could use. And that was a code for non-distributed system, without all the complexities of remote failures, network partition, etc.

Charlie Kindel's on Microsoft's Performance Review

0 comments
Having gone through my own performance review recently, I couldn't help but reread Charlie's post on performance review to see the perspective of a former executive on the review ratings. Highly recommended.

Got a 4? You Were Just Fired from Microsoft

On the rating scale, this is what Charlie says:

1 You walk on water. We love you. We want to love you long time. Here’s an absurd amount of stock, an up-to 100% bonus, and a huge pat on the back. If we’re not promoting you right now, it’s probably because we just promoted you; we’ll get to it again RSN.

2 Stud/Studette! You are awesome in our eyes and we wish everyone else was just like you. You also get a nice healthy stock grant, a bonus that will make your spouse grin ear-to-ear, and a raise. Please, please, please continue what you’ve been doing.

3 Thank you. You did a fine job. We’d love to keep you around as you are clearly an asset to the company. Keep up the great work! Oh, and if you just do a bit more you could get a “2” next time. But also be careful because you were also close to getting a “4”.

4 I know it stings to hear this, but we are telling you Microsoft really doesn’t care about you anymore and would be just as happy having you work somewhere else. Yea, yea, you might still get a bonus, and there might be small raise involved, but promotion? Forget about it. And stock? Why would we give stock to someone who’s likely not going to be here next year? Ok, ok, maybe you can climb out of this hole over the next year. If you want to try, we’ll let you.

5 Don’t let the door hit you on the butt on the way out. We’d fire you, but that’s just so messy. It’s far easier for us to make you feel unwanted and hope you leave.

Dr. Dobbs on Unit Tests

0 comments
I've had many posts on unit tests and here goes still another one. I can't count how many times I've heard that unit tests are a waste of time. Managers, pressed for a release, are "ok" if you check in without unit tests. Unit tests, when there's pressure, becomes "optional" and, compared to all features you could be implementing instead, go to the bottom of your priorities list. Yes, these are the chats I have - and no matter how much one tries, it's almost impossible convince otherwise.

Dr. Dobbs published a small text on unit tests called "Unit Testing: Is there really any debate any longer?", which is great and reflects precisely my experience. Still another shot at convincing the opposers :-)

Saturday, September 01, 2012

Craftsmen: Jiro Dreams of Sushi

0 comments
A few months back, the shuttle drive for my car dealership recommended a bunch of good movies and documentaries he had watched at the Seattle International Film Festival. I always have a great time when he drives me given his huge passion for movies and his kindness to spread the word to all customers on good films.

One of the recommended movies was: Jiro Dreams Of Sushi. Luckily for me, Netflix streams it, so today I watched it.

This is a beautiful movie, with a number of great lessons by Jiro and his family. He is a craftsman, dedicating his life to his craft and an insatiable will to improve on his craft. After 70 years on the job, he was awarded 3 Michelin stars and after 75 years on the job, a documentary on his life was released to the world.

Although it's debatable whether this is a balanced life, it's still absolutely fascinating to see such a passion for his craft and discipline to keep going, irrespective of his age and other life's challenges. No matter the price he charges for his sushi, we are the ones who most benefit from having this shokunin available to keep amazing us. Like mentioned in the documentary, a miracle every night.

Take the time and watch this documentary when you have an opportunity.

Task ContinueWhenAll and Failures

0 comments
Lately I've been using Task.Factory.ContinueWhenAll in my TPL programming, but I wasn't quite sure how it handles error nor the cleanest way to handle errors coming from inner tasks.

Take a look at the code below for an example. What do you expect the result to be?
Task innerTask1 = Task.Factory.StartNew(() => { throw new Exception("Error1"); });
Task innerTask2 = Task.Factory.StartNew(() => { throw new Exception("Error2"); });

var task = Task.Factory.ContinueWhenAll(
    new[] { innerTask1, innerTask2 },
    innerTasks =>
        {
            foreach (var innerTask in innerTasks)
            {
                Console.WriteLine("Result: {0}", innerTask.Status);
            }
        });
var exceptionTask = task.ContinueWith(
    t => Console.Error.WriteLine("Outer Task exception: {0}", t.Exception),
    TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously);

try
{
     task.Wait();
}
catch (Exception e)
{
     Console.WriteLine(e);
}

Console.WriteLine("Done - Outer Task Status: {0}, Exception Task: {1}", task.Status, exceptionTask.Status);
Console.ReadLine();
After running this code, you will see:
Result: Faulted
Result: Faulted
Done - Outer Task Status: RanToCompletion, Exception Task: Canceled
So both inner tasks finished with status "Faulted", as indicated in the "Result:" lines, but the outer task finishes as "RanToCompletion" (i.e. successful). At some point later the unobserved exceptions will be thrown in the code when finalized, which is far from ideal if you want to have a good understanding of this code and be able to debug any issues in the future.

In order to properly get the "task" up there to fail if any of the inner tasks fail is to observe the exceptions when ContinueWhenAll continuation runs. One of the most obvious is to check all innerTasks and, if any failed, we can throw its exception.

Today I came across a good on that. The suggest is to make the following change to the code:
Task innerTask1 = Task.Factory.StartNew(() => { throw new Exception("Error1"); });
Task innerTask2 = Task.Factory.StartNew(() => { throw new Exception("Error2"); });

var task = Task.Factory.ContinueWhenAll(
    new[] { innerTask1, innerTask2 },
    innerTasks =>
        {
            Task.WaitAll(innerTasks);

            foreach (var innerTask in innerTasks)
            {
                Console.WriteLine("Result: {0}", innerTask.Status);
            }
        });
var exceptionTask = task.ContinueWith(
    t => Console.Error.WriteLine("Outer Task exception: {0}", t.Exception),
    TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously);

try
{
     task.Wait();
}
catch (Exception e)
{
     Console.WriteLine(e);
}

Console.WriteLine("Done - Outer Task Status: {0}, Exception Task: {1}", task.Status, exceptionTask.Status);
Console.ReadLine();
The difference is the "Task.WaitAll" - that makes the exceptions to be observable and the ContinueWhenAll task to fail, thus making the code much more predictable and understandable in case of failures.

This is the result as compared to the output above (just the "Done" line, as I don't want to past the WriteLine that prints the exception here):
Done - Outer Task Status: Faulted, Exception Task: RanToCompletion
So the tip is: for all ContinueWhenAll, add Task.WaitAll. If you have an empty continuation (_ => {}) or any variation of that), just replace it with "Task.WaitAll" - and here it can be a method group, which is even cleaner.