Today I read a great article on Microsoft Channel 9 moving to Azure talking about the sound design principles in place and lessons the Channel 9 team shares about how to move a web site to run in the cloud.
One of the things that caught my attention is to see a Microsoft project using a distributed cache fleet running Memcache. Using a caching layer is definitely the right thing to do in many cases to make the site more scalable. I wonder why they haven't used Windows Azure AppFabric Caching. And also, after working on Amazon Elasticache before joining Azure, I'd be curious how they monitor their Memcache instances.
I was very glad to see modular code, coding to interfaces, and mostly dependency injection being used. While dependency injection is pretty popular in the Java world, it's still not as popular for many Microsoft developer. They mention dependency injection being used "for testing purposes but also to isolate you from very specific platform details". Very well done.
Division of labor is a right principle for environments where machines are not reliable. This is proper mindset about machines in the cloud: "In practice they tend to run a very long time, but you can’t depend on that fact." And breaking down the tasks and using worker roles to pick them up, connecting them via queues, seems a smart strategy (assuming you have proper monitoring on these queue depths in place). In particular, I like the fact that the Channel 9 did not just thought that instances run for a long time and released an architecture based on that, so potential problems could be addressed in the future. Unfortunately I've seen a lot of people with this mindset, and Channel 9 did very well here.
From the article, though, the only thing that could have been done better was to think about database sharding. Although SQL Azure will provide Federation, there are many things that service owners need to think about: what the database partition key will be, what queries will need to go over partition and impact potential scalability, what queries will need to be federated, etc. I am not very familiar with SQL Azure Federations and don't know if it will repartition automatically hot partitions, but if it doesn't, that's another task service owners need to prepare for. With all that said, you don't need to shard right away, but you need to think of that before you service version 1 goes out, otherwise scaling can be a major headache - and if you can't afford downtime, then that can be an almost impossible task to accomplish in some cases.
All that said, I was very glad to read about their work and their sharing the architecture and lessons publicly.
Link to the InfoQ article: