Monday, September 17, 2012

Service Failures, Disaster Recovery

ACM Queue has a couple of must-read posts on service failures - these are just required practices if one wants to do it responsibly:

  • Weathering the Unexpected
    “Google runs an annual, company-wide, multi-day Disaster Recovery Testing event—DiRT—the objective of which is to ensure that Google's services and internal business operations continue to run following a disaster. […] DiRT was developed to find vulnerabilities in critical systems and business processes by intentionally causing failures in them, and to fix them before such failures happen in an uncontrolled manner. DiRT tests both Google's technical robustness, by breaking live systems, and our operational resilience by explicitly preventing critical personnel, area experts, and leaders from participating.”
  • Fault Injection in Production
    “fault injection exercises sometimes referred to as GameDay. The goal is to make these faults happen in production in order to anticipate similar behaviors in the future, understand the effects of failures on the underlying systems, and ultimately gain insight into the risks they pose to the business.”
    “treating the fault-toleration and graceful degradation mechanisms as features. [...] Just like every other feature of the application, it's not finished until you've deployed it to production and have verified that it's working correctly.”
Post a Comment