Wednesday, March 25, 2009

Bit Rot in the Cloud

From Wikipedia:
Bit rot, or bit decay, is a colloquial computing term used either to describe gradual decay of storage media or to facetiously describe the spontaneous degradation of a software program over time. The latter use of the term implies that software can literally wear out or rust like a physical tool.
One way this manifests in the cloud is with VM images that worked fine just a few months ago but have problems today. It's not unique to the cloud, but it happens that I've been experiencing this with some EC2 images. Specifically, for demonstrating how to automate distributed testing with multiple browsers triggered by a continuous integration build.

A taste of some things that can go wrong:
  • REST URLs for APIs become deprecated and no longer supported
  • Services and servers move and are decommissioned
  • Strong password security policies cause expiration of passwords, prevent reuse of old passwords, and lock out users after too many retries (especially bad if it's the admin user)
  • Xauth cookies expire and prevent access to the display
There are a couple ways to guard against this type of bit rot. One is to identify everything that depends on time or external services, then have the instance make the appropriate adjustments and diagnostic checks on startup. Another approach is to only use vanilla VM images, and do all installation and configuration through something like Puppet.

Maybe there's also some value in using continuous integration tools to regularly exercise the VMs and their configuration, especially systems of associated nodes.

Has anybody else run into this? I'd love to hear what approaches you've taken to mitigate this sort of thing, and how they've worked for you.