The Joys of Real Hardware

Again from a presentation given by Jeff Dean, a brilliant Google engineer. These are the failure rates of hardware in a typical first year for a new cluster at Google:

  • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
  • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
  • ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
  • ~1 network rewiring (rolling ~5% of machines down over 2-day span)
  • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
  • ~5 racks go wonky (40-80 machines see 50% packet loss)
  • ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
  • ~12 router reloads (takes out DNS and external vips for a couple minutes)
  • ~3 router failures (have to immediately pull traffic for an hour)
  • ~dozens of minor 30 second blips for dns
  • ~1000 individual machine failures
  • ~thousands of hard drive failures
  • slow disks, bad memory, misconfigured machines, flaky machines, etc.

My comment: distributed systems must deal with failure constantly and consistently. My intuition (which is usually wrong) is that Java/.NET exception handling isn’t that good because there’s no convenient way (AFAIK) to fix an exception and retry/resume the block of code like you can with Lisp conditions. Software transactional memory (STM) sort of does this, i.e. it doesn’t fix anything, but it keeps retrying the block until the transaction succeeds (the inputs haven’t changed before you update the result). I’m sure there’s a solution buried in the Hadoop code. I’ll dig it out and post it soon.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s