Sunday, May 1, 2011

Excuse me, you are talking like an idiot

It has been less than a week since the great Amazon Cloud Services failure.  Clearly this was a failure of epic proportions that shakes the very foundation of the Cloud Services concept.   With an outage that lasted more than a day, thousands of Amazon customers were left out in the cold, lost money, and started to re-evaluate their approach for buying network, compute, and storage resources and services. For the pundit press and media sensationalism this became the potential Waterloo event for the Cloud.

Well, good grief (please picture me with my mouth wide open and staring at the heavens) this is almost too darn much for me to take and not scream out loud.  Let’s go through some real disasters and measure their impact:

  • The Passenger Ship General Slocum fire.  June 15, 1904.  Approximately 100 people killed.  We did not stop taking ferries, we worked to make them safer.
  • San Francisco Earthquake.  April 18, 1906.  Over 3,000 killed.  We did not abandon San Francisco (or for that matter cities on fault lines)
  • The Triangle Shirtwaist factory fire.  March 25, 1911.  Killed 146 people.  We did not abandon factories, but worked to make the workplace safer.
  • The Titanic sinking. April 15, 1912.  Killed 1,517 people.  We did not stop making ocean voyages, but ensured that there would always be enough lifeboats.
  • deHaviland Comet disasters.  1954.  Over 100 people killed.  We did not abandon commercial jet aircraft, but figured out how to make planes structurally safer.
  • Challenger Shuttle Disaster.  January 28, 1986.  Seven brave men and women died.  Nothing will stop our quest for the stars.
  • Hinsdale Central Office Fire. May 8, 1998.  Several million customers affected.  We did not give up the phone, but recognized the impact of failures.
  • Pan Am Flight 103.  December 21, 1988.   Killed 270 people.  We determined not to be paralyzed by fear and kept flying.
  • AOL Email outage.  June 20, 1997.  Over 500,000 customers affected.  I guess the hundreds of hotmail, aol, gmail, and other hosted email services don’t exist.
  • September 11, 2001.   The date says it all.  Nearly 3,000 people killed, tens of millions of people affected.  We did not stop building tall buildings or making airplanes.
  • Northeast Blackout.  August 14, 2003.  Tens of millions of customers affected.  How many businesses would have had their critical IT systems working if they used a “cloud” service.
  • Hurricane Katrina. August 29, 2005.  Nearly 2,000 people killed.  Hmm, I think you can still visit Katrina today.
  • Massive Submarine Cable Systems Failures in Asia.  December 26, 2006.  Several million customers affected.  Guess it’s time to stop using high-speed systems to Asia (by the way, other failures occurred during the Fukushima earthquake and tidal wave this year.
  • White House Email Outage.  January 26, 2009.  The Big Guy affected.  Got significantly less press.
  • Blackberry Outage. January 29, 2011.   Several million customers affected (and probably the Big Guy).  Again, a “Cloud” provided service fails, where was the outcry to re-think using Blackberries or smart phones for business?
  • White House Email Outage.  February 3, 2011.  The Big Guy affected, again.   Probably better performance if they stuck their services over at Amazon.
  • Amazon Services Failure. April 21, 2011.  Several million customers affected.  Amazon takes responsibility, works on improving their service.
There are nearly 50,000 traffic fatalities per year in the USA, but we don’t hear about re-evaluating whether we should have roads.  There are approximately 700 children under the age of 14 that accidently drown each year, but we don’t ban swimming pools (by the way, there are around 100 accidental deaths of children by guns each year).

The bottom line is that systems fail and bad things happen.   As with other life lessons such as losing money by betting it all on a particular stock or for that matter on red at a roulette wheel, you take a step back, evaluate the real story and move forward once again – it does not make sense to put all of your eggs in a single basket.  In the list above, there have been high-profile failures of commercial shared telecommunication and IT services (and dare I say Cloud) since the days of the telegraph.  In each case, users had to do a risk to benefit of continuing to rely on these services for critical functions of their businesses.   Where necessary, detailed mission reliability evaluations are required.  This helps determine if, where, and how to add additional capabilities to reduce or eliminate potential mission failure.  This is exactly how the research oriented ARPAnet became NSFNet became the Internet and then itself became a robust foundation to virtually every facet of our economy and life.

High-profile failures should make us think, but not to over sensationalize.   Otherwise, I would (although the statement itself is possible) be typing this in a mud floored hut.

P.S. My company, Polar Star Consulting provides IT systems reliability support services providing actionable recommendations to provide mission assurance from the component level to total systems.  Please contact me or send a message to info@polarstarconsulting.com for more information.

No comments: