Service Levels have always been a critical business continuity metric that companies have tracked. This has been true even through the years when companies managed their entire technology portfolio themselves. With technologies increasingly moving to cloud, SLAs and the Recovery Metrics (RTO and RPO) are more critical for both the service provider and service consumers.
Notwithstanding the five nines (99.999%) or six nines(99.9999%) guarantees, unplanned outages are unavoidable. That is the reality. Increasing the degree of automation in the operations might reduce the potential for an outage but will not provide a 100% guarantee.
This past Sunday (11/14/2010) GitHub experienced once such accidental outages. GitHub, for those who are not aware, is a community based collaborative version control system used by most open source, startup and web 2.0 projects. Amongst it customers, it lists Facebook, Twitter, Digg, Yahoo and open source projects like jQuery, Ruby On Rails, CakePHP, Hibernate and many more. So given all the marquee names, conceivably, there would have been a lot of noise created when there was a unplanned service outage. Blogs, Twitter Stream and sundry would have been raging with upset users/customers. But what happened was exactly the opposite. Customers came out in numbers and expressed empathy with GitHub. What could have caused this noble behavior?
GitHub in sharing the outage information also made extra attempts to be transparent. They came out and shared the details around what caused the outage. Gist: A configuration error caused a test suite to be targeted at a production database instead of the test database and in the process recreated the production database. Notice in the explanation that they did not hide the fact that they did loose some data between the last good backup and time of deletion. I think that is extremely smart of them of to do.
Workday did what amounts to a best practice in outage management when they had a real bad outage late last year. Their president Aneel Bhusri led the charge and reached out to all their customers and kept them appraised. To prevent twitter/blogosphere wildfire, they also reached out to bloggers/analysts and explained what had occurred and the remediation. Obviously, some customers did feel the impact of that outage, but the fact that Workday was out there being transparent and not having an automated answering machine, made customers take the outage in their stride.
Outages are part of life, Cloud or no Cloud. Your own datacenter might have an outage. I remember, back in 2000, a plumber cutting a cable at Oracle and shutting down our datacenter for a day and more. I speak from first hand experience in managing one such mishap. While running operations for Oracle Exchange (B2B marketplace) and while Ford was running an auction much less, I accidentally kicked off the backup process on a wrong database and caused the site to come down. Needless to say, my boss was less than amused, as he owed an update to Larry later that day.
So what are some good karma recommendations in dealing with outages
- Outages are part of life. There have been worst outages in the past and there will be worse ones in future. So no point hiding behind an automated answering machine or a static web page. Fail Whale works for services like Twitter but not for mission critical services. So make it human and get ahead of it.
- Identify the leader to manage the outage and make them available so (concerned) customers can reach out. Preferably the CEO or VP of Client Success and not someone from sales/marketing. Listing of their contact information on your website would be best.
- Create an Outage dashboard and provide a detailed update on what transpired. Treat this as a project with milestones and provide regular updates.
- Maintain a FAQ page which will list all the questions you think people might ask and those that you have already addressed.
- Summarize the events of the days after you have taken care of the problem.
If you are one of the better cloud service providers, you probably already have a status page like Amazon AWS or Salesforce.com. Make sure that page links to the outage dashboard, the FAQ page and also shows a Red indicator while the service is down.
I tell my clients, that are evaluating Cloud based services, to measure companies based on the way they handle crisis. During good times every one can make tall claims, display status scorecards, flashing billboards etc. It is when things fall apart that true leaders demonstrate their quality.
If you want to know how customers react to transparency check the sampling of responses to the GitHub announcement
November 16th, 2010 at 10:22 PM
Great post Subraya!
Transparency is great unless you are trying to be unethical. The corollary is that these days, if you are not transparent, people immediately suspect that you have something to hide and that kicks off even more dirt than the original damage. A case in point was Mike Arrington barging into some kind of junta meeting of VCs. It was a mess. Not that I subscribe to what mike did but in today’s times, it just attracts too much attention!
November 17th, 2010 at 5:01 PM
Thanks Bala.
The way I look at it, in business, transparency is a must. Especially when you have a binding contract with companies that have taken a great risk on your company. In case of public companies, compliance mandates enforce that. But even with the mandates, the fundamental basis of any relationship is transparency. Transparency begets trust as they say.