How to keep cloud service failures from affecting your business

You look at those prices for Amazon cloud services and think you’re getting a deal.

Fact is, you are. You’re hiring a professional staff to run your systems in a very-high-quality environment and paying little for it.

But are you using these cloud services in a way that protects your business?

Forbes analysis of the Northern Virginia Amazon cloud outage from Friday’s storm doesn’t clarify who does / doesn’t use the NoVa cloud site vs. who had a better redundancy setup.

Netflix and Instagram are likely re-examining their use of cloud services. I doubt they’ll eliminate Amazon as what happened in Northern Virginia can happen anywhere. They’ll likely discuss cost-effective means of increasing redundancy that leave them less sensitive to single location failures.

Questions to consider

Redundancy with transparent switchover to backup systems with no data loss is ideal. Do you need that? Can you afford it?

Ask the right questions when designing your use of cloud services:

  • How much downtime are your customers (internal or external) willing to tolerate?
  • Do you know what an hour of downtime costs internally (lost productivity, inability to serve customers) and externally (refunds, lost customers).
  • Given those costs, how much downtime can we afford?
  • What notification mechanisms do you need to have in place to switch? (or is the switch automatic?)
  • What do I want to happen when a failure occurs?
  • What am I willing to pay for my desired level of redundancy?
  • What will a failure that doesn’t use this level of redundancy cost my business?
  • How do you switch to the redundant system? Is it manual? Transparent?
  • Does your vendor offer redundancy? How does it work?
  • Are your vendor’s redundancy sites geographically dispersed?
  • How does my data get replicated?

This really isn’t about Amazon. It’s necessary to protect your business whether you use Rackspace, Amazon, Microsoft Azure or other cloud services. The key is knowing what you want to happen when a failure occurs and designing it into your processes.

Why not keep it all in-house?

It’s tempting to keep your data in-house. It somehow seems cheaper and there’s the impression that it’s more secure. Evidence indicates locally-hosted data has its own risks.

Locally-hosted systems have a single point of failure. I’ve had clients whose businesses have burned or flooded and others whose servers were stolen. Without a remote location to transition to, you’re down. Can your business handle that? If so, for how long?

Security

Security of internal business data is a concern with cloud vendors. High-quality cloud vendors obtain security certifications like SAS70 (financial industry), HIPAA (health care) and PA-DSS (credit cards), which require regular audits to ensure continued compliance. Companies who keep their data internal are subject to them as well – yet they still suffer data loss.

Local data storage doesn’t allow you to escape expensive HIPAA or PA-DSS compliance if those requirements apply to you. In the financial industry, systems are sometimes subject to examination by the OCC (Office of the Comptroller of the Currency) and/or other agencies. But that doesn’t prevent data loss.

Regardless of system/data location, security should be designed into business processes rather than added as an afterthought.

Electrical power and internet

Cloud vendors use industrial-class electricity supplies with diesel backup generators. Their investment in these backup systems vary both in capacity and available time-on-generator, so ask for details. A site’s ability to run on diesel for two weeks isn’t nearly as important as your ability to switch to another facility in two hours…unless they don’t have two hours of generator time.

You can (and should) use an uninterruptible power source (UPS, aka battery backup) with automatic voltage regulation (AVR) to protect your local systems, but you’re still face internet-related downtime if remote staff/clients need to access locally-stored data.

Cloud vendors have multiple very-high-speed internet providers so that they are not subject to pressure from any single vendor and so that a single vendor’s downtime doesn’t bring the entire location down. You can do the same, but most small businesses don’t. If remote connectivity is critical to your business, it’s a smart strategy.

Whether your systems are local, cloud-based or both – plan for what happens when the lights go out. It just might save your business.

 

 

3 thoughts on “How to keep cloud service failures from affecting your business”

  1. In past events I have read how to properly setup Amazon service to roller to available sites. I haven’t read details on if that worked this time, in which case only those that didn’t setup cloud services in the safest way were impacted.

    I thought I read that Netflix was looking at moving away (for cost reasons I believe). And Netflix is collocating Netflix servers at major ISP with lots of the frequently used content… Google has done a great deal of the collocating stuff.

  2. As the Navy and aeronautics industries can clearly show with their multiple redundant systems, redundancy has a proven track record, but it’s really damned expensive. Deming’s philosophies have been worth trillions to Japan since the 1950s, such as “Quality costs less in the end.” True, he disagrees with Drucker on management by objectives. It’s true that Japan doesn’t do innovation as well as it does quality, but for uptime, putting two idiots together to guard a door until morning, and you’ll show up at dawn when they’re both asleep. 3 out of 4 employees cost many multiples more than you pay them, as can be proven by the 1997 Apple Turnaround and the research hilighted in “Topgrading.” Redundancy helps with switchovers, unplanned outages, and acts of God, but quality counts a lot. Nothing can help you if your plan is to switch over to a dead system.
    When planning and researching multiple backups through Dropbox, Sugarsync and others, I found out they all depend on the same Amazon S3 servers. Oops. Still a single point of failure.
    “It’s not if. It’s when.”
    Let’s turn systems engineering on it’s head. Systems need to be so well designed that it’s no longer an issue of if your systems will go down, but how many milliseconds it takes to re-route service through the remaining African, Asian, European servers until the U.S. servers flash reboot. Aim for 30 second maximum downtime for any given continent through S3 server network, and you’ll really have something special.
    Restoring a server farm from a clone via network connection and rebooting, on pared-down, efficient linux systems should take less and less time as years go on.
    Humans can and should demand 100% uptime from their machines, with maximum allowable delays under a minute. It’s simply a matter of designing it that way.

    1. Thanks for the great comment, Roger. It mystifies me why they’re all focused on the NOVA S3 node. Im sure it has something to do with quality of service to the Northeast corridor, but it isn’t as if all those other Amazon servers are in NOVA. Re: Dropbox etc, I keep an offsite clone of critical files on Amazon and another on Rackspace for just that reason – and the Asian/European failover sites are exactly the sort of thing I’m speaking of doing. The cost is a factor, but so is the cost of downtime. We seek balance in all things, even redundancy:) One point of failure is just too obvious a problem to have.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>