ENISA to Cloud Service Providers: Define "Availability"

How many 9s have cloud service providers offered you lately with respect to their service availability? Yesterday, as part of its effort to help put Europe back on track with cloud services adoption, ENISA – the public agency responsible for the security of Europe’s information services – published a new set of surprisingly legible recommendations for not just public-sector firms, but private sector firms as well, on how to evaluate a cloud service provider’s (CSP) performance during a security event and determine whether it’s living up to the terms spelled out by their SLAs.

ENISA interviewed both American and European sources for its research. And in a refreshing act of globalization, the agency credits the U.S. government – the very creator of the Patriot Act blamed for keeping Europe on the dark side of the cloud – with initiating the trend away from periodic security reviews and toward continuous monitoring.

“Both the CSP and the customer must be able to respond to changes in the threat environment on a continuous basis. It is essential to monitor the on-going implementation of security controls and the fulfillment of key security objectives,” reads the ENISA report, entitled “Procure Secure” (PDF available here). “This is also described as a priority in the US government’s 2010 report on the implementation of the federal information security management act (FISMA). The report notes a shift in strategy ‘from periodic security reviews to continuously monitoring and remediating IT security vulnerabilities.'”

The typical legislative stance toward performance monitoring, especially in the U.S., leans toward periodic review. How else, legislators often reason, does one decide when to publish the interim and final reports? But the ENISA report suggests not only that businesses move away from this sort of cyclical motion, but embrace the notion of real-time continuous monitoring even before moment one. In fact, ENISA suggests that businesses only allow prospective CSP bidders to approach them during the Request for Proposals (RfP) phase with clear definitions in hand of the metrics they allow their customers to use to evaluate their performance.

Most importantly, ENISA says, businesses should nail down what the CSP means by “availability.” In many cases, a service may be “up,” from the CSP’s point of view, even though it’s not really responsive. And if it is responsive, and the response is no more than a very fancy 404, then that might be considered “up” too.

Ideally, the document proposes, a CSP should present “a target percentage of total operational time or requests, for which a service should be considered available over a given period (typically a month or a year).” And by “service,” the CSP should be clear. The concept should include a finite number of core functions for which an HTTP request may be placed, and the service should respond. When it fails to respond, how long a period of time elapsed before it’s declared a failure? Five minutes? An hour? A day? And for the customer’s own edification, how many service requests may be reliably placed (the “sample size”) before that target percentage starts to break down? ENISA warns this should be a moderate number – not too low or it’s meaningless, but also not too high lest “extreme events” fail to be taken into account.

“Procure Secure” states that CSPs should provide their customers not just with an availability percentage, but a metric called the recovery time objective (RTO). You’ll be hearing more about this metric as we talk more and more about resilience principles in cloud architectures. Here’s why it matters: Both public and private IaaS cloud systems enable customers to deploy “warm stand-by” services – literally VMs that are up-and-running, but doing nothing but waiting. They’re waiting for a significant failure event, in which case they can cut in.

So if a CSP offers an RTO of, say, one minute, you may want to consider a deployment that lets your warm stand-bys kick in after no shorter period a time than 60 seconds. Banking customers, ENISA notes, just don’t want to wait longer than a minute. But if that RTO is shorter, you may not need those stand-bys after all. Still, the report says, it’s important for businesses to make bidding CSPs specify what they’re talking about, especially because services to them and services to customers mean very different things.

“In applying RTO metrics, it is very important to understand exactly what recovery applies to and how it relates to service delivery,” states the report. “In an IaaS environment, the provider is likely to provide recovery time objectives for system components, rather than at the overall system level. For example, if an IaaS provider specifies an RTO for storage volume availability, it could be relatively long, on the assumption that the customer will be using multiple redundant volumes in a RAID configuration and can use this to offer a much more resilient service to its own customers. On the other hand, if the same provider specifies an RTO for the storage volume provisioning system, this might be relatively shorter because the customer cannot apply appropriate elasticity measures when the storage volume provisioning system is down. An IaaS provider would typically design a service with the expectation that the customer will build a resilient system using less resilient components.”

The ENISA report closes with an exhaustive eight-page checklist full of questions that every prospective cloud customer should ask. For example, “What do log management and forensics mean for your organization and your risks?” Breaking this down, the checklist reminds you to consider whether your logs are important to the law enforcement agencies in your area. (This is a valid question whether you’re American or European.) And here’s a favorite of mine: “What metrics and reporting are in place to support monitoring incident response?” When an incident is responded to by some automated system, does that count as “response?” And are you allowed to monitor the impact that the response had on the incident – did it mitigate the damage? Did the CSP take too long to respond? And does the CSP owe you a refund for tardy response times? As many variables play into the modern notion of “up-time,” you need to take account of every last one.