The availability management gap
There are many gaps between the practices of most service providers and the so-called “best” practices. Perhaps one of the largest gaps is in the discipline of managing availability. And yet, this common situation is striking. Unlike incident or problem management, which cannot create new value for customers, but can only reduce the amount of value destroyed, availability management can make a positive contribution toward increasing the value of services to customers by increasing the times during which services may be used.
Why, then, do most organizations focus on what they view as their pain points, such as incident, change and even configuration management? Are they already so proficient in ensuring the appropriate availability of services that they feel there is little more to be done? This can hardly be the case, given the track record of most service providers. There are several aspects to the answer to this question about the service management discipline that I consider to be the most important one of all.
Availability management is confused with other disciplines
Most importantly, service providers have difficulty in apprehending just what availability management is. Many believe that it is not a discipline in its own right. Rather, they believe that availability is merely the reflection of the effectiveness of all the other service management disciplines. Other organizations make no distinction between incident management and availability management. One of my customers, for example, is exceptional in that it has a job entitled “Availability Manager.” In reality, the responsibilities of this job are those of a major incident manager. A common question in my ITIL classes concerns the distinction between availability management and service continuity management. Although clear in theory, the practices in organizations do not often make this distinction. If you ask many organizations how they manage service continuity, the typical answer is to describe the infrastructure features designed to ensure a certain level of availability. And many organizations do not distinguish between monitoring, on the one hand, and availability management, on the other.
A discipline, not a formal process
Part of the confusion about availability management is due to the way in which ITIL presents it. After going to great pains to describe what a process is, with all its formalisms, components, resources, controls and responsibilities, ITIL’s presentation of availability management concerns anything but a process. In order to avoid this schizophrenia, we prefer to refer to the disciplines, rather than the processes, of service management. Certain activities in certain disciplines might very well be executed in a process-oriented way. Other activities might be more appropriately handled via case management, such as I have described elsewhere. Depending on the maturity of the service provider organization, yet other activities might be most appropriately performed in an ad hoc manner.
A case study
Given the rampant confusion about the nature of the management of availability, it is worthwhile to examine more closely what availability management could be. We may start with a case drawn from one of my former customers. The organization provided a certain number of free and paid services on the Internet. These services were generally owned by the marketing department and entirely dependent on IT for their delivery. At one point, the marketing director started to receive complaints about the unavailability of one of the key services. A task force was created in IT to examine the issue and determine the causes of the unavailability.
There were certain implicit factors in the work of the taskforce. First of all, there was no end to end monitoring of the availability of the service. It was largely impossible to establish any objective measures of availability—no matter what the metric. And therefore it was also impossible to determine if there was any change in the availability of the service. This situation was exacerbated by the approach of the organization to change control. The approach of the team responsible for web services delivery took the approach of implementing changes and new releases on test platforms. Once tested and approved, the RFCs would never describe the real changes being made. Instead, the change was simply to switch the test and the production platforms, a simple change of IP addresses. As a result, the change log provided no help in determining if a specific change could be related to increased unavailability. The result of this situation was that IT simply took on faith the assumption that there was an unacceptable level of availability, that the perception of the marketing director was sufficient.
The task-force itself was composed of representatives of each of the technologies used in the web service stack: the routers, the load-balancers, the firewalls, the application servers, the database servers and so forth. Each of those representatives was asked to investigate the availability of the components for which they were responsible to identify the potential cause of the issue. Needless to say, as we went around the table, each engineer assured everyone else that his or her own technology was not a fault, that the cause must be elsewhere.
After a series of new releases that sooner or later changed virtually the entire infrastructure, the perception of a problem eventually disappeared. Other, more painful and more tangible issues always seemed to take priority over this one. The case was ultimately closed, with the remark that the issue could not be reproduced.
What can we learn from this case? What were the issues in this case that ought to have been addressed by availability management? For, it goes without saying that the organization had no availability management discipline in place.
Achieving goals for which no one is responsible
The achievement of goals for which no one is responsible is probably a hit or miss affair. Think of a new home with walls painted white. If no one is responsible for keeping those walls clean and white, they will nonetheless remain in good shape—at least for a limited amount of time. It is inevitable, though, that they become dirty, stained or damaged through simple use. So it is with availability. It is possible to design and implement a service that will remain perfectly available—until the day that a change, or an incident, or an evolution in requirements or a degradation in the components results in an unexpected and undesired unavailability. It is even imaginable that an organization performs tests on the availability of a new service and finding that the availability is perfectly adequate, decides that there is no special need to manage its availability.
Availability management cuts across all services
In the time since best practice frameworks, such as ITIL, first talked about the need for managing availability, it has become increasingly common for IT service provides to assign product or service managers to their various services. The product or service manager is globally accountable for the delivery of a service or a line of services, including the availability of the service. Does the existence of a service manager preclude or obviate the need for an availability manager? Since the service management disciplines are horizontal in nature, cutting across all services, the need for an availability manager is just as present as the need for a change manager, an incident manager, a security manager or a capacity manager. Why is that? The simple answer is that availability management is a discipline consisting of its own knowledge, its own methods and its own tools.
Often, a service provider has a very narrow conception of what availability management ought to be, limiting itself to the monitoring of the availability of components and, in some cases, of services. In these cases, it might happen that the engineers or administrators responsible for the configuration and maintenance of the availability monitoring tools are considered to obviate the need for any other availability management function. The great drawback of this belief is that availability management may be confined to the reactive gathering of information about unavailability.
An additional disadvantage is the tendency to develop redundant point solutions for monitoring all sorts of technologies and services. I believe that the reason for this is that those responsible for monitoring availability do not have a stake in ensuring the availability of whatever they monitor. Indeed, the more a service or a component is not available, the more the monitoring appears to be needed. Indeed, having a lot of different tools for monitoring availability might be seen to justify increasing the monitoring staff and creating a little, indispensable fiefdom, hidden at the heart of IT. This might sound like an unfair accusation. However, without an availability manager, who is to champion the need for increased availability, yet decreased costs? The CIO might treasure these values, but hardly has the skills or time to realize them. The service managers want the best tools for their own services, independently of the global effect on IT staffing and budget. And the monitoring specialists themselves are only too happy to feel needed and important. They are unlikely to design themselves out of their own jobs.
The breadth of availability management methods
It is useful to think about managing for availability by looking at the principal reasons for unavailability. These reasons may be grouped according to the four pillars of service management: persons, processes, products and partners.
Unavailability due to people
Typical causes are:
- Lack of knowledge
- Lack of training
- Failure to follow procedures
- No one responsible
- Responsibility cuts across organizational boundaries
- End users failing to use a service as agreed
Unavailability due to processes
Processes may be:
- Poorly defined
- Difficult to apply
- Lacking in metrics
- Unsupported by appropriate tools.
Unavailability due to products
The products, or components, used to deliver a service might have inappropriate:
Unavailability due to partners
The partners, that is is say, the suppliers of a service provider, might:
- fail to deliver their own services
- deliver services without any defined service levels
- deliver services with ambiguously defined service levels.
Think of the breadth of everything an availability manager should know in order to be able to manage methods for measuring and avoiding all these different sources of unavailability. On this basis alone, I am compelled to believe that availability cannot be holistically and adequately managed by anyone other than a dedicated availability manager. The panoply of vision and skills required would make any such availability manager a good candidate for a future CIO.
Prioritizing availability improvement
I have suggested above 19 different categories of causes for unavailability. There are certainly other categories to be considered, too. Now, let us assume, as in the case study provided above, that there is a general perception among the customers of inadequate service availability.
This issue should be the bread and butter of reactive availability management. It is not really a problem management issue, because no specific problem can yet be articulated. Problem management might come later. Nor is the issue a service improvement register issue. One hopes that specific improvement initiatives will be suggested, each of which shall have to be assessed for inclusion in the service improvement register. In other words, there is a need for a high level evaluation of all the different possible improvements to availability, in all the different categories I have listed and cutting across all the services, technologies, organizational units and suppliers in use.
Sharing knowledge, methods and standards
In a relatively small organization with a highly entrepreneurial culture, it might be feasible for a bottom-up approach to developing knowledge, methods and standards for managing availability. In large organizations and in those plagued by technology silos, it might be necessary to implement or foster techniques to ensure that availability knowledge becomes known and is reused. An availability manager and the tools used by the availability manager can be important repositories of availability knowledge. This knowledge might be used via formal or informal techniques, be communicated using tools or person-to-person and be promulgated proactively or reactively. What are some of the scenarios that might be used to support the sharing of availability knowledge?
Service design controls
An organization may wish to institute a formal requirement that any new service consider the start of the art in ensuring the appropriate availability of that service. Whether this is done by consulting experts in the domain, using accepted methods or by demonstrating in practice the availability of the service will depend on the capabilities of that organization. Ideally, all persons involved in service design will have demonstrated the necessary qualifications and experience to ensure that availability. If this is not possible, the organization may also resort to formal controls, such as checklists or tests.
Integration in change control
Any system or service should be delivered at a level of availability based on certain assumptions and design features. A change may invalidate one or more of those assumptions or may invalidate one of those features. Change control should ensure that the change is assessed against this possibility. One of the methods usable to support this assessment is the consultation of the historical record of how changes in the past may have had undesirable effects on availability.
Informal sharing of knowledge
Julian E. Orr, in his book Talking About Machines: An Ethnography of a Modern Job, described the importance of informal sharing by technical personnel of knowledge about diagnosing and resolving technical issues. This sharing is surely as important for proactive design activities as it is for reactive repair and maintenance. An organization may foster this sharing in various informal or semi-formal ways. The simple fact of making known to the workforce as a whole who is involved in designing for availability will make it easier to ask questions and discuss approaches. The so-called social networking tools may also make it easier for people to become aware of availability issues, methods and standards—insofar as the organization makes such tools available and encourages people to use them. A somewhat more formal approach is to sponsor periodic meetings where a relaxed and convivial atmosphere gets people to talk about their interests. A typical forum for such meetings is a lunch break, where the organization encourages participation by providing refreshments.
What availability management might be
In summary, availability management might be a service management discipline that:
- has methods, knowledge and roles distinct from the other service management disciplines
- cuts across all services, being a vector for sharing availability knowledge
- shares with capacity management the characteristic of being able to increase the value of services, unlike most of the other service management disciplines, which can only limit the destruction of the value of services.