Incident management is halfway between BPM and ACM
I have discussed elsewhere in these columns the relationship between process-oriented work and adaptive case management. This framework will help us to refine the understanding of how incidents may best be handled. Incident handling is a good example of work that has features of process-oriented work, such as BPM, and case-oriented work, such as ACM. What are some of the patterns for managing incidents?
What are the issues?
Processes define in advance when information needs to be provided, from where and in which format. That information is used as input to one or more activities in the process, activities that transform the information to create some output and reach some objective. For certain types of incidents we should be able to determine this sort of information in advance and handle those incidents in a well-defined and predictable way. For other incidents, we cannot know in advance where we will find the information that allows us to understand how to resolve the incident. We do not always know how the incident may be resolved, nor by whom. In the early stage of handling an incident, its impact is often known only by vague indicators, such as the number of tickets or the number of persons impacted. It is only after the incident is resolved that we can estimate how much revenue or productivity may have been lost. Similarly, incident urgency may evolve during the course of handling it. Sometimes, no particular urgency is identified at the beginning, whereas a very specific urgency may be revealed later on.
On the other hand, incidents should not be handled according to pure adaptive case management principles. The basic process pattern—identification and recording; classification; investigation and diagnosis; resolution; closure—should be used for all incidents. Most of the unpredictability occurs at a more detailed level, inside the high level activities of the process.
Patterns of impact assessment
Incident impact is, according to frameworks such as ITIL®, to be determined very early in the process. It is used, together with urgency, to determine the priority of the incident. Thus, it determines the pattern of resource allocation to an incident and may influence the overall resolution time. It stands to reason that the more accurately we can assess impact, the closer we can come to the goal of incident management, which is to limit that impact as much as possible. The patterns for assessment of impact vary according to the structure, nature and maturity of the service provider-customer tandem and they may evolve during the lifetime of the incident.
In addition, accurate assessment of incident impact is important because it serves to prioritize improvement initiatives. If the impacts of incidents are incorrectly recorded, the priorities of the problems causing those incidents are not likely to be correctly set. If the limited resources available for managing problems are not effectively used, because problems of lesser significance are given too high a priority, the credibility and usefulness of problem managements in particular, and continual service improvement as a whole, is put at risk.
The service provider-customer tandem
When a service provider is internal to the customer organization, it might have access to information about the business processes supported by the IT services. However, the the customer is part of an external organization, this information is normally not available. The service provider depends on the increasingly frenzied complaints of the customers to assess what that impact might be.
The nature of the supported business processes
Business processes may range from largely manual processes that use an IT service for a very specific task to highly automated processes enabled and operated as IT services. In the latter case, the operation of the process may be measured directly using technical means. Throughput, revenue acquired, costs, etc., are directly measurable.
The case of automated business processes
For example, a company might run a largely automated video streaming service for which customers pay to see films. Such a company would be particularly interested in incidents that prevent a customer from viewing a film correctly, or from viewing it at all. One might imagine a metric such as the “abandoned film rate”, or incidents caused by poor throughput. In addition, the number of orders for films during such incidents may be compare to similar times of the day and days of the week, as measured in the past. Giving a direct measurement, within certain tolerances, of business lost due to the incident. Thus, the impact of an incident that hinders the service may be measured directly.
The case of impact predictions based on configuration models and BIA
When a business process is not automated, or when the process cycle time is significantly longer than the duration of the incident, such direct measurement of impact is hardly possible during the incident itself. In such cases, estimates of impact during the early stages of the incident may be based on previously completed business impact analyses (BIA). These analyses, typically performed as part of service continuity management, may also be used to assess the impact of incidents. If good configuration models are in place, it should be possible to easily determine which services are affected by a failure in a component.
The deliverable of BIA is a time-dependent function, Iimpact = f(ttime). A typical result of the analysis is that there is an initial level of impact that increases in time, whether the slope of the increase be linear, exponential, geometric, or other. Organizations that do not know how to measure important things will also add a “non-tangible” component to the impact, described in words rather than in financial value. Finally, the function might be discontinuous, implying that events subsequent to the incident might significantly change impact.
Whatever form the resulting function might take, it could be evaluated at Ttime calculated as the time since the incident occurred. If the IT system is inadequately monitored, Ttime might also be calculated as the time since the incident was first detected. Figure 1 summarizes the estimated calculation, which may be described as follows: the total impact of an incident at a given time (that is, now) equals the integral of the BIA function from when the incident was detected (or when it really occurred) until now.
At first glance, this way of assessing incident impact might seem needlessly complicated or overkill. Indeed, most organizations probably do not make business impact analyses with mathematical rigor. I fully agree that it would be a waste of time to perform such analyses for every incident. However, if the BIA had been previously prepared and if the incident management tool could calculate the impact automatically and rapidly, an objective and powerful means for incident impact assessment would be available. This approach would be an improvement over systems where a service desk agent makes a rough, and often inaccurate, estimate based on extremely limited knowledge. In addition, the equation gives a flexible way of predicting the impact at any given moment, helping to support the reallocation of scarce incident management resources, should this be necessary. Be that as it may, why do we insist on continuing to have human beings try to navigate configuration models, BIAs, SLAs and OLAs to estimate incident impact when this could be done much more rapidly and reliably using software?
The case of impact assessment by indirect factors
The fact is that many organizations do not do business impact analyses, or do not deliver impact functions, or do not have in place the configuration models that help determine which services are affected by failures in components. And no standard incident management tools make the sort of calculations I have presented. In this final pattern, the service desk must resort to the most primitive of initial investigations and estimations of impact. Such organizations fall back on the very simple and intuitive measurements of such factors as the number of persons impacted (“Are you the only one to have this problem, or do the people working around you also have the same problem?”). When working on this basis, it is important to periodically re-evaluate the incident impact, as additional information about impact is likely to arrive are varied and unpredictable moments.
Assessing the evolution of impact
Best practice calls for an assessment of the business impact of major incidents, as part of the incident post mortem. Depending on the patterns in use, as described above, and depending on the quality of the relevant service management activities, the post hoc evaluation of impact will be more or less similar to the predictions of impact made in the early stages of incident handling.
The benefit of such analyses depends on the use of the information to make improvements, either in the handling of incidents themselves, or in the general management of services. If an organization limits itself to the management of failure, as defined in the Cost of Quality approach, then these improvements may appear to be surprising revelations. They would be, in any case, expensive to implement, relative to achieving the same level of quality via prevention activities, or even via appraisal activities.
If the post hoc evaluation of impact is compared to the initial assessment, any gap between the two may serve to improve the initial impact assessment. This improvement could take various forms, depending on whether the service provider is using direct feedback from automated business processes, predicted impact based on a business impact analysis function, or estimated impact based on indirect indicators, such as the number of users or customers affected.
The lessons learned in major incident management may be applied to other incidents, especially those whose impact is high. The additional effort required to make post hoc impact assessments of incidents and to use those assessments to make improvements might be difficult to justify for incidents with relatively low impacts.
One might ask if there is any reason to assess the evolution of impact during the investigation and the resolution of an incident. The simple answer to this question is “No.” If we assume that an incident is to be resolved as quickly as possible, little is to be gained from changing its recorded impact or its priority once the handling of the incident has started. An exception to this rule occurs when a new type of impact, for what is presumably the same incident, has been detected. This new information may help move the investigation forward or might force a revision of the proposed resolution.
On the other hand, before the investigation of an incident has started but after it was first classified, a change in impact might result in a change in priority, thereby moving the incident higher (or lower) in the queue of open incidents. This issue is significant only if the service provider is understaffed and the functional unit or team concerned by the incident regularly has a backlog of simultaneous incidents to resolve. Such backlogs might occur on a very exceptional basis, but if they become too frequent, then it is a sign that upstream activities, such as service design and testing, are severely inadequate and are allowing services into production that are much too unreliable. Alternately, it is likely to signal an understaffed team.
The evolution of urgency
Urgency is another incident attribute that might evolve during incident handling. It differs from impact in several key ways, however. First of all, urgency is not under the control of the service provider. It is, by definition, determined by unpredictable events external to the service provider. Therefore, the service provider cannot influence or reduce the urgency of incidents. Knowing the correct urgency of an incident after that incident has been resolved, even if the urgency were initially misjudged, cannot be used to improve the investigation and resolution of incidents. It can only be used to help improve the assessment of urgency.
As with changing impact during incident investigation and resolution, a change in urgency is most often not going to change incident handling, if we assume, once again, that all incidents are being resolved as rapidly as possible. Tracking the evolution of urgency during an incident is of value when it becomes evident that the incident will not be resolved before the urgency deadline is passed. The following scenario makes the issue clear.
Suppose a CEO’s assistant has to prepare and print a series of confidential reports for a board meeting that starts at 14:00. Assume further that the assistant uses a local printer, in order to handle the confidentiality issues, and that this printer fails. Finally, assume that it will take at least 15 minutes to print, collate and distribute the relevant reports. That means there is an urgency deadline at 13:45. If the printer is not available at that time, the impact of the incident will increase very significantly. Now, let us assume that the assistant notices a problem with the printer at 10:00 and reports it to the service desk, but fails to mention the board meeting deadline. The service desk, having no reason to assume a particular urgency, assigns a relatively low priority to the incident. As a result, work on it does not start until 13:30. The assistant quickly grasps the quandary and informs the technician of the deadline. The urgency of the incident thus changes and should be noted in the incident record. If, given the very short amount of time to the urgency deadline, it is apparent that the incident cannot be resolved by that deadline, then the service provider needs to inform the assistant and needs to prepare for some major criticism. The only situation that would be worse than this is to fail to inform the assistant that the deadline cannot be met. Consequently, an increase in urgency might require special communications to handle failure to resolve the incident by the desired time.
There is yet another, exceptional, situation in which change in urgency might affect incident handling. Suppose incident A is recorded with a relatively high urgency, but it is then determined that the urgency is, in fact, lower than initially recorded. Suppose furthermore that another incident, B, occurs after incident A and has a higher priority. Assume, too, that both incidents need to be handled by the same person. Thus, a backlog is created for that incident handler. In such a scenario, it might be beneficial to stop work on incident A, take care of incident B, with its higher priority, and then return to incident A.
I emphasize “might” because, according the the theory of constraints, stopping a task in the middle to perform a different task and then return to the first one is one of the surest ways to reduce overall throughput. Such an approach creates wasteful duplicated effort, causing irrecoverable knock-on effects. Therefore, if the workload for the incident handler is very high, it might be better to first resolve incident A before handling incident B, even though incident B has a higher priority. On the other hand, if the workload is very low, and the risk of a knock-on effect from re-prioritizing the incidents is low, then it might be possible to finesse the situation and handle the incident B first.
We have given above a set of patterns for handling incidents that combine aspects of a process-oriented approach, with its fixed order of events and its fixed inputs and outputs, with aspects of adaptive case management, where the order of events is determined by the unpredictable arrival of new information and the events unfold as information becomes available.