ITIL®‘s version of the goals of incident management
According to the latest version of ITIL®, the goal of incident management is “…to restore normal service operation as quickly as possible and minimize the adverse impact on business operations…”¹ So there are really two goals mentioned: to restore as quickly as possible and to minimize impact. But, from a customer or user perspective, is it necessary to articulate more than the second goal? If we succeed in minimizing the impact of an incident on the business, are we not, by definition, resolving that incident as rapidly as possible? This might sound like quibbling, but there are some important points that depend on the distinction between these two goals.
Let’s take a few examples to illustrate the key points.
A online sales service allows customers around the world to seek items that they may wish to buy and place orders for them. A successful service, the mean number of orders per minute, all time included, is about ten. This number of orders might peak at a much higher level, but it hardly ever goes below five orders per minute. So this means that an incident causing this service to fail has an immediate impact during the very first minute after failure. Most organizations would consider such an incident to have both the highest level of impact and the highest urgency.
An IT service is used to pay the employees of the company. A payroll run is performed once a month, around the 26th day of the month. An incident causes the server platform used to run the service to fail on the 1st day of the month. Most organizations would consider the impact of such an incident to be quite high, although the initial urgency of that incident is quite low.
There are quite a few different types of incidents that are similar to incident 1, in the sense that the impact is immediate and the urgency is high. But there are also quite a few types of incidents that resemble incident 2. These would include incidents that impact batch work, as opposed to interactive processing. But they could also include incidents where there are readily applied and highly acceptable workarounds and incidents where the users impacted are perfectly agreed to work on something else while the incident is being resolved. I had such an incident myself last week. I needed to download some files from a server before the end of the week. The download failed when I first tried, on a Tuesday. But I had plenty of other things to do and did not need the files until Friday.
Different goals for different incident types?
Let’s return to the two incident management goals. For incidents of type 1, it should be clear that meeting the second goal of minimizing business impact automatically means meeting the first goal of restoring normal service as rapidly as possible. But this is not true for incidents of the second type. In order to meet the second goal, it suffices to restore normal service at any time between the initial identification of the incident and the future time at which the service will next be needed.
Is there any business justification for the first goal if the second goal is being met? I can think of two justifications, both concerned with aspects of risk. First, there is an issue of not being certain how long it will take to resolve an incident. If we do not start work on it immediately, maybe we will fail to resolve the incident before the service is next needed. But hold your horses! Do you see the contradiction in what I just said? If we are so uncertain about how long it will take us to resolve incidents, what business do we have negotiating target resolution dates based on incident priority, as so many organizations do? (I will address that contradiction is a different posting.) Second, there is the issue that we are not always certain when a service will next be needed. As a result, I accept that some of the incidents that we might have thought were like type 2, are better treated as if they were of type 1. But, I think that that still leaves a lot of incidents of type 2.
Incident management goal and Kanban
You are perhaps still saying to yourselves, “So what?” The distinction between the two goals is of capital importance when it comes to managing the overall flow of work that all teams in IT need to perform. I have been describing in some detail, both in these postings² and also in a series of webinars³, how the use of Kanban to improve the flow of IT service management work can have profound effects on improving the performance and the quality of service management activities. However, incidents always appear like a monkey wrench (a spanner, for those speaking an old world dialect of English) thrown into the works.
Since incidents are always unplanned, they will have a tendency to be treated as expedited, from a cost of delay perspective (see my posting Priority, Cost of Delay and Kanban for the concept of cost of delay). Because they represent work that is both pushed and not respecting any limits to work in progress, incidents can drag down the gains in performance that we typically see when using Kanban.
So, our strategy is to find ways of meeting the goals of incident management and, at the same time, get the benefits of handling using a Kanban approach, without expediting incident handling. Unfortunately, this remains difficult to achieve with incidents of type 1, which will continue to be expedited. However, incidents of type 2 fit very neatly into the cost of delay framework for choosing the next items on which to work. Some of those incidents might be of the fixed date type. A good example would be the payroll incident that has to be resolved by the next payroll run, but there is no value in resolving it sooner.
Other incidents could even be handled as normal from a cost of delay perspective. The fact of the matter is that resolving an incident is not always the single most important thing to do to support your customers. Over the years, I have often heard explanations from people as to why they have not been working on incidents that have been assigned to them, preferring to work on other tasks. Sometimes, these people simply do not understand the importance of an available service. But other times, their explanations make a lot of sense. In other words, they have adequately assessed all the work in their backlog and have determined that their customers will derive the most value from some work other than resolving an incident. The only argument to be made against such decisions is based on company policies, such as a policy stating that incident handling always takes priority over other types of work. Having such policies might have historical reasons for their existence. But they are fundamentally performing a disservice, if they mean that work of lesser cost to customers takes priority of work of higher cost to customers.
A simplified, more useful goal
In conclusion, for incidents of type 1 the goal of minimizing impact on customers automatically implies resolving the incident as rapidly as possible. For incidents of type 2, minimizing impact is a sufficient goal, whereas restoring services as rapidly as possible is not always desirable, given the context of other work that is currently ready to be performed. I would recommend, then, that the goal of resolving incidents as rapidly as possible be eliminated. It is not a goal in and of itself. Rather, it is a strategy or a technique that is used to achieve the true goal of incident management, but only for certain incidents. Let us then use this window of opportunity to improve the flow of all our work and get a maximum of benefit from Kanban.
¹It is too bad that ITIL calls this a purpose, when everyone else I have asked would call this a goal, or even a set of objectives. The purpose of incident management is to resolve incidents and to restore disrupted services.
³Webinars: Kanban as an ITSM Best Practice; Go with the Flow: Integrate development and transition activities using Kanban; Cross-functionality, Kanban and Service Design; Daily Improvement with Kanban.