Although troubleshooting and the definitive elimination of faults has a long history, the particular innovation of ITIL® 2 was to recommend treating problems and incidents as two separate entities, each with its own life-cycle. This advice had led to a series of confusions and ambiguities, many of which have still not been resolved among the practitioners of service management and the creators of tools to support service management.
What’s the problem?
The advice of ITIL 2, retained in all subsequent versions of ITIL, refers to the intuitive practice of seeking and eliminating the cause of an incident as part of the handling of the incident. This is particularly common, even to this day, for incidents with considerable impact. The advice, which some may consider to be counter-intuitive, is to treat problems as the causes of groups of incidents and to base the priority for treating a problem on the overall impact of those incidents, rather than the impact of any single incident. Thus, a problem that causes frequent, but relatively low impact, incidents may have an overall impact that gives it a high priority.
This advice should be understood within the context of an organization having only limited resources for handling incidents and their causes. Normally, a service provider simply cannot handle every cause and every incident simultaneously and with the same priority. Therefore, it is important to get the best return on the investment in the resources by ensuring that they work first on the problems creating the most harm.
So far, so good. But when we dig in a little deeper, we see that underlying concepts of problem management are described in a somewhat confusing way. The definitions proferred by ITIL have exactly the opposite effect to the intended effects. Instead of creating a simple standard to enhance communication, they tend to create confusion.
In its current avatar, ITIL defines a problem as “the cause of one or more incidents.” This definition seems simple enough, until we dig deeper. In ITIL 2, a problem was considered to be “the unknown cause of one or more incidents” and a major purpose of problem management is to find that cause. Once found, and a workaround identified, one speaks no longer of a problem, but of a known error. Why it was called a “known error“, as opposed to a “known problem“, will remain a mystery. Many ITIL 2 trainers will recall having introduced the non-standard concept of an “unknown problem” in order to try to explain how you can identify a problem without knowing its causes.
As if this is not confusing enough, the concept of the “root cause” only compounds the discomfiture. Currently, “root cause” is defined as “the underlying or original cause of an incident or problem.” I will leave the discussion of what “underlying” and “original” mean for another posting. How is it that a root cause is the cause of a problem or an incident, whereas a problem is also the cause of an incident? A very simple concept, with a long practical history, has been made very confusing.
I realize that there are ways of expounding on these concepts so as to make some sense of them. Indeed, that is what all ITIL trainers worth their salt end up doing. Wouldn’t it be much simpler if, instead of having to produce elaborate explanations involving much legerdemain and fancy footwork, we had more direct and intuitive definitions of the concepts involved?
Focus on the symptoms
The easiest way to resolve this conundrum is looking at the reality of what we need to do to manage “problems.” In fact, when a “problem” is first identified in reaction to one or more incidents, all we know are the symptoms. It is useless to talk about causes, or even “unknown” causes, at this point. Therefore, it would be much simpler to speak of a problem as being a certain collection of related symptoms.
This approach to the definition of “problem” has many advantages. First, it reflects the reality of what we do in managing problems. Second, it is well adapted to the long history of identifying things that go wrong and trying to find out what causes them. Third, it removes the ambiguity of the terminology. Finally, it helps us to perform the work of problem management more easily and, perhaps, more automatically.
The logic of problem management, as a discipline separate from incident management, requires us to identify—that is to say, to name—whatever has apparently caused the incident. We can do this based on the symptoms alone. Suppose users frequently get a certain error message in a certain application when they attempt a certain function. Applying Occam’s razor, we assume that all these incidents have a single cause. In our search for the cause, we first identify the particular group of symptoms. The grouping of symptoms typically includes the CI or class of CI in which the fault is detected, the operational context in which the fault occurs and the unexpected behavior.
We need to name that group of symptoms so that we can easily refer to it in the future and while we attempt to handle it. The name typically is derived from the most unusual symptom. This is precisely the same method used, for example, in medicine. For example, one all too common disease today is not named for the incapacity of our bodies’ cells to assimilate glucose correctly. Instead, it is named for a common symptom of the disease, the burning and frequent need to urinate, the phenomenon to which the originally Greek word “diabetes” refers.
Once we have identified, and then prioritized, a problem as a group of symptoms, the next obvious step is to identify the cause or causes of that problem. By defining “problem” as symptoms, and not causes, we avoid the metaphysical embarrassment of trying to find the cause of the cause. We avoid having to talk about whether a problem is “known” or “unknown”. Finally, when we have at last identified the causes and a workaround, we no longer have to debate whether a known error is really an object distinct from a problem or whether it is merely a status for a problem.
What is proactive?
The difficulties in understanding what is meant by proactive problem management are themselves a symptom of the confusion concerning problem and incident terminology. In spite of the advice of ITIL 2, many service providers and many of the tools used to support problem management continue to treat problem management as the extension of incident management. They consider problem management to be the domain of major incidents or of incidents that are difficult to resolve. With such a (mis-)understanding of problem management in hand, these same people consider proactive problem management to be that part of problem management that reacts to previous incidents after those incidents have already been resolved. In short, some people think of reactive problem management as the work of handling difficult or major incidents, whereas proactive problem management is the work of identifying and resolving the causes of incidents that have already been resolved.
This is surely not what ITIL ever intended in coining the term “proactive” problem management. My point is not to criticize organizations that do not follow ITIL, which is largely irrelevant as an issue. My point is to show how fuzzy definitions turn off people, create misunderstandings and result in a failure to benefit from what is, after all, quite good advice.
ITIL’s intended meaning for proactive problem management covers the work of identifying the potential causes of failure before they do cause any incidents. We are all familiar with the phenomenon of seeing anomalous sights and thinking, “They ought to fix that before there is an accident.” Proactive problem management is simply a structured way of identifying those underlying causes. As such, it is related on the one hand to condition-based maintenance and on the other hand to risk management.
Proactive problem management is to condition-based maintenance as reactive problem management is to incident management. For example, condition-based maintenance will check the lubricant levels in machines and top them up when necessary. In this way, it helps prevents incidents due to lack of lubricants. Proactive problem management examines the question of why lubricants need to be topped up much more frequently than expected. It tries to identify the underlying causes for loss of lubricants. As such, proactive problem management is a means for controlling risk. Because lubricant levels are unexpectedly lower than according to specific, there is a certain risk that incidents will be caused. Identifying the underlying causes helps to remove much of the uncertainty and therefore helps to clarify the priorities in addressing problems. The definition of risk, after all, is “uncertainty of outcome.” In short, proactive problem management is the work of identifying unexpected groups of symptoms, symptoms that do not follow specification, symptoms that are likely to provoke incidents if their causes are not understood and handled.
Problem tooling and automation
I realize perfectly well that merely changing the definition of a term should not change the techniques by which the underlying concepts are managed. That being said, virtually all the tools I have ever seen that are intended to support problem management are either much too complicated or much too simple. At best, they are simply administrative supports designed to document. They hardly ever really help to do the work of identifying, diagnosing or resolving problems and causes.
This is a pity, because the current state of information technology is probably advanced enough to do the work of grouping (that is to say, correlating) symptoms and identifying the probable causes of those symptoms. Clearer implementation in tools of the concept of symptom, symptom correlation and cause would go a long way to supporting much faster and probably more reliable problem identification and diagnosis.
Unfortunately, our tools today seem more concerned with administration, control and compliance than with resolving problems. It is a decadent technology used to support a decadent society. Our tools should first and foremost be designed to increase the value of our services and secondly to limit the destruction of value in our services. Any other use, such as in proving that the defined and agreed process is being followed, should be only in a very distant third position.