The Scope of this Discussion
When a problem is identified reactively, it means that one or more incidents have occurredand it has been decided to take note of and perhaps investigate their underlying causes. I exclude from this discussion both the proactively identified problems—the problems identified before any related incidents have occurred—and those organizations that treat problem management as a discipline for resolving difficult incidents rather than a discipline identifying the causes of those incidents.
Therefore, in the scope defined above, when a problem is recorded, there is always a known way of resolving any related incidents—namely, the same way one or more of the previous incidents have been resolved. I will further exclude from this discussion the case where an incident is resolved by a change that serves, at the same time, to eliminate the cause of the incident.
ITIL’s Definition of Known Error
As we know, ITIL® has defined a known error as a problem for which the root cause has been identified and for which a workaround has been identified. The question I will address here is how that workaround is different from the already known resolution of the incidents related to the problem. Another way of articulating this issue is to ask when a reactive problem might be recorded for which a workaround is not already known?
What is a Workaround?
It should be clear that all problems within the defined scope have a means for resolving related incidents without eliminating the key factors in the chain of causality that results in the incidents. But this is precisely what we mean be a workaround! So why bother specifying that a known error implies an identified workaround? In fact, it adds little or nothing to the definition of known error to refer to identified workarounds.
An Exceptional Case
There is a special case which is not even covered by such a definition. This is the case where:
- no workaround at all is possible – either the incident is resolved in such a way that its causes are eliminated or it is simply not resolved at all
- it is agreed with the customers that the incident does not need to be resolved within the normal timeframe. In other words, the customer agrees not to use the service feature that is failing.
Now, this exceptional case is exceedingly rare. In the vast majority of cases, a workaround is indeed available, albeit the workaround might require that the customer find a different way of working; one that does not depend on the faulty IT service. For all intents and purposes, we can ignore this exception.
Building a Better Workaround
We may conclude that the objective of problem management is not to identify some workaround for an incident. Rather, an objective is to identify a better workaround, if feasible. Better than what? Better than the way in which incidents have been hitherto resolved. What do we mean by better? We mean resolving the incident in such a way that the business impact of the incident is reduced. Here is an example. Suppose the first instance of the incident type was resolved with a loss of 2 FTEs of productivity. After developing a better workaround, it is possible to resolve such incidents with a loss of only 1 FTE of productivity. That’s better.
Why do I say feasible? Very simply, the cost of developing, implementing and using the better workaround must be lower than the probable business impact of incidents due to the problem, over a reasonable horizon. It makes no sense to invest 100’000 to improve a workaround if the savings are expected to be only 50’000.
What is a reasonable horizon? The duration of the horizon depends on several factors. The most important is the expected time to live for the system or the service having the problem. If a service or a system is to be decommissioned in 6 months, then 6 months is the maximum horizon. If a bug is known to be resolved by a new software version and it is planned to release that new version in 12 months, then 12 months is the maximum horizon.
The second factor in determining the horizon is the general policy of the organization concerning how to determine the return on investments. The varies considerably according to the risk tolerance of the organization as its capabilities to deliver solutions as planned. Thus, a low capability organization with low risk tolerance might insist on a positive ROI within 6 months. One with high risk tolerance and with confirmed capabilities to deliver solutions over the long term might require an ROI within three years. A horizon of greater than three years is probably not useful, given the fast pace at which technology changes.
It should be noted, too, that every organization has limited resources and limited management capabilities. As a result, the organization might decide not to pursue the development of a better workaround even though a positive ROI can be demonstrated. It may be that the limited capabilities and resources are to be devoted to other initiatives with an even higher ROI.
An Example of Identifying a Workaround
Here is a concrete example. An organization uses a client-server application in which the client periodically freezes up. In fact, the cause of this freezing up is knownand there is even a solution available, a solution tha requires upgrading to the next major version of the application. While he organization does indeed intend to make that upgrade, it will be extremely complicated, requiring the testing and adaptation of a very large number of procedures. What, if anything, can be done in the interim?
In order to resolve incidents related to this problem, the organization has been rebooting client computers. While this does resolve the incident, it also results in a lot of lost productivity and potentially lost data. Is it possible to find a way to minimize that loss of productivity and minimize the risk of lost data? Since the incidents recur frequently and touch a very large number of users, it is worth investing some time and resources to find a better way to resolve the incidents.
It is determined that it is possible to run a script locally that preserves the integrity of the machine, restarts the client software and allows users to get back to work much more quickly and with little risk of data corruption. The script is developed, tested and installed on all client computers. In the future, incidents related to the problem are resolved using this script, until such time as the software is upgradedand the bug itself is resolved.
Cause Determination and Workaround Identification are Not Necessarily Interdependent
So, this is an example of a workaround, designed and implemented under the auspices of problem management, that is a better workaround than the previous way of resolving incidents. It is curious to note that no knowledge of the application bug and its resolution were needed in order to determine this new workaround. In other words, it was possible to find an improved workaround without first understanding the causes of the incidents.
The Real Goal of Problem Management
The message we should retain is that the goal of problem management is not to find incident causes and it is not to find or improve workarounds. The goal is simply to reduce the impact of incidents. Improving workarounds and identifying causes are only two of the principle means by which that goal may be achieved.
Leave a Reply