Problem management challenges and critical success factors

Following his presentation on “problem management challenges and critical success factors” at the 8th annual itSMF Estonia conference in December, Tõnu Vahtra, Head of Service Operations at Playtech (the world’s largest publicly-traded online gambling software supplier) gives us his advice on understanding problem management, steps to follow when implementing the process, and how to make it successful.

Tõnu Vahtra

Problem management is not a standalone process

Incident management and event management

It cannot exist without the incident management process and there is a strong correlation between incident management maturity and problem management efficiency/results. Incident management needs to ensure that problems are detected and properly documented (e.g. the basic incident management requirement that all requests need to be registered). Incident management works back-to-back with the event management process, if both of these processes are KPI managed then any anomalies in alarm or incident trends can be valuable input to problem management. Incident management also has to ensure that in parallel to restoring service during an incident it has to be ensured that relevant information is collected during or right after resolution (e.g. server memory dump before restart) so that there would be more information available to identify incident root cause(s).

Critical incident management

Problem management at Playtech gains a lot from the critical incident management function, which is carried out by dedicated Critical Incident Managers who have the widest logical understanding of all products and services and years of experience with solving critical incidents. They perform incident post mortem analysis following all major incidents, and they also start with initial root cause analysis (RCA) before handing this task over to problem management. RCA is handed over to Problem Managers within 24 hours from incident end time during which the Critical Incident Manager is collecting and organizing all information available about the incident. Critical Incident Managers usually do not have any problems with allocating support/troubleshooting resources from all support levels as critical incident troubleshooting and initial preventive measures are considered the highest priority within the mandate from highest corporate management. All the above ensures high quality input for problem management on a timely manner.

Change management and knowledge management

In Error Control phase the two most important processes for problem management are change management and knowledge management. Most action items identified during RCA are implemented through change management, the stronger the process the less problem management has to be involved directly in change planning (providing abstract goals VS concrete action plan or task list for implementation) and the smaller the risks of additional incidents during change implementation. Change management also needs to have the capability and documented process flow to implement emergency changes in an organized way with minimum impact to stop reoccurring critical incidents as fast as possible.

Knowledge management is vital for incident management for ensuring that service desk specialists would be able to quickly find and action specific workarounds for known errors until their resolution is still in progress by problem management. Regular input and high attention is needed from problem management to ensure that every stakeholder for known error database (KEDB) would be able to easily locate information relevant to his/her role, all units would be aware of information relevant to them and that all the information in KEDB would be relevant and up to date. In Playtech problem management is also managing process errors identified from root cause analysis and process improvements only last when properly documented, communicated to all relevant stakeholders and additional controls are put in place to detect deflections from optimal process. Local and cross-disciplinary knowledge management for process knowledge has an important role here.

Defect management

Problem management has to go beyond ITSM processes in a software development/services corporation like Playtech and also integrate to software development lifecycle (SDLC). For this purpose in Playtech a separate defect management sub-process has been established under problem management. Defect management is managing the lifecycle of all significant software defects identified from production environments and aligning defect fixing expectations between business and development departments. Defect Managers ensure a consistent prioritized overview of all significant outstanding software defects, which warrants optimal usage of development resources and minimizes overall business impact from defects. They act as a single point of contact for all defect related communication and ensure high transparency of defect fixing process and fix ETA’s. Defect Managers define the defect prioritization framework between business and development key stakeholders and govern the agreed targets.

Software problem management

Problem management is leading the software problem management process through defect management. Under the software problem management process (which is usually being ran by a quality assurance team in relevant development units) development teams are performing root cause analysis for defects highlighted for RCA by problem management or raised internally. Every defect is analyzed from two aspects: firstly why the defect was created by development and secondly if the defect was created then why was it not identified during internal QA and reported from production environment first. Root causes and action items are defined from both questions and tracked with relevant stakeholders. This process ensures that similar defects will not be created or will be identified internally in the future. Even more importantly there is a direct feedback channel from the field to the respective developer or team who created the defect so that they get full understanding of the business implications in relation to their activities.

Important steps to take problem management to the next level

The problem management unit has to become more proactive, to get more involved in service design and service transition phases to identify and eliminate problems before they reach production environments. Problem management needs resources to accommodate contributing to pre-production risk management and even more importantly this involvement has to be valued and enforced by corporate senior management as it may take additional resources and delay time-to-market in some situations.

The Problem Management Team itself can get more resources for proactive tasks by reducing their direct participation in reactive Problem management activities. This has to be done via advocating the Problem management mindset across the entire corporation (encouraging people to think in terms of cause and effect with the desire to understand issue causes and push their resolution for continuous improvement) so each major domain would have their Problem Coordinators and identify root causes/track action items independently and problem management could take more a defining and governing role. To assert the value created from problem management and enlist more people to spread the word about problem management ideas for them to go viral, it is essential to visualize the process and explain the relations between incidents, root causes and action items to all stakeholders for them to understand how their task is contributing to the bigger picture.

There is a high number of operationally independent problem management stakeholders in Playtech and implementing KPI framework that would be fit to measure and achieve problem management goals and be applicable to all major stakeholders individually and cross stakeholders seems almost impossible a task. The saying ”You get what you measure“ is very true in problem management and no stakeholder wants to be measured by problems that involves other stakeholders and are taking actions to remove such problems from their statistics instead of focusing on the problem and its solution. At the same time problem management tends to be most inefficient and difficult for problems spreading across multiple division. A Problem Manager’s role and assertiveness in facilitating a constructive and systematic process towards the resolution of such problems is crucial. And still problem management needs to find a creative approach to reflect such problems in KPI reports to present then as part of the big picture and sell them to executive management to get their sponsorship for major improvement tasks that compete with business development projects for the same resources while the latter has a much clearer ROI.

No problem exists in isolation and the problem records in KEDB can be related to specific categories/ domains and also related hierarchically to each other (there can be major principal problems that consist of smaller problems), also specific action items can contribute to the resolution of more than one problem. Problem categories cannot be restricted to fixed list as it can have multiple triggers and causes, it should be possible to relate a problem record to all interested stakeholders, for this dynamic tagging seems to be a better approach than limited number of categories (for example list of problems that are related to a big project). Instead of looking into each problem in isolation each problem should be approached and prioritized in the right context fully considering its implications and surroundings. No ITSM tool today provides the full capabilities for problem tagging or creating the mentioned relations without development, not to mention the visualization of such relations that would be a powerful tool in trend or WHAT-IF analysis and problem prioritization. Playtech is still looking for the most optimal problem categorization model and the tool that would enable the usage of such model.

Advice to organizations that are planning to start the implementation of the problem management process

For organizations starting the implementation of problem management process my advice is don’t take all the process activities from the ITIL book and start blindly implementing them, this is not the way to start the implementation of this process or any other. Problem management success depends mostly on a specific mindset and in an already established organization it may take years for the right mindset to be universally accepted. Problem management formal process should be initially mostly invisible to all the stakeholders outside of the Problem Management Team to avoid the natural psychological tendency to resist change.

It is essential to allocate dedicated resources to problem management (Playtech assigned dedicated person to problem management in 2007, and any problem management activities prior to that were ad-hoc and non-consistent). The problem management unit should start from performing root cause analysis and removing the root causes of present major incidents that have the highest financial and reputational impact on the organization. If such incidents are being closely monitored by senior management and key stakeholders, solving them can earn the essential credits for problem management to get attention and resources for solving problems elsewhere. Secondly problem management should look at the most obvious reoccurring alarm and incident trends that result in a high support/maintenance cost. By resolving such problems they gain the trust of support and operational teams whose workload is reduced and they are more willing to contribute and cooperate in future root cause analysis. Problem final review before closure is a task often neglected but to improve the process it is essential to assess if the given problem was handled efficiently and to give feedback about problem solution to all relevant parties. Proactive problem management or KPI’s are not essential to start with and Problem Managers should concentrate on activities with highest exposure and clear value.

In summary

There will definitely be setbacks in problem management and in order to make a real difference with this process and increase the process maturity over time it has to have at least three things. A strong and assertive leader who is persistent in advocating the problem management; a continuous improvement mindset throughout the organization; and the ability to find a way forward from dead-end situations with out of the box thinking. When there is no such leader then involving external problem management experts may also help as a temporary measure to get the focus back on the most important activities. However, this measure is not sufficient in the long-term as the problem management process constantly needs to evolve with its organization and adjust with significant operational changes to be fit for purpose and remain relevant.