In many IT organizations, IT support teams spend their time reacting to issues – resetting passwords, restoring services, responding to issues (incidents), and managing major incidents such as outages. While incident management is usually considered essential to keep IT services running, it can trap IT support teams in a “firefighting cycle.” ITIL, the most popular body of service management best practices, defines problems as recurring issues and shares problem management best practices.
The available problem management best practices can make a big difference. Where, instead of simply restoring service (and moving on to the next issue), it focuses on understanding and eliminating the underlying root causes of incidents.
I appreciate that people have different views on what problem management is (and should be) within IT organizations. This article presents the thinking in ITIL 4.
Note – this article was written before the detailed ITIL (Version 5) management practices were released.
What Is Problem Management in ITSM?
The Purpose and Scope of Problem Management
In IT service management (ITSM), problem management is the service management discipline responsible for:
- Identifying the root cause of recurring or significant issues
- Documenting known errors, and
- Ensuring permanent fixes are implemented.
Incident Management vs. Problem Management
You could say that incident management asks:
“How do we restore service as quickly as possible?”
Where problem management asks:
“Why did this happen – and how do we stop it happening again?”
A Simple Example of Problems, Workarounds, and Permanent Fixes
For example, Wi-Fi goes down on Monday morning. This is the incident. IT support reboots the network controller. This is a workaround. Whereas problem management discovers that a firmware bug causes failure during high load. The bug is the problem. The resolution, or permanent fix, is to apply the vendor patch and update capacity thresholds. So incident management restores service. While problem management eliminates the disruption entirely.
Why Problem Management Matters to the Business
Effective problem management can dramatically improve organizational outcomes, with its benefits including:
Reducing Downtime
When problem management is undertaken, there are fewer repeat incidents, which means fewer business disruptions.
Lowering Operational and Business Disruption Costs
Fixing a root cause once through problem management is cheaper than repeatedly fixing symptoms. Plus, any financial costs of business disruption are removed.
Improving End-User Experience and Confidence
When problem management is used to reduce incidents, your employees notice fewer failures and faster resolutions.
Enabling Better Decisions and Greater IT Maturity
Problem management data can drive smarter IT investment and architecture choices. IT can start shifting from reactive to proactive service management through problem management.
Reactive vs. Proactive Problem Management
Problem management can be triggered in different ways:
Reactive Problem Management
This happens after an incident – often a major one or a pattern of repeated issues. For example, five similar printer outages over two weeks cause the issue to be flagged for problem management.
Proactive Problem Management
This identifies potential failures before incidents occur, often using data, analytics, or monitoring alerts. For example, disk utilisation is trending toward failure, and this triggers analysis and capacity planning.
The Problem Management Process Explained
This will differ by organization; however, the respective elements are likely similar:
1. Detection, Logging, and Prioritization
A problem may arise from incident trends, major incident reviews, system alerts, or end-user reports. These problems are then ranked based on business impact, urgency, and risk.
2. Root-Cause Analysis Techniques
Here, the cause of the problem is determined.
Common root-cause analysis (RCA) tools include:
- 5 Whys
- Fishbone (Ishikawa) diagrams
- Fault Tree Analysis
- Pareto charts
- Timeline mapping (for major incidents).
3. Developing Workarounds and Known Errors
In ITSM terms, a workaround is a temporary solution that restores service and reduces the impact of an incident while the root cause is addressed.
A known error record captures the problem symptoms, cause, and workaround, enabling the IT service desk to respond quickly to recurrence when a permanent fix cannot be found or justified.
4. Implementing Permanent Fixes Through Change Enablement
Permanent fixes identified using problem management might require a planned change, review, and approval.
5. Review, Closure, and Continuous Learning
Once a problem is resolved, the problem record is closed, and lessons learned are documented.
Roles and Responsibilities in Problem Management
Clear ownership helps prevent problem management from becoming “everyone’s job and no one’s responsibility.” The key problem management roles and respnsibilities can include:
- Problem Manager – who is accountable for governance, prioritization, and process health
- Technical Subject Matter Experts (SMEs)/Engineers – who conduct RCA and design solutions
- IT service desk – where service desk agents log known errors and use workarounds
- Change Advisory Board (CAB) – which might approve permanent fixes
- Service Owners/Business Stakeholders – who provide context and confirm acceptance of problem resolutions.
Measuring the Success of Problem Management
To best measure success, problem management should employ metrics that reflect outcomes, not just activity. Examples of these include:
- The reduction in repeat incidents
- The number of problems converted into permanent fixes
- The Mean Time to Resolve root cause (MTTR-RC)
- Cost avoidance or service hours saved
- The decrease in the percentage of major incidents linked to known errors or problems
Common Problem Management Challenges (and How to Overcome Them)
Organizations can struggle to adopt problem management best practices effectively. Some of the most common barriers to success include:
- IT support teams feel they are “too busy firefighting”
- The difference between incident management and problem management is unknown
- Limited RCA skills
- Weak data quality in incident records
- No clear link to change enablement, knowledge management, or service configuration management.
A Practical Roadmap to Adopting Problem Management
Your organization doesn’t need a complex problem-management initiative to begin with. Instead, it can start with a simple phased approach such as:
- Identifying repeat incidents and major outages
- Documenting workarounds and create known error records
- Introducing structured root cause analysis
- Linking fixes with change enablement
- Expanding into proactive detection using monitoring or artificial intelligence (AI) insights.
From Reactive IT to a Prevention-First Mindset
While this article has aimed to provide a simple but practical insight into what problem management (according to ITIL 4) is, a key learning point for me is that:
Problem management isn’t just another ITSM or ITIL discipline – it’s a mindset shift.
If you read the available best practices, problem management might seem like too much to do (especially with limited resources). But, believe me (and it sounds clichéd), you can definitely start small and build on your successes.
Note – this article was written prior to the release of ITIL (Version 5) best practice guidance.
FAQs
Problem management is the service management discipline responsible for identifying the root cause of recurring or significant issues, documenting known errors, and making sure permanent fixes are implemented. Where incident management restores service as quickly as possible, problem management asks why an issue happened and how to stop it recurring. The article covers the discipline as defined in ITIL 4.
Incident management asks how to restore service as quickly as possible; problem management asks why the issue happened and how to prevent it. Using the article’s example, if Wi-Fi goes down, rebooting the network controller is the incident workaround, but discovering that a firmware bug fails under high load is the problem, and applying the vendor patch and adjusting capacity thresholds is the permanent fix. Incident management restores service, problem management eliminates the disruption.
Reactive problem management is triggered after an incident, often a major one or a pattern of repeats, such as five similar printer outages over two weeks. Proactive problem management identifies potential failures before incidents occur, using data, analytics, or monitoring, such as disk utilization trending toward failure triggering capacity planning. The goal is to move from responding to incidents toward preventing them.
The article sets out five stages: detection, logging, and prioritization based on business impact, urgency, and risk; root-cause analysis using tools like 5 Whys, fishbone diagrams, fault tree analysis, Pareto charts, or timeline mapping; developing workarounds and known error records; implementing permanent fixes through change enablement; and review, closure, and capturing lessons learned.
The article recommends outcome-based metrics rather than activity counts: the reduction in repeat incidents, the number of problems converted into permanent fixes, mean time to resolve root cause (MTTR-RC), cost avoidance or service hours saved, and the decrease in major incidents linked to known errors or problems.
Sophie Danby
Sophie is a freelance ITSM marketing consultant, helping ITSM solution vendors to develop and implement effective marketing strategies.
She covers both traditional areas of marketing (such as advertising, trade shows, and events) and digital marketing (such as video, social media, and email marketing). She is also a trained editor.
