I've looked around for answers to my question within the forum but can't say I've spent a lot of time and I'm somewhat in a crunch. I want to know some common reports that help to identify problem candidates and potential problems within the environment.

I was recently assigned the task to produce SQL reports for problem management analysis. The goal of these reports is to identify problems within the environment. My question is not only what are some common, helpful reports but what are the things to "look out for" within these reports. (E.g. discovering growth trend of a resolver group's incidents).

What data is kept by your problem mgmt tool ?
What data is in your cmdb ?
How is it organized
How deep or detailed is it
Does your PM tool link to it
Is it relational to where you can see Single points of Failure for systems, network, etc.....
How are problems classifiesd within the tools you use
How are incidents classified
Does your company have a PM process / Team / Policy etc ?

Something to keep in mind is that all enterprises already do some form of Problem Management. It's called "Defect Tracking". ITIL has over-complicated this.

The reality is that to have successful Problem Management, the very first step is to ensure that you have a standard Defect Tracking & Management process, in place. This should address, how you Capture & Log Defects, Processes for Addressing them, etc. Fundamentally, Defects (Problems) are assigned or related to the Problems & Services they are identified for. This ensures that they flow to the Product Owners and Service Owners, respectively.

Putting this form of Problem Management in place will address the "vertical" aspect of Problem Management. In other words, Defects will be vertically aligned with the Products/Services they represent. The next step is to address the "horizontal" aspect of Problem Management. This is looking across your Incident landscape to identify Problems that don't necessarily have anything to do with Defects. For example, if you have 1,000 people a day calling the help desk to reset their passwords and you can put in a simple self-serve solution to eliminate the labor costs, you can identify the 1,000 calls as a Problem for the Password Reset Service, which gets assigned to the Service Owner, who will be responsible for coming up with a solution and fixing the problem.

There is also the concept of Risk Management, which is the management of "Potential Problems". Most enterprises don't properly break Risks out from Problems. Risks are tracked and managed a whole different way. There are usually Tactical Mitigations and Strategic Mitigations that need to be planned and implemented. Ignoring this will definitively leave a very large gap in your Problem Management process.

Also, good luck with creating SQL reports to handle the situation. It's definitely going to be an ugly way to help collect, manage, and communicate your data. It might be far more work than it's worth.

Thank you both for responding. Frank, I will answer your questions first-

We do have a Problem Management team that identifies problem stakeholders, assigns the problem to them for resolution, severity level... etc.

Could you provide additional examples of "horizontal" problem management? Are you implying in your example that end users who forget their passwords is not a defect? Would this example go into the problem log?

Somewhat tying into my previous question, it seems our Service Desk gets many calls on how to map network shares and printers. Would something like this be a problem candidate?

Lastly, are you suggesting that using incident data is (which would be my SQL queries) is not a practical way for identifying problems?

Is it a problem that the users contact the service desk on how to do things ?

These are not problems (per ITIL)

Problems per ITIL are things like

Why does the machine panic reboot every 3 1/2 days for an UNKNOWN reason ?

In order to create/generate a problem (ITIL), you need incidents that are the basis of the problem. The incident is completed when the service is restored as soon as possible. The problem is completed when the underlying root cause is found AND a solution is found. Then a change is raised to approve the implementation of the solution. The release is used to implement. config mgmt is updated

The issue you have are Incidents (including Service Request (v2)). They get solved when your SD tells the user how to do the thing they ask.

You can create a KB or FAQ for the most common incidents/Service Requests as a How do I... What if.. site

This way, the users can go to that first....

From a pure dictionary definition.. yes, these are problems.. It appears that your staff has insufficient training/depth of knowledge on computers/etc.

Does your company have an training department ? If so, do they provide basic PC and other skills?

I dont know wht you mean by 'horizontal' problem mgmt. .

As to the user who forgets his password and has your SD reset/change it

This is an incident/Service Request issue. It should be logged against the individual who raised it.

Since I dont know the individuals in question who are forgetting their passwords, I can not really consider it the issue a defect on their part for the following reasons.

1 - Human and their failings (defects/etc) are due to genetics (nature) and their environment (nuture)
2 - The manufacturer(s) (Parents) of the human(s) with the failings cant be sued.
3 - There are no real solutions to the failings. It is part of being human.

What should happen is as follows

On a monthly basis, the SD should produce a report called 'Password Reset Report'
It should list the # of calls/incidents
It should list the departments and # per departments
It should list individuals who have ># incidents of Pwd reset requests_________________John Hardesty
ITSM Manager's Certificate (Red Badge)

I'll throw you one more example, say.. the same user calls repeatedly to the help desk reporting a blue screen of death, does this become a problem until the bsods stop occurring? Additionally, if the problem is corrected by reimaging the machine, what would be the "root cause?"

I'll throw you one more example, say.. the same user calls repeatedly to the help desk reporting a blue screen of death, does this become a problem until the bsods stop occurring? Additionally, if the problem is corrected by re-imaging the machine, what would be the "root cause?"

This is actually one of my favorite questions. A Problem is defined as an group of Incidents with similar symptoms for which the root cause is unknown. What you have described is definitely a Problem, but the interesting question is "Is this worthy of research by the Problem Management process?"

Just because you have identified a Problem doesn't mean you want to invest in true Problem Management. Indeed, with the solution of re-imaging the system, you have done no Problem Management. You did Incident Management. Did you ever know what the root cause was? Did you know what component was the cause of the blue screen? No, you just implemented a solution that you knew would probably solve the issue.

Which is a good thing.

The point where you invoke the Problem Management process is where you have determined that it is not cost effective to allow the Incident to reoccur. To ensure this, you must determine what is the root cause of the Incident.

This may involve discovering that the "fix" to the Problem is outside the control of your organization. It could be a fault of the OS, drivers, or other off-the-shelf product for which you can't implement a fix. In which case the Incident Management solution of re-imaging the PC is the best fix.

But in situations where the resolution is within the scope of your organization (your in house development group has to create multiple versions of a DLL dependent on what version of anti-virus the client is running), then Problem Management's job is to submit a RFC to implement the fix.

And that is where Problem Management stops. It is now up to Change and Release Management to approve and implement the fix.

Once Change Management has reviewed that the Release has successfully resolved the Problem, the Problem record can be closed.

If you never venture into the Problem Management process and use Incident Management's quick fix of re-imaging the PC, the Problem record will remain open.

And again I will say - This is a good thing. It may not be cost justifiable or within the scope of the organization to fully remove the root cause of the Problem.

If the PC BSOD and gets re-imaged as the solution; then the PC BSOD etc. Repeat. daily.

Did any one think that the Image may be the issue ?

The problem with images is that if you take a image of a PC, that image will work on the PC the image came from and for the most part any PC that is identical to it.

Any variances in hardware - such as I/O boards, ram, hard drive size, etc.... may cause the image not to work. or cause the O?s that is installed to BSoD.

If a PC BSoD, the immediate solution is to reboot the system.
if it happens again, the tech would usually write down the error message that appears on the screen.

So this person has has raised # of incidents (1 a day) and the solution to the incident to restore service was - REIMAGE the device.

The fact that the # of incidents for this person is high may mean that the set of incidents are a candidate for a Problem being created. But the Problem Mgmt tam woudl have a set of criteria to determine that._________________John Hardesty
ITSM Manager's Certificate (Red Badge)