So here’s the thing. As a Major Incident Manager or Problem Manager, you can do all the ITIL training in the world but nothing can really prepare you for your first every major incident or crisis. It’s definitely something that becomes easier with experience.
I’ve been a Problem Manager, on and off, for years, so here’s my ten top tips for dealing with a crisis without hiding under your desk, chain smoking, or mainlining vodka.
Easier said than done I know. It’s hard being the calm, sensible one when everyone else is losing it so I’ve mastered the art of pretending. The idea is to have an outward appearance of calm; because if you’re calm, the people around you will start to calm down and you effectively take the sense of panic out of the situation.
It might sound obvious but when you’re dealing with a crisis, panicking isn’t going to help anyone.
Look After Your People
Is anyone in immediate danger?
If so, invoke the right safety protocols; be it an emergency power off (EPO) button or removing people from site and getting them to safety. Once the immediate danger has been contained you can look at who does what and the lessons learned. But take care of your people first.
How serious is this? Could it be time to invoke disaster recover (DR) plans? Make change management aware that there may well be the need for an emergency change to fix things. And if your service desk is noticeably struggling can you get more people in to take the pressure off your existing shift?
The tone you’re looking for is calm, but brisk, efficiency.
As a Crisis Manager, ensure that everyone involved with the major incident knows to update you so that you can then send out appropriate updates to those that need to be informed. This ensures that not only have you everything captured for the report, but it will save your support teams from being asked the same question by ten different people, freeing them up to help fix the issue.
Be as proactive as you can in getting the message out. For instance, as a senior manager there’s nothing worse than being told about an issue by an irate customer. So ensure your senior management team are kept in the loop with everything they need to know about the issue and its impact.
If you’re really unlucky you may have to deal with the press or regulatory bodies. In the past year or so we’ve seen many big firms experience IT outages, be it Eircom in Ireland, NatWest in the UK, or Target in the US. My heart always goes out to the major incident, problem, and service desk managers involved because let’s face it – what always makes managing a crisis that much easier? That’s right folks, being the main headline on news websites, making the national press, or trending on Twitter for the worst possible reasons.
If you have a CMDB or a service catalog, try to see if the issue’s impact extends to other customers or service towers and warn them accordingly. Also make sure the service desk has updated the welcome message on their Automatic Call Distribution (ACD) system to try and stop the avalanche of calls. Why is this so important I hear you ask? Well firstly, nothing is more stressful to a service desk analyst than having multiple calls in the queue waiting to be answered. The second reason is something that happened on my watch a long time ago in a galaxy far, far away. We got a message onto the ACD system, but by the time we managed to deploy it, the system was unable to cope with the number of calls and crashed. Now the issue was so much worse, not only was the business service down, so was the service desk, so no one could get through to report other issues and our support teams had two major incidents to fix rather than one. Not good.
Have A Fix? Test and Verify
Brilliant. Bob from the server team has a fix. But has it been tested and checked?
Remember before when we talked to change management to pre-warn them that an emergency change might be needed as part of the fix effort? Go talk to them and raise a change with all the available details while Bob is testing. The change record doesn’t have to be perfect but will need to have the key activities, who will be involved in doing the work, and rough timings.
Manage The Fix
Make sure that Bob has everything he needs to get the fix in successfully. Make sure there’s enough people on hand, for example other support teams or third party support, if needed to ensure that there are no hiccups.
Check Everyone is Up and Running
“Hurrah we’re back in business!” I know the temptation is to shout this from the rooftops but do a quick sanity check first. If your DNS server was down, check to make sure you can access the outside world. Telephony down? See if you can make a call. Website down? See if you can access it and click on some content links to make sure that the whole thing is back up, not just the landing page. You get the idea – check to make sure that everything is as it should be before you break out your victory dance.
Deal with The Immediate Aftermath
Capture as much information as you possibly can as you’re going along, because once this issue is fixed people tend to be so focused on the next issue that they forget things. So ensure that you capture everything while it’s still fresh in people’s minds.
Major Incident Review Meeting
AKA the post mortem or drains-up meeting; not a witch-hunt.
Set ground rules and reassure everyone in the room that the meeting is to look at what happened and how it can be prevented from recurring, not to assign blame. If people think that they’re going to get blamed, then they’ll clam up and you’re not going to get very far. By making people relax and feel comfortable you’ll get to the root cause much quicker, as well as any actions to prevent recurrence.
When you capture your lessons learned make sure they’re documented, shared, and acted on. The easiest way to do this is to add them to a CSI register if your organization has one. Whatever happens, make sure they’re not forgotten – if the same incident happens six months down the line, people tend to be much less forgiving if it could have been prevented.
Look After Your People (Again)
So important, it’s worth mentioning twice. Okay, so you have restored service, told your stakeholders, dealt with the fallout, and captured lessons learned. The chances are that you and team are stressed out and shattered. So now is the time for motivation in the form of time off in lieu, caffeine, or team building in the form of a quick trip to the pub after work. Not something you’ll necessarily find in any book or training course but it will do wonders for morale.
That’s me done. What are your top tips for dealing with a crisis? Please let me know in the comments!
Vawns Murphy holds qualifications in ITIL V2 Manager (red badge) and ITIL V3 Expert (purple badge), and also has an SDI Managers certificate. Plus she holds further qualifications in COBIT, ISO 20000, SAM, PRINCE2, and Microsoft. In addition, she is an author of itSMF UK collateral on Service Transition, Software Asset Management, Problem Management and the "How to do CCRM" book. She was also a reviewer for the Service Transition ITIL 3 2011 publication.
In addition to her day job as a Senior ITSM Consultant at i3Works, she is also an Associate Analyst at ITSM.tools.