Your company's site is down or you've become aware of an active data breach or some other incident occurs and:
- it needs to be fixed straight away
- it can't ever happen again
- you need to deal with the fallout
Here are some pitfalls to avoid:
Wasteful lines of communication
You receive a notification from a monitoring tool about slower response times on a website you control.
Five minutes later, assessing the cause and scope of the slowdown, the site goes down.
Two minutes after that, your manager wants to know what's happening, how it's possible this is happening to us, and whose fault it is.
Whilst talking to your manager, the CEO joins in to explain how bad it would be if the site were to stay down.
During an active incident, people will want to know what went wrong, whose fault it is, and what processes need to change to make sure it doesn't happen again. In the meantime, the incident will not be investigated or resolved.
The obvious solution would be to politely ask those people to wait until you have better answers and to reassure them that they will receive updates every X minutes. This doesn't work very well in real life. It wastes time and not everyone can resist management pressure even when they know it should be done. What you need is a lightning rod. Designating someone as Incident Manager can free up the right people to focus on what needs to be done whilst the rest of the business know who to chase for information.
Another example of wasteful lines of communication is when multiple people and teams have identified something's happening and are independently investigating and discussing it and coming to conclusions without being aware of each other's efforts. This is another reason why it's so useful for people to know about the lightning rod that is the Incident Manager. They contact the Incident Manager, the Incident Manager decides who is best placed to take action, and they will communicate with that person and the rest of the business.
What if your business can't afford to designate someone as Incident Manager? In that case it could be agreed that there will be a messaging group in which any active threats to the business are raised and actions delegated by the group among themselves, with business stakeholders added to this group so they can view progress. However, it may be more expensive to distract all the group members in this way than to designate one person as managing the response to incidents.
Say a data breach has occurred and was handled, you understand exactly what happened, and recommendations have been put together for concrete improvements of which everyone agrees that any one of them would have prevented the incident from occurring if they had been in place.
A year later the same incident reoccurs, because the improvements have not been done. How do you explain that?
To avoid having to explain it, prevent similar incidents from occurring again. Now, it's easy enough to say that you should chase recommendations for improvements until they've been completed, but there's a reason this is such a common omission. There may be strong pressures to return to business as usual, as the incident already caused business disruption and the incident is over. This pressure may be added onto by senior management, the very people with the most to lose if a similar incident did reoccur. You can use that fact - that similar incidents occurring again would damage the business more than the time and budget spent on preventive activities - to gain support.
So, use the leverage of a harmful event to make real changes. Then, once improvements have been implemented, consider how you might simulate the incident that has occurred and similar realistic versions of it, and test whether your preventive measures are adequate. Perhaps someone's login details were compromised via a spearphishing attack and used to log in to a CMS through which malware was added to a website. If you've implemented defense-in-depth protections, you should then test whether users recognise phishing attempts, whether users have permissions limited to only what they actually need, the ability to add custom HTML/CSS/JS to the CMS disabled or reduced to only what is genuinely needed, and any additional relevant protections such as ip whitelisting.
Telling people to do their job better
Another pitfall in the recommendations for improvement is to conclude that procedures were not followed and if people would just follow the procedures the incident would not have happened. Whilst that conclusion might be entirely true, telling people to just follow procedures or be less negligent is a fool's errand. There's a reason (or there might be a complex web of reasons) why they didn't following procedures and telling them to do better doesn't remove those reasons.
Is their workload so high they simply can't take the time to go by the book if they want to deliver results? Do they face a great deal of context-switching and distractions? Are the procedures viewed as out of date and out of touch? Are the procedures difficult to find or make sense of? Do the required tools suffer from issues in functionality or usability which invite skipping steps or creative workarounds? There could be any number of reasons.
You can find out by interviewing the relevant person. They can explain all the reasons something seemed like a good idea at the time. In hindsight it will likely be just as clear to them as to you that something could have been done better, but that doesn't change that everything they knew and experienced at the time made them do something different. If you don't change what they or someone else in their position knows and experiences the next time something similar happens, there's no reason they won't act the same way next time.
Implicit expectations about out-of-hours support
When immediate action is required outside of regular business hours, you don't want to have to call around and find that the needed people are not available or don't have access to the tools they need. You also don't want to get into disagreements with vendors about their conditions of out-of-hours support.
Say you work with an offshore development team and you ask if they provide 24/7 on-call support in case of incidents. Perhaps they respond you can email the project manager at any time and the project manager will then call someone from the team to assist. Don't stop there. Should we understand the project manager has sound on for their email notifications at all times? Does the rest of their team know (and are prepared for) that any of them can be called at any time of the day or night to respond to an incident? Are we agreed on what counts as an incident? How do the lines of communication proceed once everyone who's needed is on the case?
Or say you have internal development teams that have agreed on-call schedules. Have you also agreed whether their time will be compensated? Have you agreed a method of communication, such as a messaging group for which notifications can be set individually, that each team will use? Have you ensured there are fallbacks for people who turn out to be unreachable? Have you ensured incidents that have been known to occur in the past and which can be easily resolved (such as through a service restart) have been documented, so you can prevent waking up people for such issues?
Incorrect conclusions due to cognitive biases
The direction of investigation into an incident can be powerfully derailed by cognitive biases, as can the root cause analysis and forming effective recommendations for improvement. It would be tiresome to list all the possible cognitive biases, but let's view a few examples which will alert you to the danger and then let's see how we can avoid this pitfall.
As you can tell from the order of events, the JS error wasn't the cause of the reduced performance of the site, plus the performance of the site did not actually change at any time (it was the office's WiFi connection). How then did the developer and manager get it so wrong? The developer didn't typically look at the browser console, so they assumed that since there was a problem and since the console showed an error, they must be related. The developer could have seen from the error message that it had no relation to performance, they could also have looked further to see if there could be any other explanation. The developer could have also challenged the premise of slow performance by viewing the monitoring tools to see whether users of the site are experiencing any slowdown, the manager could also have asked whether the site is any slower instead of stating that it is.
In another example, perhaps a website goes down and someone discovers that there were hundreds of logins during the downtime. When the site returns after a server reboot, an additional basic authentication dialog is added to the login page to prevent what is seen as a brute force hacking attempt. Leaving aside whether this would be the best way to protect on-page and API logins from brute force attack, there was no relation between the logins and the downtime. Cursory investigation would have shown that the hundreds of logins during the downtime came down to five logins per minute, which did not noticeably affect the server or application performance, but it was the same average logins per minute as in the days and weeks before the downtime.
How to avoid such mistakes? Again there is an obvious answer that doesn't work in real life. The obvious answer is that people should be aware of cognitive biases in order to avoid them. This is much easier said than done, plus training in recognising cognitive biases does not appear to transfer between different contexts even when closely related. It's much easier, however, for someone trained in cognitive biases to recognise other people's cognitive biases. As a result, a testing or quality specialist is well suited to being involved in incident investigation and remediation.