The digitization of business processes should lead to them offering a better experience and becoming more efficient. However, these benefits often also lead to stakeholders becoming more and more dependent on the IT systems used. In this article, I therefore highlight the increasingly important, but always often forgotten or only technically addressed topic of business continuity.
Higher, further, faster: This is the credo of many digitization initiatives that are currently being launched at many levels of our society. IT systems are playing an increasingly central role in this, whether it’s paying independently at the supermarket checkout, booking a vacation trip, going to the doctor (sometimes not even necessary), or even electing a new government. Users of these IT systems are therefore relying more and more on their correct functioning – sometimes even too much. Addressing the issue of business continuity helps to address this expectation, among other things.
Digitization often increases the complexity of the overall system
For all the advantages of handling (potentially critical) value creation processes via IT systems, however, there is one major drawback: the additional complexity created by using the necessary software and/or hardware stacks. This means that digitized business processes can potentially be disrupted or even interrupted more easily. The principle of hope must not be applied here: It is only a matter of time before IT systems fail or the “trappings” change abruptly and potentially disruptively, affecting their operation – unfortunately often not for the better.
A good example is the outage that occurred at Meta (formerly Facebook) this year: Not only were the widely used services WhatsApp, Facebook and Instagram no longer accessible, but Meta employees were no longer even able to enter certain company buildings and conference rooms or send external e-mails. A full “all-around hit,” so to speak.
How was the problem finally solved? It is rumored that a team of technicians had to be dispatched to a data center in California to manually restart the affected servers. A dazzling example, then, of an abrupt system failure (which, as it turned out, was due to a configuration change) that had profound global implications and whose addressing and eventual resolution was managed through processes that had to be established in part on an ad hoc basis. A fine example of business continuity.
Was ist Business Continuity?
The term business continuity describes the ability of a company or an organization to continue the delivery of products or services to a predefined extent if a disruptive event occurs.
In most cases, a disruptive event is not a simple failure of an IT system (for example, the failure of the access system at a company location is not necessarily to be regarded as such, although it is also potentially disruptive), but rather an event that leads to the abrupt impairment of entire value chains. One example is the currently rampant “Omicron” variant of SARS-CoV-2, which caused thousands of flights worldwide to be cancelled over the Christmas holidays because infected flight personnel had to go into quarantine.
Identify and prioritize risks
“First, you think better and second, things turn out differently.” This variation of the popular saying is intended to indicate that it is usually worthwhile to make provisions for disruptive events. Even if the event should differ in gray theory and in “colorful” reality, one has nevertheless at least sensitized the employees to its occurrence and, if necessary, even already procured necessary resources, which one would not easily access in the event of a disaster. The example of protective masks at the beginning of the above-mentioned pandemic is certainly still familiar to everyone, but what about, for example, the procurement of emergency power generators in order to be able to continue operating core systems in the event of power failures or shutdowns?
Accordingly, the most important foundation for business continuity is that disruptive events are identified, analyzed and managed across the enterprise. In the simplest case, one maintains an up-to-date list of potential events, their impacts, and appropriate measures to mitigate them in the event of an emergency. The best way to create such a list is to consult both regulatory and general sources, as well as industry-specific analyses, and supplement them with risky events that have been identified at the organization level. Categorizing the events, for example, into categories such as “social,” “political,” or “technical,” makes them easier to maintain and communicate.
Let’s face it, since preparing for hypothetical events is not in the nature of humans and, consequently, neither are groups of humans, it is worthwhile when implementing measures to plan them based on prioritizing the associated risks. This allows resources to be focused and the effort required to implement the measures to be kept small, making it more likely that the appropriate funds will be made available.
For example, a simple estimate of criticality for a single event can be made as follows:
Criticality event = probability of occurrence of the event (e.g., per year) x impact of the event (e.g., financial or as an alternative metric).
For a list of events, one can thus individually estimate their criticality and prioritize the implementation of those measures that affect the most critical events. For example, a global assessment of disruptive events for a fictitious company shows that the risks “power shortage” and “delivery problems of semiconductor components” are the most critical.
The measures derived, “Acquisition of emergency power generators” and “Stockpiling of semiconductor components”, should therefore be implemented as a matter of priority and corresponding projects launched. Measures relating to other disruptive events will either be taken later or, if their criticality has been upgraded, moved forward in time (see below).
Derivation and implementation of measures as an ongoing process
Now, of course, the question arises as to what measures can be taken at all to ensure sufficient business continuity when the respective event occurs. As mentioned at the beginning, this requires various metrics that describe the extent to which the affected value creation processes continue to provide products or services.
Possible approaches here are the definition of quality parameters that relate to the quality of the product or service, and the definition of time periods within which the restricted value creation process must be restored. In the case of the latter, the so-called Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are particularly well known, but one should not be afraid of creating one’s own metrics.
What types of measures can be taken? Measures to ensure business continuity can be taken at various levels, for example at
- organizational level
- process organization or process level
- technical level
- legal level
although other types of measures, for example at the level of corporate communications, are also entirely possible.
Organizations in the ICT sector in particular often tend to focus on technical measures and neglect other aspects. However, it is extremely important that business continuity plans (BCP) are created and implemented as holistically as possible, otherwise it is not possible to ensure that full value creation is maintained.
Let me give you an example: The integration of online orders into the mobile app of your fictitious online store has resulted in a majority of users now using it to place orders. One bad Friday afternoon, your support hotline suddenly receives heaps of calls with people complaining that they can no longer place orders in the mobile app.
After some back and forth, it turns out that your ordering platform, which is hosted by a cloud provider, was affected by an outage there. The situation only eases during the course of Saturday afternoon, when the provider announces that the disruptions in the Availability Zone (AZ) where your ordering platform is hosted have been resolved. Of course, by this time there have already been numerous complaints (including of an “unadvertised” nature on social media) and many customers decided to order elsewhere. The financial damage is considerable.
What measures could you have taken in advance to address this risk in terms of business continuity? On a technical level, it would certainly have been advantageous if your operating organization had been informed about the disruption at the cloud provider at an early stage (e.g., via appropriate notifications) and, in addition, if your ordering platform had been hosted across different AZs.
In addition, if a corresponding process had been in place, support could have been informed about the existing disruption in order to be able to provide information directly to the calling customers and possibly offer them an alternative ordering option (for example, via web form).
Last but not least, you could have – we assume that this is a COTS solution – made sure in the maintenance contract of your ordering platform that any service outages are financially compensated. At this point, I’m sure you and I could think of other measures. But keep in mind here as well: Not all measures have the same effect, which is why these should also be prioritized.
Derivation and implementation of measures as an ongoing process
You may have noticed that deriving and implementing business continuity measures is no trivial matter, but takes some time. But there’s more to it than that: Since the nature of the events that threaten an organization’s value creation changes continuously, it is not enough to derive and implement measures just once. It is better to view business continuity and its assurance via measures as an ongoing process that takes the form of a deming cycle.
Accordingly, you should follow the simple pattern Plan – Do – Check – Act and regularly check the meaningfulness and effectiveness of your measures and correct them if necessary. I am aware that this is not an easy undertaking and in some cases simply not possible. However, even isolated emergency exercises – in the above case, for example, a maintenance of the ordering platform that was not made known to the support organization – can already help to check the quality of the implementation of a measure. In short: Stay tuned.
Business continuity becomes increasingly important in a VUCA world
The acronym “VUCA“, which stands for “Volatility, Uncertainty, Complexity, Ambiguity”, is often used to describe the current state of our world. You don’t necessarily have to be a pessimist to attach a certain degree of truth to this characterization. It is precisely this dynamic environment, which is becoming increasingly difficult to predict, coupled with the growth in technical complexity associated with digitization, that is making the topic of business continuity increasingly important for organizations of all sizes. So stay tuned – even small steps count.