Community
In July of this year, the world was shocked by a major IT incident caused by an update from the cybersecurity firm CrowdStrike. A wrong patch led to a global IT outage, affecting 8.5 million Windows devices and causing the cancellation of more than 5,000 commercial airline flights. This incident also impacted hospitals, media, and banks, resulting in an estimated $1.5 billion in losses. Dubbed the "Largest IT outage in history," this event highlighted the vulnerabilities in our interconnected digital world.
While this particular incident grabbed global headlines due to its scale, massive IT outages are becoming increasingly common. As companies rely more on cloud services and many depend on the same IT software, the risk of widespread disruptions grows. Here are some notable examples of major incidents in recent years:
Atlassian JIRA Outage (April 2022): Atlassian is used by many major corporations for managing their IT departments (via software like JIRA, Confluence and ServiceDesk). A wrong "Delete" script removed data for several hundred major customers. Due to the complexity of the issue, it took several weeks to fully restore service for all impacted accounts.
Amazon Web Services (AWS) Outages: AWS, the market leader in cloud services, has experienced significant outages almost yearly, affecting countless businesses due to its vast user base. Some examples of recent AWS outages can be found on "https://www.datacenterknowledge.com/outages/a-history-of-aws-cloud-and-data-center-outages".
Facebook (META) Outage (October 2021): A routine maintenance error caused a global shutdown of Facebook’s services for 6-7 hours. The incident was triggered by a simple command mistake during a capacity check.
Fastly Outage (June 8, 2021): A minor setting change by a customer activated a dormant bug, causing major websites like The New York Times and CNN to go dark for nearly an hour.
Google Outage (December 14, 2020): Google’s authentication system ran out of storage, leading to a 45-minute outage affecting Gmail, YouTube, and Google Drive.
These incidents demonstrate how interconnected and dependent modern organizations are on cloud services and internet infrastructure. They also show the profound impact small human errors at these companies can have. Due to their scale, the economic losses from these outages are enormous.
Businesses should be aware of their enormous dependence on IT and how a single human error can have severe impacts. Especially in the financial services industry, which is not only critical for the worldwide economy but also enormously dependent on IT, this risk is enormous and should be mitigated. This is where resilience comes into play. Building a resilient organization and resilient systems is a must-have for every financial institution. This resilience must be implemented at all levels of the organization: business, operational, and technical resilience.
This is done by identifying all points of failure and taking measures to ensure the impact of each failure is minimized. The complexity lies, of course, in the fact that this list of points of failure within a large organization like a bank is enormous. It is therefore essential to focus first on the points of failure that have the highest risk value, calculated by multiplying the probability of failure by the cost when such a failure occurs.
In IT, failures can be categorized into three large categories depending on their origin:
Hardware failures: Physical malfunctions of hardware components.
Human Unintentional Errors: Mistakes made unintentionally, like most of the above-described outages.
Malicious Errors: Failures invoked with bad intent. In recent years, multiple major outages were caused by cyber-attacks (usually ransomware attacks). Some examples include:
ExPetr / NotPetya (2017): Resulting in a total impact of $10 billion, with major companies like global shipping company Maersk and pharmaceutical giant Merck, as well as major governments, impacted.
WannaCry (2017): Resulting in a total impact of $4 billion, with major companies like Spain’s mobile company Telefónica and the UK’s National Health Service.
REvil/Sodinokibi (2020): Impacting major companies like Kaseya, Travelex, and JBS Foods.
…
The last two categories can further be split up into errors introduced by employees, partners/vendors, customers, and/or externals (unknowns). This shows there are many axes for failure, against which a financial services company needs to protect itself.
A company should therefore use a variety of different strategies, which combined, result in optimal resilience:
Resilient System Design:
Failure Tolerant Systems: Systems should tolerate hardware failures and network issues, support recoverability, and ensure operations like restoring backups and relaunching processes without negative impacts (e.g. via recycling and relaunch mechanisms and the use of idempotent services).
Self-Healing Systems: Design systems to quickly identify and recover from failures automatically, using mechanisms like load balancers, throttling, circuit breakers, timeouts, and elastic scalability. Such mechanisms help to avoid the ripple effect (i.e. avoid that failure in one component brings down other components and systems) and ensure a quick restore to normal.
For further details on building resilient systems, you can refer to the blog "Building resilient systems in the Financial Services industry" (https://bankloch.blogspot.com/2020/02/building-resilient-systems-in-financial.html) I wrote four years ago.
Redundancy: Implement backup platforms running on different infrastructures (e.g. different operating systems, cloud providers, or data centres) to minimize the risk of simultaneous failures. Adopting a multi-cloud strategy for critical applications can help achieve this but may incur additional costs.
Robust Business Continuity Plans: Develop and regularly update procedures for major business continuity issues. These should include steps to minimize negative impacts when IT systems are down, such as switching to manual operations if necessary.
Continuous Monitoring: This includes continuous monitoring of all business events (business activity monitoring), technical monitoring, proactive detection of anomalies and the continuous identification of bottlenecks. This allows a continuous evaluation of the state of the architecture and a proactive identification and analysis of potential issues. By immediately getting transparency on the impact (business value and which customers) of certain issues, resource allocation can be optimized (to those issues with the most business impact) and impacted customers can be proactively informed.
Continuous Testing and Gradual Deployments: Continuously test your software via automated tests. This includes functional tests, but also non-functional tests like performance and security tests, as well as failure tests, where specific failures are simulated to see if systems can properly handle those disruptions. Special failure testing frameworks like Netflix’s Simian Army or Gremlin (Failure as a Service) can be used for this, which some of the leading technology companies even execute in production as the ultimate proof of their resilience. But even with those tests, unexpected issues are likely to still occur. Therefore, gradual deployment rollouts, like canary testing, A/B testing, blue/green or red/black deployments are essential to limit the potential impact of such an issue.
Organizations must invest in a whole resilience strategy to guard against failures and minimize disruptions. Not only will this ensure delivering a continuous, reliable service to customers, but more and more regulators will impose resilience standards for critical financial applications. In Europe, the Digital Operational Resilience Act (cfr. my blog "The Dawn of DORA: Building a Resilient Financial Infrastructure" - https://bankloch.blogspot.com/2024/05/the-dawn-of-dora-building-resilient.html) and the UK’s Operational Resilience Policy mandate stringent measures to ensure the stability of financial services.
In the end, failures are inevitable and simply a matter of time before they happen. Or to quote Werner Vogels, CTO of Amazon.com: "Everything fails all the time". The true differentiator is implementing necessary measures to minimize their impact (design for failure).
For more insights, visit my blog at https://bankloch.blogspot.com
This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.
Jamel Derdour CMO at Transact365 / Nucleus365
17 December
Alex Kreger Founder & CEO at UXDA
16 December
Dan Reid Founder & CTO at Xceptor
Andrew Ducker Payments Consulting at Icon Solutions
13 December
Welcome to Finextra. We use cookies to help us to deliver our services. You may change your preferences at our Cookie Centre.
Please read our Privacy Policy.