Blog article
See all stories »

DORA – Ensuring Resilience with IT Infrastructure Observability

IT Infrastructure Visibility Is Crucial for DORA Compliance

Excerpt: 

As DORA’s compliance deadline creeps closer, ensuring IT infrastructure resilience has never been as critical. This article highlights the importance of monitoring and observability to understand IT system behaviours and stay ahead of potential issues to avoid costly downtime. Discover the key to IT resilience and how to achieve it at scale in this essential guide covering:

  1. How Resilient are IT Data Centres?
  2. Enhancing IT Infrastructure Resilience.
  3. Achieving Resilience at Scale.

Read the full article at: https://cjcit.com/insight/dora-it-observability-resilience/

Intro:

As the 17th of January 2025 deadline approaches, the pressure to improve operational resilience for DORA compliance is mounting. Building on my previous DORA article about third-party dependencies, this article focuses on how resilience can be achieved through observability – the measurability of an IT system’s internal state by collecting and examining data to deliver valuable performance and stability insights, enabling the detection and proactive resolution of potential issues.
 

How Resilient are IT Data Centres?

IT infrastructure and data centres have become significantly more reliable over the past 25 years since CJC was founded. Despite this, it might not always seem that way because of our growing reliance on IT services and high-impact issues like glitches3, outages4 or regulatory actions5 are typically swiftly picked up by the media.

A major change in the last 25 years is the shift to cloud services, which changed how service disruptions happen. Major outages are now commonly due to configuration issues (64%), software (40%) or hardware (36%) faults, capacity issues (22%), data corruption (14%) or security (10%) according to an Uptime Intelligence report (Lawrence & Simon 2023:18)6. These errors show the increasing complexity and the challenges in managing digital services. Also, because many are increasingly interconnected7 and rely on a few major cloud providers, a single issue can potentially impact multiple organisations8 simultaneously. For example, if a hyperscale data centre goes down.

The Uptime Intelligence report shows that even though the number of outages is increasing (Lawrence & Simon 2023:9), they are not growing as quickly as the IT industry (Lawrence & Simon 2023:6) and the proportionate number of major outages is decreasing (see Illustration A). This suggests that IT systems are getting more resilient thanks to IT engineering advancements, better management, and technology investments. CJC's recent announcement9 proves that the right support can improve IT system resilience, protecting revenue-generating operations.

Outage Costs

ILLUSTRATION A: PROPORTION OF OUTAGES CLASSIFIED AS SIGNIFICANT, SERIOUS, OR SEVERE. (LAWRENCE & SIMON 2023:8)

Despite making great strides in IT, outages still happen and are increasingly expensive. A recent report10 highlighted that a single outage could cause stock prices to drop by up to 9% and take over two months to recover. Unplanned downtime reportedly11 costs $256 million annually in the US, more than in Europe or Asia-Pacific at $198 million and $187 million, respectively. The Uptime report found that most unplanned outages cost individual companies over $100,000, sometimes exceeding $1 million when direct repairs, fines12, restrictions13, and lost opportunities14 were included.

So why are outages so expensive? The main reason is the growing dependence on digital services provided through data centres (Lawrence & Simon 2023:23). When critical IT services go down, business operations are disrupted, losing revenue. To prevent outages, IT and business leaders must make strategic data-driven decisions to holistically view and predict IT system behaviours like managing capacity effectively15 to retire tech or avoid outages. We will discuss how to achieve this later.

What Causes an Outage?

“Understanding the causes of outages is critical to preventing them and to guide resiliency investments. Most outages have several causes…” (Lawrence & Simon 2023:10)

Even though major outages are increasingly less common, the Uptime Report (Lawrence & Simon 2023:7) found 60% of data centre operators had an outage between 2020-2022. The report noted that the most common cause of these outages was power-related issues16 at the site (see Illustration B). Other common onsite issues included cooling system failures, which could lead to grave consequences, like data centre fires17.

ILLUSTRATION B: LEADING CAUSES OF SIGNIFICANT OUTAGES. (LAWRENCE & SIMON 2023:11)

Uptime's research also showed third-party infrastructure-as-a-service providers, like cloud and hosting services, were increasingly responsible for outages, highlighting their growing importance (Lawrence & Simon 2023:14) as data centres are increasingly off-premises. This underscores the value of technology consultants who meet international standards18 and have a strong track record19 with service-level agreements (SLAs) compared to others who potentially do not meet the required standards20.

ILLUSTRATION C: COMMON OUTAGE CAUSES OBSERVED BY CJC.

For other common causes of outages, like networking21 and IT systems22, I spoke with Umesh Tailor, the Global Head of Service Operations at CJC. He oversees a team that supports critical IT systems around the clock, where he observed most of the recent outages in capital market firms over the last 6 months were due to capacity management and hardware faults (see Illustration C). These findings align with the top causes identified by the International Organisation of Securities Commissions (IOSCO 2024:13)23 over the past few years.

Umesh explains, "Capacity and hardware issues are usually more common and possibly related. CJC’s mosaicOA tool24 allows detailed IT infrastructure and application visualisation metrics to help identify potential issues.” He elaborates by adding, “After reviewing mosaicOA data, decisions can be made to balance the workload, increase CPU memory, optimise applications or hardware for better performance.”

Enhancing IT Infrastructure Resilience

According to research by Lawrence & Simon (2023:20-24), 78% of data centre operators believe their last outages could have been avoided with better monitoring, more investment, and thorough analysis. This aligns with DORA, which states25 that security and IT tools should be continuously monitored and managed to reduce risks. Some have proposed data-driven operational resilience maturity frameworks26 leveraging observability tools to spot and manage risks before they cause problems, like the capacity management issue raised by Umesh which echoes an article from 202227, “Firms cannot optimise what firms don't know.”

Steve Moreton, CJC’s Global Head of Product Management, talked about the challenge of market data capacity management28, using the award-winning mosaicOA tool29, he saw market data consumption double every two years. He also found it could spike over a millisecond or even triple in a day because of market volatility (see Illustration D). With a tool like mosaicOA30, which can visualise data holistically across different environments, market data managers and IT operators could predict when to upgrade capacity or if there is a hardware problem.

Illustration D: mosaicOA’s Busy Day View.

ILLUSTRATION D: MOSAICOA’S BUSY DAY VIEW.

Experts agree31 that to comply with DORA and be a resilient leader, it is crucial to have clear monitoring32 and observability33 systems. The top companies that recover quickly from downtime often share these traits34:

  • They invest in both security and observability tools.
  • They leverage embedded GenAI (Generative AI) in their existing tools for speed.
  • Fast recovery leads to better user experiences and less negative media coverage.
  • They face fewer hidden outage costs.
  • They avoid financial damage from regulatory fines and ransomware demands.

Achieving Resilience at Scale

 Managing IT infrastructure in-house is complex and time-consuming, requiring specialist knowledge and resources, which is a challenge given a known skills shortage35. For agility, scalability, resiliency, and other benefits36 underpinned by service-level agreements (SLAs) many prefer outsourcing to trusted operators37.

As market data usage typically doubles every two years, obtaining IT visibility of the entire architecture is a growing challenge. Without holistic observability, blind spots may hinder performance. A monitoring and observability tool would provide better IT visibility of market data systems on a single dashboard, visualising data to indicate when upgrades are necessary or if there is potentially a hardware problem.

Firms may prefer leveraging managed service teams and the chosen observability tool as an eyes-on-glass watch tower, which monitors and alerts them if there is an issue. Alternatively, they could utilise a fully managed service, where the team not only alerts the client but swiftly remedies42 and conducts root cause analysis to avoid a repeat.

In short, financial institutions need holistic IT visibility and observability tools like mosaic to meet DORA's operational resilience requirements. These tools help with compliance and facilitate more resilient and reliable business operations in an increasingly complex digital world.

 

 

73

Comments: (0)

Now hiring