The full story behind the
recent TARGET2 outage is as yet not in the public domain, but from what we do know, the incident serves as a great use case for investment in DevOps and a move away from traditional Disaster Recovery type models.
In a recent communication, the root cause of the incident was “found to be a software defect of a third-party network device used in the internal network of the central banks operating the TARGET2 service.” Now I am not entirely sure what that means, as
to be frank, it is so broad, it is almost a meaningless statement. All I can take from it that there is an infrastructure failure. Another issue that we know about was the fact that a back-up system failed to work and failover to a secondary DR SITE took many
hours. Yet we do not know the causes of these issues.
What can we learn?
We can see that the system operates what I would call a legacy set-up, as in it follows a very typical Disaster Recovery (DR) type model. This model has been in place for many years and is pretty much the go to model within the financial services industry.
You have one infrastructure, typically a data centre, which does all the work and you have a second, identical infrastructure (data centre) as a 'hot standby', located geographically a good 40+ km away. The thinking is your primary data centre has an issue,
you cannot get it working quickly, so you simply fail services over to the hot standby and then job done.
This thinking is highly embedded within financial services; a DR model even forms an integral part of your journey to gain a banking license. However, this is an outdated model and as this TARGET2 issue shows, is also a broken one.
CIOs and COOs within the financial services sector have to stop thinking of IT services in terms of 'disaster' and then 'recovery'. That thinking drives a very basic approach which is typically one of redundancy, as in we must have duplicates of everything.
This starts with telecoms, cabling, servers, power, switches and then the thinking just gets taken up a notch to the 'data centre' level, meaning we just have a redundant data centre. There are two massive issues with this thinking.
The first is that even small incidents become an issue of disaster. The second issue is that your DR site cannot be the same as your primary, and that is based on the real world understanding that data is ever flowing, therefore 'state' is always changing.
This is before we think of software upgrades, testing cycles, hardware updates, etc.
Availability
I often explain availability as a basic rule of three. DR is thought of as a rule of two, a failure and then we switch to the backup. However, in a DR model, you now have zero redundancy until you have fixed the initial issue. With the rule of three, you
maintain redundancy and therefore resilience while you fix issues.
This applies to things like data; your data is stored across three different zones – quorum is needed between those zones and all of a sudden, you have protection against data corruption and loss. The rule of three gets expensive, so you must think commercially:
“how can I utilise this redundancy, how do I ensure it's not just wasted capacity?” In thinking this way, you also address the issue of state, ensuring everything is the same. That data for example is available at all times in all three zones.
Availability means you are utilising that capacity. In the cloud, availability is something that the big players have, and continue to invest billions in. Microsoft Azure, for example, provides availability zones, which is essentially three separate compounds
(think of them as your typical data centre) geographically distanced by something between 10-40km, all run active:active:active with each other.
Essentially your redundancy and resilience is load balanced. Now, in terms of a failure, within this availability model, there is no interruption to the availability of services. In the TARGET2 incident, transactions would have continued to flow and be processed
as they should, as only one zone would have been impacted.
There are many other aspects of high availability we should get familiar with: one such example is tiering your infrastructure and the services you provide. By this I mean, tier 1 is highly important, you must have it running at all times, therefore you
look to ensure it can run at all times. But a tier 3 application or service can afford to go offline for hours on end without having a material impact on what your business does. This tiering allows you to keep your costs in check and ensures resilience is
focussed on the areas that really matter.
Upgrades and maintenance
The challenge with upgrades and maintenance, be they physical hardware or software related, is ensuring upgrades do not impact availability and that if an upgrade was faulty, you could or can contain the issue before it is applied to the rest of your infrastructure.
Enter the importance of DevOps.
When we start to think of the very infrastructure that services run on as 'code', we start to version control that very infrastructure (which maps back to the physical kit). We can control how it is deployed and we can ensure that infrastructure is totally
consistent. There is also the added benefit that infrastructure becomes repeatable, repeatable at speed.
DevOps is critical to being able to ensure high availability models work, that your software and infrastructure is repeatable and that you can upgrade parts of your infrastructure without negatively impacting your services. As a CIO/COO, I strongly recommend
you get familiar with the term 'rolling upgrade'. Think of it as upgrading some, but not all, your services. Checking they are functioning before continuing the upgrade to those services or infrastructures that has yet to be upgraded, rolling upgrades enable
you to identify issues and then contain them, while at the same time, always providing your services (ensuring they are highly available).
Systemic importance
TARGET2 is systemically important. The cloud with high availability models, coupled with the growth in DevOps, has shown how we can ensure services remain pretty much always on. 100% is not achievable -but is the goal.
Incidents will and always do happen; the key is being able to make your services available while an incident is ongoing. Then if you suffer a real disaster, such as a natural disaster that destroys an entire data centre, you are still able to have services
available. As a systemically important service, we must all ditch the thinking of Disaster Recovery and move to the concept of High Availability.
There is a great deal of 'chat' regarding resilience within financial services and the use of the cloud. However, the financial services sector simply cannot afford to invest the amount of money into core underlying infrastructure as that which is invested
in the public cloud, especially by providers such as Microsoft, Amazon and Google.
At some point we have to acknowledge that important infrastructure is destined to run in the cloud, be that within an individual financial institution or as systemically important payment service, such as TARGET2.