Long reads

Lessons from A Space Odyssey: How to combat computer bugs

Hamish Monk

Hamish Monk

Reporter, Finextra

Perhaps the most famous instance of a software bug in pop culture is Arthur C. Clarke’s HAL 9000, and its attempt to thwart Discovery One’s mission, in 2001: A Space Odyssey. Eventually, it is uncovered that the malevolent onboard computer malfunctioned because it was programmed against two opposing objectives: to divulge all its information, and to keep the real purpose of the flight from the crew. This programming howler made HAL paranoid and, ultimately, (spoiler alert) homicidal.

While we are some ways from facing that kind of dystopian reality, the depiction of the calamitous impact that a software bug can have is, sadly, not an exaggeration. In 1994, a Royal Air Force Chinook crashed into the Mull of Kintyre – killing 29 people. Initially, the incident was passed off as a piloting misstep, but a subsequent investigation revealed it may have been caused by a bug in the aircraft's engine-control computer. This is just one such case of many in modern history.

Fortunately, there is a software development methodology that can facilitate enhanced product quality and speedier bug fixes. Known as DevOps, the movement combines software development with information technology operations – across the whole application lifecycle – to help construct, test, and launch software more reliably than ever before.

With the demand for DevOps rising exponentially in the last several years, more and more financial players are looking for skilled engineer teams that can deliver robust solutions which expunge bugs at speed – before serious reputational, financial, legal, and security damages are sustained.

To learn more about the nature of software bugs and how to combat them, Finextra caught up with next-gen cloud hosting platform, Platform.sh and one of Europe’s fastest-growing payment service providers (PSP), Mollie.

What is a bug?

Before we break down how HAL 9000 could have been outwitted by a modern team of DevOps engineers, we first need to define exactly what it is we are treating.

According to Robert Douglass, vice president, channel sales, Platform.sh, a bug is “any unintended behaviour in software. Whether it's a simple problem of the margin of a paragraph not being visible, or the more complex situation of an errant process crashing a server and taking an entire application offline – these are all bugs. We hate them.”

These fault lines can emerge at any stage of the software development life cycle (SDLC), but mostly arise – as was the case with HAL’s programming – as the result of a misunderstanding from the software team during the design, coding, data entry and documentation phases.

There are the several kinds of software bug:

  1. Functional bugs – application or website components do not work as intended
  2. Workflow bugs – the user journey or navigation becomes muddled
  3. Unit level bugs – small batches of code do not run as expected, but are quick to fix
  4. System-level integration bugs – two or more units of code fail to interact with each other
  5. Out of bound bugs – users interact with the interface in an unintended manner; and
  6. Logical bugs – the workflow of software is disrupted, and it behaves incorrectly.  

With greater DevOps expertise, Discovery One’s mission commander, David Bowman, might have diagnosed HAL’s duplicity as a logical bug, which comes from poorly written code or a misinterpretation of business logic.

What impact can a bug have?

Even with DevOps expertise, software bugs can still arise, and incur serious financial costs, reputational damage, and security issues (by helping fraudsters evade access controls to obtain unlicensed privileges).

“The impact of bugs on financial services is enormous,” noted Douglass. “Impacts can range from software working poorly and hampering adoption, to enabling hackers to steal enormous amounts of money from financial systems. A recent example was a software bug in a popular blockchain project which allowed a hacker to steal $600 million worth of assets in one transaction. Bugs are bad. We really hate them.”

The broader picture is more shocking still. In 2016, software failures cost the global economy $1.1 trillion. These failures were traced back to 363 companies and impacted 4.4 billion customers.

Just like the HAL 9000 case, many of these bugs were avoidable.

What are the best practices for exterminating bugs?

This brings us to the million-dollar question: What could mission commander Bowman have done to shut down the supercomputer bug before HAL went haywire?  

If he were acquainted with the practices of DevOps, he would have known to try several things:

1.Test, test, and re-test!

Evaluating the functionality of HAL 9000 would have helped the crew identify defects in its software. Without this step, neutralising the bug was improbable.

When it comes to testing, timing is key. Had the Discovery One crew begun testing HAL early on, during the requirement analysis phase – where 56% of defects originate – potential errors may have been anticipated. Running tests during the maintenance phase was too late for Bowman’s crewmate; the cost of fixing a bug often increases as the SDLC progresses.

According to full stack software development engineer in test (SDET), Rafaela Azevedo, “if a bug is found in the requirements-gathering phase, the cost could be $100. If the product owner doesn’t find that bug until the quality assurance [QA] testing phase, then the cost could be $1500. If it’s not found until production, the cost could be $10,000.”

If the bug is never found, it could be surreptitiously costing the company money – or, by Clarke’s estimation, plotting to wipe out two astronauts from the future.

The bad news is that, according to Douglass, “most software bugs are emergent.” This means “you often only find out about the bugs at a later point in time when the conditions of the software have changed to reveal the bug. This could be a usage pattern that was unanticipated (such as uploading a file that is larger than what was tested), or it could be due to configuration drift.”

2. Find the defect cluster

The good news is that there is a way to expedite the testing process. This is useful since exhaustive testing of all possible combinations is impractical in most cases and impossible in space. As such, an optimal amount of testing should be discerned, based on a risk assessment of the application. Effective testing beats exhaustive testing every time.

This is where defect clustering comes in – it recognises that it is uncommon for bugs to be distributed evenly throughout a software application. Following the Pareto principle, 80% of bugs are likely to be found at 20% of components.

If Bowman had heeded this probability and designed a test that distributed his diagnosis efforts in line with the ‘hotspots’ – or, the likely density of bugs in the modules – the mission may have gone smoother.

3. Reproduce the bug

Another option available to the Discovery One crew was to understand HAL’s bug by capturing it and reproducing it.

“Capturing a bug sometimes requires reproducing it,” Douglass explained. “In order to safely reproduce a bug, the best practice is to clone the running application into a new environment that is precisely identical to the application itself, and then, in the safety of the new environment, isolated from the production application, you can reproduce the bug repeatedly until it is fully understood.”

The HAL case proves that the cost of creating a clone – an exact copy of the running application – is feasible and financially justifiable compared to the option of ignoring the bug, or trying to debug it on the live system, with all the risks inherent in that approach. As such, it's important to work with tools that facilitate the full cloning of production applications quickly, and at low cost.

Daan Van Marsbergen, senior system engineer, database engineering team, Mollie, added: “It's vital to have visibility on your system and development environment. A lot of developers use var dumps or echoes or prints, but there is nice tooling out there that allows you to go through your software and code step by step, inspect all the values, and set breakpoints in places. This is really helpful for debugging.”

4. Avoid the pesticide paradox

Last but not least, if Bowman was seasoned in DevOps, he would have heeded the pesticide paradox. In other words, when too many tests are run on bugs, they cease to be effective in isolating them, or rather, in identifying new bugs outside the offending module.   

To avoid being affected by this phenomenon, software engineers must track the indirect implications of product alterations and constantly terminate tests that are no longer effective.

The thing that’s still bugging engineers

Despite the various tools and practices informed by the DevOps movement, the fact is that it is simply impossible to identify all bugs within an application.

“Acknowledging this truth, and designing operations to take bugs into consideration, is the only choice we have,” Douglass claimed. “It's most important to be able to identify and address bugs as a normal part of daily operations. This means updating the software stack in near real-time, responding to bug reports in an organised and timely manner, and reserving engineering capacity for bug fixing and maintenance.”

At the heart of operations needs to be a tool set that anticipates frequent deployments of bugfix code into the production environment. That is why preventing configuration drift and codifying the definition of infrastructure into the application repository itself have such profound compounding effects on overall operations.

Indeed, if engineers can quickly clone the production environment, avoiding any configuration drift, without needing to rely on DevOps or infrastructure operators, they can quickly address bugs and deploy the fixes on a rolling basis.

If financial services firms – or spaceships – have any hope of avoiding bug damage, these are the practices that must be adopted. In the words of HAL, DevOps teams “ought to sit down calmly, take a stress pill, and think things over.” Their mission is too important to be jeopardized.

Comments: (0)