The causes of the National Air Traffic Services (NATS) flight control centre system failure in December 2014 that affected 65,000 passengers directly and up to 230,000 indirectly have been revealed in a recently published report.
The final report from the UK Civil Aviation Authority’s Independent Inquiry Panel set up after the incident examines the cause of and response to the outage at the Swanwick control centre in Dorset, one of two sites controlling UK airspace (the other is at Prestwick in Scotland). Safety is key, said the report. I agree. And safety was not compromised in any way. Bravo!
“Independent” is a relative term, after all the panel includes Joseph Sultana, director of Eurocontrol’s Network Management, and NATS’s operations chief Martin Rolfe, as well as UK Civil Aviation Authority board member and director of safety and airspace regulation Mark Swan – all of whom have skin in the game. (Full disclosure: a panel member, Professor John McDermid, is a valued colleague of many years.)
For a thorough analysis, however, it’s essential to involve people who know the systems intimately. Anyone who has dealt with software knows that often the fastest way to find a fault in a computer program is to ask the programmer who wrote the code. And the NATS analysis and recovery involved the programmers too, Lockheed Martin engineers who built the system in the 1990s. This is one of two factors behind the “rapid fault detection and system restoration” during the incident on December 12.
The report investigates two phenomena: the system outage, its cause and how the system was restored. It also examines NATS' operational response to the outage. The report also looks at what this says about how well the findings and recommendations following the last major incident, a year earlier, had been implemented. I just look at the first here, but arguably the other two are more important in the end.
Cause and effect
In the NATS control system, real-time traffic data is fed into controller workstations by a system component called the System Flight Server (SFS). The SFS architecture is what is called “hot back-up”. There are two identical components (called “channels”) computing the same data at the same time. Only one is “live” in the running system. If this channel falls over, then the identical back-up becomes the live channel, so the first can be restored to operation while offline.
This works quite well to cope with hardware failures, but is no protection against faults in the system logic, as that logic is running identically on both channels. If a certain input causes the first channel to fall over, then it will cause the second to fall over in exactly the same way. This is what happened in December.
The report describes a “latent software fault” in the software, written in the 1990s. Workstations in active use by controllers and supervisors either for control or observation are called Atomic Functions (AF). Their number should be limited by the SFS software to a maximum of 193, but in fact the limit was set to 151, and the SFS fell over when it reached 153.
Deja vu
My first thought is that we’ve heard this before. As far back as 1997-98, evidence given to the House of Commons Select Committee on Environment, Transport and Regional Affairs reported that the NATS system, then under development, was having trouble scaling from 30 to 100 active workstations. But this recent event was much simpler than that – it’s the kind of fault you see often in first-year university programming classes and which students are trained to avoid through inspection and testing.
There are technical methods known as static analysis to avoid such faults – and static analysis of the 1990s was well able to detect them. But such thorough analysis may have been seen as an impossible task: it was reported in 1995 that the system exhibited 21,000 faults, of which 95% had been eliminated by 1997 (hurray!) – leaving 1,050 which hadn’t been (boo!). Not counting, of course, the fault which triggered the December outage. (I wonder how many more are lurking?)
How could an error not tolerated in undergraduate-level programming homework enter software developed by professionals over a decade at a cost approaching a billion pounds?
Changing methods
Practice has changed since the 1990s. Static analysis of code in critical systems is now regarded as necessary. So-called Correct by Construction (CbyC) techniques, in which how software works is defined in a specification and then developed through a process of refinement in such a way as demonstrably to avoid common sources of error, have proved their worth. NATS nowadays successfully uses key systems developed along CbyC principles, such as iFacts.
But change comes only gradually, and old habits are hard to leave behind. For example, Apple’s “goto fail” bug which surfaced in 2014 in many of its systems rendered void an internet security function essential for trust online – validating website authentication certificates. Yet it was caused by a simple syntax error – essentially a programming typo – that could and should have been caught by the most rudimentary static analysis.
Unlike the public enquiry and report undertaken by NATS, Apple has said little about either how the problem came about or the lessons learned – and the same goes for the developers of many other software packages that lie at the heart of the global computerised economy.
No comments:
Post a Comment