When Small Software Bugs Cause Big Problems

by

For most of us, software bugs are the annoying little things that we encounter in the form of small errors, like misaligned text or a clipped image. However, in rare situations, these small bugs can have massive ramifications. In this post, I will be discussing three such events: the Therac-25, the Mars Spirit Rover, and the June 2021 Fastly malfunction. 

Therac-25

The Therac-25 is a radiation therapy machine that was made in the 1980s. A later generation of the Therac machines, the Therac-25 was unique in that it was the first to only use software, not hardware, for its safety controls. While this seemed like an innovative idea at conception, the engineers quickly learned that it could result in catastrophic errors.  

The Therac-25 operated in two modes: low power mode and high power mode. Low power mode used an electron beam that didn’t penetrate too deeply, making it perfect for treating skin cancers. High power mode used x-rays to penetrate more deeply and treat bone or lung cancers. The low and high power modes were beneficial for hospitals because doctors could now use one machine to treat multiple cancers, rather than needing to buy a new machine for each treatment type. 

However, between 1985 and 1987, a malfunction with the low and high power modes, coupled with the lack of hardware safety controls, caused at least six people to die from radiation overdoses. 

Causes

The radiation overdoses were caused by a myriad of factors. On the software development side, there were three main issues that, when combined, resulted in the malfunction: 

  • Race Condition: The Therac-25 software had a race condition in the code that was undetected during development and review. The operator technician of the machine could flip through the menu and select the type of beam they wanted to use by inputting characters (x for high power, e for low power). If they mistyped the character and then went back and retyped the correct character within an 8-second time period, the machine’s safety mechanisms wouldn’t catch up, and the selection would not be updated in the settings. 
  • Malfunction 54: When the race condition resulted in an incorrect character selection, the Therac-25 would flash “Malfunction 54” on the screen. No additional information was provided to indicate how serious the malfunction was, so many technicians would dismiss the malfunction warning without investigating further. Technicians were taught that the Therac-25 was “user-error” proof, so they assumed that whatever the “Malfunction 54” was, it couldn’t be serious. 
  • Arithmetic Overflow: Arithmetic overflow is when the result of a variable exceeds the memory space designated for that variable. The final software issue in the Therac-25 was a flag that was being incremented in the code rather than given a known value as it was intended. The Therac-25’s memory for the flag was finite, so this incremental addition had the potential to cause arithmetic overflow once the value of the flag exceeded the memory. This would cause the memory to drop back down to zero.  

Results/ Investigation

Six people suffered from radiation overdose before the problems with the Therac-25 were caught. In addition to the three software issues identified above, the commission to investigate the deaths found the following problems: 

  • The company that designed the Therac-25 didn’t have an independent code review. Instead, the review was completed by the same people who wrote the code. 
  • The architecture of the software was so bad that they couldn’t perform automated testing to identify potential software bugs. 
  • This particular combination of hardware and software wasn’t tested until it was actually put together in the hospital.
  • The software that was used in the Therac-25 was reused from the Therac-20. This was an issue because the previous iterations of the machine had hardware safety controls as well. This meant that any malfunctions in the software were caught by the hardware, which would physically stop the process if a malfunction occurred. 

The Therac-25 is an extreme example of how small errors, when combined, can cause devasting events. 

Mars Spirit Rover

Software bugs seem to be commonplace in space exploration. For example, the Mars Curiosity Rover also had a software malfunction during its time on Mars. However, for this post, I am going to focus on the software bug experienced by the engineers of the Mars Spirit Rover. 

The Mars Spirit Rover landed on Mars on January 4, 2004. On January 21, 2004, the engineers discovered that they could no longer communicate with the rover because it was no longer entering sleep mode. Instead, it was stuck in a pattern of continually rebooting itself again and again. 

The engineers received intermittent pings from the rover so that they knew it was still alive, but that it was in danger of wasting its precious battery reserves and overheating. They knew there was an issue in the software or hardware, but they were unsure what the issue might be. 

Crunched for time, the engineers predicted that the rover was suffering from a problem with the flash memory. Without finding the root cause, they bypassed the flash memory during a reboot and were able to solve the issue temporarily.

Causes

The engineers performed a more intensive investigation once the rover was stable and discovered the root cause of the issue. The issue was created by three smaller compounding components:

  • File Structure: The file structure of the Mars Spirit Rover code was such that it didn’t actually reclaim memory when things were deleted; it just marked them as replaceable. This became an issue because the “deleted” items were still taking up the same amount of memory space, which caused the memory space of the rover to rapidly reach capacity.
  • Third-Party Software: Third-Party Software was used for the rover. The software had a stipulation that the flash memory had to be mirrored in RAM. However, upon investigation, they found that because of the file structure, the RAM was only 128 MB, while the flash memory was 256 MB.
  • Dynamic Memory: When the rover entered a reboot mode, it had commands that it would try to place in memory, but the RAM was already full. When the initial storage method didn’t work, the rover would then allocate to a memory address that didn’t exist, which caused errors in the reboot. The rover would then try to fix the errors by rebooting again, forcing it to get stuck in reboot mode. 

Results

The engineers eventually deleted some unused files (e.g. landing sequence) to reclaim space in the memory. Once the space was reclaimed, they were able to remotely install a file monitor system that permanently resolved the memory issues. These fixes lasted until the mission finally ended in 2010. 

Fastly

Fastly is a CDN that enables companies to cache requests closer to the request server. While Fastly is not a company that many people know, it is nonetheless utilized by many large companies to enhance their users’ web experiences. 

In June 2021, a large section of the internet went down thanks to a software bug at Fastly. In mid-May, there was a software deployment with a bug that was undetected. In June, a customer uploaded a valid configuration change that inadvertently activated the software bug. The software bug, though small enough to go undetected, nonetheless instantly wiped out 85% of the Fastly network. 

Though the cause was detected within an hour and patched later that day, the damage to the internet was already done. While not much more is known about the Fastly software bug, this is nonetheless an example of how much the internet can be impacted by relatively unknown companies. 

Leave a Reply

Your email address will not be published. Required fields are marked