1 To the Moon….and back
In this chapter
If someone asked you what you meant by a workaround there is probably no better illustration for them than the events that took place after an explosion on the Apollo 13 spacecraft as it began its journey to the Moon in 1970. The film ‘Apollo 13’ is a fairly accurate depiction of how the team in Mission Control rebuilt the entire mission schedule to ensure that the three astronauts returned safely to Earth. This chapter summarises what took place in 1970 and considers the implications for a better understanding of the concept of a workaround. The Apollo 13 explosion was a direct result of a workaround by engineers in the lead-up to the mission, a good example of how a workaround may seem to be effective to the developer but the downstream impacts may not be apparent. Tragically a workaround to achieve an on-time launch of the Challenger Space Shuttle in 1986 had disastrous consequences.
—————
To begin at the beginning (to quote Dylan Thomas) it would be appropriate to offer a definition of ‘workaround’. A good place to start is always the Oxford English Dictionary, though interestingly the word did not enter the OED database until September 2014.
“The word workaround has entered general English usage to refer to a makeshift method of overcoming or bypassing a problem, but until recently it was limited primarily to technical contexts. First attested in 1961, for its first two decades it was used primarily in aerospace jargon. For instance, in 1965, the Oakland Tribune reported that ‘Project Apollo executives are trying short-cuts, improvisations and ‘work-arounds’ to keep the moon schedule from slipping out of the ’60s and into the ’70s’ (12 Sept.). By the 1980s, the term had also been adopted by the computing industry to refer to a method of overcoming a performance issue or limitation in a program. Adoption of the workaround in nontechnical contexts is a relatively recent development.”
The Cambridge Dictionary offers a single sentence definition.
“A workaround is a way of dealing with a problem or making something work despite the problem, without completely solving it.”
The simplicity of these definitions has not inhibited a very considerable effort on the part of the academic research community to propose more extended definitions, and these are the subject of Chapter 3. The focus of this research has been primarily in the use of the term by IT professionals from the time that the concept of ‘working around’ IT problems was proposed by Leslie Gasser in his PhD thesis in 1984 (Gasser 1986). This is in line with the comment from the OED that the term was adopted by the IT profession in the 1980s.
Workarounds are usually adaptations of an application under the governance of corporate IT. The term ‘shadow IT’ is used to describe applications which have been developed and used without the agreement of corporate IT. It could be argued that shadow IT (often written as Shadow IT for emphasis) is an example of a workaround.
By their nature workarounds and the use of shadow IT are invariably invisible to all except those who develop and adopt them. This makes talking about (and indeed writing about) workarounds something of a challenge. Fortunately, there has been one major event with a global audience that is the definitive account of how workarounds avoided a potentially tragic and globally visible conclusion of a planned flight to land on the surface of the Moon and return safely.
The first crewed Mercury flight took place on 5 May 1961 when the Freedom 7 spacecraft, piloted by Alan Shepard, achieved a suborbital test flight. Nine months later, on 20 February 1962, John Glenn orbited the Earth in his Mercury spacecraft Friendship 7. It is quite amazing in retrospect that despite the colossal technical and human challenges in eventually achieving the moon landing on 16 July 1969 the only fatality in the entire history of the Mercury, Gemini and Apollo missions was when the three Apollo 1 Mission astronauts died in a fire whilst testing out the capsule on the launch tower at the Kennedy Space Center in Florida.
“I believe we’ve had a problem here”
Apollo 13 was to be the seventh crewed Apollo mission and the third to land on the Moon. The mission was commanded by Jim Lovell, with Jack Swigert as command module (CM) pilot and Fred Haise as Lunar Module (LM) pilot. Swigert (from the back-up crew) was a late replacement for Ken Mattingly, who was grounded after exposure to rubella. Fortunately NASA always had a back-up crew that shadowed the primary team very closely so there would be no significant gap in expertise.
An Apollo Lunar Module is on display at the National Air and Space Museum in Washington. The Apollo Command Modules were (at least to some extent) based on the evolution of the Mercury and Gemini capsules but the Lunar Excursion Module (LEM, later just LM) had no antecedents. The Apollo capsules could be tested in Earth orbit but there was no way to test the LM for landing and takeoff from the Moon.
Photograph © The author
The LM presented some very significant challenges to ensure that the astronauts landed safely and could take off from the surface of the Moon with just a single engine and no redundancy. Throughout the design and manufacturing it was an ongoing problem to keep the weight within what could be lifted off with the Saturn 5 rocket. As a good example the seats originally included for the astronauts were taken out as it would be easier to manage the LM standing up as well as saving weight. There is an excellent account of the development and manufacture of the LM by Kelly (2001) which highlights many workarounds that had to be adopted to keep to specification and schedule as well as the situation inside Mission Control during the Apollo 13 mission.
The launch took place on 11 April 1970. The mission very quickly had its first problem when the engine in Stage 2 developed a fault and shut down early. The spacecraft still achieved Earth orbit but only with a change in flight plan.
All the power and life support systems were contained in the Service Module, including two tanks containing liquid oxygen. This was used to power the fuel cells to give electric power, as well as oxygen for the crew and water as a by-product of the fuel cells in operation.
It was important to maintain the temperature of the oxygen tanks as the oxygen was used up. This was accomplished by a heated tube running the length of the tank with fans on either end so that the oxygen could be stirred as well as heated to achieve a consistent temperature. Detecting the level of the liquid oxygen was not easy and the method used was prone to errors caused by variations in density in the tanks, again a reason for them being stirred.
Just over 55 hours into the mission, Mission Control asked the crew to stir the tanks as they had a concern about the quality of data on the oxygen levels. Shortly after initiating the stir there was a loss of data from the spacecraft, and the crew heard a loud noise from the Service Module.
This caused Swigert to say to Mission Control “I believe we’ve had a problem here”.
Workarounds en masse
It did not take long for Mission Control and the Apollo 13 crew to realise that the outcomes of the explosion meant the end of the mission and the focus moved very quickly onto how to get the crew back to Earth. From that time onwards every action was a workaround. These have all been well documented, especially from the Grumman Aerospace perspective (the company that built the Lunar Module).
There were two very important elements of good fortune at this point. The first was that the spacecraft was at a point in its journey to the Moon where there was time to revise the orbit and have the gravity of the Moon capture the spacecraft and put it back into a return orbit with just a small amount of assistance from the LM engine, which was never designed to cope with the weight of the entire spacecraft.
The second element was that early in the design of the LM there was a discussion about using the LM as a lifeboat should there be a problem with the oxygen and power supplies in the Command Module and so there was oxygen on the LM to sustain a return to Earth. What was not considered was the build-up of carbon dioxide as the lithium hydroxide purification canisters on the Command Module and the LM were completely different in shape and construction, having been developed by two different companies. This led to Mission Control having to develop a workaround ‘tube’ that could be constructed by the astronauts from items of card, plastic and tape already in the Command Module.
As well as the internal resource issues the return route back to Earth had to be recalculated, a major problem with the computer capacity in resources available in 1970. The recalculations had to take into account the fact that the combined weight of the spacecraft was radically different from plan because of the damage to the Service Module and having to return with the LM attached until quite close to the re-entry stage. In addition most of the computer systems in the Command Module had to be turned off to conserve the battery power and there was no precedent for reactivating them. As a result new sets of commands had to be read out by the team at Mission Control and copied down onto paper by the astronauts in almost freezing conditions. A facsimile of the LM Systems Activation Checklist has been published and gives a good indication of the complexity of the LM systems.
The Command Module returned to Earth successfully on 17 April.
Without doubt the mission was probably the most visible and complex set of workarounds in history. Had they failed the astronauts would have died and the entire space programme would have been overshadowed by their deaths.
The genesis of the failure
There was of course an investigation into the cause of the failure of the Service Module. The story begins about 18 months prior to the Apollo 13 launch. An oxygen tank intended to be used in Apollo 10 was dropped a few centimetres onto the floor of the launch building but appeared to be undamaged. A replacement was fitted for Apollo 10 but the original was then allocated to Apollo 13. Three weeks before launch the tank was filled with oxygen as a standard test but it was found to be slow to empty. As a workaround the technicians switched on the heaters in the tank to boil the gas out.
The original specifications developed a decade earlier were for 28 volt spacecraft systems. However, the launch site used 65 volt systems and when this was applied to the tank switches they became welded shut and the insulation was also damaged. The oxygen boiled out as expected and the technicians had no reason to think that there had been any internal damage.
When the astronauts switched the fans on a short circuit in the damaged insulation caused a spark in the wiring and this caused the explosion.
The workaround had a completely unexpected outcome, with its basis in a failure of information management. The change in systems voltage had never been communicated to the manufacturers of the internal switches in the tank. A classic example of a workaround having an impact much later on in a series of process steps.
The role of Mission Control
In the case of Apollo 13 the astronauts themselves did not develop the workarounds but did have to have complete trust in the instructions that were being sent up to them to execute. The scale of the effort at Mission Control to save the lives of the astronauts is only partially presented in the accounts of the mission from the astronauts themselves, notably the book by Mission Commander Jim Lovell (Lovell and Kluger 1994).
To understand the immense amount of work being undertaken at Mission Control the definitive resource is an autobiographical account by Flight Director Gene Kranz (2000) who not only provides a wealth of detail about the scale, speed and innovation of the workaround development but also documents the many other workarounds that Mission Control had to manage during the preceding Mercury, Gemini and Apollo missions. The book also illustrates the commitment of the NASA management to learn from the lessons of these workarounds, ensuring that the same problem never emerged a second time during the course of the missions.
The Challenger Space Shuttle disaster
To this day I can remember 28 January 1986. Just as I was about to leave my hotel room in New York for a lunch meeting with colleagues my wife Cynthia phoned me from the UK to tell me the news of the Challenger Space Shuttle disaster. Ten minutes later as I walked into the office it was clear from the office buzz that my colleagues were unaware of the launch catastrophe. It was not easy to convince them of the magnitude of the disaster and the impact it was already having in the USA and the rest of the world.
The full story of the causes of the disaster has been presented in detail by Vaughan (1996) and McDonald (2009). In essence the elastomer O-rings sealing the sections of the booster rockets failed as a result of becoming inflexible in the quite cold conditions of the launch. The final launch approval meeting recognised that the cold temperature of the launch might have an effect on the O-rings but lacked any experimental data on the degradation of the flexibility at close to the ambient temperature. As a result members of the team had to workaround inadequate information to make their decision at a time when there was pressure from NASA to move ahead more quickly than had been the case to that date on scheduling Shuttle launches.
The impact of the low temperature on the O-rings was brilliantly illustrated by US physicist Richard Feynman during the hearings of the Commission set up to identify the issues and how the procedures should be changed. His ad hoc demonstration using an O-ring acquired from a museum and a glass of iced water is still regarded as a classic example of how to make a complex issue understandable.
There were no future problems with the Shuttle launches. Sadly the Shuttle programme was brought to a premature close as a result of the Columbia shuttle breaking up in the atmosphere on its return in 2003, the outcome of several of the heat-resistant tiles being destroyed as a result of debris breaking off the booster rockets at launch which damaging the tiles. This time there was no workaround available.
The bottom line
The issue of a definition for a workaround is important, because if there is no agreed definition of a workaround how can it be detected and assessed for its impact. For example, does a workaround have to be something that reoccurs on a regular basis, or (as is the case with Apollo 13) can it be a one-off way of solving a problem. The fact it was a workaround that inadvertently resulted in the on-board explosion highlights the fact that the impact of a workaround might not be immediately obvious to the employee adopting it. It could be argued that Apollo 13 is a special case as it is not specifically related to IT issues. In the next Chapter some more examples of workarounds in a more general sense are presented.
References
Gasser, L. (1986). The integration of computing and routine work. ACM Transactions on Office Information Systems, 4(3) https://doi.org/10.1145/214427.214429
Kelly, T.J. (2001). Moon Lander. 2001 Smithsonian History of Aviation and Spaceflight Series. ISBN 978-1-58834-273-7
Kranz, G. (2000). Failure is not an option. Berkeley Publishing Group. ISBN 0-425-17987-7
LM Systems Activation Checklist https://www.kroneckerwallis.com/product/apollo-13-activation-checklist/
Lovell, J. & Kluger, J. (1994). Lost Moon – The perilous Voyage of Apollo 13. Houghton Mifflin. ISBN 0-395-67029-2
McDonald, A.J. (2009). Truth, lies and O-Rings: Inside the space shuttle Challenger disaster. University Press of Florida. ISBN 978-0-8130-4193-3
Vaughan, D. (1996). The Challenger launch decision. University of Chicago Press. ISBN 0-226-85176-1