Talk by Tomas Petricek. Many of the ideas are referenced from The Computer Boys Take Over (2010) by Nathan Ensmenger (2010).

In this talk, Thomas defines four strategies for dealing with errors in software. Slides are available here. Video link TBA.

Errors as a Part of the Process

This is the realization that errors are an inevitable part of the process so we design methods to catch them early through testing and engineering practices.

  • Starts from data processing in the ‘60s with languages like COBOL (an English-like language developed to enable non-skilled programmers to write programs).
  • Continues with methodologies such as TDD today, where a failing test introduces error into the code and the programmer is responsible for implementing a solution. In TDD, the test becomes an honest documentation of what the code is designed to do.
  • The NATO Software Engineering Conference of ‘68 started the transition from programming as a craft to programming as engineering. Interestingly, Thomas mentions the Software Craftsmanship (2001) book, which advocates for a movement away from a manufacturing model to one that values the skills of individual developers.

See also:

Errors as Contradiction

This is the idea that we can prove that a program is correct through logic and mathematics. Instead of debugging a program, one should prove that it meets its specs.

See Also:

Errors as the Unavoidable

This is the belief that errors will always happen at runtime, regardless of engineering or design practices. In any distributed long-running system, errors will inevitably happen because of the system’s scale and complexity.

While working on hugely distributed telco systems at Ericsson in the ‘80s, Joe Armstrong helped develop the [Erlang] (https://en.wikipedia.org/wiki/Erlang_(programming_language) language. Joe’s thoughts on the nature of errors are implemented in the language:

  • Exceptions occur when the runtime doesn’t know what to do
  • Errors occur when the programmer doesn’t know what to do
  • Errors are expected bcause the spec can’t possibly cover all cases

This means that, in Erlang, error is the opposite of test error as it is something that can never be covered by a specification or test case (it is unpredictible). Its design is suited for systems that are:

  • Distributed
  • Fault-tolerant
  • Scalable
  • Highly available
  • Code can be reconfigured while running

See also:

Errors as Inspiration

This is the idea of seeing programmers as creative individuals like chess masters or musicians. We can also look to jazz where there are no ‘wrong notes’. A jazz musician responds to error by either:

  • Accepting the results of imperfect execution,
  • Compensating for the unexpected result by manual intervention, or
  • Accepting it as a serendipitous alternative

Smalltalk in the ‘70s and ‘80s allowed for reflection, where the program could be designed to rewrite itself in response to error signals.

Sonic Pi is a live music coding program with Ruby-like syntax on Emacs! It’s an utterly brillaint, fully-featured piece of software that I wholeheartedly encourage you all to try! The developer, Sam Aaron, has designed the software to make errors easier to hear and see to enable quick human intervention.

See also:

Summary

Errors are Defined by their Context

Google Translate makes Russia -> Mordor and the foreign minister Lavrov -> Sad Little Horse

Google’s AI learned this form Ukranian blogs being written in response to the Russian occupation where the two words were frequently used interchangably as a joke.

If we look at the case of Knight Capital losing $440 million due to a computer bug in 2012, we can examine how each of the above four “error types” might react to the mistake:

  • Engineer/Craftsman: there is a missing test. What properties of this system need to be validated?
  • Logician: critical systems should be proved correct. Post-mortem? Code review? It’s tough to make this one applicable if you are not working in science/mathematics.
  • Erlang: errors will always happen, the system should be redesigned to be failure-resiliant.
  • Live Coder (Ops person?): it took 45 mins to shut it down - we should learn intervene faster. Improve telemetry/monitoring? Have more reactive services? The system needs to be designed to allow for manual intervention.

Further reading: