Talk by Joe Armstrong.

Slides available here. Video link TBA.

Intro

Harware rarely fails. If it does, this is easily solved by duplicating the hardware.

Software fails easily and often.

Fault-tolerance cannot be achieved with a single computer. We need to use several computers, which means we need to deal with the following challenges:

  • Concurrency
  • Parallel Programming
  • Distributed Programming
  • Physics
  • Engineering
  • Message passing is inevitable (this is the basis of OOP)

Programming languages should make this easily doable.

You never know how things “are” in a system, you only know how things “were” the last time you checked

How individual computers work is a small problem. How computers are connected and the protocols used between them is the significant problem.

We want the same way to program large and small scale systems

Erlang

  • Influenced Prolog and by ideas from CSP (see the CSP book).
  • Unifies ideas on concurrent and functional programing
  • Uses asynchronous messaging
  • Is designed for programming fault-tolerant systems

Building fault-tolerant software boils down to detecting errors and doing something when errors are detected

Types of errors

  • @ Compilation
  • @ Runtime
  • Errors that can be inferred (ie “this should be this way => something has gone wrong”)
  • Reproducible errors
  • Non-reproduceable errs


Philosphy

  • Find how to prove that software is correct @ compile time
  • Assume that the software is incorrect


Proving that things are Correct

Proving self-consistency of small programs is relatively easy but it will not help. For example, we can look at situations where the test cases are incorrect => the program is incorrect.

Large assemblies of small things are impossible to prove correct

Types of System

  • Highly Reliable (nuclear, air-traffic, satellite): very expensive.
  • Reliable (driverless cars): kills people if they fail, moderately expensive
  • Reliable (banks, phones): disruptive if they fail
  • Dodgy (internet, HBO, Netflix): annoying if they fail
  • Crap (free apps): annoying / somewhat expected to fail

How can we make software that works reasonably well even if there are errors in the software?

6 Requirements

  1. Concurrency
  2. Error Encapsulation
  3. Fault detection
  4. Fault identification
  5. Code upgrade
  6. Stable storage


The “Method”

  • Detect all errors and crash
  • If you can’t do what you want then try to do something simpler
  • Handle errs remotely (detect them and ensure that the system is put into a “safe state”)
  • ID the error kernel (the part of the system that must ALWAYS be correct)


Errors 101

What is an Error?

  • An undesirable property of a program
  • Something that crashes a program
  • A deviation between desired and observed 
behaviour

Who finds the error?

  • The program (run-time) finds the error
  • The programmer finds the error
  • The compiler finds the error

The run-time finds an error

  • Arithmetic errors

  • Array bounds violated
  • System routine called with nonsense 
arguments
  • Null pointer
  • Switch option not provisioned
  • An incorrect value is observed

What to do when the runtime finds an error

Do:

  • Crash immediately
  • Don’t make matters worse
  • Assume somebody else will fix the problem (supervision tree of systems)

Don’t:

  • Ignore it
  • Fix it

What should a programmer do when they don’t know what to do?

Do:

  • Log it
  • Maybe try to fix it, but don’t make matters worse
  • Crash immediately

Don’t:

  • Ignore it

In sequential languages with single threads, crashing will kill the app :(

Concurrency

  • Enables high-availability, scalability, security, (security is very hard on a signle machine) and you get one way to program everything
  • One solution is to use linked processes: if a process has died, the linked processed will get notified

Detecting Errors

Where do errors come from?

  • Arithmetic errors (divide by zero, overflow, underflow, …) quiet or NaN
  • Unexpected inputs
  • Wrong values
  • Wrong assumptions about the environment
  • Sequencing errors
  • Concurrency errors
  • Breaking laws of maths or physics

Arithmetic Errors

  • Silent and deadly errors - errors where the program does not crash but delivers an incorrect result
. These can make matters worse!
  • Noisy errors - errors which cause the program to crash


See:

Value Errors

  • Program does not crash, but the values computed
 are incorrect or inaccurate
  • How do we know if a program/value is incorrect if we do not have a specification?
  • Many programs have no specifications or specs that are so imprecise as to be useless
  • The specification might be incorrect
 (as well the tests and the program)

What to do?

  • Maintain an invariant (a baseline that lets you know how your system is “supposed” to be)
  • Two systems are the same if they are observationally equivalent (a program is a black box. What is important is the interface at the boundaries of the program - STDIN/OUT)

  • Interactions between components involve message passing

  • There are very very formal ways to describe the format (protocols) but we say nothing about the sequence or ordering of data

  • There are very few formal ways to describe messages (JSON, XML)

Protocols are contracts

  • Contracts assign blame
  • A “contract checker” will validate that the contract is correct
  • How do we describe contracts?