Slides available here. Video link TBA.

Intro

Harware rarely fails. If it does, this is easily solved by duplicating the hardware.

Software fails easily and often.

Fault-tolerance cannot be achieved with a single computer. We need to use several computers, which means we need to deal with the following challenges:

Concurrency
Parallel Programming
Distributed Programming
Physics
Engineering
Message passing is inevitable (this is the basis of OOP)

Programming languages should make this easily doable.

You never know how things “are” in a system, you only know how things “were” the last time you checked

How individual computers work is a small problem. How computers are connected and the protocols used between them is the significant problem.

We want the same way to program large and small scale systems

Erlang

Influenced Prolog and by ideas from CSP (see the CSP book).
Unifies ideas on concurrent and functional programing
Uses asynchronous messaging
Is designed for programming fault-tolerant systems

Building fault-tolerant software boils down to detecting errors and doing something when errors are detected

Types of errors

@ Compilation
@ Runtime
Errors that can be inferred (ie “this should be this way => something has gone wrong”)
Reproducible errors
Non-reproduceable errs

Philosphy

Find how to prove that software is correct @ compile time
Assume that the software is incorrect

Proving that things are Correct

Proving self-consistency of small programs is relatively easy but it will not help. For example, we can look at situations where the test cases are incorrect => the program is incorrect.

Large assemblies of small things are impossible to prove correct

Types of System

Highly Reliable (nuclear, air-traffic, satellite): very expensive.
Reliable (driverless cars): kills people if they fail, moderately expensive
Reliable (banks, phones): disruptive if they fail
Dodgy (internet, HBO, Netflix): annoying if they fail
Crap (free apps): annoying / somewhat expected to fail

How can we make software that works reasonably well even if there are errors in the software?

6 Requirements

Concurrency
Error Encapsulation
Fault detection
Fault identification
Code upgrade
Stable storage

The “Method”

Detect all errors and crash
If you can’t do what you want then try to do something simpler
Handle errs remotely (detect them and ensure that the system is put into a “safe state”)
ID the error kernel (the part of the system that must ALWAYS be correct)

Errors 101

What is an Error?

An undesirable property of a program
Something that crashes a program
A deviation between desired and observed  behaviour

Who finds the error?

The program (run-time) finds the error
The programmer finds the error
The compiler finds the error

The run-time finds an error

Arithmetic errors 
Array bounds violated
System routine called with nonsense  arguments
Null pointer
Switch option not provisioned
An incorrect value is observed

What to do when the runtime finds an error

Do:

Crash immediately
Don’t make matters worse
Assume somebody else will fix the problem (supervision tree of systems)

Don’t:

Ignore it
Fix it

What should a programmer do when they don’t know what to do?

Do:

Log it
Maybe try to fix it, but don’t make matters worse
Crash immediately

Don’t:

Ignore it

In sequential languages with single threads, crashing will kill the app :(

Concurrency

Enables high-availability, scalability, security, (security is very hard on a signle machine) and you get one way to program everything
One solution is to use linked processes: if a process has died, the linked processed will get notified

Detecting Errors

Where do errors come from?

Arithmetic errors (divide by zero, overflow, underflow, …) quiet or NaN
Unexpected inputs
Wrong values
Wrong assumptions about the environment
Sequencing errors
Concurrency errors
Breaking laws of maths or physics

Arithmetic Errors

Silent and deadly errors - errors where the program does not crash but delivers an incorrect result . These can make matters worse!
Noisy errors - errors which cause the program to crash

See:

The End of (Numeric) Error
Beyond Floating Point:  
Next generation computer arithmetic, John Gustafson (Stanford Lecture)

Value Errors

Program does not crash, but the values computed  are incorrect or inaccurate
How do we know if a program/value is incorrect if we do not have a specification?
Many programs have no specifications or specs that are so imprecise as to be useless
The specification might be incorrect  (as well the tests and the program)

What to do?

Maintain an invariant (a baseline that lets you know how your system is “supposed” to be)
Two systems are the same if they are observationally equivalent (a program is a black box. What is important is the interface at the boundaries of the program - STDIN/OUT)
Interactions between components involve message passing
There are very very formal ways to describe the format (protocols) but we say nothing about the sequence or ordering of data
There are very few formal ways to describe messages (JSON, XML)

Protocols are contracts

Contracts assign blame
A “contract checker” will validate that the contract is correct
How do we describe contracts?

dansible:

about | github

postsz

categorieses

GOTO - dos and donts of error handling:

4 May 2018