Talk by Joe Armstrong.
Slides available here. Video link TBA.
Intro
Harware rarely fails. If it does, this is easily solved by duplicating the hardware.
Software fails easily and often.
Fault-tolerance cannot be achieved with a single computer. We need to use several computers, which means we need to deal with the following challenges:
- Concurrency
- Parallel Programming
- Distributed Programming
- Physics
- Engineering
- Message passing is inevitable (this is the basis of OOP)
Programming languages should make this easily doable.
You never know how things “are” in a system, you only know how things “were” the last time you checked
How individual computers work is a small problem. How computers are connected and the protocols used between them is the significant problem.
We want the same way to program large and small scale systems
Erlang
- Influenced Prolog and by ideas from CSP (see the CSP book).
- Unifies ideas on concurrent and functional programing
- Uses asynchronous messaging
- Is designed for programming fault-tolerant systems
Building fault-tolerant software boils down to detecting errors and doing something when errors are detected
Types of errors
- @ Compilation
- @ Runtime
- Errors that can be inferred (ie “this should be this way => something has gone wrong”)
- Reproducible errors
- Non-reproduceable errs
Philosphy
- Find how to prove that software is correct @ compile time
- Assume that the software is incorrect
Proving that things are Correct
Proving self-consistency of small programs is relatively easy but it will not help. For example, we can look at situations where the test cases are incorrect => the program is incorrect.
Large assemblies of small things are impossible to prove correct
Types of System
- Highly Reliable (nuclear, air-traffic, satellite): very expensive.
- Reliable (driverless cars): kills people if they fail, moderately expensive
- Reliable (banks, phones): disruptive if they fail
- Dodgy (internet, HBO, Netflix): annoying if they fail
- Crap (free apps): annoying / somewhat expected to fail
How can we make software that works reasonably well even if there are errors in the software?
6 Requirements
- Concurrency
- Error Encapsulation
- Fault detection
- Fault identification
- Code upgrade
- Stable storage
The “Method”
- Detect all errors and crash
- If you can’t do what you want then try to do something simpler
- Handle errs remotely (detect them and ensure that the system is put into a “safe state”)
- ID the error kernel (the part of the system that must ALWAYS be correct)
Errors 101
What is an Error?
- An undesirable property of a program
- Something that crashes a program
- A deviation between desired and observed
behaviour
Who finds the error?
- The program (run-time) finds the error
- The programmer finds the error
- The compiler finds the error
The run-time finds an error
- Arithmetic errors
- Array bounds violated
- System routine called with nonsense
arguments
- Null pointer
- Switch option not provisioned
- An incorrect value is observed
What to do when the runtime finds an error
Do:
- Crash immediately
- Don’t make matters worse
- Assume somebody else will fix the problem (supervision tree of systems)
Don’t:
- Ignore it
- Fix it
What should a programmer do when they don’t know what to do?
Do:
- Log it
- Maybe try to fix it, but don’t make matters worse
- Crash immediately
Don’t:
- Ignore it
In sequential languages with single threads, crashing will kill the app :(
Concurrency
- Enables high-availability, scalability, security, (security is very hard on a signle machine) and you get one way to program everything
- One solution is to use linked processes: if a process has died, the linked processed will get notified
Detecting Errors
Where do errors come from?
- Arithmetic errors (divide by zero, overflow, underflow, …)
quiet
orNaN
- Unexpected inputs
- Wrong values
- Wrong assumptions about the environment
- Sequencing errors
- Concurrency errors
- Breaking laws of maths or physics
Arithmetic Errors
- Silent and deadly errors - errors where the program does not crash but delivers an incorrect result
. These can make matters worse!
- Noisy errors - errors which cause the program to crash
See:
- The End of (Numeric) Error
- Beyond Floating Point:
Next generation computer arithmetic, John Gustafson (Stanford Lecture)
Value Errors
- Program does not crash, but the values computed
are incorrect or inaccurate
- How do we know if a program/value is incorrect if we do not have a specification?
- Many programs have no specifications or specs that are so imprecise as to be useless
- The specification might be incorrect
(as well the tests and the program)
What to do?
- Maintain an invariant (a baseline that lets you know how your system is “supposed” to be)
Two systems are the same if they are observationally equivalent (a program is a black box. What is important is the interface at the boundaries of the program - STDIN/OUT)
Interactions between components involve message passing
There are very very formal ways to describe the format (protocols) but we say nothing about the sequence or ordering of data
There are very few formal ways to describe messages (JSON, XML)
Protocols are contracts
- Contracts assign blame
- A “contract checker” will validate that the contract is correct
- How do we describe contracts?