He is like a man which built an house, and digged deep, and laid the foundation on a rock: and when the flood arose, the stream beat vehemently upon that house, and could not shake it: for it was founded upon a rock. But he that heareth, and doeth not, is like a man that without a foundation built an house upon the earth; against which the stream did beat vehemently, and immediately it fell; and the ruin of that house was great. Luke 6:48-9
It has long been recognised that one of the most important parts of any building is the foundation. The roof may be the most vulnerable but it is also the most readily replaceable. Any extensive survey of an area that has been intensively inhabited for a number of years shows the durability (and importance) of foundations. Dig beneath any old English church and you will nearly always find another old English church, and beneath that an old Roman temple. Travel to a land with an even longer history and you can often find generation upon generation built upon the same foundation. No wonder the Lord used the concept foundation to drive home the need for stability and permanence in our lives.
The question that naturally arises is "Do applications need foundations of similar permanence and stabilty?" And the answer appears to be no. In previous articles we have covered the impossibility (or at least unlikeliness) of producing a computer program that accurately reflects the underlying process. We have also considered the improbability of producing a program that is defect-free in a timely fashion. Yet computer applications really do seem to be taking over the world. From the humblest house-hold gadget to the decision systems used by military and government; computer logic is getting everywhere, so it must be doing something right. Or at least; it must appear to be doing something right.
The user acceptance of a gadget (or application) is not usually based upon absolute academic criteria but upon perception. As fallible creatures we expect to live in an imperfect world, we are therefore quite tolerant of imperfection provided the problem appears insoluble. I am currently touch-typing on a computer keyboard that has been painstakingly designed to cripple my productivity. Surely not? 'Fraid so. Back in the days of mechanical keyboards the #1 problem was keys jamming together. A fast typist could hit certain key sequences so rapidly that the metal head of the first key would not return fully before the second key struck, this caused a jam which could (in some cases) wreck the typewriter. The solution was the QWERTY keyboard, designed to ensure that all major sequences and patterns were as difficult to type as possible, slow the typist down and the machine will cope. Over the years many better keyboards have been designed but the QWERTY layout is so entrenched it seems unmoveable. So we develop technologies to auto-correct mis-typed key sequences (the hand naturally tries to make the easier movement than the one required). We build wrist-wrests and exercise sequences to try to reduce RPI and we employ thousands of extra personnel to type for us because for some reason we don't seem to be able to get up to speed.
So why do we tolerate a sub-optimal solution? Because it gets us there reliably, "if it ain't broke don't fix it." In fact you will find that throughout the whole universe of machinery the ability to keep going regardless is a much valued feature. You will find that passenger airliners are sold as much on their ability to land on three engines as on their ability to fly on four. Military strategy is also designed to keep an army functioning when a number of units (even key units) have been taken out.
The mechanism used to build this doggedness into the machine differs from instance to instance but tends to come from one of four camps :
This design philosophy has been taken on-board by computer designers lock-stock and barrel. If you come across a computer application that claims to be fault-tolerant the chances are it exhibits one (or all) of the above behaviours.
I was once sent a cartoon by fax, the picture was of an aeroplane flying over a busy city, the aeroplane had a speech bubble containing the words "Hello this is flight 932 from Sidney to Heathrow Central, what does 'Numeric Exception at 000A:0089 -- program halted' mean?"
The point being made was that there are occasions when program exceptions are not acceptable. Aeroplanes are one of the top examples of this, the space shuttle is probably the superlative form of this. As you would expect, much effort goes in to making the computer systems reliable. Firstly the programmers follow strict coding guidelines and attempt to use verified technology. Secondly, and most famously, the more important systems are duplicated. So, for example, many automatic navigation systems consist of three (or more) independent programs, the results are cross-checked before utilisation.
Now, here comes the question, do you think the results should be 'majority vote' or 'unanimous decision'? The show must go on may suggest majority vote. But is that really safe? Think about it. You know that at least one critical system has gone down, if you felt you needed to fly on three computers do you really fancy being down to two? Is the one 'outvoted' computer the one most likely to be wrong? I could argue that for the three to disagree should be exceptional, is it more likely that two happen to have spotted the exceptional correctly or one? So maybe you go down the 'unanimous' route. Then I will argue that you have simply trebled your chances of being stopped by a spoof bug.
The fundamental problem with fault tolerant systems is that they tolerate faults.
I knew a gentleman who prided himself on being tough, never had a days sick in his life, up with the dawn and didn't even know where his doctor practised. One day he started complaining of stomach ache, when he collapsed his children over-rode his wishes and had him admitted to hospital, an emergency operation revealed advanced stomach cancer. It transpired that he had first 'had a few twinges' two years previously, had the fault been reported rather than tolerated it could have saved his life.
On less life and death matters I knew an insurance company that performed insurance quotations on a machine that was totally tolerant of numeric overflow exceptions. The legal repercussions when the fault finally was found (several years later) led to a forced merger.
I believe the notion of fault-tolerance is useful but it must be handled carefully and that to do so you need to understand that you are simply playing with odds and trying to turn the odds in your favour.
As soon as you accept that you only have a probability of getting an answer right then you find that you have not one chance but two of getting the answer wrong. You can either decide to do something you shouldn't or decide not to do something you should. These correspond to what statisticians call Type II and Type I errors. To use different jargon crimes of commission or crimes of omission.
Having realised there are two types of errors you have to decide which ones you would rather make. The British legal system is heavily biased against crimes of commission. Therefore it is a criminal offence to defend you wife and family with 'excessive' force and yet you can watch a women being dragged by the hair through a public park kicking and screaming without committing so much as a civil misdemeanour. Similarly the judicial procedure is biased on the basis that it is better to free 9 criminals than convict one innocent man.
Interestingly the British military legal system tends to work the other way around. Running head-long into an impossible situation and getting yourself and colleagues shot results in a medal, not doing so can get you shot for cowardice.
What of computer systems? As we have suggested above, computer systems, especial critical ones, tend to be biased heavily against Type I errors. It is better for the computer to do something than nothing at all. The computer should keep going if it possibly can. But should it?
Let's go back to our four engined aircraft, we know it can fly on three engines, so should it? My answer is 'depends'. I would argue that you should never take off on three engines but you should be prepared to land on whatever you have. Perhaps this is obvious, but the same logic applies to just about every program you ever write, but how many of us actually analyse the issues whilst we are programming?
If you start to perform this analysis you will find that different parts of the application have different error requirements at different times. Let us take a few examples from the IDE.
First the compiler. I'm sure many of you would like a compiler that can compile all you throw at it, but do you? Think of it in terms of $$$. Suppose there is some 'construct' the compiler can't quite understand, it falls over and dies. What does that cost? A few minutes cursing? A few hours maybe? A few diatribes on CIS? Assuming you actually know what you are typing and keep a track of it I would be surprised if an unscheduled halt cost more that a few hundred dollars. But now imagine the compiler kept going and produced duff code in a computation. Gets through beta easily (see last month), ships out to a few thousand paying customers, and then you find the application gets its' sums wrong. How much has that cost you in $$$ and lost good will?
But how about a file driver? Would you rather have a file driver that refused to load any data if there was a single corruption or would you rather have a driver that made the best it could? The answer probably depends on the data. If it is a mailing list of 'potential car buyers' your probably wouldn't mind sending 1% of your mailers to the wrong place. If it is a list of people whom you owe money too …
If you go on any good time management course they will teach you the importance of distinguishing the urgent and the important. If you do so in your application design you are usually a long way towards deciding what kind of errors you wish to tolerate.
Airline ticket reservations. Suppose some remote computer is down, or some data source off-line, do you want the computer to keep going or stop taking reservations? The issue is urgent (flights have a fixed dead-line) but not really important (you can always bribe passengers to take a later flight) so you keep the computer going. But now, using the same database you have a data-mining agent that is culling management information about airline usage patterns. It finds some data missing (or cannot fit the data to the pattern it is looking for), should it keep going or admit defeat? The task is important but not urgent (of course management would never admit this) therefore the program should stop with an error.
Traditionally computers have been used for the urgent (one of the main skills being speed) and therefore tolerance of Type I errors is endemic. We have even built the tolerance of rubbish into an art-form and called it defensive programming. We have accepted that the libraries we call don't do what we want and code around it. Our librarians have accepted that we don't know how to call them and they produce a result anyway. We write code using a subset of the available syntax to work around imagined compiler bugs and write code that is impenetrable to human eye to trick the compiler into producing code that is good (look at some C code in a spare moment). We then comment our code copiously to explain to any junior that happens to be tripping by what we hope we were doing. The question is, does our hard work do us any good?
Yes, but not for long. We all associate Y2K coding with a minor date problem regarding whether or not our computers know which millennium they are working in. I suspect that history will show that Y2K is a minor problem compared to the nightmare that will surface as people discover that defensive programming has produced a world-wide code-base that is totally untrustworthy. We are programming using a technique that is designed to allow bugs to survive undetected! What is even worse is that coding defensively can even produce code that relies on the bugs being there for the work around to work. You can easily get locked into a bug ridden system that you cannot fix without making the code appear to be worse.
I once inherited part of a run-time library for a product that was extremely flakey. In my normal fashion I went through the code fixing bugs as I spotted them, you can imagine my surprise when I had the fixes rejected because other parts of the system relied on the bugs! My amazement when I started getting requests for new bugs (such as a heap checker that didn't check the heap) was absolute.
Go back to the list of 4 methods of being fault tolerant. Imagine that those four mechanisms are all being applied, habitually, throughout a system. Imagine the overhead. Envisage how the fragility of the system increases as tolerance upon tolerance is heaped up as the system is reused.
The really scary thing is the whole system works as well as the worst component! Think of it as a high-rise building, no matter how well you start you can always screw up on the next floor. But if you screw up it doesn't matter how well you work from then on the building is unstable.
The issue is fundamental and vital and goes all the way back to the verse at the head of this article. If you are building components of a system that are to be a foundation to a suite of applications then they must not tolerate bugs and bugs within them must not be tolerated.
The reason this is becoming increasingly important is because of object orientation. Traditionally code has been 'bug enabled' but code has also traditionally been 'used once' so every time you throw away the code you throw away the bugs. Now OOP encourages code re-use, so it is now incumbent upon us to make sure code is worth re-using.
So what does this all mean practically. I believe it means we all have to start coding with different attitudes, the question is not does it work but is it right. This is a very painful mental transition to make. On more than one occasion I have had a programmer proudly show me some brilliant new code only to look crestfallen when I start prying into the mechanisms underneath. "Who cares! It works! Look!"
Anyone attempting to produce a re-usable component has to understand that their object will eventually be used it situations and scenarios way beyond those used for initial testing. Therefore you cannot rely upon product testing to verify the validity of the component.
The component has to verify itself. Basically when producing an object, part of the consideration you put in has to be a self-awareness that allows the object to know whether or not it is functioning correctly, and if it isn't an error should be raised. To go further, the aim of any object should be to function perfectly or not at all. Further yet, and object should pro-actively (during the debug phase at least) hunt around looking for the tiniest infraction of correct behaviour and loud-mouth it to anyone that will listen. I have dubbed this approach offensive programming. In a perfectly offensive environment the program would refuse to load unless it were completely bug-free. You can prove mathematically that we will never get that far, the point I am trying to suggest in this article is that it may, at least, be a worth goal.
It really can be difficult environment in which to function. You are forced to stay on your toes and think, think, think. BUT if you do manage to get a program out the other end the chances are it will work better, and faster, and sooner than anything you have ever done before.
Coding offensively will be the subject of future articles, next month I will tackle the issue of maintainability. For this month I would like to finish with an overview of one of my favourite C4 features that is designed to help this process ASSERT.
I'm sure you have been told you should never ASSUME, because if you ASSUME it makes an ASS out of U and ME. Defensive programming goes along with this, and a properly defended routine will cope with absolutely everything, or at least it will try to.
An offensive routine is different. It should have a tightly defined specification of what it does and it should assume that the input parameters are correct and valid (this eliminates flab and rarely entered code-lines from the library routine). However a properly offensive routine should not just assume it is called properly but also insist. The ASSERT clause allows just this. Basically the parameter of an ASSERT is always evaluated and if the result is FALSE and debug is on then an error message appears telling you the line that failed. You are also offered a GPF, that is not as daft as it appears. It gives you time to put the 16 bit debugger in sleeper mode (or install the 32 bit as the system debugger ) then when the program GPFs you will taken to the offending line in source and shown the variables and the time of the failure.
If you scan the ABC libraries you will find asserts used liberally (they are free when debug is off). They are used to check references aren't null, parameters are in range, boundary conditions are bet, queues contain expected values. In fact just about anywhere we thought 'this is bound to be true' we assert it. An assertion failure does not (necessarily) mean an ABC bug, it simply means somebody, somewhere is doing something wrong. So next time you hit one don't think "what is that doing there" think "would I rather this fault were discovered next month on a users machine"