среда, 30 ноября 2016 г.

How Complex Systems Fail

How Complex Systems Fail is an essay by Richard Cook which describes common principles how complex systems work and how they fail. In general, these principles are applicable to all types of technical systems. And in particular from my point of view they are quite applicable to such complex IT systems as batch systems.

My favorite principle listed in this essay is #5 which states that "Complex systems run in degraded mode". Which means that there are always errors in batch processing. Actually all software developers know that complex programs always have an errors and it is impossible to prove that there are no errors in code. In case of batch systems in addition to such programmatic errors we also may have so-called environmental issues. When batch processes millions and millions of requests there always will be some networks issues, connectivity problems, glitches of external components, etc.

So, question arises - if we know that our system has certain amount of programmatic errors and we know that always there are environmental issues during batch processing, how can we be sure in final calculations results? For example, if system contains of multiple components and aggregates results of processing of million calculations into the single final number, then any error in the middle can dramatically affect final number. Which is pretty good illustrated by Garbage In, Garbage Out principle. On other hand, batch systems are widely used and in general it seems that we can trust their results despite of all these issues. Why so?

Partially this question is answered in the essay:

  1. Complex systems usually have additional functions to recover from error, to duplicate critical components, and to cross check results. So, for example, if connection to database is dropped in the middle of processing, there may be a code which will re-establish it and thus prevents batch failure. Or, for example, it is possible to validate final numbers and outputs of each component and spot errors if any early.
  2. Complex systems are maintained by people who can manually recover system if automatic mode fails. In IT we have production support teams who monitor batch processing and intervene if something goes wrong.
But from my point of view there are two more items which help to answer the question:
  1. The major part of errors in batch systems is rather defects than bugs. In other words, these are flaws in system which doesn't have critical impact on its functionality and have only limited effect. For example, defect may result in non-optimal batch processing. Or, for example, it may affect some helper component.
  2. And the last but not the least explanation is that batch systems are rather "soft" models than "hard" ones. The concept of "hard" and "soft" models has been introduced by Russian mathematician Vladimir Arnold.
    • "Hard" model is one where small change in input parameter results in huge non-proportional change in output value.
    • "Soft" model is one where small change in input parameter results in small change in output.
So batch systems as "soft" model usually have multiple inputs which may compensate each other inside somehow and thus input changes have smooth impact on outputs. As result, even if there are some inaccuracies in inputs or there are some errors inside processing, the impact on output results will be much smaller than it can be in case of "hard" model. In other words, as developers joke sometimes one should have odd numbers of errors in your code, so they compensate each other.