You assume that the only errors a program can encounter are logic errors. Logic errors are really the easiest class of error to fix. Here are examples of other errors that can bring down stable systems:
1) An API returns a list of items. In a new version of the API, the data structure used to generate the response list is changed from a list to a set, invalidating any implicit assumptions about response ordering. This happened to a service I worked on where there was an implicit assumption by callers that the results were ordered by date. This was true prior to using a set.
2) Resource leaks exposed by failure conditions. A network outage might cause infrequently tested code paths to leaks resources. In Go, I’ve seen this happen with network requests not tied to a context. This is an interesting case because the fact that an error occurs causes the leak even though the success path works correctly.
3) Missing back offs. A dependency may return errors in the case where a dependency becomes overloaded. This can cause a backup in something like an ingestion job. Without adequate back offs, the dependency may never be able to recover.
There are plenty more examples where simply handling every error case is not sufficient for stability.
As a question to you: why is this system more stable than other systems you write? Can’t you apply your error handling philosophy to everything?
1. Implicitly assuming something not specifically guaranteed is a logic error. Its like assuming hashcode(x) = x just because it happens to be true for small numbers in python. Its not part of the spec, and just observing it to be true a few times doesn’t change anything.
You could a. document the implicit assumption that the list is sorted or b. sort the list before continuing further.
2. A resource not being freed in all cases is also a logic error.
3. In the context of basically calling any service/API repeatedly, not using backoffs and setting a hard limit on the number of tries is also a mistake. I dont know if its a “logic error” but its certainly an error.
I don’t know why you think these are not error cases. Your program fails, and can be written in a way that does not fail or fails slightly more gracefully - thats an error case. There is some judgement involved - i might not implement backoff in a context where i know its not yet needed, i might just log a particularly obscure exception instead of trying to recover if i can’t imagine why it would happen. I’m not saying your program should be perfectly bug free from the moment its written - i’ve written plenty of buggy code too and we're all human. There are even a few errors that are truly insane like literally hitting a compiler bug that silently corrupts your program. Just that calling these things not error cases (and what, random acts of misfortune instead?) is a weirdly defeatist attitude that impedes further progress. They're all error cases, the only question is how much time and knowledge do you have to dedicate to handling them.
I called them errors in my post. But unintentionally using undefined behavior isn’t really a logic error. It’s a misunderstanding, and not something that you can figure out by looking at the API and thinking about the errors the API can throw.
Those errors can also be considered and mitigated though, if one thinks about what could go wrong instead of only thinking about what exceptions can be thrown:
1. One must either encode the assumption into a precondition or transfer the incoming data into a sorted data structure. But the gist is to always validate assumptions.
2. RAII’s pretty good at handling resources. Then one can inject failures to test the error handling paths and combine with code coverage measurements.
3. That sounds like an issue in the wider system which may be handled in the subsystem under development assuming that it can throttle back its requests, drop them, etc. But it may just as well be handled in another part of the larger system. It belongs more to the architecture realm, but it’s absolutely possible to foresee such issues.
For 1, the bug stems from unintentionally using undefined behavior due to a misunderstanding. You can never get the answer by "thinking about what could go wrong" because your view of the system makes the failure case impossible.
1) An API returns a list of items. In a new version of the API, the data structure used to generate the response list is changed from a list to a set, invalidating any implicit assumptions about response ordering. This happened to a service I worked on where there was an implicit assumption by callers that the results were ordered by date. This was true prior to using a set.
2) Resource leaks exposed by failure conditions. A network outage might cause infrequently tested code paths to leaks resources. In Go, I’ve seen this happen with network requests not tied to a context. This is an interesting case because the fact that an error occurs causes the leak even though the success path works correctly.
3) Missing back offs. A dependency may return errors in the case where a dependency becomes overloaded. This can cause a backup in something like an ingestion job. Without adequate back offs, the dependency may never be able to recover.
There are plenty more examples where simply handling every error case is not sufficient for stability.
As a question to you: why is this system more stable than other systems you write? Can’t you apply your error handling philosophy to everything?