Incident Reports & Application Monitoring
Failure may happen when running software systems in production usage. When failure happens, and someone knows about it, an incident report may be created.
Now it may be a strange thing to say that if someone knows about the failure, it might be reported. As it turns out, not all scenarios, expectations, and evolutions of any given software system and services may be predicted, and so failure may creep in at the edges of a continuously evolving software system.
If the software system is running on-premise at a client, or on personal machines, or just anywhere expect our managed systems, then certainly failure won’t be reported except at the client’s or user’s discretion. That assumes that the software respects opt-in only for error reporting.
If the software system is running as a managed service that is operated and offered to the clients, then all the metrics, logs, and errors are available for monitoring and health analysis.
Identifying failure in software systems is difficult. Defining failure is not a prominent aspect of most software requirements. Even more seldom, is failure defined in quantifiable aspects. But then this goes back to a complex software system not being able to easily define and capture all possible cases. The reality is that quite often, the customer or the client may be the best source of failure reporting.
Incident reports are a great way to capture these failures. A common form of report may be a ticket filed by the customer with support and the all the subsequent communication there. A more terrifying form of report is if a widespread failure of software services has happened and the software team has to file a more comprehensive and more formal incident report.
Incident reports may provide many new pieces of information, illustrate newly discovered customer use-cases, and expose flaws in systems. Another piece of value in incident reports, is to make it easier to identify software failure without the customer having to inform us of the failure.
Looking at metrics and logs and monitoring solutions on their own may not be very helpful. Being able to say that a web API’s average response time over the past month was less than 50 milliseconds isn’t helpful. Talking about metrics without an accompanying business requirement isn’t very impressive to management, to users, or to anyone really. Perhaps everyone would be happy even if the average response time was 10 times as long.
Combining incident reports with monitoring solutions may allow for new information to be found by comparing the times of failure against metrics, logs, and anything else your monitoring solution may collect. The point here is that complex systems have complex failures, and being able to find a correlation between the incident and any collected information is a great step towards diagnosis of the problem’s cause.
Diagnosis of the problem may lead to a correction of the problem or to a new SLA (service level agreement) that must be honored. Correction of the problem should be a definitive fix of the issue as it is currently known. An SLA becomes the new definition of failure, in that failure to meet the SLA is failure in and of itself. An internal SLA should be flagged as failure and corrected before the customer ever knowns of it. An external SLA merely matches the point where the problem becomes something the customer would consider failure.
Failure is the best teacher.