Rugged Tooling: Forget AI - Integrate Human Intelligence

I lead a team that builds highly shared, deep-in-the-stack automation at a large SaaS company that has many software stacks in AWS. This automation includes things like installing security scanners, log collection agents and monitoring agents - all for both Windows and Linux.

I inherited a lot of this code and was working together with a team member and a technician from the software company for one of these agents that was giving us trouble, when I realized we could improve the ruggedness of our code significantly!

In a 45 short minutes we learned a ton of things about how the agent registration worked as well as commands to reliably troubleshoot various failing behaviors we were seeing.

We had made some notes about how to do these steps and I was contemplating the best way to share them with our team. But I also wanted to share them with our development end users so they could be more productive and not have to engage us every time their configuration was failing in some of these self-diagnosable patterns.

While I welcome every opportunity to learn where my team’s code does not work as intended, I loathe the mind numbing monotony of repetitively performing identical troubleshooting steps just to learn that the root cause of the problem is some simple misconfiguration outside of our code.

A simple, but recurring example is that the API endpoint and port for registering the agent is not available because it was mistyped or networking is mis-configured.

As I was struggling with the prospect of escalations as a form of training the hundreds of developers we support, it hit me like a ton of bricks.

My entire job is about taking super repetitive tasks done by humans and get the automation to do them - because the computer does not care how many times it does something and does not lose focus like humans do.

And here I am repetitively doing the same steps over and over to come to very similar conclusions each time with different users of the tooling code.

After this realization we did two very simple things. First, we examined each troubleshooting step and we asked - “Can this step be coded right into the automation?”

A second action was equally important - logging the results of these steps - including successful outcomes. Logging must be used to expose results if the embedded intelligence is to deliver maximum value. By definition, tooling will first be debugged by a developer or development team who is trying to use the tooling.

The value of intelligent logging for tooling is multiplied because it means the logging is more likely to be sought out, reviewed and corrective action taken without a cross-team escalation. Intelligant logging also generally means better initial root cause determination by development users because: a) they can see what basics are being checked and can rule out those causes without effort, which b) gives strong hints and motivation on what to check next for root cause.

Earlier I mentioned that our intelligent logging includes logging positive test results - this brings strategic benefits including:

logging successes communicates to tooling users that your code is robust
reveals how far a process got
helps you understand where in your code a problem may be located
improves code quality because many times you realize the positive case code needs to do other things in addition to logging a message
the logging helps with debugging during the tool development itself

Code that lacks success logging for the sake of brevity is nicer to look at - but it is an area where brevity is an overall anti-pattern to robustness. If your code is truly of a tooling nature, your log messages will undergo human review much more often than the code itself.

We also follow some additional principles when deciding what and where to log:

Log to the operating system’s expected locations as our messages are more likely to be encountered in these locations and automatically collected from these locations if log aggregation is being used. This would be the event log in Windows and /var/logs on Linux.
Log highly verbose details to a dedicated, local file log.
Log summary and critical information to a centrally collected location, and especially note the location of the verbose log in the summary log.
When justified - run a scheduled monitor script that is capable of reporting that the agent is not installed.
Use timestamps that are automatically parseable by any log collection you do.

Code reasonable troubleshooting queries even if you can’t imagine a failure condition in your specific implementation (e.g. testing the registration url even if we control the default data value given for the url)

Good troubleshooting code also helps with future automation development mistakes in the code or input data. For instance, maybe the data values received by your code are under your control, but at some future time someone mistypes a configuration value. It’s much better that your own code reveals this mistake during your automation development cycle than making it to prod.

Summary of Embedding Human Intelligence To Make Your Code Rugged

never rely on the environment being setup correctly - even though your test environment probably is.
never assume data values will be valid - even if your own code is providing them.
whenever possible, test the reachability of any external resource before making calls - frequently failure messages from application apis that are attempting to leverage an external resource are less than helpful about basic conditions like not being able to get an IP route to the resource or resolve a host name.
any troubleshooting tests you (or anyone else) takes upon failure of a bit of automation should be assessed for integration directly into the code.
the results of these tests should be logged - both positive and negative results.
it is best if this logging is done to a local place that is also centrally collected. (both local and central)
if you know or can predict common causes of the error being experienced, be sure to disclose those hints in the log messages.
if you only wish to log summary information to central logging or for capture by orchestration - ensure log verbosely locally and publish the existence of and location of any detailed logs that could help with root cause analysis.
always enable verbose logging on anything you call that supports it. When possible do this logging to a specific location with a date time stamp embedded in the file name to avoid log overwriting on multiple retries.
do not underestimate the value of your production logging statements for development cycles.
if there are additional troubleshooting steps that must be performed by humans, document them as code comments with a brief explanation of what positive and negative results might mean for that test.