Wednesday, May 8, 2024

The Power of Proactive Monitoring. Part 2

Observing contemporary home devices is entertaining - I didn't even expect that much, since I didn't buy any major appliances until recently. And that washing machine continues to feed my thoughts with ideas/lessons applicable to the IT industry. So, this poste will be about those snippets of wisdom (more technical content will eventually follow, but now I am a bit more philosophical :) ):

Snippet #1

If you can monitor something pretty easily without significant impact on the overall system performance - just do it! Because if you debug a complex problem you just never know what can lead you to the right solution. Example: washing machines obviously have huge rotating drum inside that auto-compensates for uneven weight distribution and/or uneven installation. Ooook... If you have compensating mechanism that adjusts the drum in real-time, you can collect all of that information - and suddenly you have a very interesting pieces of data that you can aggregate and step into something new: (a) by running a number of drum tests in different directions you can understand what is inside and use AI to auto-setup washing cycles (b) by running a number of wash cycles you can figure out potential uneven floors and suggest to the owner that the washing machine needs level adjustment. Definitely useful feature!

Let's port that story into the database realm. Assume, you have a process that runs some kind of batch in a multithreaded format. That process dynamically adjusts a number of threads depending on the workload and the current speed of processing. And prudent solution here will be to record any change that happens to the system (thread added/number of threads shrunk/etc), because in the aggregated form it will paint you a pretty good picture of the system well-being. For example, you can detect previously unknown workload patterns! And from my own experience with a similar logging mechanism by comparing behavioral changes we were able to detect changes to the system antivirus configuration that were causing seemingly random slowdowns. Only with enough logs we were able to correlate system slowdowns with OS-level activity.

Snippet #2

More I think about IT as a part of humanity, more I understand that safety best practices MUST be forces from the top (sorry, my libertarian friends, people are too lazy for their own good). Manufactures of home appliances understood that maxima pretty well (warranty claims are pricey!) and now try to make the maintenance process (a) easy (b) inevitable. Rather that suggesting what should be done to prolong the life of their device, they added built-in counter of washing cycles + a special cleanup cycle directly in the machine itself. The user will be simply notified more or less in the following way: "you run N cycles, now it is maintenance time. To continue press any button". A bit annoying? Yes. Way more productive in terms of the devices longevity? Big YES!

Translating into IT terms... I understand that I am stepping into a minefield here, but I think that we need to embrace the paradigm shift: up to very recently the job of DB people was to accumulate and safely store ALL pieces of information, but now the most critical part of DB functionality is to separate useful information from white noise. Our systems are too bloated and to overwhelmed with data elements. I.e. one of the most critical challenges now is to configure proper data archival process - maintaining only what is critical and taking all of the garbage (old/unused/retired data). And unless you create that vision from the very beginning, adding that type of self-healing mechanism afterwards is either cumbersome or plain painful for either users or the maintenance team (or both). So, that self-presevation process has to be as explicit and as visible from the very beginning as possible - this way users will be accustomed to it from the system inception. And we all know how difficult it is to change existing behavioral patterns...


No comments: