Wednesday, May 8, 2024

The Power of Proactive Monitoring. Part 2

Observing contemporary home devices is entertaining - I didn't even expect that much, since I didn't buy any major appliances until recently. And that washing machine continues to feed my thoughts with ideas/lessons applicable to the IT industry. So, this poste will be about those snippets of wisdom (more technical content will eventually follow, but now I am a bit more philosophical :) ):

Snippet #1

If you can monitor something pretty easily without significant impact on the overall system performance - just do it! Because if you debug a complex problem you just never know what can lead you to the right solution. Example: washing machines obviously have huge rotating drum inside that auto-compensates for uneven weight distribution and/or uneven installation. Ooook... If you have compensating mechanism that adjusts the drum in real-time, you can collect all of that information - and suddenly you have a very interesting pieces of data that you can aggregate and step into something new: (a) by running a number of drum tests in different directions you can understand what is inside and use AI to auto-setup washing cycles (b) by running a number of wash cycles you can figure out potential uneven floors and suggest to the owner that the washing machine needs level adjustment. Definitely useful feature!

Let's port that story into the database realm. Assume, you have a process that runs some kind of batch in a multithreaded format. That process dynamically adjusts a number of threads depending on the workload and the current speed of processing. And prudent solution here will be to record any change that happens to the system (thread added/number of threads shrunk/etc), because in the aggregated form it will paint you a pretty good picture of the system well-being. For example, you can detect previously unknown workload patterns! And from my own experience with a similar logging mechanism by comparing behavioral changes we were able to detect changes to the system antivirus configuration that were causing seemingly random slowdowns. Only with enough logs we were able to correlate system slowdowns with OS-level activity.

Snippet #2

More I think about IT as a part of humanity, more I understand that safety best practices MUST be forces from the top (sorry, my libertarian friends, people are too lazy for their own good). Manufactures of home appliances understood that maxima pretty well (warranty claims are pricey!) and now try to make the maintenance process (a) easy (b) inevitable. Rather that suggesting what should be done to prolong the life of their device, they added built-in counter of washing cycles + a special cleanup cycle directly in the machine itself. The user will be simply notified more or less in the following way: "you run N cycles, now it is maintenance time. To continue press any button". A bit annoying? Yes. Way more productive in terms of the devices longevity? Big YES!

Translating into IT terms... I understand that I am stepping into a minefield here, but I think that we need to embrace the paradigm shift: up to very recently the job of DB people was to accumulate and safely store ALL pieces of information, but now the most critical part of DB functionality is to separate useful information from white noise. Our systems are too bloated and to overwhelmed with data elements. I.e. one of the most critical challenges now is to configure proper data archival process - maintaining only what is critical and taking all of the garbage (old/unused/retired data). And unless you create that vision from the very beginning, adding that type of self-healing mechanism afterwards is either cumbersome or plain painful for either users or the maintenance team (or both). So, that self-presevation process has to be as explicit and as visible from the very beginning as possible - this way users will be accustomed to it from the system inception. And we all know how difficult it is to change existing behavioral patterns...


Sunday, April 28, 2024

The Power of Proactive Monitoring. Part 1

This weekend I had a seemingly unrelated to IT incident that forced me to think a lot about the system development. Here is a story.  I bought a new washing from the large store (let's keep it anonymous for now), obviously with the delivery and installation. O-o-ok, on Friday that machine was delivered and installed by the official crew from the store. After that we run a number of cycles to wash whatever has accumulated (since the moment our old washer was declared dead), didn't even sort out the clean laundry (Friday...), and left on Saturday morning for the outdoors festival.

But I am proud of doing one right thing: that washer had "smart" features, so I installed the corresponding app and did all the required registration processes. To be fair, just for fun - I never had a chance to communicate with the washer.

And suddenly while walking miles away from home I am getting a notification: "Sir, it's you washer. I found  a quiet moment and decided to run self-check... And you know... I have a bad feeling that my hot and cold water hoses are swapped. Maybe I am wrong, but could you check for me? Because otherwise I can damage your clothes!" It was a very polite message, I was impressed!

Lo and behold, my family comes back from the daytrip - and I indeed detect that hoses are swapped! That means that during the previous cycles it rinsed everything with hot water... I check our laundry I find a lot of things shrunk and color-damaged! Good new: it was mostly mine and my son's items (not wife's or daughter's) . Bad news: it was a lot of them, so I am planning tomorrow to have a little chat with the store managers (will keep you posted on the results).

But key point - the importance of the self-diagnostics! That regular check saved me really a lot of money and troubles. And here is where IT industry comes back - let's be frank and answer the following question: "how much does an regular developer like doing code instrumentation?" The answer is "he/she hates it UNLESS he/she is also doing long-term maintenance!" And in how many offices people care about long-term? Everybody is on the tight schedule, everybody runs like crazy to meet deadlines...

I've seen lots and lots environments - and the only ones that have well-instrumented code are those that have direct pressure (+corresponding support and resources!) to do it from the top management. Reasons are pretty obvious: code instrumentation is very time- and resouce-consuming, especially during early stages on the project (when lots of things are changing). As a result even the best developers sometimes forget that somebody (a) needs to look at their code years from now (b) needs to find that piece of code in case something went wrong inside of it (c) understand what is normal functioning of any part of the system and what is not.

If the first two items have been covered for years - yes, comments, yes, proper error logging and tracing - the last one is way more obscure and requires to step at least one (maybe more) levels up from your cubical. Because the question is not about your code - question is about the system as a whole. And let's not forget, errare humahum est - people sometimes do very strange thing just because they can...

I am very thankful to the engineers of that washer that decided to embed hot/cold check into the self-diagnostics, even they very clearly marked what should be connected where. If you can't prevent some common mistakes - at least you can make sure that you detect them early enough!

Summary: IT engineers still have a lot to learn from their mechanical colleagues!