Sunday, April 28, 2024

The Power of Proactive Monitoring. Part 1

This weekend I had a seemingly unrelated to IT incident that forced me to think a lot about the system development. Here is a story.  I bought a new washing from the large store (let's keep it anonymous for now), obviously with the delivery and installation. O-o-ok, on Friday that machine was delivered and installed by the official crew from the store. After that we run a number of cycles to wash whatever has accumulated (since the moment our old washer was declared dead), didn't even sort out the clean laundry (Friday...), and left on Saturday morning for the outdoors festival.

But I am proud of doing one right thing: that washer had "smart" features, so I installed the corresponding app and did all the required registration processes. To be fair, just for fun - I never had a chance to communicate with the washer.

And suddenly while walking miles away from home I am getting a notification: "Sir, it's you washer. I found  a quiet moment and decided to run self-check... And you know... I have a bad feeling that my hot and cold water hoses are swapped. Maybe I am wrong, but could you check for me? Because otherwise I can damage your clothes!" It was a very polite message, I was impressed!

Lo and behold, my family comes back from the daytrip - and I indeed detect that hoses are swapped! That means that during the previous cycles it rinsed everything with hot water... I check our laundry I find a lot of things shrunk and color-damaged! Good new: it was mostly mine and my son's items (not wife's or daughter's) . Bad news: it was a lot of them, so I am planning tomorrow to have a little chat with the store managers (will keep you posted on the results).

But key point - the importance of the self-diagnostics! That regular check saved me really a lot of money and troubles. And here is where IT industry comes back - let's be frank and answer the following question: "how much does an regular developer like doing code instrumentation?" The answer is "he/she hates it UNLESS he/she is also doing long-term maintenance!" And in how many offices people care about long-term? Everybody is on the tight schedule, everybody runs like crazy to meet deadlines...

I've seen lots and lots environments - and the only ones that have well-instrumented code are those that have direct pressure (+corresponding support and resources!) to do it from the top management. Reasons are pretty obvious: code instrumentation is very time- and resouce-consuming, especially during early stages on the project (when lots of things are changing). As a result even the best developers sometimes forget that somebody (a) needs to look at their code years from now (b) needs to find that piece of code in case something went wrong inside of it (c) understand what is normal functioning of any part of the system and what is not.

If the first two items have been covered for years - yes, comments, yes, proper error logging and tracing - the last one is way more obscure and requires to step at least one (maybe more) levels up from your cubical. Because the question is not about your code - question is about the system as a whole. And let's not forget, errare humahum est - people sometimes do very strange thing just because they can...

I am very thankful to the engineers of that washer that decided to embed hot/cold check into the self-diagnostics, even they very clearly marked what should be connected where. If you can't prevent some common mistakes - at least you can make sure that you detect them early enough!

Summary: IT engineers still have a lot to learn from their mechanical colleagues!