Thursday, May 30, 2024

The Power of Proactive Monitoring. Part 6

 And talking about benefits... Sometimes if you get advanced monitoring features you can use provided information in a less-then-expected way. Here is one of my favorite examples.

As a lot of people know, some vendors provide their PL/SQL packages in a wrapped style. Also, as a lot of people know, if is often hard to get vendors to accept that something is wrong with their code (rather than with the way this code is being used).

Imagine, you have a slow-running module that takes tables/columns and converts it into formatted CLOB - and before talking to the vendor I want to collect as much data as I can:

SQL> DECLARE
2    v_CL CLOB;
3 BEGIN
4    dbms_hprof.start_profiling ('IO','HProf_Check.txt');
5    v_cl :=wrapped_pkg.f_getdata_cl('object_name','test_tab');
6    dbms_hprof.stop_profiling;
7 END;
8 /


I know that the table has 50k rows - and I see that there is 50k calls to DBMS_LOB.WRITEAPPEND! Yes, I can see what happens inside of the wrapped code - big surprise! Now I have a real evidence that the wrapped code is not efficient: whoever wrote it uses row-by-row approach.

As I wrote long-time ago, LOB operations are expensive (even LOB variables above 32K size are not real variables, but a segment in TEMP tablespace) and causing a lot of resource to be used. I.e. original code is something like that:

FUNCTION f_getData_cl(i_column_tx VARCHAR2, i_table_tx VARCHAR2) RETURN CLOB IS
    v_cl CLOB;
    v_tx VARCHAR2(32767);
    v_cur SYS_REFCURSOR;
BEGIN
    dbms_lob.createTemporary(v_cl,false,dbms_lob.call);
    OPEN v_cur FOR 'SELECT '||
                   dbms_assert.simple_sql_name(i_column_tx)||' field_tx'||
                   ' FROM '||dbms_assert.simple_sql_name(i_table_tx);
    LOOP
        FETCH v_cur into v_tx;
        EXIT WHEN v_cur%notfound;
        dbms_lob.writeAppend(v_cl,length(v_tx)+1,v_tx||'|');
    END LOOP;    
    CLOSE v_cur;
    RETURN v_cl;
END;

But if you use PL/SQL variable as a buffer, you can drastically decrease the number of LOB operations. True, it complicates your code - but gains are huge!

FUNCTION f_getData_cl(i_column_tx VARCHAR2, i_table_tx VARCHAR2) RETURN CLOB IS  
    v_cur SYS_REFCURSOR;
    v_cl CLOB;
    
    v_tx VARCHAR2(32767);
    v_buffer_tx VARCHAR2(32767);
    
    PROCEDURE p_flush IS
    BEGIN
        dbms_lob.writeappend(v_cl,length(v_buffer_tx), v_buffer_tx);
        v_buffer_tx:=null;
    END;
    PROCEDURE p_add (i_tx VARCHAR2) IS
    BEGIN
        IF length(i_tx)+length(v_buffer_tx)>32767 THEN
            p_flush;
            v_buffer_tx:=i_tx;
        ELSE
            v_buffer_tx:=v_buffer_tx||i_tx;
        END IF;
    END;    
BEGIN
    dbms_lob.createTemporary(v_cl,true,dbms_lob.call);
    OPEN v_cur FOR 
        'SELECT '||
            dbms_assert.simple_sql_name(i_column_tx)||' field_tx'||
        ' FROM '||dbms_assert.simple_sql_name(i_table_tx);        
    LOOP
        FETCH v_cur into v_tx;
        EXIT WHEN v_cur%notfound;
        p_add(v_tx);        
    END LOOP;
    p_flush;
    
    CLOSE v_cur;
    RETURN v_cl;
END;

On a reasonably large data set that second code will buy back a lot of IO operations - really a lot! I strongly recommend you to run those cases by yourself - only in that case you will believe the performance gains :) But if you are too, well, busy - I will get stats in the next post of these series :)

Monday, May 27, 2024

The Power of Proactive Monitoring. Part 5

Something worth mentioning while talking about Oracle Hierarchical Profiler is that its scope is a database session. So, if your middle-tier logic has multiple database calls within established connection, you can benefit from the profiler data. To be fair, it has a number of restrictions as I show further down, but still it can be useful.

Here is a basic example that shows "both sides of the coin":

String sql = 
  "begin dbms_hprof.start_profiling(location=>'IO',filename=>'Case1a.txt'); end;";
CallableStatement stmt = conn.prepareCall(sql);
stmt.execute();

PreparedStatement stmt =
conn.prepareStatement("select listagg(owner_tx,',') within group(order by 1) r1 \n" +
  "from (select distinct  scott.f_change_tx(owner) owner_tx\n" +
  "    from scott.test_tab) A ");
stmt.execute();

stmt =
conn.prepareStatement("select listagg(owner_tx,',') within 
group(order by 1) \n" +
   "from (select distinct  scott.f_change_tx(owner) owner_tx\n" +
   "    from scott.test_tab) B ");
stmt.execute();

sql = "begin dbms_hprof.stop_profiling; end;";
stmt = conn.prepareCall(sql);
stmt.execute();

As you can see, the code is running between start and stop functionally same queries with a slight modification (so SQL IDs wouldn't match). Both of those SQL statements would cause 50k of function calls. And that scenario is more probable in the real life - reasonably few development environments have established PL/SQL APIs as their communication mechanisms between layers (to my huge sorrow, I am a big fan of "smart DB" approach). 

But would they be shown correctly? As I spoilered above - not exactly:
Since there is no "root" PL/SQL entry point, HProf gets confused and merges both SQL calls into a single one - you see 100k function calls - i.e. you can get at least some stats on what is going on after the root object is being called, but no SQL ID, no clear info about the root element.

What is interesting, the same behavior can be replicated via SQL*Plus:
SQL> exec dbms_hprof.start_profiling 
  2          (location=>'IO',filename=>'Case1b_SQLPlus.txt');
SQL> select listagg(owner_tx,',') within group (order by 1) r1
  2  from (select distinct  scott.f_change_tx(owner) owner_tx
  3      from scott.test_tab a);
...
SQL> select listagg(owner_tx,',') within group (order by 1)
  2  from (select distinct  scott.f_change_tx(owner) owner_tx
  3      from scott.test_tab b);
...
SQL> exec dbms_hprof.stop_profiling;

As you can see, no SQL ID, and all function calls are being merged.

Summary of this part: unless you have PL/SQL APIs you may not enjoy the whole power of HProf, but still you can benefit from some of its provided information


Friday, May 24, 2024

The Power of Proactive Monitoring. Part 4

 In the previous post I introduced one of my favorite performance debugging tools - Hierarchical Profiler. Since Oracle engineers were so proactive that instrumented SQL and PL/SQL engines with a clear mechanism to understand who calls whom inside of the database, it is just great that we (as developers) can have access to the same level of internal information. That information indeed tells us complete and true story of what is happening - not guesses, not estimates, real fingerprints of every involved element.

Now let's get a bit more real and look at the problem that I bet a lot of people encounter if they start using user-defined PL/SQL functions within SQL - it is a very powerful technique, but it also can cause a lot of damage if used inappropriately/without understanding of the internals.

Imagine the following scenario:

  1. Our help-desk started to get a lot of performance complains related to the same module 
  2. Corresponding developer checked 10046 trance and couldn't find anything suspicious
  3. The ticket is elevated to a senior performance expect (well, me :) )
And the very first thing I notice about the core query is that is contains a user-defined PL/SQL function. I immediately ask the developer - do you KNOW how many times that function is executed during the normal activity? The answer I've got was: "Well, it shouldn't be too much..."

Let's pause for a sec. We are soo accustomed to Cost-Based optimizer to do various miracles - but the database engine works only as good as data we provided. And it is important to check whether its logic matches ours. Here was the clear case of my developer "assume" instead of "check". So, the right solution is to get some hard statistics: I wrapped the suspect SQL call into HProf start/stop (below is a simplified case):

SQL> exec dbms_hprof.start_profiling ('IO', 'HProf_Case1.txt');
SQL> declare
  2      v_tx varchar2(32767);
  3  begin
  4      select listagg(owner_tx,',') within group (order by 1)
  5      into v_tx
  6      from (
  7          select distinct scott.f_change_tx(owner) owner_tx
  8          from scott.test_tab
  9          );
 10  end;
 11  /
SQL> exec dbms_hprof.stop_profiling;

Just to be clear - the table TEST_TAB has 50000 rows with only 26 distinct owners, and the function F_CHANGE_TX does basic formatting, nothing special... The HPROF report surprised everybody - 50K calls to the function!


This means that time was wasted on a very cheap function which was just fired lots and lots of times, because original developer just guessed the query behavior. He knew that the function would do basic formatting of the repeated value, so the output always will be the same - but he forgot to tell that to Oracle. And without that information the database fired mentioned above function for every row!

The solution was pretty simple - the query was rewritten to help CBO:

SQL> exec dbms_hprof.start_profiling ('IO', 'HProf_Case1_fix.txt');
SQL> declare
  2      v_tx varchar2(32767);
  3  begin
  4      select listagg(owner_tx,',') within group (order by 1)
  5      into v_tx
  6      from (
  7          select  scott.f_change_tx(owner) owner_tx
  8          from (select distinct owner 
  9                 from scott.test_tab)
 10           );
 11  end;
 12  /
SQL> exec dbms_hprof.stop_profiling

And here is a proof: 26 function calls, overall - 28 times faster!


IMHO, that was simple enough and straightforward enough illustration of how simple it is to find a problem if you have the right tool! And it is also a great illustration of the power of knowing something (rather than guessing)
 




Monday, May 20, 2024

The Power of Proactive Monitoring. Part 3

After talking for a while about some abstract ideas - let's get a bit more technical and a bit closer to Oracle. Before we really talk about the "proactive" part - we should be a bit more realistic. In about 90% of cases the need for monitoring is only being recognized after the fact that something is wrong. As a result the problem of finding "who"who's-done-it" is complicated by not having that much forensic data to work with.

Luckily, in Oracle technology space some of the things are being recorded by Oracle background processes no matter what, and ASH / AWR could help at least to narrow down the area to dig (sometimes). But let's for now not get into the DBA weeds and stay in the developer's realm.

Assume, you can replicate the problem (that's often a big "if") and assume that you can have access to the top level of the spaghetti code you inherited from 30+ years of development. From my experience, the entry point is usually detectable - but nobody living knows what exactly happens inside of the "black box".

Luckily #2, Oracle engineers left us instrumentation to access their internal mechanisms - in my case, calls to SQL and PL/SQL as one program units call the other. That access in implemented via one of the most powerful (and, unfortunately, underutilized) features - Oracle Hierarchical Profiler (HProf). That feature is available since Oracle 11 R1 and supersedes Oracle Profiler (notice the omission of the word "hierarchical"!) . In the las couple of years I made a number of presentations about that product, but I feel that sometimes articles are a bit more helpful (yes, I still can't say the podcast - sorry, I just cant adjust to the video style). This and the following post(s?) will cover some basic ideas why YOU should care about HProf.

Let's first define a use-case: the database is doing "something" and that "something" takes too much time. And key caveat: somebody indeed has proven that it is the database that is slow - not data transmission, not the middle tier, not the client. Remember the classical image that my esteemed guru Dr. Paul Dorsey came up years ago?


Users are always complaining (that's their nature) - and it is your responsibility as a system architect to understand that there are soooo many places where you can lose some critical resources, including time (well, and money!) But only when you think in terms of a logical action (rather than technical call) you can really tune your system:
  • it doesn't matter how fast is your database if there are tens of thousands of calls from the middle-tier to accomplish a single activity - the time will be spent closing/opening sessions. And only if you can hierarchically step up to the point where the activity is initiated you can see the "big picture"
  • it doesn't matter how slow is your database if the time is spent on the client side because of various JavaScript performance issues

But let's assume that you found out that you need to dive into the internals of the database. The problem that is that the database calls are rarely something simple - usually you have a combination of PL/SQL units-calling SQL-calling-PL/SQL units-calling other PL/SQL units-calling SQL etc. I.e. you have a hierarchy of database calls. And that's where HProf comes to rescue. 

I will start with the simplest module and simplest logging implementation (writing logs to the file system rather than storing it in the database):

SQL> CREATE DIRECTORY IO AS 'C:\IO';
SQL> exec dbms_hprof.start_profiling 
                   (location=>'IO',filename=>'HProf.txt');

SQL> DECLARE
2      PROCEDURE p_doSomething (pi_empno NUMBER) IS
3      BEGIN
4          dbms_lock.sleep(0.1);
5      END;
6      PROCEDURE p_main IS
7      BEGIN
8          dbms_lock.sleep(0.5);
9          FOR c IN (SELECT * FROM emp) LOOP
10             p_doSomething(c.empno);
11         END LOOP;
12     END;
13 BEGIN
14     p_main();
15 END;
16 /

SQL> exec dbms_hprof.stop_profiling;

As you can see, nothing fancy: (1) anonymous block calls a procedure (2) procedure wastes half-second and walks a loop based on EMP table (3) for every row in the EMP table another procedure is called that wastes 0.1 sec. All of the processing is started/stopped via DBMS_HPROF calls, and results are written into the file, as I said earlier.

If you want to read the raw file - it is pretty simple, but still doesn't provide any insights:

P#V PLSHPROF Internal Version 1.0
P#! PL/SQL Timer Started
P#C PLSQL."".""."__plsql_vm"
P#X 8
P#C PLSQL."".""."__anonymous_block"
P#X 6
P#C PLSQL."".""."__anonymous_block.P_MAIN"#980980e97e42f8ec #6
P#X 63
P#C PLSQL."SYS"."DBMS_LOCK"::9."__pkg_init"
P#X 7
P#R
P#X 119
P#C PLSQL."SYS"."DBMS_LOCK"::11."SLEEP"#e17d780a3c3eae3d #197
P#X 500373
P#R
P#X 586
P#C SQL."".""."__sql_fetch_line9" #9."4ay6mhcbhvbf2"
P#! SELECT * FROM SCOTT.EMP
P#X 3791
P#R
P#X 17
<<… and so on …>>

P#C is a call, P#R - return from the subprogram, P#X - elapsed time etc. But you need an aggregator - PLSHPROF routine:

C:\Utl_File\IO>plshprof -output hprof_intro HProf.txt
PLSHPROF: Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
[8 symbols processed]
[Report written to 'hprof_intro.html']

Now the raw data finally makes sense (the resulting report has a lot of sorting options, I picked up the most explicit one):


All of the steps exactly as I described above - including SQL ID of executed query. So, if needed you have a way to link that report with V$SQLAREA and/or DBMS_XPLAN analysis.

To be continued...







Wednesday, May 8, 2024

The Power of Proactive Monitoring. Part 2

Observing contemporary home devices is entertaining - I didn't even expect that much, since I didn't buy any major appliances until recently. And that washing machine continues to feed my thoughts with ideas/lessons applicable to the IT industry. So, this poste will be about those snippets of wisdom (more technical content will eventually follow, but now I am a bit more philosophical :) ):

Snippet #1

If you can monitor something pretty easily without significant impact on the overall system performance - just do it! Because if you debug a complex problem you just never know what can lead you to the right solution. Example: washing machines obviously have huge rotating drum inside that auto-compensates for uneven weight distribution and/or uneven installation. Ooook... If you have compensating mechanism that adjusts the drum in real-time, you can collect all of that information - and suddenly you have a very interesting pieces of data that you can aggregate and step into something new: (a) by running a number of drum tests in different directions you can understand what is inside and use AI to auto-setup washing cycles (b) by running a number of wash cycles you can figure out potential uneven floors and suggest to the owner that the washing machine needs level adjustment. Definitely useful feature!

Let's port that story into the database realm. Assume, you have a process that runs some kind of batch in a multithreaded format. That process dynamically adjusts a number of threads depending on the workload and the current speed of processing. And prudent solution here will be to record any change that happens to the system (thread added/number of threads shrunk/etc), because in the aggregated form it will paint you a pretty good picture of the system well-being. For example, you can detect previously unknown workload patterns! And from my own experience with a similar logging mechanism by comparing behavioral changes we were able to detect changes to the system antivirus configuration that were causing seemingly random slowdowns. Only with enough logs we were able to correlate system slowdowns with OS-level activity.

Snippet #2

More I think about IT as a part of humanity, more I understand that safety best practices MUST be forces from the top (sorry, my libertarian friends, people are too lazy for their own good). Manufactures of home appliances understood that maxima pretty well (warranty claims are pricey!) and now try to make the maintenance process (a) easy (b) inevitable. Rather that suggesting what should be done to prolong the life of their device, they added built-in counter of washing cycles + a special cleanup cycle directly in the machine itself. The user will be simply notified more or less in the following way: "you run N cycles, now it is maintenance time. To continue press any button". A bit annoying? Yes. Way more productive in terms of the devices longevity? Big YES!

Translating into IT terms... I understand that I am stepping into a minefield here, but I think that we need to embrace the paradigm shift: up to very recently the job of DB people was to accumulate and safely store ALL pieces of information, but now the most critical part of DB functionality is to separate useful information from white noise. Our systems are too bloated and to overwhelmed with data elements. I.e. one of the most critical challenges now is to configure proper data archival process - maintaining only what is critical and taking all of the garbage (old/unused/retired data). And unless you create that vision from the very beginning, adding that type of self-healing mechanism afterwards is either cumbersome or plain painful for either users or the maintenance team (or both). So, that self-presevation process has to be as explicit and as visible from the very beginning as possible - this way users will be accustomed to it from the system inception. And we all know how difficult it is to change existing behavioral patterns...


Sunday, April 28, 2024

The Power of Proactive Monitoring. Part 1

This weekend I had a seemingly unrelated to IT incident that forced me to think a lot about the system development. Here is a story.  I bought a new washing from the large store (let's keep it anonymous for now), obviously with the delivery and installation. O-o-ok, on Friday that machine was delivered and installed by the official crew from the store. After that we run a number of cycles to wash whatever has accumulated (since the moment our old washer was declared dead), didn't even sort out the clean laundry (Friday...), and left on Saturday morning for the outdoors festival.

But I am proud of doing one right thing: that washer had "smart" features, so I installed the corresponding app and did all the required registration processes. To be fair, just for fun - I never had a chance to communicate with the washer.

And suddenly while walking miles away from home I am getting a notification: "Sir, it's you washer. I found  a quiet moment and decided to run self-check... And you know... I have a bad feeling that my hot and cold water hoses are swapped. Maybe I am wrong, but could you check for me? Because otherwise I can damage your clothes!" It was a very polite message, I was impressed!

Lo and behold, my family comes back from the daytrip - and I indeed detect that hoses are swapped! That means that during the previous cycles it rinsed everything with hot water... I check our laundry I find a lot of things shrunk and color-damaged! Good new: it was mostly mine and my son's items (not wife's or daughter's) . Bad news: it was a lot of them, so I am planning tomorrow to have a little chat with the store managers (will keep you posted on the results).

But key point - the importance of the self-diagnostics! That regular check saved me really a lot of money and troubles. And here is where IT industry comes back - let's be frank and answer the following question: "how much does an regular developer like doing code instrumentation?" The answer is "he/she hates it UNLESS he/she is also doing long-term maintenance!" And in how many offices people care about long-term? Everybody is on the tight schedule, everybody runs like crazy to meet deadlines...

I've seen lots and lots environments - and the only ones that have well-instrumented code are those that have direct pressure (+corresponding support and resources!) to do it from the top management. Reasons are pretty obvious: code instrumentation is very time- and resouce-consuming, especially during early stages on the project (when lots of things are changing). As a result even the best developers sometimes forget that somebody (a) needs to look at their code years from now (b) needs to find that piece of code in case something went wrong inside of it (c) understand what is normal functioning of any part of the system and what is not.

If the first two items have been covered for years - yes, comments, yes, proper error logging and tracing - the last one is way more obscure and requires to step at least one (maybe more) levels up from your cubical. Because the question is not about your code - question is about the system as a whole. And let's not forget, errare humahum est - people sometimes do very strange thing just because they can...

I am very thankful to the engineers of that washer that decided to embed hot/cold check into the self-diagnostics, even they very clearly marked what should be connected where. If you can't prevent some common mistakes - at least you can make sure that you detect them early enough!

Summary: IT engineers still have a lot to learn from their mechanical colleagues!