Friday, May 27, 2011

Base64 Encoding in PL/SQL

Just before the holiday weekend it is always nice to tell a good story to cheer everybody up. Here is mine! You know, I've had a couple of good news for you :-) And no, they are not about car insurance - they are about Base64 encoding in PL/SQL:

- with some manual tweaking or parameters it is possible to get a really good performance out of UTL_ENCODE package in both 10g and 11g. In my case - up to 50x performance gain!

And now let's get to details. In the environment I am working right now it just historically happened that we had two different Base64 encoding/decoding procedures (you can google for either of them - there will be  tons of hits on OTN):
- Java-based (classes Base64 and UTLBase64)
     * encodes into 76-char per line Base64 encoding with CHR(10) at the end of the line ~ 77 char/line
- PL/SQL-based (UTL_ENCODE.base64_encode inside).
     * encodes to 64-char per line Base64 encoding with CHR(10)||CHR(13) ~ 66 char/line

Recently we started to use encoded data much more for a number of different reasons - and my developers started to complain about the performance, especially for large data volumes (just FYI - my development server is Dell 6850 4 2-core CPUs, MS Windows 2008, Oracle 10.2.0.5 64 bit) processed by Java-based module.

Eventually I decided to take a step back, and since my area of knowledge is mostly PL/SQL to take a look whether I can make PL/SQL version run faster. Of course, the first suspect in any string processing was the size of buffer - it just seemed strange to me that people (in majority of examples I found online) use single-line size to both operations. It definitely didn't seem optimized. Also a lot of examples use concatenation to generate an output CLOB, which is also a major slow-down. Here are my final version of encode/decode for BLOB:

---------------------------------------------------------------------
function base64EncodeBlob_plsql (i_bl BLOB) 
return clob is
  v_out_cl     CLOB;
  file_len     PLS_INTEGER;
  modulo       PLS_INTEGER;
  pieces       PLS_INTEGER;
  amt          BINARY_INTEGER      := 23808;
  buf          RAW (32767);
  buf_tx       varchar2(32767);
  pos          PLS_INTEGER         := 1;
  filepos      PLS_INTEGER         := 1;
  counter      PLS_INTEGER         := 1;
BEGIN
  DBMS_LOB.createtemporary 
     (v_out_cl, TRUE, dbms_lob.call);
  file_len := DBMS_LOB.getlength (i_bl);
  modulo := MOD (file_len, amt);
  pieces := TRUNC (file_len / amt);

  WHILE (counter <= pieces) LOOP
    DBMS_LOB.READ (i_bl, amt, filepos, buf);
    buf_tx:=utl_raw.cast_to_varchar2
       (UTL_ENCODE.base64_encode (buf));
    dbms_lob.writeappend
       (v_out_cl,length(buf_tx),buf_tx);
    filepos := counter * amt + 1;
    counter := counter + 1;
  END LOOP;

  IF (modulo <> 0) THEN
    DBMS_LOB.READ
      (i_bl, modulo, filepos, buf);
    buf_tx:=utl_raw.cast_to_varchar2
      (UTL_ENCODE.base64_encode (buf));
    dbms_lob.writeappend
      (v_out_cl,length(buf_tx),buf_tx);
  END IF;
  return v_out_cl;
END;

function base64DecodeClobAsBlob_plsql(i_data_cl CLOB) 
return blob is
  v_out_bl blob;
  clob_size number;
  pos number;
  charBuff varchar2(32767);
  dBuffer RAW(32767);
  v_readSize_nr number;
  v_line_nr number;
begin
  dbms_lob.createTemporary
    (v_out_bl, true, dbms_lob.call);
  v_line_nr:=greatest(65, 
                 instr(i_data_cl,chr(10)),
                 instr(i_data_cl,chr(13)));
  v_readSize_nr:=
      floor(32767/v_line_nr)*v_line_nr;
  clob_size := dbms_lob.getLength(i_data_cl);
  pos := 1;

  WHILE (pos < clob_size) LOOP
    dbms_lob.read
      (i_data_cl, v_readSize_nr, pos, charBuff);
    dBuffer := UTL_ENCODE.base64_decode
      (utl_raw.cast_to_raw(charBuff));
    dbms_lob.writeAppend
     (v_out_bl,utl_raw.length(dBuffer),dBuffer);
    pos := pos + v_readSize_nr;
  end loop;
  return v_out_bl;
end;
----------------------------------------------------------------

Performance improvement was just enormous! Let's take the process of encoding of 1 MB BLOB:
- Java version - 5 sec
- unoptimized PL/SQL (48 chars per read) - 11 sec

- optimized PL/SQL (23808 char per read) - 0.1 (50x gain and 110x gain!)

Decoding on 1 MB BLOB:
- Java version - 5 sec
- unoptimized PL/SQL (66 chars per read) - 2 sec
- optimized PL/SQL (32736 char per read) - 0.1  sec (50x and 20x gain).

Now let me explain where I found these magic numbers. For encode the logic is as follows:
- Known:
   * the length of the line is 66 or 64 data chars + 2 end-of-line.
   * 64 Base64 encoded data chars represent 64*(3/4)=48 original data bytes.
   * the result cannot exceed 32K for each processing (stay in the scope of VARCHAR2 datatype).
- Highest data read is calculated in the following way: floor(32767/66)*48 = 23808.

For decode:
- Known:
  * The routine may get different kinds of Base64 encoded strings.
  * for now I decided to support 64-char and 76-char formats (plus end-of-line)
  * these specified formats could generate 65,66,77 or 78 char/line
  * length of the line is identified by the highest of first occurances of CHR(10) / CHR(13) 
  * in when the original volume of data was less then 48 - default 65 char/line is established
  * the result cannot exceed 32K for each processing (stay in the scope of VARCHAR2 datatype). 
  * read should always happen in complete lines
- Highest data read is calculated in the following way: floor(32767/<lineSize>)*<lineSize>

P.s I also run a number of test in 11g (11.2.0.1). Unfortunately, I the hardware on that box is completely different from what I have in my 10g instance, so at the current point I will not post a lot of real numbers (it will be like comparing apples to oranges), but the overall findings are the following:
- Java-version - 0.5 sec
- Optimized PL/SQL - 0.1

So, additional good news - 11g version of JVM is significantly faster than 10g, but my PL/SQL code still outruns it.

P.p.s. I am wondering if anybody from Java-side would like to come up with some optimized Java-code to be loaded into the database. It would be interesting to compare!

P.p.p.s. Yes, I know that JVM is not coming with XE-edition. That's one of the reasons I decided to optimize my PL/SQL version (since Dulcian's BRIM products are compatible with all editions)

Thursday, May 5, 2011

Small thoughts about hierarchical queries

I am just wondering, how many times the question of how exactly does Oracle execute CONNECT BY has been raised? I think, all known database experts had at least something on this :-) I didn't expect this thing to be so popular! Although, it could say something about either the quality of existing books/docs or about the fact very few newcomer ever read aforementioned books/docs.

Knowing a number of currently published authors my guess would be the latter has higher probability. H-m-m-m-m (if anybody reads this blog) - here is a survey: if you are planning to use a new feature would you rather read docs first and try later - or you would first try and check the doc only is something doesn't work as expected? :-)

P.s. I should not break the trend.  Here are my 2c on CONNECT BY.

Tuesday, May 3, 2011

Hierarchical queries

For the last couple of days I am working on the new paper of ODTUG KScope'11 on how to appropriately work with hierarchical data. The reason I decided to open this topic was Oracle's introduction in 11g R2 an extended version of CTE (common table expressions) - or to call it simpler a recursive WITH-clause (here is a set of examples: link, another link). For years it was used in a number of RDBMS implementations (SQL Server, PostgreSQL etc), while Oracle people were using Oracle's own CONNECT-BY clause.

I understand, that compatibility with other platforms is a good thing - but I am still struggling to figure out why anybody in Oracle would care. Me and my co-worker Grigoriy Novikov are running a set of tests and will keep you posted whether there are any performance reasons (doesn't look like - yet), but even architecturally I am not 100% sure that I see the "25-words-or-less" case that would prove the existence of this feature.

To be fair, I know that I am biased - if the problem cannot be solved with simple SQL statement, I immediately build function that would return a result collection as an output and do all my processing in PL/SQL. Maybe this approach is limited - but this code I can give to my junior developers to maintain. Unfortunately, proposed solutions involving CTE (as far as I can see) are above the comprehension point of mere mortals... I appreciate the level of the knowledge of people providing such solutions - but I am still looking for a good "PL/SQL for Dummies"-level example.

To be continued!