Friday, September 12, 2014

Analytic functions in real life

Yesterday while working on a real production report I found a requirement that forced me to come up with a bit nontrivial usage of analytic functions. Here is the problem (translated to SCOTT-level terms):

- I have a table that contains a DATE column (HIREDATE)
- Input parameter is also a date (IN_DT)
- For each department I need to show all rows that have HIREDATE after IN_DT plus the most recent row from the past (only one!).

Obviously, if you split the last condition into two, the solution is reasonably clear:

SELECT empno, ename, deptno, hiredate
FROM scott.emp
WHERE hiredate>:IN_DT
UNION ALL
SELECT empno, ename, deptno, hiredate
FROM (
     SELECT e.*,
            row_number() 
               over(PARTITION BY deptno
                    ORDER BY hiredate desc, 

                             empno desc
                   ) rank_nr
     FROM scott.emp e
     WHERE hiredate<=:IN_DT
     )
WHERE rank_nr = 1

The only problem of this solution - it needs two passed over EMP table. But in my production case that table is huge. Also, just to make life interesting, the result of the specified query has to be joined to 8 other tables to provide the required output. It is understandable, that my first optimization idea was to somehow do a single-pass query. To my own surprise, Oracle analytic functions are flexible enough to help me out:

SELECT empno, ename, deptno, hiredate, rank_nr
FROM
(
SELECT e.*,
      dense_rank()  
        over(PARTITION BY deptno
             ORDER BY 
               CASE WHEN hiredate > :IN_DT THEN '9999'
                    ELSE to_char(hiredate,'YYYYMMDDHH24MISS')||
                           lpad(empno,10,'0')
               END  desc   
            ) rank_nr,
      max(hiredate) 
        over(
             PARTITION BY deptno) max_dt        
FROM scott.emp e
)
WHERE max_dt > :IN_DT AND rank_nr IN (1,2)
      OR
      max_dt < = :IN_DT AND rank_nr = 1

In this solution I two analytical functions
1. DENSE_RANK is forced to order in the following way:
- bring all future dates to the top by assigning them exactly the same value '9999'
- order all other dates by converting them to 'YYYMMDDHH24MISS' format. In that format DESC would bring the most recent date up anyways.
- to resolve conflicts of multiple entries associated with the same date concatenate past dates with padded EMPNO. This way textual sort would correctly work even if the length of EMPNO may differ.
2. MAX() OVER() brings the highest hire date for each department.

Now the outer SELECT has enough information to filter out correct rows: if the highest date if after the parameter, you need to bring ranks 1 and 2 (all future dates + the last past date). Otherwise you need only rank 1 (last past date).

I understand, that this solution may take a couple of seconds to wrap your head around :-). Still - I was able to do exactly what I needed in a single pass!

Summary: if you are working on reports, analytic function is the MUST. That's the only safe way to write efficient code that would consistently work over large data sets.

No comments: