Recursive Subquery

At my current assignment we are processing files coming from an online system to be inserted into our database (kind of a data warehouse). This is done using external tables and a scheduled job. The job just checks if there is a file available and will process this. The trouble is that the files might not make it to our database, for various reasons.

I want to be able to check if all the files have been processed and no file has gone missing. After processing a file, its name gets logged into a table so I can check this at a later time. Of course, I don’t want to eyeball the list to see if all the files have been processed.

I have been playing around with some queries to find out the gaps. A couple of things to know in advance.
The files are named using a timestamp (‘YYYYMMDDHH24MI’)
The files are always 5 minutes apart from each other.
Let’s first create a test table:

rem create the testdata table
create table testdata
( filename            varchar2(4000)
)
/

Then I create some testdata. Using the DBMS_RANDOM.VALUE function I determine which records should be inserted.

rem fill the testdata table
begin
  for d in 10 .. 15 loop -- create 5 days of test data
    for hr in 0 .. 23 loop -- for every hour of the day
      for mn in 0 .. 59 loop -- for every second of the hour
        if mod(mn, 5) = 0 then -- only if it's a 5 minute value
          if trunc(dbms_random.value * 10) <> 4 then -- and our randomizer doesn't end up being 4 (create random gaps)
            insert into testdata
              (filename)
            values
              ('201907' || trim(to_char(d, '09')) || trim(to_char(hr, '09')) || trim(to_char(mn, '09')) || '.tst');
          end if;
        end if;
      end loop;
    end loop;
  end loop;
  commit;
end;
/

My first attempt at this query was to find all the gaps between to filenames which weren’t 5 minutes apart. So I started creating the query. Using subquery factoring I can show the steps I took.
Since the table consists of filenames and not dates (or timestamps) I needed to get the date portion of the file first:

select to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MISS') filedate
    from testdata
   order by 1

Using this set of data I can create the next step in the query, that is determining the preceding and the following date for each date and while we’re at it, determine the gap before and after:

with filenames as
 (select to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MISS') filedate
    from testdata
   order by 1)
select lag(filedate) over(order by filedate) previousfiledate
      ,filedate
      ,lead(filedate) over(order by filedate) nextfiledate
      ,(filedate - lag(filedate) over(order by filedate)) gapbefore
      ,(lead(filedate) over(order by filedate) - filedate) gapafter
  from filenames

Wrapping this set of data into yet another factored subquery, I can remove all the rows that have a gap of exactly 5 minutes.

with filenames as
 (select to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MISS') filedate
    from testdata
   order by 1),
gaps as
 (select lag(filedate) over(order by filedate) previousfiledate
        ,filedate
        ,lead(filedate) over(order by filedate) nextfiledate
        ,(filedate - lag(filedate) over(order by filedate)) gapbefore
        ,(lead(filedate) over(order by filedate) - filedate) gapafter
    from filenames)
select *
  from gaps
 where 1 = 1
   and (gapafter <> to_dsinterval('0 00:05:00')) 
    or (gapbefore <> to_dsinterval('0 00:05:00'))

After I got the results for this query it made me wonder: This query shows me where the gaps in the data are, but it doesn’t tell me exactly which file or files are missing. I still have to figure that out myself. It also shows every gap twice, once after one file and once before the next file. There has got to be a better way to find and fill up the gaps. What if I could just generate all the filenames that should be there and then subtract the filenames that have been recorded.
First I need to create a list of all the possible filenames that exist between the first and last recorded filename. I know about a feature called recursive subquery factoring, but I never used it before. Luckily Tim Hall has created a nice post on this subject.

with
/* 
 * First determine all the possible dates between the first and the last recorded file
 * Using recursive subquery
 * Thanks to Tim Hall for https://oracle-base.com/articles/11g/recursive-subquery-factoring-11gr2
 */
possibledates(thedate) as
 ( -- Anchor member
  select min(to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MI')) thedate
    from testdata
  union all
  -- Recursive member
  select thedate + to_dsinterval('0 00:05:00') thedate
    from possibledates
   where thedate <
         (select max(to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MI'))
            from testdata)),
/*
 * Then determing the filenames from these dates
 */
possiblefilenames as
 (select to_char(thedate, 'YYYYMMDDHH24MI') || '.tst' filename
    from possibledates)
/*
 * Using a simple minus operation, determine which filenames are missing
 */
select filename
  from possiblefilenames
minus
select filename
  from testdata

Not only is this query very fast, it also takes away the problem for me to eye-ball the data to find the missing files. It just shows which files are actually missing.

Leave a Reply

Your email address will not be published. Required fields are marked *