At my current assignment we are processing files coming from an online system to be inserted into our database (kind of a data warehouse). This is done using external tables and a scheduled job. The job just checks if there is a file available and will process this. The trouble is that the files might not make it to our database, for various reasons.
I want to be able to check if all the files have been processed and no file has gone missing. After processing a file, its name gets logged into a table so I can check this at a later time. Of course, I don’t want to eyeball the list to see if all the files have been processed.
I have been playing around with some queries to find out the gaps. A couple of things to know in advance.
The files are named using a timestamp (‘YYYYMMDDHH24MI’)
The files are always 5 minutes apart from each other.
Let’s first create a test table:
rem create the testdata table create table testdata ( filename varchar2(4000) ) /
Then I create some testdata. Using the DBMS_RANDOM.VALUE function I determine which records should be inserted.
rem fill the testdata table begin for d in 10 .. 15 loop -- create 5 days of test data for hr in 0 .. 23 loop -- for every hour of the day for mn in 0 .. 59 loop -- for every second of the hour if mod(mn, 5) = 0 then -- only if it's a 5 minute value if trunc(dbms_random.value * 10) <> 4 then -- and our randomizer doesn't end up being 4 (create random gaps) insert into testdata (filename) values ('201907' || trim(to_char(d, '09')) || trim(to_char(hr, '09')) || trim(to_char(mn, '09')) || '.tst'); end if; end if; end loop; end loop; end loop; commit; end; /
My first attempt at this query was to find all the gaps between to filenames which weren’t 5 minutes apart. So I started creating the query. Using subquery factoring I can show the steps I took.
Since the table consists of filenames and not dates (or timestamps) I needed to get the date portion of the file first:
select to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MISS') filedate from testdata order by 1
Using this set of data I can create the next step in the query, that is determining the preceding and the following date for each date and while we’re at it, determine the gap before and after:
with filenames as (select to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MISS') filedate from testdata order by 1) select lag(filedate) over(order by filedate) previousfiledate ,filedate ,lead(filedate) over(order by filedate) nextfiledate ,(filedate - lag(filedate) over(order by filedate)) gapbefore ,(lead(filedate) over(order by filedate) - filedate) gapafter from filenames
Wrapping this set of data into yet another factored subquery, I can remove all the rows that have a gap of exactly 5 minutes.
with filenames as (select to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MISS') filedate from testdata order by 1), gaps as (select lag(filedate) over(order by filedate) previousfiledate ,filedate ,lead(filedate) over(order by filedate) nextfiledate ,(filedate - lag(filedate) over(order by filedate)) gapbefore ,(lead(filedate) over(order by filedate) - filedate) gapafter from filenames) select * from gaps where 1 = 1 and (gapafter <> to_dsinterval('0 00:05:00')) or (gapbefore <> to_dsinterval('0 00:05:00'))
After I got the results for this query it made me wonder: This query shows me where the gaps in the data are, but it doesn’t tell me exactly which file or files are missing. I still have to figure that out myself. It also shows every gap twice, once after one file and once before the next file. There has got to be a better way to find and fill up the gaps. What if I could just generate all the filenames that should be there and then subtract the filenames that have been recorded.
First I need to create a list of all the possible filenames that exist between the first and last recorded filename. I know about a feature called recursive subquery factoring, but I never used it before. Luckily Tim Hall has created a nice post on this subject.
with /* * First determine all the possible dates between the first and the last recorded file * Using recursive subquery * Thanks to Tim Hall for https://oracle-base.com/articles/11g/recursive-subquery-factoring-11gr2 */ possibledates(thedate) as ( -- Anchor member select min(to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MI')) thedate from testdata union all -- Recursive member select thedate + to_dsinterval('0 00:05:00') thedate from possibledates where thedate < (select max(to_timestamp(substr(filename, 1, 12), 'YYYYMMDDHH24MI')) from testdata)), /* * Then determing the filenames from these dates */ possiblefilenames as (select to_char(thedate, 'YYYYMMDDHH24MI') || '.tst' filename from possibledates) /* * Using a simple minus operation, determine which filenames are missing */ select filename from possiblefilenames minus select filename from testdata
Not only is this query very fast, it also takes away the problem for me to eye-ball the data to find the missing files. It just shows which files are actually missing.