Comparing queries…

How do you compare a rewritten query to its original version? Most of the time I just run a MINUS operation on the original and new query. Actually I execute two. Old query (A) MINUS New query (B) and vice versa (B) MINUS (A). Both should result in no rows. That way I thought I had proven that the resultsets for both queries are equal.
But there is a flaw in this assumption.

What if there are duplicate rows in one of the resultsets? The MINUS operator removes a row the resultset if it exists in both collections. But if one of the collections has the row twice and the other collection has it one, then it is completely removed.
Let’s say we have two queries:

-- first query
select 'alpha' nato_spelling from dual union all
select 'bravo' nato_spelling from dual union all
select 'bravo' nato_spelling from dual union all
select 'charlie' nato_spelling from dual
NATO_SPELLING
-------------
alpha
bravo
bravo
charlie

and

-- second query
select 'alpha' nato_spelling from dual union all
select 'bravo' nato_spelling from dual union all
select 'charlie' nato_spelling from dual
NATO_SPELLING
-------------
alpha
bravo
charlie

As you can see, by just eye-balling the queries the resultsets are different.
But when you execute the minus operator on the queries you’ll get the impression the resultsets are the same:

select *
from
((-- first query
select 'alpha' nato_spelling from dual union all
select 'bravo' nato_spelling from dual union all
select 'bravo' nato_spelling from dual union all
select 'charlie' nato_spelling from dual
 )
 MINUS
 (-- second query
select 'alpha' nato_spelling from dual union all
select 'bravo' nato_spelling from dual union all
select 'charlie' nato_spelling from dual
 )
)

NATO_SPELLING
----------------------------
select *
from
((-- second query
select 'alpha' nato_spelling from dual union all
select 'bravo' nato_spelling from dual union all
select 'charlie' nato_spelling from dual
 )
 MINUS
 (-- first query
select 'alpha' nato_spelling from dual union all
select 'bravo' nato_spelling from dual union all
select 'bravo' nato_spelling from dual union all
select 'charlie' nato_spelling from dual
 )
)
NATO_SPELLING
----------------------------

It gets worse when you compare queries that include a UNION or a UNION ALL operator:

select *
from
(
 (-- first query
  select 'alpha' nato_spelling from dual union all
  select 'bravo' nato_spelling from dual union all
  select 'bravo' nato_spelling from dual union all
  select 'charlie' nato_spelling from dual
 )
 UNION
 (-- third query
  select 'x-ray' nato_spelling from dual union all
  select 'yankee' nato_spelling from dual union all
  select 'zulu' nato_spelling from dual
 )
)
NATO_SPELLING
-------------
alpha
bravo
charlie
x-ray
yankee
zulu

6 rows selected

and

select *
from
(
 (-- first query
  select 'alpha' nato_spelling from dual union all
  select 'bravo' nato_spelling from dual union all
  select 'bravo' nato_spelling from dual union all
  select 'charlie' nato_spelling from dual
 )
 UNION ALL
 (-- third query
  select 'x-ray' nato_spelling from dual union all
  select 'yankee' nato_spelling from dual union all
  select 'zulu' nato_spelling from dual
 )
)
NATO_SPELLING
-------------
alpha
bravo
bravo
charlie
x-ray
yankee
zulu

7 rows selected

Clearly the results are different. This is because of how the UNION operator works. It remove duplicates from the resultset. But, when I check it using the MINUS operator (both ways):

select *
  from 
  (
   (select *
      from
      (
       (-- first query
        select 'alpha' nato_spelling from dual union all
        select 'bravo' nato_spelling from dual union all
        select 'bravo' nato_spelling from dual union all
        select 'charlie' nato_spelling from dual
       )
       UNION
       (-- third query
        select 'x-ray' nato_spelling from dual union all
        select 'yankee' nato_spelling from dual union all
        select 'zulu' nato_spelling from dual
       )
      )
   )
   MINUS
   (select *
      from
      (
       (-- first query
        select 'alpha' nato_spelling from dual union all
        select 'bravo' nato_spelling from dual union all
        select 'bravo' nato_spelling from dual union all
        select 'charlie' nato_spelling from dual
       )
       UNION ALL
       (-- third query
        select 'x-ray' nato_spelling from dual union all
        select 'yankee' nato_spelling from dual union all
        select 'zulu' nato_spelling from dual
       )
      )
   )
  )
NATO_SPELLING
-------------

and

select *
  from
  (
   (select *
      from
      (
       (-- first query
        select 'alpha' nato_spelling from dual union all
        select 'bravo' nato_spelling from dual union all
        select 'bravo' nato_spelling from dual union all
        select 'charlie' nato_spelling from dual
       )
       UNION ALL
       (-- third query
        select 'x-ray' nato_spelling from dual union all
        select 'yankee' nato_spelling from dual union all
        select 'zulu' nato_spelling from dual
       )
      )
   )
   MINUS
   (select *
      from
      (
       (-- first query
        select 'alpha' nato_spelling from dual union all
        select 'bravo' nato_spelling from dual union all
        select 'bravo' nato_spelling from dual union all
        select 'charlie' nato_spelling from dual
       )
       UNION
       (-- third query
        select 'x-ray' nato_spelling from dual union all
        select 'yankee' nato_spelling from dual union all
        select 'zulu' nato_spelling from dual
       )
      )
   )
  )
NATO_SPELLING
-------------

These results tell me the queries have the same resultsets, when in fact they do not.
A simple (extra) check could be to check if both queries have the same number of rows in the resultset.
In my opinion the resultsets are equal when both the MINUS queries give no results AND the count of the individual queries are the same.
Is there something I am still missing here?
I hope not.

3 thoughts on “Comparing queries…

  1. Patrick,

    Adding the count makes it a bit more robust, but not enough.
    If ‘alpha’ is a duplicate in query 1 and ‘bravo’ is a duplicate lin query 2, your counts will also be the same.

    I would suggest adding a
    Count(*) over (partition by nato_spelling) occurrences
    To both queries and then do both minus operations.
    That way a specific nato_spelling is only removed by the minus if the occurrences in both queries are the same.



  2. In such cases I use a full outer join on the primary key or some unique key. in your sample I used the following statement to find the difference:

    with my_old_data
    as (
    select ‘alpha’ nato_spelling from dual union all
    select ‘bravo’ nato_spelling from dual union all
    select ‘bravo’ nato_spelling from dual union all
    select ‘charlie’ nato_spelling from dual
    )
    , my_new_data
    as (
    select ‘alpha’ nato_spelling from dual union all
    select ‘bravo’ nato_spelling from dual union all
    select ‘charlie’ nato_spelling from dual
    )
    , my_old
    as (
    select nato_spelling
    , row_number() over (partition by nato_spelling order by nato_spelling) rn
    from my_old_data
    )
    , my_new
    as (
    select nato_spelling
    , row_number() over (partition by nato_spelling order by nato_spelling) rn
    from my_new_data
    )
    select o.*
    , n.*
    from my_old o
    full outer join my_new n on o.nato_spelling = n.nato_spelling
    and o.rn = n.rn
    where nvl(o.rn,-1) != nvl (n.rn, -2)


Leave a Reply

Your email address will not be published. Required fields are marked *