How can we use Oracle to deduplicate our data

Dear Patrick,

We have gone through a merger at our company where we are trying to merge the databases. The problem now is that we have duplicate records in our tables. We can of course go through all the records by hand and check if they exist twice. Another option is to build an application to do this. But using the Oracle Database there must be a better way to do this. Any ideas?

Ramesh Cunar

Dear Ramesh,

Going through all these records by hand seems like quite a time consuming process. I don’t know the number of records in your tables, but I would definitely not do this by hand. Writing an application to do this makes it possible to run the application multiple times, so you can update the rules making it better every time. Writing an application can be time consuming and running it can be as well.

I think your best shot would be to try and find a solution in either PL/SQL or SQL. Depending on the complexity and size of your data you could try any of the following solutions.

You can write a procedure that loops through your data and saves the records in an Associative array. If the record already exists in the array, or at least the key fields are equal, then it should not be added.

CREATE OR REPLACE PROCEDURE ot_deduplicate_aa IS
  TYPE t_ot_emp_aa IS TABLE OF ot_emp%ROWTYPE INDEX BY PLS_INTEGER;
  CURSOR c_ot_emp IS
    SELECT e.empno
          ,e.ename
          ,e.job
          ,e.mgr
          ,e.hiredate
          ,e.sal
          ,e.comm
          ,e.deptno
      FROM ot_emp e;
  r_ot_emp         ot_emp%ROWTYPE;
  l_unique_ot_emps t_ot_emp_aa;
  FUNCTION record_already_exists(record_in        IN ot_emp%ROWTYPE
                                ,collection_inout IN OUT t_ot_emp_aa) RETURN BOOLEAN IS
    l_indx        PLS_INTEGER := collection_inout.first;
    l_returnvalue BOOLEAN := FALSE;
  BEGIN
    IF l_indx IS NOT NULL THEN
      LOOP
        l_returnvalue := l_returnvalue OR ((record_in.ename = collection_inout(l_indx).ename) AND
                         (record_in.job = collection_inout(l_indx).job));
        l_indx := collection_inout.next(l_indx);
        EXIT WHEN l_returnvalue OR(l_indx IS NULL);
      END LOOP;
    END IF;
    RETURN l_returnvalue;
  END record_already_exists;

BEGIN
  OPEN c_ot_emp;
  LOOP
    FETCH c_ot_emp
      INTO r_ot_emp;
    EXIT WHEN c_ot_emp%NOTFOUND;
    -- check if this record already exists
    IF NOT (record_already_exists(record_in => r_ot_emp, collection_inout => l_unique_ot_emps)) THEN
      l_unique_ot_emps(l_unique_ot_emps.count + 1) := r_ot_emp;
    END IF;
  END LOOP;
  FOR indx IN l_unique_ot_emps.first .. l_unique_ot_emps.last LOOP
    INSERT INTO ot_emp_deduplicated
      (empno
      ,ename
      ,job
      ,mgr
      ,hiredate
      ,sal
      ,comm
      ,deptno)
    VALUES
      (l_unique_ot_emps(indx).empno
      ,l_unique_ot_emps(indx).ename
      ,l_unique_ot_emps(indx).job
      ,l_unique_ot_emps(indx).mgr
      ,l_unique_ot_emps(indx).hiredate
      ,l_unique_ot_emps(indx).sal
      ,l_unique_ot_emps(indx).comm
      ,l_unique_ot_emps(indx).deptno);
  END LOOP;
END ot_deduplicate_aa;
/

You can speed up this process by using bulk processing for both the retrieval and the storing of the data. Watch out that you don’t blow up your memory by retrieving too much data at once. You can use the limit clause to prevent this.

Another way to go at this, is the use of the analytic functions in Oracle. The idea behind using analytics is to rank all the record in their own partition. This rank can be rather arbitrary so we can use ROW_NUMBER as a ranking mechanism. In the PARTITION BY clause we can add all the columns we need to check for duplicates on. And to define which record to keep we add the (obligatory) order by clause. Then, using this new column, we can state that we are just interested in the records where the ROW_NUMBER equals 1.

WITH emp_all_rows AS
 (SELECT e.empno, e.ename, e.job, e.mgr, e.hiredate, e.sal, e.comm, e.deptno
       ,row_number() OVER (PARTITION BY e.ename, e.job ORDER BY e.empno ASC) rn
    FROM ot_emp e) 
INSERT
  INTO ot_emp_deduplicated(empno, ename, job, mgr, hiredate, sal, comm, deptno)
(SELECT empno, ename, job, mgr, hiredate, sal, comm, deptno
   FROM emp_all_rows
  WHERE rn = 1)
/

You can of course use this same technique to delete the duplicate rows directly from the source table, but then you cannot use the query factoring (WITH clause) because that isn’t supported in SQL.

DELETE FROM ot_emp e
 WHERE e.rowid IN 
(SELECT ear.rowid
   FROM (SELECT e.rowid
               ,row_number() over(PARTITION BY e.ename ORDER BY e.empno ASC) rn
           FROM ot_emp e) ear --<-- EmpAllRows
  WHERE rn > 1)
/

It is probably faster to insert the de-duplicated rows into a new table, then drop the original table and rename the new table. Be aware of any triggers, constraints, indexes etc. that may exist and need to be recreated.

You can check out the full demo code at
https://livesql.oracle.com/apex/livesql/s/ckqg3dkqmvj5omqttapzmrhtg
Or you can send me an email and I will send you the demo script.

Hope this helps in your efforts to de-duplicate your data.

Happy Oracle’ing,
Patrick Barel

If you have any comments on this subject or you have a question you want answered, please send an email to patrick[at]bar-solutions[dot]com. If I know the answer, or can find it for you, maybe I can help.

This question has been published in OTech Magazine of Winter 2015.