We have gone through a merger at our company where we are trying to merge the databases. The problem now is that we have duplicate records in our tables. We can of course go through all the records by hand and check if they exist twice. Another option is to build an application to do this. But using the Oracle Database there must be a better way to do this. Any ideas?
Going through all these records by hand seems like quite a time consuming process. I don’t know the number of records in your tables, but I would definitely not do this by hand. Writing an application to do this makes it possible to run the application multiple times, so you can update the rules making it better every time. Writing an application can be time consuming and running it can be as well.
I think your best shot would be to try and find a solution in either PL/SQL or SQL. Depending on the complexity and size of your data you could try any of the following solutions.
You can write a procedure that loops through your data and saves the records in an Associative array. If the record already exists in the array, or at least the key fields are equal, then it should not be added.
CREATE OR REPLACE PROCEDURE ot_deduplicate_aa IS TYPE t_ot_emp_aa IS TABLE OF ot_emp%ROWTYPE INDEX BY PLS_INTEGER; CURSOR c_ot_emp IS SELECT e.empno ,e.ename ,e.job ,e.mgr ,e.hiredate ,e.sal ,e.comm ,e.deptno FROM ot_emp e; r_ot_emp ot_emp%ROWTYPE; l_unique_ot_emps t_ot_emp_aa; FUNCTION record_already_exists(record_in IN ot_emp%ROWTYPE ,collection_inout IN OUT t_ot_emp_aa) RETURN BOOLEAN IS l_indx PLS_INTEGER := collection_inout.first; l_returnvalue BOOLEAN := FALSE; BEGIN IF l_indx IS NOT NULL THEN LOOP l_returnvalue := l_returnvalue OR ((record_in.ename = collection_inout(l_indx).ename) AND (record_in.job = collection_inout(l_indx).job)); l_indx := collection_inout.next(l_indx); EXIT WHEN l_returnvalue OR(l_indx IS NULL); END LOOP; END IF; RETURN l_returnvalue; END record_already_exists; BEGIN OPEN c_ot_emp; LOOP FETCH c_ot_emp INTO r_ot_emp; EXIT WHEN c_ot_emp%NOTFOUND; -- check if this record already exists IF NOT (record_already_exists(record_in => r_ot_emp, collection_inout => l_unique_ot_emps)) THEN l_unique_ot_emps(l_unique_ot_emps.count + 1) := r_ot_emp; END IF; END LOOP; FOR indx IN l_unique_ot_emps.first .. l_unique_ot_emps.last LOOP INSERT INTO ot_emp_deduplicated (empno ,ename ,job ,mgr ,hiredate ,sal ,comm ,deptno) VALUES (l_unique_ot_emps(indx).empno ,l_unique_ot_emps(indx).ename ,l_unique_ot_emps(indx).job ,l_unique_ot_emps(indx).mgr ,l_unique_ot_emps(indx).hiredate ,l_unique_ot_emps(indx).sal ,l_unique_ot_emps(indx).comm ,l_unique_ot_emps(indx).deptno); END LOOP; END ot_deduplicate_aa; /
You can speed up this process by using bulk processing for both the retrieval and the storing of the data. Watch out that you don’t blow up your memory by retrieving too much data at once. You can use the limit clause to prevent this.
Another way to go at this, is the use of the analytic functions in Oracle. The idea behind using analytics is to rank all the record in their own partition. This rank can be rather arbitrary so we can use ROW_NUMBER as a ranking mechanism. In the PARTITION BY clause we can add all the columns we need to check for duplicates on. And to define which record to keep we add the (obligatory) order by clause. Then, using this new column, we can state that we are just interested in the records where the ROW_NUMBER equals 1.
WITH emp_all_rows AS (SELECT e.empno, e.ename, e.job, e.mgr, e.hiredate, e.sal, e.comm, e.deptno ,row_number() OVER (PARTITION BY e.ename, e.job ORDER BY e.empno ASC) rn FROM ot_emp e) INSERT INTO ot_emp_deduplicated(empno, ename, job, mgr, hiredate, sal, comm, deptno) (SELECT empno, ename, job, mgr, hiredate, sal, comm, deptno FROM emp_all_rows WHERE rn = 1) /
You can of course use this same technique to delete the duplicate rows directly from the source table, but then you cannot use the query factoring (WITH clause) because that isn’t supported in SQL.
DELETE FROM ot_emp e WHERE e.rowid IN (SELECT ear.rowid FROM (SELECT e.rowid ,row_number() over(PARTITION BY e.ename ORDER BY e.empno ASC) rn FROM ot_emp e) ear --<-- EmpAllRows WHERE rn > 1) /
It is probably faster to insert the de-duplicated rows into a new table, then drop the original table and rename the new table. Be aware of any triggers, constraints, indexes etc. that may exist and need to be recreated.
You can check out the full demo code at
Or you can send me an email and I will send you the demo script.
Hope this helps in your efforts to de-duplicate your data.
This question has been published in OTech Magazine of Winter 2015.