During my work at the current customer site I was asked to create files with the changes to the data in the database. Normally you would use triggers on the tables to create logs of the records changed. Or use a Flashback Data Archive, which is only available starting Oracle 11G and I am working with Oracle 10G. The trouble in this situation is that we want to see if something changed on the data of a materialized view. This materialized view is fully refreshed overnight.
First idea I had was to check for the RowID. I figured that would change when the row was changed. Problem is that the RowID either always changes, whether or not the data changed, or didn’t change, even when the data changed. This proved not to be a reliable way to check for changes in the data.
Of course I could make a copy of the MV before it gets refreshed, but that would mean duplicating a lot of data. And of course we would have to check every column for changes. This would be an easy task if the tables don’t have too many columns and if they are simple columns (i.e. varchar2s, numbers and dates).
How about making some sort of hash value for the entire row and store that value. This involved a lot of thinking and some investigating. Chances that 2 rows will hash to the same value are really small. According to Stack Overflow:
(there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).
That means hashing is not flawless, but it is good (enough).
I want to make it dynamic, I don’t want to create similar code for every table (or materialized view in this case) I need to check. These can be tens, maybe hundreds or even thousands. How can I make this a generic solution. I can of course put all the fields of a record in one big string and hash that string. Let’s check some documentation. First of all, I need to find all the columns in the table. Luckily Oracle provides us with a data dictionary in which we can found out just about everything we need to know about our objects. Column information is stored in the ALL_TAB_COLUMNS view (ALL for all objects we have access to. We can also use USER_TAB_COLUMNS but that shows us only the columns in table in our own schema. There is also DBA_TAB_COLUMNS, which is pretty much the same as ALL_TAB_COLUMNS but it needs the DBA Role and it shows us every column in every table in every schema).
I need the names and types of columns in a certain table (possibly owned by a different user) so I need to retrieve the columns COLUMN_NAME and DATA_TYPE based on TABLE_NAME and OWNER. The SQL statement I need to execute is something like this:
FROM all_tab_columns atc
WHERE atc.table_name = <table_name>
and atc.OWNER = <owner>
Since I am using PL/SQL it is easy to return everything I need in one roundtrip to the SQL Engine using Bulk Operations. The means I want to BULK COLLECT whatever is returned. When I retrieve a single value I can use a scalar variable to select into. But now I will retrieve probably more than one row, so I need a collection to fetch into. Actually I need a collection of records so I create a record and a collection in my package:
TYPE column_rt IS RECORD(
, data_type user_tab_columns.data_type%TYPE);
TYPE column_aat IS TABLE OF column_rt INDEX BY BINARY_INTEGER;
Using this collection type and my SQL statement I create a function like this:
FUNCTION get_column_names(table_name_in IN VARCHAR2
,owner_in IN VARCHAR2 DEFAULT NULL) RETURN column_aat IS
, atc.data_type BULK COLLECT
FROM all_tab_columns atc
WHERE atc.table_name = table_name_in
AND (atc.owner = owner_in OR owner_in IS NULL);
Now that I know what columns are in the table I can construct an SQL statement to retrieve all columns concatenated in one column. Something like this:
SELECT <concatenated_columns> <alias>
FROM <user>.<table> <table_alias>
I also need to find a way to link the result back to the originating table. The only way (I can think of) I can do this (without relying on Primary Keys which may or may not be defined) is using the ROWID. So my SQL statement will become something like this:
SELECT <concatenated_columns> <alias>, rowid
FROM <user>.<table> <table_alias>
Using Native Dynamic SQL I can construct the SQL statement at runtime, hence using the data dictionary to define what I really select.
Oracle does a lot of implicit conversion on which I could rely, but I think it is better (and faster) to do explicit conversion. That way I have more control over the actual statement being generated. Also, my statement will fail if there is no implicit conversion available (and it did, since I have a couple of Oracle Spatial columns in my test Materialized Views).
So, to build the select part of the statement which is dynamic I created another function:
FUNCTION create_select(columns_in IN column_aat
,table_name_in IN VARCHAR2
,owner_in IN VARCHAR2) RETURN VARCHAR2 IS
l_returnvalue := ”;
FOR indx IN nvl(columns_in.first, 0) .. nvl(columns_in.last, -1) LOOP
l_skip := FALSE;
WHEN c_varchar THEN
l_prefix := c_prefix_varchar;
l_suffix := c_suffix_varchar;
WHEN c_number THEN
l_prefix := c_prefix_number;
l_suffix := c_suffix_number;
WHEN c_date THEN
l_prefix := c_prefix_date;
l_suffix := c_suffix_date;
l_prefix := c_prefix_empty;
l_suffix := c_suffix_empty;
l_skip := TRUE;
logging.add(9, ‘data_type’, columns_in(indx).data_type);
IF NOT (l_skip) THEN
l_returnvalue := l_returnvalue || ‘|| ‘ || l_prefix || columns_in(indx).column_name || l_suffix;
As you can see in the code, I am currently only hashing the scalar fields in the table. This makes it easier for me to test. If there are special columns (like the Spatial datatype) it gets skipped for now, but I do log that this datatype is currently not processed (In the currently working version I do have some code to do hashing on the SDO_GEOMETRY column type). This function can ‘easily’ be expanded to reflect new, different types of columns that I might encounter. I could even create support for User Defined dataTypes that I may store in a table.
I now have a SQL statement that will return two columns. The concatenated values from a row and the rowid. The rowid must be returned as is. This will be the way of connecting the originating table to the value returned by this bit of PL/SQL. The concatenated value of the entire row should be hashed so it returns a value of predefined size (Varchar2(32)).
To hash a value I use the hash function in the dbms_crypto package. Since I don’t want to write the code needed here in my statements (hide the logic behind a layer of code (Steven Feuerstein)) I created a simple function for it:
FUNCTION hashvalue(value_in IN maxvarchar) RETURN VARCHAR2 IS
RETURN dbms_crypto.hash(src => utl_i18n.string_to_raw(value_in, ‘AL32UTF8’)
,typ => dbms_crypto.hash_md5);
(maxvarchar is a subtype defined so I don’t need to remember the maximum size of a varchar in PL/SQL
SUBTYPE maxvarchar IS VARCHAR2(32767);
Now I need to create a table function (a function that can be called from SQL using the TABLE() operator) that gets me the hashed values for all the rows.
To be able to access more than one field in the resultset I need this function to return a schema level defined collection type. In this case a Nested Table Type.
CREATE OR REPLACE TYPE hash_rt AS OBJECT
( hrt_rowid VARCHAR2(18)
, hrt_hashvalue VARCHAR2(32)
create or replace type hash_ntt as table of hash_rt
The code needs to know what table it should run against and who the owner of this object is. So the function specification becomes:
FUNCTION hashrows(table_name_in IN VARCHAR2
,owner_in IN VARCHAR2 DEFAULT USER) RETURN hash_ntt
By adding the keyword PIPELINED to this specification I can utilize possible parallel execution of any SQL I write against this function.
FUNCTION hashrows( table_name_in IN VARCHAR2
, owner_in IN VARCHAR2 DEFAULT USER) RETURN hash_ntt
TYPE valrow_r IS RECORD(
, vr_rowid maxvarchar);
TYPE valrow_aat IS TABLE OF valrow_r INDEX BY BINARY_INTEGER;
— get the columns for this table.
l_columns := get_column_names(upper(table_name_in), upper(owner_in));
— build the select statement
l_sql := ‘SELECT ‘ || ltrim(create_select(l_columns, table_name_in, owner_in), ‘|’) ||
‘ fullrow, rowid from ‘ || owner_in || ‘.’ || table_name_in || ‘ ‘ || c_table_alias;
— open the cursor using this select statement
OPEN l_cv FOR l_sql;
— bulk collect the results
FETCH l_cv BULK COLLECT
— check to see if there is anything to process.
IF l_valrow.count > 0 THEN
— loop though the result
FOR indx IN l_valrow.first .. l_valrow.last LOOP
— pipe the result out of the function as soon as it is known
PIPE ROW(hash_rt(l_valrow(indx).vr_rowid, hashvalue(l_valrow(indx).vr_value)));
— close the cursor
— return control
WHEN OTHERS THEN
logging.add(9, ‘Error’, SQLERRM);
logging.add(9, ‘Error Stack’, dbms_utility.format_error_stack);
logging.add(9, ‘SQL Statement’, l_sql);
By choosing the pipelined option I am obliged to PIPE each ROW when it’s done processing out of the function. At the end of the function I have nothing to RETURN but control, so therefore I have a simple return at the end of the function (no value is being returned).
Since my SQL returns multiple rows and I want to BULK COLLECT them I need a collection. In this case (again) a collection of records, since I am fetching more than one column per record. I could have created a package level type, but there is no need for this type outside this function, so I decided to declare it here.
I PIPE out each row which is of the rowtype I defined. It is kind of strange that the function says it will return a collection but actually it is returning single rows. But if you think about it, it does return more than one row, just one at a time.
Since the create_select function returns the column names with two | at the beginning of the string (saves a lot of if-then processing) I need to trim these bars from the beginning in building the statement:
ltrim(create_select(l_columns, table_name_in, owner_in)
Now, lets see what this function returns. Unfortunately this function cannot be called from PL/SQL. Running this testscript:
— Call the function
result := hashing.hashrows(table_name_in => ‘MV_HASH’);
Results in an Oracle error:
ORA-06550: line 19, column 12:
PLS-00653: aggregate/table functions are not allowed in PL/SQL scope
So I need to call it from SQL using a statement like this:
select * from table(hashing.hashrows(‘MV_HASH’))
This results in:
Now that looks cool. Not something you would want to read though.
Let’s do some testing. First I create two tables. 1 for the ‘real’ data and one to store the hashed values in (along with the rowed).
create table mv_hash (
, name varchar2(30)
create table ht_hash (
, ht_hashvalue varchar2(32)
Then I insert some rows into the first table (which is called mv_ because I will be using this on a Materialized View in the real world)
insert into mv_hash (id, name) (select t.object_id, t.OBJECT_NAME from user_objects t)
603 rows inserted
Now, insert the hash values of the mv_table into the ht_ (HashTable) table.
insert into ht_hash (ht_rowid, ht_hashvalue) (select hrt_rowid, hrt_hashvalue from table(hashing.hashrows(‘MV_HASH’)))
603 rows inserted
Then it’s time to update a row in the mv_ table
set name = ‘Patrick’
where rownum = 1
1 row updated
Let’s find out which row was updated
from mv_hash m
left outer join ht_hash h on (m.rowid = h.ht_rowid)
where h.ht_rowid is null
It seems the rowed hasn’t changed by the update. Let’s see if we can find it using the entire table, the hashtable and the table function too.
from mv_hash m
join (table(hashing.hashrows(‘MV_HASH’)) hm
left outer join ht_hash h on (hm.hrt_hashvalue = h.ht_hashvalue)
) on (hm.hrt_rowid = m.rowid)
where h.ht_hashvalue is null
It is the way to get to the changed row. Actually I am doing an ANTI-JOIN (outer-join with one of the joined columns being null, i.e. not available)
This check works even when the entire table is truncated and filled again since we don’t check the rowid, but the hashvalue for the row. If the hashvalue hasn’t changed, it won’t show up.