Change data capture (CDC) is an approach or a technique to identify changes, only changes, in the source. I have seen applications that are built without CDC and later mandate to implement CDC at a higher cost. Building an ETL application without CDC is a costly miss and usually a backtracking step. In this article we can discuss different methods of implementing CDC.

Scenario #01: Change detection using timestamp on source rows

In this typical scenario the source rows have extra two columns say row_created_time & last_modified_time. Row_created_time : time at which the record was first created ; Last_modified_time: time at which the record was last modified

  1. In the mapping create mapping variable $$LAST_ETL_RUN_TIME of datetime data type
  2. Evaluate condition SetMaxVariable ($$LAST_ETL_RUN_TIME, SessionStartTime); this steps stores the time at which the Session was started to $$LAST_ETL_RUN_TIME
  3. Use $$LAST_ETL_RUN_TIME in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where last_modified_date > ’01/01/1900 00:00:000’
  4. Now let us assume the session is run on ’01/01/2010 00:00:000’ for initial seed
  5. When the session is executed on ’02/01/2010 00:00:000’, the sequel would be like : select * from employee where last_modified_date > ’01/01/2010 00:00:000’, hereby pulling records that had only got changed in between successive runs

Scenario #02: Change detection using load_id or Run_id

Under this scenario the source rows have a column say load_id, a positive running number. The load_id is updated as and when the record is updated

  1. In the mapping create mapping variable $$LAST_READ_LOAD_ID of integer data type
  2. Evaluate condition SetMaxVariable ($$LAST_READ_LOAD_ID,load_id); the maximum load_id is stored into mapping variable
  3. Use $$LAST_READ_LOAD_ID in the ‘where’ clause of the source SQL. During the first run or initial seed the mapping variable would have a default value and pull all the records from the source, like: select * from employee where load_id > 0; Assuming all records during initial seed have load_id =1, the mapping variable would store ‘1’ into the repository.
  4. Now let us assume the session is run after five load’s into the source, the sequel would be select * from employee where load_id >1 ; hereby we limit the source read only to the records that have been changed after the initial seed
  5. Consecutive runs would take care of updating the load_id & pulling the delta in sequence

In the next blog we can see how to implement CDC when reading from Salesforce.com

Posted by Badri Narayanan
Comments (0)
August 12th, 2010

Comments (0)