Storing timestamps along with a record indicating its new arrival or a change in its value is a must in a data warehouse. We always take it for granted, adding timestamp fields to table structures and tending to miss that the amount of storage space a timestamp field can occupy is huge, the storage occupied by timestamp is almost double against a integer data type in many databases like SQL Server, Oracle and if we have two fields one as insert timestamp and other field as update timestamp then the storage spaced required gets doubled. There are many instances where we could avoid using timestamps especially when the timestamps are being used for primarily for determining the incremental records or being stored just for audit purpose.

How to effectively manage the data storage and also leverage the benefit of a timestamp field?

One way of managing the storage of timestamp field is by introducing a process id field and a process table. Following are the steps involved in applying this method in table structures and as well as part of the ETL process.
Data Structure

  1. Consider a table name PAYMENT with two fields with timestamp data type like INSERT_TIMESTAMP and UPDATE_TIEMSTAMP used for capturing the changes for every present in the table
  2. Create a table named PROCESS_TABLE with columns PROCESS_NAME Char(25), PROCESS_ID Integer and PROCESS_TIMESTAMP Timestamp
  3. Now drop the fields of the TIMESTAMP data type from table PAYMENT
  4. Create two fields of integer data type in the table PAYMENT like INSERT_PROCESS_ID and UPDATE_PROCESS_ID
  5. These newly created id fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID would be logically linked with the table PROCESS_NAME and its field PROCESS_ID

ETL Process

  1. Let us consider an ETL process called ‘Payment Process’ that loads data into the table PAYMENT
  2. Now create a pre-process which would run before the ‘payment process’, in the pre-process build the logic by which a record is inserted with the values like (‘payment process’, SEQUNCE Number, current timestamp) into the PAYMENT table. The PROCESS_ID in the payment table could be defined as a database sequence function.
  3. Pass the current_prcoess_id from pre-process step to the ‘payment process’ ETL process
  4. In the ‘payment process’ if a record is to inserted into the PAYMENT table then the current_prcoess_id value is set to both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID else if a record is getting updated in the PAYMENT table then the current_process_id value is set to only the column UPDATE_PROCESS_ID
  5. So now the timestamp values for the records inserted or updated in the table PAYMENT can be picked from the PROCESS_TABLE by joining by the PROCESS_ID with the INSERT_PROCESS_ID and UPDATE_PROCESS_ID columns of the PAYMENT tableBenefits

  • The fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID occupy less space when compared to the timestamp fields
  • Both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID are Index friendly
  • Its easier to handle these process id fields in terms picking the records for determining the incremental changes or for any audit reporting.

Posted by Muneeswara C Pandian
Comments (3)
May 15th, 2008

Comments (3)

Rajamani - May 29th, 2008

Thanks for the clarification Pandian.

Muneewaara C Pandian - May 19th, 2008

Rajamani, thanks for reading, the second point was an inetersteing observation, have tried to clarify below 1. We would insert 1 record into the PROCESS_TABLE for each process/job. We would insert/update multiple records through a single process. Say for 100 records inserted/updated into the PAYMENT table with PROCESS_ID, the additional storage accupied by three columns for 1 record will be far less than the actual saving of space by avoiding 100 timestamp values. 2. The PROCESS_ID generation will be a pre-process to the PAYMENT PROCESS, we need to set two kind of dependency on this process 1. Only on successful PROCESS_ID creation the PAYMENT PROCESS can start 2. Only after successful PAYMENT PROCESS the PROCESS_ID creation can increment the seqeunce number. This dependency will esnure that on job restart for a load failure the seqeunce number for PROCESS ID id is not incremented. The second dependency will be requried only if wwe want to ensure that all records inserted for a load irrespective of load failure or process rerun needs to have same process id.

Rajamani - May 16th, 2008

The logic seems good, however this needs some clarification with regard to 1. The space occupied by the new table with three columns,and indexes compared to the two timestamp fields in the original table(PAYMENT table). 2. What happens to the Sequence Number generated for PROCESS_ID, in case if the load fails.

Comments are closed.