Data Cleaning

gloved hand holding cleaning spray bottle

With landed data and the first pass at type analysis completed. It is time to take a better look at the current state of the data. Since this data set is quite small and I was on the edge of my SQL knowledge I performed this manually. Creating a quick drawing of table relationships while reviewing the values in the tables. All this to set the stage for the cleaning of the data.

Data cleaning is an important step in the transform area of an ETL. As this step directly impacts the quality of the end product. In cleaning the data, we want to ensure that the outbound product is an incremental step toward consumable. Meaning, that there may be several steps to cleaning a particular data set. In each step, the cleaning should retain the business value, but remove the inconsistencies that would not be desired in the end product.

For example, in the landed data every string field was encapsulated with double-quotes. This is not something the final state of the data should contain, as those double-quotes are not providing value. A simple usage of the replace( expression, pattern, replacement) function for each varchar column was enough to solve this problem. In other data scenarios that qualified for cleaning, a value of \N was provided instead of null.

In performing this exercise, I decided to execute all data cleaning in place. Updating the analysis tables incrementally. If I was to do this in the future, I would create a new table for each step of the cleaning process. Creating a new table for each cleaning step. For the reason of testing each cleaning step. As I made mistakes in the cleaning process, to get back to a data set in the same state I would have to truncate the table(s) and execute all previous steps. This started to become painful as more cleaning was completed. This in-place approach also made structural transforms more difficult. As pulling data out of the table to refactor the schema would require some jockeying of columns. Instead of modifying the existing schema, I chose to defer the final transforms as part of the load step.

At this stage in the process, we have a clean data set, the last thing to do is load it into the target database.