ETL Retrospective

bench overlooking water

With my first ETL complete, I want to provide additional details on the steps I took. Then reflect on what I would do differently next time. First, a more detailed look at the steps I took:

  1. Collect data, as CSV files, from the data source
  2. Scan CSV files for types of data and content
  3. For each CSV, write a script to create the landing table, land data from CSV into the table
  4. For each landing table, extract data into type analysis tables
  5. Manual analysis of types, data, and transformations
  6. For each analysis table, clean the data in the analysis tables
  7. For each target table, create a schema defining foreign key relationships between tables
  8. For each analysis table, migrate cleaned data to the target table

These steps are captured in 53 SQL files in the repository in 6 directories. All of this is well and good, it took me from little ETL experience to some experience. This approach felt horizontal. All CSV files were loaded into the landing database before trying to understand the types of data I had. I also waited to understand the model of the CSV data, considering the relationship between data files and how that relationship was defined. It also felt that the definition of relationships and normalization of the nationality data was too late in the process. Finally, this process does not allow for incremental data loads. Say an additional season of data was added to the data set. The only way to add this data at the present is to truncate all tables and refresh all data from CSV up to the target tables.

Knowing what I know now, here is how I would proceed with the next ETL:

  1. Collect data, as CSV files, from the data source
  2. Review CSV files for data relationships
  3. Review CSV files for data types
  4. Identify a CSV with the least number of dependencies to begin working on processing
    1. Script to create landing table & land data from CSV into the table
      • Performing this in 1 script
      • Appending a column to the landing schema to capture a data version
    2. Script data, including version, to an analysis table
    3. Identify data types and transformations to the data
      • This would remain a manual / discovery type process
    4. Script to create staging table, land transformed data into the staging table with the version from analysis table
      • Performing this in 1 script
      • This would include creating any foreign key relationships to other staging tables
  5. Repeat steps for each CSV

I believe, that with this more vertical approach, there are substantial gains. First, there would be a better understanding of the data and its quality before jumping into the more dependent CSVs. Versioning of data would allow for the management of what dataset we are working with allowing for updates to occur. Another benefit of versioning, if an error was found in the target datastore the source file could be easily identified. Grouping like steps into one SQL file limits the number of files to maintain keeping things together. Finally, the target data is open to use more quickly than the horizontal approach.