Extract Transform Load - An overview

triangles of light

No matter what your role is in the IT domain, you will have some task that requires you to do some with a data set. There is an ever growing subset of positions in the IT domain that are solely roles in manipulating or interpreting data. As one might expect there are tools the professionals in the IT data sub-domain that are useful as a software developer.

The ability to quickly ingest a large dataset from one system into another is important. One way I have been successful in ingesting this data is using the Extract, Transform, and Load process. Or its better known acronym ETL. What are these 3 steps?

Extract

The start of the process is to fetch the data from the original source or sources. The goal of this step is to gather the structure and data landing in a technology "as is". With the copies of the data, the next part of this process would be to perform any necessary data investigation and validation. With better understanding of the data and the pieces of information that are excluded due to validation issues, the next step is initiated.

Transform

This step applies a series of rules to the data as uncovered during the data investigation. The rules applied may be cleaning the data, standardizing, aggregating, translating or many other activities. Depending on the number and complexity of the rules that are to be applied to the dataset, one may repeat this process in several iterations to incrementally mutate the dataset to match the target.

Load

The step in this process is to place the data in the consumable resting place. This is also known as the target. The target contains data in it's useable form. There are many different types of targets, they may be a CSV file to share, a data warehouse or a database for application consumption. By no means is that an exhaustive list of targets as every business or team has different sets of customers that are looking for this data.

I have found a dataset on Kaggle that was really interesting. As I have been following Formula 1 very closely since COVID had started. The next couple of posts will be me working through each of these steps with the Formula 1 data set and writing about the decisions and tools used. For this ETL I will be using Microsoft SQL server, so posts will also contain ways to use INFORMATION_SCHEMA data to our advantage.