Extracting with Bulk Insert

large waterfall

For this ETL journey, I will be using Microsoft tooling. More specifically Microsoft SQL Server. I have already identified the data set for this learning experience. The data can be downloaded from Kaggle as a set of comma-separated value (CSV) files. This data set has a total of 13 files.

Now that the CSVs are on my machine something needs to be done to get the data into SQL Server. There are 2 options to perform this operation. The first option is to create a database in SQL server. Then right-click on the database, navigate to Tasks > Import Flat File.... This flow will take you through a wizard to import the data into the database. It is a fine solution, my issue with it is I can't record this import making it repeatable. On to the second and preferred option. Perhaps the more "technical" option.

The preferred tool is using create table and bulk insert commands in a SQL query. This query is something that we can write, save, and commit to source control to track history. The general order of operation for these extract scripts will be to create un-refined tables in the landing database, then import the CSV contents directly into the tables without modification. Here is an example of a landing script:

-- Define the target landing database 
use FormulaOneLanding

-- Determine if the table exists, if so, drop the table
if object_id('dbo.Constructors') is not null
  drop table dbo.Constructors
go

-- Create the table that will house the contents of the CSV
create table dbo.Constructors (
 ConstructorId int,
 ConstructorRef varchar(max),
 Name varchar(max),
 Nationality varchar(max),
 Url varchar(max)
)
go

-- Read the CSV and insert the data into the previously created table
bulk insert dbo.Constructors
from '{location to csv}\constructors.csv'
with (
  fieldterminator = ',',
  ROWTERMINATOR = '0x0a',
  firstrow = 2
)
go

A couple of notes on the decisions made with this query. First, I chose to drop each table before importing the CSV. This is due to having a complete data set at this current moment. If the data set had been a series of incremental updates, this would not be the preferred method. In the case of an incremental data set, I would anticipate wanting to load the files in without dropping the table. Rather, defining a version or a file name column in the table to identify the source of the record within the table. Second, I chose to not tightly define the column data types in the landing table. In scanning through the data, there were pieces of information that appeared to be all integers, however, they were not. There were other scenarios where the column data appeared to be short (less than a couple of dozen characters). Upon further inspection, there were 2-3 much larger records. At this point, the main goal is to move the data into the desired technology. It is the responsibility of the next steps to perform validation of the data, by programmatic means.

In the next post, all of the data will be landed. It will be time to validate what we have, understanding how we may want to shape the data in its final form.