Information Schema Introduction

network

With the CSV dataset landed in SQL server the next step is to understand what type of data is in the dataset. While this can be done manually, we are developers why not perform these tasks programmatically? To be successful in this step of the adventure, we need to learn about INFORMATION_SCHEMA.

INFORMATION_SCHEMA is an ANSI standard set of read-only views that expose meta-data. Many relational database management systems implement this standard. To name a few:

  • Microsoft's SQL Server
  • Apache Hive
  • PostgreSQL
  • MariaDB

While there are many views of this meta-data available, there are 2 that will be used heavily throughout implementing this ETL. They are INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.TABLES. These 2 meta-data views provide a ton of information used in everyday querying and data discovery. Let's jump right into some example queries:

-- Fetch all tables that have 'analysis' in the table name
select * from INFORMATION_SCHEMA.TABLES where TABLE_NAME like '%analysis%';

-- Fetch all columns with tables that start with 'circuits'
select * from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME like 'circuits%';

-- Fetch all columns with varchar(max) as their datatype
select * from INFORMATION_SCHEMA.COLUMNS where DATA_TYPE = 'varchar' and CHARACTER_MAXIMUM_LENGTH = -1;

-- Fetch all columns of any datatype that are not nullable
select * from INFORMATION_SCHEMA.COLUMNS where IS_NULLABLE = 'NO';

-- Fetch all columns of any datatype that have a default value other than null
select * from INFORMATION_SCHEMA.COLUMNS where COLUMN_DEFAULT is not null;

As you can see by the examples above, this meta-data is all at our disposal. It's great that we can view all this information as it will assist in writing ETLs that are dynamic. An example of this is to add a column to a table if it does not already exist.

-- Declare an int variable to store the column count
declare @containsColumn int = 0;

-- Query INFORMATION_SCHEMA for the count of the columns in the table with the specific column name
-- Store the count in the declared variable
select @containsColumn = count(1) from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'CircuitsTypeAnalysis' and COLUMN_NAME = 'Sample';

-- Check if the column exists in the table
if(@containsColumn > 0)
begin
  -- Add the column to the table, as it does not existS
  alter table CircuitsTypeAnalysis add sample bit
end

This example is a little more involved than the others. It can only get more complicated from here. In the next post, you will see how to create a SQL query using the data from INFORMATION_SCHEMA then executing the constructed query. Pretty wild, heh?