ADF: Incremental Data Loads and Deployments

About Azure Data Factory (ADF)

The ADF service is a fully managed service for composing data storage, processing, and movement services into streamlined, scalable, and reliable data production pipelines. Copy activity enables the copying of data from a source to sink (destination) in a batch. It can be used for migrating data from on-premise to Azure, or Azure to on-premise, or even Azure to Azure.

Limitations

While ADF is a great tool to perform ETL activities effectively, it comes with its own set of limitations. The primary one being its inability to support parameterization. Parameters are essential to understand and load only the data that has changed since the last successful data load instead of performing a full load operation. While dealing with millions of rows of data, this limitation to perform incremental load results in serious inefficiency in the whole ETL process.

A Proposed Solution

Although, we do believe that Microsoft will eventually fix this limitation in an upcoming version of ADF, in the meanwhile we have built an effective stop-gap solution. We have also made sure that the solution is configurable and scales for each new client, and each new data source/target set, thus reducing the overhead for our customers in building, configuring and maintaining ETL activities.

With this solution, the following issues can be addressed:

Ability to add/manage clients (source/target systems) dynamically
Ability to pass parameters to ADF to load data incrementally
Drive/manage multiple jobs in Azure
Edit jobs dynamically

We have used a PowerShell script to build our solution:

Solution Overview

To implement this solution, follow these steps using the accompanying PowerShell commands:

Note: Generic naming convention is used for linked services, datasets and pipelines.

Get the LastRunDate for each Client from the Target table and update that in the configuration table
Get the list of clients and respective details whose data need to be copied to the cloud

$SqlCmd.CommandText = "SELECT ID, ServerName,
DatabaseName, UserName ,SecureId,StartTime,
SQLQuery, Client From ServerDetails"

Use the ForEach loop mechanism within PowerShell to go through the list of clients
Create the Linked Service if it is not available

$IsOutputDataSetExist = (Get-AzureRmDataFactoryDataset 
-ResourceGroupName $resource -DataFactoryName 
$datafactory -Name $outputdatasetName)
if (([string]::IsNullOrEmpty($IsOutputDataSetExist))) { 
New-AzureRmDataFactoryDataset -ResourceGroupName $resource 
-DataFactoryName $datafactory -File 
$FinaloutputdatasetJSONFileName -Force 
}

Similarly, create Input and Output datasets if they are not available

Input dataset

$IsInputDataSetExist =  Get-AzureRmDataFactoryDataset  -ResourceGroupName
$resource -DataFactoryName $datafactory -Name $inputdatasetName
if (([string]::IsNullOrEmpty($IsInputDataSetExist))) {
New-AzureRmDataFactoryDataset -ResourceGroupName $resource 
-DataFactoryName $datafactory  
-File $FinalInputdatasetJSONFileName -Force
}

Output dataset

$IsOutputDataSetExist =  Get-AzureRmDataFactoryDataset
-ResourceGroupName $resource -DataFactoryName 
$datafactory -Name $outputdatasetName
if (([string]::IsNullOrEmpty($IsOutputDataSetExist)))  {
New-AzureRmDataFactoryDataset -ResourceGroupName 
$resource -DataFactoryName $datafactory  
-File $FinaloutputdatasetJSONFileName -Force 
}

Get the LastRunDate from the configuration table for each client

$SqlCmd.CommandText = "SELECT LastRunDate  LastRunDate 
FROM ClientConfigurations WHERE Client ="+$ClientName

Create/update the pipeline with the one-time run configuration

New-AzureRmDataFactoryPipeline -ResourceGroupName 
$resource -DataFactoryName $datafactory 
-File $FinalPipelineJSONFileName  -Force

Solution Prerequisites

In order to efficiently use this solution, you need to have the following prerequisites in place:

ADF (including Data Management Gateway – for on-premise source databases) as a solution for ETL
Azure PowerShell cmdlets package must be installed on the local machine
Azure Active Directory account
List of on-premise server details whose data should be copied to the cloud

Conclusion

This custom solution provides the ability to manage multiple source/target systems as a configurable solution and thereby helps them avoid ETL overheads. Furthermore, it helps us perform incremental data loads instead of full loads to avoid inefficiencies, in addition to automating the process of passing parameters to the pipeline with an ability to track the status of each job.

If these are some of the challenges that you are facing, then this solution would fit perfectly into your scheme of things.

ADF: Incremental Data Loads and Deployments

Blog

About Azure Data Factory (ADF)

Limitations

A Proposed Solution

Solution Overview

Solution Prerequisites

Conclusion

John P. ”Jack” Cullen

Sonia Sardana

Saurabh Pathak

Jennifer D’Silva

Melissa Douglas

Joseph Nordlinger

Ruby Pandit

Paul Cmiel

Jeff Hoekstra

Tim Atkinson