ggk-quote

Get A Quote

ggk-contact

+91 1234 44 4444

Blog

ADF: Incremental Data Loads and Deployments

March 2, 2018

by Analysts

About Azure Data Factory (ADF)

The ADF service is a fully managed service for composing data storage, processing, and movement services into streamlined, scalable, and reliable data production pipelines. The Copy Activity enables the copying of data from a source to sink (destination) in a batch. It can be used for migrating data from on-premise to Azure (or) Azure to on-premise (or) Azure to Azure.

Problem Statement

While ADF is a great tool to perform ETL activities effectively, it comes with its own set of limitations. The primary one being its inability to support parameterization. Parameters are essential to understand and load only the data that has changed since the last successful data load instead of performing a full load operation. While dealing with millions of rows of data, this limitation to perform incremental load results in serious inefficiency to the whole ETL process.

We believe Microsoft is going to fix this limitation in the upcoming version of ADF but we have built an effective stop-gap solution to use until that happens.

We have also made sure the solution is configurable and scales for each new client and each new data source/target set, thus reducing the overhead for our customers in building, configuring and maintaining ETL activities.

Our Proposed Solution

With the proposed solution, we plan to address the following issues:

  • Ability to add/manage clients (source/target systems) dynamically
  • Ability to pass parameters to ADF to load data incrementally
  • Drive/Manage multiple Jobs in Azure
  • Edit Jobs dynamically

We have used a PowerShell script to build our solution:

Solution Overview

Our approach can be depicted in the following diagram:

Technical Details

To implement our solution, we will follow these steps, using the accompanying PowerShell commands:

Note: Generic naming convention is used for linked services, datasets and pipelines.

  • Get the LastRunDate for each Client from the Target table and update that in the configuration table
  • Get the list of clients and respective details whose data need to be copied to the cloud
$SqlCmd.CommandText = "SELECT ID, ServerName,
DatabaseName, UserName ,SecureId,StartTime,
SQLQuery, Client From ServerDetails"
  • Use the ForEach loop mechanism within PowerShell to go through the list of clients
    Create the Linked Service if it is not available
$IsOutputDataSetExist = (Get-AzureRmDataFactoryDataset 
-ResourceGroupName $resource -DataFactoryName 
$datafactory -Name $outputdatasetName)
if (([string]::IsNullOrEmpty($IsOutputDataSetExist))) { 
New-AzureRmDataFactoryDataset -ResourceGroupName $resource 
-DataFactoryName $datafactory -File 
$FinaloutputdatasetJSONFileName -Force 
}
  • Similarly, create Input and Output datasets if they are not available
Input dataset

$IsInputDataSetExist =  Get-AzureRmDataFactoryDataset  -ResourceGroupName
$resource -DataFactoryName $datafactory -Name $inputdatasetName
if (([string]::IsNullOrEmpty($IsInputDataSetExist))) {
New-AzureRmDataFactoryDataset -ResourceGroupName $resource 
-DataFactoryName $datafactory  
-File $FinalInputdatasetJSONFileName -Force
}
Output dataset

$IsOutputDataSetExist =  Get-AzureRmDataFactoryDataset
-ResourceGroupName $resource -DataFactoryName 
$datafactory -Name $outputdatasetName
if (([string]::IsNullOrEmpty($IsOutputDataSetExist)))  {
New-AzureRmDataFactoryDataset -ResourceGroupName 
$resource -DataFactoryName $datafactory  
-File $FinaloutputdatasetJSONFileName -Force 
}
  • Get the LastRunDate from the configuration table for each client
$SqlCmd.CommandText = "SELECT LastRunDate  LastRunDate 
FROM ClientConfigurations WHERE Client ="+$ClientName
  • Create/update the pipeline with the one-time run configuration
New-AzureRmDataFactoryPipeline -ResourceGroupName 
$resource -DataFactoryName $datafactory 
-File $FinalPipelineJSONFileName  -Force

 

Solution Prerequisites

In order to efficiently use the solution, we need to have the following prerequisites in place.

  • ADF (including Data Management Gateway – for on-premise source databases) as a solution for ETL
  • Azure PowerShell cmdlets package must be installed on the local machine
  • Azure Active Directory account
  • List of On-premise server details whose data should be copied to the cloud

Conclusion

This custom solution provides our customers the ability to manage multiple source/target systems as a configurable solution and thereby helps them avoid ETL overheads. The solution further helps us perform incremental data loads instead of full loads to avoid inefficiencies. It also automates the process of passing parameters to the pipeline with an ability to track the status of each job.

If these are some of the challenges that you are facing, then this solution would fit perfectly into your scheme of things.