ggk-quote

Get A Quote

ggk-contact

+91 1234 44 4444

Blog

AZURE Data Factory

December 15, 2015

by Analysts

A brief introduction

The data landscape for the enterprises is rising exponentially in volume, variety, and complexity. Unstructured, semi-structured and cloud-borne data need to be processed through various tools, frameworks, and custom services.

Azure Data Factory is a data integration service based on the cloud. It orchestrates & automates the movement and transformation of data. ADF (Azure Data Factory) is a part of Microsoft’s analytics suite and is not a standalone ETL (Extract Transform Load) tool. Composing, orchestrating and monitoring is done for diverse data of various origins which give ADF a diversified data collector stature.

Features like automatic cluster management, retrying for transient failures, configurable timeout policies and alerting make it very convenient to process data in a managed environment.

The USPs for Azure data factory can be- high availability, Fault tolerant and a fully managed service.

Data Factory works across the on-premises, the cloud data sources as well as SaaS to assimilate, prepare, transform, analyze, and finally publish the data. Like a manufacturing factory that runs various equipment to ingest the raw materials and converts them into the finished goods, Data Factory brings about the existing services that collect the raw data from diverse sources and transform it into ready-to-use information.

In addition to the above functionalities, ADF also provides a dynamic monitoring dashboard where the users can monitor the data pipelines.

Concept of Azure Data Factory

Dataset: Datasets are named references/pointers to the data you want to use as an input or an output of an Activity. E.g. table, files.

Activity: Activities define the actions to perform on your data. E.g. HIVE, copy.

Pipeline: Pipelines are a logical grouping of Activities. E.g. manage, monitor, schedule.

Linked service: Linked services define the required information for Data Factory to connect to various external resources (Connection Strings). E.g. SQL server, Hadoop cluster.

Developers can also create Data Factory with Azure Portal, PowerShell (using Azure Resource Manager templates), Visual Studio (with Azure .NET SDK 2.7 or later), REST APIs – Azure Data Factory SDK. Users can author activities, combine them into a pipeline and set an execution schedule.

ADF can be handy in following scenarios

Access to Data Sources : Such as SQL Server On premise, SQL Azure, Azure Blob Storage etc.

Data transformation : through Hive, Pig, C# etc.

Monitoring : the pipeline of data, validation, and execution of scheduled jobs etc.

Load it into desire Destination : such as local SQL servers, SQL Azure or Azure Blob Storage etc.

Prerequisites to use ADF

  • Microsoft Data Management Gateway
  • To sign to azure.com, credit/debit card information is needed. A free trial of 30 days is available

It’s a common notion that Azure is here to replace SSIS but ADF comes as a complimentary service and not to compete with SSIS

   SSISADF
Development ToolSSDT/BIDSPortal, ADF Editor, PowerShell, JSON Script
Data Source and DestinationsMany Source and DestinationsAZURE Storage, ASURE SQL database, SQL Server, Filesystem
Data TransformationsMany TransformationsLess activities. Need to write own activities
EnvironmentAdministrative Efforts, Need good software hardwareAzure will take care everything
PricingPay for futuresPay per Usage
Error handlingError handling through Event Handlers, Failure precedence constraintAlert Rules, No Event handlers, Error message logging
DeploymentDeployment WizardPowerShell Scripts
MonitoringSSIS logging and Catalog reportsPowerful GUI, DataSlice execution, Drill through monitor feature
Data LineageNoneYes, Data Lineage feature available
SecurityRole-based for Deploy, execute and monitorRoles such as owner, contributor, reader, data factory contributor and user access administrator.

 

On premises file to Azure SQL Database

Steps to copy .csv file data to cloud databases.

  1. Place the Source file (.csv) file on the local machine.
  2. Create SQL Database in Azure portal.
  3. Create destination table for the above-created database using SSMS on your local machine.
  4. Create Data Factory in Azure portal for the required transfer:
  • Create Gateway to access to on-premises resources.(File, Database)
  • Create Linked service for Source (File) and Destination (Azure table).
  • Create Datasets for Source and Destination.
  • Create Pipeline to move data from Source to Destination.
  1. Monitor and Manage – for each data set we can see the status of the row (whether it is transferred or not, failure logs if any, re-run option if required).