Authenticate Azure Data Factory with Azure Data Lake Gen 2 using Managed Identities

Roshan Joe Vincent
3 min readFeb 27, 2021

--

Azure Logo

Azure Data Lake Gen 2 brings together the capabilities of Azure Data Lake Gen 1 and Blob Storage. The addition of Hierarchical namespace enables efficient data access and other performance improvements make ADLS Gen 2 a very viable storage solution for big data.

On the other hand, Azure Data Factory is a cloud-based ETL service for data integration and transformation. A easy-to-use UI aids the user in creating code-free ETL pipelines.

In this post, I will discuss how you can connect to a Data Factory to ADLS Gen2 using Managed Identities. There are various ways in which you can connect to your Gen 2 DataLake from Azure Data Factory. For example,

  • Service Principal
  • Account Key
  • Managed Identity

However, in each of the first two ways, a user has to manage the secrets using either Azure Key Vault or put the client credentials directly in Azure Data Factory (Not recommended). In order to overcome this burden on developers, Azure came up with Managed Identities. Managed Identities automatically authenticate Azure Services with each other by creating an identity for each Azure Resource during resource creation . These identities can then be added to other azure resources using role based access control (RBAC).

So, in our case of authorizing Data Factory with the Data Lake, all we need to do is to assign a role to the Data Factory in the Data Lake Instance. You need an instance of a Data Factory and a Data Lake for this tutorial. You can follow the links below to create the same.

Once you have the two resources up and running, now we can provide ADLS Gen2 access to the Data Factory via role based access control (RBAC).

  1. Browse to your Gen 2 Data Lake in Azure Portal
  2. Click on Access Control (IAM)
  3. Click on +Add
  4. In the Add Role Assignment Pane, do the following,
  5. Role - Contributor
  6. Assign Access to - Data Factory
  7. Subscription - Your Subscription
  8. Select - Data Factory Name
  9. Once Selected, Click Save

The above would give access to the data factory on the data lake resource. However, for the data factory to access the files in the containers we need to add Storage Blob Data Contributor Role to the Data Factory on the Data Lake. Follow Steps 5 to 9 to add Role as Storage Blob Data Contributor.

Now, you have successfully given the Data Factory the appropriate permissions on ADLS Gen 2. Now, its time to connect to this ADLS Gen2 instance from my Data Factory.

  1. Browse to your Data Factory Instance on Azure Portal
  2. Click on Author & Monitor
  3. On the left-hand side pane, Click on the Manage icon with a spanner
  4. Under Connections, Click on Linked Services. Linked Services allow you to create connections to other resources like datalakes, databases.etc.
  5. Click on + New
  6. On the pane, Search for Azure Data Lake Storage Gen 2
  7. Click on Azure Data Lake Storage Gen 2
  8. In the New linked Service Pane, you can now connect to the Data Lake that you created.
  9. Name - Name for your linked Service
  10. Authentication Method - Managed Identity
  11. Account Selection Method - Either you can select from your Azure Subscription or Enter manually. If you select Enter Manually, make sure the URL is correct. For ADLS Gen2, if your DataLake name is mydatalakegen2 then the URL would be https://mydatalakegen2.dfs.core.windows.net
  12. Click on Test Connection. If all goes through successfully, you should have a Green Tick with Connection successful as the message.
  13. Click on Create.
  14. Now, we need to publish this in order to apply this to the Data Factory. Click on Publish all.
  15. View the changes on the pane and Click on Publish again

As you have assigned permissions to the Data Factory in ADLS Gen2 via role based access control (RBAC), Azure would automatically use the managed identity associated with the Data Factory and authenticate the connection.

In my next post, I will connect to the data lake using the linked service I created and demonstrate an example of Data Flow on JSON data.

--

--

Roshan Joe Vincent
Roshan Joe Vincent

Written by Roshan Joe Vincent

Machine Learning / Deep Learning — University of Cincinnati — Onc.AI (Machine Learning Engineer) [https://github.com/roshan-vin4u]

No responses yet