I was reading through the Fabric May 2024 Updates today and noticed a pretty cool development tucked away amongst the pile of features released this month. In a single paragraph was the following:
“With OneLake shortcuts, you can bring on-premises or network-restricted data into OneLake, without any data movement or duplication.”
From my perspective, this is an INCREDIBLY cool feature that will stand as the building block for further integrations from MSFT on how the Hybrid Data Platform Architecture will continue to facilitate those who prefer cloud and those who prefer on-prem designs, creating a cohesive “best of both worlds” design.
Current Fabric Shortcut Support
For a quick refresher…a Fabric Shortcuts are essentially a view in fabric built on external file stores. It’s important to delineate the differences between a shortcut and mirroring as even I get them confused 🙂
- Shortcut: Shortcuts are objects in OneLake that point to other storage locations. Generally these are related to what I would refer to as “Data Lake Storage” locations. Shortcuts do not involve any replication of data.
- Mirroring: Mirroring is essentially the replication solution for fabric. For example, if you have an Azure SQL DB in an azure tenant you can simply mirror that SQL DB into Fabric and create a near-real time replica without migrating all of the Azure SQL DB processes and objects to fabric workloads.
Currently Fabric supports shortcuts for:
- Azure Storage: Storage Accounts and Blobs in Azure Data Lake
- Amazon S3: The AWS equivalent to Data Lake.
- Google Storage: Think google drive storage.
- Amazon S3 Compatible: De-facto standard object storage API for storing unstructured data in the cloud. Unlike traditional file system interfaces, it provides application developers a means to control data through a rich API set. S3 Compatible Storage is a storage solution that allows access to and management of the data it stores over an S3 compliant interface.
AS WELL AS ON-PREMISES SOURCES! Let’s get into how to get this working…
Configuring your first on-prem source
The First key piece of infrastructure for this on-prem mirroring to work is to leverage the “Standard Data Gateway”. This is essentially the equivalent to the Self Hosted Integration Runtime of Azure Data Factory that allows you to link your Fabric Environment to your On-Prem.
Ensuring that you install this on an “always on machine” that is within your network, use the following link to access the executable file: https://go.microsoft.com/fwlink/?LinkId=2116849&clcid=0x409
Following the steps laid out in that documentation, we should be able to link fabric to the on-prem files stores.
Now that I have my on-prem data gateway installed and active:
Let’s jump over to our Lakehouse and create a shortcut with this on-prem data gateway:
It is important to note at this point to note that this feature is available for only Amazon S3, Google Cloud Storage, and S3 compatible shortcuts.
This solution WILL NOT let you just connect and display files from your local drive.
To simulate a local cloud storage for GCP, I signed up for some free storage to test the shortcut with this new setting:
And just like that we can see the example buckets created:
I now have access to data stored in GCP storage, via my private gateway, to view and leverage the data without any movement in fabric lakehouse:
Importance of network restricted file exposure through fabric shortcuts
Exposing network-restricted data stores via Microsoft Fabric’s OneLake is important for several reasons:
- Data Accessibility: It allows users to access on-premises data seamlessly without physically moving the data, maintaining data integrity and reducing transfer times.
- Security: By using a gateway, sensitive data remains within the secure network while still being accessible for analytics and processing, ensuring compliance with data protection regulations.
- Efficiency: Facilitates efficient data integration and analysis by providing a unified view of both on-premises and cloud data, enhancing decision-making processes.
Possible use case and benefits
This type of design for integrating network-restricted data stores with Microsoft Fabric’s OneLake has several use cases and benefits:
Use Cases:
- Hybrid Cloud Solutions: As discussed above, with this feature businesses can leverage both on-prem and cloud data for analytics supporting hybrid cloud models.
- Real-Time Analytics: Organizations can see exactly what is occurring in on-prem data stores without the need to transfer data to the cloud. This enables faster decision making.
Benefits:
- Cost Efficiency: Reduces the need for large-scale data transfers, saving on bandwidth and storage costs.
- Enhanced Security: Keeps sensitive data within secure, network-restricted environments while still allowing for external access.