Leveraging Microsoft Fabric and Azure OpenAI for Data Processing and AI Integration

Microsoft Fabric is a data/analytics platform launched by Microsoft in May 2023. The idea of the platform is to provide one comprehensive tool for data processing, reporting and analysis. Fabric combines multiple existing Microsoft data services into one new product. Fabric is not just an umbrella of existing features, but offers also new features such as a unified OneLake solution for data storing. The product is still in Preview stage. This is reflected in the fact that not all products in Fabric can communicate with each other smoothly. The situation is expected to improve as the product matures.

OpenAI is a well-known company specialized in artificial intelligence. The company became famous for its DALL-E software, which can create images based on text descriptions. The real popularity came in 2022, when the company released the ChatGPT tool based on the GPT-3.5 model. In January 2023, Microsoft invested a significant amount of money in OpenAI and acquired a significant stake in the company. Later that year, Microsoft began offering OpenAI’s software directly from its Azure cloud service. Azure’s OpenAI capabilities are still in Preview stage, but they already allow you to create functional prototypes and test what the services can do.

This introduction leads us into the real topic. In this blog post I’m showing how I build a PoC with Microsoft Fabric that fetches data from API, stores it, edits it for AI and then sends it into Blob Storage, so that Azure OpenAI service can use it.

Ingesting Data Into Fabric

Data ingesting is possible to do in Fabric in several different ways. Fabric offers Pipeline Copy activity and Dataflow Gen 2 features as low-/nocode solutions, which enable data retrieval without writing code. In addition, the Fabric toolbox contains Spark-based Notebooks, in case the nocode tools run out of features.

In this PoC i used Data Pipelines to ingest data from API. Data Pipelines has support for paged API’s and they are easy to use over different kind of authentication mechanisms, like Basic auth and Bearer token based authentication.

On the storage side, Fabric also offers several different options. There is a database-style Data Warehouse, a more abstract Lakehouse, and even PowerBI’s own Datamart has found its way to Fabric. You can use the guide provided by Microsoft to choose the appropriate storage solution. However, I recommend staying on the Lakehouse side, as it offers the most comprehensive features for data processing. If SQL is a core skill, then Data Warehouse can also be a viable option.

I decided to use Lakehouse and store my data into two different “tables”. Fabric presents data as tables, even though the underlying data is stored into Parquet files.

Two datapipelines ingests data into Lakehouse and one Notebook sends it into Blob Storage for Azure OpenAI

Sending Data for Azure OpenAI

At the time of writing this blog post, Azure OpenAI Studio cannot directly use the data storing options provided by Fabric as a data source and the data must be transported somewhere within OpenAI’s reach. In my opinion, the easiest way to do this is to send the data, for example, from Fabric to Azure Blob Storage using a Notebook. Blob Storage is a cheap and easy-to-use data storage and it is possible to use it directly at OpenAI Studio

Feeding Data for OpenAI GPT-3.5 Model

With Azure OpenAI Studio, we can feed text, pdf, powerpoint, md and word files for the GPT language model. The files must not be too large (long) and the text must contain enough context. For example, list-like text is not suitable for the language model, because the model may not understand what the list is about. Descriptive texts, such as instructions and specifications, are more suitable for the language model. On the Fabric side, it is advisable to add enough context around the pure text data to make the data suitable for the language model. For example, if the data contains only spare parts and spare part numbers as lists:

Track bearing, 1234, shelf 3 level 2
Belt tensioner 5678, shelf 9 level 1

It is advisable to add additional information to the material about the data:

Spare part name: Track bearing, Warehouse shelf location: shelf 3, Warehouse shelf location level: 2, Spare part serial number: 1234
Spare part name: Belt tensioner, Warehouse shelf location: shelf 9, Warehouse shelf location level: 1, Spare part serial number: 5678

With the additional information, the language model can interpret the data more easily and provide it if the user asks, for example, “what spare part has serial number 1234”.

Summary

After teaching the data model, Azure OpenAI Studio provides a ready-made tool for publishing the solution as a chat page (web application). The end result can also be tested directly within Azure OpenAI Studio before publishing and the model can be fine-tuned if necessary, for example in terms of GPT parameters.

Connecting Fabric to Azure OpenAI Studio is not yet completely straightforward. Data does not move between Azure OpenAI and Fabric, or sometimes even within Fabric, completely frictionlessly and in addition to nocode tools, you have to quite often use some programming language like Python, Scala or R to do the job. Fabric’s tools are developing at a rapid pace, so it is expected that the integration between these systems will also become easier in the future.