Skip to content

Working with Fabric Warehouse Data in PySpark Notebooks

Photo by Pixabay: https://www.pexels.com/photo/wine-tank-room-434311/

I noticed today that there are surprisingly few samples/blog posts/guides about how to write (and read) data into warehouse from Fabric notebook. You can find lot of information about doing it through T-SQL Notebook, but you can actually use it from Spark Notebook through the Spark connector. Let me show you how.

Spark Connector

The Spark connector for Fabric Data Warehouse allows Spark Notebooks to read data from warehouse and write it through a two phase Spark dataframe handling (first it stores the data into intermediate storage and then uses copy into command to write it into Fabric DW table, but we don’t need to care about this).

Spark Connector has few nice features baked in:

  • Access to data in the same workspace or across different workspaces.
  • Automatic detection of the SQL analytics endpoint in a Lakehouse, based on workspace context.
  • A simplified Spark API that reduces complexity and allows access with minimal code.
  • Built-in availability within the Fabric runtime—no separate installation is needed.

Some of the limitations are:

  • You cannot use time travel
  • Write operation doesn’t work when private link is enabled and public access is blocked.
  • Supports only writing into DW not into Lakehouse (well I don’t know why you would like to write into Lakehouse with this tihng)

Code Sample

The whole thing is build around com.microsoft.spark.fabric library that brings spark.read.synapseql and dataframe.synapsesql methods. Synapsesql takes warehouse name, schema and table name as parameter (warehouse name).(schema).(table name). You can of course use the normal dataframe functions like filter and select to query the data.

import com.microsoft.spark.fabric
 
df = spark.read.synapsesql("SampleWarehouse.dbo.Trip")
# Just duplicate row as an example
first_row_df = df.limit(1)
# If you have not null columns and dataframe contains null values, you need to fill them before saving
filled_df = first_row_df.fillna({
    "PickupTimeID": 0,
    "DropoffTimeID": 0,
    "DateID": 0,
    "HackneyLicenseID": 0,
    "MedallionID": 0
})
# Write supports multiple methods (overwrite and append for example)
filled_df.write.mode("append").synapsesql("SampleWarehouse.dbo.Trip")

In my tests the synapsesql was slower than just working with Lakehouses and even smaller actions took 10-20 seconds to complete. The poor performance is more notably in writes. Query performance is in good level like almost in any Fabric workload.

Summary

You can read and write data into Fabric warehouse by using the synapsesql function from com.microsoft.spark.fabric library. Write operations might be little bit slower than using the good old Lakehouse, but on the other hand you will gain all the other goodies that Warehouse offers, like working schema support and multi table transactions.

You can find more information about working with Spark connector from this Microsoft Learn page

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *