Import Exercise Data from Apple Watch into Microsoft Fabric

I have been using Apple Watch to monitor my exercises, especially floorball exercises. While I love the Apple Watch product quality, ease of use and feel around my wrist it sadly lacks good statistic tools for exercise data. For example I tried to view the amount of training days within season 23-24 and I couldn’t find any way to do it. From that I got an idea to upload the exercise data into Microsoft Fabric and use some good old Pandas magic to calculate statistics.

Exporting Data from IPhone

You can export health data from Apple Watch quite easily. Open Health app, go to home page, click your initial icon on top right corner, scroll down and click export all health data. This creates bunch of XML files that holds your data. XML is a terrible format to handle inside Fabric, so we need to first convert XML files into CSV format. I used Web based tool called Apple Health XML to CSV Converter page to do the conversion.

Upload to Fabric

After converting the data we can upload it into Microsoft Fabric. To do this I simply created Lakehouse called TestLakehouse and used it’s built-in upload function to upload CSV files into Files section. This allows us to easily use them in Data Pipeline.

You can upload files into Data Lake through UI

Data Pipeline

I decided to use Data Pipeline tool to do the conversion from CSV to Delta Lake format, because Data Pipeline is simple and fast to build and it has nice CSV schema import feature. Data Pipeline is simple to build. You just need to add Copy data action, select file from Data Lake as source, change delimiter character to semicolon and import schema mapping. I noticed that converter added sep=; as first line in all files, so I removed it manually. By default the Fabric supports header line as first line, so you cannot have more in front of it. Otherwise it will mess up the schema import.

Mappings are auto generated from CSV headers

I wanted to import three different types of data: Heart rate, energy burned and exercise time. So I created Copy data action for all those files, with almost identical settings (source file is of course different and target table name).

Three different copy actions helps to debug problems compared to one bigger action

Data Analysis

After running the pipeline, we will have three new Delta Lake tables that hold the data. I am not very good at PowerBI so I decided to do the visualization and analysis by using Notebooks. First I wanted to understand what kind of data I’m holding. I loaded all the data from HeartRate table and drew a scatter plot from data. This is simple way to visualize the data. As seen in the picture below, the data holds multiple values per day. We can easily calculate min, max, mean etc. values for each day by using Pandas.

import pandas as pd
import matplotlib.pyplot as plt

df = spark.sql("SELECT * FROM TestLakehouse.`HeartRate`")
pdf = df.toPandas()

# We need to convert startdate and enddate to date type
pdf['startdate'] = pd.to_datetime(pdf['startdate'])
pdf['enddate'] = pd.to_datetime(pdf['enddate'])

# Drop all rows that has startdate less than 2023-08-01
pdf = pdf[pdf['startdate'] > '2023-08-01']

# Convert value to numeric
pdf['value'] = pd.to_numeric(pdf['value'])

pdf.plot.scatter(x='startdate', y='value', figsize=(12, 6), title='HeartRate')
plt.show()

As I was only interested from 2023 autumn – 2024 fall data, I used pandas to filter out other days. I could have done the filtering in SQL also, but as the startdate was in string format I did it in Pandas to avoid any datetime formatting issues.

Next I wanted to get the answer that I was initially looking for: “What is the amount of exercise days I had in this season”. To get that I used bit more Pandas magic to print out the truth

df = spark.sql("SELECT * FROM TestLakehouse.`EnergyBurned`")
pdf = df.toPandas()

# We need to convert startdate and enddate to date type
pdf['startdate'] = pd.to_datetime(pdf['startdate'])
pdf['enddate'] = pd.to_datetime(pdf['enddate'])

# Drop all rows that has startdate less than 2023-08-01
pdf = pdf[pdf['startdate'] > '2023-08-01']

# Convert value to numeric
pdf['value'] = pd.to_numeric(pdf['value'])

total = pdf['value'].sum()

print(f"Calories burned in Season 2023/2024: {round(total, 2)}")
# Drop time and calculate unique values to get training days
pdf['startdate'] = pdf['startdate'].dt.date
print(f"Number of different startdate values in day level: {pdf['startdate'].nunique()}")

Again nothing fancy in here. Just loading the data from EnergyBurned table (to be able to calculate total calories also), some data type conversion, filtering and calculation of sums. As a result I got two numbers:

Calories burned in Season 2023/2024: 48518.54
Number of different startdate values in day level: 76

I didn’t wear the watch in every exercise, so I would estimate that the real number is close to 80. On top of that I do gym ~twice a week, so the total amount of exercises is around 160 days. I also burned almost 50 000 kcals during exercises, so that is nice extra (this does not include gym as I don’t like to wear the watch during the gym time).

Summary

The entire process of exporting data, converting it into a suitable format, importing it into Microsoft Fabric, and performing some basic analysis took me about three hours. Microsoft Fabric is an easy and fast tool for building analyses over CSV data, and I cannot emphasize this enough. I could have also handled the data analysis in Jupyter on my local machine by loading the CSV files from my hard drive. However, by storing the data in Fabric, I can begin collecting data over time and potentially compare different years’ heart rate minimums, maximums, means, etc.