9 min read

Exploring Fitness Data in R

Back in late February, I purchased a Microsoft Band 2. While I don’t exercise every day, I thought it would be fun to track the few health activities I can record. After five months, I’ve got quite a bit of daily step data and a few miles of jogging under my belt. I thought I would begin an exploratory analysis and see what kinds of information I could visualize to learn more about my health activities.

The Microsoft Health Dashboard–where all your health data is stored–is designed to export daily summaries for all activities in a comma-separated format. However, if you know a little bit of web scraping, you can get a lot more data that is used to create the dashboards online. I’ve used this same process with my work projects to get information that is not ready for export. Since many websites serve information this way, the technique has application beyond this one particular project. I will detail the technique I used to retrieve that information and create a few exploratory visuals.

The Data

For this analysis, I used R to retrieve (httr, jsonlite), clean (tidyr), summarize (dplyr) and visualize (ggplot2) the data. I used ProjectTemplate to manage my scripts and raw data along with RMarkdown to create this post.

First, you will need to retrieve the data. When you log into your Microsoft health account and browse to any of your activities (sleep, step, running, etc.) hourly summaries of the day are displayed in one of the dashboard’s visuals. If you open up your browser’s developer tools, you can see the url request your browser made to retrieve that information before it was displayed. We’re going to mimic those browser requests and clean up that information creating a tidy data frame which can be used for exploration. You’ll need to retrieve the request url and the cookie your browser used when it made that request. With those two pieces, you can retrieve a single day’s worth of information.

Getting more than a single day requires us to manipulate the original url’s parameters. Luckily for us, the only thing we need to change is the date. Using base R’s seq.Date function, we can create a sequence of dates in a format that the server will accept. Once a list of dates has been created, the paste function concatenates the full request using each generated date. Httr uses the concatenated url to make a server request. The day’s activities are returned in JSON format which can be parsed by jsonlite and converted into a data frame. The final step binds each daily request into one data frame called “Activities” using rbind. When used inside of a for-loop, the hourly step count for many days can be obtained. Below is the full code I used to get this information.

library(httr)
library(jsonlite)
library(lubridate)

cookie = **Insert Browser Cookie**
DateRange = as.character(seq.Date(from = as.Date("2016-01-01"), to = as.Date("2016-08-05"), by = 1))

Activities = data.frame()
for(i in DateRange){
  request.url = paste('https://dashboard.microsofthealth.com/card/getuseractivitybyhour?date=', i, '&utcOffsetMinutes=-240', sep="")
  page = GET(request.url, add_headers('Cookie' = cookie))
  page = fromJSON(content(page, as = 'text'))
  page = page$ActivityList
  Activities = rbind(Activities,page)
}

Once the information is retrieved, we’ll want to extract various time details using the lubridate package.

# Time formats.
Activities$TimeOfDay = strptime(Activities$TimeOfDay, "%m/%d/%Y %X", tz = "EST")
Activities$TimeOfDay = as.POSIXct(Activities$TimeOfDay)
Activities$Hour = hour(Activities$TimeOfDay)
Activities$Wday = wday(Activities$TimeOfDay, label = TRUE)
Activities$Month = month(Activities$TimeOfDay, label = TRUE)
Activities$Day = as.Date(Activities$TimeOfDay)

Using dplyr we can get summaries of daily activities from the hourly data we originally retrieved.

library(tibble)
library(tidyr)
library(dplyr)
Activities = as_tibble(Activities)
StepSum = Activities %>% 
  filter(StepsTaken > 0) %>%
  group_by(Day,Wday) %>% 
  summarize(DailyStepCount = sum(StepsTaken)) %>%
  ungroup(Day,Wday) %>% 
  mutate(AverageDaily = mean(DailyStepCount))

Visualizing Daily Step Count with Ggplot2

The data you get from the dashboard is pretty clean out of the box. So, we can quickly move to visualizing what we have with ggplot2. One great thing about all this information is first hand knowledge. Since I am the one who collected, monitored, and analyzed, I have a pretty good understanding of what to expect from it. The first chart visualizes average daily steps and the events that contributed to better step days.

# Packages for Visualization
library(ggplot2)
library(dplyr)
library(tibble)
library(viridis)
library(scales)
library(extrafont); library(extrafontdb)
library(tidyr)
library(magrittr)

# A dataframe with events during the year.
Event_Dates = data.frame(Event = c(1:4), 
                         Begin = rep(as.Date(c("2016-02-29", "2016-05-01", "2016-05-16", "2016-06-05")),2), 
                         End = rep(as.Date(c("2016-03-06", "2016-05-07","2016-05-21", "2016-08-01")),2),
                         y_bottom = c(0,0,10000,4700),
                         y_top = c(10000,5000,19000,12500),
                         Label = c("Spring Vacation", "Graduation","New York City", "Summer Hikes and Runs"))

# Daily Step Count Visual
ggDailySteps = ggplot() + 
  geom_point(data = StepSum, size = 3, aes(Day, DailyStepCount),color = 'royalblue3') + 
  labs(title = "My Daily Steps in 2016\n", x = NULL, y = "Daily Step Count\n") +
  scale_color_viridis(option = "C", discrete = TRUE) + 
  guides(color = guide_legend(title = 'Mean Daily Step Count', color = 'black', label = FALSE, size = 1,override.aes = list(color = 'black', size = 1))) +
  scale_y_continuous(labels = comma, breaks = c(0,2500,5000,7500,10000,12500,15000)) + 
  theme(text = element_text(family = "Georgia", color = 'grey10'),
        plot.title = element_text(size = 24, hjust = -0.01),
        panel.background = element_rect(fill = 'antiquewhite'),
        panel.grid = element_blank(),
        plot.margin = unit(c(0.4,0.4,0.4,0.4), 'cm'),
        plot.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
        legend.text = element_text(color = 'grey10', size = 14),
        legend.key = element_rect(fill = 'antiquewhite',color = 'antiquewhite',size = 2),
        legend.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
        legend.position = c(0.8,1.11),
        legend.direction = 'horizontal',
        axis.text = element_text(size = 16, color = 'grey10'),
        axis.title = element_text(size = 16),
        axis.ticks = element_blank()) + 
  geom_rect(data = Event_Dates, fill = 'grey80', alpha = .2,aes(xmin = Begin-3, xmax = End+3, ymin = y_bottom, ymax = y_top)) +
  geom_segment(color = 'grey60', size = 1.3, linetype = 3, aes(x = Event_Dates$End[7]+4, xend = (Event_Dates$End[7]+15), y = 16500, yend = 18000)) + 
  geom_text(label = "New York City Vacation",size = 5, color = 'grey30', aes(x = (Event_Dates$End[6]+60), y = 18755)) +
  geom_segment(size = 1.3,linetype = 3,color = 'grey60',aes(x = Event_Dates$End[6]-5, xend = Event_Dates$End[6]-15, y = 5000, yend = 7600)) + 
  geom_text(label = 'Graduation Week', color = 'grey30', size = 5, aes(y = 8650, x= Event_Dates$End[7]-32)) + 
  geom_segment(aes(x = Event_Dates$End[1]-7, xend = (Event_Dates$End[1]-30), y = 10000, yend = 11600),color = 'grey60', size = 1.3, linetype = 3) + 
  geom_text(label = "Spring Vacation", color = 'grey30',size = 5, aes(y = 12500, x=Event_Dates$End[1]-30)) + 
  geom_text(label = "Summer Hikes\nWalks and Runs",color = 'grey30', size = 5, aes(x = as.Date("2016-07-03"), y = 13700, family = "Georgia")) + 
  geom_rug(data = StepSum,aes(Day, DailyStepCount), sides = 'l', color = 'grey60') + 
  geom_segment(data = StepSum, linetype = 2, aes(y = AverageDaily,yend = AverageDaily, x = as.Date("2016-01-01"), xend = as.Date("2016-08-01"),color = 'grey60'), size = 1.4, show.legend = TRUE)

ggDailySteps

I’m pleased with what I was able to do with this visual. The ggplot2 package does a good job of simplifying the plotting process. Once you get familiar with all the options you have at your fingertips, complex visualizations become much easier. The most difficult things tend to be the little design details like the length of the line segments and the placement of the annotations. Accomplishing the same details may be easier in a real design program like inkscape or photoshop, but I am pleased with how much I could accomplish using R only. Plus, since the initial plotting work is complete, I can continue to add events and replot with minimal effort.

Hourly Heart Rate.

Visualizing my heart rate took much less time to create since it contained fewer annotations. Plus, I had already established the basic thematic elements in my previous plot. The resulting graph shows roughly the number of times I have exercised since purchasing my band.

ggAverageHeart = ggplot(subset(Activities, AverageHeartRate > 0), aes(Day, AverageHeartRate)) + 
geom_line() +
labs(x = NULL, y = "Average Rate\n", title = "Average Hourly Heart Rate in 2016\n") +
theme(text = element_text(family = "Georgia", color = 'grey15'),
      plot.title = element_text(size = 26, hjust = -0.01),
      panel.background = element_rect(fill = 'antiquewhite'),
      panel.grid = element_blank(),
      plot.margin = unit(c(0.4,0.4,0.4,0.4), 'cm'),
      plot.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
      legend.text = element_text(color = 'grey10', size = 14),
      legend.key = element_rect(fill = 'antiquewhite',color = 'antiquewhite',size = 2),
      legend.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
      legend.position = c(.75,1.05),
      legend.direction = 'horizontal',
      axis.text = element_text(size = 16, color = 'grey10'),
      axis.title = element_text(size = 16),
      axis.ticks = element_blank()) + 
stat_smooth(color = 'royalblue3', size = 1.3) + 
geom_segment(linetype = 3, size = 1.3,color = 'grey40', aes(x = as.Date("2016-04-10"), xend = as.Date("2016-04-20"), y = 158, yend = 164)) + 
geom_text(label = "High heart rate indicates a workout", aes(x = as.Date("2016-05-15"), y = 170, family = "Georgia"), size = 4, color = 'grey15')

ggAverageHeart
## `geom_smooth()` using method = 'gam'

Running Performance

The Microsoft Health dashboard provides many metrics to help you understand and track your exercise performance; however, it does not let you compare all of your runs together under one graph. To create a single visual, I’ll retrieve the running information with the same method I used previously and weed out runs longer than three miles. The graph I chose to create looks at the relationship between distance along a run (between 0 and 3 miles) and my pace at that given distance. Below is the code I used to create it.

ggRunRate = ggplot(subset(Runs, Pace > 0), aes(TotalDistance, Pace)) + 
  geom_point() +
  labs(x = "\nTotal Distance\n(miles)", y = "Pace\n", title = "Pace over the Course of a Run\n") +
  theme(text = element_text(family = "Georgia", color = 'grey15'),
        plot.title = element_text(size = 26, hjust = -0.01),
        panel.background = element_rect(fill = 'antiquewhite'),
        panel.grid = element_blank(),
        plot.margin = unit(c(0.4,0.4,0.4,0.4), 'cm'),
        plot.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
        legend.text = element_text(color = 'grey10', size = 14),
        legend.key = element_rect(fill = 'antiquewhite',color = 'antiquewhite',size = 2),
        legend.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
        legend.position = c(.75,1.05),
        legend.direction = 'horizontal',
        axis.text = element_text(size = 16, color = 'grey10'),
        axis.title = element_text(size = 16),
        axis.ticks = element_blank()) + 
  stat_smooth(color = 'royalblue3', size = 1.3) + 
  scale_x_continuous(breaks = c(0,0.5,1,1.5,2,2.5,3))

ggRunRate

As I would expect, pace generally decreases as distance increases. Adding the loess smoother highlights a few specific running trends. My pace starts out strong and gradually declines over the course of three miles. Towards the end, I might pick up speed and finish stronger.

As the weather cools off, I hope to begin more consistent runs and measure performance during that period of time. I would expect/hope that the decreasing trend over the length of the run would flatten and the average pace rises, but that will have to be tested at another time. For now, this project is only a short exercise in exploratory data analysis.