RocketStats
/index.xml
Recent content on RocketStatsHugo -- gohugo.ioen-usWed, 15 Feb 2017 20:00:00 +0000An Intro to Gaussian Mixture Modeling
/2017/02/15/an-intro-to-gaussian-mixture-modeling/
Wed, 15 Feb 2017 20:00:00 +0000/2017/02/15/an-intro-to-gaussian-mixture-modeling/<!-- BLOGDOWN-BODY-BEFORE
/BLOGDOWN-BODY-BEFORE -->
<p>One of my goals for 2016 is to improve my ability to understand different statistical/machine learning problems. I have an educational background in economics, so I have spent a good deal of time studying and using linear modeling in it’s various forms. However, I have spent little time with the various classification techniques. Gaussian mixture modeling has several advantages as a good place to start. It is fairly simple, introduces the concept of expectation-maximization, and belongs to a family of algorithms all with the same form.</p>
<p>The gaussian mixture model (GMM) is a modeling technique that uses a probability distribution to estimate the likelihood of a given point in a continuous set. For the GMM, we assume that our classes bear the markings of a normally distributed density function. When the two classes are clearly defined, the guassian distribution works well as an estimate for class-conditional probabilties.</p>
<p>In practice, <a href="http://machinelearningmastery.com/dont-implement-machine-learning-algorithms/" title="Jason Brownlee, Stop Coding Machine Learning Algorithms From Scratch">it is not usually a great idea to implement your own learning algorithm.</a> But, the exercise can be useful as a learning tool. I will be implementing my own Gaussian Mixture Model to show how it works and see if I am able to get results that mimic the Mclust packge. By the end of this article, I hope in the very least to provide you with good resources for learning the GMM on your own.</p>
<div id="linear-disciminant-analysis" class="section level3">
<h3>Linear Disciminant Analysis</h3>
<p>Before moving directly into the main model I want to note that if you are familiar with linear discriminant anlaysis (LDA), then you may have already seen the formula for the GMM. LDA is a supervised learning technique where the class priors are known and the class means and covariances can be estimated using training data. LDA works much like the gaussian mixture model by estimating an <em>a posterior</em> probability <a href="http://statweb.stanford.edu/~tibs/ElemStatLearn/" title="Elements of Statistical Learning, Trevor Hastie, Robert Tibsharani, Jerome Friedman, 107-08.">(Elements of Statistical Learning 107-08).</a></p>
<p>LDA takes the form: <span class="math display">\[r_{ic} = \frac{\pi_c N(x_i: \mu_c, \Sigma_c)}{\sum_{c'}{\pi_{c'}N(x_i: \mu_{c'},\Sigma_{c'})}}\]</span> where <span class="math inline">\(\pi_c\)</span> is a prior for each class estimated from our training data and <span class="math inline">\(N(x_i:\mu_c,\Sigma_c)\)</span> is the density function given a mean, <span class="math inline">\(\mu\)</span>, and covariance matrix, <span class="math inline">\(\Sigma_c\)</span>. The denominator is the summation of each class prior multiplied by the probability density which effectively acts as a normalizing constant so that <span class="math inline">\(\sum_c{r_{ic}}=1\)</span>. When we compute a class-conditional probability for each observation, we are given <span class="math inline">\(c\)</span> number of probabilities that the observation <span class="math inline">\(x_i\)</span> belongs to the class <span class="math inline">\(c\)</span>. We choose the class which maximizes that probability <a href="http://statweb.stanford.edu/~tibs/ElemStatLearn/" title="Elements of Statistical Learning, Trevor Hastie, Robert Tibsharani, Jerome Friedman, 106.">(Elements, 107).</a></p>
</div>
<div id="expectation-maximization" class="section level3">
<h3>Expectation Maximization</h3>
<p>There are times, however, when the class for each observation is unknown and we wish to estimate them. When this is the case, we can use the gaussian mixture model and the <em>Expectation-Maximization algorithm (EM)</em>.</p>
<p>The EM algorithm is a two step process. First is the <em>E-step</em> where the expectation is calculated. For the Gaussian Mixture Model, we use the same form of bayes theorm to compute expectation as we did with LDA. The equation we end up using is the same:</p>
<p><span class="math display">\[r_{ic} = \frac{\pi_c N(x_i: \mu_c, \Sigma_c)}{\sum_{c'}{\pi_{c'}N(x_i: \mu_{c'},\Sigma_{c'})}}\]</span> In this form of bayes theorem, <span class="math inline">\(\pi_c\)</span> is a vector of mixing components for the gaussian density where <span class="math inline">\(\sum{\pi_c = 1}\)</span>. <span class="math inline">\(N(x_i: \mu_c \Sigma_c)\)</span> is notation for our probability density function where we compute the probability of <span class="math inline">\(x_i\)</span> given <span class="math inline">\((\mu, \Sigma)\)</span>. The denominator is the sum of the priors multiplied by the gaussian probabilities. An <span class="math inline">\(r_i\)</span> will be computed for each row-vector, <span class="math inline">\(x_i\)</span>, and with each mixing component, <span class="math inline">\(\pi_c\)</span>. Once a probability for each class has been computed, choose the most likely class.</p>
</div>
<div id="the-multivariate-guassian-distribution" class="section level3">
<h3>The Multivariate Guassian Distribution</h3>
<p>First, we need to define <span class="math inline">\(N(\mu, \sigma)\)</span> which is our gaussian density function. The normal distribution can take on a univariate or multivariate form. For the example below, I will use the multivariate form:</p>
<p><span class="math display">\[N(\mu_k, \Sigma_k) = \frac{e^{-\frac{1}{2}(x_i-\mu_k)^T \Sigma^{-1} (x_i-\mu_k) }}{\sqrt{|2\pi\Sigma|}}\]</span></p>
<p>The following R code uses the same equation to calculate the multivariate probability density.</p>
<pre class="r"><code># Multivariate Normal PDF Function Given a matrix x, mu (mean), and
# sigma (covariance), we can calculate the probability density for each
# row using the apply function. The function returns a column vector of
# probabilities.
mvpdf <- function(x, mu, sigma) {
if (det(sigma) == 0) {
warning("Determinant is equal to 0.")
}
apply(x, 1, function(x) exp(-(1/2) * (t(x) - mu) %*% MASS::ginv(sigma) %*%
t(t(x) - mu))/sqrt(det(2 * pi * sigma)))
}</code></pre>
<p>With this function, we can estimate the probability that a point lies within the a distribution with paramaters <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\Sigma\)</span>. Gaussian mixture models work well when the class densities are clearly separated and well defined. So, as long as the classes do not overlap and the data is truly distributed normally, we can find a good estimate.</p>
</div>
<div id="getting-initial-parameters" class="section level3">
<h3>Getting Initial Parameters</h3>
<p>Gaussian mixture models assume that each latent class has a different set of means and covariances. However, since each class is unknown we must begin by intializing these parameters and itteratively updating. Initialization methods are an important step in mixture modeling which can greatly affect the consistency and accuracy of your results. Thus, it’s worth briefly mentioning a few techniques.</p>
<p>One approach is to randomly sample <span class="math inline">\(k\)</span> number of means and covariances within the range of our data; however, since the EM algorithm is a “hill climbing” (i.e. maximization) algorithm, randomly choosing your starting points can alter performance. Given certain random values, my custom algorithm created for this article would often fail to produce an estimate and R would begin to return <em>NA</em> values after several iterations. The R package <em>mclust</em> addresses the issue by selecting initial means and covariances through the application of hierarchical clustering – an unsupervised clustering technique which iteratively collects points/groups together until the desired number of clusters is found. After classifying each point to a cluster, we can initialize our mixture, mean, and covariance parameters within those groups (<a href="http://projecteuclid.org/download/pdfview_1/euclid.ssu/1272547280#page=7" title="Volodymyr Melnykov and Ranjan Maitra, Statistics Surveys: Vol 4 (2010), Finite Mixture Models and Model-Based Clustering, 86.">Melnykov, Maitra, 2010</a>).</p>
</div>
<div id="maximization-step" class="section level3">
<h3>Maximization Step</h3>
<p>Once the e-step has been completed, we need to maximize our results. Listed below is each equation we use during the maximization step: <span class="math display">\[mc = \sum_i{r_{ic}}\]</span> <span class="math display">\[\pi_c=\frac{m_c}{m}\]</span> <span class="math display">\[u_c = \frac{1}{m_c} \sum_i{r_{ic}x_i}\]</span> <span class="math display">\[\Sigma_c = \frac{1}{m_c} \sum_i{r_{ic}(x_i - \mu_c)^T(x_i-\mu_c)}\]</span> Now for an explanation of what is happening here. Our first task it to update the mixing components (i.e. prior probabilities). For each point <span class="math inline">\(x_i\)</span>, we will have calculated an <span class="math inline">\(r_{ic}\)</span> which will give us one column vector per class. Our first equation for <span class="math inline">\(mc\)</span> tells us the responsibility assigned to each class. Since <span class="math inline">\(\sum{m_c}=N\)</span> where <span class="math inline">\(N\)</span> is equal to the number of rows in our dataset, we can update the proportions using the second equation for <span class="math inline">\(\pi_c\)</span> updating the mixing components. <span class="math inline">\(\mu_c\)</span> and <span class="math inline">\(\Sigma_c\)</span> are updates to our gaussian density parameters where we calculate the mean and covariances normally, but use weights <span class="math inline">\(r_{ic}\)</span> and <span class="math inline">\(\frac{1}{m_c}\)</span> to weight each point in our mean and covariance estimation.</p>
<p>When the maximization step is complete, we return to the beginning and repeat the process until a maximum is found. When this happens, the class means and covariances will not be greatly altered with each iteration.</p>
</div>
<div id="code-and-example" class="section level3">
<h3>Code and Example</h3>
<pre class="r"><code># Plot our dataset.
plot(iris[, 1:4], col = iris$Species, pch = 18, main = "Fisher's Iris Dataset")</code></pre>
<p><img src="#####../content/post/2017-02-15---gaussian_mixture_modeling_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>If you have used R long enough, you’re probably familiar with <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set" title="Wikipedia, Iris flower data set.">fisher’s iris dataset.</a> There are three species of flowers being measured with four continuous variables. In this dataset we have a good mix of variables that explain the differences between the classes (Petal Length and Petal Width) and those which show the classes are mixed (Sepal Length and Sepal Width).</p>
<p>Now let’s assume that we do not know which class each <span class="math inline">\(x_i\)</span> belongs. Gaussian Mixture Modeling can help us determine each distinct species of flower.</p>
<p>Let’s start by intializing the parameters. The code below borrows from the mclust package by using it’s hierarchical clustering technique to help create better estimates for our means.</p>
<pre class="r"><code># Mclust comes with a method of hierarchical clustering. We'll
# initialize 3 different classes.
initialk <- mclust::hc(data = iris, modelName = "EII")
initialk <- mclust::hclass(initialk, 3)
# First split by class and calculate column-means for each class.
mu <- split(iris[, 1:4], initialk)
mu <- t(sapply(mu, colMeans))
# Covariance Matrix for each initial class.
cov <- list(diag(4), diag(4), diag(4))
# Mixing Components
a <- runif(3)
a <- a/sum(a)</code></pre>
<p>Next we’ll use the equations defined above to calculate expectation.</p>
<pre class="r"><code># Calculate PDF with class means and covariances.
z <- cbind(mvpdf(x = iris[, 1:4], mu = mu[1, ], sigma = cov[[1]]), mvpdf(x = iris[,
1:4], mu = mu[2, ], sigma = cov[[2]]), mvpdf(x = iris[, 1:4], mu = mu[3,
], sigma = cov[[3]]))
# Expectation Step for each class.
r <- cbind((a[1] * z[, 1])/rowSums(t((t(z) * a))), (a[2] * z[, 2])/rowSums(t((t(z) *
a))), (a[3] * z[, 3])/rowSums(t((t(z) * a))))
# Choose the highest rowwise probability
eK <- factor(apply(r, 1, which.max))</code></pre>
<p>Now let’s begin to update</p>
<pre class="r"><code># Total Responsibility
mc <- colSums(r)
# Update Mixing Components.
a <- mc/NROW(iris)
# Update our Means
mu <- rbind(colSums(iris[, 1:4] * r[, 1]) * 1/mc[1], colSums(iris[, 1:4] *
r[, 2]) * 1/mc[2], colSums(iris[, 1:4] * r[, 3]) * 1/mc[3])
# Update Covariance matrix.
cov[[1]] <- t(r[, 1] * t(apply(iris[, 1:4], 1, function(x) x - mu[1, ]))) %*%
(r[, 1] * t(apply(iris[, 1:4], 1, function(x) x - mu[1, ]))) * 1/mc[1]
cov[[2]] <- t(r[, 2] * t(apply(iris[, 1:4], 1, function(x) x - mu[2, ]))) %*%
(r[, 2] * t(apply(iris[, 1:4], 1, function(x) x - mu[2, ]))) * 1/mc[2]
cov[[3]] <- t(r[, 3] * t(apply(iris[, 1:4], 1, function(x) x - mu[3, ]))) %*%
(r[, 3] * t(apply(iris[, 1:4], 1, function(x) x - mu[3, ]))) * 1/mc[3]</code></pre>
</div>
<div id="log-likelihood-maximization" class="section level3">
<h3>Log-Likelihood Maximization</h3>
<p>During each iteration, each mean is updated in a way that maximizes the log-likelihood. Each iteration will increase the sum of the log-likelihood. The log-likelihood function experiences diminishing returns for each iteration of the E-M algorithm. Convergence occurs as the marginal utility of the function approaches zero (i.e. the change from another itteration is small). If we calculate the log-likelihood for each iteration, we can estimate the most cost effective approach for estimating our mixtures. The log-likelihood is defined as:</p>
<p><span class="math display">\[log N(\mu, \Sigma) = log \sum_c{\pi_c*N(\mu_c,\Sigma_c)}\]</span> In other words, we need to estimate the probability density of each gaussian, combine them into one and take the log across all the points. The sum of the logs will be our likelihood estimate. The Mclust package takes the log-likelihood estimate and calculates the Bayesian Information Criterion (BIC) as a metric for goodness of fit. The BIC equation is defined as:</p>
<p><span class="math display">\[BIC=-2ln \hat{L} + k * ln(n)\]</span></p>
<p>For our particular example using iris, we compute: <span class="math display">\[ -2 * loglike + 14 * log(150) \]</span> where <span class="math inline">\(k\)</span> is the <a href="http://stats.stackexchange.com/questions/103539/meaning-of-number-of-parameters-in-aic" title="Excellent explanation for choosing a value of k on StackOverflow.">number of free parameters in our model</a> and <span class="math inline">\(n\)</span> is the number in our sample size. We can compute a number of different models with <span class="math inline">\(k\)</span> number of classes and compare the BIC. The best model is one that maximizes that metric. Below is a quick review of the code used to calculate BIC.</p>
<pre class="r"><code># Compute the sum of the mixture densities, take the log, and add the
# column vector.
loglik <- sum(log(apply(t(t(z) * a), 1, sum)))
# BIC is calculated using the equation
bic <- -2 * loglik + 14 * log(NROW(iris))
# After every iteration we can plot.
par(mfrow = c(1, 2))
plot(bic[, c(1, 2)], type = "l", lwd = 2, col = "red", main = "BIC", xlab = "Iterations",
ylab = "Score")
plot(bic[, c(1, 3)], type = "l", lwd = 2, col = "blue", main = "Log-Likelihood",
xlab = "Iterations", ylab = "Score")</code></pre>
<p><img src = "/images/2017-09-15-gmm/log_like_bic.png"></img></p>
</div>
<div id="putting-it-all-together" class="section level3">
<h3>Putting it all together</h3>
<p>The best way to see the EM algorithm/gaussian mixture model in action is to visualize each iteration. During the beginning, updates make a difference and cause several of the classes to chage. As the itterations continue, the EM algorithm experiences diminishing marginal returns till the change after another itteration is negligible.</p>
<p><img src = "/images/2017-09-15-gmm/gmm_animation.gif"></img></p>
</div>
<div id="comparison-with-the-mclust-package-and-lda" class="section level3">
<h3>Comparison with the Mclust package and LDA</h3>
<p>Simply using the package Mclust provides you with a richer and more rigorous analysis but, hopefully you have a better reference for what the algorithm is doing. Training the model with Mclust is easy and only requires two parameters: the data and the number of clusters.</p>
<pre class="r"><code># Load the package
library(mclust)
# Select 4 continuous variables and look for three distinct groups.
mcl.model <- Mclust(iris[, 1:4], 3)
# Plot our results.
plot(mcl.model, what = "classification", main = "Mclust Classification")</code></pre>
<p><img src="#####../content/post/2017-02-15---gaussian_mixture_modeling_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>Let’s compare the results of Mclust to our custom algorithm first and then look how well Mclust discovered the class Species.</p>
<div id="mclust-v.s.-custom-gaussian-mixture-model" class="section level4">
<h4>Mclust v.s. Custom Gaussian Mixture Model</h4>
<table style="width:38%;">
<colgroup>
<col width="12%" />
<col width="12%" />
<col width="12%" />
</colgroup>
<thead>
<tr class="header">
<th>Class 1</th>
<th>Class 2</th>
<th>Class 3</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>50</td>
<td>0</td>
<td>0</td>
</tr>
<tr class="even">
<td>0</td>
<td>45</td>
<td>0</td>
</tr>
<tr class="odd">
<td>0</td>
<td>0</td>
<td>55</td>
</tr>
</tbody>
</table>
<p>A confusion matrix shows us that the two models performed equally well with no differences in classification.</p>
</div>
<div id="modeled-class-v.s.-actual-class" class="section level4">
<h4>Modeled Class v.s. Actual Class</h4>
<table style="width:49%;">
<colgroup>
<col width="12%" />
<col width="18%" />
<col width="18%" />
</colgroup>
<thead>
<tr class="header">
<th align="center">setosa</th>
<th align="center">versicolor</th>
<th align="center">virginica</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">50</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr class="even">
<td align="center">0</td>
<td align="center">45</td>
<td align="center">0</td>
</tr>
<tr class="odd">
<td align="center">0</td>
<td align="center">5</td>
<td align="center">50</td>
</tr>
</tbody>
</table>
<p>When we look at how our model identifies the three species of flower, we can see that it does fairly well in estimating the parameters correctly. As the matrix shows, only five were misclassificed in our sample of 150.</p>
</div>
</div>
<div id="resources" class="section level3">
<h3>Resources</h3>
<p>I hope that’s enough to pique your interest in these types of models. There’s a lot more that could be explained, but what has been written here is, I hope, a good overview of the whole process. If you’re interested in more of the technical details, I would highly recommend:</p>
<ul>
<li><p><a href="https://www.youtube.com/watch?v=Rkl30Fr2S38&index=119&list=PLD0F06AA0D2E8FFBA" title="Mathematical Monk: Gaussian Mixture Modeling">Mathematical Monk: Gaussian Mixture Modeling</a></p></li>
<li><p><a href="http://statweb.stanford.edu/~tibs/ElemStatLearn/" title="Elements of Statistical Learning, Trevor Hastie, Robert Tibsharani, Jerome Friedman, 107-08.">Elements of Statistical Learning</a></p></li>
</ul>
</div>
<!-- BLOGDOWN-HEAD
/BLOGDOWN-HEAD -->
Exploring Fitness Data in R
/2016/09/05/exploring-fitness-data-in-r/
Mon, 05 Sep 2016 20:14:00 -0500/2016/09/05/exploring-fitness-data-in-r/<!-- BLOGDOWN-BODY-BEFORE
/BLOGDOWN-BODY-BEFORE -->
<p>Back in late February, I purchased a Microsoft Band 2. While I don’t exercise every day, I thought it would be fun to track the few health activities I can record. After five months, I’ve got quite a bit of daily step data and a few miles of jogging under my belt. I thought I would begin an exploratory analysis and see what kinds of information I could visualize to learn more about my health activities.</p>
<p>The Microsoft Health Dashboard–where all your health data is stored–is designed to export daily summaries for all activities in a comma-separated format. However, if you know a little bit of web scraping, you can get a lot more data that is used to create the dashboards online. I’ve used this same process with my work projects to get information that is not ready for export. Since many websites serve information this way, the technique has application beyond this one particular project. I will detail the technique I used to retrieve that information and create a few exploratory visuals.</p>
<div id="the-data" class="section level3">
<h3>The Data</h3>
<p>For this analysis, I used R to retrieve (httr, jsonlite), clean (tidyr), summarize (dplyr) and visualize (ggplot2) the data. I used ProjectTemplate to manage my scripts and raw data along with RMarkdown to create this post.</p>
<p>First, you will need to retrieve the data. When you log into your Microsoft health account and browse to any of your activities (sleep, step, running, etc.) hourly summaries of the day are displayed in one of the dashboard’s visuals. If you open up your browser’s developer tools, you can see the url request your browser made to retrieve that information before it was displayed. We’re going to mimic those browser requests and clean up that information creating a tidy data frame which can be used for exploration. You’ll need to retrieve the request url and the cookie your browser used when it made that request. With those two pieces, you can retrieve a single day’s worth of information.</p>
<p>Getting more than a single day requires us to manipulate the original url’s parameters. Luckily for us, the only thing we need to change is the date. Using base R’s seq.Date function, we can create a sequence of dates in a format that the server will accept. Once a list of dates has been created, the paste function concatenates the full request using each generated date. Httr uses the concatenated url to make a server request. The day’s activities are returned in JSON format which can be parsed by jsonlite and converted into a data frame. The final step binds each daily request into one data frame called “Activities” using rbind. When used inside of a for-loop, the hourly step count for many days can be obtained. Below is the full code I used to get this information.</p>
<pre class="r"><code>library(httr)
library(jsonlite)
library(lubridate)
cookie = **Insert Browser Cookie**
DateRange = as.character(seq.Date(from = as.Date("2016-01-01"), to = as.Date("2016-08-05"), by = 1))
Activities = data.frame()
for(i in DateRange){
request.url = paste('https://dashboard.microsofthealth.com/card/getuseractivitybyhour?date=', i, '&utcOffsetMinutes=-240', sep="")
page = GET(request.url, add_headers('Cookie' = cookie))
page = fromJSON(content(page, as = 'text'))
page = page$ActivityList
Activities = rbind(Activities,page)
}</code></pre>
<p>Once the information is retrieved, we’ll want to extract various time details using the lubridate package.</p>
<pre class="r"><code># Time formats.
Activities$TimeOfDay = strptime(Activities$TimeOfDay, "%m/%d/%Y %X", tz = "EST")
Activities$TimeOfDay = as.POSIXct(Activities$TimeOfDay)
Activities$Hour = hour(Activities$TimeOfDay)
Activities$Wday = wday(Activities$TimeOfDay, label = TRUE)
Activities$Month = month(Activities$TimeOfDay, label = TRUE)
Activities$Day = as.Date(Activities$TimeOfDay)</code></pre>
<p>Using dplyr we can get summaries of daily activities from the hourly data we originally retrieved.</p>
<pre class="r"><code>library(tibble)
library(tidyr)
library(dplyr)
Activities = as_tibble(Activities)
StepSum = Activities %>%
filter(StepsTaken > 0) %>%
group_by(Day,Wday) %>%
summarize(DailyStepCount = sum(StepsTaken)) %>%
ungroup(Day,Wday) %>%
mutate(AverageDaily = mean(DailyStepCount))</code></pre>
</div>
<div id="visualizing-daily-step-count-with-ggplot2" class="section level3">
<h3>Visualizing Daily Step Count with Ggplot2</h3>
<p>The data you get from the dashboard is pretty clean out of the box. So, we can quickly move to visualizing what we have with ggplot2. One great thing about all this information is first hand knowledge. Since I am the one who collected, monitored, and analyzed, I have a pretty good understanding of what to expect from it. The first chart visualizes average daily steps and the events that contributed to better step days.</p>
<pre class="r"><code># Packages for Visualization
library(ggplot2)
library(dplyr)
library(tibble)
library(viridis)
library(scales)
library(extrafont); library(extrafontdb)
library(tidyr)
library(magrittr)
# A dataframe with events during the year.
Event_Dates = data.frame(Event = c(1:4),
Begin = rep(as.Date(c("2016-02-29", "2016-05-01", "2016-05-16", "2016-06-05")),2),
End = rep(as.Date(c("2016-03-06", "2016-05-07","2016-05-21", "2016-08-01")),2),
y_bottom = c(0,0,10000,4700),
y_top = c(10000,5000,19000,12500),
Label = c("Spring Vacation", "Graduation","New York City", "Summer Hikes and Runs"))
# Daily Step Count Visual
ggDailySteps = ggplot() +
geom_point(data = StepSum, size = 3, aes(Day, DailyStepCount),color = 'royalblue3') +
labs(title = "My Daily Steps in 2016\n", x = NULL, y = "Daily Step Count\n") +
scale_color_viridis(option = "C", discrete = TRUE) +
guides(color = guide_legend(title = 'Mean Daily Step Count', color = 'black', label = FALSE, size = 1,override.aes = list(color = 'black', size = 1))) +
scale_y_continuous(labels = comma, breaks = c(0,2500,5000,7500,10000,12500,15000)) +
theme(text = element_text(family = "Georgia", color = 'grey10'),
plot.title = element_text(size = 24, hjust = -0.01),
panel.background = element_rect(fill = 'antiquewhite'),
panel.grid = element_blank(),
plot.margin = unit(c(0.4,0.4,0.4,0.4), 'cm'),
plot.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
legend.text = element_text(color = 'grey10', size = 14),
legend.key = element_rect(fill = 'antiquewhite',color = 'antiquewhite',size = 2),
legend.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
legend.position = c(0.8,1.11),
legend.direction = 'horizontal',
axis.text = element_text(size = 16, color = 'grey10'),
axis.title = element_text(size = 16),
axis.ticks = element_blank()) +
geom_rect(data = Event_Dates, fill = 'grey80', alpha = .2,aes(xmin = Begin-3, xmax = End+3, ymin = y_bottom, ymax = y_top)) +
geom_segment(color = 'grey60', size = 1.3, linetype = 3, aes(x = Event_Dates$End[7]+4, xend = (Event_Dates$End[7]+15), y = 16500, yend = 18000)) +
geom_text(label = "New York City Vacation",size = 5, color = 'grey30', aes(x = (Event_Dates$End[6]+60), y = 18755)) +
geom_segment(size = 1.3,linetype = 3,color = 'grey60',aes(x = Event_Dates$End[6]-5, xend = Event_Dates$End[6]-15, y = 5000, yend = 7600)) +
geom_text(label = 'Graduation Week', color = 'grey30', size = 5, aes(y = 8650, x= Event_Dates$End[7]-32)) +
geom_segment(aes(x = Event_Dates$End[1]-7, xend = (Event_Dates$End[1]-30), y = 10000, yend = 11600),color = 'grey60', size = 1.3, linetype = 3) +
geom_text(label = "Spring Vacation", color = 'grey30',size = 5, aes(y = 12500, x=Event_Dates$End[1]-30)) +
geom_text(label = "Summer Hikes\nWalks and Runs",color = 'grey30', size = 5, aes(x = as.Date("2016-07-03"), y = 13700, family = "Georgia")) +
geom_rug(data = StepSum,aes(Day, DailyStepCount), sides = 'l', color = 'grey60') +
geom_segment(data = StepSum, linetype = 2, aes(y = AverageDaily,yend = AverageDaily, x = as.Date("2016-01-01"), xend = as.Date("2016-08-01"),color = 'grey60'), size = 1.4, show.legend = TRUE)
ggDailySteps</code></pre>
<p><img src="#####../content/post/2016-09-03---analyzing_fitness_data_r_files/figure-html/unnamed-chunk-4-1.png" width="960" /></p>
<p>I’m pleased with what I was able to do with this visual. The ggplot2 package does a good job of simplifying the plotting process. Once you get familiar with all the options you have at your fingertips, complex visualizations become much easier. The most difficult things tend to be the little design details like the length of the line segments and the placement of the annotations. Accomplishing the same details may be easier in a real design program like inkscape or photoshop, but I am pleased with how much I could accomplish using R only. Plus, since the initial plotting work is complete, I can continue to add events and replot with minimal effort.</p>
</div>
<div id="hourly-heart-rate." class="section level3">
<h3>Hourly Heart Rate.</h3>
<p>Visualizing my heart rate took much less time to create since it contained fewer annotations. Plus, I had already established the basic thematic elements in my previous plot. The resulting graph shows roughly the number of times I have exercised since purchasing my band.</p>
<pre class="r"><code>ggAverageHeart = ggplot(subset(Activities, AverageHeartRate > 0), aes(Day, AverageHeartRate)) +
geom_line() +
labs(x = NULL, y = "Average Rate\n", title = "Average Hourly Heart Rate in 2016\n") +
theme(text = element_text(family = "Georgia", color = 'grey15'),
plot.title = element_text(size = 26, hjust = -0.01),
panel.background = element_rect(fill = 'antiquewhite'),
panel.grid = element_blank(),
plot.margin = unit(c(0.4,0.4,0.4,0.4), 'cm'),
plot.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
legend.text = element_text(color = 'grey10', size = 14),
legend.key = element_rect(fill = 'antiquewhite',color = 'antiquewhite',size = 2),
legend.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
legend.position = c(.75,1.05),
legend.direction = 'horizontal',
axis.text = element_text(size = 16, color = 'grey10'),
axis.title = element_text(size = 16),
axis.ticks = element_blank()) +
stat_smooth(color = 'royalblue3', size = 1.3) +
geom_segment(linetype = 3, size = 1.3,color = 'grey40', aes(x = as.Date("2016-04-10"), xend = as.Date("2016-04-20"), y = 158, yend = 164)) +
geom_text(label = "High heart rate indicates a workout", aes(x = as.Date("2016-05-15"), y = 170, family = "Georgia"), size = 4, color = 'grey15')
ggAverageHeart</code></pre>
<pre><code>## `geom_smooth()` using method = 'gam'</code></pre>
<p><img src="#####../content/post/2016-09-03---analyzing_fitness_data_r_files/figure-html/unnamed-chunk-5-1.png" width="960" /></p>
</div>
<div id="running-performance" class="section level3">
<h3>Running Performance</h3>
<p>The Microsoft Health dashboard provides many metrics to help you understand and track your exercise performance; however, it does not let you compare all of your runs together under one graph. To create a single visual, I’ll retrieve the running information with the same method I used previously and weed out runs longer than three miles. The graph I chose to create looks at the relationship between distance along a run (between 0 and 3 miles) and my pace at that given distance. Below is the code I used to create it.</p>
<pre class="r"><code>ggRunRate = ggplot(subset(Runs, Pace > 0), aes(TotalDistance, Pace)) +
geom_point() +
labs(x = "\nTotal Distance\n(miles)", y = "Pace\n", title = "Pace over the Course of a Run\n") +
theme(text = element_text(family = "Georgia", color = 'grey15'),
plot.title = element_text(size = 26, hjust = -0.01),
panel.background = element_rect(fill = 'antiquewhite'),
panel.grid = element_blank(),
plot.margin = unit(c(0.4,0.4,0.4,0.4), 'cm'),
plot.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
legend.text = element_text(color = 'grey10', size = 14),
legend.key = element_rect(fill = 'antiquewhite',color = 'antiquewhite',size = 2),
legend.background = element_rect(fill = 'antiquewhite', color = 'antiquewhite'),
legend.position = c(.75,1.05),
legend.direction = 'horizontal',
axis.text = element_text(size = 16, color = 'grey10'),
axis.title = element_text(size = 16),
axis.ticks = element_blank()) +
stat_smooth(color = 'royalblue3', size = 1.3) +
scale_x_continuous(breaks = c(0,0.5,1,1.5,2,2.5,3))
ggRunRate</code></pre>
<p><img src="#####../content/post/2016-09-03---analyzing_fitness_data_r_files/figure-html/unnamed-chunk-6-1.png" width="960" style="display: block; margin: auto auto auto 0;" /></p>
<p>As I would expect, pace generally decreases as distance increases. Adding the loess smoother highlights a few specific running trends. My pace starts out strong and gradually declines over the course of three miles. Towards the end, I might pick up speed and finish stronger.</p>
<p>As the weather cools off, I hope to begin more consistent runs and measure performance during that period of time. I would expect/hope that the decreasing trend over the length of the run would flatten and the average pace rises, but that will have to be tested at another time. For now, this project is only a short exercise in exploratory data analysis.</p>
</div>
<!-- BLOGDOWN-HEAD
/BLOGDOWN-HEAD -->
About
/about/
Thu, 05 May 2016 21:48:51 -0700/about/<style>
#col {
-moz-column-count: 2;
-webkit-column-count: 2;
column-count: 2;
}
</style>
<div>
<p>This github is dedicated to my data analysis projects. Each page was generated using <a href= 'http://rmarkdown.rstudio.com'>RMarkdown</a> and <a href = "http://github.com/rstudio/blogdown">Blogdown</a>. <br> <br> <br> Look for other great content on <a href="https://r-bloggers.com">R-Bloggers.</a></p>
</div>