Proportional-odds logistic regression is often used to model an ordered categorical response. By “ordered”, we mean categories that have a natural ordering, such as “Disagree”, “Neutral”, “Agree”, or “Everyday”, “Some days”, “Rarely”, “Never”. For a primer on proportional-odds logistic regression, see our post, Fitting and Interpreting a Proportional Odds Model. In this post we demonstrate […]

## Getting started with the purrr package in R

If you’re wondering what exactly the purrr package does, then this blog post is for you. Before we get started, we should mention the Iteration chapter in R for Data Science by Garrett Grolemund and Hadley Wickham. We think this is the most thorough and extensive introduction to the purrr package currently available (at least […]

## Working with dates and time in R using the lubridate package

Sometimes we have data with dates and/or times that we want to manipulate or summarize. A common example in the health sciences is time-in-study. A subject may enter a study on Feb 12, 2008 and exit on November 4, 2009. How many days was the person in the study? (Donâ€™t forget 2008 was a leap […]

## The Wilcoxon Rank Sum Test

The Wilcoxon Rank Sum Test is often described as the non-parametric version of the two-sample t-test. You sometimes see it in analysis flowcharts after a question such as “is your data normal?” A “no” branch off this question will recommend a Wilcoxon test if you’re comparing two groups of continuous measures. So what is this […]

## Pairwise comparisons of proportions

Pairwise comparison means comparing all pairs of something. If I have three items A, B and C, that means comparing A to B, A to C, and B to C. Given n items, I can determine the number of possible pairs using the binomial coefficient: $$ \frac{n!}{2!(n – 2)!} = \binom {n}{2}$$ Using the R […]

## A tidyr Tutorial

The tidyr package by Hadley Wickham centers on two functions: gather and spread. If you have struggled to understand what exactly these two functions do, this tutorial is for you. To begin we need to wrap our heads around the idea of “key-value pairs”. The help pages for gather and spread use this terminology to […]

## Getting Started with Factor Analysis

Take a look at the following correlation matrix for Olympic decathlon data calculated from 280 scores from 1960 through 2004 (Johnson and Wichern, p. 499): 100m LJ SP HJ 400m 100mH DS PV JV 1500m 100m 1.0000 0.6386 0.4752 0.3227 0.5520 0.3262 0.3509 0.4008 0.1821 -0.0352 LJ 0.6386 1.0000 0.4953 0.5668 0.4706 0.3520 0.3998 0.5167 […]

## An Introduction to Loglinear Models

Loglinear models model cell counts in contingency tables. They’re a little different from other modeling methods in that they don’t distinguish between response and explanatory variables. All variables in a loglinear model are essentially “responses”. To learn more about loglinear models, we’ll explore the following data from Agresti (1996, Table 6.3). It summarizes responses from […]

## Setting up Color Palettes in R

Plotting with color in R is kind of like painting a room in your house: you have to pick some colors. R has some default colors ready to go, but it’s only natural to want to play around and try some different combinations. In this post we’ll look at some ways you can define new […]

## Getting Started with Hurdle Models

Hurdle Models are a class of models for count data that help handle excess zeros and overdispersion. To motivate their use, let’s look at some data in R. The following data come with the AER package. It is a sample of 4,406 individuals, aged 66 and over, who were covered by Medicare in 1988. One […]