Clay Ford

Getting Started with Rate Models

Let’s say we’re interested in modeling the number of auto accidents that occur at various intersections within a city. Upon collecting data after a certain period of time perhaps we notice two intersections have the same number of accidents, say 25. Is it correct to conclude these two intersections are similar in their propensity for […]

Modeling Non-Constant Variance

One of the basic assumptions of linear modeling is constant, or homogeneous, variance. What does that mean exactly? Let’s simulate some data that satisfies this condition to illustrate the concept. Below we create a sorted vector of numbers ranging from 1 to 10 called x, and then create a vector of numbers called y that […]

Creating a SQLite database for use with R

When you import or load data into R, the data are stored in Random Access Memory (RAM). This is the memory that is deleted when you close R or shut off your computer. It’s very fast but temporary. If you save your data, it is saved to your hard drive. But when you open R […]

Simulating Data for Count Models

A count model is a linear model where the dependent variable is a count. For example, the number of times a car breaks down, the number of rats in a litter, the number of times a young student gets out of his seat, etc. Counts are either 0 or a postive whole number, which means […]

Simulating a Logistic Regression Model

Logistic regression is a method for modeling binary data as a function of other variables. For example we might want to model the occurrence or non-occurrence of a disease given predictors such as age, race, weight, etc. The result is a model that returns a predicted probability of occurrence (or non-occurrence, depending on how we […]

Assessing Type S and Type M Errors

The paper Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors by Andrew Gelman and John Carlin introduces the idea of performing design calculations to help prevent researchers from being misled by statistically significant results in studies with small samples and/or noisy measurements. The main idea is that researchers often overestimate effect […]

Interpreting Log Transformations in a Linear Model

Log transformations are often recommended for skewed data, such as monetary measures or certain biological and demographic measures. Log transforming data usually has the effect of spreading out clumps of data and bringing together spread-out data. For example, below is a histogram of the areas of all 50 US states. It is skewed to the […]

Getting Started with Matching Methods

NOTE: This article demonstrates how to use propensity scores for matching data. However, propensity scores have come under fire in recent years. King and Nielsen’s 2019 article, Why Propensity Scores Should Not Be Used for Matching, argue that propensity scores increase imbalance, inefficiency, model dependence, and bias. Others argue that while propensity scores may be […]

Getting Started with Moderated Mediation

In a previous post we demonstrated how to perform a basic mediation analysis. In this post we look at performing a moderated mediation analysis. The basic idea is that a mediator may depend on another variable called a “moderator”. For example, in our mediation analysis post we hypothesized that self-esteem was a mediator of student […]

Getting started with Multivariate Multiple Regression

Multivariate Multiple Regression is the method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with […]