The Intuition Behind Confidence Intervals

Say it with me: An X% confidence interval captures the population parameter in X% of repeated samples. In the course of our statistical educations, many of us had that line (or some variant of it) crammed, wedged, stuffed, and shoved into our skulls until definitional precision was leaking out of noses and pooling on our […]

Post Hoc Power Calculations are Not Useful

It is well documented that post hoc power calculations are not useful (Goodman and Berlin 1994, Hoenig and Heisey 2001, Althouse 2020). Also known as observed power or retrospective power, post hoc power purports to estimate the power of a test given an observed effect size. The idea is to show that a “non-significant” hypothesis […]

Understanding Multiple Comparisons and Simultaneous Inference

When it comes to confidence intervals and hypothesis testing there are two important limitations to keep in mind. The significance level1, \(\alpha\), or the confidence interval coverage, \(1 – \alpha\), only apply to one test or estimate, not to a series of tests or estimates. are only appropriate if the estimate or test was not […]

Understanding Robust Standard Errors

What are robust standard errors? How do we calculate them? Why use them? Why not use them all the time if they’re so robust? Those are the kinds of questions this post intends to address. To begin, let’s start with the relatively easy part: getting robust standard errors for basic linear models in Stata and […]

Simulating Data for Count Models

A count model is a linear model where the dependent variable is a count. For example, the number of times a car breaks down, the number of rats in a litter, the number of times a young student gets out of his seat, etc. Counts are either 0 or a postive whole number, which means […]

Simulating a Logistic Regression Model

Logistic regression is a method for modeling binary data as a function of other variables. For example we might want to model the occurrence or non-occurrence of a disease given predictors such as age, race, weight, etc. The result is a model that returns a predicted probability of occurrence (or non-occurrence, depending on how we […]

Is R-squared Useless?

On Thursday, October 16, 2015, a disbelieving student posted on Reddit My stats professor just went on a rant about how R-squared values are essentially useless, is there any truth to this? It attracted a fair amount of attention, at least compared to other posts about statistics on Reddit. It turns out the student’s stats […]

Simulating Endogeneity

First off, what is endogeneity, and why would we want to simulate it? Endogeneity occurs when a statistical model has an independent variable that is correlated with the error term. The reason we would want to simulate it is to understand what exactly that definition means! Let’s first simulate ideal data for simple linear regression […]

A Rule of Thumb for Unequal Variances

One of the assumptions of the Analysis of Variance (ANOVA) is constant variance. That is, the spread of residuals is roughly equal per treatment level. A common way to assess this assumption is plotting residuals versus fitted values. Recall that residuals are the observed values of your response of interest minus the predicted value of […]