StatLab Articles

Getting Started with Shiny

What is Shiny? Shiny is an R package that facilitates the creation of interactive web apps using R code, which can be hosted locally, on the shinyapps server, or on your own server. Shiny apps can range from extremely simple to incredibly sophisticated. They can be written purely with R code or supplemented with HTML, […]

Databases for Data Scientists

As data scientists, we’re often most excited about the final layer of analysis. Once all the data is cleaned and stored in a format readable by our favorite programming language (Python, R, STATA, etc), the most fun part of our work is when we’re finding counter-intuitive causations with statistical methods. If you can prove that […]

Modeling Non-Constant Variance

One of the basic assumptions of linear modeling is constant, or homogeneous, variance. What does that mean exactly? Let’s simulate some data that satisfies this condition to illustrate the concept. Below we create a sorted vector of numbers ranging from 1 to 10 called x, and then create a vector of numbers called y that […]

Creating a SQLite database for use with R

When you import or load data into R, the data are stored in Random Access Memory (RAM). This is the memory that is deleted when you close R or shut off your computer. It’s very fast but temporary. If you save your data, it is saved to your hard drive. But when you open R […]

Simulating Data for Count Models

A count model is a linear model where the dependent variable is a count. For example, the number of times a car breaks down, the number of rats in a litter, the number of times a young student gets out of his seat, etc. Counts are either 0 or a postive whole number, which means […]

Simulating a Logistic Regression Model

Logistic regression is a method for modeling binary data as a function of other variables. For example we might want to model the occurrence or non-occurrence of a disease given predictors such as age, race, weight, etc. The result is a model that returns a predicted probability of occurrence (or non-occurrence, depending on how we […]

An Introduction to Analyzing Twitter Data with R

In this article, I will walk you through why a researcher or professional might find data from Twitter useful, explain how to collect the relevant tweets and information from Twitter in R, and then finish by demonstrating a few useful analyses (along with accompanying cleaning) you might perform on your Twitter data. Part One: Why […]

Getting Started with Multiple Imputation in R

Whenever we are dealing with a dataset, we almost always run into a problem that may decrease our confidence in the results that we are getting – missing data! Examples of missing data can be found in surveys – where respondents intentionally refrained from answering a question, didn’t answer a question because it is not […]