When performing data analysis, we often need to “reshape” our data from wide format to long format. A common example of wide data is a data structure with one record per subject and multiple columns for repeated measures. For example:
id gender race trt day1 day2 day3 1 F 0 0 19.81310 18.05777 14.84996 2 M 0 0 17.91846 18.75825 15.30547 3 M 0 0 17.22526 19.79218 15.10622
Notice that columns day1, day2 and day3 represent repeated measures for each person.
An alternative way to layout these data is listing one record per subject per day. In this case we have one column indicating day of measure and one column for the measure itself, like this:
id gender race trt day amt 1 F 0 0 day1 19.81310 1 F 0 0 day2 18.05777 1 F 0 0 day3 14.84996 2 M 0 0 day1 17.91846 2 M 0 0 day2 18.75825 2 M 0 0 day3 15.30547 3 M 0 0 day1 17.22526 3 M 0 0 day2 19.79218 3 M 0 0 day3 15.10622
An advantage to having data in long format is that graphing and statistical modeling are frequently made easier. For example, the long format allows us to include a variable for
day in our model. In the wide format, there is no explicit variable for day. It’s actually embedded in the column headers of day1, day2 and day3. Storing data in long format also allows us to quickly subset our data without dropping columns. We could subset the long data above to easily see all amounts on day 2 for males with race = 0 and trt = 1. To do the same with the wide data would mean dropping the day1 and day2 columns after subsetting.
How do you reshape wide data to long? Below we give basic demonstrations using R, SAS, SPSS and Stata to perform the reshaping demonstrated above. In each case we assume you’re starting with a csv file called dat.csv. Feel free to download the file and try the code below. It’s identical to the wide format data displayed above.
While base R has a reshape function, we think it’s easier to use the reshape2 or tidyr packages. The reshape2 package has the
melt function. The basic way to use it is to indicate the id.vars. These are the variables that will remain after reshaping. All variables not listed as id.vars are reshaped. That is, the column headers are turned into variables and the values in the columns are gathered into one column. The tidyr function
gather does the same thing but has a slightly different syntax. The
key argument takes the name of the column that will contain the column headers while the
value argument takes the name of the column that will contain the gathered values. Listed after the arguments are the columns to be reshaped.
dat <- read.csv("dat.csv") install.packages("reshape2") # only need to do once library(reshape2) # need to do every time you start a new R session datL <- melt(dat,id.vars = c("id","gender","race","trt"), variable.name = "day",value.name = "amt") # same with tidyr install.packages("tidyr") library(tidyr) datL <- gather(dat,key = "day",value = "amt",day1:day3)
One way to reshape data in SAS is using PROC TRANSPOSE. List the columns that need to be reshaped in the
var statement. Next in the
by statement list the columns in the wide data that should remain in the long data. The out= in the PROC TRANSPOSE statement creates a new data set called datLong. The
(rename=(col1=Measurement _name_=day)) option renames the new column headers in the long data set. Otherwise SAS provides the default columns names “col1” and “_name_”.
proc import datafile="C:\Users\mst3k\Documents\dat.csv" out=dat replace; getnames=yes; run; proc transpose data=dat out=datLong (rename=(col1=Measurement _name_=day)); var day1-day3; by id gender race trt; run;
In SPSS 22 it’s possible to reshape your data using the Restructure Data Wizard (Data…Restructure…). However we find it easier to simply use syntax. Once you read in the CSV file, open up a new syntax window (File…New…Syntax), type the code below, and then highlight the code and click Ctrl + R to run it. The VARSTOCASES example below (read as “vars to cases”) uses three subcommands: /MAKE, /INDEX and /KEEP. The /MAKE line says to create a column called “amt” from the columns day1 – day3. The /INDEX line says name the new column “day”, which will contain the column headers from the wide data set. The /KEEP line lists which variables from the wide data set are to remain as-is in the long data set.
VARSTOCASES /MAKE amt FROM day1 day2 day3 /INDEX=day(amt) /KEEP=id gender race trt.
Stata offers the
reshape command for restructuring data. To reshape a wide data set long, you have to specify
reshape long. After that you specify the word kernel that the multiple columns we want to reshape have in common. In our case that’s “day” (day1, day2, day3). The Stata reshape command apparently relies on this naming convention. In our example this works fine. However Stata uses that common word kernel as the name of the new column containing the gathered values. So we can’t just name our column containing the day number “day”. Stata throws an error. Therefore below we assign the day values to a column called “time”. Use the i() option to supply the variable that uniquely identifies records in the wide data set and use the j() option to supply the name of the variable that will store the numbers from the day1, day2, and day3 column headers. Afterward we rename our two new columns to match the output from the other examples.
import delimited dat.csv, clear reshape long day, i(id) j(time) rename day amt rename time day
For questions or clarifications regarding this article, contact the UVa Library StatLab: firstname.lastname@example.orgClay Ford
Statistical Research Consultant
University of Virginia Library
July 21, 2013