From wide to long format in R

In the last tutorial, we saw how to reshape data from long to wide format in R. Today, we will see how to convert a data format from wide to long. Most plotting functions, including those from ggplot2, require data in a long format. Therefore, we will use the reshape and stack functions from base R, the gather and pivot_longer functions from the tidyr package and the melt function from the data.table package.

Some terminology:

  • id column: it is still present in the long format
  • key column: it is a new variable which replaces several others
  • value column: it is a new variable with the content of those several other variables

Load the tidyr and the data.table packages

library("tidyr")
library("data.table")

Load the iris dataset

data("iris")

Example with one id column

The image below shows a theoretical example of a conversion from wide to long format:

from wide to long format in R

a) Variables assignment

id_column = "Petal.Width"
key_column = "Species"
value_column = "Freq"

data_long = as.data.frame(table(iris[, c(id_column, key_column)]),
                          stringsAsFactors = FALSE)

data_wide = reshape(data_long, direction = "wide", idvar = id_column,
                    timevar = key_column)
names(data_wide) = sub(paste0(value_column, "."), "", names(data_wide))
observations for each combination of two variables from the iris dataset in a wide format

b) Functions

1) reshape from base R

data_long = reshape(data_wide, direction = "long", idvar = id_column,
                    timevar = key_column, v.names = value_column,
                    times = setdiff(names(data_wide), id_column),
                    varying = list(setdiff(names(data_wide), id_column)),
                    new.row.names = seq(1:(nrow(data_wide)*(ncol(data_wide)-1))))

2) stack from base R

data_long = cbind(id_column = data_wide[, id_column],
                  stack(data_wide[, !names(data_wide) %in% id_column]))
names(data_long) = sub("ind", key_column, names(data_long))
names(data_long) = sub("values", value_column, names(data_long))

3) gather from tidyr

data_long = gather(data_wide, key = key_column, value = value_column,
                   -all_of(id_column))
names(data_long) = sub("key_column", key_column, names(data_long))
names(data_long) = sub("value_column", value_column, names(data_long))

4) pivot_longer from tidyr

data_long  = pivot_longer(data_wide, names_to = key_column,
                          values_to = value_column, -all_of(id_column))

5) melt from data.table

data_long = melt(as.data.table(data_wide), id.vars = id_column,
                 variable.name = key_column, value.name = value_column)

c) Result

observations for each combination of two variables from the iris dataset in a long format

Remark

The output data structure can be different depending on the function used:

  • the two functions from base R give a data.frame output
  • the two functions from the tidyr package give a tibble output
  • the function from the data.table package give a data.table output

Rows and/or columns order may also differ, but the content will be the same.

Conclusion

In conclusion, we can change data format from wide to long using the reshape and stack functions from base R, the gather and pivot_longer functions from the tidyr package and the melt function from the data.table package. But in my opinion, the easiest ones are the pivot_longer and melt functions. Which function do you use the most?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply