Merging spreadsheet or csv files into one dataframe - Part 1

September 7, 2022

This was one of my most frequently used tasks. During the years, I used many different and continuously evolving versions of this workflow to deal with numerous issues related to differences between files. I couldn’t automate it totally to deal with all the below detailed problems, instead I had adaptations for certain types of files. While my case may seem special, these solutions can be generalized with little or no modifications to solve similar tasks.

The situation: there are continuous data series separated into many similar tabular files. In my case they were generated by environmental data loggers with measurements recorded typically in every 5-10-30-60 minutes and downloaded manually in every few weeks. This resulted several hundred files in every year. The original workflow was simply copying and pasting data into a new excel file, but this was time consuming and tiring, arising the probability of various errors. Measurement date and time was recorded in one column with shorter and longer gaps in the timeline, caused by stopping measurement at download, especially when sensors connected to the logger had to be repaired or reinstalled and there were sometimes dead batteries or some disturbances. Some loggers automatically switched to daylight saving time and back in “real time”, some others did that at the following download. Some loggers did both, causing funny situations, while others didn’t switch time automatically. Duplicate measurements also occurred, along with fake recordings between regular logging intervals. Even file headers, number and name of columns and datetime formats differed on occasion for the same type of loggers, as we had to use different laptops and different software versions for downloading, because otherwise we couldn’t have communicated with some loggers.

About the files: extensions were .xls, .csv and some other text formats, like .dat. Filenames were composed of logger name and download date, usually without space between them, but sometimes with underscore, like “Loggername20151215.csv” or “Other-device_20200423.xls”. They were stored in folders called “DownloadsXXXX”, where “XXXX” was the year of measurements. These naming conventions made the automation of the workflow easier, as we’ll see. For these posts I generated fake data files based on original ones so there is no conflict of interest.

In this first part, we check the simpler case, when we have .csv files from only one device, with filenames differing only in their 8 digits date part. The files have the same number of columns, with identical names and data types. We assume that aside from its format, there are no problems with the datetime column, namely its values continuously follow the regular intervals within the file. Time gaps between files are not a problem.

Let’s start with loading packages. Usually, I simply use library(tidyverse) to load the core packages of the Tidyverse (including dplyr, tidyr, ggplot2 etc. except for the non-automatically loaded lubridate) at once, but now call them individually.

library(dplyr)
library(readr)
library(tibble)
library(stringr)
library(lubridate)

Then set the working directory, where we have the files.

setwd("path/to/Downloads2020")

We can now get the filenames from our folder with the list.files() function, defining the desired loggername in the pattern argument. The .* part of our pattern is a regular expression or regex, indicating any character, occurring 0 or more times after “Loggername” and before “csv”. With this we determine that the function list any .csv files whose name begins with “Loggername”, independently from the following date part of the name which is different for each file.

filenames <- list.files(pattern = "Loggername.*csv")

We can see the result by printing our filenames vector.

filenames

## [1] "Loggername20200122.csv" "Loggername20200310.csv" "Loggername20200522.csv"
## [4] "Loggername20200723.csv" "Loggername20200915.csv" "Loggername20201027.csv"
## [7] "Loggername20210108.csv"

There are 5 files with matching names.
We can get one element of this vector with its index number surrounded by [], e.g. filenames[2] will give us the second filename from the vector.

filenames[2]

## [1] "Loggername20200310.csv"

Having the names, let’s initiate an empty list for our files. We will read our data from the files into this list and store them as separate data.frames in it.

list_for_dataframes <- list()

In the next step we construct a for loop to automatically import data from all of our files one after another with the read_csv() function. It means that we repeat the same task for each element of the filenames vector: using [i] as index, indicating the ith filename and using as the only argument in the read_csv() function, since we defined the working directory and there is no need to set other arguments like skipping rows or columns.
At the beginning, we define the desired number of rounds in the loop with the seq.along() function which gives the element indices of the filenames vector and thereby setting the number of i.
After that comes the body of the loop, surrounded by {} (braces). This is the task itself that will be repeated i times. In this part we tell the loop to read the data from the ith file of the filenames vector into the ith element of list_for_dataframes. Note that we have to use [[i]] (in double square brackets) when referring to a list element and [i] (in single square brackets) when referring to a vector element.

for (i in seq_along(filenames)) {
  list_for_dataframes[[i]] <- read_csv(filenames[i])
}

Since they don’t have names in this form, we set the names of the list elements, based on our filenames. It’s not an essential step, but this way we can easily recognize our data.frames when checking the contents of our list.

names(list_for_dataframes) <- filenames

We can get the contents of the list and the main details of its elements with the glimpse() function to check if everything went well.

glimpse(list_for_dataframes)

## List of 7
##  $ Loggername20200122.csv: spec_tbl_df [4,896 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##   ..$ Datetime: POSIXct[1:4896], format: "2019-12-19 12:20:00" "2019-12-19 12:30:00" ...
##   ..$ Sensor_A: num [1:4896] 29 27.4 -13.6 28 36.1 ...
##   ..$ Sensor_B: num [1:4896] 53.32 48.16 7.85 61.6 20.45 ...
##   ..- attr(*, "spec")=
##   .. .. cols(
##   .. ..   Datetime = col_datetime(format = ""),
##   .. ..   Sensor_A = col_double(),
##   .. ..   Sensor_B = col_double()
##   .. .. )
##   ..- attr(*, "problems")=<externalptr> 
##  $ Loggername20200310.csv: spec_tbl_df [6,917 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##   ..$ Datetime: POSIXct[1:6917], format: "2020-01-22 11:50:00" "2020-01-22 12:00:00" ...
##   ..$ Sensor_A: num [1:6917] -18.91 21.17 26.41 8.72 -3.87 ...
##   ..$ Sensor_B: num [1:6917] 5.15 48.73 4.97 76.31 69.15 ...
##   ..- attr(*, "spec")=
##   .. .. cols(
##   .. ..   Datetime = col_datetime(format = ""),
##   .. ..   Sensor_A = col_double(),
##   .. ..   Sensor_B = col_double()
##   .. .. )
##   ..- attr(*, "problems")=<externalptr> 
##  $ Loggername20200522.csv: spec_tbl_df [10,537 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##   ..$ Datetime: POSIXct[1:10537], format: "2020-03-10 12:20:00" "2020-03-10 12:30:00" ...
##   ..$ Sensor_A: num [1:10537] -15.99 -7.01 -10.2 21.03 30.6 ...
##   ..$ Sensor_B: num [1:10537] 26.569 45.081 0.548 83.368 88.509 ...
##   ..- attr(*, "spec")=
##   .. .. cols(
##   .. ..   Datetime = col_datetime(format = ""),
##   .. ..   Sensor_A = col_double(),
##   .. ..   Sensor_B = col_double()
##   .. .. )
##   ..- attr(*, "problems")=<externalptr> 
##  $ Loggername20200723.csv: spec_tbl_df [8,932 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##   ..$ Datetime: POSIXct[1:8932], format: "2020-05-22 17:20:00" "2020-05-22 17:30:00" ...
##   ..$ Sensor_A: num [1:8932] 29.8 39.5 -18.9 38.1 -5.9 ...
##   ..$ Sensor_B: num [1:8932] 98 20.2 25.1 47.2 25.9 ...
##   ..- attr(*, "spec")=
##   .. .. cols(
##   .. ..   Datetime = col_datetime(format = ""),
##   .. ..   Sensor_A = col_double(),
##   .. ..   Sensor_B = col_double()
##   .. .. )
##   ..- attr(*, "problems")=<externalptr> 
##  $ Loggername20200915.csv: spec_tbl_df [7,783 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##   ..$ Datetime: POSIXct[1:7783], format: "2020-07-23 17:20:00" "2020-07-23 17:30:00" ...
##   ..$ Sensor_A: num [1:7783] 15.44 20.59 -7.66 30.59 -16.83 ...
##   ..$ Sensor_B: num [1:7783] 47.4 90 48.3 35.2 14.2 ...
##   ..- attr(*, "spec")=
##   .. .. cols(
##   .. ..   Datetime = col_datetime(format = ""),
##   .. ..   Sensor_A = col_double(),
##   .. ..   Sensor_B = col_double()
##   .. .. )
##   ..- attr(*, "problems")=<externalptr> 
##  $ Loggername20201027.csv: spec_tbl_df [6,029 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##   ..$ Datetime: POSIXct[1:6029], format: "2020-09-15 18:10:00" "2020-09-15 18:20:00" ...
##   ..$ Sensor_A: num [1:6029] 12.66 -3.82 -15.88 25.97 17.32 ...
##   ..$ Sensor_B: num [1:6029] 16 16.2 85 27.4 85.6 ...
##   ..- attr(*, "spec")=
##   .. .. cols(
##   .. ..   Datetime = col_datetime(format = ""),
##   .. ..   Sensor_A = col_double(),
##   .. ..   Sensor_B = col_double()
##   .. .. )
##   ..- attr(*, "problems")=<externalptr> 
##  $ Loggername20210108.csv: spec_tbl_df [10,523 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##   ..$ Datetime: POSIXct[1:10523], format: "2020-10-27 14:50:00" "2020-10-27 15:00:00" ...
##   ..$ Sensor_A: num [1:10523] 0.439 17.675 22.491 38.042 22.879 ...
##   ..$ Sensor_B: num [1:10523] 41.9 62.7 41.8 62.8 69 ...
##   ..- attr(*, "spec")=
##   .. .. cols(
##   .. ..   Datetime = col_datetime(format = ""),
##   .. ..   Sensor_A = col_double(),
##   .. ..   Sensor_B = col_double()
##   .. .. )
##   ..- attr(*, "problems")=<externalptr>

As we used the read_csv() function from the readr package, all of our data.frames have classes spec_tbl_df, tbl_df and tbl too, meaning that they are tibbles along with their “old” data.frame class, designed for working more smoothly with Tidyverse packages. Anyway, we keep calling them data.frame for simplicity.
We can also see that our columns have identical names across files and column types were automatically detected. Alternatively, we could have used base R’s read.csv() function, but in that case we should have defined column types manually to avoid getting a character datetime column.
There are a few values visible from every column, but we can take a closer look with the head() function that works on a single data.frame so we can call the function on one list element.

head(list_for_dataframes[[5]])

## # A tibble: 6 x 3
##   Datetime            Sensor_A Sensor_B
##   <dttm>                 <dbl>    <dbl>
## 1 2020-07-23 17:20:00    15.4     47.4 
## 2 2020-07-23 17:30:00    20.6     90.0 
## 3 2020-07-23 17:40:00    -7.66    48.3 
## 4 2020-07-23 17:50:00    30.6     35.2 
## 5 2020-07-23 18:00:00   -16.8     14.2 
## 6 2020-07-23 18:10:00    17.3      2.90

That’s a more familiar tabular representation of our data, showing that the function imported our data correctly as indicated by the data types of columns too.

Let’s combine them into one data.frame with the bind_rows() function. This function needs identical column names and, by those, merge the rows of our separate data.frames.

combined_df <- bind_rows(list_for_dataframes)

As the rows from the separate data.frames continuously follow each other, independently from the time values in this combined_df, this is not our final result yet. But we can check the number of rows and compare it to the sum number of rows in the separate data.frames, since they should be the same.
Let’s use the nrow() function to get the row number of combined_df.

nrow(combined_df)

## [1] 55617

And then check the sum row number of the separate data.frames. Because they are stored in a list, we can’t call nrow() directly, so we wrap this function in a sapply() call, which gives these values in a vector, and then get the sum of row numbers with the sum() function.

sum(sapply(list_for_dataframes, nrow))

## [1] 55617

Success.

Now we need to determine the “real” timeline with which we can align our existing but not complete set of time values in combined_df. I used to set it for a whole calendar year with appropriate time intervals, but we can define it to get the earliest and latest time values from the datetime column with min(combined_df$Datetime) and max(combined_df$Datetime).
In either case we use the seq.POSIXt() function to genereate a sequence of datetime values, wrapped inside the as_tibble_col() function to create a one-column data.frame from the sequence.
For the whole year version we use yet another function, make_datetime() to manually create the beginning and end time of the sequence, with arguments year, month, day, hour, minute and second. , representing the first two arguments of seq.POSIXt(). The third argument is the time interval of the sequence expressed in seconds, which is 600 in our case, as there are 10 minute intervals.
We start with an additional step I used to have: getting the actual year from the name of the working directory, “Downloads2020” in this case. This way I could use the same code for different year folders. For that, I call getwd() inside a str_extract() function that has a regex pattern argument “\\d{4}” detecting 4 consecutive digits. This function extracts that part captured by the pattern, the year in our case, and we wrap it inside the as.integer() function to store it as a number instead of a character vector.

actual_year <- as.integer(str_extract(getwd(), "\\d{4}"))

Now we can put this actual_year vector to our make_datetime() functions.

real_time <- as_tibble_col(seq.POSIXt(make_datetime(actual_year, 1, 1, 0, 0, 0), 
                                  make_datetime(actual_year, 12, 31, 23, 50, 0), 600), 
                       column_name = "Datetime")

Having the real_time and our combined_df we can finally use a left_join() to join the two data.frames keeping all real time values for the whole year and fill the proper rows with our data in other columns by these time values. For this purpose real_time must be the first argument and combined_df the second in this order, defining the joining column with the “by” argument in the function.

final_df <- left_join(real_time, combined_df, by="Datetime")

We can now have some checks again to see the row and column numbers with the dim() function. Let’s see them for the one-column real_time first.

dim(real_time)

## [1] 52704     1

And then for our final_df that should have the same number of rows and 3 columns.

dim(final_df)

## [1] 52704     3

Goal!
Knowing that we have the proper number of columns and rows, take a look at our first and last rows with the head() and tail() functions.

head(final_df)

## # A tibble: 6 x 3
##   Datetime            Sensor_A Sensor_B
##   <dttm>                 <dbl>    <dbl>
## 1 2020-01-01 00:00:00    -7.98    85.1 
## 2 2020-01-01 00:10:00     6.91     8.32
## 3 2020-01-01 00:20:00    25.9     69.5 
## 4 2020-01-01 00:30:00    20.1     83.3 
## 5 2020-01-01 00:40:00     1.71    23.4 
## 6 2020-01-01 00:50:00    36.1     98.5

tail(final_df)

## # A tibble: 6 x 3
##   Datetime            Sensor_A Sensor_B
##   <dttm>                 <dbl>    <dbl>
## 1 2020-12-31 23:00:00    16.8      2.05
## 2 2020-12-31 23:10:00    13.0     27.1 
## 3 2020-12-31 23:20:00    -8.75    73.1 
## 4 2020-12-31 23:30:00     3.66    55.9 
## 5 2020-12-31 23:40:00    -4.35    46.3 
## 6 2020-12-31 23:50:00    22.7     56.4

Everything is fine.
I usually had one last step: exporting final_df to a file. It can be done with the write_csv() function if we want to stick with “.csv” format or use the write_xlsx() function from the writexl package if we need Excel format. In both cases our arguments would be the same, the name of our data.frame first, then the desired name of the new file. For the latter, I included a str_c() function to join the beginning and end part of the filename string with actual_year (converted to character string by the function) thus getting “Loggername2020.csv”. One important thing here is to include the file extension in the filename, otherwise programs couldn’t recognize these files.

write_csv(final_df, str_c("Loggername", actual_year, ".csv"))

writexl::write_xlsx(final_df, str_c("Loggername", actual_year, ".xlsx"))

And that’s all.
This workflow can be adapted easily to input files with “.xls” or “.xlsx” extensions, using the read_excel(), read_xls() or read_xlsx() functions from the readxl package, if the files have the same one-row-header format, that is column names are in the first row. But in later posts we’ll check different cases.

Arrivederci a presto!

Posted on:: September 7, 2022

Length:: 12 minute read, 2550 words

Categories:: File merging

Tags:: rrrrrrrr

See Also:: Merging spreadsheet or csv files into one dataframe - Part 4; Merging spreadsheet or csv files into one dataframe - Part 3; Merging spreadsheet or csv files into one dataframe - Part 2