Merging spreadsheet or csv files into one dataframe - Part 2
By Mariano in File merging
September 14, 2022
Buongiorno. In this episode we expand the workflow presented in
Part 1 to do a similar operation on much more files belonging to different groups. More precisely, we have the same type of files generated by automatic measurements from different devices and named accordingly to data logger name.
As always, we start with loading the packages we will use.
library(dplyr)
library(readr)
library(tibble)
library(stringr)
library(lubridate)
And continue with setting the working directory, where we have the files.
setwd("path/to/Downloads2020")
Then we get all filenames we want to read. In the pattern argument of list.files() the regex “.*csv$” indicates any character string in filenames ending with “.csv”.
filenames <- list.files(pattern = ".*csv$")
Printing our filenames vector, we can see that we have several files with different logger-associated names.
filenames
## [1] "Loggername20200122.csv" "Loggername20200310.csv"
## [3] "Loggername20200522.csv" "Loggername20200723.csv"
## [5] "Loggername20200915.csv" "Loggername20201027.csv"
## [7] "Loggername20210108.csv" "lovesensors20200122.csv"
## [9] "lovesensors20200310.csv" "lovesensors20200522.csv"
## [11] "lovesensors20200723.csv" "lovesensors20200915.csv"
## [13] "lovesensors20201027.csv" "lovesensors20210108.csv"
## [15] "Other-device20200122.csv" "Other-device20200310.csv"
## [17] "Other-device20200522.csv" "Other-device20200723.csv"
## [19] "Other-device20200915.csv" "Other-device20201027.csv"
## [21] "Other-device20210108.csv" "Time_machine-66_20200122.csv"
## [23] "Time_machine-66_20200310.csv" "Time_machine-66_20200522.csv"
## [25] "Time_machine-66_20200723.csv" "Time_machine-66_20200915.csv"
## [27] "Time_machine-66_20201027.csv" "Time_machine-66_20210108.csv"
Since our files belong to different devices, and thus we cannot read and merge all of them into one data.frame, we need separate combined data.frames. For this purpose we need the names of our loggers. We can get them from our filenames using the str_replace() function with a regex pattern and a replacement argument to extract the names from every filename, nested inside the unique() function to have only one name for each logger instead of each file. The good looking regex pattern “([^_]+)(?:_)*\\d{8}.+” catches any character string that is followed by 8 digits (as the date part of our names) and an optional _ before the numbers. The replacement “\\1” refers to the first capturing group “([^_]+)”, meaning that we want to replace our filename string with only that first part that comes before the optional underscore and the 8 numbers. With the “(?:_)” part we define a non-capturing group to exclude the optional _.
logger_names <- unique(str_replace(filenames, "([^_]+)(?:_)*\\d{8}.+", "\\1"))
And we can check them printing the vector.
logger_names
## [1] "Loggername" "lovesensors" "Other-device" "Time_machine-66"
Hack.
Before moving on to reading the files, we need the actual year as a vector and the real time with regular intervals for the whole calendar year. In Part 1 we did that after reading the files, but this time we need them before to automate the reading process among devices.
actual_year <- as.integer(str_extract(getwd(), "\\d{4}"))
Now we can put the actual_year vector into our make_datetime() functions when creating the one column data.frame.
real_time <- as_tibble_col(seq.POSIXt(make_datetime(actual_year, 1, 1, 0, 0, 0),
make_datetime(actual_year, 12, 31, 23, 50, 0), 600),
column_name = "Datetime")
And here comes the main part. Unlike Part 1 we need a nested for loop now: one loop to iterate over our logger_names and another loop inside the first one to iterate over all files for each logger name. In this nested loop construction, we first define the files_for_one_logger, followed by initiating an empty list_for_dataframes, then comes the inner loop over the files_for_one_logger to read the files into the list. In the next steps we bind the rows of our data.frames from the list into a combined_df, then left_join() it with real_time to get our final_df that we write into a “.csv” file using the logger_names and actual_year vectors to define the new filename. As we use temporary objects during the process, we remove all of them with rm() in the last step to keep our environment cleaner. Note that we use name in the outer loop and i in the inner loop. We also expand a bit our reading function, because by default read_csv() prints many information about column types guessed by the function so we include show_col_types = FALSE to prevent that.
for (name in seq_along(logger_names)){
files_for_one_logger <- list.files(pattern = str_c("^", logger_names[name], ".*.csv"))
list_for_dataframes <- list()
for (i in seq_along(files_for_one_logger)) {
list_for_dataframes[[i]] <- read_csv(files_for_one_logger[i], show_col_types = FALSE)
}
combined_df <- bind_rows(list_for_dataframes)
final_df <- left_join(real_time, combined_df, by="Datetime")
write_csv(final_df, str_c(logger_names[name], "_", actual_year, ".csv"))
rm(combined_df, final_df, i, name, files_for_one_logger, list_for_dataframes)
}
Since most times I needed additional cleaning steps, I usually didn’t export these final_dfs immediately. Instead, assigned them into permanent data.frames with the assign() function which needs a desired object name as the first argument and the object as the second argument. Then I could repeatedly plot, filter or modify these individual data.frames before writing into file. In this version we simply switch the write_csv() function to assign(), everything else remain the same as previously.
for (name in seq_along(logger_names)){
files_for_one_logger <- list.files(pattern = str_c("^", logger_names[name], ".*.csv"))
list_for_dataframes <- list()
for (i in seq_along(files_for_one_logger)) {
list_for_dataframes[[i]] <- read_csv(files_for_one_logger[i], show_col_types = FALSE)
}
combined_df <- bind_rows(list_for_dataframes)
final_df <- left_join(real_time, combined_df, by="Datetime")
assign(str_c(logger_names[name], "_", actual_year), final_df)
rm(combined_df, final_df, i, name, files_for_one_logger)
}
There is yet another option, when we need our final_dfs in a list. For that we need to initialize another list to store the final_dfs.
list_for_final_dfs <- list()
Then comes the nested loop, where instead of writing or assigning, we populate the new list_for_final_dfs with the final_dfs, then set the names of these list elements using the actual logger name and actual_year.
for (name in seq_along(logger_names)){
files_for_one_logger <- list.files(pattern = str_c("^", logger_names[name], ".*.csv"))
list_for_dataframes <- list()
for (i in seq_along(files_for_one_logger)) {
list_for_dataframes[[i]] <- read_csv(files_for_one_logger[i], show_col_types = FALSE)
}
combined_df <- bind_rows(list_for_dataframes)
final_df <- left_join(real_time, combined_df, by="Datetime")
list_for_final_dfs[[name]] <- final_df
names(list_for_final_dfs)[name] <- str_c(logger_names[name], "_", actual_year)
rm(combined_df, final_df, i, name, files_for_one_logger)
}
And we can check the contents of list_for_final_dfs with str() or glimpse().
str(list_for_final_dfs)
## List of 4
## $ Loggername_2020 : tibble [52,704 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ Datetime: POSIXct[1:52704], format: "2020-01-01 00:00:00" "2020-01-01 00:10:00" ...
## ..$ Sensor_A: num [1:52704] -7.98 6.91 25.9 20.13 1.71 ...
## ..$ Sensor_B: num [1:52704] 85.14 8.32 69.53 83.28 23.38 ...
## $ lovesensors_2020 : tibble [52,704 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ Datetime: POSIXct[1:52704], format: "2020-01-01 00:00:00" "2020-01-01 00:10:00" ...
## ..$ Sensor_A: num [1:52704] -13.34 9.11 13.43 18.35 17.58 ...
## ..$ Sensor_B: num [1:52704] 91 20.3 94.8 42.7 35.5 ...
## $ Other-device_2020 : tibble [52,704 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ Datetime: POSIXct[1:52704], format: "2020-01-01 00:00:00" "2020-01-01 00:10:00" ...
## ..$ Sensor_A: num [1:52704] -17.78 32.031 37.11 7.443 -0.554 ...
## ..$ Sensor_B: num [1:52704] 31.4 77.2 97.1 33.3 64.3 ...
## $ Time_machine-66_2020: tibble [52,704 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ Datetime: POSIXct[1:52704], format: "2020-01-01 00:00:00" "2020-01-01 00:10:00" ...
## ..$ Sensor_A: num [1:52704] 9.88 3.33 -13.33 1.34 6.64 ...
## ..$ Sensor_B: num [1:52704] 30.6 83.4 93.9 87.5 14.3 ...
And that’s all.
Like in Part 1, this workflow can be adapted easily to input files with “.xls” or “.xlsx” extensions, using the read_excel(), read_xls() or read_xlsx() functions from the readxl package, if the files have the same one-row-header format, that is column names are in the first row.
Arrivederci a presto!
- Posted on:
- September 14, 2022
- Length:
- 6 minute read, 1144 words
- Categories:
- File merging
- Tags:
- rrrrrrrr