Combine data frames with different fields using a crosswalk table
Source:R/combineDf.r
combineDf.Rd
This function combines multiple "source" data frames, possibly with different column names, into a single "destination" data frame. Usually merge
will be faster and easier to implement if the columns to be merged on have the same names, and rbind
will always be faster and much easier if the column names and data types match exactly.
The key tool in this function is a "crosswalk" table (a data.frame
) that tells the function which fields in each source data frame match to the final fields in the destination data frame. Values in source data frame fields can be used as-is, combined across fields, or have functions applied to them before they are put into the destination data frame. If a source data frame doe snot gave a field that matches the destination field, a default value (including NA
) can be assigned to all cells for that source data frame.
The data frames to be combined can be provided in ...
or as file names in the first column of the crosswalk table. These can be either CSV files (extension ".csv
"), TAB files (extension ".tab
"), "Rdata" files (read using load
and with a ".rda
" or ".rdata
" extension), or "RDS" files (read using readRDS
and with a ".rds
" extension). The file type will be intuited from the extension, and its case does not matter. Note that if an object in an Rdata file has the same name as an object in this function (i.e., any of the arguments plus any objects internal to the function), this may cause a conflict. To help obviate this issue, all internal objects are named with a period at the end (e.g., "crossCell.
" and "countDf.
").
All cells in each source data frame will have leading and trailing white spaces removed before combining.
Usage
combineDf(
...,
crosswalk,
collapse = "; ",
useColumns = NULL,
excludeColumns = NULL,
useFrames = NULL,
classes = NULL,
verbose = FALSE
)
Arguments
- ...
Data frames to combine. These must be listed in the order that they appear in the
crosswalk
table.- crosswalk
A
data.frame
. Column names are fields in the destination data frame. Each row corresponds to a different data frame to join. If...
is not used then the first column must have the paths and file names to CSV, TAB, Rdata, or RDS files representing data frames to join. All objects will be coerced todata.frame
s.Other than this column, the elements of each cell contain the name of the column in each source data frame that coincides with the column name in the
crosswalk
table. For example, suppose the destination data frame is to have a column by the name of "species" with names of species within it. If the first source data frame has a column named "Species" and the second source data frame has a column named "scientificName", then the first value incrosswalk
under its "species" column will be "Species" and the second "scientificName". This will take all the values in the "Species" field of the first source data frame and all the values in the "scientificName" field in the second source data frame and put them both into the "species" field of the destination frame.More complex operations can be done using the following in cells of
crosswalk
:Filling all cells with a single value: If the value in the crosswalk cell begins with "
%fill%
", then the value that follows it will be repeated in destination data frame in each row. For example,%fill% inspected
will repeat the string "inspected" in every row of the output corresponding to the respective source data frame (any spaces immediately after
%fill%
are ignored).Concatenating (pasting) columns together: To combine multiple fields, begin crosswalk cell with "
%cat%
", then list the fields to combine (with or without commas separating them). For example, if the crosswalk cell has%cat% field1 field2 field3
then the resulting column in the destination data frame will have values from
field1
,field2
, andfield3
pasted together. See also argumentcollapse
.Applying a function: You can manipulate values by including functions in the crosswalk cell. The crosswalk cell should begin with "
%fun%
", then be followed an expression to evaluate. Expressions should generally not use the "<-
" operator (or the equals operator used in the same way). For example:%fun% ifelse(as.numeric(field1) >20, NA, as.numeric(field1))
will create a column that is
NA
whenever values infield1
are >20, and the value infield1
otherwise. Note that for mathematical operations, it is almost always necessary to useas.numeric
around column names representing numbers since fields are read in as characters.
- collapse
Character, specifies the string to put between fields combined with the
%cat%
operator in thecrosswalk
table.- useColumns, excludeColumns
Logical, character vector, integer vector, or
NULL
. Indicates which columns in the crosswalk table are to be used or not to be used. These can be given as aTRUE
/FALSE
vector, a vector of column names, or a vector of column index values (integers). You can only specifyuseColumns
orexcludeColumns
at a time (one or both must beNULL
). If both areNULL
(default), then all columns in the crosswalk will be used.- useFrames
Logical, character, or
NULL
. Indicates if a particular source data frame should be used. This should be aTRUE
/FALSE
vector or the name of a column in the crosswalk table that hasTRUE
/FALSE
values. If this is the name of a column, the column will be removed from the columns inuseColumns
. IfNULL
(default). then all data frames in the crosswalk will be used.- classes
Character or character list, specifies the classes (e.g., character, logical, numeric, integer) to be assigned to each column in the output table. If
NULL
, all classes will be assumed to be character. If just one value is listed, all columns will be set to this class. If a list, it must be the same length as the number of columns incrosswalk
and specify the class of each column. If it has names, then the names must correspond to the columns incrosswalk
and will be used to assign the data type to the corresponding columns.- verbose
Logical, if
TRUE
prints extra information during execution. Useful for debugging thecrosswalk
table.
Examples
df1 <- data.frame(x1=1:5, x2=FALSE, x3=letters[1:5], x4=LETTERS[1:5],
x5='stuff', x6=11:15)
df2 <- data.frame(y1=11:15, y2=rev(letters)[1:5], y3=runif(5))
crosswalk <- data.frame(
a = c('x1', 'y1'),
b = c('x2', '%fill% TRUE'),
c = c('%cat% x3 x4', 'y2'),
d = c('x5', '%fill% NA'),
e = c('%fun% as.numeric(x6) > 12', '%fun% round(as.numeric(y3), 2)')
)
combined <- combineDf(df1, df2, crosswalk=crosswalk)
combined
#> a b c d e
#> 1 1 FALSE a; A stuff 0
#> 2 2 FALSE b; B stuff 0
#> 3 3 FALSE c; C stuff 1
#> 4 4 FALSE d; D stuff 1
#> 5 5 FALSE e; E stuff 1
#> 6 11 TRUE z NA 0.08
#> 7 12 TRUE y NA 0.83
#> 8 13 TRUE x NA 0.6
#> 9 14 TRUE w NA 0.16
#> 10 15 TRUE v NA 0.01