Combine data frames with different fields using a crosswalk table

This function combines multiple "source" data frames, possibly with different column names, into a single "destination" data frame. Usually merge will be faster and easier to implement if the columns to be merged on have the same names, and rbind will always be faster and much easier if the column names and data types match exactly.

The key tool in this function is a "crosswalk" table (a data.frame) that tells the function which fields in each source data frame match to the final fields in the destination data frame. Values in source data frame fields can be used as-is, combined across fields, or have functions applied to them before they are put into the destination data frame. If a source data frame doe snot gave a field that matches the destination field, a default value (including NA) can be assigned to all cells for that source data frame.

The data frames to be combined can be provided in ... or as file names in the first column of the crosswalk table. These can be either CSV files (extension ".csv"), TAB files (extension ".tab"), "Rdata" files (read using load and with a ".rda" or ".rdata" extension), or "RDS" files (read using readRDS and with a ".rds" extension). The file type will be intuited from the extension, and its case does not matter. Note that if an object in an Rdata file has the same name as an object in this function (i.e., any of the arguments plus any objects internal to the function), this may cause a conflict. To help obviate this issue, all internal objects are named with a period at the end (e.g., "crossCell." and "countDf.").

All cells in each source data frame will have leading and trailing white spaces removed before combining.

Usage

combineDf(
  ...,
  crosswalk,
  collapse = "; ",
  useColumns = NULL,
  excludeColumns = NULL,
  useFrames = NULL,
  classes = NULL,
  verbose = FALSE
)

Arguments

...

Data frames to combine. These must be listed in the order that they appear in the crosswalk table.

crosswalk

A data.frame. Column names are fields in the destination data frame. Each row corresponds to a different data frame to join. If ... is not used then the first column must have the paths and file names to CSV, TAB, Rdata, or RDS files representing data frames to join. All objects will be coerced to data.frames.

Other than this column, the elements of each cell contain the name of the column in each source data frame that coincides with the column name in the crosswalk table. For example, suppose the destination data frame is to have a column by the name of "species" with names of species within it. If the first source data frame has a column named "Species" and the second source data frame has a column named "scientificName", then the first value in crosswalk under its "species" column will be "Species" and the second "scientificName". This will take all the values in the "Species" field of the first source data frame and all the values in the "scientificName" field in the second source data frame and put them both into the "species" field of the destination frame.

More complex operations can be done using the following in cells of crosswalk:

Filling all cells with a single value: If the value in the crosswalk cell begins with "%fill%", then the value that follows it will be repeated in destination data frame in each row. For example,

%fill% inspected

will repeat the string "inspected" in every row of the output corresponding to the respective source data frame (any spaces immediately after %fill% are ignored).
Concatenating (pasting) columns together: To combine multiple fields, begin crosswalk cell with "%cat%", then list the fields to combine (with or without commas separating them). For example, if the crosswalk cell has

%cat% field1 field2 field3

then the resulting column in the destination data frame will have values from field1, field2, and field3 pasted together. See also argument collapse.
Applying a function: You can manipulate values by including functions in the crosswalk cell. The crosswalk cell should begin with "%fun%", then be followed an expression to evaluate. Expressions should generally not use the "<-" operator (or the equals operator used in the same way). For example:

%fun% ifelse(as.numeric(field1) >20, NA, as.numeric(field1))

will create a column that is NA whenever values in field1 are >20, and the value in field1 otherwise. Note that for mathematical operations, it is almost always necessary to use as.numeric around column names representing numbers since fields are read in as characters.

collapse

Character, specifies the string to put between fields combined with the %cat% operator in the crosswalk table.

useColumns, excludeColumns

Logical, character vector, integer vector, or NULL. Indicates which columns in the crosswalk table are to be used or not to be used. These can be given as a TRUE/FALSE vector, a vector of column names, or a vector of column index values (integers). You can only specify useColumns or excludeColumns at a time (one or both must be NULL). If both are NULL (default), then all columns in the crosswalk will be used.

useFrames

Logical, character, or NULL. Indicates if a particular source data frame should be used. This should be a TRUE/FALSE vector or the name of a column in the crosswalk table that has TRUE/FALSE values. If this is the name of a column, the column will be removed from the columns in useColumns. If NULL (default). then all data frames in the crosswalk will be used.

classes

Character or character list, specifies the classes (e.g., character, logical, numeric, integer) to be assigned to each column in the output table. If NULL, all classes will be assumed to be character. If just one value is listed, all columns will be set to this class. If a list, it must be the same length as the number of columns in crosswalk and specify the class of each column. If it has names, then the names must correspond to the columns in crosswalk and will be used to assign the data type to the corresponding columns.

verbose

Logical, if TRUE prints extra information during execution. Useful for debugging the crosswalk table.

Value

A data frame.

Examples

df1 <- data.frame(x1=1:5, x2=FALSE, x3=letters[1:5], x4=LETTERS[1:5],
       x5='stuff', x6=11:15)
df2 <- data.frame(y1=11:15, y2=rev(letters)[1:5], y3=runif(5))

crosswalk <- data.frame(
  a = c('x1', 'y1'),
  b = c('x2', '%fill% TRUE'),
  c = c('%cat% x3 x4', 'y2'),
  d = c('x5', '%fill% NA'),
 e = c('%fun% as.numeric(x6) > 12', '%fun% round(as.numeric(y3), 2)')
)

combined <- combineDf(df1, df2, crosswalk=crosswalk)
combined
#>     a     b    c     d    e
#> 1   1 FALSE a; A stuff    0
#> 2   2 FALSE b; B stuff    0
#> 3   3 FALSE c; C stuff    1
#> 4   4 FALSE d; D stuff    1
#> 5   5 FALSE e; E stuff    1
#> 6  11  TRUE    z    NA 0.08
#> 7  12  TRUE    y    NA 0.83
#> 8  13  TRUE    x    NA  0.6
#> 9  14  TRUE    w    NA 0.16
#> 10 15  TRUE    v    NA 0.01

Combine data frames with different fields using a crosswalk table

Usage

Arguments

Value

See also

Examples