I posted a question over on StackOverflow on an efficient way of comparing two data frames with the same column structure, but with different rows. What I would like to end up with is an n x m logical matrix where n and m are the number of rows in the first and second data frames, respectively; and the value at the ith row and jth column indicates whether all the values from row i from data frame one is equal to row j from data frame two. To provide some context, this will be used in a propensity score matching algorithm to identify candidate matches that match exactly on any number of covariates. In addition to the approaches I had, joran provided an approach using the Vectorize function (thanks again as I learned another nice function). I decided to put three approaches to a race

To understand what I need, I’ll start with a small example with two data frames, one with 4 rows, the other with 3, and each has two variables, one logical and the other numeric. As an aside, I only need this to work for integers, factors, characters, and logical types therefore avoiding issues of comparing numerics.

This is pretty straight forward. Now I want the same type of result, but to compare more than one column (in the final implementation I need to handle any number of columns so not necessarily limited to one or two).

We can already see with this small example that the Vectorize approach is the slowest. However, let’s try a larger example. First we’ll create two data frames, one with 1,000 rows and the second with 1,500. The resulting matrix will be 1,000 x 1,500.

The combined column approach is by far the fasted way, and it makes good since. It is a bit surprising (at least to me), how much worse the Vectorize and outer functions are. Moreover, I am a bit concerned about potential issues with the paste method and doing comparisons on those results. Please feel free to leave comments below if there are other approaches.