Tag: merge

Merging two data.frame objects in R is very easily done by using the merge function. While being very powerful, the merge function does not (as of yet) offer to return a merged data.frame that preserved the original order of, one of the two merged, data.frame objects. In this post I describe this problem, and offer […]

Update (2017-02-03) the dplyr package offers a great solution for this issue, see the document Two-table verbs for more details.

Merging two data.frame objects in R is very easily done by using the merge function. While being very powerful, the merge function does not (as of yet) offer to return a merged data.frame that preserved the original order of, one of the two merged, data.frame objects.
In this post I describe this problem, and offer some easy to use code to solve it.

If we will now merge the two objects, we will find that the order of the rows is different then the original order of the “y” object. This is true whether we use “sort =T” or “sort=F”. You can notice that the original order was an ascending order of the “val” variable:

The rows are by default lexicographically sorted on the common columns, but for ‘sort = FALSE’ are in an unspecified order.

Or put differently: sort=FALSE doesn’t preserve the order of any of the two entered data.frame objects (x or y); instead it gives us an
unspecified (potentially random) order.

However, it can so happen that we want to make sure the order of the resulting merged data.frame objects ARE ordered according to the order of one of the two original objects. In order to make sure of that, we could add an extra “id” (row index number) sequence on the dataframe we wish to sort on. Then, we can merge the two data.frame objects, sort by the sequence, and delete the sequence. (this was previously mentioned on the R-help mailing list by Bart Joosen).

Following is a function that implements this logic, followed by an example for its use:

keep_order can accept the numbers 1 or 2, in which case it will make sure the resulting merged data.frame will be ordered according to the original order of rows of the data.frame entered to x (if keep_order=1) or to y (if keep_order=2). If keep_order is missing, merge will continue working as usual. If keep_order gets some input other then 1 or 2, it will issue a warning that it doesn’t accept these values, but will continue working as merge normally would. Notice that the parameter “sort” is practically overridden when using keep_order (with the value 1 or 2).

The same code can be used to modify the original merge.data.frame function in base R, so to allow the use of the keep_order, here is a link to the patched merge.data.frame function (on github). If you can think of any ways to improve the function (or happen to notice a bug) please let me know either on github or in the comments. (also saying that you found the function to be useful will be fun to know about )

Update: Thanks to KY’s comment, I noticed the ?join function in the {plyr} library. This function is similar to merge (with less features, yet faster), and also automatically keeps the order of the x (first) data.frame used for merging, as explained in the ?join help page:

Unlike merge, (join) preserves the order of x no matter what join type is used. If needed, rows from y will be added to the bottom. Join is often faster than merge, although it is somewhat less featureful – it currently offers no way to rename output or merge on different variables in the x and y data frames.