Changelog

dplyr 0.7.5 (2018-04-14) Unreleased

Breaking changes for package developers

The major change in this version is that dplyr now depends on the selecting backend of the tidyselect package. If you have been linking to dplyr::select_helpers documentation topic, you should update the link to point to tidyselect::select_helpers.

Another change that causes warnings in packages is that dplyr now exports the exprs() function. This causes a collision with Biobase::exprs(). Either import functions from dplyr selectively rather than in bulk, or do not import Biobase::exprs() and refer to it with a namespace qualifier.

Bug fixes

Reindexing grouped data frames never updates the "class" attribute. This also avoids unintended updates to the original object (#3438).

do() operations with more than one named argument can access . (#2998).

Fix row_number() and ntile() ordering to use the locale-dependent ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (#2792, @foo-bar-baz-qux).

distinct(data, "string") now returns a one-row data frame again. (The previous behavior was to return the data unchanged.)

Note that this only works in selecting functions because in other contexts strings and character vectors are ambiguous. For instance strings are a valid input in mutating operations and mutate(df, "foo") creates a new column by recycling “foo” to the number of rows.

Hybrid evaluation simplifies dplyr::foo to foo (#3309). Hybrid functions can now be masked by regular R functions to turn off hybrid evaluation (#3255). The hybrid evaluator finds functions from dplyr even if dplyr is not attached (#3456).

Scoped select and rename functions (select_all(), rename_if() etc.) now work with grouped data frames, adapting the grouping as necessary (#2947, #3410). group_by_at can group by an existing grouping variable (#3351). arrange_at can use grouping variables (#3332).

Error messages

Add warning with explanation to distinct() if any of the selected columns are of type list (#3088, @foo-bar-baz-qux).

Better error message if dbplyr is not installed when accessing database backends (#3225).

Corrected error message when calling cbind() with an object of wrong length (#3085).

Better error message when joining data frames with duplicate or NA column names. Joining such data frames with a semi- or anti-join now gives a warning, which may be converted to an error in future versions (#3243, #3417).

Internal

Avoid cleaning the data mask, a temporary environment used to evaluate expressions. If the environment, in which e.g. a mutate() expression is evaluated, is preserved until after the operation, accessing variables from that environment now gives a warning but still returns NULL (#3318).

The .env argument to sample_n() and sample_frac() is defunct, passing a value to this argument print a message which will be changed to a warning in the next release.

Databases

This version of dplyr includes some major changes to how database connections work. By and large, you should be able to continue using your existing dplyr database code without modification, but there are two big changes that you should be aware of:

Almost all database related code has been moved out of dplyr and into a new package, dbplyr. This makes dplyr simpler, and will make it easier to release fixes for bugs that only affect databases. src_mysql(), src_postgres(), and src_sqlite() will still live dplyr so your existing code continues to work.

It is no longer necessary to create a remote “src”. Instead you can work directly with the database connection returned by DBI. This reflects the maturity of the DBI ecosystem. Thanks largely to the work of Kirill Muller (funded by the R Consortium) DBI backends are now much more consistent, comprehensive, and easier to use. That means that there’s no longer a need for a layer in between you and DBI.

If you’ve implemented a database backend for dplyr, please read the backend news to see what’s changed from your perspective (not much). If you want to ensure your package works with both the current and previous version of dplyr, see wrap_dbplyr_obj() for helpers.

UTF-8

Internally, column names are always represented as character vectors, and not as language symbols, to avoid encoding problems on Windows (#1950, #2387, #2388).

Error messages and explanations of data frame inequality are now encoded in UTF-8, also on Windows (#2441).

Joins now always reencode character columns to UTF-8 if necessary. This gives a nice speedup, because now pointer comparison can be used instead of string comparison, but relies on a proper encoding tag for all strings (#2514).

Fixed problems when joining factor or character encodings with a mix of native and UTF-8 encoded values (#1885, #2118, #2271, #2451).

New group_vars() generic that returns the grouping as character vector, to avoid the potentially lossy conversion to language symbols. The list returned by group_by_prepare() now has a new group_names component (#1950, #2384).

The scoped verbs taking predicates (mutate_if(), summarise_if(), etc) now support S3 objects and lazy tables. S3 objects should implement methods for length(), [[ and tbl_vars(). For lazy tables, the first 100 rows are collected and the predicate is applied on this subset of the data. This is robust for the common case of checking the type of a column (#2129).

Summarise and mutate colwise functions pass ... on the the manipulation functions.

Tidyeval

dplyr has a new approach to non-standard evaluation (NSE) called tidyeval. It is described in detail in vignette("programming") but, in brief, gives you the ability to interpolate values in contexts where dplyr usually works with expressions:

Most verbs taking dots now ignore the last argument if empty. This makes it easier to copy lines of code without having to worry about deleting trailing commas (#1039).

[API] The new .data and .env environments can be used inside all verbs that operate on data: .data$column_name accesses the column column_name, whereas .env$var accesses the external variable var. Columns or external variables named .data or .env are shadowed, use .data$... and/or .env$... to access them. (.data implements strict matching also for the $ operator (#2591).)

The column() and global() functions have been removed. They were never documented officially. Use the new .data and .env environments instead.

Expressions in verbs are now interpreted correctly in many cases that failed before (e.g., use of $, case_when(), nonstandard evaluation, …). These expressions are now evaluated in a specially constructed temporary environment that retrieves column data on demand with the help of the bindrcpp package (#2190). This temporary environment poses restrictions on assignments using <- inside verbs. To prevent leaking of broken bindings, the temporary environment is cleared after the evaluation (#2435).

Verbs

Joins

[API] xxx_join.tbl_df(na_matches = "never") treats all NA values as different from each other (and from any other value), so that they never match. This corresponds to the behavior of joins for database sources, and of database joins in general. To match NA values, pass na_matches = "na" to the join verbs; this is only supported for data frames. The default is na_matches = "na", kept for the sake of compatibility to v0.5.0. It can be tweaked by calling pkgconfig::set_config("dplyr::na_matches", "na") (#2033).

bind_rows() works correctly with NULL arguments and an .id argument (#2056), and also for zero-column data frames (#2175).

Breaking change: bind_rows() and combine() are more strict when coercing. Logical values are no longer coerced to integer and numeric. Date, POSIXct and other integer or double-based classes are no longer coerced to integer or double as there is chance of attributes or information being lost (#2209, @zeehio).

bind_rows() and bind_cols() now accept vectors. They are treated as rows by the former and columns by the latter. Rows require inner names like c(col1 = 1, col2 = 2), while columns require outer names: col1 = c(1, 2). Lists are still treated as data frames but can be spliced explicitly with !!!, e.g. bind_rows(!!! x) (#1676).

Fixed segmentation faults in hybrid evaluation of first(), last(), nth(), lead(), and lag(). These functions now always fall back to the R implementation if called with arguments that the hybrid evaluator cannot handle (#948, #1980).

Formatting of grouped data frames now works by overriding the tbl_sum() generic instead of print(). This means that the output is more consistent with tibble, and that format() is now supported also for SQL sources (#2781).

dplyr 0.5.0 2016-06-24

Breaking changes

Existing functions

distinct() now only keeps the distinct variables. If you want to return all variables (using the first row for non-distinct values) use .keep_all = TRUE (#1110). For SQL sources, .keep_all = FALSE is implemented using GROUP BY, and .keep_all = TRUE raises an error (#1937, #1942, @krlmlr). (The default behaviour of using all variables when none are specified remains - this note only applies if you select some variables).

A new family of functions replace summarise_each() and mutate_each() (which will thus be deprecated in a future release). summarise_all() and mutate_all() apply a function to all columns while summarise_at() and mutate_at() operate on a subset of columns. These columuns are selected with either a character vector of columns names, a numeric vector of column positions, or a column specification with select() semantics generated by the new columns() helper. In addition, summarise_if() and mutate_if() take a predicate function or a logical vector (these verbs currently require local sources). All these functions can now take ordinary functions instead of a list of functions generated by funs() (though this is only useful for local sources). (#1845, @lionel-)

Local backends

dtplyr

All data table related code has been separated out in to a new dtplyr package. This decouples the development of the data.table interface from the development of the dplyr package. If both data.table and dplyr are loaded, you’ll get a message reminding you to load dtplyr.

Tibble

Functions related to the creation and coercion of tbl_dfs, now live in their own package: tibble. See vignette("tibble") for more details.

$ and [[ methods that never do partial matching (#1504), and throw an error if the variable does not exist.

as_data_frame() is now an S3 generic with methods for lists (the old as_data_frame()), data frames (trivial), and matrices (with efficient C++ implementation) (#876). It no longer strips subclasses.

The internals of data_frame() and as_data_frame() have been aligned, so as_data_frame() will now automatically recycle length-1 vectors. Both functions give more informative error messages if you attempting to create an invalid data frame. You can no longer create a data frame with duplicated names (#820). Both check for POSIXlt columns, and tell you to use POSIXct instead (#813).

print.tbl_df() is considerably faster if you have very wide data frames. It will now also only list the first 100 additional variables not already on screen - control this with the new n_extra parameter to print() (#1161). When printing a grouped data frame the number of groups is now printed with thousands separators (#1398). The type of list columns is correctly printed (#1379)

Package includes setOldClass(c("tbl_df", "tbl", "data.frame")) to help with S4 dispatch (#969).

tbl_cube

tbl_cubes are now constructed correctly from data frames, duplicate dimension values are detected, missing dimension values are filled with NA. The construction from data frames now guesses the measure variables by default, and allows specification of dimension and/or measure variables (#1568, @krlmlr).

Swap order of dim_names and met_name arguments in as.tbl_cube (for array, table and matrix) for consistency with tbl_cube and as.tbl_cube.data.frame. Also, the met_name argument to as.tbl_cube.table now defaults to "Freq" for consistency with as.data.frame.table (@krlmlr, #1374).

The backend testing system has been improved. This lead to the removal of temp_srcs(). In the unlikely event that you were using this function, you can instead use test_register_src(), test_load(), and test_frame().

Multiple partitions or ordering variables in windowed functions no longer generate extra parentheses, so should work for more databases (#1060)

Internals

This version includes an almost total rewrite of how dplyr verbs are translated into SQL. Previously, I used a rather ad-hoc approach, which tried to guess when a new subquery was needed. Unfortunately this approach was fraught with bugs, so in this version I’ve implemented a much richer internal data model. Now there is a three step process:

When applied to a tbl_lazy, each dplyr verb captures its inputs and stores in a op (short for operation) object.

sql_build() iterates through the operations building to build up an object that represents a SQL query. These objects are convenient for testing as they are lists, and are backend agnostics.

sql_render() iterates through the queries and generates the SQL, using generics (like sql_select()) that can vary based on the backend.

In the short-term, this increased abstraction is likely to lead to some minor performance decreases, but the chance of dplyr generating correct SQL is much much higher. In the long-term, these abstractions will make it possible to write a query optimiser/compiler in dplyr, which would make it possible to generate much more succinct queries.

If you have written a dplyr backend, you’ll need to make some minor changes to your package:

sql_join() has been considerably simplified - it is now only responsible for generating the join query, not for generating the intermediate selects that rename the variable. Similarly for sql_semi_join(). If you’ve provided new methods in your backend, you’ll need to rewrite.

select_query() gains a distinct argument which is used for generating queries for distinct(). It loses the offset argument which was never used (and hence never tested).

src_translate_env() has been replaced by sql_translate_env() which should have methods for the connection object.

There were two other tweaks to the exported API, but these are less likely to affect anyone.

translate_sql() and partial_eval() got a new API: now use connection + variable names, rather than a tbl. This makes testing considerably easier. translate_sql_q() has been renamed to translate_sql_().

Also note that the sql generation generics now have a default method, instead methods for DBIConnection and NULL.

Dual table verbs

bind_rows() handles 0-length named lists (#1515), promotes factors to characters (#1538), and warns when binding factor and character (#1485). bind_rows()` is more flexible in the way it can accept data frames, lists, list of data frames, and list of lists (#1389).

Joins now use correct class when joining on POSIXct columns (#1582, @joel23888), and consider time zones (#819). Joins handle a by that is empty (#1496), or has duplicates (#1192). Suffixes grow progressively to avoid creating repeated column names (#1460). Joins on string columns should be substantially faster (#1386). Extra attributes are ok if they are identical (#1636). Joins work correct when factor levels not equal (#1712, #1559). Anti- and semi-joins give correct result when by variable is a factor (#1571), but warn if factor levels are inconsistent (#2741). A clear error message is given for joins where an explicit by contains unavailable columns (#1928, #1932). Warnings about join column inconsistencies now contain the column names (#2728).

There were a number of fixes to enable joining of data frames that don’t have the same encoding of column names (#1513), including working around bug 16885 regarding match() in R 3.3.0 (#1806, #1810, @krlmlr).

Vector functions

Hybrid cummean() is more stable against floating point errors (#1387).

Hybrid lead() and lag() received a considerable overhaul. They are more careful about more complicated expressions (#1588), and falls back more readily to pure R evaluation (#1411). They behave correctly in summarise() (#1434). and handle default values for string columns.

dplyr 0.4.3 2015-09-01

Improved encoding support

Until now, dplyr’s support for non-UTF8 encodings has been rather shaky. This release brings a number of improvement to fix these problems: it’s probably not perfect, but should be a lot better than the previously version. This includes fixes to arrange() (#1280), bind_rows() (#1265), distinct() (#1179), and joins (#1315). print.tbl_df() also recieved a fix for strings with invalid encodings (#851).

bind_rows() gains a .id argument. When supplied, it creates a new column that gives the name of each data frame (#1337, @lionel-).

bind_rows() respects the ordered attribute of factors (#1112), and does better at comparing POSIXcts (#1125). The tz attribute is ignored when determining if two POSIXct vectors are comparable. If the tz of all inputs is the same, it’s used, otherwise its set to UTC.

print.tbl_df() now displays the class for all variables, not just those that don’t fit on the screen (#1276). It also displays duplicated column names correctly (#1159).

print.grouped_df() now tells you how many groups there are.

mutate() can set to NULL the first column (used to segfault, #1329) and it better protects intermediary results (avoiding random segfaults, #1231).

mutate() on grouped data handles the special case where for the first few groups, the result consists of a logical vector with only NA. This can happen when the condition of an ifelse is an all NA logical vector (#958).

dplyr 0.4.2 2015-06-16

This is a minor release containing fixes for a number of crashes and issues identified by R CMD CHECK. There is one new “feature”: dplyr no longer complains about unrecognised attributes, and instead just copies them over to the output.

lag() and lead() for grouped data were confused about indices and therefore produced wrong results (#925, #937). lag() once again overrides lag() instead of just the default method lag.default(). This is necesary due to changes in R CMD check. To use the lag function provided by another package, use pkg::lag.

Fixed a number of memory issues identified by valgrind.

Improved performance when working with large number of columns (#879).

dplyr 0.4.0 2015-01-08

New features

bind_rows() and bind_cols() efficiently bind a list of data frames by row or column. combine() applies the same coercion rules to vectors (it works like c() or unlist() but is consistent with the bind_rows() rules).

right_join() (include all rows in y, and matching rows in x) and full_join() (include all rows in x and y) complete the family of mutating joins (#96).

group_indices() computes a unique integer id for each group (#771). It can be called on a grouped_df without any arguments or on a data frame with same arguments as group_by().

New vignettes

vignette("data_frames") describes dplyr functions that make it easier and faster to create and coerce data frames. It subsumes the old memory vignette.

Minor improvements

do() uses lazyeval to correctly evaluate its arguments in the correct environment (#744), and new do_() is the SE equivalent of do() (#718). You can modify grouped data in place: this is probably a bad idea but it’s sometimes convenient (#737). do() on grouped data tables now passes in all columns (not all columns except grouping vars) (#735, thanks to @kismsu). do() with database tables no longer potentially includes grouping variables twice (#673). Finally, do() gives more consistent outputs when there are no rows or no groups (#625).

Overhaul of single table verbs for data.table backend. They now all use a consistent (and simpler) code base. This ensures that (e.g.) n() now works in all verbs (#579).

In *_join(), you can now name only those variables that are different between the two tables, e.g. inner_join(x, y, c("a", "b", "c" = "d")) (#682). If non-join colums are the same, dplyr will add .x and .y suffixes to distinguish the source (#655).

select() now implements a more sophisticated algorithm so if you’re doing multiples includes and excludes with and without names, you’re more likely to get what you expect (#644). You’ll also get a better error message if you supply an input that doesn’t resolve to an integer column position (#643).

Printing has recieved a number of small tweaks. All print() method methods invisibly return their input so you can interleave print() statements into a pipeline to see interim results. print() will column names of 0 row data frames (#652), and will never print more 20 rows (i.e. options(dplyr.print_max) is now 20), not 100 (#710). Row names are no never printed since no dplyr method is guaranteed to preserve them (#669).

slice() works for data tables (#717). Documentation clarifies that slice can’t work with relational databases, and the examples show how to achieve the same results using filter() (#720).

dplyr now requires RSQLite >= 1.0. This shouldn’t affect your code in any way (except that RSQLite now doesn’t need to be attached) but does simplify the internals (#622).

Functions that need to combine multiple results into a single column (e.g. join(), bind_rows() and summarise()) are more careful about coercion.

Joining factors with the same levels in the same order preserves the original levels (#675). Joining factors with non-identical levels generates a warning and coerces to character (#684). Joining a character to a factor (or vice versa) generates a warning and coerces to character. Avoid these warnings by ensuring your data is compatible before joining.

rbind_list() will throw an error if you attempt to combine an integer and factor (#751). rbind()ing a column full of NAs is allowed and just collects the appropriate missing value for the column type being collected (#493).

summarise() is more careful about NA, e.g. the decision on the result type will be delayed until the first non NA value is returned (#599). It will complain about loss of precision coercions, which can happen for expressions that return integers for some groups and a doubles for others (#599).

A number of functions gained new or improved hybrid handlers: first(), last(), nth() (#626), lead() & lag() (#683), %in% (#126). That means when you use these functions in a dplyr verb, we handle them in C++, rather than calling back to R, and hence improving performance.

Hybrid min_rank() correctly handles NaN values (#726). Hybrid implementation of nth() falls back to R evaluation when n is not a length one integer or numeric, e.g. when it’s an expression (#734).

filter.data.table() works if the table has a variable called “V1” (#615).

*_join() keeps columns in original order (#684). Joining a factor to a character vector doesn’t segfault (#688). *_join functions can now deal with multiple encodings (#769), and correctly name results (#855).

group_by() on a data table preserves original order of the rows (#623). group_by() supports variables with more than 39 characters thanks to a fix in lazyeval (#705). It gives meaninful error message when a variable is not found in the data frame (#716).

New functions

data_frame() by @kevinushey is a nicer way of creating data frames. It never coerces column types (no more stringsAsFactors = FALSE!), never munges column names, and never adds row names. You can use previously defined columns to compute new columns (#376).

distinct() returns distinct (unique) rows of a tbl (#97). Supply additional variables to return the first row for each unique combination of variables.

Set operations, intersect(), union() and setdiff() now have methods for data frames, data tables and SQL database tables (#93). They pass their arguments down to the base functions, which will ensure they raise errors if you pass in two many arguments.

rename() makes it easy to rename variables - it works similarly to select() but it preserves columns that you didn’t otherwise touch.

slice() allows you to selecting rows by position (#226). It includes positive integers, drops negative integers and you can use expression like n().

Programming with dplyr (non-standard evaluation)

You can now program with dplyr - every function that does non-standard evaluation (NSE) has a standard evaluation (SE) version ending in _. This is powered by the new lazyeval package which provides all the tools needed to implement NSE consistently and correctly.

See vignette("nse") for full details.

regroup() is deprecated. Please use the more flexible group_by_() instead.

Minor improvements and bug fixes

%>% is simply re-exported from magrittr, instead of creating a local copy (#496, thanks to @jimhester)

Examples now use nycflights13 instead of hflights because it the variables have better names and there are a few interlinked tables (#562). Lahman and nycflights13 are (once again) suggested packages. This means many examples will not work unless you explicitly install them with install.packages(c("Lahman", "nycflights13")) (#508). dplyr now depends on Lahman 3.0.1. A number of examples have been updated to reflect modified field names (#586).

do() now displays the progress bar only when used in interactive prompts and not when knitting (#428, @jimhester).

group_by() has more consistent behaviour when grouping by constants: it creates a new column with that value (#410). It renames grouping variables (#410). The first argument is now .data so you can create new groups with name x (#534).

Now instead of overriding lag(), dplyr overrides lag.default(), which should avoid clobbering lag methods added by other packages. (#277).

Minor improvements and bug fixes by backend

Databases

Correct SQL generation for paste() when used with the collapse parameter targeting a Postgres database. (@rbdixon, #1357)

The db backend system has been completely overhauled in order to make it possible to add backends in other packages, and to support a much wider range of databases. See vignette("new-sql-backend") for instruction on how to create your own (#568).

order_by() now works in conjunction with window functions in databases that support them.

Data frames/tbl_df

All verbs now understand how to work with difftime() (#390) and AsIs (#453) objects. They all check that colnames are unique (#483), and are more robust when columns are not present (#348, #569, #600).

Hybrid evaluation bugs fixed:

Call substitution stopped too early when a sub expression contained a $ (#502).

Cubes

dplyr 0.2 2014-05-21

Piping

dplyr now imports %>% from magrittr (#330). I recommend that you use this instead of %.% because it is easier to type (since you can hold down the shift key) and is more flexible. With you %>%, you can control which argument on the RHS recieves the LHS by using the pronoun .. This makes %>% more useful with base R functions because they don’t always take the data frame as the first argument. For example you could pipe mtcars to xtabs() with:

mtcars %>%xtabs( ~cyl +vs, data = .)

Thanks to @smbache for the excellent magrittr package. dplyr only provides %>% from magrittr, but it contains many other useful functions. To use them, load magrittr explicitly: library(magrittr). For more details, see vignette("magrittr").

%.% will be deprecated in a future version of dplyr, but it won’t happen for a while. I’ve also deprecated chain() to encourage a single style of dplyr usage: please use %>% instead.

Do

do() has been completely overhauled. There are now two ways to use it, either with multiple named arguments or a single unnamed arguments. group_by() + do() is equivalent to plyr::dlply, except it always returns a data frame.

If you use named arguments, each argument becomes a list-variable in the output. A list-variable can contain any arbitrary R object so it’s particularly well suited for storing models.

Minor improvements

compute() gains temporary argument so you can control whether the results are temporary or permanent (#382, @cpsievert)

group_by() now defaults to add = FALSE so that it sets the grouping variables rather than adding to the existing list. I think this is how most people expected group_by to work anyway, so it’s unlikely to cause problems (#385).

dplyr is more careful when setting the keys of data tables, so it never accidentally modifies an object that it doesn’t own. It also avoids unnecessary key setting which negatively affected performance. (#193, #255).

print() methods for tbl_df, tbl_dt and tbl_sql gain n argument to control the number of rows printed (#362). They also works better when you have columns containing lists of complex objects.

row_number() can be called without arguments, in which case it returns the same as 1:n() (#303).

"comment" attribute is allowed (white listed) as well as names (#346).

Bug fixes

filter() now fails when given anything other than a logical vector, and correctly handles missing values (#249). filter.numeric() proxies stats::filter() so you can continue to use filter() function with numeric inputs (#264).

filter() handles scalar results (#217) and better handles scoping, e.g. filter(., variable) where variable is defined in the function that calls filter. It also handles T and F as aliases to TRUE and FALSE if there are no T or F variables in the data or in the scope.

select.grouped_df fails when the grouping variables are not included in the selected variables (#170)

all.equal.data.frame() handles a corner case where the data frame has NULL names (#217)