Though regular expressions can be a little tough to grasp at first, they are really useful and a great thing to learn early on. I understand that this is a new topic, however, I don't think that it is more difficult to understand than substring (so it would not take much longer than a few extra minutes to teach) and in the long run knowing regular expressions could be more useful.

I wrote the following brief intro on regular expressions:

A regular expression is sequence of characters that define a search pattern. In programming we use regular expressions to check if a certain pattern occurs in a set of strings. For example if you have a list of students and you wanted to search for students who had the last name Smith you could use regular expression “Smith”. In R you specifically use the syntax:

grepl(“REGULAR EXPRESSION”, VARIABLE”)
or in this example
grepl(“Smith”, Name)

Because a regular expression is a sequence of characters, it can only be used on a variable that is a string. However, you can use regular expressions to search for a pattern of numbers if the variable you are searching through is a string. For example you could have an employee id that is a string of numbers and letters. You want to subset on all employee ids that start with the pattern of numbers “185” because “185” represents the employees who do work in a particular field. You can use the regular expression “^185” to search through employee id.

There are a lot of tricks you can use with regular expressions. Here I use “^” before “185” to indicate that I want to find strings that start with 185. If we want to find strings that ended with “185” we would use the syntax “185$”.

If you want to search for a regular expression that is a string of numbers on a variable that is in a numerical format (such as double, int), you will have to first convert the variable to a string before you can use a regular expression.

Let me know what you think! I'm excited to start contributing.

This comment has been minimized.

Hi @hkronenb thanks for the suggestion, and we're glad to have you excited to contribute too!

I agree with you that the current example is not very elegant or R-like (substr() is probably a little too into the weeds for this particular spot in this lesson, IMO), and that regular expressions would be the proper and in many cases easier and more elegant way to do this.

I am a little hesitant to dive into them though, in this lesson, because they could easily become a whole module into themselves. But they are certainly a powerful tool that would be useful to introduce, even if very cursorily. Let me think a little bit more about how/where to fit in this content and I'll get back to you.

This comment has been minimized.

edited

[In which Noam clears out like a month of GitHub notifications]

I would hesitate to "partially" introduce regular expressions at all; regex and string handling would easily be an additional module or two. I think it might make sense to introduce them as a concept, though not here. The tidyr lesson might be a good place, because things like sep arguments can use but do not require regexes. So you could show one regex example and make an advanced option for an exercise something that would require students to look up some different regex syntax.

Given that, I would see if one could re-write this portion of the lesson without string handling at all, so as to reduce the length and cognitive load of the lesson. None of the concepts being conveyed here require string handling. One could re-write these pipelines to simply select 4 countries by name, or use a different logical filter to reduce to some other small set of countries (say, the highest or lowest GDP countries).