4
Text patterns and matches A regular expression, or regex for short, is a pattern describing a certain amount of text In this slide, regular expressions are highlighted as regex –it is the most basic pattern, simply matching the literal text regex (highlighted in this slide) I will use the term “string” to indicate the text that I am applying the regular expression to and will be highlighted as string 4

5
Literal characters The most basic regular expression consists of a single literal character, ex: a –match the first occurrence of that character in the string –on Jack is a boy Jack is a boy, not Jack is a boy In this slide, I’ll use a shorter notation sometimes –a: Jack is a boy Eleven characters with special meanings: –[ \ ^ $. | ? * + ( )–[ \ ^ $. | ? * + ( ) –metacharacters –escape metacharacters with a backslash use 1\+1=2 to match 1+1=2 5

6
Character classes/sets Match only one out of several characters –to match an a or an e, use [ae] –you could use this in gr[ae]y to match gray or grey –a character class matches only a single character –gr[ae]y will not match graay or graey –the order does not matter Use a hyphen to specify a range of characters –[0-9] matches a single digit between 0 and 9 –combine ranges and single characters [0-9a-fA-F] –combine ranges and single characters [0-9a-fxA-FX] A caret after the opening square bracket negates the class –q[^x] matches qu in question but does not match Iraq since there is no character after the q for the negated character class to match 6

7
Shorthand character classes \d matches a single character that is a digit \w matches a word character –alphanumeric characters plus underscore \s matches a whitespace character –includes tabs and line breaks –\S not \s The actual characters matched by the shorthands depends on the software you’re using –$ man perlre 7

8
Non-printable characters Use special character sequences to put non-printable characters –\t for tab (ASCII 0x09) –\r for carriage return (0x0D) –\n for line feed (0x0A) Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n Use \xFF to match a specify character by its hexadecimal index in the character set –\xA9 matches the copyright symbol \uFFFF for a Unicode character (if supported) –\u20A0 matches the euro currency sign 8

9
The dot The dot,., matches (almost) any character The dot matches a single character, except line break characters –a short for [^\n] –gr.y matches gray, grey, gr%y, etc Most regex engines have a “dot matches all” or “single line” mode that makes the dot match any single character, including line breaks 9

10
Anchors Anchors do not match any characters but match a position –^ matches at the start of the string –$ matches at the end of the string Most regex engines have a “multi-line” mode that makes ^ match after any line break, and $ before any line break –b$ matches only bob \b matches at a word boundary –a word boundary is a position between a character that can be matched by \w and a character that cannot be matched by \w –\b also matches at the start and/or end of the string if the first and/or last characters in the string are word characters –\B matches at every position where \b cannot match –\bis\b: This island is beautiful 10

11
Alternation Alternation is the regular expression equivalent of “or” –cat|dog: About cats and dogs You can add as many alternatives as you want –cat|dog|mouse|fish 11

12
Repetition ? makes the preceding token in the regular expression optional –colou?r matches colour or color * matches the preceding token zero or more times + matches the preceding token once or more – matches an HTML tag without any attributes – is easier to write but matches invalid tags such as {} specifies a specific amount of repetition –\b[1-9][0-9]{3}\b matches 1000–9999 –\b[1-9][0-9]{2,4}\b matches 100–99999 12

13
Greedy and lazy repetition The repetition operators or quantifiers are greedy They will expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex – : This is a first test Place a question mark after the quantifier to make it lazy, i.e., stop matching as soon as possible – : This is a first test A better solution is to use ]+> to quickly match an HTML tag without regard to attributes –the negated character class is more specific than the dot, which helps the regex engine find matches quickly 13

14
Grouping and backreferences Place round brackets, (), around multiple tokens to group them together –you can then apply a quantifier to the group –Set(Value)? matches Set or SetValue Round brackets create a capturing group –the above example has one group –how to access the group’s contents depends on the software or programming language you’re using Group zero always contains the entire regex match –Set(Value)?: SetValue, then $0 = SetValue, $1 = Value –Set(Value)?: Set, then or $0 = Set, $1 is nothing Use the special syntax Set(?:Value)? to group tokens without creating a capturing group –more efficient if you don’t need the contents 14

15
Look-around Look-around is a special kind of group The tokens inside the group are matched normally, but then the regex engine makes the group give up its match and keeps only the result Look-around matches a position, just like anchors –q(?=u) matches question, but not Iraq (?=u) match at each position in the string before a u u is not part of the overall regex match positive look-ahead –q(?!u) matches Iraq but not question negative look-ahead –(?<=a)b matches abc positive look-behind –(?
{
"@context": "http://schema.org",
"@type": "ImageObject",
"contentUrl": "http://images.slideplayer.com/11/3236125/slides/slide_15.jpg",
"name": "Look-around Look-around is a special kind of group The tokens inside the group are matched normally, but then the regex engine makes the group give up its match and keeps only the result Look-around matches a position, just like anchors –q( =u) matches question, but not Iraq ( =u) match at each position in the string before a u u is not part of the overall regex match positive look-ahead –q( !u) matches Iraq but not question negative look-ahead –( <=a)b matches abc positive look-behind –(

About project

Feedback

To ensure the functioning of the site, we use cookies. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy &amp Terms.
Your consent to our cookies if you continue to use this website.