CS 330: Lecture 5 – Lookaround Assertions and Numeric Ranges

To start things off, we’re going to play some Regex Bingo. Find a partner and make a 4×4 grid of randomly generated strings. Include uppercase and lowercase letters, numbers, whitespace, and punctuation. Keep the strings short. There’s no free space.

I will call out the following regex, one at a time. Cross out all strings that match.

Find all the statements in a Javascript source file that don’t end in a semi-colon. (As a proxy, we’ll say all lines that have code on them that don’t end in a curly brace, comma, or semi-colon.)

Evaluate embedded mathematical expressions in a report.

Fix missing quotation marks around single-token attributes in HTML.

Replace all numbers in [0, 255] with their binary representation with b suffix. For example 6 turns into 110b.

Occasionally in our patterns, we find ourselves wanting to match some text X adjacent to some other text Y. We don’t really want to do anything with text Y, but we need it in our pattern to serve as an anchor. We end up capturing text Y only to reinsert it, unmodified. For example, in the missing quotation marks problem, we capture the attribute but reinsert it as is:

id.gsub!(/(\w+=)([^" ]+)/,'\1"\2"')

We didn’t really do anything with the attribute. It’d be great if we could just replace the content that we do care about. We can—with lookaround assertions:

id.gsub!(/(?<=\w+=)([^" ]+)/,'"\2"')

Lookaround assertions allow us to mark elements as anchor points, dictating where a match occurs, but not actually including the anchoring text as part of the match. I use them a lot in my text editor to position my cursor after some anchoring text. For example, to get my cursor just inside divs with class foo, I’d run the following search in Vim: /\(<div class="foo">\)\@<=.

Additionally, sometimes we want to search for numbers that fall into a restricted numeric range. Regex may not be the best tool for this, but it can be made to work. Start by drawing out a table of the range you wish to capture. For the problem above, we want to find byte literals in [0, 255], so here’s our table:

0

1

2

3

4

5

6

7

8

9

00

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

Now we decompose this table into a series of rectangular groups, with members of each group having something in common. For example, this group has only single digits:

0

1

2

3

4

5

6

7

8

9

This group has only two digits:

00

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

This group has three digits with a leading 1:

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

This group has three digits with a leading 2 and a tens-digits in [0, 4]:

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

And this group is what’s leftover. These numbers have a leading 25 and a ones-digit in [0, 5]:

250

251

252

253

254

255

These rectangles can be individually matched with these expressions:

\d

\d\d

1\d\d

2[0-4]\d

25[0-5]

To match the whole range, we union the rectangles together with |.

Here’s your TODO list for next time:

Examine regexes in a language, editor, or tool that you use regularly. On a quarter sheet, describe and write a short program that accomplishes some regex pattern matching task in that language. Aim for some task that involves capturing.