Categorize URLs using Regex

For those not afraid of regex, here is a complimentary script to categorize URLs. Be careful the regex order is important, some values can overwrite others. Usually, it’s a good idea to place the home page last

What if I want to follow robots.txt rules?

just had Obeyrobots parameter

1

2

#like that

Rcrawler(Website="https://www.gokam.co.uk/",Obeyrobots=TRUE)

What if I want to limit crawling speed?

By default, this crawler is rather quick and can grab a lot of webpage in no times. To every advantage an inconvenience, it’s fairly easy to wrongly detected as a DOS. To limit the risks, I suggest you use the parameter RequestsDelay. it’s the time interval between each round of parallel HTTP requests, in seconds. Example

1

2

# this will add a 10 secondes delay between

Rcrawler(Website="https://www.example.com/",RequestsDelay=10)

Other interesting limitation options:

no_cores: specify the number of clusters (logical cpu) for parallel crawling, by default it’s the numbers of available cores.

no_conn: it’s the number of concurrent connections per one core, by default it takes the same value of no_cores.

What if I want to crawl only a subfolder?

2 parameters help you do that. crawlUrlfilter will limit the crawl, dataUrlfilter will tell from which URLs data should be extracted

Where to find proxy? It’s been a while I didn’t need one so I don’t know.

Where are the internal Links?

By default, RCrawler doesn’t save internal links, you have to ask for them explicitly by using NetworkData option, like that:

1

Rcrawler(Website="https://www.gokam.co.uk/",NetworkData=TRUE)

Then you’ll have two new variables available at the end of the crawling:

NetwIndex var that is simply all the webpage URLs. The row number are the same than locally stored HTML files, so row n°1 = homepage = 1.html

NetwIndex data frame

NetwEdges with all the links. It’s a bit confusing so let me explain:

NetwEdges data frame

Each row is a link. From and To columns indicate “from” which page “to” which page are each link.

On the image above:row n°1 is a link from homepage (page n°1) to homepage row n°2 is a link from homepage to webpage n°2. According to NetwIndex variable, page n°2 is the article about rvest.etc…

Weight is the Depth level where the link connection has been discovered. All the first rows are from the homepage so Level 0.

Type is either 1 for internal hyperlinks or 2 for external hyperlinks

Count Links

I guess you guys are interested in counting links. Here is the code to do it. I won’t go into too many explanations, it would be too long. if you are interested (and motivated) go and check out the dplyr package and specifically Data Wrangling functions

Count outbound links

1

2

3

4

5

6

7

8

9

count_from<-NetwEdges[,1:2]%>%

#grabing the first two columns

distinct()%>%

# if there are several links from and to the same page, the duplicat will be removed.

group_by(From)%>%

summarise(n=n())

# the counting

View(count_from)

# we want to view the results

the homepage (n°1) has 13 outbound links

To make it more readable let’s replace page IDs with URLs

1

2

count_from$To<-NetwIndex

View(count_from)

using website URLs

Count inbound links

The same thing but the other way around

1

2

3

4

5

6

7

8

9

10

count_to->NetwEdges[,1:2]%>%

#grabing the first two columns

distinct()%>%

# if there are several links from and to the same page, the duplicat will be removed.

group_by(To)%>%

summarise(n=n())

# the counting

View(count_to)

# we want to view the results

count of inbound links

Again to make it more readable

1

2

count_to$To<-NetwIndex

View(count_to)

using website URLs

So the useless ‘author page‘ has 14 links pointing at it, as many as the homepage… Maybe I should fix this one day.

Compute ‘Internal Page Rank’

Many SEOs, I spoke to, seem to be very interested in this. I might as well add here the tutorial. It is very much an adaptation of Paul Shapiro awesome Script.

But Instead of using ScreamingFrog export file, we will use the previously extracted links.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

links<-NetwEdges[,1:2]%>%

#grabing the first two columns

distinct()

# loading igraph package

library(igraph)

# Loading website internal links inside a graph object

g<-graph.data.frame(links)

# this is the main function, don't ask how it works

pr<-page.rank(g,algo="prpack",vids=V(g),directed=TRUE,damping=0.85)

# grabing result inside a dedicated data frame

values<-data.frame(pr$vector)

values$names<-rownames(values)

# delating row names

row.names(values)<-NULL

# reordering column

values<-values[c(2,1)]

# renaming columns

names(values)[1]<-"url"

names(values)[2]<-"pr"

View(values)

Internal Page Rank calculation

Let make it more readable, we’re going to put the number on a ten basis, just like when the PageRank was a thing.

1

2

3

4

5

6

#replacing id with url

values$url<-NetwIndex

# out of 10

values$pr<-round(values$pr/max(values$pr)*10)

#display

View(values)

On 15 webpages website, it’s not very impressive but I encourage you to try on a bigger website.

What if a website is using a JavaScript framework like React or Angular?

RCrawler handly includes Phantom JS, the classic headless browser. Here is how to to use

1

2

3

4

5

6

# Download and install phantomjs headless browser

# takes 20-30 seconds usually

install_browser()

# start browser process

br<-run_browser()

After that, reference it as an option

1

2

3

4

Rcrawler(Website="https://www.example.com/",Browser=br)

# don't forget to stop browser afterwards

stop_browser(br)

It’s fairly possible to run 2 crawls, one with and one without, and compare the data afterwards

This Browser option can also be used with the other Rcrawler functions.

⚠️ Rendering webpage means every Javascript files will be run, including Web Analytics tags. If you don’t take the necessary precaution, it’ll change your stats

So what’s the catch?

Rcrawler is a great tool but it’s far from being perfect. SEO will definitely miss a couple of things like there is no internal dead links report, It doesn’t grab nofollow attributes on Links and there is always a couple of bugs here and there, but overall it’s a great tool to have.