Scraping Online Communities for your Outreach Campaigns

Online communities offer a wealth of intelligence for blog owners and business owners alike.

Exploring the data within popular communities will help you to understand who the major influencers are, what content is popular and who are the key content aggregators within your niche.

This is all fair and well to talk about, but is it feasible to be manually sorting through online communities to find this information out? Probably not.

This is where data scraping comes in.

What is Scraping and What Can it do?

I'm not going to go into great detail on what data scraping actually means, but to simplify this, here's a definition from the Wikipedia page:

"Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program."

Let me explain this with a little example...

Imagine a huge community full of individuals within your industry. Each person within the community has a personal profile page that contains information about their interests, contact details, social profiles, etc.

If you were tasked with gathering all of this data on all of the individuals then you might start to hyperventilating at the thought of all the copy and pasting you'd need to do.

Well, an alternative is to scrape all of this content so that you can automate all of this process and easily export all of this information into a manageable, more consumable format in a matter of seconds. It'd be pretty awesome, right?

Luckily for you, I'm going to show you how to do just that!

The Example of Inbound.org

Recently, I wanted to gather a list of digital marketers that were fairly active on social media and shared a lot of content online within communities. These people were going to be some of my core targets to get content from the blog in front of.

To do this, I first found some active communities online where these types of individuals hang out. Being a digital marketer myself, this process was fairly easy and I chose Inbound.org as my starting place.

Scoping out Data Requirements

Each community is different and you'll be able to gather varying information within each.

The place to look for this information is within the individual user profile pages. This is usually where the contact information or links to social media accounts are likely to be displayed.

For this particular exercise, I wanted to gather the following information:

Full name

Job title

Company name and URL

Location

Personal website URL

Twitter URL, handle and follower/following stats

Google+ URL, follower count and list of contributor URLs

Profile image URL

Facebook URL

LinkedIn URL

With all of this information I'll be able to get a huge amount of intelligence about the community members. I'll also have a list of social media accounts to add and engage with.

On top of this, with all the information on their websites and sites that they write for, I'll have a wealth of potential link building prospects to work on.

You'll see in the above screenshot that a few of the pieces of data are available to see on the Inbound.org user profiles. We'll need to get the other bits of information from the likes of Twitter and Google+, but this will all stem from the scraping of Inbound.org.

[newsletter]

Scraping the Data

The idea behind this is that we can set up a template based on one of the user profiles and then automate the data gathering across the rest of the profiles on the site.

Once you've installed the plugin, you're good to go on the actual scraping side of things...

Quick Note: Don't worry if you don't have a good knowledge of coding - you don't need it. All you'll need is a very basic understanding of reading some code and some basic Excel skills.

To begin with, you'll need to do a little Excel admin. Simply add in some column titles based around the data that you're gathering. For example, with my example of Inbound.org, I had, 'Name', 'Position', 'Company', 'Company URL', etc. which you can see in the screenshot below. You'll also want to add in a sample profile URL to work on building the template around.

Now it's time to start getting hands on with XPath.

How to Use XPathOnURL()

This handy little formula is made possible within Excel by the SEO Tools plugin. Now, I'm going to keep this very basic because there are loads of XPath tutorials available online that can go into the very advanced queries that are possible to use.

For this, I'm simply going to show you how to get the data we want and you can have a play around yourself afterwards (you can download the full template at the end of this post).

Here's an example of an XPath query that gathers the name of the person within the profile that we're scraping:

=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

A2 is simply referencing the cell that contains the URL that we're scraping. You'll see in the screenshot above that this is Jason Acidre's profile page.

The next part of the formula is the XPath.

What this essentially says is to scrape through the HTML to find a tag that has 'user-profile' id attached to it. This could be a div, span, a or whatever.

Once it's found this tag, it then needs to look at the first h2 tag within this area and grab the text within it. This is Jason's name, which you'll see in the screenshot below of the code:

Don't be put off at this stage because you don't need to go manually trawling through code to build these queries, there's a much simpler way.

The easiest way to do this is by right-clicking on the element you want to scrape on the webpage (within Chrome); for example, on Inbound.org, this would be the profile name. Now click 'Inspect element'.

The developer tools window should now appear at the bottom of your browser (or in a separate window). Within that, you should see the element that you've drilled down on.

All you need to do now is right-click on it and press 'Copy XPath'.

This will now copy the XPath code for your Excel formula to the clipboard. You'll just need to add in the first part of the query, i.e. =XPathOnUrl(A2,

You can then paste in the copied XPath after this and add a closing bracket.

Note: When you use 'Copy XPath' it will wrap some parts of the code in double apostrophes (") which you'll need to change to single apostrophes. You'll also need to wrap the copied XPath in double apostrophes.

Your finished code will look like this:

=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

You can then apply this formula against any Inbound.org profile and it will automatically grab the user's full name. Pretty good, right?

Check out the full video tutorial below that I've put together that talks you through this whole process:

XPath Examples for Grabbing Other Data

As you're probably starting to see, this technique could be scaled across any website online. This makes big data much more attainable and gives you the kind of results that an expensive paid tool would offer without any of the cost - bonus!

Here's a few more examples of XPath that you can use in conjunction with the SEO Tools plugin within Excel to get some handy information.

Twitter Follower Count

If you want to grab the number of followers for a Twitter user then you can use the following formula. Simply replace A2 with the Twitter profile URL of the user you want data on. Just a quick word of warning with this one; it looks like it's really long and complicated, but really I've just used another Excel formula to snip of the text 'followers' from the end.

Google+ Follower Count

Like with the Twitter follower formula, you'll need to replace A2 with the full Google+ profile URL of the user you want this data for.

=XPathOnUrl(H67,"//span[@class='BOfSxb']")

List of 'Contributor to' URLs

I don't think I need to tell you the value of pulling in a list of websites that someone contributes content to. If you do want to know then check out this post that I wrote.

This formula is a little more complex than the rest. This is because I'm pulling in a list of URLs as opposed to just one entity. This requires me to use the StringJoin function to separate all of the outputs with a comma (or whatever character you'd like).

Also, you may notice that there is an additional section to the XPath query, "href". This pulls in the link within the specific code block instead of the text.

As you'll see in the full Inbound.org scraper template that I've made, this is how I pull in the Twitter, Google+, Facebook and LinkedIn profile links.

You'll want to replace A2 with the Google+ profile URL of the person you wish to gather data on.

Some Findings from the Data I've Gathered

With all big data sets will come some interesting findings. Here's a few little things that I've found from the top 100 influential users on Inbound.org.

The chart above maps out the average number of followers that the top 100 users have on both Twitter (12,959) and Google+ (9,601). As well as this, it shows the average number of users that they follow on Twitter (1,363).

The next thing that I've looked at is the job titles of the top 100 users. You can see the most common occurrences of terms within the tag cloud below:

Finally, I had a look through all of the domains listed within each of the top 100 Inbound.org users' Google+ 'contributor to' sections and mapped out the most frequently mentioned sites.

Here's the spread of domains that were the most popular to be contributed to:

It Doesn't Stop There

As you've probably gathered, this can be scaled out across pretty much any community/forum/directory/website online.

With this kind of intelligence in your armoury, you'll be able to gather more intelligence on your targets and increase the effectiveness of your outreach campaigns dramatically.

Also, as promised, you can download my full Inbound.org scraper template below:

Fantastic Tutorial Matt!
It’s a pain in the ass having to manually do this kind of content scraping and being able to pull tons of info straight into an excel doc can save you a LOT of time.
Nicely done! ^.^

Glad you liked it, Evan. I’ve been working on some pretty cool techniques using this process to gather intelligence about my outreach targets. I’m sure you’ll find lots of applications once you have a play around.

Great tutorial, Matthew. There’s definitely a wealth of data you can gather for outreach and competitive analysis.

I’m not sure if XPath can do this, but is there a way to scrape say the last 100 links shared by an user in their Twitter timeline? I think that’d be useful for outreach, as you can get an idea of what they’re sharing and what interests them.

Problem with data scraping like this is it starts getting a bit… shit at any volume. I think there’s a setting you can change in SEO Tools but I’m not sure what it is – it starts freezing up your PC and running out of memory.

I completely agree with you – this isn’t the most effective solution, but what it does do great is negate the need for any advanced coding knowledge.

Running Python scripts to crawl through web entities on a large scale is much more resource efficient and can be directly plugged into a web interface, etc. The only thing is, you’ll need a Python developer, and a good one, to scale this out.

I use these techniques to gather niche datasets and then hook the data up to some more sophisticated tools to get the most out of it.

Thanks for chipping in – glad you liked the article. One of the ways that I make use of this kind of data is for link prospecting. For example, within the Inbound.org example, I’ve now got a long list of the websites that the top influencers within the Inbound marketing industry write for. That’s a perfect starting point for getting some exposure to my target market.

Alongside this, I’ve got a pool of influencers to start engaging with over social media in order to get my content in front of them and get my content shared widely.

What I don’t use this data for is to just spam people. This defies the whole point of the intelligence gathering exercise.

Hey Matt, just wanted to jump in and say man ur doing a great job posting really helpful content. I have to give you credit I have definitely learned a thing or two from your post. Good $#!+ Dude! Keep’em coming.

This is awesome, Matt! Maybe this is a dumb question, but did you manually enter in all the profile URLs or was there something I missed to use xpath to automatically snag the user profile links as well?

There is an issue pulling down the xpath values on twitter since their new profile update. The reason being is that SEOTools xpath checked appears to be treating Styles Classes as DIVS, and subsequently it cannot identify the item selected. To rectify, I simply amended the first bracketed number from a 1 to a 2 (which basically adds another div).

So this:
=XPathOnUrl(A2,”//*[@id=’page-container’]/div[1]/div/div[2]/div/div/div[2]/div/div/ul/li[4]/a/span[2]”)

Great tutorial! but I am running into problems any time I try to scrape data that’s within an HTML table. I keep getting a “#VALUE!” error in Excel, any time I use the Xpath from some data that is in a table. Any suggestions?