I often receive requests asking about email crawling. It is evident that this topic is quite interesting for those who want to scrape contact information from the web (like direct marketers), and previously we have already mentioned GSA Email Spider as an off-the-shelf solution for email crawling. In this article I want to demonstrate how easy it is to build a simple email crawler in Python. This crawler is simple, but you can learn many things from this example (especially if you’re new to scraping in Python).

I purposely simplified the code as much as possible to distill the main idea and allow you to add any additional features by yourself later if necessary. However, despite its simplicity, the code is fully functional and is able to extract for you many emails from the web. Note also that this code is written on Python 3.

Ok, let’s move from words to deeds. I’ll consider it portion by portion, commenting on what’s going on. If you need the whole code you can get it at the bottom of the post.

Let’s import all necessary libraries first. In this example I use BeautifulSoup and Requests as third party libraries and urllib, collections and re as built-in libraries. BeautifulSoup provides a simple way for searching an HTML document, and the Request library allows you to easily perform web requests.

1

2

3

4

5

6

frombs4 importBeautifulSoup

importrequests

importrequests.exceptions

fromurllib.parse importurlsplit

fromcollectionsimportdeque

importre

The following piece of code defines a list of urls to start the crawling from. For an example I chose “The Moscow Times” website, since it exposes a nice list of emails. You can add any number of urls that you want to start the scraping from. Though this collection could be a list (in Python terms), I chose a deque type, since it better fits the way we will use it:

1

2

# a queue of urls to be crawled

new_urls=deque(['http://www.themoscowtimes.com/contact_us/'])

Next, we need to store the processed urls somewhere so as not to process them twice. I chose a set type, since we need to keep unique values and be able to search among them:

1

2

# a set of urls that we have already crawled

processed_urls=set()

In the emails collection we will keep the collected email addresses:

1

2

# a set of crawled emails

emails=set()

Let’s start scraping. We’ll do it until we don’t have any urls left in the queue. As soon as we take a url out of the queue, we will add it to the list of processed urls, so that we do not forget about it in the future:

1

2

3

4

5

# process urls one by one until we exhaust the queue

whilelen(new_urls):

# move next url from the queue to the set of processed urls

url=new_urls.popleft()

processed_urls.add(url)

Then we need to extract some base parts of the current url; this is necessary for converting relative links found in the document into absolute ones:

1

2

3

4

# extract base url and path to resolve relative links

parts=urlsplit(url)

base_url="{0.scheme}://{0.netloc}".format(parts)

path=url[:url.rfind('/')+1]if'/'inparts.path elseurl

The following code gets the page content from the web. If it encounters an error it simply goes to the next page:

After we have processed the current page, let’s find links to other pages and add them to our url queue (this is what the crawling is about). Here I use the BeautifulSoup library for parsing the page’s html:

1

2

# create a beutiful soup for the html document

soup=BeautifulSoup(response.text)

The find_all method of this library extracts page elements according to the tag name (<a> in our case):

1

2

# find and process all the anchors in the document

foranchor insoup.find_all("a"):

Some of <a> tags may not contain a link at all, so we need to take this into consideration:

1

2

# extract link url from the anchor

link=anchor.attrs["href"]if"href"inanchor.attrs else''

If the link address starts with a hash, then we count it as a relative link, and it is necessary to add the base url to the beginning of it:

1

2

3

# add base url to relative links

iflink.startswith('/'):

link=base_url+link

Now, if we have gotten a valid link (starting with “http”) and we don’t have it in our url queue, and we haven’t processed it before, then we can add it to the queue for further processing:

1

2

3

# add the new url to the queue if it's of HTTP protocol, not enqueued and not processed yet

32 Comments

“We also work with law enforcement authorities to track down and prosecute spammers. Harvesting email addresses from websites is illegal under several anti-spam laws, and the data resulting from Project Honey Pot is critical for finding those breaking the law.”

For now the best option is Python cause it’s having multiple web scraping libraries avail.
As far as the speed is concerned, it’s not the language but rather a server (incl. its confuguration), which requests the web pages, that plays the main role in a fast content extraction.

Anybody know to build one script to extract emails from eBay ? For ex: ebay.co.uk and to collect only domains i need for ex: @btinternet.com? This is something like eBay scraper linux script. If anyone have any idea please respond to my comment. Thanks

Hello can you please explain to me how to use the code i tried python following the code its says
line 4, in
from urllib.parse import urlsplit
ImportError: No module named parse
please help me understand this more am still very new to coding Thanks.

Thanks for providing this code–it’s exactly what I was looking for!! I have some very newbish questions for you guys:

(1) Where do the e-mail save once it’s done crawling?
(2) It seems to have a never ending set of URLs. Is there a way I can stop it once it gets off the ‘path’ I want it be on and still collect the emails?

(1) You might save emails in database. If you do not need a fast assosiative search then you might save them in file.
(2) You might stop by check every received email if it matches a criterium.
Does it make clear?

Sorry, no. I’m completely new to this. When I run the code I see that it is processing sites, but I have absolutely no clue where the results populate (command, text, csv, etc.) Do I need to add additional code to print it somewhere?

Sire. Results are populated into stack here:
# extract all email addresses and add them into the resulting set
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)

So you might process new_emails set as you wish.emails.update(new_emails) only pushes it into emails stack.
If you want the whole html text for each processed url, get use response.text.

hi i want to gather the emails addresses only, of those who make posts , ask questions etc in various FORUMS in on line gambling related websites, will the above web crawler do this , please dvise , don

“mailto” and “tel” or other prefixes contained in anchor tags cause this crawler to loop infinitely, it mistakes them for part of the relative URL path and adds them to the queue then it keeps combining these compromised URLs to these wrong paths.