Introduction

Well, after having talked about the previous analysis that I did about Comixology in the previous post in this series , we’ll now talk about the web scraping with Python, the code itself. As we have already talked about, we will use requests to make requests to web pages and lxml with Xpath for parsing and information extraction from the desired pages. Let’s go.

Starting to write the code

To start the conversation, we will import the necessary libraries, requests and, from lxml, the html function. To access the HTML parts, we will use Xpath. So, let’s talk about Xpath before we move on to the code. I’ll try to keep it short.

Xpath: What it is and how to use it

Xpath is a query language used to extract content from XML / HTML documents. It can be used to extract information form nodes and attributes, being very useful in web scraping tasks. Xpath uses sentences that match elements in the HTML or XML code.

XML / HTML documents are treated as node trees. The nodes can have relationships, possibly being fathers, childrens, siblings, ancestors and descendants. Considering the following XML code:

Father – Elements that contain another elements. In our example, product is the father of class, color and price.

Children – Analogously, children are elements that are contained in another element. The class, color and price elements are children of the product element.

Siblings – Elements inside of the same node. Class, color and price are siblings.

Ancestors – All elements that contain a certain element, like the father of an element, the father of the father, and so on. In our example, product and store are ancestors of the color, or ancestor of the class.

Descendants – Elements contained in another element, independent of the level. Like children, children of the children. Class, color and price are descendants of the store element.

To select nodes / elements, we use some symbols to construct expressions that reach the desired information in the XML / HTML code. A symbol has a function, as I show in the table below:

Symbol

Function

name

Select any element with this name.

/

Selects from a root element.

//

Selects elements that match the selection criteria, no matter where they are.

.

Selects the current element.

..

Selects the father of the current element.

@

Selects an attribute.

In this way, we can build an expression that selects the elements where the information we desire is. Let’s look at an example with HTML code:

In short, first we make the request, passing the address to be scraped to the requests.get function. Then, we use the html() function from lxml to extract the HTML source code from the address. Finally, we use the xpath function to extract the information using sentences created with the syntax that we just saw.

As it is possible to note, the returned value is always a list, and we just need to treat it like a normal Python list. Also note that, when we search for something that does not exist in the document, an empty list is returned.

And now that we saw this simple example, let’s move on to something more complex and fun.

Returning to the code

Now that we know what Xpath is about, let’s use it to extract the Comixology website itself. For web scraping with Xpath, one of our best friends will be the “Inspect Element” feature from Chrome or Firefox. It will allow us to see the source code and the HTML structure and path for a certain element in the page.

Let’s move on to the Publishers section on Comixology. ( click here ). We will extract our code defining the function and creating an empty list which will hold the Publisher links that we will extract. As we talked on the first post, we will define the page quantity manually (4 pages, so, we will have a for page in range (1,5)), and we will pass through each of them. If we go to the site and change to the next page, we will see that the link becomes: https://www.comixology.com/browse-publisher?publisherList_pg=2. If we change it back to the first page, the final part of the link turn back to 1. The first pages of the code will be as following:

Simple stuff until now. We need to find the quantity of Publishers in each page. For this, let’s inspect the element that constitutes a Publisher. It could be the image, the link. Also, we must not forget that we will not consider the Featured Publishers table. When we inspect an element of the All Publishers table and compare it with an element of the Featured Publishers table, we note that an intermediate div has a class that differentiate them. While items from the Featured Publishers table are inside a div that has the class “list FeaturedPublisherList”, items from the All Publishers table are in a div that has the class “list publisherList”. Let’s make our Xpath sentence going from there:

Now, I’ll create a Xpath string for the extraction of the elements, going from the div with the class that we just saw (//div[@class=”list publisherList”]), going through the HTML elements until we get to the < a > , where we will extract the “href” attribute, which is the link itself. I’ll divide the creation of the string in 2 lines, so that we won’t have a very long line. An important detail that I haven’t mentioned until now is that the extracted links come with a ref attribute. We will create a function to remove this attribute and return a list of “clean” links. We will pass our list of extracted links to this function. We will then use the extend function to put this list inside the empty list that we create on the start of our scraping function. When we do this for every page, we will have every Publisher link. The function that will remove the attribute will use Regex (regular expressions; we will need to import the re function to work with Regex in Python), which I’ll not explain on this post, but basically, we will replace everything from the “?”, including the “?”, by nothing. Let’s see how everything looks:

The comments will help you understand each part of the code. If everything goes right, this function will extract all the links for the Publishers, and you will have them on the list returned by the function. First part, concluded

I’ll take this time to talk about a library that will help us in this task, which is the Pickle package ( to know more about it, click here ). This library will allow us to export data to files, to load them back later on. In this first moment, this will not seem very useful, since this extraction function runs very fast and does not take much time to be completed, because there are only 4 pages to extract. But for the next steps, when we have lots of pages and links to visit, this package will be extremely important. Let’s export our list of links to a file. We will use the dump function from the pickle package, passing to it the object to be exported:

To load back the exported object, we use the load() function from pickle, passing to it the desired file:

publishers_links = pickle.load(open("publishers_links.p","rb"))

From Publishers to Series

Now that we already have the Publishers links, our next step is similar. Now, we need to define a function that will receive a list of Publisher links, go through each of these links, extracting the Comics Series links and exporting them to a file that we can load later on. Let’s start:

Up until there, all simple. After that, we have to discover the number of pages. When we explore some Publisher links, we note that there are three possible cases: A – when the publisher has only one page, B – when there is a small number of pages (like, from 2 to more or less 10), C – when a publisher has a great number of pages. See in the image:

A – Publisher with 1 page

B – Publisher with few pages

C – Publisher with many pages

With this scenario, it is a little harder to get the number of pages directly. But there is an easy way. When there is more than one page, the total number of Comics Series is shown in the lower right corner of the Comics Series table. Since one page always have 36 Series (with the exception being the last one, or ir the Publisher has only one page), the number of pages is an easy calculation. If the page is divisible by 36, the number of pages is equal to the quantity of Series divided by 36, in Integer division (in Python 3, use “//”). If the number of Series is not divisible by 36, all we got to do is add 1 to the result of the previous division. If the number of Series is not available, the Publisher has only one page of Series. Simple, isn’t it? Well, let’s inspect some elements to see how we are going to build the Xpath sentence, and then we can go to the code:

Path to quantity of Series

Click here to see the code

# Xpath string for extraction of total quantity of Series xpath_series = '//div[@class="list seriesList"]/div[@class="pager"]' xpath_series += '/div[@class="pager-text"]/text()' total_series = tree.xpath(xpath_series) # If the extraction returned the total quantity if total_series: # The only item in the list is a string with the quantity, which # we will split to create a list with each word of it total_series = total_series[0].split() # Quantity of series will be the last item of that series total_series = int(total_series[len(total_series)-1]) # Divide the quantity of Series by 36, in order to discover the # number of pages of Series in this Publisher if total_series % 36 == 0: number_of_pages = (total_series // 36) else: number_of_pages = (total_series // 36) + 1 # If the extraction returns an empty list, there is only one page of Series else: number_of_pages = 1

The split function will divide the string we extracted to a list, where the last item will be the total of Series from this Publisher. As we already talked about, if the Xpath does not find anything, the Publisher has only one Series.

Now that we have the number of pages, we can go through them, extracting the links from each Series and move on to the next page, until there are no more pages. The final function will be like this:

Click here to see the code

# Receive list of links for each Publisher and return list of links for the # Comics Series def get_series_links_from_publisher(publisher_links): series_links = [] for link in publisher_links: page = requests.get(link) tree = html.fromstring(page.content) # Xpath string for extraction of total quantity of Series xpath_series = '//div[@class="list seriesList"]/div[@class="pager"]' xpath_series += '/div[@class="pager-text"]/text()' total_series = tree.xpath(xpath_series) # If the extraction returned the total quantity if total_series: # The only item in the list is a string with the quantity, which # we will split to create a list with each word of it total_series = total_series[0].split() # Quantity of series will be the last item of that series total_series = int(total_series[len(total_series)-1]) # Divide the quantity of Series by 36, in order to discover the # number of pages of Series in this Publisher if total_series % 36 == 0: number_of_pages = (total_series // 36) else: number_of_pages = (total_series // 36) + 1 # If the extraction returns an empty list, there is only one page of Series else: number_of_pages = 1 for page_number in range(1,number_of_pages+1): page = requests.get(link+'?seriesList_pg='+str(page_number)) tree = html.fromstring(page.content) # Xpath for extraction of the Series links in this page xpath_series_links = '//div[@class="list seriesList"]/ul/' xpath_series_links += 'li[@class="content-item"]/figure/' xpath_series_links += 'div[@class="content-cover"]/a/@href' extracted_series_links = tree.xpath(xpath_series_links) clean_series_links = remove_attributes_from_link(extracted_series_links) series_links.extend(clean_series_links) return(series_links)

Again, we will use pickle to export our list of links in a file. This step should take longer than the first one to complete, but it should not take more than some minutes.

PS: The exporting will be essential on the next steps of our scraping, since unexpected errors may cause lots of work to be wasted, so, it is useful to export periodically our data in order to avoid that.

pickle.dump(series_links, open("series_links.p","wb"))

From series to Comics

Now, we have to take one more step to go through Series links and extract links for the comics itself. Prepare yourselves, cause the execution of this step will take some time, due to the long list of links to visit. However, the idea is basically the same: go through each page extracting each of the links we desire. As we already saw on our previous post, the analysis of the website, we will have to extract links for different types / categories of comics, like Collected Editions, Issues, etc. Let’s consider that Series links are in the variable “series_links”, according to the previous part of the post, and let’s start our code:

For the next step, we will create another function, since the blocks are very similar, with only a few changes with respect to links and path of the XPath for the div that contains certain type of comic. But first, to understand well, we will make the code for a specific block. Then we shall see the parts of the code that must be repeated and how we can structure our function.

Let’s start the first block, scraping the Collected Editions links. The first thing we’ll do is check if this block actually exists for the current Series. Not all series have all types of comics, and in fact, most of them only has the Issues block. Thus, if the block does not exists, we do not need to perform this extraction code. We will inspect the element to understand the structure that contains the links that we want, and then, via XPath, we will see if this structure exists:

Comixology – Series – Collected Editions div

Click here to see the code

# Div where the Collected Editions are collected_div = tree.xpath('//div[@class="list CollectedEditions"]') if collected_div:

As we have seen in the image, the structure that we seek is a div with the class attribute = “list CollectedEditions”. Knowing that, we check if this div exists through XPath and, if it does, continue to execute our code.

As each type of comic may have one or more pages, we follow the same logic that we have used for the Series. Let’s see if there is the total number of that type of comic. If the number is present, such amount divided by 18, which is the amount of comics for each page of a given type. If the amount of that kind of comic is not on the page, it means that this type has only one page.

Now we go to the end of the block, scraping itself. We iterate through each page (if more than one) or access the unique page for that type of comic. Let’s extract the links of this type, referencing the correct div, pass these links thorugh our function that removes attributes of links and add them to the list that we have created before to store links:

Evaluating the code you can note that only two things will change. The piece that makes up the link with the page (in the case above, ‘? CollectedEditions_pg =’) and the paths to the XPath. For the Xpath paths, additionally, we note inspecting elements on the page that what changes is only div class, and all other paths start from it, remaining unchanged from type to type of comic. Therefore, our function will need two informations, which are the way of Xpath to the div and the part of the link to be accessed, corresponding to this type of comic. In addition, we will also provide the tree object and the current link, so you do not need to make any request again unnecessarily. Our function, then, looks like this:

This function returns a list of all links to this particular type of comic. Now all we need to do is to repeat this function for each type with the piece of the link and the path to the appropriate div, which we discover inspecting each type of comic.

Here, we use pickle.dump to export the links that we already extracted for files each time the counter hits a multiple of 100. We do this because, with the amount of links that we have to see, it is quite possible that a connection problem occurs, the site goes offline for a while, or anything of the sort. Any of these errors can make so that all the information generated by the code is lost. So, we export the information periodically. The code also exports the counter where the last export occurred. This way, when we call the function, our code can check if there are already exported data, and, through the counter value, pick up where we left off.

What we can also do is, seeing how the exporting of the links and the counter repeats itself, is transform this part of the code in one more function.

# Export comics links each time the counter hits a multiple of 100 or when it # reaches the last one, to avoid loss of information in possible errors if (counter % 100 == 0 and counter != 0) or ( counter == len(series_links) - 1): comics_links = comics_links_dump(comics_links, counter,comics_links_counter)

And finally, to the last step! The extraction of information from each comic link.

The last step: extraction of information from comics

The last step is relatively simple. We’ll have to go into each comic link and extract all the information that is possible from these links. One of the things I learned while doing the scraping is that comics can be removed from the site. Thus, when trying to make a request, we will return a 404 error page, and nothing of our scraping code will work. We will first extract the title of the HTML document to see if we are in an error page. We set the base path from which we retrieve all the information. Our information will be all in a dictionary, which will later be included in a list. Each key of the Dict will be a comic information. With this configuration, Pandas lets you easily create a Dataframe from a list of Dictionaries. From here, our task basically boil down to inspect elements through the browser and set the corresponding Xpath in the code. Let’s go:

And so, we extract the first information about the comic. In the next steps, we will fix some names that are hidden in the comic credits, in the sidebar to the right of the page, and we will use Regex to remove some HTML escape sequences that are extracted. Here, the information available varies from comic to comic. So our code will take the name of each information of credits (by Art, by Written, etc. – this is the XPath defined in credits_tasks variable) and what is its value (the people themselves – XPath defined in the variable_names credits). Let’s continue:

If you look at the source code, in the part where the credits are located, all lists have an item named “HIDE …”, which is hidden. In this way, we make our for block go up to this item, and when we get it, we add the names to the list at that point, which is he index where the item “HIDE …” is. In the end, we just add everything to the Dictionary.

Now, let’s extract other Comic information such as page count, Publisher, among others. This information can be found under credits, and we extract them all at once. The name of each information is within an h4 element with class “subtitle” and values are within a div element with class “aboutText”. Let’s create the full path collecting information from other places.

Now, to get the price, we have to consider three situations. The first, most common, are comics with a fixed price and no discount. The second is discounted comics. And the third, free comics. The way I found to organize this situation was to use three fields, an original price, other final price and other of Discount (this one is a boolean that indicates whether a comic is discounted or not). Where there is no discount, the original price and the final price are equal. We also treat in our code some comics that are unique to bundles, which then become valueless.

Finally, we extract the average rating received by that comic and the amount of reviews that it has. At first it seemed that I would have to count the number of classes that determined a rating, but with a little inspection and viewing the source code, I found that there was a hidden element with the rating value. Then, it became easy. The quantity of ratings is also simple, as there is a clear element that contains this number. Let’s see:

And so, we complete the information gathering functions. Now, let’s set a few things so that they all get linked.

Final Touches

We will need three more functions. The first will simply join the comics of the links, which are scattered in files, in a single list. The second function receives the comics list that we already extracted and will go through each link extracting information and then export this information to files. And the last will gather the information from these files into another new variable, so we can finally use them for analysis. Let’s go to the first function:

Finally, we need to string all that we did together. For this, we basically need to do one thing. We will use Python to check if each exported file containing our links exist. If they exist and are complete, we will load it in a variable. If not, we will run each function to extract the information. For files that are divided into several other files, we will check if the folder that contains the files exists and if it does not exist, we will create it as well. Let’s use the “os.path.isdir()” function to check for the folder and “os.path.isfile()” to check for files. To check whether a scraping which is divided into multiple files is completed, we will load the file that holds the counter and check if it is equal to the amount of items in a given list. And so we will close our scraping. Let’s go to the code:

And just like that, we finish our scraping. In the next post, I will make an analysis of the data in order to understand a little better the world of digital comics and come to some interesting conclusions.

PS: Suggestions and corrections to improve the code are very welcome. If you have one, feel free to send it in the comments.