Introduction

After having done the analysis of the Comixology website to prepare for the scraping (post here) and doing the scraping itself to extract the data from the website (post here), we will do a very cool data analysis on the information we extracted, using Python and Pandas.

Let’s find out which publisher has the best prices, the publishers with the best average ratings and a detailed analysis on the giant ones: Marvel x DC Comics. Let’s begin.

Initial Preparation

First, as usual, let’s import the packages we need. They are old friends: numpy, pandas, matplotlib and seaborn. Then, we will read the csv file with the read_csv function from Pandas.

Now, let’s create a new column, price per page. This column will help us compare the price of comics that have a different number of pages, and, therefore, should have a bigger price. But how much bigger?

For some comics, the page count information is not available, and so, for these cases, Pandas will return inf as the value of the column, representing an infinite value. For these comics, we will set the price per page as NaN:

Now, let’s use the iterrows() function of the DataFrame to extract the publishing year of the print version of the comic. This function creates a for loop that iterates over each row of the DataFrame. Let’s use the split() function to turn the string that contains the print release date into a list of values, and the third one will be the year. In some cases, this will return a value bigger than 2016, and since this is impossible, we will define these cases as NaN:

To the analysis (and beyond!)

The first analysis we’ll do is the calculation of some average values of the website, like average price of comics, average page count, among others. We’ll use the nanmean() function from numpy. This function calculates the mean of a series os values, not considering NaN cases.

After that, let’s list comics with an average rating of 5 stars, that have more than 20 ratings (to consider only the more representative comics; comics with an average rating of 5 stars but with only one rating are not a very good metric), and let’s sort it by price per page. In the top, we will have some free comics (the 6 first ones). Then, we will have great comics, in the eyes of the users, that have a very good price.

In the next analysis, we will use only comics with more than 5 ratings. For that, we will filter the DataFrame. Then, we’ll create a Pandas pivot table, so that we can visualize the quantity of comics with ratings and the average rating of this publisher. Then, we will consider as representative publishers those that have at least 20 comics with ratings. To do that, we will filter the pivot table. And finally, we will sort this table by average rating, going from the highest to the lowest. This means that the publishers on the top of the table will be the ones that have the best average rating from its comics.

One thing that I believed that could make a difference in the ratings of a comic was the age classification. Were comics made to the adults rated better? Or worse? Let’s check that making another pivot table:

As we can see, the height of the bars is quite similar. It seems that the age classification does not make a significant effect on the ratings of a comic. If we see it with a purely mathematical view, comics with an age classification for 9+ years or for all ages get the best ratings, by a small margin. But it is not possible to view a strong relation, since it does not varies in the same way as the age classification increases or decreases.

Our next step is to see how the release of comics evolved (considering print versions) over the years. Remember that we already created a column with the year of release of the print version of the comic. The next step is basically to count the occurrences of each year in this column. Let’s make a list with the years and then count the releases per year:

The numbers show that the growing was moderate, until the decade of 2000, when a boom happened, with a great increase in releases until 2012, when the release numbers started to oscillate. The fall shown in 2016 is because we are still in the middle of the year.

Now we’ll go on to make an evaluation of the most rated comics on the website. We can also probably say that these are the most read comics on the website. So, for this analysis, we will check the comics with most ratings, sorting the table and printing some columns. Let’s see the 30 first ones.

As we can see, DC Comics has a lower average price and price per page, and an average rating slightly higher. The average page count is a little higher on Marvel. Below, the bar charts that represent these comparations:

Next step is to see some numbers related to the quantity of comics that each have. How many comics each publisher has, how many of them are good (4 or 5 stars rating), how many are bad (1 or 2 stars) and the proportion of these to the total. For this analysis, we will basically filter the DataFrame and count the number of rows of each filtered view. Simple:

Interesting to note that even with Marvel having more comics, as we saw in the previous table, there quantity of ratings of DC’s comics is way bigger, approximately 55% more. It seems that DC’s fans are more propense to rate comics in Comixology than Marvel ones.

Our next evaluation will be about characters and teams of heroes / villains. First, we need to create lists of characters and teams for each publisher. I created the lists by hand, doing some research. It didn’t took very long.

Next, we need to pass each name of character or team. First, let’s define a DataFrame, and we’ll filter so that the only rows that remain are the ones where the comic name includes the name of this character or team. Then, we’ll extract some information from there. The quantity of comics will be the number of rows of the resulting DataFrame. Then, we will get the average price, rating and page count. All this information will be saved in a dictionary, and this dictionary will be appended to a character list, if it is a character, or a team list, if it is a team. In the end, we will have a list of dictionaries for characters and one for teams, and we will use them to create DataFrames:

Let’s consider only teams and characters that have more than 20 comics where their names are present on the title of the comic. So, let’s make a filter:

Click here to see the code

# Filter characters and teams DataFrame for rows where there are more than 20
# comics where the character / team name is present on the title of the comics
characters_df = characters_df[characters_df['Quantity_of_comics'] > 20]
teams_df = teams_df[teams_df['Quantity_of_comics'] > 20]

Now, let’s check the biggest characters and teams in number of comics and average rating. For the characters, even considering the ones with more than 20 comics, there are still too many characters left. So, we’ll limit the list to the top 20 characters. For the teams, there is no need, since there are already less than 20. Then, we’ll print the tables:

Among the characters, we have Batman as the one with the biggest number of comics, followed by Spider-Man and Superman. After that, we have some other famous characters, like Captain America, Iron Man, Wolverine, Flash. Here, nothing surprising.

Here, we have some surprises on the top. Even if the quantity of comics is not very big, few people would imagine that Mystique would be the character with the highest average rating, among all these extremely popular characters. On the next positions, more surprises, with Booster Gold in second, Jonah Hex in third, Blue Beetle in fifth. Of the most popular characters, we see Spider-Man, Deadpool and Wonder Woman, in the end of the top 20. Let’s go to the teams:

Conclusion

And with that, we conclude our series of 3 posts with the analysis of the website, web scraping and data analysis of digital comics, with information extracted from the Comixology website. As the data is not always available in a simple and practical manner, like a database or a csv dataset, sometimes we have to get the data through web scraping, or some other more complex technique.

In this analysis, we reached some conclusions related to the comics on the website. I made a summary of my conclusions on the list below:

Some smaller publishers have a good average rating, probably being a good option if you want to read something different than the big ones (Marvel, DC Comics, Image, etc)

Among the big ones (publishers with more than 300 comics rated on Comixology), Marvel and DC Comics are in the bottom of the ranking when it comes to average ratings of its comics. The three first ones are Archie (of Archie comics, Mega Man, Sonic, among others), MAX (focused in adult comics: Dexter, Jessica Jones, Deadpool) and Image – Skybound (mainly Walking Dead).

Age Classifications does not seems to affect the rating a comic receives significantly.

The number of releases of comics increased a lot in the decade of 2000, suffered a recent downfall and now seems to oscilate through the years.

The two comics with most ratings on Comixology are free. The third, maybe surprisingly, is the first issue of the Saga series, from the publisher Image.

In the private battle between Marvel and DC Comics, DC seems to have a small advantage. DC has a smaller average price per page and average price, while having a slightly higher average rating. The average page count is a little bigger on Marvel’s comics. DC also has a bigger proportion of good comics (4 ou 5 stars rating) and a smaller proportion of bad comics (1 or 2 stars rating).

Batman is the character with most comics, followed by Spider-Man and Superman. The heroes with the highest average rating are, surprisingly, Mystique (from X-Men), Booster Gold and Jonah Hex.

Among the teams, the ones with most comics are the X-Men, Avengers and Justice League. On the podium for the highest average rating are All-Star Squadron from DC, Fantastic Four and Thunderbolts from Marvel.