Python Lessons from 4chan, Part I: Logging is Easy!

If I wanted to make this post sound professional and industrious, I would say that my motivations behind this project were because I’ve started working towards my Bayesian model of webcomic updates again, and that I’m taking an intermediate step by analyzing data from similar content creators.

But the truth is, I was just pissed off that I couldn’t read the manga I wanted to.

These are the Python lessons I learned scraping manga scanlations off of 4chan.

My Problem

Japanese comics–perhaps unsurprisingly–are mostly written in Japanese. Since many are never officially published in English, there are groups of enthusiasts who scan the ones they like, translate them, and put them on the internet for free. These are called scanlations, and they’re kindaclassified as pirating.

The decentralized, volunteered origins of these translations makes keeping track of them somewhat difficult, and I frequently find that chapters are missing online, or have messed up ordering, etc. Many scanlations come from 4chan’s anime/manga message board, which means that if no one re-uploads them to a separate hosting site, they eventually get deleted or end up lost in the annals of some third-party archiver.

Warning: never, ever go to 4chan.

I cannot stress this enough: Mom, if you’re reading this, do not go to 4chan.

If you are a loved one, or have any flicker of warmth in your heart, do NOT go to 4chan. It’s like the internet’s unconscious id, vomitting out content that no decent human should ever see. The anime/manga message board (/a/), is much tamer than the disturbing “random” board (/b/) or the infamously fascist/far-right “politics” board (/pol/), but even then, it’s still really… inappropriate. I’m just there for translations of a few series that I can’t get elsewhere.

I wanted to make sure that I didn’t miss any updates, so I decided to write code that would run in the background of my computer, check to see if new translations had been released, and then save all the images to my computer. I wanted to do this:

and in such a way that it would be relatively easy for Python newbies to understand.

Long story short, this was much more irritating than I thought it was going to be, and I wanted to share some of what I learned so that others can avoid my pain.

In order to avoid a insanely long post, I’ve broken my take-aways into a few shorter posts. This is the first one:

Logging errors isn’t that scary

Since high school, a lot of the code I’ve written has been for me, meaning that only I have to understand it. Doing collaborative research in a computer science lab, developing helpful R packages, and making shareable scientific code has meant that I’ve needed to change the ways I used to work, but I still struggle with my natural laziness.

Since I was making shareable code that would be running in the background, I was going to need to do more than just print("THIS BROKE"). I was going to have to log the scraper’s activity to a file. Luckily, Python has a module that makes this pretty dang easy: logging.

The docs have a pretty decent tutorial, but for the simplest cases, all you basically need to do is to set up the logging configurations:

Note that if you change the level in logging.basicConfig, you can decide what levels of priority you want actually logged. If the level is set to logging.WARNING, only messages with priority levels at “warning” or above (i.e., errors) will be logged, etc.

It’s just that simple!

Source Code:

The end-product of my pains. I gave up adding doc strings halfway, but I have a lot of comments, so understanding what’s happening shouldn’t be too hard. I made the argument-parsing nice and sexy–try python3 scanlation_scraper_timed.py -h for a look-see.

For an idea of how threads work, maybe check out my previous post about scraping with threads.