As you may have noticed in my previous, criminally short introduction, there were no pictures of any actual honeypot ants. Given that this is a series of blog posts centered around these ants, that might seem strange and self-defeating. I would personally agree: at this point, I don’t remember why I thought it was a good idea.
But you know what? Let me make that up to you. We’re going H-A-double-M on pictures today, folks: welcome to the porn-free, image-palooza post.

Did you know that if you try to scrape too many pages at a time from the same website, it will sometimes think you’re being malicious and block your IP address? Or that using Python’s urllib or shelve packages totally sucks on some computers?

Come, let me show off teach you about some of the more nuanced aspects of web-scraping buttloads of data from the internet.

Think of the most popular animal you know. Most of us will imagine pandas, elephants, dolphins, maybe bald eagles. These are the names kids shout out when asked about their favorites; these animals are considered “charismatic.” We paint pictures of them, we write songs about them, we use them as symbols, and we include them in metaphors.

Instead of using the term “computer nerd”, I always preferred to describe myself as a “computer jock” growing up1. After all, if there had been a programming team in high school, I felt like I probably would have made varsity.

But sometimes, athletes are limited by the equipment they have on hand. Cyclists can’t ride the Tour de France with tricycles, and my lab can’t run hundreds of thousands of simulations on our 2014 Macbooks. We’ve been feeling the crunch as of late, which is why we’ve begun exploring the various computing clusters available to us on campus.

This post is meant for two purposes: the first is to document an example of using remote clusters with R, and the second is to serve as instructions/reference for my lab members at Rochester.

My mommy told me that joke was funny when I first came up with it when I was 11, so I’m assuming it’s still good. ↩

If I wanted to make this post sound professional and industrious, I would say that my motivations behind this project were because I’ve started working towards my Bayesian model of webcomic updates again, and that I’m taking an intermediate step by analyzing data from similar content creators.

But the truth is, I was just pissed off that I couldn’t read the manga I wanted to.

These are the Python lessons I learned scraping manga scanlations off of 4chan.

Part 3: The ‘pythonicity’ of decorators

Have you ever wanted to make a small pull request to improve an open-source project that you have a heavily modified version of? For example, say you have a personal version of a repo that you’ve changed a bunch with a particular aspect you think the main project would find useful, but you don’t want to make them pull all your custom code?

I’ve run into this type of problem a bunch times, so I’m making a really short post on how to make a pull request to a project for just a few specific commits.

It was a nice evening, sitting around a fire under the stars with my new lab a few days after I started grad school. My girlfriend was visiting at the time, and I was gently teasing her about the way she pronounced the word “eggs” (“eygs”).

No one batted an eye.

It was then that she called me out for pronouncing “museum” as “myoo-zam.”

If I wanted to make this post sound professional and industrious, I would say that my motivations behind this project were because I’ve started working towards my Bayesian model of webcomic updates again, and that I’m taking an intermediate step by analyzing data from similar content creators.

But the truth is, I was just pissed off that I couldn’t read the manga I wanted to.

These are the Python lessons I learned scraping manga scanlations off of 4chan.

Part 2: Using (the new) string formatting!

If I wanted to make this post sound professional and industrious, I would say that my motivations behind this project were because I’ve started working towards my Bayesian model of webcomic updates again, and that I’m taking an intermediate step by analyzing data from similar content creators.

But the truth is, I was just pissed off that I couldn’t read the manga I wanted to.

These are the Python lessons I learned scraping manga scanlations off of 4chan.

Part 1: Logging is easy!

What does a quintessential heartland rock song have to do with a traditional Japanese aesthetic?

American rock roll legend Bob Seger isn’t Japanese. I’m not sure he’s ever even been to Japan. It makes it weird, then, that his music so perfectly captures a uniquely Japanese aesthetic, one infamously difficult to translate. What’s even weirder is that he can do this in a song about how he was constantly horny as a kid.

Of course, I’m talking about the mono no aware (物の哀れ) in his hit song, “Night Moves”.

I’ve recently gotten into the fabulous world of mechanical keyboards. Other than the Ａｅｓｔｈｅｔｉｃｓ and the fact that building working electronics is cool, the main draw for me was the ability to completely customize how your board works. Just imagine all the chances for increased ｐｒｏｄｕｃｔｉｖｉｔｙ!

Of course, good cognitive scientist as I am, I know that optimal performance requires a good grip of the statistics of the environment you’re working in. For us, that’s going to be our typing patterns! Here, I’ll show you how I was able to capture a huge amount of data on how I type, how to play around with it, and how to do the same for your own keyboard.

I like statistical models. I like webcomics. I like not having to suffer through deciding whether a webcomic is ever going to update regularly again. I began to ask myself, “Can I use statistical modelling to tell me when I should stop hoping a webcomic will keep updating?”

Nothing is more haunting than that oft-repeated phrase: “updates when?” It’s not even about the wait, it’s about the uncertainty–either end or start updating, don’t keep me in limbo! I would love it if I could make a model that could just tell me, “Hey, this comic is entering its death spiral, abandon ship!” Also, I just like learning new statistical methods.

Although that model is still in the works, I’ve gotten my hands on a bunch of cool data in the meantime. This post isn’t quite a tutorial; it’s more like a demonstration of how you can fun with simple web scraping and niche interests–but I’ve attached all the code I used, complete with documentation and a flexible design for newbies who want to start collecting their own data.