You are here: Home » NewsFeeds » Getting a full PDF from a DRM-encumbered online textbook

Getting a full PDF from a DRM-encumbered online textbook

Updated on October 24, 2015By Tek Editor Comments Off on Getting a full PDF from a DRM-encumbered online textbook

I recently started a calculus course that uses an online textbook. Buying this textbook online was mandatory, not for the content, but to get an electronic access code for homework assignments. While I had the option of additionally buying a physical copy of the book, I don’t like the idea of textbook publishers trying to squeeze the used books market with scummy tactics like this. On top of this, unless I paid extra, I will lose access to this book at some point in the future. That is unacceptable to me. So… I’m going to crack it.

(Yes, I probably could have just torrented a PDF copy. But that’s no fun!)

The DRM on this textbook is pretty intense. Of course, there isn’t a “download PDF” option. There is a printing option, but it’s limited to 10 pages at a time, and prints the pages out with a large watermark in the center, along with licensing info (my name, number, and a “do not scan, copy, duplicate, distribute, or exercise any freedom with the material” notice) in the margins. Fun!

First, we need to download all the pages. Due to the download limit, this is going to take forever… right? Nope! A little Clojure and java.awt.Robot has our mouse pointer whizzing around the screen by itself.

My Clojure was pretty rusty, so the code is far from pretty, and I got around timing problems by adding more sleeps… but with some trial-and-error, it worked pretty well. Several coffee/tea/tinder breaks later, broken up by restarting the scraping process where it broke for some reason, and all the pages are living on my hard drive. Nice! Er, except the ones that didn’t get captured due to timing issues. A bit of Python magic found which pages weren’t grabbed correctly, though, and I was able to rerun the scraper on just those ranges to clean up the remnants. Overall, the process took around a day, which while not ideal wasn’t too bad. I experimented with taking regional screenshots to actually detect if UI elements were ready instead of just guessing, but in reality if I was doing this to more books and wanted it to be robust I would look into cracking the .swf itself.

Now, we need to get a page into image form, so we can play around with it in GIMP. Once we get a process worked out, we can automate it with ImageMagick and process all 1500-odd pages of the book. Getting this image is easy: pick a page and run:

Luckily page R11 was totally white, so converting it to a PNG yielded a clean, isolated copy of the watermark.

Dealing with the margins will be easy, we can just crop them out, so lets focus on the watermark. I made everything but the watermark itself transparent in GIMP, and after that removing it from the original image is as simple as overlaying the cleaned-up version on the page and setting the watermark’s layer mode to divide.

Now we need to repeat our earlier PDF->PNG conversion for all the files. This wasn’t much harder with a dash of GNU parallel (an incredibly handy tool):

(I could have used mogrify and done it in-place, but I wanted to keep a backup. The first parallel command took quite a while to run.)

This command seems a bit confusing, so I’ll break it into its constituent pieces.

The first part is -background white ... -flatten, which fills the transparent edges with white. I wanted the images to have an 8.5×11 ratio (because I’m a silly American), and it turns out they already were in that ratio – if I included the transparent part. No cropping required!

The next part is -fill white -draw "rectangle 160,163 1799,215" -draw "rectangle 160,2725 2342,2834". Since we aren’t cropping out the licensing text, I’m instead simply covering it up with some filled-white rectangles. The coordinates took a bit of tweaking, but it worked out pretty well.

Finally after we -flatten, and all the operations have happened on the source image, we can load the watermark image and divide by it as before: watermark.png -compose Divide_Src -composite "/home/jon/book-imgs-proc/{/}". To make everything work, I had to manually move the watermark up in GIMP to get it to align. Not sure why, but after I did this everything processed basically perfectly (it’s still not exactly aligned, so there’s some very thin gray lines, but it’s good enough for me).

Now, we can use convert again (ImageMagick is so useful) to join the PNG images into a single PDF:

…and went out with a friend, then came back and read The C Programming Language for a bit, then browsed Hacker News for a bit… it ended up running for almost 3 hours but it chewed through all the pages eventually. The resulting PDF was more than 700 MiB.

Now we can use basically any PDF OCR tool to make the text searchable. If I had the motivation I could probably scrape the original text from the book to get it perfect, but I don’t care that much, so OCR it is.

I already have Ruby and the Tesseract OCR engine installed, so I just grabbed the one-script pdfocr tool from its Github repo. One extra command installation it needed for some reason and…

./pdfocr -t -i book.pdf -o book-ocr.pdf

…another hour or so of waiting later and my PDF was done! Basically 100% searchable and cleanly formatted. Over buying a “lifetime of edition” code I saved at least $120, so I’m pretty happy with this project overall.