Blog

I was recently out with a friend of mine who mentioned that he was having a tough time scraping some data off a website. After a few drinks we arrived at a barter, if I could scrape the data he’d buy me some single malt scotch which seemed like a great deal for me. I assumed I’d make a couple of HTTP requests, parse some HTML, grab the data and dump it into a CSV. In the worst case I imagined having to write some custom code to login to a web app and maybe sticky some cookies. And then I got started.

As it turned out this site was running one of the most sophisticated anti-scraping/anti-robot packages I’ve ever encountered. In a regular browser session everything looked normal but after a half dozen or so programmatic HTTP requests I started running into their anti-robot software. After poking around a bit it, the blocks they were deploying were a mix of:

Whitelisted User Agents – Following a few requests from PHP cURL the site started blocking requests from my IP that didn’t include a “regular” user agent.

Requiring cookies and Javascript – I thought this was actually really clever. After a couple of requests the site started quietly loading an intermediate page that required your browser to run Javascript to set a cookie and then complete a POST request to a URL that included a nonce in order to view a page. To a regular user, this was fairly transparent since it happened so quickly but it obviously trips up a client HTTP client.

Soft IP rate limits – After a couple of dozen requests from my IP I started receiving “Solve this captcha” pages in order to view the target content.

Taken all together, it’s a pretty sophisticated setup for what’s effectively a niche social networking site. With the “requires Javascript” requirement I decided to explore using Electron for this project. And turns out, it’s a perfect fit. For a quick primer, Electron is an open source project from GitHub that enables developers to build cross platform desktop applications by merging nodejs and Chrome. Developers end up writing Javascript that can leverage the nodejs ecosystem while also using Chrome’s browser internals to render windows and widgets. Electron helps in this use case because it provides a full Chrome browser that’s scriptable and has access to node’s system level modules. For completeness, you could implement all of this in a Chrome extension but in my experience extensions have more complicated non-privileged to privileged communication and lack access to node so you can’t just fire off a “fs.writeFileSync” to persist your results.

With a full browser environment, we now need to tackle the IP restrictions that cause captchas to appear. At face value, like most people, I assumed solving captchas with OCR magic would be easier than getting new IPs after a couple of requests but it turns out that’s not true. There weren’t any usable “captcha solvers” on npm so I decided to pursue the IP angle. The idea would be to grab a new IP address after a few requests to avoid having to solve a captcha which would require human intervention. Following some research, I found out that it’s possible to use Tor as a SOCKS proxy from a third party application. So concretely, we can launch a Tor circuit and then push our Electron HTTP requests through Tor to get a different IP address that your normal Internet connection.

To run that locally, you’ll need to do the usual “npm install” and then also run a Tor instance if you want to get a new IP address on every request. The way it’s implemented, it’ll detect the “content you want” and also alert you when there’s a captcha by playing a “ding!” sound. To launch, first start Tor and let it connect. Then you should be able to run:

Once it loads, you’ll see the test page in what looks like a Chrome window with a devtools instance. As it refreshes, you’ll notice that the IP address is displays for you keeps updating. One “gotcha” is that by default Tor will only get a new IP address each time it opens a conduit, so you’ll notice that I run “killall” after each request which closes the Tor conduit and forces it to reopen.

And that’s about it. Using Tor with the skeleton you should be able to build a scraper that presents a new IP frequently, scrapes data, and conveniently notifies you if human input is required.

It has become a bi-weekly ritual. The professor spent too much time on the course material again and is left mumbling through a complex project description during the 11th hour of class. All the while, you’re off somewhere else. As you sling your backpack over your shoulder, you catch the only words you’ll need to hear: “You can download the syllabus along with the source code from the CS department’s website,” they say. Great! You hustle back to study location of choice, open your laptop, and extract the project files. After the obligatory knuckle crack, you look down at the method stubs spelled out for you. “All I have to do is fill-in these functions?” you think to yourself. And as you’re getting familiar with the project structure, a couple flicks of the scroll wheel reveal hundreds, sometimes thousands of lines of unexplained boilerplate code.

You eventually finish up the assignment and push it to the CS department’s server for grading. Without fail, someone raises their hand during the next class asking the instructor if they could explain what some of that boilerplate code was for, at which point the student is usually told to refer to the language documentation to figure it out for themselves. And for the most part, this makes perfect sense. After all, you’re there to learn about some of the more complex topics in computer science, not to write setter and getter methods all day. That’s what your data structures class was for.

But I would like to share with you the first few months of my experience as a Jr. Software Engineer and compare it to my time as an undergraduate student. You might be not-so-surprised to hear I have spent more time writing code similar to the boilerplate stuff mentioned above than I have perfecting the space and time complexity of my pioneering solution to The Traveling Salesman problem.

As an undergraduate student, I was an ace at avoiding merge conflicts in repositories where I was the only contributor. I could even run a build script with the best of ‘em. Nobody ever really told me how to use version control systems to manage a collaborative project with tens of thousands of lines of code strewn across a mess of files and directories. And if, for some reason, those same build scripts broke or a merge conflict popped up on a group project? Well, I was pretty much at the mercy of Stack Overflow.
At Setfive, when I was tasked with setting up a relational database schema for my first real project, I wasn’t really sure where to begin. There was no syllabus to refer to and no professor to schedule office hours with. While I was aware of relational database software such as MySQL and NodeJS, I had never really written a query, so I certainly didn’t know the difference between an inner and outer join. And while coordinating all those AJAX calls and setting up the Symfony bundle configs was a little confusing at first, I think I’m starting to learn how to apply my undergraduate education to these real-world projects.

So far, I have found that industry-level programming helps hone a much more practical skill set than academic programming. Don’t get me wrong, I learned a ton in college, and I know the concepts taught are not only important to a fundamental understanding of the field of computer science, but also have profound and meaningful applications elsewhere, such as in operating systems, machine learning, and so on. But when I look back on the things I have learned in such a short period of time over these past few months, it gets me excited for the road ahead. I owe an enormous thanks to Setfive for bringing me on as an entry-level software developer and advising me with patience.

You might remember Txty Jukebox, our free to use collaborative music web app that we built on top of the YouTube Data API. We were happy to find that our original version was well received and even got some press from the folks over at makeuseof.com. Well, we’ve finally got a chance to spend some time ( big thanks to our new hire Josh who led the charge ) to make improvements based on the feedback we received and re-branded it under jointdj.com!

The main idea behind our music inspired web application is to create an easy way for groups of people to collaboratively share and listen to song (and video) requests. Any user with a smart phone or computer can enter the event code provided by the event’s host on jointdj.com and start submitting songs to the event’s playlist. The “event” doesn’t always have to be a traditional party either, for example, we’ve been using Joint DJ ourselves in our office as a Pandora or Spotify replacement.

To see how it works I suggest skimming the jointdj.com landing page which does a good job of quickly outlining how to use. Instead of regurgitating that information here I’ll highlight a few new features/improvements to get excited about:

One big lesson learned from our first go around with Txty Jukebox was that while it’s great when everyone at your event is engaged and the song queue is filled up you can run into awkward silences if the playlist runs of songs when people get distracted, say, doing work or playing an intense game of flip cup. In the past you had to wait until someone queued another song so it became a bit of a chore for the event host. To solve this issue and ensure there will never be a silent moment, we’ve created a new feature that lets the event host to pick a genre of music when they create an event from which a song will be randomly selected and played if a playlist ever runs out. For example, I could create an event with “Top 40 / Pop” as the auto fill genre. If at any point during my event the playlist is empty, all the sudden the latest Chainsmokerz song will magically be queued up!

Another issue we saw in the first version was that sometimes users didn’t get the exact song played that they were searching for. That was because we automatically selected the first result from Youtube regardless of whether it’s the desired result. For Joint DJ, we’ve added the ability for users to use an intuitive browser based UI to easily search for a song and then review the list of music video results from YouTube along with the thumbnail. Once the user finds exactly what song they want to play they can simply select it to add it to the event’s playlist.

Lastly, we improved the design of the live player view where events users can watch and listen to the music videos associated with the requests. You’ll see “flash” messages when songs are added that show the artist, title and which “DJ” submitted it. Additionally we show the next 4-5 upcoming songs in the queue along with their thumbnails on the left side of the player window. Overall, the new look is more colorful and crisp and should be more impressive to the events users keeping them engaged, having fun, and contributing songs to the event. Below is a screenshot of what the live player view looks like:

A feature request we get fairly frequently is the ability to convert an HTML document to a PDF. Maybe it’s a report of some sort or a group of charts but the goal is the same – faithfully replicate a HTML document as a PDF. If you try Google, you’ll get a bunch of options from the open source wkhtmltopdf to the commercial (and pricey) Prince PDF. We’ve tried those two as well as a couple of others and never been thrilled with the results. Simple documents with limited CSS styles work fine but as the documents get more complicated the solutions fail, often miserably. One conversion method that has consistently generated accurate results has been using Chrome’s “Print to PDF” functionality. One of the reasons for this is that Chrome uses its rendering engine, Blink, to create the PDF files.

So then the question is how can we run Chrome in a way to facilitate programmatically creating PDFs? Enter, Electron. Electron is a framework for building cross platform GUI applications and it provides this by basically being a programmable minimal Chrome browser running nodejs. With Electron, you’ll have access to Chrome’s rendering engine as well as the ability to use nodejs packages. Since Electron can leverage nodejs modules, we’ll use Gearman to facilitate communicating between our Electron app and clients that need HTML converted to PDFs.

The code as well as a PHP example are below:

As you can see it’s pretty straightforward. And you can start the Electron app by running “./node_modules/electron/dist/electron .” after running “npm install”.

One caveat is you’ll still need a X windows display available for Electron to connect to and use. Luckily, you can use Xvfb, which is a virtual framebuffer, on a server since you obviously wont have a physical display. If you’re on Ubuntu you can run the following to grab all dependencies and setup the display:

On one of our projects that I am working on I had the following problem: I needed to create an aggregate temporary table in the database from a few different queries while still using Doctrine2. I needed to aggregate the results in the database rather than memory as the result set could be very large causing the PHP process to run out of memory. The reason I wanted to still use Doctrine to get the base queries was the application passes around a QueryBuilder object to add restrictions to the query which may be defined outside of the current function, every query in the application goes through this process for security purposes.

After looking around a bit, it was clear that Doctrine did not support (and shouldn’t support) what I was trying to do. My next step was to figure out how to get an executable query from Doctrine2 without ever running it. Doctrine2 has a built in SQL logger interface which basically lets you to listen for executed queries and to see what the actual SQL and parameters were for the executed query. The problem I had was I didn’t want to actually execute the query I had built in Doctrine, I just wanted the SQL that would be executed via PDO. After digging through the code a bit further I found the routines that Doctrine used to actually build the query and parameters for PDO to execute, however, the methods were all private and internalized. I came up with the following class to take a Doctrine Query and return a SQL statement, parameters, and parameter types that can be used to execute it via PDO.

In the ExampleUsage.php file above I take a query builder, get the runnable query, and then insert it into my temporary table. In my circumstance I had about 3-4 of these types of statements.

If you look at the QueryUtils::getRunnableQueryAndParametersForQuery function, it does a number of things.

First, it uses Reflection Classes to be able to access private member of the Query. This breaks a lot of programming principles and Doctrine could change the interworkings of the Query class and break this class. It’s not a good programming practice to be flipping private variables public, as generally they are private for a reason.

Second, Doctrine aliases any alias you give it in your select. For example if you do “SELECT u.myField as my_field” Doctrine may realias that to “my_field_0”. This make it difficult if you want to read out specific columns from the query without going back through Doctrine. This class flips the aliases back to your original alias, so you can reference ‘my_field’ for example.

Third, it returns an array of parameters and their types. The Doctrine Connection class uses these arrays to execute the query via PDO. I did not want to reimplement some of the actual parameters and types to PDO, so I opted to pass it through the Doctrine Connection class.

Overall this was the best solution I could find at the time for what I was trying to do. If I was ok with running the query first, capturing the actual SQL via an SQL Logger would have been the proper and best route to go, however I did not want to run the query.