Intro

This article isn’t meant to discuss what web scraping is, or why it’s valuable to do. What I intend to focus on instead, is how modern web application architecture is changing how web scraping can/must be performed. A nice article discussing traditional web scraping just appeared in Hacker Newsletter #375 by Vinko Kodžoman. His article tipped my motivation to write this.

Traditional Scraping

Up until recently, data was typically harvested by parsing a site’s markup. Browser automation frameworks allowed this to be achieved in various ways, and I’ve used both Beautiful Soup and Selenium to achieve what I needed to in the past. Vinko discusses in his article another library lxml, which I’ve not tried. His explanation of lxml and how it interacts with the DOM is good enough to allow general understanding of the way scraping is performed. Essentially, your bot reads the markup, and categorizes relevant data for you.

Entering Front-End & REST APIs

Background

Modern web applications often implement some sort of front-end framework, such as Angular, React, Vue, etc. These JavaScript frameworks communicate to a web API of some sort to retrieve data. This means that multiple requests are required before the actual markup is built that contains useful data. Typically those requests require some sort of parameters to return only a subset of the data to you in the markup.

Problem

Earlier this year I was tasked with developing a scraper to collect data from a website which used Angular 2 on its front end. The requirement was that it collect business listings from each region on the site, however, the listings were only shown based on the boundaries of a Google Map frame users were able to drag around. There was no way to iterate the markup via pagination, and no directory/list of all existing regions. Traditional scraping of the markup wouldn’t work, because the content was rendered dynamically. I couldn’t figure out how to use BS4 for this task, and I was stumped.

Solution

I began to browse the site in question and examine the requests that it made via Postman. What I discovered was that life was much easier under this new type of web app architecture. AJAX calls were made via GET requests to return the data I was looking to collect. After spending a little time browsing the site to map the required endpoints of the REST API and their parameters, harvesting the data was substantially easier via their own API than it ever was in the days of parsing markup. Data was returned in JSON format, and could be added directly into a Mongo collection. Essentially, all the data could be copied from the website in a matter of half a minute or so.

Cool Story Bro, But Example?

What inspired me to write this article was a project at work. Since paperwork and money are involved, I can’t provide the examples I wish I could. However, here’s an example using a site in the Marijuana industry 😉

Example Scenario: You’re wanting to know every single place to buy marijuana in the country. Because you love marijuana!

First, browse Leafly.com (a place to find you some marijuana) and check out what’s going on in the background.

You can see a post request is made each time you change the map, and the request includes the top left and bottom right corners of the map frame.

Here is a Python script to make a POST request to the API endpoints on Leafly with the parameters that we just discovered. We’ve ignored sending any headers at all, and set the “Take” parameter to be absurd.

If this wasn’t a proof of concept, you could do much better than this by adjusting the coordinates a few times and taking multiple samples. There are a lot things to change if you were trying to make a legitimate Leafly scraper, I know. I’m just demonstrating that the ability to grab over 4k listings in a few seconds is pretty neat.

Conclusion

It’s getting easier to scrape large amounts of data when front-end frameworks talk to API’s with no authentication, especially if they have no limit to the request size.

P.S. If you’re ethical you will obey the terms of service posted on any site. You’ll then determine that running the provided proof of concept is wrong (don’t do it).

It’s been about a year, a little over actually, since I started work on my main side project. The app is a motocross track directory, which isn’t something that doesn’t exist already, but I felt existing track directories were lacking a lot of features. This lead to me creating MapMoto.

An Idea

I ride motocross a lot, not as much as a few years ago, but a lot. I’m always looking up weather before I ride, looking for hot-line numbers to call to confirm days to ride, and looking for new tracks all together, especially when traveling. I wrote down everything I wished a motocross track directory would have, and came up with the follow list.

A really good map, one where I could just drag it around and see every track easily, without typing in a zip code and having some limited radius displayed, or seeing a list of tracks for a given area and having to click on each one to see it’s location on a map.

The ability to see how far away from me (drive time wise) a track is without opening up a Google Maps Tab on my own.

To quickly check what day’s a track is open

To see the weather forecast for the week, to cross reference with the days a track is open.

MapMoto Alpha

I began the project last year in CodeIgniter 3. That’s right, in 2015 I STARTED a project in CI. Those who’ve been around the php community are probably wondering what I was thinking as it’s widely criticized for being dated. I was simply taking a shortcut. I’d developed two applications for clients using CI and the Google Maps API, so I had felt I could get started more quickly. I jumped right into coding, and said to hell with planning ahead.

After 3 or 4 weeks of work I had a working app to add motocross tracks into a simple MySQL database I’d designed. It was unstyled except for Bootstrap’s default look, but worked well enough for me to spend a lot of weeks adding hundreds of tracks across the country.

I constantly improved things to make my work easier, such as dropping a map pin based on a track’s address, then allowing that pin to be dragged and sync the new location with the database (lots of tracks are only kinda near their list address). I developed user and administrative roles, and built user profile pages with user settings pages to match. Users could upload avatars, link to their social media, be granted permission to manage tracks, the scope of this project crept larger and larger. Eventually I’d styled the site to a look I was fairly happy with too. After implementing the weather forecast system I’d told myself enough was enough. This was my side project, I wanted to learn something new, and I should learn modern practices. During the entirety of development I’d felt stressed out that I was stuck working in CodeIgniter. I knew it well enough already that this project seemed like work, which didn’t seem fun. I just wanted my project finished.

MapMoto Beta

I returned from a trip to Peru a few weeks before my girlfriend who was there for work. I told myself during those few weeks of an empty house, to stay up all hours coding in, I’d finally convert this damn thing into a framework I was excited to work with. There was a strong temptation to switch the project over to Django, as I’m comfortable writing Python, but haven’t used a Python web framework. I’d also been reading an Express.js book I’d purchased myself impulsively one day on Amazon, and thought maybe I’d go full JS.

Ultimately I decided to stick with php so as not to have to rewrite everything outside of the HTML/CSS/JS. Laravel 5 seemed like a solid choice for a modern framework, with good documentation and an active community. The conversion from CodeIgniter 3 to Laravel 5 took me roughly 3 weeks, which was surprisingly fast. Laravel’s Eloquent ORM, and Blade Templates, allowed me to clean up the code base dramatically. Since the conversion, implementations of new features have gone much more quickly than CI would have allowed, and working on my side project has become fun again.

Beta Launch

After deciding it was time to quietly launch the site, I opened it up to unregistered users, and allowed new users to register. I posted on /r/motocross for some feedback on what other riders thought of the initial version. I know of larger motocross communities on line, but decided to go with a low test volume at first. The initial reaction from everyone was very positive. I plan to continue to work on this project slowly, and have a larger official launch at some point in the future. It’s been a great learning experience.

Update April 10th, 2017: The plugin developer contacted me about this post. Out of respect for the way they handled it (extremely classy), I wanted to put a link to the paid version of their plugin. So, if you actually want support on the unlimited feed, and don’t want to do any hacky tricks, go support their hard work via the link below.

I came across this info somewhat by accident today while working on an XML Feed generator for a WooCommerce installation. I’ll often review the code of a couple plugins with similar functions to what I’m developing. While looking through Woocommerce Google Feed Manager I guess I found a gremlin.

Save the file, re-zip your plugin, and upload it to your WordPress installation.

If you have over 100 products in your WooCommerce Store then go ahead and generate your Product Feed once more,

The Long Winded Explanation…

If you want to know what the above steps actually do, then this will go into a little more detail. I say a little because I really didn’t go through the plugin very much, I made this discovery, took a couple of screen shots and then moved on … I was busy doing real work stuff 🙁

Initially, I just wanted to see how the application structure looked, and how the feed was being generated. I looked at the file structure and opened /includes/application/class-feed-processor.php without looking at anything else first. I noticed the $grmln variable declaration which lead immediately to the discovery of the if statement mentioned above.

But lets walk through this like we’re trying to hack the plugin, rather than miraculously discovering a variable.

First, take a look at the main plugin filewp-product-feed-manager.php
and scroll to line: 115 where there’s a method define_constants()

It’s not uncommon that constants are all that’s holding back paid features in free versions of plugins. Definitely far from always, but not uncommonly the case.

Well that’s a problem, because the conditions to make this statement true are such, “if the user doesn’t have a license status that’s valid”, and “if the post count is over 99”.

All you really need to do is remove the break;, and then as one would expect, the loop no longer breaks when the unlicensed user product limitation is reached. You could likely also remove this limit by bumping WPPFM_FEED_P_LMTR to be equal to an absurdly high number or setting $grmln equal to count($data)….maybe we could even define wppfm_lic_status ourselves to be valid and just cruise on past this block as licensed users. Whatever…I didn’t explore these options or play around much in that regard, so other’s will have more information for me I’m sure. For those of you that wanted to know more behind how the tl;dr version of this post accomplished removing the limitation, now you do.

Disclaimer

Developers work long and hard on these plugins, and if you’re going to use features which extend beyond the free versions of plugins, it’s best you just buy them. Sometimes I’ll publish interesting ways of extending functionality to the free version, but this is NO WHERE CLOSE to the benefits of actually obtaining the paid version. Updates will break these little hacks, and they may change functionality here and there in unexpected ways. Not to mention it’s always nice to get product support from the developer, which a license will get you, and hacking will not 🙂

UPDATE 12/5/2016:
If you’re going to attempt to integrate this into the WordPress platform, please consider using my WP Drinking Age Plugin.

Background

So outside the normal grind I’ve been working on a website for a tequila brand. After a meeting with marketing I’d gathered it was important to add a drinking age gateway to the site. You see some type of these gateways on just about every alcohol brand’s site. I asked if they’d prefer to simply ask “Are you of Legal Drinking Age?”, and then have “Yes/No” buttons determine a user’s fate (1)(2), or if they’d rather have the user input their birthday (3). Apparently, and I’m not a business guy or a lawyer so don’t comment and argue this with me, the yes/no gateways hold slightly less legitimacy than the ones where a user inputs their birthday to enter the site .

These gateways are really just about putting responsibility into the hands of the user, within reason. Is it easier to lie with a yes/no gateway?…sure. Is it impossible to lie about your birthday though?…definitely not. So a birthday input is reasonably easy to implement, but not impossible to crack. If a user is going to lie they’re going to lie, not turn off JavaScript to get around the gateway. This is why a client side gateway seemed like the most reasonable approach.

Solution

A JavaScript gateway will allow easy indexing by search engines because they won’t run the JavaScript to restrict the content in the first place . If we were to have PHP check for the presence of an age verification cookie, search engine crawlers will not have this cookie and will not see/index the content. We could work around that I’m sure, but there’s no need.

All countries don’t have the same drinking age. So solutions where the gateway is based on a static entry age aren’t viable. An age, and location must be gathered by the gateway to determine if a user is credible to enter the site. Tequila in Mexico is legal 3 years earlier than California. Sites that implemented this type of gateway (1) seemed to POST location values with country codes “US” “CA” ,ect. So they were authenticating server side with these codes…ugh

I’d Googled long enough to determine the few hours spent writing this script were likely more fun than just searching for an existing solution to work with, because it didn’t jump out at me.

Easy Setup

I’ve been meaning to play with honeypots for quite some time, and if I’d given it just a little more research, I’d have started much sooner. This is because shortly after deciding upon glastopf as the first on my list of honey pots to try out, I came across mhn, an open source project by Threat Stream.

The Modern Honeypot Network (mhn) makes not only launching honeypots insanely easy, but it serves as a nice way of monitoring multiple honeypots as well. Digital Ocean Droplets seemed like a cheap and safe way of getting started, and I quickly found this post by Lenny Zeltser which provides pretty good directions to anyone wanting to do this themselves.

My initial plan to create a single glastpof installation evolved into two more honeypots, one being dionea, and the other Wordpot.

Results So Far

After only 2 days of attacks there has certainly been a lot recorded, but I’ve not had the time to properly look into any of them yet. The Most prominent port for probes seems to be 5060, looking for phone system vulnerabilities I assume. Dionea has yet to capture any binaries, and Wordpot has been probed only 4 times.

I did do port scans on a few of the attacking IP Addresses and have seen a few older versions of Windows (2003, XP) with open VNC ports…

Further

With more time will come more data, and then the real fun begins.

The REST API included in the MHN Framework makes sending the data to other applications simple. You can view the data for my honeypots over the Last 24 hours here.

UPDATE 10/4/2016

I’ve stopped running my droplets recently. Other projects have taken up the bulk of my time and I no longer have the time to dedicate to monitoring them. The plugin I was using to connect this site and the Modern Honey Network Server can be found on my Github here

WordPress makes up some large percentage of the web. As I’m writing this, web development firms all over the world are churning out WordPress sites for their clients. Some of these installs are vanilla and basic, yet some come with exceedingly complicated plugin/theme combinations. WordPress’ ease of use is a double edged sword. The positive side being a developer may complete a feature rich, member’s only website in one day. The negative being, a multitude of plugins and code snippets written by other developers are included in these projects (other wise they wouldn’t be completed within a day). A good developer will make good choices as to what plugins to use, a novice developer may not be able to tell, and things can become dangerous.

Vulnerabilities to WordPress itself are often handleable via automatic updates. If a client has brought an outdated site to your firm, often best practice is to back up the site in it’s current configuration, update it, and turn updates on from there on out. Plugins and themes can be harder to manage as far as security is concerned, given the nature of developers working on these projects for free, having their own lives, and maybe leaving these projects behind all together.

OWASP WordPress Vulnerability Scanner Project

The Vulnerability Scanner Project is a black box testing script for WordPress installations. A full description can be found on the projects OWASP Wiki.

The scanner is that of a php script checking a multitude of things that you’d otherwise have to check manually. It does this by reading various bits of the sites source to determine core version, theme information, and plugin information which it references with wpvulndb.com’s wordpress vulnerability database. Because this is a black box testing method, it alerts you to things that any visitor to your site may potentially discover. This differs from plugins such as Wordfence, which provide you with insider information (white box). I must disclaim by saying, a script such as this is a valuable tool for a developer to check client’s sites, or their own sites quickly for obvious problems, but keep in mind it’s far from a comprehensive security audit.

Example Time

If you’re a debian user such as myself, you can get the dependencies for this with,sudo apt-get install php5 php5-cli php5-curl php5-json git

Create a directory to store the wp-scanner code and navigate into it. Then clone the repo,

git clone https://github.com/RamadhanAmizudin/Wordpress-scanner.git

I went ahead to the WordPress Release Archives and downloaded version 4.1.6, which has known issues (just about every old version has issues). For this example, the site is installed on my local machine’s apache server in a directory titled /epm, and there are no other plugins or themes installed on the test site.

Usage of the WordPress Vulnerability Scanner is fairly simple, and for a list of all options just give the -h flag. The example below scans the url “localhost/epm” with the default settings as defined by the -d flag, and references its findings with the wpvulndb.com as defined by –wpvulndb.

A WordPress development workflow can be difficult to optimize, especially when working in a team environment. You wont have trouble finding a few well written articles online concerning version control setups and project structures. Setting up a new project is smooth when you’re using a Git repository of some flavor for your theme content, and Database Sync or a similar plugin to sync your local and live databases.

What about the case in which you inherit an existing WordPress site?

It’s not uncommon to need to pull all of a site’s plugins, settings, and media content to a local development environment at least initially. This is typically part of our “on boarding” process at work.

Simply syncing themes and databases as you typically would in your development workflow is not enough in this instance. With a plugin known as All-in-One WP Migration, you’re able to export an entire WordPress instance to a single file with an extension .wpress. You’re then able to turn right back around and import that file to another WordPress installation to create a clone.

Here is what the steps look like to export a production site to a file.

At this point, depending upon the size of the backed up file, a blank installation of WordPress may be all that’s necessary to clone the production site.

If the size of your .wpress file exceeds 512MB, you will be prompted to purchase the Unlimited Extension of All-in-One WP Migration. If you’re inheriting a site that’s been in production for a while, it’s likely that the backup file is over this small size limit (see a fix for this below).

Hacking the plugin seemed like a reasonable thing to try before making the $59 dollar purchase of the Unlimited Extension (which comes with lifetime updates, and unlimited support).

Go ahead and open up /wp-content/plugins/all-in-one-wp-migration/constants.php

Lines 199:201 define the file upload size limit, there’s a nice comment there indicating such. If you’d like to control+f “size”, it should take you right to it.

Save the file and navigate back to the “import” function for the All-In-One Migration Plugin. The file upload limit now reads 4GB.

The plugin will no longer reject your large file uploads.

Note that this plugin does have regular updates, and each update will reset your file upload limit. Because I use this plugin to import existing sites to my local development environment, and not as part of my regular workflow, it’s not much of a problem.

/**
* Copyright (C) 2014 ServMask Inc.
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

Within the same week, my girlfriend and I both found ourselves without phones. Her Galaxy took a soaking in the ladies room, and my late Nexus 5 had ceased to charge despite all repair effort. So now, I find myself with two fresh Nexus 5’s, a white for my girlfriend and a black one for myself, running Android Lolipop 5.0.1

I’m going to walk through the process of what I’ve done setting up the devices. They are almost completely open source, with additional security and privacy features to be installed in Part 2. This is written as a fairly high level overview of the process, so I’ll try not to get into the nittygritty. This isn’t intended as a walk-through.

Since it had been a while since I played with my phone, I had to install some of the Android Developer Tools, adb and fastboot. In my case that meant a quick,

In order to use these tools, usb debugging had to be enabled. Turning on developer mode had to be done first by navigating to,Settings>>About Phone>>Build Number
I don’t know how many times, but by tapping the menu item rapidly, it will enable Developer Mode, and display “Developer Options” in Settings Menu. After that , USB Debugging needed to be checked for this process to continue.

With the phone turned on the following commands reboot the phone to the bootloader, and unlock it.sudo adb reboot bootloadersudo fastboot oem unlock

It’s that easy, gotta love Nexus devices for that.

Flashing the recovery was the next step. I’ve used both TWRP and CWM before. Initially, I attempted to flash TWRP, however it just wouldn’t stick, and would reboot to a stock recovery each flash. I tried a few versions (2.8.0.1 through 2.8.0.4). After researching various potential solutions, I bagged it and decided to flash CMW. This stuck the first time, download here. I don’t find myself in recovery that often, and I’m not married to either…whatever works. I went with the touch version which as of writing this is version 6.0.4.5.
Once more I rebooted the phone into the bootloader, and then flashed CWM to the Nexus 5 recovery partition.

Flashing the Cyanogen Mod ROM went slightly differently for each phone. Both processes started with another reboot, this time to the newly flashed recovery partition. The caches are wiped in recovery, then

In the CWM recovery, I was able to install the zip from the sdcard. Oddly, my phone kept telling me the push was successful, yet the zip did not appear on my sdcard. This obstacle was overcome by installing the zip via an adb sideload

sudo adb sideload ./cm-12-20150124-NIGHTLY-hammerhead.zip

After a reboot, the phones were running CM12, however, they weren’t rooted. Booting back into recovery, the zip file for superuser must be placed in the file system.

sudo adb push ./UPDATE-SuperSU-v2.16.zip /sdcard

From there, it can be installed via the recovery. Downloading SU from the the Play Store (apk downloader), and the phone is rooted and ready for Part 2…coming soon

Although I’ve not actually been inside yet, I’m on the email list for Sacramento’s Hacker lab. A few weeks ago they put out an email alerting local developers that their new location in Rocklin is hosting an event for Intel’s Realsense 3D camera technology. It’s not really my field, but I love leaning new things, and I love me a good conference, so I applied. A few weeks later I got called up by an event organizer and they were nice enough to grant me a spot.

Follow up to come.

UPDATE

The event was very fun, and I cant thank them enough for the food and the presentation, oh and the camera 🙂

I was really impressed with the technology but a bit bummed that I needed Windows 8.1, and a bit bummed the Java Script support is the least developed. The folks who knew the Unity Engine seemed to be really going to town.