I noticed something odd on the library website the other day: a search of our site displayed a ton of spam in the Google Custom Search Engine (CSE) results.

But when I clicked on the links for those supposed blog posts, I’d get a 404 Page Not Found error. It was like these spammy blog posts didn’t seem to exist except for in search results. I thought this was some kind of fake-URL generation visible just in the CSE (similar to fake referral URLs in Analytics), but regular Google was seeing these spammy blog posts as being on our site as well if I searched for an exact title.

Still, Google was “seeing” these blog posts that kept netting 404 errors. I looked at the cached page, however, and saw that Google had indexed what looked like an actual page on our site, complete with the menu options.

Cloaked URLs

Not knowing much more, I had to assume that there were two versions of these spam blog posts: the ones humans saw when they clicked on a link, and the ones that Google saw when its bots indexed the page. After some light research, I found that this is called “cloaking.” Google does not like this, and I eventually received an email from Webmaster Tools with the subject “Hacked content detected.”

It was at this point that we alerted the IT department at our college to let them know there was a problem and that we were working on it (we run our own servers).

Finding the point of entry

Now I had to figure out if there was actually content being injected into our site. Nothing about the website looked different, and Drupal did not list any new pages, but someone was posting invisible content, purely to show up in Google’s search results and build some kind of network of spam content. Another suspicious thing: these URLs contained /blogs/, but our actual blog posts have URLs with /blog/, suggesting spoofed content. In Drupal, I looked at all the reports and logs I could find. Under the People menu, I noticed that 1 week ago, someone had signed into the site with a username for a former consultant who hadn’t worked on the site in two years.

Yikes. So it looks like someone had hacked into an old, inactive admin account. I emailed our consultant and asked if they’d happened to sign in, and they replied Nope, and added that they didn’t even like Nikes. Hmm.

So I blocked that account, as well as accounts that hadn’t been used within the past year. I also reset everyone’s passwords and recommended they follow my tips for building a memorable and hard-to-hack password.

Clues from Google Search Console

The spammy content was still online. Just as I was investigating the problem, I got this mysterious message in my inbox from Google Search Console (SC). Background: In SC, site owners can set preferences for how their site appears in Google search results and track things like how many other websites like to their website. There’s no ability to change the content; it’s mostly a monitoring site.

I didn’t write that reconsideration request. Neither did our webmaster, Mandy, or anybody who would have access to the Search Console. Lo and behold, the hacker had claimed site ownership in the Search Console:

Now our hacker had a name: Madlife520. (Cool username, bro!) And they’d signed up for SC, probably because they wanted stats for how well their spam posts were doing and to reassure Google that the content was legit.

But Search Console wouldn’t let me un-verify Madlife520 as a site owner. To be a verified site owner, you can upload a special HTML file they provide to your website, with the idea that only a true site owner would be able to do that.

But here’s where I felt truly crazy. Google said Madlife520’s verification file was still online. But we couldn’t find it! The only verification file was mine (ending in c12.html, not fd1.html). Another invisible file. What was going on? Why couldn’t we see what Google could see?

Finding malicious code

Geng, our whipsmart systems manager, did a full-text search of the files on our server and found the text string google4a4…fd1.html in the contents of a JPG file in …/private/default_images/. Yep, not the actual HTML file itself, but a line in a JPG file. Files in /private/ are usually images uploaded to our slideshow or syllabi that professors send through our schedule-a-class webform — files submitted through Drupal, not uploaded directly to the server.

So it looks like this: Madlife520 had logged into Drupal with an inactive account and uploaded a text file with a .JPG extension to a module or form (not sure where yet). This text file contained PHP code that dictated that if Google or other search engines asked for the URL of these spam blog posts, the site would serve up spammy content from another website; if a person clicked on that URL, it would display a 404 Page Not Found page. Moreover, this PHP code spoofed the Google Search Console verification file, making Google think it was there when it actually wasn’t. All of this was done very subtly — aside from weird search results, nothing on the site looked or felt differently, probably in the hope that we wouldn’t notice anything unusual so the spam could stay up for as long as possible.

Steps taken to lock out the hacker

Geng saved a local file of the PHP code, then deleted it from the server. He also made the subdirectory they were in read-only. Mandy, our webmaster, installed the Honeypot module in Drupal, which adds an invisible “URL: ___” field to all webforms that bots will keep trying to fill without ever successfully logging in or submitting a form, in case that might prevent password-cracking software. On my end, I blocked all inactive Drupal accounts, reset all passwords, unverified Madlife520 from Search Console, and blocked IPs that had attempted to access our site a suspiciously high number of times (these IPs were all in a block located in the Netherlands, oddly).

At this point, Google is still suspicious of our site:

But I submitted a Reconsideration Request through Search Console — this time, actually written by me.

And it seems that the spammy content is no longer accessible, and we’re seeing far fewer link clicks on our website than before these actions.

I’m happy that we were able to curb the spam and (we hope) lock out the hacker in just over a week, all during winter break when our legitimate traffic is low. We’re continuing to monitor all the pulse points of our site, since we don’t know for sure there isn’t other malicious code somewhere.

I posted this in case someone, somewhere, is in their office on a Friday at 5pm, frantically googling invisible posts drupal spam urls 404??? like I was. If you are, good luck!

I know this dead horse has been beaten. But here are some reminders about things that slip through the cracks.

Every once in a while, google the name and alternate names of your organization and check the universal (not personal) results.

Google results page: before (Note: this is my best approximation. Was too distressed to take a screenshot)I did this a while ago and was shocked to discover that the one image that showed up next to the results was of someone injecting heroin into their arm! Oh my god! As it turned out, one of our librarians had written a blog post about drug abuse research and that was a book cover or illustration or something. None of us knew about it because why would we google ourselves? Well, now we google ourselves.

Claim your location on Google+.

Click the “Are you the business owner?” link (pink in screenshot at right). You’ll have to verify before you can make a basic page. But in doing so, you will have some control over the photos that show up next to the place name. For example, I posted some of my better library photographs to our Google+ page, and they soon replaced the heroin arm.

Demote sitelinks as necessary.

Sitelinks are the sub-categories that show up beneath the top search result. In our case, it’s things like ‘Databases’ and ‘How to find books’ — appropriate for a library. But there were also some others, like ‘Useful internet links’ (circa 2003) that were no longer being updated, so once verified as webmasters, we demoted them.

Check out your reviews.

Since place-based search is the thing now, you’d better keep tabs on your Foursquare, Google, and other reviews pages. For one thing, it’s great to identify pain points in your user experience, since we are now trained to leave passive-aggressive complaints online rather than speak to humans. Example: our Foursquare page has a handful of grievances about staplers and people being loud. Okay, so no surprise there, but we’re trying to leave more positive tips as the place owners so that people see The library offers Library 101 workshops every fall when they check in, not Get off the damn phone! (verbatim).

Add to your front-page results.

If there are irrelevant or unsatisfactory search results when you look up your organization, remember that you have some form of control. Google loves sites like Wikipedia, Twitter, YouTube, etc., so establishing at least minimal presences on those sites can go far.

Meta tags.

Okay, so this is SEO 101. But I surprised myself this morning when I realized, oh dear, we don’t have a meta description. The text of our search result was our menu options. Turns out Drupal (and WordPress) don’t generate meta tags by default. You’ll have to stick them in there manually or install a module/plug-in. Also, you’ll want to use OpenGraph meta tags now. These will give social sites more info about what to display. They look like this:

<meta property="og:title" content="Lloyd Sealy Library at John Jay College of Criminal Justice"/>
<meta property="og:type" content="website"/>
<meta property="og:locale" content="en_US"/>
<meta property="og:site_name" content="Lloyd Sealy Library at John Jay College of Criminal Justice"/>
<meta property="og:description" content="The Lloyd Sealy Library is central to the educational mission of John Jay College of Criminal Justice. With over 500,000 holdings in social sciences, criminal justice, law, public administration, and related fields, the Library's extensive collection supports the research needs of students, faculty, and criminal justice agency personnel."/>

All right, good luck. Here’s hoping you don’t have photos of explicit drug use to combat in your SEO strategy.

P.S. If you use the CUNY commons, try the Yoast WordPress SEO plugin. It is really configurable, down the post-level.

I’m posting this because none of the other solutions I found through googling fixed our problem. Context: Superfish is a Drupal module for your nav menu that shows submenus on hover. Also, I don’t really know JavaScript/jQuery very well, just enough to fumble around and get solutions.

By default, the Superfish submenus fade into view when you hover on the menu’s title, which is usually a link itself. This is dumb. It’s so slow. In 100% of the usability study sessions I conducted with this default menu animation, the users clicked on the menu’s title right away and got annoyed when they saw the submenu begin to appear just as the browser loaded a new page. Ain’t nobody got an extra 400 milliseconds for a submenu! It should appear on hover instantly. Here’s how I disabled the slow animation.

Then, I went to admin/config/user-interface/superfish (you may have to give yourself the right permissions to configure Superfish) and deleted these files from the “Path to Superfish library” text box:
jquery.hoverIntent.minified.js
jquery.bgiframe.min.js

Most of the other solutions out there say to change the delay or speed values to 1 in lines 86-99 of superfish.js, or to set disableHI to false, but none of those solutions worked (although I kept them in the .js out of laziness to change it back).

Note: We’re using Drupal v.7 and Superfish v.7.x-1.9. It’s faster for us to call the jQuery library from Google than from our own server. As of this blog post, you can see our menu in action at the top of our library website.

Other than the default delay in showing submenus on hover, Superfish is awesome.

5:50pm, Friday

One Friday night a few weeks ago, all was peaceful here in the library. Everyone else had left, the lights were dimmed, and I was wrapping up a few last things before heading out to my weekend. I had done a few tweaks to the site’s dropdown menu CSS, and as I put on my scarf and coat, I casually pushed them from our development server’s Git repository to our remote master repo, then pulled the commits down to our production server.

I reloaded the library webpage.

It had gone completely blank.

As the panic slowly seeped into my bloodstream, I reloaded again and again, even looked at the source code — nothing, not even a space or error message.

Reverting

I had never rolled back any changes before, and the Git cheat sheet I have tacked to my wall didn’t have enough information about undoing mistakes to make me comfortable about rushing off a command. I called our Drupal consultant, who answered his cell phone while driving and spoke in a calming voice about how this is why we use version control, just revert to a safe commit, and it will all be okay.

Our commits are logged and easily readable in an Unfuddle project, so I peered at that and picked out what I knew to be the previous, safe commit, and entered this commend:

sudo git reset --hard 5154951c5a3a6a9211ba68268c6159c51cdb5f58

Every StackExchange thread featuring this command also included dire warnings that had previously frightened me away from using it, but if you really do want to wipe out changes in your local repository (in this case, whatever had just been pulled down to our production server), this is how you do it.

The site came back up as it had been before, after maybe five or ten minutes of downtime. I breathed a little easier and left for the weekend.

Investigating

But why had it gone blank? This was what I had to look into when I got back. If I pulled the most up-to-date commits down from the remote repo again, the site would still blank out. (I knew because I tried, hoping the WSoD had been a fluke. It wasn’t.) There were 60 changed files in the commit, mostly CSS, PDFs, and files for two non-essential modules. Even weirder, why was the up-to-date dev site totally fine? Until we fixed whatever was wrong on the production site, we’d have to pause development.

Drupal’s help pages have a list of common problems that case the White Screen of Death. It’s thorough but not complete. We went about troubleshooting at times when site use was low, so a few seconds of downtime wouldn’t be too disruptive. We still couldn’t tell if it was a server problem or something in those 60 files, so we started with these:

This error matched our server logs and Drupal error reports. The file that required opening had been deleted in the toxic commit, but at first it didn’t seem like that would be the problem. The Admin Views module is only visible to logged-in administrative users who want a more tricked-out menu bar at the top of their screens — why would it bring the site down?

In exasperation, I disabled the Admin Views module and tried again to pull down — and voilà, the site was still there, updated, and looked fine. Apparently, that was all I had to do: turn off the module causing problems so the site code wouldn’t quit out on me.

If it were a more essential module (not just one for a few admins’ convenience), we would have had to look into this issue further. For now, having caused enough headaches for myself, I’ll leave well enough alone.

The Lloyd Sealy Library website uses Drupal 7 as its content management system and Git for version control. The tricky thing about this setup is that you can keep track of some parts of a Drupal site using Git, but not all. Code can be tracked in Git, but content can’t be.

Code

theme files (CSS, PHP, INC, etc.)

the out-of-the-box system

all modules

any core or module updates (do on dev, push to production)

Content

anything in the Drupal database:

written content (pages, blog posts, menus, etc.)

configurations (preferences, blocks, regions, etc.)

Here’s our workflow:

Code: Using Git to push code from dev to production is pretty straightforward. I was a SVN gal, so getting used to the extra steps in Git took some time to learn. I used video tutorials made by our consultants at Cherry Hill as well as Lynda.com videos. (For those new to using version control, it’s a mandatory practice if you manage institutional websites. Using version control between two servers lets you work on the same content simultaneously with other people and roll out changes in a deliberate manner. Version control keeps track of all the changes made over time, too, so if you mess up, you can easily revert your site back to a safe version.)

Content: Keeping the content up to date on both servers is a little hairier. We use the Backup and Migrate module to update our dev database on an irregular schedule with new content made on the production server. The only reason to update the dev database is so that our dev and production sites aren’t confusingly dissimilar. Additionally, some CSS might refer to classes newly specified in the database content. The schedule is irregular because the webmaster, Mandy, and I sometimes test out content on the dev side first (like a search box) before copying the content manually onto the production site.

Why have a two-way update scheme? Why not do everything on dev first, and restore the database from dev to production? We want most content changes to be publicly visible immediately. All of our librarians have editor access, which was one of the major appeals of using a CMS that allowed different roles. Every librarian can edit pages and write blog posts as they wish. It would be silly to embargo these content additions.

Help: A lot of workflow points are covered in Drupal’s help page, Building a Drupal site with Git. As with all Drupal help pages, though, parts of it are incomplete. The Drupal4Lib listserv is very active and helpful for both general and library-specific Drupal questions.

Non-Drupal files: Lastly, we have some online resources outside of Drupal that we don’t want clogging up our remote repository, like the hundreds of trial transcript PDFs. These aren’t going to be changing, and they’re not code. The trial transcript directory is therefore listed in our .gitignore file.