Short-Circuit Evaluation Is Fast

As I wrote some time ago, taking Michael Dinowitz's advice, I turn off session management for Spiders and Bots in an effort to cut down on memory usage on the server. See, spiders do not accept client cookies and therefore (on my sites) cannot hold sessions. Consequently, they start a new session for each page request they make. Since sessions take some time to timeout, this ends of creating large numbers of session variables that go unused (in proportion to the number of pages spidered).

When I first did this I used a Regular Expression (RegEx) to check for commonly known spider user agents (CGI.http_user_agent). It looked something like:

if (REFindNoCase( "slurp|googlebot|....", CGI.http_user_agent )){

This works great; however, I started adding more spiders to the list (as they started hitting my site) and I starting to fear that it wasn't efficient. If you ever look at how a regular expression works by using a program such as The RegEx Coach you can actually step through the RegEx path and you will see that for every character it comes across in the target sting, it does a LOT of logic for the regular expression. And, the larger the expression the more the logic.

This got me thinking about short-circuit evaluation. I am not sure which version brought this on board, but ColdFusion MX 7 has this feature, this optimization. This optimization means that evaluation of a relational expression in an IF statement is terminated just as soon as it is possible to tell what the result will be. Meaning that if you have several parts of a single IF statement and the first can determine the fate of the IF, then the remaining parts are not evaluated.

For example, in the following example, only the first value is checked:

if (false AND true AND true AND true){ ... }

Since the "false" makes the statement false no matter what the rest of the arguments are, the remaining "true" statement are not even evaluated.

I have taken this idea and applied it to the problem of turning off session management for spiders. Instead of using a regular expression, I break out each comparison to its own sub-part of an IF statement:

// Define the application. To stop unnecessary memory usage, we are going

// to give web crawler no session management. This way, they don't have

// to worry about cookie acceptance and object persistence (except for

// APPLICATION scope). Here, we are using short-circuit evaluation on the

// IF statement with the most popular search engines at the top of the

// list. This will help us minimize the amount of time that it takes to

// evaluate the list.

if (

(NOT Len(CGI.http_user_agent)) OR

FindNoCase( "Slurp", CGI.http_user_agent ) OR

FindNoCase( "Googlebot", CGI.http_user_agent ) OR

FindNoCase( "BecomeBot", CGI.http_user_agent ) OR

FindNoCase( "msnbot", CGI.http_user_agent ) OR

FindNoCase( "Mediapartners-Google", CGI.http_user_agent ) OR

FindNoCase( "ZyBorg", CGI.http_user_agent ) OR

FindNoCase( "RufusBot", CGI.http_user_agent ) OR

FindNoCase( "EMonitor", CGI.http_user_agent ) OR

FindNoCase( "researchbot", CGI.http_user_agent ) OR

FindNoCase( "IP2MapBot", CGI.http_user_agent ) OR

FindNoCase( "GigaBot", CGI.http_user_agent ) OR

FindNoCase( "Jeeves", CGI.http_user_agent ) OR

FindNoCase( "Exabot", CGI.http_user_agent ) OR

FindNoCase( "SBIder", CGI.http_user_agent ) OR

FindNoCase( "findlinks", CGI.http_user_agent ) OR

FindNoCase( "YahooSeeker", CGI.http_user_agent ) OR

FindNoCase( "MMCrawler", CGI.http_user_agent ) OR

FindNoCase( "MJ12bot", CGI.http_user_agent ) OR

FindNoCase( "OutfoxBot", CGI.http_user_agent ) OR

FindNoCase( "jBrowser", CGI.http_user_agent ) OR

FindNoCase( "ZiggsBot", CGI.http_user_agent ) OR

FindNoCase( "Java", CGI.http_user_agent ) OR

FindNoCase( "PMAFind", CGI.http_user_agent ) OR

FindNoCase( "Blogbeat", CGI.http_user_agent ) OR

FindNoCase( "TurnitinBot", CGI.http_user_agent ) OR

FindNoCase( "ConveraCrawler", CGI.http_user_agent ) OR

FindNoCase( "Ocelli", CGI.http_user_agent ) OR

FindNoCase( "Labhoo", CGI.http_user_agent ) OR

FindNoCase( "Validator", CGI.http_user_agent ) OR

FindNoCase( "sproose", CGI.http_user_agent ) OR

FindNoCase( "oBot", CGI.http_user_agent ) OR

FindNoCase( "MyFamilyBot", CGI.http_user_agent ) OR

FindNoCase( "Girafabot", CGI.http_user_agent ) OR

FindNoCase( "aipbot", CGI.http_user_agent ) OR

FindNoCase( "ia_archiver", CGI.http_user_agent ) OR

FindNoCase( "Snapbot", CGI.http_user_agent ) OR

FindNoCase( "Larbin", CGI.http_user_agent ) OR

FindNoCase( "psycheclone", CGI.http_user_agent ) OR

FindNoCase( "ColdFusion", CGI.http_user_agent )

){

// This application definition is for robots that do NOT need sessions.

THIS.Name = "KinkySolutions v.1 {dev}";

THIS.SessionManagement = false;

THIS.SetClientCookies = false;

THIS.ClientManagement = false;

THIS.SetDomainCookies = false;

// Set the flag for session use.

REQUEST.HasSessionScope = false;

} else {

// This application is for the standard user.

THIS.Name = "KinkySolutions v.1 {dev}";

THIS.SessionManagement = true;

THIS.SetClientCookies = true;

THIS.SessionTimeout = CreateTimeSpan(0, 0, 20, 0);

THIS.LoginStorage = "SESSION";

// Set the flag for session use.

REQUEST.HasSessionScope = true;

}

Now, regular expressions do short-circuit evaluation also, so the difference here is subtle. Let's say that we get a page request from a non-spider user agent. This is the "worst case" scenario since we will have to check every spider value against the string. With a regular expression, we would have to run through the matching processing for each of the (N) spider values for each of the (C) characters in the user agent. That's NxC iterations. However, in the compound IF statement, we would only have to run the matching process for each spider for each (U) instance of a user agent. That's just NxU and since U is always one, its just N number of iterations.

Now this is misleading because for string comparison, the substrings still have to match against many characters in the target string, but I am sure (but do not know for a fact) that literal matching must be faster than RegEx matching since there is not "logic" to literal matching.

If we do get a spider request that is a popular spider (higher in the IF statement, earlier in the regular expression), it's still faster to have the compound IF statement. See, the regular expression still needs to be checked in it's entirety for EACH character it comes across in the target string. But the IF statement only needs a sub-set of the IF sub-part run just once.

Of course, in practicality, they all run between 0-16ms per page hit. With large iterations (10,000+), the compound IF statement is levels of magnitude faster.

Furthermore, you can make it even faster by creating a temporary string of the LCase() of the user agent and then doing Find() rather than FindNoCase() for each sub-part (not shown above).

Assume that you develope this for a client who has no idea how to edit the Application.cfc file. Could you just simply store all the bots into a database and just loop over them?

Wouldn't that make it slightly a bit easier to manage? Or perhabs instead of checking for bots just check to see if the user is using an a common browser. I assume this would be an easier step considering there are more bots than there are browser types. Just saying...

Do the CFID / CFTOKEN values exist at that point in the first page request? I'd have to double-check on that.

@Jody,

True, there a number of ways to do this. To be honest, I rarely ever update this logic. I haven't even thought much about in the last few years. For all I know - there might be more bots hitting my site than I realize :)

Yeah, you could probably just check to see if there is a standard user agent.

Actually I may be wrong here - cookies get set early on in the request lifecycle even if they don't stick. I am pretty sure I need to use my own cookie rather than one of the built-ins. Like:

structKeyExists(cookie, "NEEDCOOKIES")

And set it AFTER that check, meaning your first request will get you a short timeout and your second will get you a regular one. We use a cookie like this anyways to verify people can register/stay signed in so I will need to look at using this with your bot check.

I am a bit confused as to why you would do this in the Application.cfc? So every time a bot visits your site, the application is going to loose session for all users. Where in the Application.cfc are you putting this code? OnRequestStart? Please explain how this does not corrupt the application scope.