Skin Spider : Tremendous Weekend Updates

I have made some excellent updates to Project Skin Spider. In particular, I have just about finished the spidering of galleries, videos, and thumbnails. This was really the "proof of concept" code module. This would prove that the Skin Spider idea is viable. Check out pages spider_gallery_links.cfm, spider_gallery.cfm, and spider_video.cfm. If you look at those, you will see that I am using ColdFusion and Regular Expressions and some smooth CFHttp calls to find the appropriate links and download them into the system.

As I was building the spidering pages, I realized that as I was testing I kept downloading duplicate information. Even worse than that, I kept spidering pages that I had already spidered and found to have no videos on them. This lead me to the use of a "blacklist" database table. Now, anytime that I find a gallery that doesn't have videos or a video that doesn't download properly, I add it to the blacklist. Then, when I am going to add new records to the database I first check to see if they have already been added (as valid entries) OR if they have been blacklisted. This is making the database much cleaner.

So far, in testing, it doesn't seem that hot linking is going to be much of an issue. I know that some servers are all hardcore about their bandwidth usage, but so far so good. When I make my CFHttp calls, I am sending a user agent in both the tag attributes as well as in a CFHttpParam tag. Additionally, I always supply a valid CFHttpParam tag for the CGI variable, http_referer. Basically with the CFHttp tag I try my best to mimic the work flow of an actual browser.

I am not sure if I am going to need it, but currently, I am also taking a screen shot of the gallery page with the Web Shot command line utility. This is such a nifty little utility that I made a demo for a while back. You basically use CFExecute and give it some arguments and BAM! You get a screen shot of the gallery. Right now, that image, and all the other image I am going to force to be 100 x 75 pixels in dimension. The video thumbs are not all that size so there is going to be some distortion. For now, though, I think this should work alright. We can update perhaps in Phase II.

I am making sure to leave plenty of execution time for gallery and video downloads. Using the CFSetting tag, I give the RequestTimeOut about 4-5 minutes on the very intensive pages. I am also trying to keep the content stream steady by alternating between spidering a gallery and spidering the videos. Every time I spider a gallery, I then jump over and try to spider the videos from that gallery. This should keep the in-flux of videos at a good rate.

I try to keep the rate of content streaming good, but this smells like a job for Asynchronous gateways. I can imagine putting all the videos to spider on a single queue and then just letting the ColdFusion gateways slam them with multiple threads. But the reality of the situation is that is not an enterprise application. It is meant to be run on a personal computer that happens to be running ColdFusion on it. Of course, there might be ways to speed this up.

Speaking of speeding it up, you might notice that I don't use CFLocation a whole lot. In fact, all of my page changes made during spidering are done via Javascript. The reason this is done is that I want to give the user some visual feedback as stuff happens in the spidering process. For the moment, at least, the spidering is done in a pop-up window and I keep flushing to the screen for user feedback. Of course, once you use the CFFlush tag, you can no longer use CFLocation. That is where Javascript comes into play with the window.location value.

I had thought about not making it into a pop-up. I had thought about keeping it server side... but to be honest, I am not sure how to run pages like that. I suppose I could have done a scheduled task, but I think if you use a lot of CFLocation tags pages crap out because of redirect overflows or something. Still, I am sure that there is a way to make this perhaps a bit more streamlined.

Also, since I mentioned the use of CFFlush, it makes me think about the use of ColdFusion frameworks in later phases of the Skin Spider project. As I have said before, this phase, Phase I, is basically just a proof of concept. It is a "worst practices" method of building ColdFusion applications. I am trying to keep it clean, but at the same time, I am trying to make the same mistakes that many new programmers make. Well, mistakes is the wrong word. Basically, I am just not making it very "upper level" programming. But that's the whole point of the project - to learn - to take it from low level to high level.

But I digress, back to frameworks. As you can see in the code, I am performing a lot of CFFlush tags. I think a lot of frameworks have problems with the CFFlush tag because they build their content templates from inside-out and I think CFFlush conflicts with this idea. I am very interested to hopefully get some feedback from framework people on the next phase of development.

One funny thing about this batch of updates is that I realized that I have not made any way to updated items in the database. The DatabaseService.cfc ColdFusion component only has ways to add a new record and delete existing records. I have to come up with a way to update records.

Also, one final note, going back to the asynchronous processes. Right now, the application is designed to run one spidering process at a time. If I were to run more than one at a time, I would probably have to tighten up my calls to the database. Right now, I try to make fewer calls to the database to increase the page's performance. However, if I knew that I could have two pages spidering at the same time, then I would have to be setting a lot more database flags to ensure that no two items were being spidered at the same time. This would probably involved some CFLock tags and double-check locking. But, that will have to wait until the next Phase.

I am the co-founder and lead engineer at InVision App, Inc — the world's leading prototyping,
collaboration & workflow platform. I also rock out in JavaScript and ColdFusion 24x7 and I dream about
promise resolving asynchronously.