Tuesday, February 23, 2010

In this post I will explain the basic principle of Download accelerators (such as DAP, wxDownload Fast, etc) used to download files.

How it speeds up the downloading.

First Let me explain the difference between normal downloading of a file and downloading of a file using download accelerator. If you use regular browser to download a file then it creates only one connection with the server to download the file, whereas if download accelerators is used to download file then it creates multiple connection with the server and downloads a file in chunks and upon completion of download it joins those chunks. Number of chunks created by download accelerators is depend on its configuration. I used wxDownload Fast and configured it to create 3 chunks of a file. That is, it creates 3 connections with the server to download any file.

Lets consider an example.

I used wxDownload Fast as download accelerator and downloaded a file ymsgr8us.exe (Yahoo messenger) which is of size 9.9 MB.
The initial request sent by wxDownload Fast to server was ordinary request to retrieve file, as given below:

In HTTP response, from Content-Length header field wxDownload Fast learned the actual size of file. Once the file size is know to wxDownload Fast (in general, to download accelerators), depending on number of connection (chunks) to create it decides how big should be the chunk size. In our case, I configured it to create three chunks, therefore, it divides Content-Length value by 3 to create three chunks.

Then it sends another HTTP request to the server, by creating another connection. The HTTP request made by wxDownload to the server was as follows:

Note the Range field in HTTP request. The range header field was Range: bytes=3474600-

It instructs the server to return a file from 3474600th byte onwards. Although, download manager requested for entire file in its first connection, it would however terminate that connection as soon as it will receive upto bytes 3474599. Hence it would not wast resources and download duplicate byte streams.

As now you can image what would be the third HTTP request. It is given below. observe the Range HTTP header field.

Friday, February 19, 2010

Hi folks, In this post I will explain how to build Heritrix from its source code and how to Integrate HTMLUnit into Heritrix.

First question comes to mind is, What is Heritrix? and why do we need HTMLUnit to Integrate into Heritrix?

Well, Heritrix is a open-source, Web crawler. Heritrix does not include web page level DOM model and JavaScript Interpreter. Therefore, if you want to crawl the web to look for malicious scripts or obfuscated JS, then you need a JS interpreter. Hence. HTMLUnit comes into play. HTMLUnit is a headless browser, which has got JS interpreter.

Steps to Build Heritrix:

Download latest version of JDK rpm from sun website and install it.

Set JAVA_HOME and PATH environment to .bashrc file (~/.bashrc)

export JAVA_HOME=/usr/java/jdk1.6.x.x export PATH=$JAVA_HOME/bin:$PATH Now JDK is ready to be used by Heritirx and Maven
3. We need maven 1.0.2 to build heritix. Note: We need src of heritix so that we can modify it in future. Therefore do not use heritix binaries available on Ineternet. Build heritirx from src. Also Note the version of maven. it is very very important. Do not try with latest version of maven. It may not work.
4. Download binary of maven 1.0.2 and extract it somewhere on disk. Now set MAVEN_HOME
environment for it as mention below.
Edit (/etc/profile) file to insert following lines before unset i and unset pathmunge commands at the end of file.export MAVEN_HOME=/path_of_Maven_directory pathmunge $MAVEN_HOME/bin before

Now logout and login again to reflect environment variable changes done above to be get reflected.

5. Run maven -v command to test maven is running properly.
6. Run maven jar command. this will create /root/.maven/repository directory.
7. Now go into heritirx directory and run command maven dist
8. This will create subdirectory target, and many other subdirectories inside target directory.target/distribution directory holds heritirx build version.
It there is failure due to any dependency jar file then download that file from Internet and store it in either /root/.maven/cache or /root/.maven/repository/.../jar/ directory.
9. Heritirx is build Successfully. Extract build version and test heritirx.
10. Launch heritrix by using command:$HERITRIX_HOME/bin/heritrix --admin=LOGIN:PASSWORD
where $HERITRIX_HOME is the location of your untarred heritrix.?.?.?.tar.gz.

Integrating HTMLUNIT into Heritrix:

This is little bit tricky. You are at this point means you already have heritrix, sun JDK and maven.

Follow the steps given below:Step 1: Download HTMLUnit (I used HTMLUnit 2.5). We don't need source code of HTMLUnit therefore download binary of HTMLUnit. We only need its JAR files.

Step 2: Copy all JAR files in HTMLUnit into lib sub-directory of heritrix folder. Do not replace files, which are already there, if you replace them, then you need to modify project.properties file. Only add those files which are not there.

Step 3: Edit project.xml file in heritrix directory. Bcoz we want to tell heritrix where HTMLUnit classes can be found. Add tag for each JAR file of HTMLUnit.

Step 4: Edit project.properties file in heritirx directory to instruct maven that, do not try to download those dependency files from Internet, rather look into local directory. Syntax to do this can be easily found in project. properties file, simply make use of it.

For example:maven.jar.htmlunit = ${basedir}/lib/htmlunit-2.5.jar

Add a entry for each JAR file (that is, each dependency entry done in Step 3 ) of HTMLUnit.