I am using SMILA 1.2 for Win7 32 bit OS. I am able to follow steps as given in the 5 minute tutorial available in Eclipse site till 'Start indexing job run'. But exception occurs after reaching the step 'Start the crawler'.

My machine is behind a firewall and I have added proxy setting in the file: \SMILA\configuration\org.eclipse.smila.importing.crawler.web\webcrawler.properties.

Status of the job 'crawlSmilaWiki' after hitting the URL in browser:
"localhost:8080/smila/jobmanager/jobs/"
is shown as 'FAILED'.

After viewing the log file, the following exception is logged:

org.eclipse.smila.importing.crawler.web.WebCrawlerException: org.apache.http.NoHttpResponseException: The target server failed to respond
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.handleCrawlException(WebCrawlerWorker.java:285)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.crawlLinkRecord(WebCrawlerWorker.java:270)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.initiateCrawling(WebCrawlerWorker.java:186)
at org.eclipse.smila.importing.crawler.web.WebCrawlerWorker.perform(WebCrawlerWorker.java:167)
at org.eclipse.smila.workermanager.internal.WorkerRunner.call(WorkerRunner.java:55)
at org.eclipse.smila.workermanager.internal.WorkerRunner.call(WorkerRunner.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

I have attached the log file for further analysis. Request anyone to help me out with the issue.

Please exchange the xxx.xxx.xxx.xxx by the IP address or host name of your proxy host and - if your proxy host uses another port than my squid - also the proxyPort to match your proxy's port number.

And start SMILA over again.

But I found out something different. It seems that I cannot crawl SMILA wiki via a proxy (no matter which way it is configured) because SMILA claims that robots.txt would forbid it, which is not true, I can crawl it without a proxy. There seems to be some bug in the robots.txt handling, I will create a bug report, but this is not the same problem than your problem reported above....

I think the issue is related as it is resolved on modifying the value returned by getHostAndPort(final URL url) to be default 80.

I had faced similar issue any my machine is behind proxy as well. Here are the things that were done.

On downloading the SMILA 1.2 code, making changes to webcrawler.properties to include proxyHost and proxyPort within the bundle observed that the change are not reflected in SMILA.application's webcrawler.properties.

The proxy changes had to be inside SMILA.application's webcrawler.properties for proxy setting.

The port value is getting set to -1 in the current build and this is causing the target server not responding. On changing the value in UriHelper to have default of 80 the issue got resolved and am able to crawl http://wiki.eclipse.org/SMILA

> On downloading the SMILA 1.2 code, making changes to
> webcrawler.properties to include proxyHost and proxyPort within the
> bundle observed that the change are not reflected in SMILA.application's
> webcrawler.properties.
>
> The proxy changes had to be inside SMILA.application's
> webcrawler.properties for proxy setting.

General hint: Configuration changes always have to be done in SMILA's
"configuration" folder, not in the bundles directly.
When running SMILA in eclipse IDE, the configuration folder can be found
in the "SMILA.application" project.

>
> The port value is getting set to -1 in the current build and this is
> causing the target server not responding. On changing the value in
> UriHelper to have default of 80 the issue got resolved and am able to
> crawl http://wiki.eclipse.org/SMILA

After Andreas' fix, crawling with a proxy should be possible now without
code changes with the nightly build download (resp. current trunk when
running in eclipse IDE).