Here is the problem: In Windows Vista, this code works fine with Google, Yahoo and Bing, but in Windows 7 and 10 the Google tests fail. By fail I mean: the TWebBrowser object does not show it content in each search result, just for the last term. So, it looks like the test goes straight through the term's list in less than a second.

macicogna wrote:I've developed a simple VCL App to test the response of Search Engines like Google, Bing and Yahoo. My focus is the performance in terms of seconds/search and traffic that leads to Captcha blocking.

Why are you using TWebBrowser for that? That is a UI control, not well-suited for speed testing. An HTTP client class/API, like Indy's TIdHTTP, would be better suited for that task.

macicogna wrote:In order to "wait" the result from Search Engines, I've created this method:

Calling ProcessMessages() in a loop is going to affect your timing results. Not to mention it is just plain bad to call it in a loop at all. I would suggest DoNavigate() simply grab the current clock time and then call Navigate() by itself and exit. Let the TWebBrowser::OnDocumentComplete event tell you when the navigation is complete, at which time you can then grab the clock time again and calculate the difference.

macicogna wrote:Here is the problem: In Windows Vista, this code works fine with Google, Yahoo and Bing, but in Windows 7 and 10 the Google tests fail. By fail I mean: the TWebBrowser object does not show it content in each search result, just for the last term. So, it looks like the test goes straight through the term's list in less than a second.

Did you verify with a packet sniffer that Navigate() is actually attempting to contact Google? What you describe sounds like Navigate() is probably failing and exiting immediately.

Can you show your actual test code?

Last edited by rlebeau on Thu Aug 25, 2016 7:52 pm, edited 1 time in total.

It is always nice to change ideas with you. Thinking about your questions I've solved the problem, but here is my answers in order to help other readers.

Why are you using TWebBrowser for that?

As modern Search Engines, like Google, return just a bunch of Javascript, my first step was to capture "Captcha blocking", seen it with a TWebBrowser, in order to inspect the response latter and learn how to identify its content as HTML files.

That is a UI control, not well-suited for speed testing. AN HTTP client class/API, like Indy's TIdHTTP, would be better suited for that task.

Sure, nice hint. I think I'll implement a complementary version using TIdHTTP to check performance more accurately.

Calling ProcessMessages() in a loop is going to affect your timing results. Not to mention it is just plain bad to call it in a loop at all. I would suggest DoNavigate() simply grab the current clock time and then call Navigate() by itself and exit. Let the TWebBrowser::OnDocumentComplete event tell you when the navigation is complete, at which time you can then grab the clock time again and calculate the difference.

Yes, even using TWebBrowser the TWebBrowser::OnDocumentComplete approach is a better idea. My doubt is that subsequent TWebBrowser::Navigate() calls (inside a loop) might cancel previous one. I've read about this in other places and there I've got the ProcessMessages() approach.

Did you verify with a packet sniffer that Navigate() is actually attempting to contact Google?

Yes, I was using TCPView.

Can you show your actual test code?

Sure. I've upload as an attachment.

Your question about "[...] actually attempting to contact Google?" makes me think that this problem might be outside my code, as just Google behave badly in Windows 7 an 10.

So, I've searched about Google's integration with Web Browsers and I've seen different Google URL patterns. So, I changed it to this one:

macicogna wrote:As modern Search Engines, like Google, return just a bunch of Javascript, my first step was to capture "Captcha blocking", seen it with a TWebBrowser, in order to inspect the response latter and learn how to identify its content as HTML files.

Modern search engines provide REST APIs for performing searches in application code, returning machine-parsable results (usually XML or JSON). You should not be submitting HTML webforms and then scraping the resulting HTML/JavaScript for results.

macicogna wrote:Yes, even using TWebBrowser the TWebBrowser::OnDocumentComplete approach is a better idea. My doubt is that subsequent TWebBrowser::Navigate() calls (inside a loop) might cancel previous one.

If you use the OnDocumentComplete event (and use it correctly, ie no ProcessMessages() loop), then you won't be able to use a simple Navigate() loop anymore. You will have to wait until OnDocumentComplete is fired before then calling Navigate() again. You will have to break up your code logic into pieces, executing each piece at the proper time.

macicogna wrote:

Did you verify with a packet sniffer that Navigate() is actually attempting to contact Google?

Yes, I was using TCPView.

TCPView is not a packet sniffer. It only shows you active connections, but not the actual data that being transmitted on those connections. If you Navigate() to multiple URLs on the same server, connections might get reused. Use a real packet sniffer, like Wireshark or Fiddler, to look at the actual HTTP requests.

And "Bingo"! Now it is working! The "/search?" makes the difference in Windows 7 and 10.

Whatever made you think that "https://www.google.com.br/#q=$(q)" would work in the first place? "#" is a bookmark delimiter. Everything after "#" is not actually part of the requested URL itself.

When you navigate to "https://www.google.com.br/#q=bcbj", for example, the web browser will connect to "www.google.com.br" and send a request for "/". The web server will never see "q=bcbj". And Google's "/" page is fairly minimal, which could explain why your Navigate() loop ran so quickly. Only AFTER the response has been fully processed by the web browser, the web browser will then look for a bookmark named "q=bcbj" within the HTML, and if found then scroll the display to that position.

When you navigate to "https://www.google.com.br/search?q=bcbj" instead, the web browser will connect to "www.google.com.br" and send a request for "/search?q=bcbj", which is a request for "/search" with "q=bcbj" as its input parameters, thus allowing search results to be queried and returned.

Our focus with this subject is a freeware Anti-plagiarism software called CopySpider.

Modern search engines provide REST APIs...

Yes, but there are differences between "regular" and API searches [1] and obstacles of use like [2] and [3]. So, we decide to use TWebBrowser with a set of good Search Engines in order to run a few searches, as the user might do by hand, using the response's inner text to harvest our data of interest.

The good news is there are examples like DuckDuckGo, that has a Partnership door, which we will try in the near future.

You will have to wait until OnDocumentComplete is fired before...

I know my code need a complete redesign, but I would like to ask you, if possible, any hint about how to "wait" the document complete event. Are you talking about multi-thread and, maybe, semaphores?

TCPView is not a packet sniffer.

My bad. I was seen IPs and ports, but no packets.

Whatever made you think that "https://www.google.com.br/#q=$(q)" would work in the first place? "#" is a bookmark delimiter. Everything after "#" is not actually part of the requested URL itself.

I play guilty! I totally ran over this "#". On the other hand, thanks for your explanation. Now it is clear to me what happened.

In my defense: I don't know why the wrong URL pattern was working properly with my BDS 2006 and Windows Vista. If it had failed too, maybe I had not post this subject.

Only if you want to search multiple sites at the same time in parallel. But you wouldn't use TWebBrowser for that, since it is a UI control. But you could do it with Indy's TIdHTTP instead, for instance.

What I was referring to earlier about using the OnDocumentComplete event is writing simple event-driven code to run your looping logic, for example: