Please use FireFox to view this page

This website has been designed for use with the FireFox browser. Please use FireFox to view this page.

5

Using Internet Explorer from .NET

5.0Introduction

Earlier in this book we have looked at
how to read HTML from websites, and how to navigate through websites using GET
and POST requests. These techniques certainly offer high performance, but with
many websites using cryptic POST data, complex cookie data, and JavaScript
rendered text, it might be useful to know that you can always call on the
assistance of Internet Explorer’s browsing engine to help you get the data you
need.

It must be stated though, that using
Internet Explorer to data mine web pages creates a much larger memory footprint,
and is not as fast as scanning using HTTP requests alone. But it does come into
its own when a data mining process requires a degree of human interaction. A
good example of this would be if you wanted to create an automated test of your
website, and needed to allow a non-technical user the ability to follow a
sequence of steps, and select data to extract and compare, based on the familiar
Internet Explorer interface.

This chapter is divided into two main
sections. The first deals with how to use the Internet Explorer object to
interact with all the various types of web page controls. The second section
deals with how Internet explorer can detect and respond to a user interacting
with web page elements.

5.1 Web page navigation

The procedure for including the
Internet Explorer object in your application differs depending on which version
of Visual Studio .NET you are using. After starting a new windows forms project,
users of Visual Studio .NET 2002 should right click on their toolbox and select
“Customize toolbox”, click “COM components” then select “Microsoft Web
Browser”. Users of Visual Studio .NET 2003 should right click on their toolbox
and select “Add/Remove Items”, and then follow the same procedure as mentioned
above. In Visual Studio .NET 2005, you do not need to add the web browser to the
toolbox, just drag the “WebBrowser”
control to the form.

An important distinction between the
Internet Explorer object used in Visual Studio .NET 2002/03 and the 2005 version
is that, the latter uses a native .NET class to interact with Internet Explorer,
whereas the former uses a .NET wrapper around a COM (Common Object Model)
object. This creates some syntactic differences between how Internet Explorer is
used within .NET 2.0 and .NET 1.x. The first example in this chapter will cover
both versions of .NET for completeness. Further examples will show .NET 2.0 code
only, unless the equivalent .NET 1.x code would differ substantially.

The first thing you will need to know
when using Internet Explorer is how to navigate to a web page. Since Internet
Explorer works asynchronously, you will also need to know when Internet Explorer
is finished loading a web page. In the following example, we will simply
navigate to
www.google.com and popup a message box once the page is loaded.

To begin this example, drop an
Internet Explorer object onto a form, as described above, and call it “WebBrowser”.
Now add a button to the form and name it “btnNavigate”.
Click on the button and add the following code

C#

private void btnNavigate_Click(object sender,
System.EventArgs e)

{

NavigateToUrlSync("http://www.google.com");

MessageBox.Show("page loaded");

}

VB.NET

Private Sub btnNavigate_Click(ByVal sender As System.Object,
_

ByVal e As System.EventArgs) Handles
btnNavigate.Click

NavigateToUrlSync("http://www.google.com")

MessageBox.Show("page loaded")

End Sub

We then create the NavigateToUrlSync
method. Note how the C# version differs in version 1.x and 2.0. This is because
the COM object is expecting four optional ref object parameters. These
parameters can optionally define the flags, target frame name, post data and
headers sent with the request. They are not used in this case, yet since C# does
not support optional parameters they have to be passed in nonetheless.

To finish off the example, don’t
forget to declare the public
bBusy flag.

C#

public bool bBusy = false;

VB.NET

public bBusy As Boolean = false

To test the application, compile and
run it in Visual Studio, then press the navigate button. You should see
something similar to Figure 5.0

Figure 5.0 – Navigating synchronously
to a web page

5.2 Manipulating web pages

An advantage of using Internet Explorer
over raw HTTP requests is that you get access to the DOM (Document Object Model)
of web pages, once they are loaded into Internet Explorer. For developers
familiar with JavaScript, this should be an added bonus, since you will be able
to control the web page in much the same way as if you were using JavaScript
within a HTML page.

The main difference however, between
using the DOM in .NET versus JavaScript, is that .NET is a strongly typed
language, and therefore you must know the type of the element you are
interacting with before you can access its full potential.

If you are using .NET 1.x you will need
to reference the HTML type library, by clicking Projects > Add Reference. Then
select Microsoft.mshtml
from the list. For each of the examples in this section you must import the
namespace into your code thus:

C#

using mshtml;

VB.NET

Imports mshtml

If you then cast the WebBrowser.Document
object to an HTMLDocument
class, many of the code examples shown below should word equally well for .NET
1.x as .NET 2.0

5.2.1 Frames

Frames may be going out of fashion in
modern websites, but oftentimes, you may need to extract data from a website
that uses frames, and you need to be aware how to handle them within Internet
Explorer. In this section, you will notice that the code differs substantially
between version 1.x and 2.0 of .NET, therefore source code for both are
included.

To create a simple frameset, create
three files, Frameset.html,
left.html
and right.html,
these files containing the following HTML code respectively.

Frameset.html

<html>

<frameset cols="50%,50%">

<frame name="LeftFrame" src="Left.html">

<frame name="LeftFrame" src="right.html">

</frameset>

</html>

Left.html

<html>

This is the left frame

</html>

Right.html

<html>

This is the right frame

</html>

In the following example, we will use
Internet Explorer to read the HTML contents of the left frame. This example uses
code from the program listing in section 5.1, and assumes you have saved the
HTML files in C:\

VB.NET 1.x

Private Sub btnNavigate_Click(ByVal sender As System.Object,
_

ByVal e As System.EventArgs) Handles
btnNavigate.Click

NavigateToUrlSync("C:\frameset.html")

Dim hDoc As HTMLDocument

hDoc = WebBrowser.Document

hDoc = CType(hDoc.frames.item(0), HTMLWindow2).document

MessageBox.Show(hDoc.body.innerHTML)

End Sub

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object,
_

ByVal e As System.EventArgs) Handles
btnNavigate.Click

NavigateToUrlSync("C:\frameset.html")

Dim hDoc As HtmlDocument

hDoc = WebBrowser.Document.Window.Frames(0).Document

MessageBox.Show(hDoc.Body.InnerHtml)

End Sub

C# 1.x

private void btnNavigate_Click(object sender,
System.EventArgs e)

{

NavigateToUrlSync(@"C:\frameset.html");

HTMLDocument hDoc;

object oFrameIndex = 0;

hDoc = (HTMLDocument)WebBrowser.Document;

hDoc = (HTMLDocument)((HTMLWindow2)hDoc.frames.item(

ref oFrameIndex)).document;

MessageBox.Show(hDoc.body.innerHTML);

}

C# 2.0

private void btnNavigate_Click(object sender,
System.EventArgs e)

{

NavigateToUrlSync(@"C:\frameset.html");

HtmlDocument hDoc;

hDoc = WebBrowser.Document.Window.Frames[0].Document;

MessageBox.Show(hDoc.Body.InnerHtml);

}

The main difference between the .NET
2.0 and .NET 1.x versions of the above code is that the indexer on the frames
collection returns an object, which must be cast to an HTMLWindow2
under the COM wrapper in .NET 1.x. In .NET 2.0 the indexer performs the cast
internally, and returns an
HtmlWindow object.

To test the application, compile and
run it from Visual Studio .NET, press the navigate button, and a message box
should pop up saying “This is the left frame”, as shown in Figure 5.1

Figure 5.1 – Reading framesets with
Internet Explorer

5.2.2 Input boxes

Input boxes are used in HTML to allow
the user enter text into a web page. Here we will automatically populate an
input box with some data.

Given a some HTML, which we save as
InputBoxes.html as follows

<html>

<form name="myForm">

My Name is :

<input type="text" value="" name="myName">

</form>

</html>

We can get a reference to the input
box on the form by calling
getElementById on the HtmlDocument.
In .NET 1.x this should be then cast to an IHTMLInputElement.

C# 2.0

private void btnNavigate_Click(object sender,
System.EventArgs e)

{

NavigateToUrlSync(@"C:\InputBoxes.html");

HtmlElement hElement;

hElement = WebBrowser.Document.GetElementById("myName");

hElement.SetAttribute("value", "Joe Bloggs");

}

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object,
_

ByVal e As System.EventArgs) Handles
btnNavigate.Click

NavigateToUrlSync("C:\InputBoxes.html")

Dim hElement As HtmlElement

hElement = WebBrowser.Document.GetElementById("myName")

hElement.SetAttribute("value", "Joe Bloggs")

End Sub

In order to enter the text into the
input box, we call the
SetAttribute method of the HtmlElement,
passing in the property to change, and the new text. In .NET 1.x we would set
the value
property of the
IHTMLInputElement to the new text.

To test the application, compile and
run it from Visual Studio .NET, then press the navigate button. You should see
the name “Joe Bloggs” appearing in the input box as in Figure 5.2

Figure 5.2 – Input Boxes in Internet
Explorer

5.2.3 Drop down lists

In HTML, drop down lists are used in
web pages to allow users input from a list of pre-defined values. In the
following example, we will demonstrate how to set a value of a drop down list,
and then read it back.

We shall start off with a HTML file,
which we save as
DropDownList.html

<html>

<form name="myForm">

My favourite colour is:

<select name="myColour">

<option value="Blue">Blue</option>

<option value="Red">Red</option>

</select>

</form>

</html>

We can get a reference to the drop down
list by calling getElementById
on the HtmlDocument.
In .NET 1.x this should be then cast to an IHTMLSelectElement.

Here, we can see that in order to set
our selection we pass “selectedIndex”
and the selection number to
SetAttribute. We then pass “value”
to GetAttribute
in order to read back the selection. In .NET 1.x, we achieve the same results by
setting the selectedIndex
property on the
IHTMLSelectElement and reading back the
selection from the value
property.

To test the application, compile and run
it from Visual Studio .NET, press the navigate button, and you should see a
message box appear saying “My favorite color is: Red”, similar to as shown in
figure 5.3

Figure 5.3 – Using drop down lists in
Internet Explorer

5.2.4Check boxes and radio buttons

Check boxes and radio buttons are
generally used on web pages to allow the user to select between small numbers of
options. In the following example, we shall demonstrate how to toggle check
boxes and radio buttons.

We shall start off with a HTML file,
which we will save as
CheckBoxes.html

<html>

<form name="myForm">

<input type="checkbox" name="myCheckBox">Check this.<br>

<input type="radio" name="myRadio" value="Yes">Yes

<input type="radio" name="myRadio" checked="true"
value="No">No

</form>

</html>

As before we can get a reference to the
checkbox by calling
getElementById. However, since the two radio
buttons have the same name, we need to use

Document.All.GetElementsByName and then select
the required radio button from the HtmlElementCollection
returned.

In .NET 1.x, we would use a call to
getElementsByName
on the HTMLDocument.
This returns an
IHTMLElementCollection. We can then get the
reference to the
IHTMLInputElement with the method item(null,1).

We can get a reference to the button on
the form by calling
getElementById on the HtmlDocument.
In .NET 1.x this should be then cast to an IHTMLElement.

C# 2.0

private void btnNavigate_Click(object sender,
System.EventArgs e)

{

NavigateToUrlSync(@"C:\buttons.html");

HtmlElement hElement;

hElement =
WebBrowser.Document.GetElementById("btnSubmit");

hElement.InvokeMember("click");

}

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object,
_

ByVal e As System.EventArgs) Handles
btnNavigate.Click

NavigateToUrlSync("C:\buttons.html")

Dim hElement As HtmlElement

hElement =
WebBrowser.Document.GetElementById("btnSubmit")

hElement.InvokeMember("click")

End Sub

In the above example, we can see that
after we get a reference to the button, we call the click method using InvokeMember.
Similarly, if we wanted to submit the form without clicking the button, we could
get a reference to myForm
and pass “submit”
to the InvokeMember
method.

In .NET 1.x, there is no InvokeMember
method of IHTMLElement,
so therefore you must call the click method of the IHTMLElement. In the case of
a form, you should cast the
IHTMLElement to an IHTMLFormElement
and call it’s submit
method.

To test this application, compile and
run it from Visual Studio .NET, and press the navigate button. The form should
load and then automatically forward itself to a google.com search result page as
in Figure 5.5.

Figure 5.5 – Using Buttons and Forms
in Internet Explorer.

5.2.6 JavaScript

Many web pages use JavaScript to
perform complex interactions between the user and the page. It is important to
know how to execute JavaScript functions from within Internet explorer. The
simplest method is to use
Navigate with the prefix javascript:
then the function name. However, this does not give us a return value, nor will
it work correctly in all situations.

We shall start with a HTML page, which
contains a JavaScript function to display some text. This will be saved as
JavaScript.html

We can then use the Document.InvokeScript
method to execute the JavaScript thus:

C# 2.0

private void btnNavigate_Click(object sender,
System.EventArgs e)

{

NavigateToUrlSync(@"C:\javascript.html");

string strRetVal = "";

strRetVal = (string)WebBrowser.Document.InvokeScript("jsFunction");

MessageBox.Show(strRetVal);

}

VB.NET 2.0

Private Sub btnNavigate_Click(ByVal sender As System.Object,
_

ByVal e As System.EventArgs) Handles
btnNavigate.Click

NavigateToUrlSync("C:\javascript.html")

Dim strRetVal As String

strRetVal =
WebBrowser.Document.InvokeScript("jsFunction").ToString()

MessageBox.Show(strRetVal)

End Sub

In .NET 1.x, we would call the parentWindow.execScript
method on the HTMLDocument.
Not forgetting to add empty parenthesis after the JavaScript function name.
Unfortunately execScript
returns null
instead of the JavaScript return value.

To test the application, compile and run
it from Visual Studio .NET, then press the Navigate button. You should see a
message “This was displayed by JavaScript” as shown in figure 5.6

Figure 5.6 – Using JavaScript in
Internet Explorer

5.3 Extracting data from web pages

In order to extract HTML from a web
page using Internet Explorer, you need to call Body.Parent.OuterHtml in
.NET 2.0 or
body.parentElement.outerHTML in .NET 1.x. You
should be aware that the HTML returned by this method is different to the actual
HTML content of the page.

Internet Explorer will “correct” HTML
in the page by adding <BODY>, <TBODY> and <HEAD> tags where missing. It will
also capitalize existing HTML Tags, and make other formatting changes that you
should be aware of.

Techniques for parsing this textual
data are explained later in the book under the section concerning Regular
Expressions.

5.4 Advanced user interaction

When designing an application which
uses Internet Explorer as a tool for data mining, it comes of added benefit,
that the user can interact with the control in a natural fashion, in order to
manipulate its behavior. The following sections describe ways in which a user
can interact with Internet Explorer, and how these events can be handled within
.NET

5.4.1 Design mode

If you wanted to provide the user with
the ability to manipulate web pages on-the-fly, there is no simpler way to do
it, than using the in-built “design mode” in internet explorer. This particular
feature is not supported with the managed .NET 2.0 WebBrowser control. However,
it is possible to access the unmanaged interfaces, which we were using in .NET
1.x through the
Document.DomDocument property. This can be then
cast to the HTMLDocument
in the mshtml
library (Not to be confused with the managed HtmlDocument class).
Therefore, in the case, you will need to add a reference to the mshtml
library and add a “using mshtml”
statement to the top of your code.

In this example, we will create a
simple rich text editor based on Internet Explorer’s design mode. Within design
mode the user can perform a wide variety of tasks using intuitive actions, for
example, you can insert an image by right clicking on the browser, or convert
text to bold by pressing CTRL+B. Many of these tasks can be further automated
using the execCommand
method of the HTMLDocument
object. In the following example, we will demonstrate how to set fonts using
this method.

Open a new project in Visual Studio
.NET, and drag a WebBrowser
control onto the form, followed by a button, named btnFont. Also Add a FontDialog
control named fontDialog.
Click on the form and type the following code for the form load event.

C# 2.0

private void Form1_Load(object sender, EventArgs e)

{

string url = "about:blank";

webBrowser.Navigate(url);

Application.DoEvents();

HTMLDocument hDoc = (HTMLDocument)webBrowser.Document.DomDocument;

hDoc.designMode = "On";

}

VB.NET 2.0

Private Sub Form1_Load(ByVal sender As System.Object, _

ByVal e As System.EventArgs) Handles
MyBase.Load

Dim url As String = "about:blank"

webBrowser.Navigate(url)

Application.DoEvents()

Dim hDoc As HTMLDocument =
webBrowser.Document.DomDocument

hDoc.designMode = "On"

End Sub

In .NET 1.x, we would cast the
Document to an HTMLDocument,
rather than referencing the
DomDocument property, and also, the Navigate
method would be as described in section 5.1.

From the above code we can see that we
get a reference to the text currently highlighted by the user using the selection.createRange
method. We then execute two commands on this selection, “FontName”
and “FontSize”.
Other commands that could be used would be “ForeColor”,
“Italic”,
“Bold”
and so forth.

To test the application, compile and
run it from Visual Studio .NET, enter some text into the space provided, then
highlight it. Click on the font button and choose a new font and size. The text
should change to the selected font, as shown in Figure 5.7

Figure 5.7 – A WYSIWYG editor using
Internet Explorer

5.4.2 Capturing Post data

When a user is navigating through web
pages, it may be necessary to keep track of what URLs they are going to, and
post data sent between Internet Explorer and the web server. Although there are
ways and means of doing this using packet sniffing, or third party tools, these
do sometimes tend to listen to too much data and record traffic from other
applications. Due to a bug in the .NET wrapper for Internet Explorer (see
Microsoft Knowledge base 311298), the beforeNaviate event will not
fire as you move between pages.

In order to subscribe to this event,
we need to know a little about how COM events work “under the hood”. Every COM
object which generates events will implement the IConnetionPointContainer
interface. A client wishing to subscribe to events from this COM object must
call the FindConnectionPoint
method on this interface, passing the IID (Interface ID) of the required set of
events.

Some COM objects support multiple sets
of events or “Connection Points”, for example, Internet Explorer supports the
DWebBrowserEvents
connection point, and the
DWebBrowserEvents2 connection point. Herein
lies the problem, the .NET wrapper will by default attach to the DWebBrowserEvents2
connection point, which contains a version of BeforeNavigate which is
incompatible with .NET due to unsupported variant types.

If you open the ILDASM utility, then
click file open, and select
Interop.SHDocVw.DLL, Select the DWebBrowserEvents connection point, then double click on BeforeNavigate,
you will see the following window:

Figure 5.8 The BeforeNavigate Event

From the information in Figure 5.8 we
can see that the Dispatch ID is set to 64 Hex (100 decimal). While using ILDASM
we can also find the IID of the
DWebBrowserEvents connection point by double
clicking on “class interface” that is, eab22ac2-30c1-11cf-a7eb-0000c05bae0b.
At this point we have everything we need to create an interface in C# for this
event.

To put this all together, create a new
project in Visual Studio .NET, and drop in a Web Browser control (not the .NET
2.0 version, but the COM version). Add a reference to Microsoft Internet Controls under COM references. You will need to include both
SHDocVw
and
System.Runtime.InteropServices in the using
list at the head of your code. Now add the code for the interface listed above.

C#

private void Form1_Load(object sender, System.EventArgs e)

{

UCOMIConnectionPointContainer icpc;

UCOMIConnectionPoint icp;

int cookie = -1;

icpc =
(UCOMIConnectionPointContainer)axWebBrowser1.GetOcx();

Guid g = typeof(DWebBrowserEvents).GUID;

icpc.FindConnectionPoint(ref g, out icp);

icp.Advise(this, out cookie);

}

What this code does, is that it obtains
a reference to Internet Explorer’s underlying IConnectionPointContainer,
by calling the GetOcx
method, that axWebBrowser1 has inherited from AxHost. From this, we can
then obtain a reference to the required connection point by passing its GUID /
IID to the FindConnectionPoint
method. To subscribe to events, we call the Advise method. To unsubscribe we
should call the unAdvise
method, if required.

To handle the event, we shall simply pop
up a message box immediately before the page navigates. We shall also display
any post data being sent.

Since we have specified that our class
should implement
IWebBrowserEvents, this dictates that the post
data must be received as an object. This object should then be cast to a byte
array, and then to a UTF8 string for readability.

To finish off the example, add a button
to the form, and attach some code to it, to allow it to navigate to some website
with a post-form on it, in this example, Amazon.com

To test the application, run it from
Visual Studio .NET, press the navigate button, enter something in the Amazon
search box, and press go. You should see a message box appearing, containing the
post data which you sent to the web server, as depicted in Figure 5.9

Figure 5.9 – Capturing Post data from
Internet Explorer

5.4.3 Capturing click events

Although you can capture events such
as DocumentComplete
to determine when a user navigates to a new page, it is a little trickier to
trap events which do not involve page navigation, such as entering text into a
text box for instance.

The event trapping technique differs
substantially between .NET 1.x and .NET 2.0. In the latter version, you need to
implement the default
COM interop method, this an entry point in your
application which is marked as Dispatch ID 0, which COM uses to call back
whenever your application subscribes to an event. In order to use COM
interoperability, you need to include a using System.Runtime.InteropServices
statement at the top of your code, in .NET 1.x.

In .NET 2.0, it’s a little more
straightforward. Here, we attach an HtmlElementEventHandler
delegate to the Document.Click
event, and implement it in our own event handler.

Basing this example on the sample code
in section 5.1, we shall now add some extra event handling capabilities to pop
up a message box whenever the user clicks a HTML element in the web browser.

At this point we have now subscribed
to the click event, and in the case of .NET 2.0, supplied a call back delegate
named Document_Click.
For demonstration purposes, we shall simply display the tag name of the element
clicked, and the event type (which should always be “click” in our case).

In order to get a reference to the
element clicked, we have used a different technique for each version of .NET. In
.NET 2.0, the
GetElementFromPoint method is used to determine
the element from the mouse location. In .NET 1.x, we can get the reference to
the element via the
Document.parentWindow.@event.srcElement
property.

To test the application, compile and
run it from Visual Studio .NET, press the navigate button, then click anywhere
on the screen. You should see a message box appear with the tag name of the HTML
element that you clicked on, as shown in figure 5.10

Figure 5.10 – Capturing events within
Internet Explorer

5.5Extending Internet Explorer

The examples so far have dealt with
embedding Internet Explorer in our applications, rather than embedding our
applications in Internet Explorer. This may not be ideal for all users, as we
loose the familiar interface that users are accustomed to. This section deals
with how build applications around running instances of Internet Explorer.

5.5.1Menu extensions

When you right click on a web page,
you can see a context menu, which you can extend with a simple registry tweak.
In this example, you can add a link to “Send to a friend” in the context menu,
which will link to a website that allows you to send emails. Firstly create the
following registry key:

HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\MenuExt\Send to a friend

Then set the default value to a location
on your hard drive, say c:\SendToAFriend.html,
which would contain the following HTML:

After making the change to the registry,
close all browser windows. To test the menu extension, open a new browser
window, right click on it, then select “Send to a friend”. The page should
redirect to a new website.

Figure 5.11 – Menu extensions in
Internet Explorer

5.5.2Spawning a new instance of Internet Explorer

A simple way of controlling instances
of Internet Explorer is to create them yourself using COM. In this example I use
COM late binding, which differs from the early-bound examples used earlier in
this chapter, specifically in the .NET 1.x examples. Early bound COM objects are
compiled into the application at design time. Late bound COM objects are loaded
dynamically at run time.

The benefit of early bound objects is
that the development environment will be aware of the object model of the
component, and
Intellisense will assist you determine which
methods you can call. We do not have such a luxury with late bound objects.
However, there is an advantage that we can bind to COM objects hosted as
executables, such as in the following example.

To start off, create a new windows
forms application in Visual Studio .NET, drop a button on the form, and attach
the following code to it.

You should also add references to System.Threading
and System.Reflection
at the top of your code.

The above code retrieves a reference to
the COM object model for the Internet Explorer application by inspecting the
ProgID
“InternetExplorer.Application”.
It then creates an instance of this COM object. It sets its Visible
property to true, then calls the Navigate2 method, passing
the URL
www.google.com as a parameter.

Unfortunately it is not trivial to
subscribe to events from this late bound object, so therefore, if it is
necessary to detect navigation between pages, it may be necessary to poll on the
LocationURL
property of the browser.

To test this application, compile and
run it from Visual Studio .NET, then press the button on the form. You should
see a new browser window open, on the
www.google.com homepage as shown in figure 5.12

Figure 5.12 – Spawning a new instance
of Internet Explorer

5.5.3Browser Helper Objects

When you need to get really tight
integration with Internet Explorer, in cases where you want code to execute
completely transparently to the user, and yet have full control of the browsers
document model and be able to subscribe to events, Browser Helper Objects (BHO)
is the way to go.

BHO technology is widely associated
with Spyware
applications, which silently run in the background, as a user is browsing
websites. Since the BHO would have access to the HTMLDocument object of the
Internet Explorer instance hosting it, it would be possible to read the text of
the webpage being visited, and duly display context-sensitive advertisements.

Internet Explorer expects the BHO
object to be COM based, not a .NET assembly. Therefore it is necessary to create
a CCW (Com Callable Wrapper) for our assembly. This CCW has a unique Class ID,
which we store in the registry at the following location:

When Internet Explorer (or Windows
Explorer) starts, it reads all the Class ID’s listed in at the registry location
listed above, and creates instances of their respective COM objects, and in our
case, the underlying .NET assembly. It then interrogates the COM object to
ensure that it implements the
IObjectWithSite interface. This interface is
very strictly defined and implemented as follows:

C#

using System;

using System.Runtime.InteropServices;

namespace BrowserHelperObject

{

[ComVisible(true),

InterfaceType(ComInterfaceType.InterfaceIsIUnknown),

Guid("FC4801A3-2BA9-11CF-A229-00AA003D7352")]

public interface IObjectWithSite

{

[PreserveSig]

int
SetSite([MarshalAs(UnmanagedType.IUnknown)]object site);

[PreserveSig]

int GetSite(ref Guid guid, out IntPtr ppvSite);

}

}

Internet Explorer uses the two methods
listed above to interact with the BHO. The SetSite method is called by
Internet Explorer whenever it starts up, or shuts down. This is to update the
BHO with the status of any internal references it may hold to the instance of
Internet Explorer which is hosting it. GetSite may be called by
Internet Explorer to query the reference a BHO holds to it. Every BHO must
implement both of these methods, and handle requests to and from the hosting
instance correctly.

To demonstrate Browser Helper Objects,
we shall go though a simple example, where we attach a BHO to Internet Explorer,
which will append the current date to every page visited by the user.

Start a new class library project in
Visual Studio, add a reference to the Microsoft.mstml .NET
assembly, and also to the COM object named “Microsoft Internet Controls”. Add a
new class file containing the definition of IObjectWithSite as listed
above. Then you can create the skeleton of your BHO thus:

C#

using System;

using System.Runtime.InteropServices;

using SHDocVw;

using Microsoft.Win32;

using mshtml;

namespace BrowserHelperObject

{

[ComVisible(true),

Guid("F839CC51-A6D8-4e9c-ACE5-F05071AD0C74"),

ClassInterface(ClassInterfaceType.None)]

public class DateStamp : IObjectWithSite

{

WebBrowser webBrowser;

}

}

What you can see from the code above
is that the class implements the IObjectWithSite interface,
which is a pre-requisite of any BHO. It also has a GUID (Genuinely Unique
Identifier), - this is used to uniquely identify the CCW, and can be chosen
arbitrarily, using the
GuidGen.exe tool or similar. The WebBrowser
class in the code does not refer to the familiar WebBrowser class as used in
.NET 2.0, but instead is a class defined within SHDocVw. It is this object
which will contain a reference to the hosting instance of Internet Explorer.

As mentioned previously, it is
necessary for every BHO to implement both the GetSite and SetSite
methods. In most cases, there is little need to perform any custom actions
within GetSite,
so therefore its implementation would remain standard for most types of BHO. A
typical implementation would be as follows:

C#

public int GetSite(ref Guid guid, out IntPtr ppvSite)

{

IntPtr punk = Marshal.GetIUnknownForObject(webBrowser);

int hr = Marshal.QueryInterface(punk, ref guid, out
ppvSite);

Marshal.Release(punk);

return hr;

}

What this code does, is that it
firstly obtains a pointer to the IUnknown COM interface for
our reference to the hosting instance of Internet Explorer. It then queries the
IUnknown
interface with a GUID issued internally by Internet Explorer. This returns a
pointer to another object, as required by Internet Explorer. The code then frees
the resources associated with the IUnknown pointer, and
returns a HRESULT in the event that an error occurred whilst trying to query the
interface.

What is of more interest is the SetSite
method. This is where we have the opportunity to attach custom event handlers to
the hosting web browser. In this case, we attach the DocumentComplete
event handler.

C#

public int SetSite(object site)

{

if (site != null)

{

webBrowser = (WebBrowser)site;

webBrowser.DocumentComplete += new

DWebBrowserEvents2_DocumentCompleteEventHandler(

this.OnDocumentComplete);

}

else

{

webBrowser.DocumentComplete -= new

DWebBrowserEvents2_DocumentCompleteEventHandler(

this.OnDocumentComplete);

webBrowser = null;

}

return 0;

}

As mentioned earlier, Internet Explorer
may also call this method as it is shutting down, therefore, in which case the
passed parameter is null.
It is required that we should detach event handlers and free any associated
resources at the point when the host closes.

At this point we are in a position to
add our own custom logic, which we place within the OnDocumentComplete
function thus:

In the above code, we retrieve a
reference to the HTMLDocument
contained within the hosting Internet Explorer instance, and simply add the
current date to the HTML of the page.

Before we are ready to try out our new
BHO, we should add some extra plumbing to enable the assembly to store the Class
ID of it’s CCW in the registry with the other Browser Helper Objects installed
on the system.

The above code is called whenever we
create a CCW from the assembly. It inserts a new key containing the Class ID, at
the registry location as specified.

Similarly, as we un-register the CCW, we
will want to remove that key from the registry. This would be implemented thus:

C#

[ComUnregisterFunction]

public static void UnregisterBHO(Type t)

{

RegistryKey key =
Registry.LocalMachine.OpenSubKey(BHOKEYNAME, true);

string guidString = t.GUID.ToString("B");

if (key != null) key.DeleteSubKey(guidString, false);

}

You will find that whilst developing a
BHO, you may need to recompile and test the code several times to perfect your
application. Every time that you attach a BHO to Internet Explorer, it will also
attach itself to Windows Explorer, and the assembly will be locked for the
duration of the lifetime of these two processes. When the assembly is locked you
will not be able to delete it, or modify it by building a new version of the BHO
over it.

To unlock the BHO, you will need to
un-register it using regasm /unregister
then stop all iexplore.exe
and explore.exe
processes, through either task manager, or by logging off and logging back in
again.

To test the above application, compile
the above code, then open up the Visual Studio .NET command prompt and navigate
to the folder that contains the output DLL. then run the command

Regasm /codebase
browserHelperObject.dll

Now open up an Internet Explorer window
and you should see the date written at the bottom of the page, as shown in
figure 5.13

Figure 5.13 – Browser Helper Objects
in Internet Explorer

If you receive the following warning – do not panic, as
long the GUID you used in your assembly is unique, it will not cause a problem

RegAsm warning: Registering an unsigned assembly with /codebase
can cause your assembly to interfere with other applications that may be
installed on the same computer. The /codebase switch is intended to be used
only with signed assemblies. Please give your assembly a strong name and
re-register it.

5.6Conclusion

This chapter has demonstrated how to
control Internet Explorer from within a .NET application. It should pave the way
for automating data mining processes using this versatile component.

With the added benefit of enabling a
user to interact with web pages in a natural fashion, and being able to trap
events from within Internet Explorer, it should be possible to implement data
mining training tools, and website test automation utilities with the examples
shown in this chapter.

The next chapter deals with extracting
data from HTML code, using regular expressions, and the hugely versatile HTML
agility pack.