If the site has been around for a long time without it, I would instead use the rel="canonical" tag here.

Make sure you register both www and non-www URLs in Google webmastertools. Look at the crawl stats and crawl errors reports for both. Likewise look at the internal linking and incoming external links reports. They are separate for www and for non-www.

I was going to do something in the htaccess file to do the redirect. Can I use a tag that way? Why use a rel tag?

Regarding adding both www and non-www versions to Google webmastertools, may I ask what the benefit of that would be? Since it is a vbulletin forum, each page is dynamically generated and there should be no difference between the www and non-www pages, both in number and content.

I am quite excited becase mysite.com/forum/ is PR4 and www.mysite.com/forum/ is PR4. So, perhaps if I get this canonical situation sorted out it might go to PR5 or 6?

Regarding adding both www and non-www versions to Google webmastertools, may I ask what the benefit of that would be?

Go look at the reports for your site. The answer will be obvious soon enough. :)

Fixing canonical problems from the very first day a site goes live is best done with a redirect. When its for an old site, using a redirect can sometimes mess with the analytics in several ways. The rel="canonical" tag stops that happening.

Although your site is dynamically generated, you'll find that the same page is pulled by Google on different days for www and for non-www and the revisit rate is completely different. That means they will often appear completely different in the SERPs.

Its hard to measure how much of an effect implementing canonicalization has. I've done it an a couple sites and I'd say the sites gain 5-10% traffic over the next couple months usually.

It can have the biggest impact when googlebot is spending so much time crawling non-canonical pages that it doesn't get to all of your content. I'd guess that 1 million posts is at least 100,000 forum topics. That is a lot of content for a PR4 site. I'd imagine that googlebot's crawl budget isn't enough to regularly crawl all of your content. Increased crawl rates could lead to better indexing in the long tail for you.

A while back, Google did have lots of trouble with canonical variations. Even today in certain "edge situations" they still can - so attention to canonical issues is definitely a best practice.

But when it comes to common platforms such as WordPress and vBulletin, Google seems to have already adapted. As earlier posts said, some sites may see a 10% boost over time, but others can't even be sure that anything changed.

If both versions of URLs have the same toolbar PageRank, then Google has already got it figured out, in my expwerience. In that case most of the advantage will come from more efficient crawling. If you see toolbar differences, then you may get more of a boost over time by taking care of any canonical issue.

Several years ago, googlebot was not very smart about common cases of non-canonical urls. www vs no-www, index.html, session id parameters, and such caused real crawling and ranking problems. Canonicalization was a solution, and it worked.

I always thought it was silly that googlebot couldn't seem to deal with www and no-www returning the same content. Today, googlebot seems much smarter to me about some of the basic cases.

It also used to be the case that how you linked your site together internally mattered much more. Back in the days when pagerank sculpting worked wonders, canonicalization was a way of getting every last drop of pagerank your site had coming to it. A few years ago, Google implemented some algorithms that made sure that sites that are not well sculpted or have some canonicalization issues aren't at a disadvantage. As a result, it is no longer necessary to canonicalize as much as in the past for ranking boosts.

There are certainly cases in which Googlebot still doesn't identify non-canonical content properly and canonicalization can help. You can usually tell this is the case by looking in your logs and seeing if Googlebot is spending time crawling two versions of the same url.

One of the first things that comes to mind is the number of incoming links to the site stat. About 99% of them use www.mysite.com. Either that or maybe I need to wait longer for more results to come in?

Also, it seems that I have some crawl error issues restricted to www.mysite.com.

That's an odd package. You'd certainly expect higher numbers for with-www, but not outright zeros on the other side. It makes it seem as if every last link to the "wrong" sitename is present and accounted for. (That's good, if true, but it looks strange.)

How does google identify a soft 404? That is, I know what it is, and I know how (not) to code it. But how can google tell? Especially when, like here, there are plenty of "real" 404s in the mix.

Google counts redirects to the home page as "soft 404" as well as 200 code error pages with nothing but an error message. Usually when there is a mix, the redirects to the home page are causing the soft 404s.

Google also gets soft 404 if you have many pages with very thin but similar content. I have a case of image gallery being a popup in a smaller window, where the popup page html only contains 2 iffames. The first iframe shows the big photo and the second iframe contains thumbnails strip. The page itself is set to noindex (but allows follow) to allow Google to get to big image..

Google is, however, reporting soft 404 for every of pages that contain 2 iframes.