This is the second of two articles about key search engine visibility and site crawlability issues for dynamically generated Web sites, organized into an integrated version of their presentations with additional insights. In part 1, we looked at three key problem areas with sites that have dynamically generated content: information architecture and keyword research; robots.txt files; and the use of Sitemaps. In part 2, we’ll continue exploring more issues.

Technical Problems

There are many potential technical problems you can end up with on your site. Here’s a summary of the most common ones:

The key problem in the above URL is the number of parameters, delimited by ampersand characters. Complex URLs with a lot of parameters can cause crawlers to ignore the page altogether. Probably not the result you’re looking for. These types of problems must be cleaned up.

2. Sometimes complex dynamic sites use redirects as a tool in managing site structure. Unfortunately, they often default to using 302 redirects. Search engine crawlers view the 302 redirect as temporary. That means that they don’t pass on link credit from the old page to the new page. That’s bad.

3. Another common problem: CMS systems tend to use JavaScript-based menu systems. This can prevent a search engine crawler from seeing the link. Not a good thing.

The problem can range from bad to worse. For example, if a link is embedded in JavaScript and there’s still a visible “a href” statement in the code, you may be okay. On the other hand, if there’s no visible “a href” statement, you’re probably hosed.

4. Duplicate content is also a huge issue on dynamic Web sites. Fundamentally, you want only one URL to reference a given document. Many CMS systems have big problems with this.

Worse still, the CMS systems sometimes actually reference pages on a site using more than one URL. This sometimes happens on a large scale resulting in large amounts of duplicate content on a site. Not good.

One of the more common duplicate content problems occurs when a search engine will refer to the home page of the site using the default document name (e.g. www.yourdomain.com/index.html), and use that form of the URL for the home page on the internal links to the home page.

Not only does this create duplicate content, it divides up the “link juice” of your site in a really bad way. This issue is important enough that I wrote about in more detail in a recent SEW Experts column, “SEO Hell, a CMS Production.”

5. Most CMS systems do not handle the problem of canonicalization very well. This problem occurs when every page on your site can be accessed in both http://www.yourdomain.com format and in http://yourdomain.com format.

Search engines treat this as duplicate content. What makes this particularly bad is that most of your inbound links will go to http://www.yourdomain.com, but some will go to http://yourdomain.com.

The search engines are going to pick one version of your page, and the links to the other version of your page are simply wasted.

Fixing this is usually relatively easy. On an Apache Web server you can fix this in your .htaccess file using a scripting language called Mod Rewrite.

Like the Robots.txt file there’s a strong potential for screwing up your site if you misuse this scripting language, so use it with great care. Make sure you have an experienced programmer doing this for you.

Site Analysis and Web Site Analytics

You’re going to want to have insight into what’s going on with your site. Here are a few key points to consider:

1. Web Analytics. Of course, you’re going to have a Web analytics package in place. Google Analytics is sufficient for many sites, and will even work well for some dynamic sites.

Webmasters of complex sites often find they need something more powerful. There are many Web analytics packages out there. Some really good ones are:

Laura recommends tracking spider activity. She particularly likes NetTracker, a technology acquired by Unica and rolled into the NetInsight family of products.

Whatever solution suits your fancy, detailed tracking of spider activity requires a solution that can read log files. Analytics software that relies on JavaScript tagging of your pages to perform tracking will not capture spider activity because spiders do not execute the JavaScript.

2. Keyword Ranking. Laura likes to use Web Position Gold, a tool that scans the search engine results to determine where specific keywords rank in the engines.

Use tools like Web Position Gold with care. The search engines don’t like automated rank checking programs using up their bandwidth. Laura recommends you check no more than twice per month.

3. Link Checking. It’s smart to check out your site structure with an automated tool. This allows you to detect broken links on the site, and also can help you located instances of more than one URL referring to the same page. The tool I like to use for this is Xenu’s Link Sleuth. It’s free and can provide a ton of information about your site.

Summary

There are still other SEO problems with dynamic Web sites, but the ones pointed out by Laura and Matt, summarized and expanded upon in this series, are common ones I’ve seen over and over again.

Large dynamic Web sites can be a big headache. Tackle the problems outlined in these two articles, and you’re likely to have little to worry about.

Search Headlines

We report the top search marketing news daily at the Search Engine Watch Blog. You’ll find more news from around the Web below.

Here we’ll take a look at the basic things you need to know in regards to search engine optimisation, a discipline that everyone in your organisation should at least be aware of, if not have a decent technical understanding.