Information and insight from the A&M Webmasters

Even though we paid Google a lot of money for our Search Appliance, that didn’t buy us the secrets of what makes Google so accurate. More than 100 factors are responsible for which results come up in our search engine.

Perhaps the most important factor is PageRank. Google looks at how many other pages (including your own) are linking to your page, and what text they use to link to it. For example, if many pages contain a link to Aggie Honor (which gives the philosophy behind the Aggie Honor System Office), Google will assume that page is an authoritative source on Aggie honor. If most pages link to Aggie Code of Honor (from the Student Rules website), Google will ascribe more “Aggie Honor” PageRank to that page instead. If they link to the same page using the words Click Here, Google will ascribe “Click Here” PageRank instead.

Really, webmasters have the most control over search results – what they say on their pages and how other pages link to theirs.

But if your Web pages don’t appear when they should in search results, we have two obvious tools: KeyMatches and Related Queries.

KeyMatches work like pay-per-click advertisments (except they’re free): when someone searches for “your” search term, we can put a reference to your Web page at the top of the listings. For example, searching for “teaching” brings up a KeyMatch for the Department of Teaching, Learning and Culture, which keeps it from being buried by results from other Web sites. This feature is most helpful for new information or for new pages that haven’t yet found their rightful place in the search index. The text can be customized for time-sensitive queries: try searching for “spring break” or “graduation“.

Related Queries encourages visitors to use more fruitful search terms. Someone who isn’t an Aggie may try to search for “alumni” but they will never find The Association of Former Students that way. So I added a Related Query that, when someone searches for “alumni”, says, “You could also try: former student” We’ve also added some other fun Related Queries for the A&M community, but you’ll have to find them for yourself…

6 Comments to Tweaking the search engine

Thanks for this informative website. I took a quick look in previous posts but did not find it: do you know approximately how many pages are currently in the appliance’s index? I noticed you had earlier stated the license is for 2 million pages, but was curious how many are actually indexed.

Now that we’ve eliminated most of the black holes, our index has dropped from 1.5 million to less than 350,000 pages. By comparison, Yale indexes about 450,000 pages. Besides black holes, we also eliminated some other large directories that tended to be outdated, redundant, or unrelated to our university mission (along the line of photos of the 1998 departmental Fourth of July picnic). As we get a better handle on actual numbers, and as more webmasters fine-tune their robots files, we may be able to get less ruthless with our Do Not Crawl list, and worry less about exceeding our license limit.

Actually, before we put anything on the do-not-crawl list we once passed 12 million pages seen (though obviously not crawled.) After seeing that we became overly stringent on what we excluded. We know that we’re excluding a lot of conent that could be included and potentially seen as worthwhile, but we’re erring on the side of making sure that the returns provided are the most relevent possible. As Michael said, we’ll likely start easing our filters as we get more comfortable with what is on campus.

For what its worth, Google’s PageRank isn’t simply made up of what is specified above. A lot of what goes into PageRank is really black magic (trade secret in the legal world), but several books and articles have been published on it detailing some of the finer points that are certainly known to help improve PageRank. Some of these things include the credibility of the page linking to your site. If a page that has an inherently high PageRank links to your page, your PageRank would end up being higher than if a page that has a low PageRank links to you.

Now, one thing that can certainly be detrimental is if you have many “bad” sites (e.g. sites with extremely low PageRanks) linking to your pages, chances are that your pages’ PageRank will suffer as well. This is particularly common when search engines started flooring the PageRank of link pages – which were nothing but pages that linked to other pages.

As has been mentioned previously (though not into too much detail), the way your pages are designed and laid out (from an HTML standpoint) can either help or degrade your PageRank. If your pages conform to HTML or XHTML standards and you stick away from bad practices like nesting tables or putting text as images, then you’re already helping to improve your PageRank by making it easier for the GSA to search and index the data on the page that matters. However, if you do not conform to HTML or XHTML standards and follow bad practices for web design (e.g. nesting tables within tables within tables, displaying data on your page through images without using alt tags, etc.) then you’re PageRank will likely plummet because the GSA won’t be as able to index the critical data on the page.

But these topics go past just trying to hit it with Google and branch into an area called Search Engine Optimization (SEO). A great read on this is “Search Engine Optimization: Building Traffic and Making Money with SEO” – ISBN #9780596527860 ($9.99 Short Cut on Oreilly.com).

Good points, Chris, especially about the importance of authoritative pages linking to yours. Of course, Google decides whether a page is authoritative, in part, by how many other pages link to it. One problem with knowing for sure what things affect your PageRank is the troublesome fact that Google won’t tell you what your real PageRank is! The green display on the Google Toolbar is apparently not very meaningful for SEO purposes. When I was an SEO consultant, I used it for a while, but lost interest. An “empty” PageRank (gray bar not green) seems to be associated with new sites, while PageRank 0 is associated with link spamming sites. The real danger is infamous sites (spammers) linking to yours, not really the new sites that are not yet famous.