Introducing TF-IDF in WebSite Auditor:

Algorithmic keyword ideas to boost your pages' relevance and rankings in the age of semantic search

By: Aleh Barysevich, Co-founder and CMO, SEO PowerSuite

April 18th, 2017

TL;DR

TF-IDF (short for "term frequency-inverse document frequency") has long been used by Google to figure out the relevance of pages in its index to a given query. Then there was Hummingbird, and then RankBrain. The TF-IDF tool, out today in WebSite Auditor, is an attempt to bring those 3 concepts together to give you data-backed optimization advice in the age of semantic search. This new tool uses the TF-IDF algorithm to help you optimize your pages for topical relevance, so that they rank higher up in search engine results.

The TF-IDF tool is fully available in WebSite Auditor's free version; to start using it, simply download WebSite Auditor (or restart the app if you already have it installed - it will update automatically upon launch) and jump straight to Content Analysis > TF-IDF.

Or, read on for a brief explanation of TF-IDF and its place in Google's algorithm, the way the TF-IDF tool works in WebSite Auditor, and how you can use it to optimize your pages.

WTF, TF-IDF?

From day one, search engines have been trying to process and interpret content like humans do. In hindsight, humans (more precisely, SEOs, a particular subset of humans) tried to do the reverse — figure out how search engines interpret text as to crack the secret code of ranking at the top of the search results. That's how SEOs came up with metrics like keyword density — a simple, easy to calculate figure that could be used in on-page optimization.

But Google never used keyword density due to its being noisy and easy to manipulate. Instead, Google's long been using TF-IDF in indexing and information retrieval; several of Google's patents also imply TF-IDF is used in ranking. The main purpose of TF-IDF is to figure out the importance of a given keyword to a given page.

Mathematically, TF-IDF is the product of how often a keyword appears on a page (TF) and how often it is expected to appear on an average web page, based on a larger set of documents (IDF).

Because TF-IDF compares an individual page's keyword usage to that of a large corpus of documents, it is a pretty clean estimation of how important the term is to the page. It scales down the prominence of unimportant words and phrases (think function words and introductory terms) - because the entire set of documents uses them a lot, too. The more rare, meaningful terms are, on the contrary, scaled up in importance.

Term Frequency

You may want to think of term frequency as a normalized version of keyword density. Here's one of the formulas commonly used to calculate it:

Don't let the logarithms put you off — thanks to the logs, there's less noise in TF than there is in keyword density. Say, if you have a 1,000-words long page on which your target keyword appears 10 times, then that term's keyword density is going to be 1%; its term frequency would be 4.32/9.97=0.43 (if you use log base 2).

If you edit the page so that the keyword appears 2X as much (20 times) then it'll have 2X the original keyword density — 2%. But the TF will not go up as much; it'll be 5.32/9.97=0.53 (again, using log base 2).

Inverse Document Frequency

IDF measures the ratio of the total number of documents in a corpus to the number of documents that contain the given keyword.

As you see, if the keyword is a common word that a lot of documents mention, the IDF value will be tiny; when we multiply TF by it to get TF-IDF, it won't increase much. If, on the contrary, the term only appears in a few documents, its IDF is going to be substantial (and, hence, TF-IDF will result in a larger figure).

Hummingbird, RankBrain, TF-IDF, and semantic search

Hummingbird is the name of the ranking algorithm Google started using in 2013. Hummingbird uses context and searcher intent (as opposed to the individual keywords in a query) to produce the best results. According to Wikipedia, Hummingbird is "capable of understanding the concepts and relationships between keywords", and its goal "is that pages matching the meaning do better, rather than pages matching just a few words".

RankBrain (launched in October 2015) forms part of Google's Hummingbird algorithm. Its objective is similar to that of Hummingbird, but our understanding is that the mechanism it uses is different. Google's recently said that RankBrain is "involved in every query", and affects the actual rankings "probably not in every query but in a lot of queries".

There are two parts to RankBrain: the query analysis part and the ranking part. For the former, RankBrain attempts to interpret queries (particularly the rare or completely new long-tail queries) by associating them with other more common queries and concepts, so as to provide better search results in response. For the ranking part, it analyzes the pages in its index and looks for specific features that make them relevant to the query (I'll get to how it figures out what these features are in a moment). These pages will not necessarily contain the exact words from the query, but are nonetheless relevant.

So both Hummingbird and RankBrain seems to focus on certain keyword-agnostic features of web pages to figure out whether the page is a good search result for the query. Such "features" are determined by analyzing the best-performing search results, according to Google's user satisfaction metrics. These metrics may include the SERP click-through rate, pogo-sticking, time on page, and so on.

So effectively, RankBrain may analyze a group of search results that rank well for similar searches and have good user satisfaction signals, and look for the features these pages share — in other words, the features that make them good search results. These features may then be used as niche-specific ranking signals for related queries. Because most of online content is text, such features often are the presence of certain terms and phrases on the page.

Let me give you an example. If you search for "comprehensive seo guide" on Google, not even half of the results you get will contain these exact words. RankBrain may have a better way of knowing what the best results for this query are. As it looks at their content, it will discover that those best results have a few things in common…

Most of the top-ranking pages for "comprehensive seo guide" mention terms like "search engines", "link building", "keyword research", etc. — the terms that we can all agree should be present in an SEO guide that calls itself comprehensive. So that's RankBrain's impressive way of reverse-engineering the human brain.

The TF-IDF tool in WebSite Auditor does something similar: it analyzes the top-ranking pages for your target keywords and looks for terms and phrases that a large number of them use. These are the topic-relevant terms and concepts that will help you increase the relevance (and hence the rankings) of your pages in the semantic search era.

How TF-IDF works in WebSite Auditor

The new TF-IDF tool in WebSite Auditor lets you discover the terms that are inherently associated with your target keywords or topics, judging by the content of your top-performing competitors. It uses the same TF-IDF algorithm as search engines do, only the corpus of documents isn't the Web — it's your 10 top-ranking competitors.

To start the analysis, jump to Content Analysis > TF-IDF in WebSite Auditor, select a page you're about to optimize, and enter a target keyword. While you're at it, here's what the app does behind the scenes:

3. Puts up a complete list of words and phrases the competitors use in their content;

4. Calculates the TF-IDF for each term's usage on each page, and each term's average TF-IDF among the 10 pages;

5. Calculates the TF-IDF for the usage of the same terms on your page;

6. Builds a table of these keywords and good-looking chart for you to look at.

The list of terms you see is sorted by the number of competitor pages that use them — this ensures that the most important, relevant terms appear at the top. The Recommendation column gives you usage advice for each term that appears on the pages of 5 or more of the competitors:

Add if you aren't using an important term at all;

Use more if the term's TF-IDF on your page is below the competitors' lowest value;

Use less if the term's TF-IDF is above the competitors' highest value.

You can even make changes to your page and implement these recommendations right in WebSite Auditor by going to Content Editor, where you can edit your content is a WYSIWYG editor or in HTML.

Try playing around with the TF-IDF tool yourself in WebSite Auditor's free version — I promise, you're in for more than a few exciting discoveries.

One final word of caution - please don't take every single recommendation in the TF-IDF dashboard literally. The algorithm does its part to pick up the best terms for you and give you usage advice; but before you make changes to your page, remember that whatever content you're adding, it has to offer value to the user. In other words, don't try to use this as a way to trick search engines into thinking your page is something it really isn't; instead, use it as algorithmic inspiration for keyword ideas and improving your content.