Share This Post

How does Google understand text?

On Yoast.com, we talk a lot about writing and readability. We consider it a very important part of good SEO. Your text needs to satisfy your users’ needs. This, in turn, will help your rankings. However, we rarely talk about how Google and other search engines read and understand texts. In this post, we’ll explore what we know about how Google analyzes online text.

Are we sure Google understands text?

We know that Google understands text to some degree. Think about it: one of the most important things Google has to do is match what the user types into the search bar to a search result. User signals alone won’t help Google to do this. Moreover, we also know that it is possible to rank for a phrase that you don’t use in your text (although it’s still good practice to identify and use one or more specific keyphrases). So clearly, Google does something to actually read and assess your text in some way or another.

What is the current status?

I’m going to be honest. We don’t really know how Google understands texts. The information simply isn’t freely available. And we also know, judging from the search results, that a lot of work is still to be done. But there are some clues here and there that we can draw conclusions from. We know that Google has taken big steps when it comes to understanding context. We also know that it tries to determine how words and concepts are related to each other. How do we know this? On the one hand, by analyzing some of the patents Google has filed over the years. On the other hand, by considering how actual search results pages have changed.

Word embeddings

One interesting technique Google has filed patents for and worked on is called word embedding. I’ll save the details for another post, but the goal is basically to find out what words are closely related to other words. This is what happens: a computer program is fed a certain amount of text. It then analyzes the words in that text and determines what words tend to occur together. Then, it translates every word into a series of numbers. This allows the words to be represented as a point in space in a diagram, a scatter plot, for example. This diagram shows what words are related in what ways. More accurately, it shows the distance between words, sort of like a galaxy made up of words. So for example, a word like “keywords” would be much closer to “copywriting” in this space than it would be to “kitchen utensils”.

Interestingly, you can do this not only for words, but for phrases, sentences and paragraphs as well. The bigger the data set you feed the program, the better it will be able to categorize and understand words and work out how they’re used and what they mean. And, what do you know, Google has a database of the entire internet. How’s that for a dataset? With a dataset like that, it’s possible to create reliable models that predict and assess the value of text and context.

Related entities

From word embeddings, it’s only a small step to the concept of related entities (see what I did there?). Let’s take a look at the search results to illustrate what related entities are. If you type in “types of pasta”, this is what you’ll see right at the top of the SERP: a heading called “pasta varieties”, with a number of rich cards that include a ton of different types of pasta. These pasta varieties are even subcategorized into “ribbon pasta”, “tubular pasta”, and several other subtypes of pasta. And there are lots and lots of similar SERPs that reflect the way words and concepts are related to each other.

The related entities patent that Google has filed actually mentions the related entities index database. This is a database in which concepts or entities, like pasta, are stored. These entities also have characteristics. Lasagna, for example, is a pasta. It’s also made of dough. And it’s a food. Now, by analyzing the characteristics of entities, they can be grouped and categorized in all kinds of different ways. This allows Google to better understand how words are related, and, therefore, to better understand context.

Practical conclusions

Now, all of this leads us to two very important points:

If Google understands context in some way or another, it’s likely to assess and judge context as well. The better your copy matches Google’s notion of the context, the better its chances. So thin copy with limited scope is going to be at a disadvantage. You’ll need to cover your topics exhaustively. And on a larger scale, covering related concepts and presenting a full body of work on your site will reinforce your authority on the topic you specialize in.

Easier texts which clearly reflect relationships between concepts don’t just benefit your readers, they help Google as well. Difficult, inconsistent and poorly structured writing is more difficult to understand for both humans and machines. You can help the search engine understand your texts by focusing on:

Good readability (that is to say, making your text as easy-to-read as possible without compromising your message).

Good structure (that is to say, adding clear subheadings and transitions).

Good context (that is to say, adding clear explanations that show how what you’re saying relates to what is already known about a topic).

The better you do, the easier your users and Google alike will understand your text and what it tries to achieve. Especially because Google seems to basically be trying to create a model that mimics the way us humans process language and information. And yes, adding your keyphrase to your text still helps Google to match your page to a query.

Google wants to be a reader

In the end, the message is this: Google is trying to be, and becoming, more and more like an actual reader. By writing rich content which is well-structured and easy to read and is clearly embedded into the context of the topic at hand, you’ll improve your chances of doing well in the search results.